Technological competitiveness of China's internet platformers: comparison of Google and Baidu by using patent text information

Kazuyuki Motohashi (Department of Technology Management for Innovation, The University of Tokyo, Tokyo, Japan; Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan and Research Institute of Economy Trade and Industry, Chiyoda-ku, Tokyo, Japan)

Chen Zhu (Department of Technology Management for Innovation, The University of Tokyo, Tokyo, Japan)

Asia Pacific Journal of Innovation and Entrepreneurship

ISSN: 2398-7812

Article publication date: 9 January 2024

Downloads

565

pdf (5.8 MB)

Abstract

Purpose

This study aims to assess the technological capability of Chinese internet platforms (BAT: Baidu, Alibaba, Tencent) compared to US ones (GAFA: Google, Amazon, Facebook, Apple). More specifically, this study explores Baidu’s technological catching-up process with Google by analyzing their patent textual information.

Design/methodology/approach

The authors retrieved 26,383 Google patents and 6,695 Baidu patents from PATSTAT 2019 Spring version. The collected patent documents were vectorized using the Word2Vec model first, and then K-means clustering was applied to visualize the technological space of two firms. Finally, novel indicators were proposed to capture the technological catching-up process between Baidu and Google.

Findings

The results show that Baidu follows a trend of US rather than Chinese technology which suggests Baidu is aggressively seeking to catch up with US players in the process of its technological development. At the same time, the impact index of Baidu patents increases over time, reflecting its upgrading of technological competitiveness.

Originality/value

This study proposed a new method to analyze technology mapping and evolution based on patent text information. As both US and China are crucial players in the internet industry, it is vital for policymakers in third countries to understand the technological capacity and competitiveness of both countries to develop strategic partnerships effectively.

Keywords

Citation

Motohashi, K. and Zhu, C. (2024), "Technological competitiveness of China's internet platformers: comparison of Google and Baidu by using patent text information", Asia Pacific Journal of Innovation and Entrepreneurship, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/APJIE-02-2023-0032

Publisher

:

Emerald Publishing Limited

License

Published in Asia Pacific Journal of Innovation and Entrepreneurship. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

1. Introduction

The advancement of artificial intelligence (AI) (machine learning) could turn massive data from the internet and IoT sensors into a gold mine (Agrawal et al., 2018). AI technology is versatile and applicable across various industries (Trajtenberg, 2018; Motohashi, 2020). Not only does it improve the accuracy of predictions, but it also enhances the economy of scope in big data analysis. The nature of general-purpose technology of AI, or non-rivalry of big data for various applications, allows internet business firms to grow as internet platforms, expanding their services to a variety of industries (Goldfarb and Trefler, 2018). Accordingly, Google, Amazon, Facebook and Apple (GAFA) have become top-listed firms in stock market valuation ranking.

At the same time, the concentration of data into a small number of firms, such as GAFA, has raised concern among national authorities outside the US. Google has been fined a combined $9.5bn since 2017 by EU antitrust regulators, and EU regulatory bodies have kept a close watch on the activities of other US internet firms. The EU also imposes General Data Protection Regulation to ensure privacy protection of European standards when private data are transferred beyond EU borders. Such policy actions could lead to “virtual nationalism,” where cyberspace is compartmentalized by nation/region (Economist, 2020).

In this regard, China is going its own way by virtually banning internet business on US internet platforms and international data transfer (Chorzempa et al., 2018). As a result, indigenous internet giants Baidu, Alibaba and Tencent (BAT) have emerged in a domestically segmented cyberspace insulated from international competition. Based on huge amounts of data from 800 million smartphone users, as well as large domestic markets in China, Alibaba and Tencent are listed in the global top 20 in terms of market capitalization. Recently, BAT have invested heavily in AI technology based on a large talent pool inside China. The Chinese Government plans to become a global AI leader by 2025, and BAT is supposed to play a crucial role (Biancotti and Ciocca, 2018).

This study focuses on Baidu and Google and assesses the technological capability of Chinese internet platforms compared to US ones. These two firms are quite comparable in terms of their business domain and advertising based on internet search queries, and both firms have recently made substantial investments in autonomous driving technology. We use text information (abstract) of patent applications submitted to the US Patents and Trademarks Office (USPTO) and CNIPR (China patent authority). The text information of patent data is assumed to reflect the content of the invention precisely. The similarity score of two patents based on the patent abstract provides more accurate information than their IPC code (Arts et al., 2017). In addition, the vector space model with a high dimension of continuous variables gives finer-grained information about patent contents, as compared to one-dimensional IPC codes with discrete variables (Younge and Kuhn, 2016; Motohashi et al., 2019).

Understanding the technological capability of Chinese firms is important from the perspective of both business and policy. A firm in a developed economy, such as Japan, cannot conduct internet/IoT business in China by itself but needs to collaborate with local firms such as BAT. Under such conditions, it is critical to access the technological capability of Chinese counterparts as the bargaining position in partnership negotiation depends on relative management resources, particularly technological capacity, to which Chinese firms are eager to gain access. In addition, as tensions between the US and China due to trade disputes become intense, information on technological competitiveness in both countries is essential intelligence for policymakers in third countries. This is particularly the case for Japan as both countries are very important partners, and an inappropriate strategy to deal with them may cause substantial damage to the domestic economy.

The remainder of this paper is organized as follows. Section 2 reviews catch-up related literature and our research framework. Section 3 outlines the data source and methodology of our vector space model based on internet technology patents from USPTO and CNIPR. Google and Baidu patents are compared via two types of empirical analysis in Sections 4 and 5. One is an overview of the technologies of these two firms using clustering analysis. The other is based on a more micro view of individual patents, together with the distribution of patented technologies of its neighbors in the technology space. Finally, we conclude with a summary of the findings and policy implications in Section 6.

2. Technological catch-up and proposed research framework

The concept of catch-up possesses a significant and enduring historical legacy, marked by notable examination in Abramovitz’s (1986) influential work. It achieved prominence in the post-Second World War period, characterized by the USA’s early adoption of advanced methods of production and industrial practices that other countries had not yet embraced. According to this scenario, catch-up is commonly defined by economic scholars as the process of reducing the disparities in productivity and income between a leading nation and a trailing one (Fagerberg and Godinho, 2005). Kashani et al. (2022) examine the evolution of catch-up studies and suggest that catch-up can be measured by a range of indicators, including productivity, income and technological capability. The primary focus of this study lies in the technological aspect of catch-up, defined as the significant improvements in technological capabilities by firms from technologically disadvantaged nations as they close the gap with advanced incumbents, moving closer to the global technological frontier (Miao et al., 2018).

Theoretically, Bell and Pavitt (1993) introduce a framework to conceptualize technology as a capability in the catch-up process. This framework emphasizes that technological capabilities, representing a firm's capacity to absorb and learn from imported technology, are critical determinants of successful technology transfers in developing countries. It has underpinned numerous empirical studies examining the growth of latecomer firms and the impediments to their leadership emergence. Studies on technology catch-up fall into the following two main categories based on research methods: qualitative case studies and quantitative empirical research. Qualitative studies have explored the success of catch-up among Asian firms in diverse industries, including consumer electronics, automotive and shipbuilding (Cho et al., 1998; Kim, 1998; Fan, 2006; Mathews, 2006). In quantitative research, patent data, often regarded as a common proxy for technological knowledge, have gained prominence in monitoring the technological catch-up process.

Considering catch-up as a learning process, prior research has used patent citations to track technology acquisition. Wang et al. (2014) leverage the citations of licensees’ patents to discern if latecomer firms had gleaned knowledge from prior licensing agreements. Besides, Lee (2013) conducts a comprehensive comparison of technological capabilities between Korean firms and their US counterparts, using a range of citation-based indicators such as quality, originality and diversity. Although citation information has been widely used for measure patent quality and technology spillover, such information is not available in many developing countries. In this light, we introduce a novel framework that leverages patent text data to monitor the catch-up process between latecomer firms and incumbents in advanced economies. Initially, we train our own Word2Vec model based on a large-scale patent corpus. This trained Word2Vec model is then used to convert patent texts, specifically abstracts, into vector format. Subsequently, clustering analysis is performed to provide an overview of the technological landscape and detailed technical domains. Following that, two semantic-based indicators are introduced to compare the technological capabilities of Google and Baidu. Traditionally, constructing pairwise cosine similarity scores for a large-scale data set, such as one exceeding one million entries, is computationally demanding. Therefore, we use a neighborhood graph and tree (NGT) to search for similar patent pairs. Figure 1 presents the proposed research framework.

3. Vector space model of internet technology

3.1 Data source

To conduct a fair comparison of a US firm (Google) and a Chinese firm (Baidu), we use the patent data from USPTO and CNIPR. Specifically, we retrieve all patent application information by Google (26,383 USPTO patents) and Baidu (6,695 CNIPR patents) from the PATSTAT 2019 Spring version. We then check the IPC subgroups of these patents to identify internet-related technology patents. We identify a total of 2,350 IPC subgroups, but many of them contain a very small number of Google or Baidu patents.

We treat the subgroups with at least 100 Google or Baidu patents as a core technology of internet search engine-related business and retrieve all patents belonging to these 50 subclasses for subsequent analysis. There are 680,241 US patents and 427,628 CN patents from 1959 to 2018. The subgroups span over seven IPC classes, “F24,” “G01,” “G02,” “G06,” “G09,” “‘G10,” “‘H04,” but more than 95% of patents belong to the G06 (computing, calculating, counting) and H04 (electric communication technique) classes. Figure 2 shows the number of patents by application year. It should be noted that most patent applications via CNPIR have been made within the last five years, while USPTO patent applications were made relatively earlier. A drop in patent applications in recent years comes from data truncation associated with the time lag between application and publication years, particularly for USPTO patents.

3.2 Vector space model

A myriad of patents makes it difficult to mine out useful information and relationships among them. Recent text mining techniques have been proposed to turn a document into a vector form so that existing machine learning algorithms can be used. We followed the classic Skip-gram model proposed by Mikolov et al. (2013) to build word vector representations for our patent corpus. We then calculated the document embedding for a patent by averaging all nouns occurring in that patent. To do so, we first conducted a preprocess on the patent corpus. Wang et al. (2019) noted that the word representations should be able to demonstrate multifacetedness. That is, the trained Word2Vec model should yield meaningful representations for words in different forms (e.g. in different tenses). Furthermore, many pre-trained word embedding models (e.g. Google pre-trained Word2Vec models) kept words in their original forms.

Along with this convention, without conducting lemmatization, we only removed punctuation and placed all words in lowercase, turning all digits into a token “<num>”. The corpus was built on 1,107,869 patent applications We retained words with frequencies higher than four. A Skip-gram model was then adopted to build a 300-dimensional vector for each word in the corpus. Our Skip-gram model generated vector representations for 170,340 words, of which 73,780 (43%) were nouns.

From the results of this word embedding (300-dimension vector expressions for each word), the document vector d_j (corresponding to the patent content expression) is computed by the following:

dj=1nj∑wi∈Nvi

where v_i is the vector representation of word w_i; n_j is the number of nouns occurring in the document d_j; and N is a set of all nouns in the dictionary.

3.3 Validation of document embedding results

The document embedding results are created in two steps: (1) word embedding and (2) aggregation at the document level. In terms of the first step, we conduct a face validation of word embedding results. Specifically, we conduct k-means clustering of embedded words to check that similar words are clustered into the same cluster. The results of the clustering analysis are presented in Appendix 1. For example, the first cluster consists of “image-related” words, including “image,” “position,” “display,” and “picture.” The second one shows the list of text-related words (“document,” “language,” etc.). Accordingly, it is possible to conclude that our word embedding results are reasonable.

In the second step (aggregation at document level), we take a simple average of word embedding vectors in each document. To assess the document embedding results, we use Doc-DB patent family information. Within each patent family, all patents are based on the same invention, so the contents of these patents should be close to each other. We calculate pairwise cosine similarities of the patents corresponding to the same patent family. It should be noted that one patent family could have both USPTO and CNIPR patents. Therefore, we could evaluate document embedding results separately using US-US, CN-CN and US-CN pairs.

Figures 3 and 4 show the distribution of cosine similarity of document embedding results between patent family pairs. For a given patent family, we calculated all pairwise cosine values of US patents and then described the results separately using US-US, CN-CN and US-CN pairs. The mode points of each type of pair correspond to 1 (showing exactly the same vector), and most pairs have cosine similarity close to 1. We could conclude that our document embedding method produces reasonable results. In addition, the US-US patent family pair is relatively closer in terms of the contents, as compared to the CN-CN pairs, and the US-CN pairs are in the middle. Therefore, there may not be any systematic bias associated with the data source (USPTO or CNIPR patents), which is important to make a fair comparison between Google and Baidu in the following sections.

Table 1 shows the results of descriptive statistics of cosine similarities of patent pairs by type of family and by type of document-level aggregation. We have again confirmed that the median point of each type of pair is close to 1 (at least 0.97), suggesting the validity of document embedding results. Table 1 also reports the results using TF-IDF weighted averages of word embedding results (figures with asterisks). The cosine similarity of these figures is even lower than that of the simple mean. Therefore, we proceed with the subsequent analysis by using the document embedding results with a simple average of word embedding vectors.

4. Clustering analysis

The contents of the patent corpus are explored by dividing the whole corpus into several clusters. We used k-means to conduct clustering based on the vectorized patent contents information. In terms of the granularity of clustering, we take the number of IPC subclasses, that is, 11. We could set this number arbitrarily, but it becomes difficult to gain a broad picture from too many clustering results. In addition, the number of clusters could not be too small as the whole corpus would be divided much more finely. We applied k-means clustering for 1,107,869 patents, and the word crowd of each cluster is presented in Figures 3 and 4. The number of words in this figure corresponds to the aggregated TF-IDF value of each word in each cluster (sum of patent level TFIDF to each cluster level) and can be formally expressed as follows:

Aggregated TFIDFi=∑wi∈Dj;Dj∈Ctji

where D_j’s are patents in cluster C, and t_ji is the TF-IDF value of word w_i in patent D_j. Figure 5 also shows the label of each cluster, created by using this word crowd information, together with 10 patents located near the center point of each cluster (A list of titles of these patents are presented in Appendix 2).

Figure 6 visualizes the contents of 1.1 million patents, together with the location of each of the 11 clusters. For this purpose, the 300-dimensional document vectors are reduced into 2D space. We use the Uniform Manifold Approximation and Projection (UMASP), which has a superior run-time efficiency (McInnes et al., 2018). UMAP can convert high-dimensional data into a low-dimensional space while preserving both local and global structures. There are three broad types of patent content:

web application, such as data analytics, language modeling and web content application;
display interface, such as image recognition and human interface; and
ICT infrastructure, such as storage system, file management and mobile communication.

Figure 7 shows the share of patent applications by cluster and country (USPTO or CNIPR). The share of ICT infrastructure patents (such as storage, file management systems and wireless communication) is found to be larger for the USA, while there are relatively more application-related patents (such as mobile user interaction and data analytics) for China. Such differences come from the difference in the timing of technological development in both countries. US patent applications started in the 1990s and grew rapidly in the early 2000s, while for China, most patent applications were submitted after 2010. Players in China, including Baidu, therefore focus more on application developments based on ICT infrastructure technologies developed by US players.

Figure 8 shows the location of Google and Baidu patents in the technology space based on the information compiled using UMAP in Figure 6. Google patents are more widely distributed in the space, while Baidu patents are concentrated in some particular fields, such as data analytics, mobile user interaction and Web search/language modeling. Google’s first patent application was submitted in 1997, while Baidu started applying for patents mainly after 2009. As is shown in cross-country trends in the USA and China, Baidu focuses on application development in the process of technologically catching up with Google.

To control for cross country differences in patent contents, we calculate the revealed comparative advantage (RCA) index for Google and Baidu by cluster as follows:

RCAij=(Pij/∑jPij)/(∑i∈US or ChinaPij/∑i∈US or China∑jPij)

where P_ij is patent country by firm “i” and cluster “j”. Figure 9 shows RCA for Google and Baidu (i = Google or Baidu) by cluster (j). It should be noted that the value of RCA is greater than 1 when a firm focuses on a particular field, and vice versa. First, the pattern of RCA by cluster is very similar across these two firms. As both are operating internet search engines, a high value can be found for web search and language modeling (Google: 2.48, Baidu: 3.36). In addition, the RCA of file management system is greater than 1 for both firms. Second, differences can be found between these firms in web content application (Google > Baidu) and mobile user interaction (Google<Baidu). This point can be explained by the difference in the ICT environment between the two countries, that is, mobile internet is diffused more widely in China. As a consequence, it is more important for Baidu to invest more in mobile specific applications, such as internet services taking user location information into account.

5. Technology space distribution analysis

The foregoing clustering analysis provides an overview of the technology space in terms of patenting, but it does not provide detailed information on the within-cluster distribution of individual patents. In this section, we generate statistics regarding the neighborhood patents to each of over one million patents in our sample in terms of content. Specifically, we estimate the top 200 nearest patents in terms of cosine similarity to each patent.

An apparent difficulty is that deriving all pairwise cosine similarities among one million involves a massive amount of computations. We, therefore, used a NGT proposed by Sugawara et al. (2016) for indexing, which is an approximate similarity search method. NGT has been developed for efficient retrieval of relevant internet content by search engines, but it can be applied to any type of text information. Motohashi et al. (2019) use NGT results for patent titles and abstracts published by the Japan Patent Office to understand the characteristics of academic patents (as compared to firm patents).

NGT uses a tree structure for indexing network graphs efficiently. A parameter is epsilon as a range of search of nearest neighbors. There is a trade-off between the search range and search time. We fit our samples and use epsilon = 0.35 with an accuracy rate of 0.997 (See Appendix 3 for details).

Figure 10 presents the average cosine similarity of the 200th nearest patents (i.e. the patents ranked 200th in terms of the cosine similarity) with each of 1.1 million patents by application year and patent authority. An upward time trend (technology space becomes denser over time) can be found in CNIPR patents, while it is not the case for USPTO patents. As a result, the cosine similarity of the 200th nearest patents for CNIPR patents (around 0.90) becomes greater than that of USPTO patents (around 0.88) on average.

Figure 11 shows the share of USPTO patents in the top 200 nearest patents by patent authority (CNIPR or USPTO). The share for USPTO patents is stable at around 70%, meaning 30% of the top 200 nearest patents are CNIPR patents. In contrast, the share for CNIPR patents rose until 2006, then fell. The upward trend corresponds to the period in which the number of USPTO patents increases, while a downward trend occurs when the number of CNIPR patent applications overtakes USPTO patents. More importantly, a pattern of technology divergence is revealed between the two countries, that is, increasing numbers of same-country patent pairs in terms of content similarity rather than cross-country pairs.

The information on 200 near patents in terms of patent contents provides a picture of the technology space around the patent to be examined. As shown in Figure 12, finding near patents corresponds to drawing a border within which 200 near patents are located. The border is a hypersphere (300 dimensions) with a radius of the distance (e.g. 1-cosine similarity) between the patent to be examined and the 200th nearest patent. The technology space is densely populated with surrounding patents if the radius (1-cosine similarity) is small, and vice versa. It should be noted that there are two types of surrounding patents. One is the patent applied for before the patent is to be examined, and the other is one thereafter. A patent application provides information on preceding patents, and we refer to such patents as BASE. We refer to the latter as FOLLOW, as these patent applications were submitted following the patent to be examined.

BASE could be considered as a backward citation and FOLLOW as a forward citation. Hence, the number of BASE patents can be used as an indicator of the novelty of a patent (smaller BASE means more novelty), and the number of FOLLOW patents indicates the impact of a patent (larger FOLLOW means more impact).

We use this information to assess the technological capability of Google and Baidu. As is the case for citation information, this indicator can be biased by data truncation, that is, the newer the patent to be examined, the more BASE patents and the fewer FOLLOW patents could be found. Therefore, we normalized the number of BASE and FOLLOW (200-BASE) using the number of patent applications before and after, respectively. In addition, there is a time trend of such indicators, particularly for CNIPR patents. As the number of patent applications increases (Figure 2) in densely populated fields (Figure 10) for CNIPR patents, IMPACT tends to be larger, while BASE is smaller. Therefore, we need to control for the patent authority difference (USPTO or CNIPR). Finally, we derive the following indicator for cumulativeness (less novel) and impact for each patent:

Cumulativenessi=(BASEi/∑t<TPt)/AVERAGEi, c∈US or China(BASEi/∑t<TPt)Impacti=(FOLLOWi/∑t>TPt)/AVERAGEi, c∈US or China(FOLLOWi/∑t>TPt)

where BASE_i and FOLLOW_i are the number of BASE patents of patent “i” with application date “T” and patent authority “c” (US or China), and Pt is a patent count of patent applications at the application date “T.” Here we conduct double normalization by the timing (BASE is normalized by the number of patent applications before the patent to be examined, all candidate of BASE and the same for FOLLOW) and by the country of patent authority.

As cumulativeness and impact are patent-level indicators, we could aggregate this at the firm level. Figure 13 presents the trend of cumulativeness indicators of Google and Baidu. Here, we produce three types of these indicators: (1) using all patents, (2) using USPTO patents only and (3) using CNIPR patents only in the 200 nearest patents. The distinction of patent authority allows us to investigate the technology trajectory of these firms within and across countries. The cumulativeness of Google used to be below 1, suggesting relatively novel patents under the USPTO patent standards, but it has recently reached one due to an increasing trend of US neighbor patents. This could be explained by the convergence of internet technologies among major players such as (G)AFA. The increasing trend of cumulativeness is clearer in the case of Baidu. Baidu patents used to be relatively novel (less than 1) under Chinese standards, but this has also recently reached 1. Increasing numbers of USPTO patents are used as a base, and Baidu has aggressively caught up with US players in the process of its technological development.

Figure 14 shows the impact indicators of Google and Baidu. Google’s performance is stable over time around 1, reflecting an average impact under US standards. However, the impact of USPTO patents is found to be more than average (around 1.2), while the impact of CNIPR patents is less than average (0.7 to 0.8). In contrast, Baidu shows quite dynamic patterns for this indicator. While the overall impact indicator has recently fallen, USPTO neighbor patents reveal an increase regarding this indicator. Together with the finding in Figure 13, Baidu is found to pay more attention to technological development in China and started patenting in mainstream technologies in the USA so that both cumulativeness and impact measured by US patents increase over time. It should be noted that the USPTO-based impact indicator has recently become greater than 1, suggesting Baidu has achieved technological catching up with US players to some extent.

6. Conclusion

Technology upgrading of China’s internet platforms has received growing attention given their huge data assets of a billion mobile users together with ample engineering talents for AI and data science. China has set a goal of becoming a global leader in AI by 2025, and it is assumed that BAT (China’s GAFA equivalent) will play a vital role. Using Google as the benchmark, this study assessed the technological capability of BAIDU. We use patent text information (abstract of invention) to examine how these two firms have developed over time.

We extract internet-related technology patents from USPTO and CNIPR patent publication information to determine the technology trajectory of both countries’ patent applicants. Internet-related patent applications to CNIPR have increased significantly in the past five years, and the contents of patent applications in both countries are found to be diverging. This may be due to the fact that China’s internet market is segmented from the rest of the world and evolving in its own way. The rapid progress of mobile internet in China also explains the difference in technology portfolios across the two countries.

Given such general trends of technological development, Baidu and Google show similar patterns of focused areas of R&D in general, such as web search technology and data analytics for language modeling, based on common business models based on internet search engines. However, our results reveal some differences, such as more mobile applications in Baidu and more web content applications in Google. In terms of the dynamics of technological development, Baidu follows a trend of US rather than Chinese technology, and it is assumed that Baidu is aggressively seeking to catch up in the process of technological development. At the same time, the impact index of Baidu patents increases over time, suggesting its upgrading of technological competitiveness.

This study proposes a new methodology to analyze technology mapping and evolution based on patent text information. The citation information has been used extensively for patent characteristics (mainly patent quality) and technology spillover (Nagaoka et al., 2010). However, patent citation information is unavailable in many countries, including China. In contrast, the proposed methodology offers wider geographic applicability, particularly when using patent information in developing countries, due to the availability of patent abstract information in most nations. Furthermore, recent studies have highlighted the utilization of companies’ Web pages to monitor their market-side opportunities (Park and Geum, 2022; Motohashi and Zhu, 2023). As web data are also in a textual format, our proposed methodology can be easily applied to these datasets for a better understanding of market-side catch-up and competition.

However, there are also some limitations in our methodology. First, we use fixed word embedding information over time. The content of the same term, such as “machine learning,” for example, should change over time as its technology progresses. Therefore, our document embedding results could represent a range of various technologies, while it is weak to measure the progress (or depth) of some particular technology component. Using a word embedding methodology that takes the context of each word within paragraphs into account, such as BERT, maybe a potential solution. In addition, the size of neighbor patents (200 in our case) is arbitrary. We could decrease or increase this size, but the number depends on the scope of our analysis or the degree to what extent we want to identify the density of technology (patent) distribution. We may use the kernel smoothing technique in multi-dimension space for future research.

Figures

Figure 1.

Research framework

Figure 2.

Internet-related patents by application year

Figure 3.

Distribution of cosine similarity between pairs within patent families

Figure 4.

Histograms of cosine similarity between pairs within patent families

Figure 5.

Word crowd of clustering results

Figure 6.

UMAP visualization of patent contents and clustering results

Figure 7.

Composition of patent contents by country

Figure 8.

Comparison of Google and Baidu patents

Figure 9.

RCA of Google/Baidu patents in each country

Figure 10.

Cosine similarity of 200th nearest patents

Figure 11.

Share of USPTO patents in 200 neighbors by country

Figure 12.

Graphical interpretation of NGT results

Figure 13.

Cumulativeness of Google and Baidu patents

Figure 14.

Impact of Google and Baidu patents

Figure A1.

Word cloud results for word embedding

Figure A2.

Tuning explored range in NGT analysis

Table 1.

Cosine similarity between within patent family pairs

Country	Mean	SD	Min	25%	50%	75%	Max
US	0.97	0.05	0.25	0.99	1.00	1.00	1.00
CN	0.95	0.07	0.53	0.92	0.98	0.99	1.00
USCN	0.97	0.05	0.28	0.96	0.99	1.00	1.00
US*	0.95	0.10	0.14	0.99	1.00	1.00	1.00
CN*	0.89	0.14	0.24	0.84	0.97	0.99	1.00
USCN*	0.94	0.10	0.11	0.94	0.98	1.00	1.00

Note:

(*) denoted the results of TF-IDF weighted document embedding

Source: Created by authors

Table A1.

Document cluster labels

Labels	nearest10_title	IPC
0	Method and device for obtaining combined image	G06K9/62
0	Digital image visualized management and retrieval for communication network	G06F17/30
0	Terminal device, intelligent mobile phone, and face identification-based authentication method and system	G06K9/00
0	Remote sensing image significance target detection method and system based on Hadoop	G06F17/30
0	Method for detecting over-exposure area in monitoring video image combining multiple features	G06K9/62
0	Method and system for detection of representative area of automatic quasi object type image	G06F17/30
0	Station identification method and device	G06K9/00
0	Method for generating and applying image search code technique	G06F17/30
0	Image matching method and image matching device	G06K9/62
0	Method and system for replacing background images of smart camera in real time	G06F3/0484
1	Distributed storage method and apparatus, and data processing method and apparatus	G06F17/30
1	Massive real-time data synchronization system based on private cloud storage	H04L29/08
1	Distribution and utilization global total data transmission and storage method and device and electronic equipment	G06F17/30
1	Data rapid distribution method and device	H04L29/06
1	Method for acquiring and converting data of metering system of intelligent transformer substation	G06F17/30
1	Method of pre-caching or pre-fetching data utilizing thread lists and multimedia editing systems using such pre-caching	G06F3/06
1	Database normalization storage system and method suitable for use in multi-model satellite testing	G06F17/30
1	Data audits based on timestamp criteria in replicated data bases within digital mobile telecommunication system	G06F17/30
1	Write operation control method, system and device and computer storage medium	G06F3/06
1	Smart storage platform apparatus and method for efficient storage and real-time analysis of big data	G06F3/06
2	Context-based photograph sharing platform for property inspections	G06F17/30
2	Systems and methods for constructing and using models of memorability in computing and communications applications	G06F3/048
2	Systems and methods for constructing and using models of memorability in computing and communications applications	G06F3/048
2	Systems and methods for constructing and using models of memorability in computing and communications applications	G06F3/048
2	Incentives for content consumption	G06Q30/00
2	Method and apparatus for locating errors in documents via database queries, similarity-based information retrieval and modeling the errors for error resolution	G06F17/30
2	Method and system for electronic display of photographs	G06F17/30
2	Three-dimensional web crawler	G06F17/30
2	Intelligent integrating system for crowdsourcing and collaborative intelligence in human- and device- adaptive query-response networks	G06F17/30
2	Methods and systems for annotation of digital information	G06F17/24
3	Intelligent liquid warehousing device	G06K9/00
3	Internet-of-things-based water level monitoring system for water conservancy and hydropower engineering	H04L29/08
3	Touch control input device used for electronic information equipment	G06F3/041
3	Output device and wearable display	G09G5/00
3	Diversified reinforced tablet computer system	G06F1/16
3	Force touch module, preparation method thereof, touch screen panel and display device	G06F3/041
3	Luminous band display type sliding touch bar and display method of touch luminous band	G06F3/041
3	Economical skin-pattern-acquisition and analysis apparatus for access control; systems controlled thereby	G06K9/00
3	Shield machine posture solving device based on VBA writing	G06F9/44
3	Touch-control module, touch screen and intelligent device and stereo touch-control method	G06F3/041
4	Method for understanding questions in question type automatic question-answer systems on basis of rule	G06F17/27
4	Data searching method and system based on semantic analysis	G06F17/27
4	Information searching method based on metadata	G06F17/30
4	Relevancy priority ordering method used for environmental protection regulation retrieval	G06F17/30
4	Information management, retrieval and display system and associated method	G06F17/30
4	Information management, retrieval and display systems and associated methods	G06F7/00
4	Information management, retrieval and display system and associated method	G06F17/30
4	Method of indexing words in handwritten document images using image hash tables	G06F17/30
4	Method for searching pattern matching index	G06F17/30
4	System, method and program product for answering questions using a search engine	G06F17/30
5	Search engine method based on keyword resolution scheduling	G06F17/30
5	Method and system for automatically converting dynamic form page to HTML5 page	G06F17/22
5	Automatic access of electronic information through machine-readable codes on printed documents	G06F12/00
5	Electronic commerce system for updating information	G06F12/00
5	Web service multithreading file uploading system	H04L29/08
5	System and method for creating and posting media lists for purposes of subsequent playback	G06F3/0482
5	System and method for creating and posting media lists for purposes of subsequent playback	G06F15/16
5	System and method for creating and posting media lists for purposes of subsequent playback	G06F15/16
5	Pay per record system and method	H04L29/06
5	Dynamic generation of target files from template files and tracking of the processing of target files	G06F7/00
6	Wired security access control device of financial industry network and access method of wired security access control device	H04L29/06
6	Vehicle identification system and method	G06F17/30
6	Control system	G06F3/16
6	Plug type audio device and signal processing method	G06F3/16
6	Touch display device and touch display method	G06F3/041
6	Method and device for playing audio data in sound card signal input channel in real time	G06F3/16
6	Portal access control system	G06F7/04
6	Method and device for displaying states of ports of switch	H04L12/24
6	Computer control system	G06F3/00
6	Login method and device for user identified by radio frequency	G06F21/00
7	Device, method and equipment for information data interaction for processing information data	G06F17/30
7	Smart instant interaction technology for use in radius range of position	G06F17/30
7	Information processing method, terminal and electronic device	G06F17/30
7	System information security monitoring method and device, computer device and storage medium	G06Q10/10
7	Novel electronic device information collection and selective information orientation distribution method	H04L29/06
7	Interested object information acquisition method and system with mobile terminals coordinating with cloud terminal	H04L29/08
7	Information display method and device	H04L12/58
7	Method and device for feeding back information, and terminal	H04L12/58
7	Method, device and system for storing social networking service (SNS) content	G06F17/30
7	Method and system for automatically ordering dishes and settling account	G06Q30/02
8	Facial action unit strength estimation-based expression analysis method	G06K9/00
8	Spatial data matching method based on machine learning	G06F17/30
8	Method for quickly sorting electroencephalograph signal based on threshold analysis	G06F3/01
8	Intelligent analysis method for components of camera scene image	G06K9/62
8	Method and system for generating radio frequency identification data into tripping origin destination) matrix on the basis of Spark	G06F17/30
8	Target identification method based on geometry reconstruction and multi-scale analysis	G06K9/00
8	Time sequence similarity measurement method based on self-adaptive piecewise statistical approximation	G06F17/30
8	Judgment standard establishment method for identifying red and black time sequence through resistance method	G06K9/62
8	Data flow abnormality detection and multiple verification method based on enhancement-type angle abnormality factor	G06F17/30
8	Wi-Fi-based indoor personnel passive detection method	G06K9/00
9	Systems and methods of network operation and information processing	G06F15/16
9	Systems and methods of network operation and information processing	G06F17/30
9	Systems and methods of network operation and information processing, including engaging users of a public-access network	G06F15/16
9	Systems and methods of network operation and information processing, including use of unique/anonymous identifiers throughout all stages of information processing and delivery	G06F15/16
9	Video broadcast creation method and system, access device and management device	H04L29/06
9	System and method for realizing signaling firewall based on signaling point-free access technology	H04L29/06
9	Network device access authentication method in network video monitoring	H04L29/06
9	System and method for simulating an application for subsequent deployment to a device in communication with a transaction server	G06F7/00
9	Method and system for managing personal information	G06Q30/00
9	Method for monitoring resource utilization of server	H04L12/24
10	Off-line engine system based on software as a service (SaaS) mode	G06F17/30
10	System and method for providing a messaging application program interface	G06F3/00
10	Integrated chaining process for continuous software integration and validation	G06F9/44
10	Method for implementing configuration clause processing of policy-based network in cloud component software system	H04L29/06
10	Method for providing a virtual execution environment on a target computer using a virtual software machine	G06F9/44
10	Frame driving method of application construction platform	G06F9/44
10	Internal control management system capable of applying response type shared application architecture	G06F9/44
10	Computer flexible management construction system and interface storage and explanation method	G06F9/44
10	Method and system for connecting words, phrases, or symbols within the content of transmitted data to URI or IP address	G06F17/30
10	Realization method and system for device control by using HTTP interface	H04L29/08

Source: Created by authors

Appendix 1. Word cloud results for word embedding

k-means++ was used to assign all words derived by the Skip-gram model into 24 clusters. We chose the number of clusters arbitrarily. The words in each cluster were presented in the form of word cloud. The Skip-gram model assumes that similar words are more likely to appear in the same context (window). Therefore, in fact, the words in each cluster are supposed to be associative and related, not exactly to be similar.

Appendix 2. Document cluster labels

Instead of labeling document clusters only by the word clouds, we also adopted the patent titles as complementary information. We picked up ten patents of each cluster, which were nearest to its centroid.

Appendix 3. Tuning of explored range in NGT

NGT has a primary parameter ϵ that defines the explored range for the graph, allowing us to achieve higher precision. As the “No Free Lunch” theorem, the more extensive the explored range, the higher the precision, the longer the search time. To investigate the relationship between the explored range ϵ and accuracy, we randomly collect n patents from the corpus. Denote N_true(i) as the true nearest 200 neighbors of patent i, and N_ngt(i, ϵ) the approximated nearest 200 neighbors of patent i given by NGT. Then, the accuracy of given ϵ value is calculated by the following:

Accuracy(ϵ)=1n∑i=1nlen(Ntrue(i) ∩ Nngt(i, ϵ))200

In our case, we collected a random sample of 500 patents and set the range of ϵ from 0.05 to 1 with a step 0.05. The following figures shows the change of accuracy by tuning the value of ϵ. For the following results, we set the ϵ as 0.35, which had a 0.997 accuracy rate and plausible running time in the experiment.

References

Abramovitz, M. (1986), “Catching up, forging ahead, and falling behind”, The Journal of Economic History, Vol. 46 No. 2, pp. 385-406.

Agrawal, A., Gans, J. and Goldfarb, A. (2018), Prediction Machines: The Simple Economics of Artificial Intelligence, Harvard Business School Press.

Arts, S., Cassiman, B. and Gomez, J.C. (2017), “Text matching to measure patent similarity”, Strategic Management Journal, Vol. 39 No. 1, pp. 62-84.

Bell, M. and Pavitt, K. (1993), “Technological accumulation and industrial growth: contrasts between developed and developing countries”, Industrial and Corporate Change, Vol. 2 No. 2, pp. 157-210.

Biancotti, C. and Ciocca, P. (2018), “Regulating data superpower in the age of AI”, Realtime Economic Issues Watch, October 23, 2018, Peterson Institute for International Economics.

Cho, D.S., Kim, D.J. and Rhee, D.K. (1998), “Latecomer strategies: evidence from the semiconductor industry in Japan and Korea”, Organization Science, Vol. 9 No. 4, pp. 489-505.

Chorzempa, M., Triolo, P. and Saks, S. (2018), “China’s social credit system: a mark of progress or a threat to privacy?”, Peterson Institute for International Economics, Policy Brief 18-14.

Economist (2020), “Special report: the data economy”, The Economist, Feb 22, 2020, London.

Fagerberg, J. and Godinho, M.M. (2005), “Innovation and catching-up”, The Oxford Handbook of Innovation, Oxford University Press, New York, NY, pp. 514-543.

Fan, P. (2006), “Catching up through developing innovation capability: evidence from China’s telecomequipment industry”, Technovation, Vol. 26 No. 3, pp. 359-368.

Goldfarb, A. and Trefler, D. (2018), “AI and international trade”, NBER Working Paper #24254, Cambridge MA.

Kashani, E.S., Radosevic, S., Kiamehr, M. and Gholizadeh, H. (2022), “The intellectual evolution of the technological catch-up literature: bibliometric analysis”, Research Policy, Vol. 51 No. 7, p. 104538.

Kim, L. (1998), “Crisis construction and organizational learning: capability building in catching-up at Hyundai motor”, Organization Science, Vol. 9 No. 4, pp. 506-521.

Lee, K. (2013), Schumpeterian Analysis of Economic Catch-up: Knowledge, Path-Creation, and the Middle-Income Trap, Cambridge University Press, London.

McInnes, L., Healy, J. and Melville, J. (2018), “UMAP: uniform manifold approximation and projection for dimension reduction”, 6, Dec 2018, arXiv preprint arXiv:1802.03426.

Mathews, J.A. (2006), “Dragon multinationals: new players in 21st century globalization”, Asia Pacific Journal of Management, Vol. 23 No. 1, pp. 5-27.

Miao, Y., Song, J., Lee, K. and Jin, C. (2018), “Technological catch-up by east Asian firms: trends, issues, and future research agenda”, Asia Pacific Journal of Management, Vol. 35 No. 3, pp. 639-669.

Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013), “Efficient estimation of word representations in vector space”, In ICLR.

Motohashi, K. (2020), “Science and technology co-evolution in AI: empirical understanding through a linked dataset of scientific articles and patents”, RIETI Discussion Paper Series 20-E-010, RIETI, Tokyo Japan.

Motohashi, K. and Zhu, C. (2023), “Identifying technology opportunity using dual-attention model and technology-market concordance matrix”, Technological Forecasting and Social Change, Vol. 197, p. 122916.

Motohashi, K., Koshiba, H. and Ikeuchi, K. (2019), “A method of extracting content information from patent documents and comparison of their characteristics by applicant type by using the vector space model of distributed expressions”, NISTEP Discussion Paper No. 175, MEXT, Japan, Tokyo, (in Japanese).

Nagaoka, S., Motohashi, K. and Goto, A. (2010), “Patent statistics as an innovation indicator”, in Hall, B. and Rosenberg, N. (Eds), Handbook of the Economics of Innovation, Elsevier Science, North Holland, Vol. 2.

Park, M. and Geum, Y. (2022), “Two-stage technology opportunity discovery for firm-level decision making: GCN-based link-prediction approach”, Technological Forecasting and Social Change, Vol. 183, p. 121934.

Sugawara, K., Kobayashi, H. and Iwasaki, M. (2016), “On approximately searching for similar word embeddings”, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

Trajtenberg, M. (2018), “Artificial intelligence as the next GPT: a Political-Economy perspective”, NBER Working Paper #24245, Cambridge MA.

Wang, Y., Roijakkers, N. and Vanhaverbeke, W. (2014), “How fast do Chinese firms learn and catch up? Evidence from patent citations”, Scientometrics, Vol. 98 No. 1, pp. 743-761.

Wang, B., Wang, A., Chen, F., Wang, Y. and Kuo, C. (2019), “Evaluating word embedding models: methods and experimental results”, APSIPA Transactions on Signal and Information Processing, Vol. 8 No. 1, p. e19.

Younge, K.A. and Kuhn, J.M. (2016), Patent-to-Patent Similarity: A Vector Space Model, SSRN.

Acknowledgements

This study is conducted as part of the Project “Digitalization and Innovation Ecosystem: A Holistic Approach” undertaken at the Research Institute of Economy, Trade, and Industry (RIETI). In addition, financial support from JSPS-KAKEN Fostering Joint International Research Program B (Grant No.19K0035) is acknowledged. The authors would like to thank the participants of the discussion seminar at RIETI for their helpful comments.

In total, 680,241 US patents + 427,628 China patents. The abstract of CNIPR patents is translated into English, so that all documents are in English.

It should be noted that any difference in the type of document (USPTO or CNIPR patents) does not cause such pattern, as is discussed in the Section 2, based on the validation of document embedding with patent family information.

Corresponding author

Chen Zhu can be contacted at: zhujohn0425@g.ecc.u-tokyo.ac.jp