Why Gnutella Can't Scale. No, Really.
  
  by Jordan Ritter <jpr5@darkridge.com>
  
  Forward
  
    ------------------------------------------------------------------------
  In the spring of 2000, when Gnutella was a hot topic on everyone's mind, a
  concerned few of us in the open-source community just sat back and shook our
  heads. Something just wasn't right. Any competent network engineer that
  observed a running gnutella application would tell you, through simple
  empirical observation alone, that the application was an incredible burden
  on modern networks and would probably never scale. I myself was just
  stupefied at the gross abuse of my limited bandwidth, and that was just DSL
  -- god la. Other informative articles persist on O'Reilly's P2P Website, and
  elsewhere.
  
  So where's my paper, and why haven't you seen it? Well, in case you didn't
  know, I'm one of the founding developers of Napster, and for several good
  reasons, including the sobering fact that I was one of the leaders of the
  main competitor, I did not release my material to the public. Several times
  I resigned myself to re-writing my paper to accommodate the release of new
  information and analyses, but I never finished. Now I regret having sat on
  this for so long, for every paper on Gnutella that has come out in the last
  year has served as nothing but vindication of my conclusion from so early
  on: Gnutella will never scale.
  
  Following is what remains of my paper, hacked up, sliced, diced and
  re-written. The information and analyses are still useful, but as I just
  said, the conclusions are the same. This paper simply proves those
  conclusions through mathematics.
  
  
  
  Onward, Through the Fog
  
    ------------------------------------------------------------------------
  
  This paper assumes a working knowledge of Gnutella networks and internals,
  and therefore uses terminology and phraseage specific to Gnutella. If the
  wording seems somewhat strange or foreign to you, please stop reading this
  paper and seek other documentation before proceeding. Furthermore,
  explanation of the accompanying math is intentionally terse. Every effort
  has been made to verify the accuracy of the equations herein, but this
  discussion is intentionally limited to that which is solely relevant to
  Gnutella in order to keep at a minimum any distraction from an already
  complex topic.
  
  To Scale, or Not to Scale
  
  Scaling Gnutella will require more than just better resource management
  tools -- in its current incarnation Gnutella is mathematically and
  technologically unable to scale to a network of any reasonably large size.
  Following herein is a discussion focused on mathematically describing the
  metrics of a GnutellaNet topology, and using derived equations to interpret
  and visualize realistic limits of the technology. In order to keep the math
  as simple as possible, let's assume we're examining a relatively quiet
  GnutellaNet network, and dissect the flow of information one step at a time.
  
                             Variables and Equations
  
               P     The number of users connected to the
                     GnutellaNet.
                     The number of connections held open to other
               N     servents in the network. In the default
                     configuration of the original Gnutella client,
                     this is 4.
                     Our TTL, or Time To Live, on packets. TTL's are
               T     used to age a packet and ensure that it is
                     relayed a finite number of times before being
                     discarded.
                     The amount of available bandwidth, or
               B     alternatively, the maximum capacity of the
                     network transport.
                     A function describing the maximum number of
          f(n, x, y) reachable users that are at least x hops away,
                     but no more than y hops away.
                     f(n, x, y) = Sum[((n-1)^(t-1))*n, t = x->y]
                     A function describing the maximum number of
            g(n, t)  reachable users for any given n and t.
                     g(n, t) = f(n, 1, t)
                     A function describing the maximum amount of
                     bandwidth generated by relaying a transmission
          h(n, t, s) of s bytes given any n and t. Generation is
                     defined as the formulation and outbound delivery
                     of data.
                     h(n, t, s) = n*s + f(n, 1, t-1)*(n-1)*s
                     A function describing the maximum amount of
                     bandwidth incurred by relaying transmission of s
                     bytes given any n and t. Incurrence is defined
          i(n, t, s) as the reception or transmission of data across
                     a unique connection to a network.
                     i(n, t, s) = (1 + f(n, 1, t-1))*n*s + f(n, t,
                     t)*s
  
  It benefits the casual reader to first explain in terms of a balanced,
  equally distributed GnutellaNet, so for this exercise assume that everyone
  has the same N and T. In the initial release of Gnutella, N = 4 and T = 5.
  Further, let P = 2000, arbitrarily. Finally, let us assume no other
  interfering factors exist (for now).
  
  Early reports of Gnutella's usage claimed upwards of 2000 to 4000 users on
  the GnutellaNet. This is significant because these reports inaccurately
  implied that all 4,000 users on the GnutellaNet were reachable and
  searchable. The reality is that even in an ideally balanced GnutellaNet, P
  is never relevant to your potential reach; N and T are the only limiting
  factors.
  
                                Reachable Users
                 T=1 T=2T=3  T=4   T=5    T=6      T=7       T=8
             N=2  2  4   6    8     10     12       14       16
             N=3  3  9   21  45     93    189      381       765
             N=4  4  16  52  160   484   1,456    4,372    13,120
             N=5  5  25 105  425  1,705  6,825    27,305   109,225
             N=6  6  36 186  936  4,686  23,436  117,186   585,936
             N=7  7  49 301 1,813 10,885 65,317  391,909  2,351,461
             N=8  8  64 456 3,200 22,408 156,8641,098,056 7,686,400
  
    Raising N (number of connections open) and T (number of hops) extend the
                    number of reachable users geometrically.
  
  Keep in mind, the above illustrates potential reach given two assumptions:
  the network is fully balanced, and everyone shares the same N and T.
  
  So, the next obvious step for an intrepid and now better-informed Gnutella
  user is to increase N and T, so as to extend their potential reach into the
  GnutellaNet web. Not so fast! As your reach increases geometrically, so does
  the amount of bandwidth generated and incurred. Let's now move the
  discussion towards B.
  
  Delving Deeper into B
  
  Before proceeding, it is very important to understand that many assumptions
  must be made in order to carry out these computations. Observed
  characteristics of GnutellaNet topologies are simply too varying to
  accurately generalize. That said, I still believe that there exists a
  statistical mean of each characteristic in a GnutellaNet, which is to say
  that if I were to take a snapshot of the current topology of a public
  GnutellaNet I could derive an average N, T, and so forth. While potentially
  inaccurate as a realistic representation, these means can still produce a
  useful generalization.
  
  In our discussion of B, there are really two different perspectives on how
  to measure the amount of bandwidth: the amount generated, and the amount
  incurred. This is a very important distinction to make, because knowing the
  amount of raw data generated is statistically useful, but understanding the
  bandwidth cost incurred by individual events in the network is much more
  important since it more realistically signifies the impact on an Internet
  connection. As previously stated, h(n, t, s) represents the amount of
  bandwidth generated by relaying a packet through the network, counting only
  data that is outbound to another destination. i(n ,t, s), on the other hand,
  counts all outbound and inbound transmissions, yielding a more accurate
  perspective on bandwidth usage. Let's introduce an example.
  
  Joe Smith likes classic rock, and is desperately searching for any live
  recordings of The Grateful Dead. Joe loads up his Gnutella client, connects
  to the GnutellaNet, and executes his search, "grateful dead live". What
  actually happens?
  
    Search Query Packet
           Makeup
         IP
     header   20 bytes
        TCP
     header   20 bytes
   Gnutella      23
     header bytesTER>830  996  1,162   1,328
     N=3        249       747  1,743   3,735   7,719     15,687       31,623      63,495
     N=4        332      1,328 4,316  13,280   40,172    120,848     362,876   1,088,960
     N=5        415      2,075 8,715  35,275  141,515    566,475   2,266,315   9,065,675
     N=6        498      2,988 15,438 77,688  388,938   1,945,188  9,726,438  48,632,688
     N=7        581      4,067 24,983 150,479 903,455   5,421,311 32,528,447 195,171,263
     N=8        664      5,312 37,848 265,6001,859,864 13,019,712 91,138,648 637,971,200
  
  From above, given a concurrent demographic comparable to Napster (assuming
  equally balanced), searching for a simple 18 byte string "grateful dead
  live" unleashes 90 megabytes worth of data to be transmitted.
  
  Even so, I don't consider h(n, t, s) to be the best measure. Let's now look
  at i(n, t, s), which is comprised of the originating transmission, 1
  reception and N-1 transmission for tiers 1 through T-1, and 1 reception for
  the last tier.
  
                        Bandwidth Incurred in Bytes (S=83)
       T=1    T=2   T=3     T=4     T=5        T=6        T=7            T=8
   N=2 332    664   996    1,328   1,660      1,992      2,324            2,656
   N=3 498   1,494 3,486   7,470   15,438    31,374     63,246          126,990
   N=4 664   2,656 8,632  26,560   80,344    241,696    725,752       2,177,920
   N=5 830   4,150 17,430 70,550  283,030   1,132,950  4,532,630     18,131,350
   N=6 996   5,976 30,876 155,376 777,876   3,890,376 19,452,876     97,265,376
   N=71,162  8,134 49,966 300,9581,806,910 10,842,622 65,056,894    390,342,526
   N=81,328 10,624 75,696 531,2003,719,728 26,039,424 182,277,296 1,275,942,400
  
     i(n, t, s) has the unique property of representing double h(n, t, s).
  
  From above, a whopping 1.2 gigabytes of aggregate data could potentially
  cross everyone's networks, just to relay an 18 byte search query. This is of
  course where Gnutella suffers greatly from being fully distributed.
  
  Also, let's not forget that there is no consideration of time in this set of
  calculations. In the average case, 1.2 gigabytes worth of data takes a very
  long time to generate and propagate through the Internet. However, even in
  more realistic cases, propagating a few megabytes worth of data through
  several hundred thousand nodes across the Internet still takes a
  considerable amount of time.
  
  At this point, though, our exercise is still incomplete. What percentage of
  Gnutella clients share content? Of them, what percentage are likely to
  respond to Joe's query? And of those, what would be the mean number of
  responses, and their mean length?
  
  
  
  The Anatomy of a Firestorm
  
    ------------------------------------------------------------------------
  
  This is where we'll begin to see generalizations diverging from reality.
  Still though, let's take a quick gander at what evangelists thought Gnutella
  would be capable of. For this, we'll need to introduce a few more variables
  and equations.
  
                          More Variables and Equations
  
               a     Mean percentage of users who typically share
                     content.
  
               b     Mean percentage of users who typically have
                     responses to search queries.
  
               r     Mean number of search responses the typical
                     respondent offers.
  
               l     Mean length of search responses the typical
                     respondent offers.
                     A function representing the Response Factor, a
                     constant value that describes the product of the
               R     percentage of users responding and the amount of
                     data generated by each user.
                     R = (a*b) * (88 + r*(10 + l))
                     A function describing the amount of data
          j(n, T, R) generated in response to a search query by tier
                     T, given any n and Response Factor R.
                     j(n, T, R) = f(n, T, T) * R
                     A function decsribing the maximum amount of
                     bandwidth generated in response to a search
          k(n, t, R) query, including relayed data, given any n and t
                     and Response Factor R.
                     k(n, t, R) = Sum[ j(n, T, R) * T, T = 1->t ]
  
  Assuming that a mean exists for the characteristics of our measurement makes
  these calculations much simpler. That said, recall that I don't believe this
  assumption to be false; that at any given moment there does exist some
  measurable a, b, r and l. Let's assume conservative estimates for now, and
  apply observed behaviour from other reports later.
  
  The difficulty in gauging the sheer amount of data coming back to us stems
  from our inability to realistically discern where in the partial mesh of
  connections the data is coming from. By design, the only thing we will know
  about about the packets received is the (hopefully) unique message ID. If
  the message ID correlates to the message ID of one of our pending queries,
  the response is ours. Otherwise, the response is someone else's traffic, and
  if it correlates to an known ID in our routing table, it is simply passed
  along.
  
                         Search Response Packet Makeup
                             IP header       20 bytes
                            TCP header       20 bytes
                       Gnutella header       23 bytes
                        Number of hits        1 byte
                                  Port        1 byte
                            IP Address        4 bytes
                                 Speed        3 bytes
                            Result Set r * (8 + l + 2) bytes
                    Servent Identifier       16 bytes
                                Total: 88 + r*(10 + l) bytes
  
  Let's take a look now at what the variation of N and T yields in terms of
  bandwidth costs. For our first case, let's choose some reasonable values:

a = 30%, b = 50%, r = 5 and l = 40, or R = 50.7.
  
                        Bandwidth Generated in Bytes (R=50.7)
       T=1    T=2     T=3      T=4      T=5        T=6        T=7          T=8
   N=2
   N=3152.1  760.5  2,585.7  7,452.9  19,620.9  48,824.1    116,965         272,715
   N=4202.8 1,419.6 6,895.2  28,797.6 110,932    496,614   1,441,500      4,989,690
   N=5253.5 2,281.5 14,449.5 79,345.5 403,826   1,961,330  9,229,680     42,456,400
   N=6304.2 3,346.2 26,161.2 178,261 1,128,890  6,832,640 40,104,500    230,230,000
   N=7354.9 4,613.7 42,942.9 349,577 2,649,330 19,207,500 135,115,000   929,909,000
   N=8405.6  6,084  65,707.2 622,190 5,491,420 46,392,900 380,422,000 3,052,650,000
  
    Precision is limited to 6 or less digits; sorry, I don't know how to make
                  mathematica behave differently in this case.
  
  With 30% of Gnutella users sharing, and only half of them responding, the
  standard client settings yield over 14MB of return responses. I believe this
  particular R value to be near reality as far as percentages are concerned,
  but r and l are probably conservative, given recent reports by Clip2 DSS and
  others. Let's raise R a bit, here's R = 72.
  
                      Bandwidth Generated in Bytes (R=72)
       T=1  T=2   T=3    T=4      T=5       T=6         T=7         T=8
    N=2
    N=3216 1,080 3,672  10,584   27,864    69,336     166,104         387,288
    N=4288 2,016 9,792  40,896  157,536   577,440    2,047,104      7,085,952
    N=5360 3,240 20,520 112,680 573,480  2,785,320  13,107,240     60,293,160
    N=6432 4,752 37,152 253,1521,603,152 9,703,152  56,953,152    326,953,152
    N=7504 6,552 60,984 496,4403,762,360 27,276,984 191,879,352 1,320,581,304
    N=8576 8,640 93,312 883,5847,798,464 65,883,456 540,244,224 4,335,130,368
  
  These different values don't appear to have much of an impact on the overall
  bottom line; just over 13MB of traffic generated in response with standard
  client settings. Let's take one more look and adjust some of the values: 

a = 30%, b = 40%, r = 10 and l = 60, or R = 94.56. I believe this R to be the
  most realistic.
  
                          Bandwidth Generated in Bytes (R=94.56)
        T=1     T=2     T=3       T=4       T=5        T=6        T=7         T=8
   N=2
   N=3283.68  1,418.4 4,822.56 13,900.3  36,594.7   91,061.3    218,150          508.638
   N=4378.24 2,647.68 12,860.2 53,710.1   206,897    758,371   2,688,530       9,306,220
   N=5 472.8  4,255.2 26,949.6  147,986   753,170   3,658,050 17,214,200      79,185,000
   N=6567.36 6,240.96  48,793   332,473  2,105,470 12,743,500 74,798,500     429,398,000
   N=7661.92 8,604.96 80,092.3  651,991  4,941,123 35,823,800 252,002,000  1,734,360,000
   N=8756.48 11,347.2 122,550  1,160,44010,242,000 86,526,900 709,521,000  5,693,470,000
  
     Standard client settings yield a whopping 17MB generated in response to
                              Joe's search query.
  
  In order to better understand the results above, one must understand the
  Response Factor, R, and the reasoning behind it. Recent analyses of Gnutella
  networks show a small percentage of participants actually sharing content,
  and a disproportionately small percentage of those sharing actually having
  most of the content. It is highly improbable that a means to statistically
  describe the widely varying response characteristics of participants in a
  GnutellaNet exists. R is a compromise for this difficult task, representing
  a gross mean across an ideal GnutellaNet of responses we can expect the
  average query to generate. The key word here is ideal; we know these gross
  means to exist, but they are as yet unmeasurable, or at least at this point
  unverifiable, given the quickly changing network topology.
  
  Bringing it all together
  
  So, now that we have all the pieces to the puzzle, let's fit them together.
  How much aggregate data, including request and response, is generated by
  Joe's search for "grateful dead live"? Let's intersect h(n, t, s) with
k(n,
  t, R) to get The Big Picture.
  
                        Bandwidth Generated in Bytes (S=83, R=94.56)
         T=1      T=2     T=3       T=4       T=5        T=6        T=7         T=8
   N=2
   N=3 532.68   2,165.4 6,565.56 17,635.3  44,313.7    106,748    249,773         572,133
   N=4 710.24  3,975.68 17,176.2 66,990.1   247,069    879,219   3,051,410     10,395,200
   N=5  887.8   6330.2  35,664.6  183,261   894,685   4,224,530 19,480,500     88,250,700
   N=61,065.36 9,228.96  64,231   410,161  2,494,410 14,688,700 84,524,900    478,031,000
   N=71,242.92  12,672  105,075   802,470  5,844,690 31,245,100 284,530,000 1,929,530,000
   N=81,420.48 16,659.2 160,398  1,426,04012,101,800 99,546,700 800,659,000 6,331,440,000
  
              The Big Picture, h(n, t, s) and k(n, t, R) combined.
  
  What's really stunning about the above table is the stark realization that
  in supporting numbers of users comparable to Napster, Gnutella would
  generate more than an unbelievably significant 800MB worth of data for just
  one of those users to search the entire network for "grateful dead live" and
  receive responses.
  
  Our job is still not finished yet, though. What remains is to apply these
  statistics to observed query rates to gain an understanding of the real-time
  impact of a GnutellaNet on a network.
  
  Behold, The Firestorm
  
  When Napster, Inc. was served with an injunction designed to halt all
  file-sharing service through the Napster network, Gnutella and similar
  services experienced what is now commonly referred to as the "Napster
  Flood". While an inordinate number of users perceived the injunction as
  their personal charge to download from Napster as much as possible before
  the service was brought down, still a great many flocked to other
  file-sharing services such as Gnutella.
  
  During this period of time, Clip2 DSS observed query rates peaking at 10
  queries per second, double the normal 3-5 per second. The possibility of
  exceeding 10 qps during periods of heavy usage these days is not unlikely.
  
  The final item of interest in this paper is the extrapolation of bandwidth
  rates (per second) from the bandwidth costs calculated above and observed
  rates. For thoroughness, query rates for a quiet (3qps), normal (5 qps), and
  burdened (10 qps) GnutellaNet are examined. For each test case, the main
  assumption is that Joe Smith's behaviour satisfies the typical user
  demographic.
  
                     Bandwidth rates for 3 qps (S=83, R=94.56)
        T=1      T=2      T=3       T=4      T=5       T=6       T=7          T=8
   N=2
   N=31.6KBps  6.5KBps 19.7KBps  52.9KBps 132.9KBps 320.2KBps 749.3KBps     1.7MBps
   N=42.1KBps 11.9KBps 51.5KBps   201KBps  741KBps   2.6MBps   9.1MBps     31.2MBps
   N=52.7KBps  19KBps   107KBps  548.8KBps 2.7MBps   12.7MBps 58.4MBps    264  MBps
   N=63.2KBps 27.7KBps 192.7KBps  1.2MBps  7.5MBps   44.1MBps 253.6MBps   1.4  GBps
   N=73.7KBps 38.1KBps 315.2KBps  2.4MBps  17.5MBps 123.7MBps 853.6MBps   5.8  GBps
   N=84.2KBps  50KBps  481.2KBps  4.3MBps  36.3MBps 298.6MBps  2.4GBps   19    GBps
  
                     Bandwidth rates for 5 qps (S=83, R=94.56)
        T=1      T=2      T=3       T=4      T=5       T=6       T=7        T=8
   N=2
   N=32.7KBps 10.8KBps 32.8KBps  88.1KBps 221.6KBps 533.7KBps  1.2MBps      2.9MBps
   N=43.6KBps 19.9KBps 85.9KBps   335KBps  1.2MBps   4.4MBps  15.3MBps     52  MBps
   N=54.4KBps 31.7KBps 178.3KBps 916.3KBps 4.5MBps   21.1MBps 97.4MBps    441.3MBps
   N=65.3KBps 46.1KBps 321.2KBps  2.1MBps  12.5MBps  73.4MBps 422.6MBps     2.4GBps
   N=76.2KBps 63.4KBps 525.4KBps   4MBps   29.2MBps 206.2MBps  1.4GBps      9.6GBps
   N=87.1KBps 83.3KBps  802KBps   7.1MBps  60.5MBps 497.7MBps   4GBps      31.7GBps
  
                      Bandwidth rates for 10 qps (S=83, R=94.56)
         T=1      T=2       T=3       T=4      T=5       T=6       T=7        T=8
   N=2
   N=3 5.4KBps  21.6KBps 65.6KBps  176.2KBps443.2KBps  1.1MBps   2.4MBps     5.8MBps
   N=4 7.2KBps  39.8KBps 171.8KBps  670KBps  2.4MBps   8.8MBps  30.6MBps   104  MBps
   N=5 8.8KBps  63.4KBps 356.6KBps  1.8MBps   9MBps    42.2MBps 194.8MBps  882.6MBps
   N=610.6KBps  92.2KBps 642.4KBps  4.2MBps   25MBps  146.8MBps 845.2MBps    4.8GBps
   N=712.4KBps 126.8KBps  1.1MBps    8MBps   58.4MBps 412.4MBps  2.8GBps    19.2GBps
   N=814.2KBps 166.6KBps  1.6MBps  14.2MBps  121MBps  995.4MBps   8GBps     63.4GBps
  
  Keeping things in Perspective
  
  From the charts above, it becomes mind-numbingly clear that the Gnutella
  distributed architecture is fundamentally flawed and can have a horrific
  impact on any network. On a slow day, a GnutellaNet would have to move 2.4
  gigabytes per second in order to support numbers of users comparable to
  Napster. On a heavy day, 8 gigabytes per second.
  
  A lot of potentially obscure assumptions are made here, though, and they
  should be carefully examined and understood before making conclusions:
  
     * the test GnutellaNet is ideal, which is to say that all participants
       form a topology which conforms to g(n, t);
     * being ideal, its topology is static -- meaning all responses to a
       search query are received by the requestor, without being cut off by
       transient nodes;
     * query rates are constant,
     * query demographics correlate to the average case presented above,
     * all GnutellaNet participants are capable of supporting the bandwidth
       rates incurred,
     * search queries and responses represent the only relevant and
       bandwidth-significant activity on the GnutellaNet.
  
  So why should the above charts be taken with a grain of salt? Well, the real
  GnutellaNet that exists today is certainly not ideal, and has been
  occasionally observed persisting as several smaller, fractured GnutellaNets.
  Also, there's a great deal of transience in the GnutellaNet; observations
  yield only roughly 30-40% of participants remain for 24 hours or more. And
  it should be obvious to even the most casual observer that query rates are
  not constant, and are more likely to burst and lull as the topology shifts
  and usage varies.
  
  One important factor in evaluating the usefulness of the above is to
  consider the usage demographic. Current usage may show 3-5 queries per
  second with anywhere between 4,000 and 8,000 users, but if Gnutella were to
  ever grow in size, both by users and consequentially by files, search rates
  would likely increase dramatically. This would be for at least two reasons:
  more users equates to more people interested in locating content equates to
  more aggregate queries per second, and more content equates to wider
  variance in type of material equates to, quite simply, more to search for.
  So, applying query rates involving only thousands of users to GnutellaNet
  populations orders of magnitude greater in size is probably inaccurate;
  instead, at greater sizes, the above computed bandwidth rates are probably
  much too small. Indeed, one can extrapolate from the above, using the test
  case of 1,000,000 users:
  
     * 8,000 users generate 5 queries per second, which simplified means
     * 1,600 users generate 1 query per second, which then leads to
     * 1,000,000 users / 1,600 users per query per second == 625 queries per
       second
  
  Therefore it is more likely that, given an ideal GnutellaNet and a capable
  Internet, Gnutella would generate 625 queries per second with one million
  users instead of our test case of 5, which generates 4GBps worth of traffic
  just by itself. So how much data does a query rate of 625 qps generate? The
  calculation is left as a thoughtful exercise to the reader.
  
  Most important of all, though, the above numbers assume a capable network
  connection exists for all participants. If networks weren't capable of
  relaying the amounts of traffic discussed above, traffic jams would occur
  and query rates would drop, query response rates would drop, and overall
  traffic rates, as a result, would drop. And we know they aren't capable; we
  know that a significant percentage of participants are dialup users, and
  their low bandwidth capabilities cause significant traffic congestion and
  topology fragmentation when improperly configured.
  
  
  
  Conclusions
  
    ------------------------------------------------------------------------
  
  Even though many assumptions were made throughout the course of these
  calculations, some of which are provably unrealistic, these exercises still
  yield a useful perspective. In an ideal world, Gnutella is truly a
  "broadband killer app" in the most literal of senses -- it can easily bring
  the Internet infrastructure to its knees. And it should also be noted that
  only search query and response traffic was accounted for, omitting various
  other types of Gnutella traffic such as PING, PONG, and most importantly,
  the bandwidth costs incurred by actual file transfers. 2.4GBps is just
  search and response traffic, but what about the obnoxiously large amount of
  bandwidth necessary to transfer files between clients?
  
  Those reading this paper should be careful to note that non-intended uses of
  the GnutellaNet also incur noticeable bandwidth hits: using search queries
  to chat with other participants, SPAM placed inside search queries and
  results to advertise various things, and gibberish, typically resulting from
  misbehaving users or clients. Futhermore, with individuals writing their own
  clients and protocol extensions, we may begin to see loop detection being
  rendered useless. Depending on how individual clients implement loop
  detection (comparing message ID's versus comparing message ID's + a checksum
  of the packet's payload), protocol extensions may interfere with legacy
  clients and result in more traffic than necessary being generated and
  relayed.
  
  The main argument against this paper is that GnutellaNets are never ideal,
  and as adoption and usage grows, are statistically less likely to be ideal,
  given the increase in complexity of the topology as the number of
  participants increase. I would agree with this principle, but I believe it
  only serves as better proof of the premise: if an ideally distributed and
  fully capable network generates 2.4GBps to accomodate 1M users (and we
  already know this figure to be unrealistic in terms of what the modern
  Internet is capable of), then a poorly distributed network with insufficient
  bandwidth will certainly not be able to support the same number of
  participants or the traffic they generate. In other words, again, Gnutella
  can't scale.
  
  Another key argument against these computations is that they are all focused
  on the center of an ideal GnutellaNet, and applying this generalization to
  all configurations of nodes is misleading and inaccurate. Traffic is
  measured and generalized from a maximizable point; this is to say that the
  "center" node will always generate the most amount of traffic given the same
  configuration throughout, whereas a leaf node in an ideal GnutellaNet
  generates only a fraction of that bandwidth. However, empirical analysis
  yields the observation that, in practice, leaf nodes don't generally have
  only one connection into the GnutellaNet. As a matter of fact, leaf nodes
  don't tend to occur naturally at all, since it is rarely in a participant's
  best interest to limit themselves to one connection, in maximizing bandwidth
  capacity versus search depth. To date I've only observed this happening on a
  large scale with Reflectors, or strategically placed Gnutella "proxies" at
  high bandwidth locations on the Internet aimed at serving dialup and other
  small capacity clients. So, the inaccuracy of these numbers likely lies in
  their being, again, much too small. Also, regardless of how intertwined and
  convoluted the connection paths are, the data path is effectively rendered
  semi-ideal through loop detection, so the methodology turns out to be more
  realistic than first thought.
  
  Yet another valid question to raise against the premise is, What is a
  reasonable size? Is it 100 users? Is it 1,000? Or 100,000? Or 1,000,000?
  Nothing short of global domination? Discerning what's reasonable is
  assuredly a subjective comparison, however, I use the phrasage
  interchangably with original statements like "Gnutella will kick Napster in
  the pants." Common sense dictates that in order to accomplish that, Gnutella
  would have to perform more efficiently, scale higher, and be more capable.
  These exercises prove that, on a perfect level, Gnutella just can't rise to
  meet the challenge. Consequentially, they prove that on an imperfect level
  Gnutella has no hope of performing on the same level.
  
  In the final assessment, it's painfully obvious that Gnutella needs a
  complete overhaul. Major architectural flaws are fundamental in nature and
  cannot be mitigated effectively without redesign at the most basic level.
  Some intelligent caching could likely benefit the Gnutella architecture,
  since observations yield that many searches and responses result in
  repetitive, duplicate transmissions. However, given the transience of
  GnutellaNet participants, and the wide variety of participating clients, it
  would be difficult to predict with any amount of accuracy how effective
  technology like this would be.
  
  Various efforts claim to be underway to redesign the protocol; among them,
  gPulp stands out as the farthest along, with message boards and mailing
  lists set up for those wanting to get involved. But, with its mission of
  consentual changes implemented through a working group, I harbor significant
  doubt as to whether they will ever be timely and effective at producing an
  alternative. GnutellaWorld, another revamp effort recently publicized by
  CNet's news.com, takes the lead on the initiative for developing Gnutella2.
  J.C. Nicholas, apparently representing GnutellaWorld, claimed in an
  interview with CNet that Gnutella2 technology would be out "soon".
  Characterized as an "Internet Earthquake" and promised to be "the greatest
  revolution since Linux", Gnutella2 sounds more like the same old hype than
  anything else. And with only 8-9 months under their collective belt as an
  organization, I personally wonder how far along efforts could be. If the
  fact that this open-source project's CVS repository remains quite empty, or
  that its mailing lists appear dormant presents any indication of progress,
  the Internet probably has some time to go before experiencing the next
  internet cataclysm. Considering GnutellaWorld's intentions of supporting 20
  million people or more, I can only hope that it's nothing like the original
  Gnutella.
  
  Permission to reproduce this document in part or in full is permitted only
  under the condition that credit for the work is visibly given to the author,

                       Jordan Ritter <jpr5@darkridge.com>