Tuesday 15 March 2011

optimization - PostgreSQL: How to optimize my database for storing and querying a huge graph -


I am running PostGrace SQL 8.3 with 1. GHG RAM and 1.83 GHz Intel Core Duo Mac Mini with Mac OS X 10.5.8 . . I have stored a huge graph in my PostgreSQL database, which has 1.6 million nodes and 30 million edges. My database schema is similar:

  Make Table Nodes (ID Integer Primary Key, Title VARCHAR (256)); Create table edges (ID integer, link integer, primary key (id, link)); Create index id_idx current edges (ID); Create index link_idx Create edges;  

Data is visible in table edges

  ID link 1 234 1 88865 1 6 2 365 2 12 ...  

So it stores the ID for each node, which is the link to exit from id y.

The time to find all outgoing links is fine:

  = # Analyze select links from edges where id = 4620; Question Scheme ------------------------------------------------ --------------------------------- Index scan ID_idx on edges (cost = 0.00..101.61 rows = 3067 width = 4) (real time = 135.507..157.982 rows = 1052 loops = 1) Index Conduit: (id = 4620) Total runtime: 158.348 ms (3 lines)  

However, if I come Searching for links in the node, the database is 100 times slower (although the resulting number of incoming links is only 5-10 times the number of outgoing links):

  = #in Analyze identity ID from where the link = 4620; Question Scheme ------------------------------------------------ ---------------------------------- Heap scan on bitmap edges (cost = 846.31..100697.48 rows = 51016 width = 4) (real time = 322.584..48 9 83.478 lines = 26887 loop = 1) Check again: (link = 4620) - & gt; Bitmap index scan link_ idx (cost = 0.00..833.56 rows = width of 51016 = 0) (real time = 298.132..298.132 rows = 26887 loop = 1) Index link: (link = 4620) Total runtime: 49001.936 MS (5 lines ))  

I tried not to use PostGrace with a bitmap scan

  = # set_autm_mapscan = false;  

But the speed of the query for the incoming link did not improve:

  = # Select ID from the edges where the link = 1588; Question Scheme ------------------------------------------------ ------------------------------------------- Using link_idx on index edges Scanned (cost = 0.00.4.4777.63 rows = 1143 width = 4) (real time = 110.302..51275.822 rows = 43629 loop = 1) Index link: (link = 1588) Total runtime: 51300.041 ms (3 lines) < / Code> 

I have increased my shared buffers from 24MB to 512 MB, but this did not help, so I wonder if the outgoing and incoming links My query to show an asymmetric behavior of this kind? Is anything wrong with indexed my choice? Or would I want to make a node with an ID ID better by creating a third table for all incoming links? But this would be quite a waste of disk space. But since I'm new to SQL Database, maybe I'm missing something fundamental?

I think that's right.

You can use it after filling the table. Now the second query should be sharp, and should be slow down first.

To keep both questions fast, you must denormalize using a second table, as you have proposed. Just remember the cluster and analyze this second table after loading your data, so all the examples associated with the node will be physically grouped.

If you will not be able to ask it all the time and you do not want to store and back up this second table, you can make it temporarily before you can inquire:

  Create a temporary table eg_des_backwards as the link, from the order of the edges ID; Create index edges_backwards_link_idx on edges_backwards (link);  

You do not need to cluster this temporary table, because it will be ordered physically directly on creation. This does not make any sense for a query, but can help for many queries in one line.


No comments:

Post a Comment