This is an individual project. Please refrain from looking up solutions for similar projects online.
Downloadproxylab-handout.tarfile from Canvas. Copy the handout file to a protected directory on the Linux machine where you plan to do your work, and then issue the following command:
linux> tar xvf proxylab-handout.tar
This will generate a handout directory calledproxylab-handout. The README file describes the various files.
The first step is implementing a basic sequential proxy that handles HTTP/1.1 GET requests. Other requests type, such as POST, are strictly optional.
When started, your proxy should listen for incoming connections on a port whose number will be specified on the command line. Once a connection is established, your proxy should read the entirety of the request from the client and parse the request. It should determine whether the client has sent a valid HTTP request; if so, it can then establish its own connection to the appropriate web server then request the object the client specified. Finally, your proxy should read the servers response and forward it to the client.
When an end user enters a URL such ashttp://web.mit.edu/index.htmlinto the address bar of a web browser, the browser will send an HTTP request to the proxy that begins with a line that might resemble the following:
In that case, the proxy should parse the request into at least the following fields: the hostname,web.mit.edu; and the path or query and everything following it,/index.html. Use theparseurlfunction from hw9. That way, the proxy can determine that it should open a connection toweb.mit.eduand send an HTTP request of its own starting with a line of the following form:
GET /index.html HTTP/1.
Note that all lines in an HTTP request end with a carriage return,\r, followed by a newline,\n. Also important is that every HTTP request is terminated by an empty line:”\r\n”.
You should notice in the above example that the web browsers request line ends withHTTP/1.1, while the proxys request line ends withHTTP/1.0. Modern web browsers will generate HTTP/1.1 requests, but your proxy should handle them and forward them as HTTP/1.0 requests.
It is important to consider that HTTP requests, even just the subset of HTTP/1.0 GET requests, can be incredibly complicated. The textbook describes certain details of HTTP transactions, but you should refer to RFC 1945 for the complete HTTP/1.0 specification. Ideally your HTTP request parser will be fully robust according to the relevant sections of RFC 1945, except for one detail: while the specification allows for multiline request fields, your proxy is not required to properly handle them. Of course, your proxy should never prematurely abort due to a malformed request.
The important request headers for this are theHost,User-Agent,Connection, andProxy-Connection headers:
It is possible that web browsers will attach their ownHostheaders to their HTTP requests. If that is the case, your proxy should use the sameHostheader as the browser.
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.3) Gecko/20120305 Firefox/10.0.
The header is provided on two separate lines because it does not fit as a single line in the writeup, but your proxy should send the header as a single line. TheUser-Agentheader identifies the client (in terms of parameters such as the operating system and browser), and web servers often use the identifying information to manipulate the content they serve. Sending this particular User-Agent: string may improve, in content and diversity, the material that you get back during simple telnet-style testing.
TheConnectionandProxy-Connectionheaders are used to specify whether a connection will be kept alive after the first request/response exchange is completed. It is perfectly acceptable (and suggested) to have your proxy open a new connection for each request. Specifyingcloseas the value of these headers alerts web servers that your proxy intends to close connections after the first request/response exchange.
Also keep in mind, when analyzing request headers from the browser, the header names are case-insensitive, different browsers might capitalize (or have lower cases) for the same field name. So your parsing should do comparsions that is also case insensitive.
To make your headers work, you will have to skip the browser supplied header forConnection,User-Agent andProxy-Connection. You should also check if the request header coming from the browser already containsHost, if it does, use it; if it doesnt, make sure you add theHostheader.
For your convenience, the values of the describedUser-Agentheader is provided to you as a string constant inproxy.c.
Finally, if a browser sends any additional request headers as part of an HTTP request, your proxy should forward them unchanged.
There are two significant classes of port numbers for this lab: HTTP request ports and your proxys listening port.
The HTTP request port is an optional field in the URL of an HTTP request. That is, the URL may be of the form,http://cse-cmpsc311.cse.psu.edu:8080, in which case your proxy should connect to the hostcse-cmpsc311.cse.psu.eduon port 8080 instead of the default HTTP port, which is port
The listening port is the port on which your proxy should listen for incoming connections. Your proxy should accept a command line argument specifying the listening port number for your proxy. For example, with the following command, your proxy should listen for connections on port 8081:
linux> ./proxy 8081
You may select any non-privileged listening port (greater than 1,024 and less than 65,536) as long as it is not used by other processes. Since each proxy must use a unique listening port and many people will simultaneously be working on each machine, the scriptport-for-user.plis provided to help you pick your own personal port number. Use it to generate port number based on your user ID:
$ ./port-for-user.pl yuw yuw17: 62346
The port,p, returned byport-for-user.plis always an even number. So if you need an additional port number, say for theTiny server, you can safely use portspandp+ 1.
Please dont pick your own random port. If you do, you run the risk of interfering with another user.
For the second part of the lab, you will add a cache to your proxy that stores recently-used Web objects in memory. HTTP actually defines a fairly complex model by which web servers can give instructions as to how the objects they serve should be cached and clients can specify how caches should be used on their behalf. However, your proxy will adopt a simplified approach.
When your proxy receives a web object from a server, it should cache it in memory as it transmits the object to the client. If another client requests the same object from the same server, your proxy need not reconnect to the server; it can simply resend the cached object.
Obviously, if your proxy were to cache every object that is ever requested, it would require an unlimited amount of memory. Moreover, because some web objects are larger than others, it might be the case that one giant object will consume the entire cache, preventing other objects from being cached at all. To avoid those problems, your proxy should have both a maximum cache size and a maximum cache object size.
The entirety of your proxys cache should have the following maximum size:
When calculating the size of its cache, your proxy must only count bytes used to store the actual web objects; any extraneous bytes, including metadata, should be ignored.
Your proxy should only cache web objects that do not exceed the following maximum size:
For your convenience, both size limits are provided as macros incache.h. When you are writing your code, you should always use the Macro Name to refer to these constants instead of using the actual number. This will make it easy to adapt your code to different configurations.
To ensure that you cache only web objects that is correctedly sent through the network, you should only cache an object if all of the following are true:
The easiest way to implement a correct cache is to allocate a buffer for the active connection that meets the first three conditions stated above and accumulate data as it is received from the server. If the number of bytes received doesnt match expected size for the content(aka not meeting the 4th condition listed above), it is an indication that error has occured during the network transmission and the object is most likely corrupted and shouldnt be cached. In that case, make sure you free the memory you are not caching to avoid memory leak.
MAX_CACHE_SIZE + MAX_OBJECT_SIZE
Your proxys cache should employ an eviction policy that is a least-recently-used (LRU) eviction policy for your sequential proxy server. Notice that both reading an object from the cache and writing it into the cache count as using the object.
We are includingcache.handlibcache.afile you can use directly to implement cache in your proxy server. The Makefile creates two executables, one using implementation from provided libcache.a (called proxy), the other using your implementation in cache.c (called proxycache).
Once your cache works with the supplied cache library, you should implement your own cache.c file to implement the few functions described incache.h. You should implement your cache as a doubly linked list, where each nodeCachedItemhas a pointer pointing to the node in front of it and a pointer pointing to the node behind it.
You will mainly need to implement four functions:
1.void cache_init(CacheList *list); This function will initialize the list to an empty list by setting size to 0, and set first and last to NULL;
2.void cache_URL(const char *URL, const char *headers, void *item, size_t size, CacheList *list); This function will add a new node to the front of the doubly linked list. It will use strdup to allocate its own space to store the URL and headers, it will just store and own the pointer pointing to the dynamically allocated memory that stores the binary content of the file in item, it will set the size of the node and update the overall size of the whole CacheList accordingly. The hardest part is when we dont have enough space to cache this item, and we will have to evict stale items in our linked list. We keep the stale items at the tail of the list, we will keep evicting items from the last node in the list, and upate the cache size accordingly until we make enough room to add this new node with content of given size.
Your code should also check the size of this item to see if it exceededMAXOBJECTSIZEand if it does, free up the pointer item and dont allocate this node. Consider all the edge cases for this step: if there was no nodes in the CacheList when we are inserting it (you need to update both first and last pointer in CacheList); if there were nodes but some had to be evicted to make room, you should take care updating the linked list and first/last pointer.
3.CachedItem *find(const char *URL, CacheList *list); This function will look for a cached item in the CacheList with a matching URL key. And returns the pointer to the nodes if it finds it. If there is no matching node, it returns NULL. But that is not all of what this function does. To maintain our CacheLists structure to enforce LRU eviction policy, we should put freshly accessed items to the front of the list. So in this case, if we do find a matching node, and since it is just visited, it should be taken off the middle of the list (or whereever it was), and moved to the very front of the list. You should be very careful with all the edge cases for this step: for example, what if the node was already at the front of the list? what about the node was at the end of the list? what about the node was in the middle of the list?
4.void cache_destruct(CacheList *list); This last function is to clean up the whole linked list and free all memory. It will not be used in your proxylab code since proxy server never terminates and you should never clean the cache, but we will use this function combined with other functions to form a separate unit test set to check the correctness of your implementation.
This will be graded out of a total of 60 points:
Your handout materials include an autograder, calleddriver.sh, that your instructor will use to get preliminary scores forBasicCorrectness, andCache. From theproxylab-handoutdirectory:
You must run the driver on a Linux machine.
The autograder does only simple checks to confirm that your code is acting like a caching proxy. For the final grade, we will do additional manual testing to see how your proxy deals with real pages. Here is a list of some pages that still uses http protocol (as of December 2nd, 2019) that you can use to test.
As always, you must deliver a program that is robust to errors and even malformed or malicious input. Servers are typically long-running processes, and web proxies are no exception. Think carefully about how long-running processes should to different types of errors. For many kinds of errors, it is certainly inappropriate for your proxy to immediately exit.
Robustness implies other requirements as well, including invulnerability to error cases like segmentation faults and a lack of memory leaks and file descriptor leaks.
Besides the simple autograder, you will not have any sample inputs or a test program to test your imple- mentation. You will have to come up with your own tests and perhaps even your own testing harness to help you debug your code and decide when you have a correct implementation. This is a valuable skill in the real world, where exact operating conditions are rarely known and reference solutions are often unavailable.
Fortunately there are many tools you can use to debug and test your proxy. Be sure to exercise all code paths and test a representative set of inputs, including base cases, typical cases, and edge cases.
Your handout directory the source code for the CS:Tiny web server. While not as powerful asthttpd, the CS:APP Tiny web server will be easy for you to modify as you see fit. Its also a reasonable starting point for your proxy code. And its the server that the driver code uses to fetch pages.
You can usecurlto generate HTTP requests to any server, including your own proxy. It is an extremely useful debugging tool. For example, if your proxy and Tiny are both running on the local machine, Tiny is listening on port 8080, and proxy is listening on port 8081, then you can request a page from Tiny via your proxy using the followingcurlcommand:
$ curl -v –proxy localhost:8081
GET HTTP/1. User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4. Host: localhost: Accept: / Proxy-Connection: Keep-Alive
* Closing connection #
netcat, also known asnc, is a versatile network utility. You can usenetcatjust liketelnet, to open connections to servers. Hence, imagining that your proxy were running onlocalhostusing port 8081 you can do something like the following to manually test your proxy:
$ nc -C localhost 8081 GET yuw17/cmpsc311/vagrant/Vagrantfile HTTP/1.
HTTP/1.1 200 OK Date: Tue, 03 Dec 2019 16:11:41 GMT Server: Apache
Last-Modified: Wed, 04 Sep 2019 03:34:17 GMT ETag: “41a2e68-624-591b1e132e040” Accept-Ranges: bytes Content-Length: 1572 Connection: close Content-Type: text/plain
In addition to being able to connect to Web servers,netcatcan also operate as a server itself. With the following command, you can runnetcatas a server listening on port 12345:
sh> nc -l 12345
Once you have set up anetcatserver, you can generate a request to a phony object on it through your proxy, and you will be able to inspect the exact request that your proxy sent tonetcat.
Eventually you should test your proxy using themost recent versionof Mozilla Firefox. VisitingAbout Firefox will automatically update your browser to the most recent version.
To configure Firefox to work with a proxy, visit
It will be very exciting to see your proxy working through a real Web browser. Although the functionality of your proxy will be limited, you will notice that you are able to browse the vast majority of websites through your proxy.
An important caveat is that you must be very careful when testing caching using a Web browser. All modern Web browsers have caches of their own, which you should disable before attempting to test your proxys cache.
If you want to be able to let traffic to localhost (where you are running tiny webserver) to also go through proxy, you will have to manually change the setting at:
Submitcache.candproxy.cfile to gradescope.