Talk:Guild Wars DAT
Preliminary specs
I have added my preliminary specs for this Guild Wars .DAT format. Feel free to update it as necessary, or discuss it here. For starters, the files in there are compressed, but I don't know how. I tried Zlib, to no avail. --Mr.Mouse
--SilentAvenger I disagree with some of the specs there. Using filemon I looked at how GW opens the file. First, I dont know if the header size is really that, as GW just reads the first 32 bytes straight up. Second of all, ffna tables are not neccesserily 512 bytes in size, but rather about 50 bytes, varying a bit. The stuff is stored in 256 blocks in the file, padding the end if needed. Also, there is no ffna offset in the header. Maybe only an offset to the first block. In an empty archive (achieved by killing the internet connection just as GW starts), there is no ffna block, and it still is 00 02 00 00.
Method that GW opens the file, as I have managed to figure out until now: Read first 32 bytes. Go to the MFT pointed by the offset at 0x10 from file start. Read 384 bytes of that MFT, then either (I dont know which yet) - figure out the size by looking at the self reference entry or - Take the int at 0x0c from MFT start, multiply by 24 (the size for a file reference) Take the figure out size, subtract 384, read. (This reads all the MFT) Then, read the block pointed at by the other reference, before the self reference. This block, to my tests is neither compressed data, MFT, ffna, or ATEXDXT* (The four types of data you recognize on sight). Thats as far as I went.
Also: ffna's do not store data. Data seems to be stored in compressed blocks, all ending with 08 00 01 80, followed by what I assume is the uncompressed file size. I have not been able to decompress this, but its probably deflate or decompress. Someone told me by some obscure method of reasoning that it might be the method used by gzip, but he then disconnected from mIRC.
There are ATEXDXT* blocks (* being a digit), which a friend of mine thought are uncompressed DXT* textures.
Some ffna's seem to contain a self-reference entry.
About that last uint32 at each reference. I looked at the dat I had, ran some stat checking (will run on a bigger archive once GW stops downloading), it is unique for every entry *not* pointing to block that starts with ffna, and repeats for the ffna blocks. So, it might not be hash.
Wasnt sure if to post this here or in Talk, tell me for next time :)
--Captain Hey man, if you don't agree with the contents, feel free to add corrections to what the current page is. The common goal is to figure this thing out, and all help is appreciated. :)
--Mr.Mouse I'd rather the discussion takes place here. If you have an alternative spec,please add it below the former one, name it Alternative 1. ;) I'll get back to this discussion soon.
As a matter of fact here. :)
I disagree with some of the specs there. Using filemon I looked at how GW opens the file. First, I dont know if the header size is really that, as GW just reads the first 32 bytes straight up. Second of all, ffna tables are not neccesserily 512 bytes in size, but rather about 50 bytes, varying a bit. The stuff is stored in 256 blocks in the file, padding the end if needed. Also, there is no ffna offset in the header. Maybe only an offset to the first block. In an empty archive (achieved by killing the internet connection just as GW starts), there is no ffna block, and it still is 00 02 00 00.
1. Coders are sometimes weird. They may read the 32 bytes straight up, but they may have editors that don't do that. They may have noticed it never changes and just read it straight up and did not bother to remove the specs code. Also, it's just one file, so we have nothing to compare it with (or do we? Let me know).
2.You have seen ffna tables that are not 512 in size (so from the start to where the actual files begin)? I have just let it download a bit (to about 40 mb) and that is what I noticed. Coincidence that the value I saw was 512? I think not. You see, it resembles perfectly the order the other entries (offset , then size) of other tables. It's intriguing.
3. Yes, the blocks are padded to 256 probably. Should you not wait though until you have the complete .DAT file? Why do you break off the download? Is not the final file the one you play with? The specs you should retrieve from that file I think. Not from an unfinished one. Or am I mistaken?
Method that GW opens the file, as I have managed to figure out until now
Read first 32 bytes.
Go to the MFT pointed by the offset at 0x10 from file start.
Read 384 bytes of that MFT, then either (I dont know which yet)
- figure out the size by looking at the self reference entry
or - Take the int at 0x0c from MFT start, multiply by 24 (the size for a file reference)
Take the figure out size, subtract 384, read. (This reads all the MFT)
Then, read the block pointed at by the other reference, before the self reference.
This block, to my tests is neither compressed data, MFT, ffna, or ATEXDXT* (The four types of data you recognize on sight).
Again, it's good to look at how the executable handles it, but keep in mind that editors may do all of this differenly. Also, don't rely to heavily on what coders do ;) IMHO ...
The last block you talk about (before the self reference), you don't think this may be hashed filenames? We should look at this closely.
It's interesting that you mention GZip. But I thought GZip wasn't that good in compressing. Hmm.
Well, we'll get there eventually! Good work! Hope to hear your thoughts! Or of anyone who knows something we don't, of course. :)
--SilentAvenger I'm sorry it came out so list-like, I was in a hurry at the time.
Well, about ffna's, I think they are directories of sorts. See, I have downloaded the entire Gw.dat possible from the main menu (You know, how it sits there and downloads), came out about 200mb, and some ffna entries, except for having a self entry as most of them do, also had a bunch of other reference entries. My hunch is, as this archive format supports dynamic writing of content, and not only reading, is that they pre-allocate 2 blocks per ffna, so they can grow the listing if needed. I must say I have not found any ffna's with less than a 512 length, or, come to think of it, any block. And the blocks seem to be aligned on 512 jumps, now that I go check. So, the block size might be 512, and not 256.
About decompression... We could try a bunch, just copy the compressed stuff out of the file, ignoring the constant bytes, and try various things on them.
Also, there seem to be various types of ffnas, some with file entries, some with blocks of data (without the suffix the compressed ones have).
About breaking off the download, being able to see the file at parts of being downloaded has helped me alot, for example, the fact that the number at 0x10 is an MFT offset. If you have different points in the file development, it helps. I also run some statistical analysis using tools I write in C++ on the files, and more test subjects is less errors.
--Republicola A few notes: I don't think you can get a "complete" archive. You download files as you need them, and when idle (in main menu, but maybe in other situations). But even if you left the game in the main menu for a long time, it is unlikely that you would get a special completed archive because they are regularly adding new files and changing old ones. The archive when there is nothing new to download could differ as little as one bit from an archive still in the middle of receiving files.
Gzip is a file format. The gzip system uses compression from zlib. Keep in mind that compression quality is very subjective; both small data size *and* speedy decompression are desireable in a compression system. I got this interesting error message from someone else (it occured when loading Gates of Kryta map, I believe): (2) File 0x8381 stream 0x1 is corrupt (1) Map file '0x008381' failed to load. Attempting to re-bloat. The "bloat" part seems to be a name for the decompression algorithm. Zlib calls it inflate, and I don't know about other systems. This is highly speculative, of course.
Finally, I suspect that file names are never used by GW. See how it refers to the map file with a number in the error message? I also remember noticing a lack of file name strings when I was running the game in my debugger a while ago. They could easily be stripped out of the process by the tools the designers use. This would mean no hashing and less space (of constant size) needed to store a file reference.
--Mr.Mouse 07:32, 3 Jun 2005 (EDT) We may introduce the rule to add new comments to the top, saves scrolling down. For now, I'll post below.
A few notes: I don't think you can get a "complete" archive. You download files as you need them, and when idle (in main menu, but maybe in other situations). But even if you left the game in the main menu for a long time, it is unlikely that you would get a special completed archive because they are regularly adding new files and changing old ones. The archive when there is nothing new to download could differ as little as one bit from an archive still in the middle of receiving files.
It doesn't matter if they change files constantly, they will still need a format of the "new" archive that the executable can read. Chances are very high therefore that the format of the game resource archive will always be the same when fully loaded. And that is what the game will need: a fully downloaded archive.
Well, it may be indeed that no filenames are used in the archive. Nevertheless, we should not dismiss it yet until the purpose of each bit in the archive is identified ;)
I have found that int the total file (256 mb) there are three mft tables. They also point to three pre-self reference tables. These last are exactly the same, though, it seems, while the mft tables do have some difference (just the pointers). Seems redundant info. Also, after the three mft tables comes another large chunk of (compressed) data. Perhaps that is 1. junk 2. filenames compressed??
I have found some refs to compression in the exe:
- FCArchive
- CmpHuff --> would probably mean Huffman encoding? [1]
Also, it seems there are markers for different types of files that compress/decompress differently.
--SilentAvenger I Checked the EXE, and there seems to be a bunch of files called CmpIO, CmpHuff and CmpDict, and a function called CmpDecompress, which GW calls. I have managed to locate the function GW uses to decompress, but havent messed around with it. To find it, Search the strings for "Cmp", there is an error message form FcArchive regarding problems extracting, follow it, and you get to a proc, which contains a call to the function that does the actual decompression.
--Republicola I don't understand your reasoning for using a complete archive. Even if the format is different, almost all the archives people have are not complete, so that format would be less useful. I think it will be easier if we are working with different archives--it's like looking at an object from multiple view points to determine its 3D structure.
My archive is a bit over 1 GB, and there were eight Mft sections when I last checked. Does that number increase roughly in a linear manner with archive size then? There might be a max size, which would make sense for a hash table.
Regarding Huffman encoding, it seems unlikely that it would be used as the standard compression algorithm throughout the archive. It is really bad for compression of random binary data. It is only useful for transmitting text that you know certain things about (that it's English, for example). What that article doesn't mention is that you also need to store the table of binary sequences along with the compressed data. There are standard tables for specific applications (like English text) that make it more efficient and useful because you don't have to include a table with the data. Maybe it is used just to compress text or something though.
--SilentAvenger I have updated the structure with information I found using statistical analysis of about 20 archives. Also, I must congratulate Republicola on extracting an ATEX entry! (Nothing much to look at)
Regarding the counters, did some more stat analysis. They are completely sequential, and start at 11 or so. Except for 1, they are in a contigous block, some of them in the hash table thing, and some in the MFT table entries. The hash table sometimes has 2 entries for the same counter, the second one's first uint32 much smaller than the first's.
Also, files in the MFT with sequential counters, seem to have the flags 3,16 for the first one, and 3,16,17,25 for the second one. Except for rare occasions of a 3,16,17,24 file, with no pair.
http://rep.undev.org/gw/finished.png
So here is the image I extracted from one of the ATEX sections in my archive. It was the second ATEX section, which I used because the first had 'DXTA', which isn't an actual compression format. This one was 'DXT5', though it isn't exactly that anyway. For reference, the data is here: http://rep.undev.org/gw/07537E00.dat
The image data starts at 0x1c. The only obvious element of the header is the width and height, which are the first two shorts after 'ATEXDXT5'. The image data works as normal DXT5 [2] with a few differences. First, only the alpha part is there. Normally, it is organized into 4x4 pixel blocks (texels) with eight bytes of alpha information and eight bytes of color information. Here, the color information is missing, so texels are only eight bytes. To view the image, I added a DDS header and put in eight byte blocks of zeros every eight bytes throughout the image data.
With that done, the image still wasn't right, but it was obvious that it was missing several texels, which shift all the following texels left. The missing texels were the ones in the four corners of the image (all zero data, since they are completely black). I am pretty sure that for extra compression, they leave out these texels and give their offsets elsewhere (only about two bytes to store the offset, rather than keeping the eight bytes of empty image data). If I counted correctly, the offsets (in texels, not bytes) of zero texels for that image are 0, 7, 56, 63 or 0, 6, 54, 60, depending on whether you add one for each previous texel. I haven't examined the data very carefully for these values yet.
I don't know how to get the right lower mipmap levels yet. What is interesting is that assuming normal DXT5 data (color included and no missing texels), the sizes work out perfectly for a 32x32 image with 6 mipmap levels if you start the image data at 0x14. Maybe GW needs to trick DirectX into thinking it's dealing with real DXT5 data at some point.
--Mr.Mouse Nice puzzling! Keep it up,I'd say! We are making progress here!
--SilentAvenger No kidding eh? Well. More good news. I can now extract files by hand via ollydbg. I have isolated some of the functions related to decompression.
Files are stored in streams. I have yet to find out how that works, probably with that weird hash table thing. The buffer passed to decompression is the exact contents of the block, minus the last 8 bytes.
If someone in here knows ASM,
004BCB72 - Call to allocate the buffer. ECX contains uncompressed file size (last uint32 of compressed file data). Returns allocated space in EAX.
004BCBBA - Call to decompress the file block. EDX contains compressed file size. EBX is the compressed buffer pointer. ECX contains destination buffer.
inside decompressing function:
004D8EBA - Call that actually decompresses. ECX contains destination buffer.
after that call completes you can freely grab the decompressed file from the EXE memory. I've already looked at the prefs file and the gui file for GW.
TODO - figure out the hash function, figure out about this "stream" stuff.
--SilentAvenger Update: Using the ASM skills of xttocs and republicola, we managed to call some functions within GW at will, and this has allowed us to extract a bunch of files, but not all of them. Update: Now I have managed to extract all 9104 files from my archive. seems like bit 3 really is a compression flag, but sometimes you have comrpessed files without that set, seems like leftovers, as there seem to be multiple versions of the same file.
FFNAs are game data! so are ATEX! Some of the compressed files are ATEX or FFNA, therefore, these are *not* part of the archive. What we need to find out now is how GW knows which files to get.
On a side note, GW has one pretty funny language, if we manage to re-compress and insert the prefs file, we could try finding out which one it is.
--Mr.Mouse Has the archive format been investigated enough to move to the main GRAF index?
--Nicoli_s well we have an extractor partly working but without filenames, so you'd have to talk to republicola and silentavenger about that, plus, it hooks the dll instead of reading the .dat file straight
--xttocs Did you manage to call the GW routines from the dll without crashing?
--Harrowed Someone wiped the data. *sigh* Restored.
--Mr.Mouse :
Thanks. Damn children, probably. At least: of that intellect. ;)
-- Saibot : Is it possible to get the source for the current extractor somewhere, I might be able to give a hand, but starting everything from scratch to contribute is not something that appeals to me ;-)
--Nicoli_s well i have no problem with it, but what we need to do is to figure out how to actually read the .dat file, instead of hooking the exe as we currently do, but ill talk to republicola
-- Saibot : Any news?
-- MqsTout : It looks like GW stores a lot of tiny files by the way it downloads so many thousands. Does it actually have coherent whole models, or does it use pieces it then reconstructs together?
--Harrowed No updates in a while.. Anyone got anything? If we can export, shoudln't we be able to import... But would this update realtime cfg's etc.. IE updating Clipping range etc.
-- Ral : Still no news?
--Otac0n Don't mean to pester, but does anyone know if the files end at (offset + size) or at (offset + size + 4)?
Just wondering, because it seems that the "ATEX" and "ffna" are filetypes, and not necessarily counted in the file size.
--AzuiSleet: This almost seems like a lost cause, but there's some people left on irc.wc3campagins.com #hakhak I was also able to extract sounds from the archive: http://azu.brokenedge.net/000000000B031600.dat.mp3 the keyword is to look for LAME in the file (lame mp3 encoding)