Talk:Guild Wars DAT

From XentaxWiki
Revision as of 00:02, 13 June 2006 by 67.67.35.17 (talk)
Jump to navigation Jump to search

Preliminary specs

I have added my preliminary specs for this Guild Wars .DAT format. Feel free to update it as necessary, or discuss it here. For starters, the files in there are compressed, but I don't know how. I tried Zlib, to no avail. --Mr.Mouse


--SilentAvenger I disagree with some of the specs there. Using filemon I looked at how GW opens the file. First, I dont know if the header size is really that, as GW just reads the first 32 bytes straight up. Second of all, ffna tables are not neccesserily 512 bytes in size, but rather about 50 bytes, varying a bit. The stuff is stored in 256 blocks in the file, padding the end if needed. Also, there is no ffna offset in the header. Maybe only an offset to the first block. In an empty archive (achieved by killing the internet connection just as GW starts), there is no ffna block, and it still is 00 02 00 00.

Method that GW opens the file, as I have managed to figure out until now: Read first 32 bytes. Go to the MFT pointed by the offset at 0x10 from file start. Read 384 bytes of that MFT, then either (I dont know which yet) - figure out the size by looking at the self reference entry or - Take the int at 0x0c from MFT start, multiply by 24 (the size for a file reference) Take the figure out size, subtract 384, read. (This reads all the MFT) Then, read the block pointed at by the other reference, before the self reference. This block, to my tests is neither compressed data, MFT, ffna, or ATEXDXT* (The four types of data you recognize on sight). Thats as far as I went.

Also: ffna's do not store data. Data seems to be stored in compressed blocks, all ending with 08 00 01 80, followed by what I assume is the uncompressed file size. I have not been able to decompress this, but its probably deflate or decompress. Someone told me by some obscure method of reasoning that it might be the method used by gzip, but he then disconnected from mIRC.

There are ATEXDXT* blocks (* being a digit), which a friend of mine thought are uncompressed DXT* textures.

Some ffna's seem to contain a self-reference entry.

About that last uint32 at each reference. I looked at the dat I had, ran some stat checking (will run on a bigger archive once GW stops downloading), it is unique for every entry *not* pointing to block that starts with ffna, and repeats for the ffna blocks. So, it might not be hash.

Wasnt sure if to post this here or in Talk, tell me for next time :)


--Captain Hey man, if you don't agree with the contents, feel free to add corrections to what the current page is. The common goal is to figure this thing out, and all help is appreciated. :)


--Mr.Mouse I'd rather the discussion takes place here. If you have an alternative spec,please add it below the former one, name it Alternative 1. ;) I'll get back to this discussion soon.

As a matter of fact here. :)

I disagree with some of the specs there. Using filemon I looked at how GW opens the file. First, I dont know if the header size is really that, as GW just reads the first 32 bytes straight up. Second of all, ffna tables are not neccesserily 512 bytes in size, but rather about 50 bytes, varying a bit. The stuff is stored in 256 blocks in the file, padding the end if needed. Also, there is no ffna offset in the header. Maybe only an offset to the first block. In an empty archive (achieved by killing the internet connection just as GW starts), there is no ffna block, and it still is 00 02 00 00.

1. Coders are sometimes weird. They may read the 32 bytes straight up, but they may have editors that don't do that. They may have noticed it never changes and just read it straight up and did not bother to remove the specs code. Also, it's just one file, so we have nothing to compare it with (or do we? Let me know).

2.You have seen ffna tables that are not 512 in size (so from the start to where the actual files begin)? I have just let it download a bit (to about 40 mb) and that is what I noticed. Coincidence that the value I saw was 512? I think not. You see, it resembles perfectly the order the other entries (offset , then size) of other tables. It's intriguing.

3. Yes, the blocks are padded to 256 probably. Should you not wait though until you have the complete .DAT file? Why do you break off the download? Is not the final file the one you play with? The specs you should retrieve from that file I think. Not from an unfinished one. Or am I mistaken?


Method that GW opens the file, as I have managed to figure out until now Read first 32 bytes. Go to the MFT pointed by the offset at 0x10 from file start. Read 384 bytes of that MFT, then either (I dont know which yet) - figure out the size by looking at the self reference entry or - Take the int at 0x0c from MFT start, multiply by 24 (the size for a file reference) Take the figure out size, subtract 384, read. (This reads all the MFT) Then, read the block pointed at by the other reference, before the self reference. This block, to my tests is neither compressed data, MFT, ffna, or ATEXDXT* (The four types of data you recognize on sight).


Again, it's good to look at how the executable handles it, but keep in mind that editors may do all of this differenly. Also, don't rely to heavily on what coders do ;) IMHO ...

The last block you talk about (before the self reference), you don't think this may be hashed filenames? We should look at this closely.

It's interesting that you mention GZip. But I thought GZip wasn't that good in compressing. Hmm.

Well, we'll get there eventually! Good work! Hope to hear your thoughts! Or of anyone who knows something we don't, of course. :)


--SilentAvenger I'm sorry it came out so list-like, I was in a hurry at the time.

Well, about ffna's, I think they are directories of sorts. See, I have downloaded the entire Gw.dat possible from the main menu (You know, how it sits there and downloads), came out about 200mb, and some ffna entries, except for having a self entry as most of them do, also had a bunch of other reference entries. My hunch is, as this archive format supports dynamic writing of content, and not only reading, is that they pre-allocate 2 blocks per ffna, so they can grow the listing if needed. I must say I have not found any ffna's with less than a 512 length, or, come to think of it, any block. And the blocks seem to be aligned on 512 jumps, now that I go check. So, the block size might be 512, and not 256.

About decompression... We could try a bunch, just copy the compressed stuff out of the file, ignoring the constant bytes, and try various things on them.

Also, there seem to be various types of ffnas, some with file entries, some with blocks of data (without the suffix the compressed ones have).

About breaking off the download, being able to see the file at parts of being downloaded has helped me alot, for example, the fact that the number at 0x10 is an MFT offset. If you have different points in the file development, it helps. I also run some statistical analysis using tools I write in C++ on the files, and more test subjects is less errors.


--Republicola A few notes: I don't think you can get a "complete" archive. You download files as you need them, and when idle (in main menu, but maybe in other situations). But even if you left the game in the main menu for a long time, it is unlikely that you would get a special completed archive because they are regularly adding new files and changing old ones. The archive when there is nothing new to download could differ as little as one bit from an archive still in the middle of receiving files.

Gzip is a file format. The gzip system uses compression from zlib. Keep in mind that compression quality is very subjective; both small data size *and* speedy decompression are desireable in a compression system. I got this interesting error message from someone else (it occured when loading Gates of Kryta map, I believe): (2) File 0x8381 stream 0x1 is corrupt (1) Map file '0x008381' failed to load. Attempting to re-bloat. The "bloat" part seems to be a name for the decompression algorithm. Zlib calls it inflate, and I don't know about other systems. This is highly speculative, of course.

Finally, I suspect that file names are never used by GW. See how it refers to the map file with a number in the error message? I also remember noticing a lack of file name strings when I was running the game in my debugger a while ago. They could easily be stripped out of the process by the tools the designers use. This would mean no hashing and less space (of constant size) needed to store a file reference.


--Mr.Mouse 07:32, 3 Jun 2005 (EDT) We may introduce the rule to add new comments to the top, saves scrolling down. For now, I'll post below.

A few notes: I don't think you can get a "complete" archive. You download files as you need them, and when idle (in main menu, but maybe in other situations). But even if you left the game in the main menu for a long time, it is unlikely that you would get a special completed archive because they are regularly adding new files and changing old ones. The archive when there is nothing new to download could differ as little as one bit from an archive still in the middle of receiving files.

It doesn't matter if they change files constantly, they will still need a format of the "new" archive that the executable can read. Chances are very high therefore that the format of the game resource archive will always be the same when fully loaded. And that is what the game will need: a fully downloaded archive.

Well, it may be indeed that no filenames are used in the archive. Nevertheless, we should not dismiss it yet until the purpose of each bit in the archive is identified ;)

I have found that int the total file (256 mb) there are three mft tables. They also point to three pre-self reference tables. These last are exactly the same, though, it seems, while the mft tables do have some difference (just the pointers). Seems redundant info. Also, after the three mft tables comes another large chunk of (compressed) data. Perhaps that is 1. junk 2. filenames compressed??

I have found some refs to compression in the exe:

  • FCArchive
  • CmpHuff --> would probably mean Huffman encoding? [1]

Also, it seems there are markers for different types of files that compress/decompress differently.


--SilentAvenger I Checked the EXE, and there seems to be a bunch of files called CmpIO, CmpHuff and CmpDict, and a function called CmpDecompress, which GW calls. I have managed to locate the function GW uses to decompress, but havent messed around with it. To find it, Search the strings for "Cmp", there is an error message form FcArchive regarding problems extracting, follow it, and you get to a proc, which contains a call to the function that does the actual decompression.


--Republicola I don't understand your reasoning for using a complete archive. Even if the format is different, almost all the archives people have are not complete, so that format would be less useful. I think it will be easier if we are working with different archives--it's like looking at an object from multiple view points to determine its 3D structure.

My archive is a bit over 1 GB, and there were eight Mft sections when I last checked. Does that number increase roughly in a linear manner with archive size then? There might be a max size, which would make sense for a hash table.

Regarding Huffman encoding, it seems unlikely that it would be used as the standard compression algorithm throughout the archive. It is really bad for compression of random binary data. It is only useful for transmitting text that you know certain things about (that it's English, for example). What that article doesn't mention is that you also need to store the table of binary sequences along with the compressed data. There are standard tables for specific applications (like English text) that make it more efficient and useful because you don't have to include a table with the data. Maybe it is used just to compress text or something though.


--SilentAvenger I have updated the structure with information I found using statistical analysis of about 20 archives. Also, I must congratulate Republicola on extracting an ATEX entry! (Nothing much to look at)

Regarding the counters, did some more stat analysis. They are completely sequential, and start at 11 or so. Except for 1, they are in a contigous block, some of them in the hash table thing, and some in the MFT table entries. The hash table sometimes has 2 entries for the same counter, the second one's first uint32 much smaller than the first's.

Also, files in the MFT with sequential counters, seem to have the flags 3,16 for the first one, and 3,16,17,25 for the second one. Except for rare occasions of a 3,16,17,24 file, with no pair.


--Republicola

http://rep.undev.org/gw/finished.png

So here is the image I extracted from one of the ATEX sections in my archive. It was the second ATEX section, which I used because the first had 'DXTA', which isn't an actual compression format. This one was 'DXT5', though it isn't exactly that anyway. For reference, the data is here: http://rep.undev.org/gw/07537E00.dat

The image data starts at 0x1c. The only obvious element of the header is the width and height, which are the first two shorts after 'ATEXDXT5'. The image data works as normal DXT5 [2] with a few differences. First, only the alpha part is there. Normally, it is organized into 4x4 pixel blocks (texels) with eight bytes of alpha information and eight bytes of color information. Here, the color information is missing, so texels are only eight bytes. To view the image, I added a DDS header and put in eight byte blocks of zeros every eight bytes throughout the image data.

With that done, the image still wasn't right, but it was obvious that it was missing several texels, which shift all the following texels left. The missing texels were the ones in the four corners of the image (all zero data, since they are completely black). I am pretty sure that for extra compression, they leave out these texels and give their offsets elsewhere (only about two bytes to store the offset, rather than keeping the eight bytes of empty image data). If I counted correctly, the offsets (in texels, not bytes) of zero texels for that image are 0, 7, 56, 63 or 0, 6, 54, 60, depending on whether you add one for each previous texel. I haven't examined the data very carefully for these values yet.

I don't know how to get the right lower mipmap levels yet. What is interesting is that assuming normal DXT5 data (color included and no missing texels), the sizes work out perfectly for a 32x32 image with 6 mipmap levels if you start the image data at 0x14. Maybe GW needs to trick DirectX into thinking it's dealing with real DXT5 data at some point.


--Mr.Mouse Nice puzzling! Keep it up,I'd say! We are making progress here!


--SilentAvenger No kidding eh? Well. More good news. I can now extract files by hand via ollydbg. I have isolated some of the functions related to decompression.

Files are stored in streams. I have yet to find out how that works, probably with that weird hash table thing. The buffer passed to decompression is the exact contents of the block, minus the last 8 bytes.

If someone in here knows ASM,

004BCB72 - Call to allocate the buffer. ECX contains uncompressed file size (last uint32 of compressed file data). Returns allocated space in EAX.

004BCBBA - Call to decompress the file block. EDX contains compressed file size. EBX is the compressed buffer pointer. ECX contains destination buffer.

inside decompressing function:

004D8EBA - Call that actually decompresses. ECX contains destination buffer.

after that call completes you can freely grab the decompressed file from the EXE memory. I've already looked at the prefs file and the gui file for GW.

TODO - figure out the hash function, figure out about this "stream" stuff.


--SilentAvenger Update: Using the ASM skills of xttocs and republicola, we managed to call some functions within GW at will, and this has allowed us to extract a bunch of files, but not all of them. Update: Now I have managed to extract all 9104 files from my archive. seems like bit 3 really is a compression flag, but sometimes you have comrpessed files without that set, seems like leftovers, as there seem to be multiple versions of the same file.

FFNAs are game data! so are ATEX! Some of the compressed files are ATEX or FFNA, therefore, these are *not* part of the archive. What we need to find out now is how GW knows which files to get.

On a side note, GW has one pretty funny language, if we manage to re-compress and insert the prefs file, we could try finding out which one it is.


--Mr.Mouse Has the archive format been investigated enough to move to the main GRAF index?


--Nicoli_s well we have an extractor partly working but without filenames, so you'd have to talk to republicola and silentavenger about that, plus, it hooks the dll instead of reading the .dat file straight


--xttocs Did you manage to call the GW routines from the dll without crashing?


--Harrowed Someone wiped the data. *sigh* Restored.


--Mr.Mouse : Thanks. Damn children, probably. At least: of that intellect. ;)


-- Saibot : Is it possible to get the source for the current extractor somewhere, I might be able to give a hand, but starting everything from scratch to contribute is not something that appeals to me ;-)


--Nicoli_s well i have no problem with it, but what we need to do is to figure out how to actually read the .dat file, instead of hooking the exe as we currently do, but ill talk to republicola


-- Saibot : Any news?


-- MqsTout : It looks like GW stores a lot of tiny files by the way it downloads so many thousands. Does it actually have coherent whole models, or does it use pieces it then reconstructs together?


--Harrowed No updates in a while.. Anyone got anything? If we can export, shoudln't we be able to import... But would this update realtime cfg's etc.. IE updating Clipping range etc.


-- Ral : Still no news?


--Otac0n Don't mean to pester, but does anyone know if the files end at (offset + size) or at (offset + size + 4)?

Just wondering, because it seems that the "ATEX" and "ffna" are filetypes, and not necessarily counted in the file size.


--AzuiSleet: This almost seems like a lost cause, but there's some people left on irc.wc3campagins.com #hakhak I was also able to extract sounds from the archive: http://azu.brokenedge.net/000000000B031600.dat.mp3 the keyword is to look for LAME in the file (lame mp3 encoding)


--Mqstout: Has there been any further progress on this made? Something usable?


---Ricky26:

Hey, you have the cracker of B&W2's STUFF file here, and ready to hack... *readies axe*

Introductions made:

0tac0n: Yes, ATEX and ffna are counted in the size.

Ive noticed that FileMon says it reads 384 bytes at the start of the Mft... That explains those extra indexes away. If we can find the 384 (Is it hardcoded?) .... *mutters to self*

Shall I whip up some C# for analysis? =)


Things to look for:

  • AMAT - Material
  • GRMT
  • DX9S - Shader


Also, I think that the encoding is infact Huffman ( and Im working some C# magic now ). Look at a Zip file and GW.DAT, similarities?
My Guess is this:

  • The server has a bunch of Huffman files (or a complete GW.DAT file).
  • It sends the data ENCODED over the network. (Saves bandwidth).
  • The client stores it as huffman. (Saves space).
  • The client decompresses it at runtime. (Bloat is NCSoft made?)


If this is true then the following may be:

Are FFNAs the Huffman Tree? (Each FFNA could be a file...?)


Actually, what if they just made one Huffman Tree and everything uses that. (It could be hidden in the EXE). Any thoughts on filenames? Or File-References.


Day 2: (lol)

Anyways, I have some good news (for home made decompression).
Their compression looks like re-arranged Zip Format:

  • In flag theory 2: 0x08 = Decompressed (Same as in a ZIP). (0x00 = STORED Hmm.... I smell some ZIPyness) This _IS_ the first short... The second short is flags (the same as in ZIP).
  • Compression has flags 0x0102 and 0x0304 (Dir & File headers in zip)
  • CRC is an INT in zip.
  • Zips start with "PK" (0x0102) and these start with 0x???? 0x0102



Im gonna add these specs to the main page as Theory 2. =P
We almost know enough to write non-compressed files into the FcArchive.... =)
--- On another note, is there anyone here still alive?

THERE ARE FILENAMES
Ive found conclusive evidence that there are filenames...
The folders start with /. E.g. /Art/Ui/ (actual dir).
Now I need to find them in the .DAT =P


Mqstout:I'm still alive here, eagerly awaiting your answers. Sadly I don't know my way around. Last file hacking I did was decomposing saved game formats for that ancient DOS game Drakkhen. I just look forward to useful... uses (in extraction)

-- Ricky

Does anyone know how SilentAvenger ran the decompression code?

=(


--- Ricky26

"Files are stored in streams. I have yet to find out how that works, probably with that weird hash table thing. The buffer passed to decompression is the exact contents of the block, minus the last 8 bytes."

Thats because the last 8 bytes is "new flags" and the decompressed size (int32). ¬_¬


mqstout: I don't know if it'd be helpful to look at the files as they're stored on the CDs the game shipped with? The .dat file is spread across two disks (both for the original and the expansion) and diffing them or looking at what's around the break might be helpful?


--Otac0n I definitely have made some advancements, using a tool I created available at http://67.67.35.17/GWDat.zip

Written in C#.

This has lead me to believe that all of the textures listed as "DXTA" are actually DXT5 compressed. These are probably game textures, since they have a file ID.

If we can get some work done on the header of all these files, we should be able to export most of the game's textures.

Update May 22nd:

I have re-coded alot of the GWDat C# project. Grab it again if you want it. It should help out in delving arround alot.



Taylor Mouse : I read in anothor forum that you can decompress the content of the dat file using gw.exe -image. This decompresses all files in the dat file.

User:Ricky26:

0tacon: I was doing the same thing! A C# project... But I was also doing some binary stuff in a C++ one...

You use XFire? ;) (ricky26)

@Taylor Mouse: Thanks for the tipoff.... Unfortunately it decompresses INTO the dat. ='(

Update: I think that -image creates the "clean" archives that you see on the packaged disks!

Update: -image compresses stuffs =(...

Also: I can now extract files manually with OllyDbg... (reinventing the wheel! gah!) but not by choice! just as they are requested =(.

I think that there has to be some kind of executable code in the dat, script or something, because "-image" isnt in the exe.

=)

There may be executables in the dat, but the -image (along with all the other command line switches) IS in the .exe, without the "-" switch. If you look for the "push" assembly commands in the exec, you will find references to "params.cpp" and "cmd.cpp" and "HcmdParser". These routines parse the options out. The routines start at 40B780 and 40B8D0 in the current (as of June 5th) version of the exec.

Decompression Routine

The routine starts at:

508C70

in my version of the exec. The above poster is correct in that ecx contains the decompressed data.

Correction:

Sometimes ecx contains the buffer, but it is sometimes in edx as well.


--Otac0n

Posting addresses from the exe is useless. Every recompile/relink is likely to jumble these around.

When I put a breakpoint on the routine that SilentAvenger posted, it only breaks when decompressing NEW files. AKA only if the file was just now downloaded. So, it may be just a preliminary decompression.


--Ricky26

Is there an IM we can use, this wiki is getting tedious...?

Also, I searched the EXE with a hex editor... So maybe I had it on UNICODE... Oops =)

I can call compression from C++, but I get "Cannot access memory at" errors... =)


--Otac0n

We could IRC.

I'm gonna try to register a channel on GameSurge. "#GWDat"

Join up. I'm gonna log the channel and create a PHP script to show the logs from my webserver.