GSoC 2016 - Variable Segment Length Bitmap Compression: June 2016

Thursday, June 30, 2016

Thursday, June 30

Finished writing the decode up! I also started debugging it for a while. I had a few hiccups but it seems to be working well now. I'll keep debugging it and then start putting it together with the existing ANDing and ORing code I have and then it should be pretty set!

Tuesday, June 28, 2016

Tuesday, June 28

Today I finished half of the decode up method of decoding. I finished coding the decoding of a fill word and started working on how to decode a partial fill. Tomorrow I'll finish that up which overlaps with decoding a literal so that will lead into the last part nicely.

Monday, June 27, 2016

Monday, June 27

Heard back from David today. It looks like there was an error in the VAL paper but that my approach was correct so I'm glad I continued on a debugged the version I have. Today, I planned out the decode up algorithm by hand and I'm going to write it into my code the rest of the week. This one is much harder than the the decode down since this one takes multiple segments to make a single larger segment so there are many more different combinations of fills and literals that can result from the subsegments of the pre-decoded words.

Friday, June 24, 2016

Friday, June 24

Still haven't heard from David so I've decided to go forward with my current implementation. Today I worked on thoroughly debugging my decode down method and it seems to be working well now, after a few hurdles. I also started planning out each of the four cases of the decode up and will finish implementing it next week.

Thursday, June 23, 2016

Thursday, June 23rd

Still waiting for advice regarding the issue I posted about yesterday. There are two different ways to proceed depending on which way David says to proceed so I took the day off to make sure I'm going the right direction. If I don't have an answer by tomorrow I will proceed how I was planning on doing it and fix it later if needed.

Wednesday, June 22, 2016

Wednesday, June 22

Alright, so I finished writing the decode down. I want to test it some more to make sure it works how I want it to though.

After I finished writing it, I wanted to reread through the VAL papers to make sure my code was doing exactly what the papers explain, but I think I ran into a discrepancy between them. The paper illustrates a word VAL(s=60) being decoded into a segment length of 15. The VAL(s=60) word is as follows: 1000 000000000...00000 0100 0000 0000 1110. This word represents 0100 0000 0000 1110 runs of 0s (16,398). The decoding example shows its reduction into a VAL(s=15) word of that following: 1100 011111111111111 00000000001111... which represents 16,383 runs of 0s followed by 15 runs of 0s. While this adds up to 16,398 runs of 0s like the original word before decoding, the segment length has changed from 60 to 15. This means that while the original VAL(s=60) word represents 60x16,398 0s (983,880), the VAL(s=15) segments represent 15*16,383 (245,745) and 15*15 (225) 0s. This means that if we assume the segment length has changed from 60 to 15 with the decoding process, the decoded word in the example does not represent the same number of words that the original word represents. I wrote to David to clarify the issue, so hopefully I will have an answer to this tomorrow!

Tuesday, June 21, 2016

Tuesday, June 21

Today I continued work on the decoding. I am almost done with writing the decoding down. When I'm done with that I will work on the more difficult decoding up. A big part of today was writing various different utilities involving the activeWord struct to hold the decoded words, including a loop that I'm rather proud of:

void updateActiveWord(activeWord *toUpdate, word_32 newWord){

int i;

for(i=0;i<toUpdate->numSegs;i++){

toUpdate->flag[i]=(newWord<<(4-(toUpdate->numSegs)+i))>>((toUpdate->length*toUpdate ->numSegs)+3);

toUpdate->seg[i]=(newWord<<(4+(toUpdate->length*i)))>>(WORD_LENGTH-toUpdate ->length);

}

toUpdate->currSeg=0;

}

It's not actually that complicated; the main reason I'm proud of this is because all of that looped bit work worked the first time through so I did not have to debug! What a day!

Monday, June 20, 2016

Monday, June 20

Nothing much to report today. I started working on the decoding. That said I can't even begging to describe how excited I am about my rewriting of the query engine. Not only is it much more organized and cleaner than before but it's working so much better. All of the work was worth it!

Friday, June 17, 2016

Friday, June 17

Just putting the final touches on my complete revamp of my project. It sure was a headache but I'm excited to write the decoding methods for the VAL querying using my streamlined edits!

Thursday, June 16, 2016

Thursday, June 16

I have debugged the existing issue in the WAH query engine and it works fine (very silly mistake on my part, not initializing some of the utility package).
Today I did a MAJOR reorganization of the entire query engine. I have decided that though it currently feels like I am essentially rewriting my entire WAH query engine, it is very much worth the time because it will allow me to implement the VAL engine very easily (as well as any other algorithms). My old implementation was a pretty naive one, solely focused on making WAH work. However, I am adding an extrapolation of the implementation using the activeWord struct. The main purpose of this is that I can use the activeWord struct as a parameter for my existing utilities for ORing and ANDing segments (there are 6 to be precise: fillORfill, fillORlit, litORlit, fillANDfill, fillANDlit, and litANDlit). They currently take in a word_32 (an entire compressed word). However, if I change the implementation to work on activeWord structs instead, I can use the same methods for VAL. The WAH implementation should not change much as each activeWord will be essentially equivalent to one WAH word. The main reason for this change is that each word_32 read in from the compressed files can be equivalent to 1, 2, or 3 segments, depending on the segment length, so this struct will allow me to separate the segments as needed and store the individual parts in the activeWord that can then be ORed or ANDed.
Once I rewrite everything in the format that I'm aiming for, all I have to do is write the decodeNext method which calls the decodeUp and decodeDown methods that coordinate the flow of segments from the compressed files through the activeWord struct.
It feels slow rewriting this and often involves breaking a lot of what I have but I know this is going to make everything A LOT easier in the future.

Wednesday, June 15, 2016

Wednesday, June 15

I think I have isolated this bug. I believe it has to do with how I am allocating and freeing up my results data space as the query engine does fine for the first query but seems to be failing on the second one. I think tomorrow I should be able to move on.

Tuesday, June 14, 2016

Tuesday, June 14

I began structuring my VAL Query engine and realized that to make this much more usable and extendable in the future, I would need to reorganize the path of my existing program. The way I wrote it before was solely for a WAH compressor and WAH query engine. However, now that I'm extending its usage to VAL, it would be very messy to add in VAL because I wrote all of the query utilities in one file. However, half of it is used for both WAH and VAL. For this reason, I spent 7 hours today reorganizing and rewriting my existing code to separate the strictly WAH functionality. Now, all queries pass through Query.c which performs tasks that both algorithms need and then passes it off to the appropriate engine as needed. It was a headache and a half and I managed to create a slight bug in the existing WAH implementation, but I think it is all for the better and is much more streamlined as well. Tomorrow I will fix the bug I left today and then mirroring my rewritten implementation for VAL.

Monday, June 13, 2016

Monday, June 13

I spent the day reading papers on VAL and planning on how to tackle the querying. VAL querying is very tricky and there are two different ways to go about it, both of which I will be implementing. The first method is to "decode down" which means that if we query two columns of different segment lengths, we essentially translate the column using the longer segment length down into words of the smaller segment lengths. The alternate method is to "decode up" which means doing the opposite (translating the word with a smaller length into an equivalent using a longer length). I want to make sure that I set up my query engine correctly. I brainstormed different tools to facilitate my implementation but what I have decided to do is add a new struct activeWord which contains the aligned information ready for ANDing and ORing between columns of different segment lengths. Essentially, my code will read the next word in the compressed file, translate them into activeWord structs (decoding the appropriate column in the user-selected method of decoding), and then query the activeWord structs together (assuming that they are now correctly aligned).

I need to sleep on all of this to make sure this is how I want to set up my query engine but I'm confident this is the way I will proceed with tomorrow.

Friday, June 10, 2016

Friday, Jube 10

Very productive day today. I finished debugging my compressors (there were a few, including one pretty hefty one that I managed to pound out) and implemented the optimal compression. It compresses the bitmap using a segment length of 7, 14, and 28 and picks the length that resulted in the shortest file size. It does this by storing the shortest bitmap in a buffer (the filename is the same path but storing it in a column number that doesn't exist) and then simply renaming the compressed file the appropriate name. It all works perfectly on all of the small files but complains at various points of the larger files, I think because of a lack of memory, so I think I have a memory leak somewhere that I will look into, but I'm not terribly concerned. I will be moving on to querying next week!

I also briefly talked to David to to check in. Not very eventful but I did confirm with him that after implementing the query engine, I should extend the implementation I have to support VAL64, not just my current VAL32 implementation. That shouldn't be terribly difficult as all of my current utilities are very adaptable depending on segment length. It's more just a matter of variable and type declarations throughout the program.

Thursday, June 9, 2016

Thursday, June 9

I actually decided to take the day off today. I had to go to the hospital on Tuesday and wasn't feeling so great today so took a day off.

Wednesday, June 8, 2016

Wednesday, June 8

I ran tests today to make sure my VAL compressor engine is working correctly. The main issue I wanted to address was regarding the extra padding of 0s at the end of each column which I scoured over trying to figure out why it was wrong until I remembered that my computer is little-endian so I was just reading it backwards and it was, in fact, correct all along.

While running all the tests I found a bug that caused the first word in every column but the first to be in correct due to not clearing the memory of the last column so I made sure to set that to 0 at the beginning of compressing each column.

Other than that, it looks like it is working properly! The rest of this week I will be focusing on implementing the optimal length compressor which compressors each column using a segment length of 7, 14, and 28 and then picks the shortest length compression for saving each compressed column and uses that one for querying.

Tuesday, June 7, 2016

Tuesday, June 7

Sorry for not checking in yesterday. Time got the best of me.

Big day today! It looks like my VAL compressor is working! I've only been running it with a segment length of 7 to be consistent but it should be working with 14 and 28 bit segments as well. There's one little issue with the padding on the last word that I want to look into but otherwise it seems to be functioning very well! The only thing to do with it really it to implement the optimal segment length compressor. That basically involves running the 7, 14, and 28-bit compressors and then selecting the smallest compressed column. Then, the entire compressed bitmap will have compressed columns with different segment lengths. It will be interesting to see the distribution of segment lengths within each bitmap and find out which length is chosen more often.

The rest of the week I want to dedicate not only to finishing up these last few things but also to running tests with the compressor to make sure it is working EXACTLY like it should. The last thing I want is for a random bit to make it's way somewhere in the compressed columns...

Friday, June 3, 2016

Friday, June 3

Still working on implementing VAL. I had to make what felt like a significant decision regarding file formatting and variable declarations. You see the original bitmap formatted I wrote last summer rewrites text files as binary files with appropriate padding at the beginning of words. WAH is easy to reformat as you just add an extra 0 every 31 or 63 bits. However, with VAL, the segments can vary in length so I thought a lot about what the best way to format these files is. I eventually deciding to pad the segments every 7 bits since all of the lengths are divisible by 7. This also allows the compressor to read the file byte by byte and recreate the 14 and 28 bit segments of necessary by reading two or three bytes and then building a 14 or 28 bit segment respectively.

This also raised an issue regarding the type definition of a word. WAH 32 and WAH 64 both read and write 32 bit words. However, VAL reads by 8 bits but writes to 32 bits. The WAH engine I wrote uses the same type for both reading and writing so I had to add another type definition for reading and writing which would be the same for both WAH compressors but different for the VAL compressor.

Thursday, June 2, 2016

Thursday, June 2

Nothing terribly exciting to report today. Still working on the core of the compressBlock(struct blockSeg *) method. I've decided to focus on just compress using one segment length first. Then once I have that working properly, I will add in the optimal segment length compression that compresses the column using 7, 14, and 28 bit segments and then chooses the length that resulted in the compressed column of the smallest size.

Wednesday, June 1, 2016

Wednesday, June 1

I spent today reading through my WAH code. I want to mimic the tools and format of that as much as possible to I need to have a good understanding of how it maps out. Specifically, I use a struct I call a "blockSeg" that holds all of the information needed to begin to compress a column, like the column file pointer, the current block being compressed, etc. I wanted to read through how I used the blockSeg struct in the WAH code so I could easily mirror how it is handled but with the VAL algorithm underneath instead of WAH.