2008-08-07 05:07 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2008-08-07 05:56 http://web.archive.org/web/20060904185736/http://www.complang.tuwien.ac.at/anton/lfs/ 2008-08-07 05:57 similar thought process on versioning 2008-08-07 06:56 wow- that's pretty cool. 2008-08-07 06:57 very similar method of managing free space 2008-08-07 07:01 -!- pgquiles(~pgquiles@224.Red-81-39-154.dynamicIP.rima-tde.net) has joined #tux3 2008-08-07 09:23 more purdy: http://shapor.com/tux3/ 2008-08-07 09:50 bitchin' 2008-08-07 09:58 that really looks nice shapor 2008-08-07 09:58 is this now the "official site" or the mirror site? 2008-08-07 09:59 and wtf on zumastor? 2008-08-07 10:07 wtf? 2008-08-07 10:11 wtf wtf i mean 2008-08-07 10:11 :-0 2008-08-07 10:12 i'm thinking of organizing daniels mailing list posts in to a design document 2008-08-07 10:12 morning 2008-08-07 10:16 morning flips 2008-08-07 10:28 shapor, that would be most wonderful 2008-08-07 10:29 u guys heard of MogileFS? 2008-08-07 10:29 http://www.danga.com/mogilefs/ 2008-08-07 10:29 never 2008-08-07 10:29 its a selfish effort, to absorb it all ;) 2008-08-07 10:29 distributed fs that allows you to run any type of fs locally 2008-08-07 10:30 Application level -- no special kernel modules required. 2008-08-07 10:30 veoh networks uses it. was developed for live journal 2008-08-07 10:30 sounds like not a filesystem really 2008-08-07 10:30 actually sounds a lot like google's gfs 2008-08-07 10:31 MogileFS is not: 2008-08-07 10:31 * POSIX Compliant 2008-08-07 10:31 kerneltrap picked up the matt dillon dialogue 2008-08-07 10:31 so yeah, much like google's gfs, not a real filesystem ;) 2008-08-07 10:31 useful thing maybe though 2008-08-07 10:32 who knows 2008-08-07 10:32 cloud? 2008-08-07 10:32 its vrey useful that its open source 2008-08-07 10:32 yeah, saw the *not* posix note 2008-08-07 10:32 like hadoop and memcached 2008-08-07 10:32 nice tools for building clusters for specific applications 2008-08-07 10:33 web 2.0 stuff 2008-08-07 10:33 speaking of which 2008-08-07 10:33 danga = livejournal guy 2008-08-07 10:33 someone at a startup i was talking to yesterday 2008-08-07 10:33 is running redhat's gfs 2008-08-07 10:33 heh 2008-08-07 10:33 fun? 2008-08-07 10:33 in a backend cluster 2008-08-07 10:33 and it keeps crashing 2008-08-07 10:33 when load gets high 2008-08-07 10:33 cascading failure 2008-08-07 10:33 all nodes die 2008-08-07 10:33 never woulda thunkit 2008-08-07 10:34 i'm lurking on the hadoop irc now 2008-08-07 10:34 that's how i heard of mogilefs 2008-08-07 10:35 the lfs article is a good read 2008-08-07 10:35 thinking way back, it was _really_ dumb of me to try to put the free space bitmaps inside the snapshot 2008-08-07 10:36 nobody questioned it at the time 2008-08-07 10:36 good thing you can question your own work, huh? 2008-08-07 10:37 if you live long enough 2008-08-07 10:37 I'm thinking about that because the lfs guy seems to be busy making the same mistake 2008-08-07 10:38 "At each snapshot we have a set of free blocks (and complementary, a set of allocated blocks) for the files reachable through this snapshot." 2008-08-07 10:39 btrfs uses per-block refcounts for that (ouch) and zfs uses dead block lists (complexity and weirdness) 2008-08-07 10:39 tux3 just has a conventional allocator and can figure out what blocks can actually be freed by looking at its version information 2008-08-07 10:40 this is a huge advantage to keeping all the version information for a given block together in one place 2008-08-07 10:40 flips: i found that site linked off comments on lwn from last year 2008-08-07 10:40 related to zumastor i think 2008-08-07 10:40 ah 2008-08-07 10:41 http://lwn.net/Articles/170346/ 2008-08-07 10:41 hm thats not the one 2008-08-07 10:42 http://lwn.net/Articles/239369/ 2008-08-07 10:42 that one 2008-08-07 10:42 ah yeah stumbled across it reading about gplv3 2008-08-07 10:43 that comment is not very accurate re wafl 2008-08-07 10:44 I like linus's comment 2008-08-07 10:44 "Umm. You are making the fundamental mistake of thinking that Sun is in 2008-08-07 10:44 this to actually further some open-source agenda." 2008-08-07 10:46 remember scott mcneally wearing the penguin suit? 2008-08-07 10:51 lol that was a long time ago 2008-08-07 10:51 seems like only yestaday 2008-08-07 10:55 see "Free-space management and clones" in http://www.complang.tuwien.ac.at/anton/lfs/ and you will see diagrams much like versioned pointers 2008-08-07 10:55 yes! 2008-08-07 10:56 but he didn't go on to realize you could actually represent the version data that way 2008-08-07 10:56 only has rather similar freeable block algorithm 2008-08-07 10:57 also, you don't need .killed, on .born 2008-08-07 10:57 only .born 2008-08-07 10:57 you should email him 2008-08-07 10:57 invite him to the list 2008-08-07 10:57 sure 2008-08-07 10:57 he is a professor i believe 2008-08-07 10:57 he was talking about potentialy putting some students on lfs 2008-08-07 10:57 might be useful for tux3 2008-08-07 10:58 very pretty site shapor 2008-08-07 10:58 hand coded html? 2008-08-07 10:58 looks like you speak his language 2008-08-07 10:58 I think so 2008-08-07 10:58 yes I will email 2008-08-07 10:58 fartenpoopenspakenziepissin 2008-08-07 10:58 :-) 2008-08-07 10:58 genau 2008-08-07 10:59 yes hand coded html 2008-08-07 10:59 looked like crap until i added the css 2008-08-07 10:59 its very simple just 1996-ish h1, h2, and li tags 2008-08-07 10:59 I have issues with this post http://www.complang.tuwien.ac.at/anton/memory-wall.html 2008-08-07 11:00 I think its a few years old 2008-08-07 11:01 there's no such thing as an infinitely fast CPU 2008-08-07 11:52 mind if I plagiarize your html? 2008-08-07 11:53 tim_dimm, what about the disk wall? 2008-08-07 11:53 or more properly, disk chasm 2008-08-07 12:08 disk wall? you mean wall-o-disk ? 2008-08-07 12:09 the big leap? the humongoid gap? 2008-08-07 12:49 was playing lego indiana jones last night, why I thought about chasms 2008-08-07 13:32 why: 2008-08-07 13:33 -if (!tuxread(&sb, 6, buf, 11)) 2008-08-07 13:33 +if (tuxread(&sb, 6, buf, 11)) 2008-08-07 13:34 wow 711 diggs! 2008-08-07 13:36 "But will Tux3 be the ReiserFS killer?" 2008-08-07 13:36 "The ReiserFS murderer" 2008-08-07 13:36 lol 2008-08-07 13:37 "Are you saying that Tux3 will be made from the blood, sweat and horror filled tears of the developers Wives..........probably :P" 2008-08-07 13:37 http://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices <- this is really helping me think about lvm3 2008-08-07 13:37 url? 2008-08-07 13:37 linked from http://shapor.com/tux3/ 2008-08-07 13:37 the "digg" link at the bottom 2008-08-07 13:38 ah, I looked at that line lots of times without noticing the digg link 2008-08-07 13:39 the reiser jokes seem to keep the conversation going 2008-08-07 13:39 70 comments 2008-08-07 13:39 http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-volumes.html <- actually, this is the interesting one 2008-08-07 13:39 most of the good ones re "below the threshold" 2008-08-07 13:39 heh 2008-08-07 13:44 wow. css is a completely different syntax than html/xml 2008-08-07 13:48 yes it is metadata for the html to use 2008-08-07 13:48 the one i put up is in no way minimalistic 2008-08-07 13:49 it is originally from doxygen 2008-08-07 13:49 so it has lots of crap in it 2008-08-07 13:49 could be trimmed down to 10 lines to do the same thing 2008-08-07 13:54 mistake in your html 2008-08-07 13:55 2008-08-07 13:55 looking forward to the 10 line version 2008-08-07 13:56 i wasn't planning on making it 10 lines 2008-08-07 13:57 its also probably not w3c compliant 2008-08-07 13:58 http://validator.w3.org/check?uri=http%3A%2F%2Fshapor.com%2Ftux3 2008-08-07 14:01 flips: i just quickly pulled all the css out for classes 2008-08-07 14:01 since i dont set any, they never get used 2008-08-07 14:01 so it is equivalent 2008-08-07 14:09 created a mercurial tree for tux3 kernel 2008-08-07 14:17 linus knocked us off the top of the lkml.org hot list with his 2.6.27-rc2 announcement 2008-08-07 14:17 now #2 2008-08-07 14:33 heh 2008-08-07 15:17 only 2008-08-07 15:17 only #? 2008-08-07 15:17 wtf good is that? 2008-08-07 15:17 ;-) 2008-08-07 15:49 flips: bitbucket has a wiki feature, its a nicer way of storing notes/links than editing html 2008-08-07 15:49 best part is, its all tracked in a mercurial repo 2008-08-07 15:49 http://www.bitbucket.org/shapor/tux3/wiki/OtherFilesystems 2008-08-07 15:50 where can I download the source? 2008-08-07 15:50 http://www.wikicreole.org/ i think 2008-08-07 15:51 although i'm not sure what they are using to integrate with hg 2008-08-07 15:54 sudo apt-get install ikiwiki 2008-08-07 15:55 I like this: http://www.selenic.com/mercurial/wiki/ 2008-08-07 15:55 oh neat 2008-08-07 15:55 yeah thats better than using some 3rd party thing 2008-08-07 15:56 little more work to setup though 2008-08-07 15:56 no doubt 2008-08-07 15:56 there was also one with an issue tracker somewhere 2008-08-07 15:57 people seem to be using rcs as a data backend for more and more things these days 2008-08-07 15:57 it just makes sense 2008-08-07 15:58 however it is quite slow 2008-08-07 15:58 http://moinmoin.wikiwikiweb.de/MoinMoin 500 - Internal Server Error 2008-08-07 16:05 http://www.mantisbt.org/bugs/view_all_bug_page.php 2008-08-07 16:10 http://moinmo.in/ <- works 2008-08-07 16:13 tim_dimm, want to try the dcc again? 2008-08-07 16:13 I keep failing to notice the popup 2008-08-07 16:13 not that xchat dcc has ever worked for me 2008-08-07 16:13 hm 2008-08-07 16:13 call me then 2008-08-07 16:13 something abourt nat and xchat being lame 2008-08-07 16:57 -!- MaZe(~MaZe@64.173.151.3) has joined #tux3 2008-08-07 16:59 http://www.physorg.com/news137325794.html 2008-08-07 17:13 hey maze 2008-08-07 17:13 ok, let's consider a serious issue: how to get people to send us beer 2008-08-07 17:13 I'll put a link on the tux3.org page when we figure out what the link should say 2008-08-07 17:17 http://tux3.org/ <- send beer here 2008-08-07 17:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-07 17:52 struct btree_ops { 2008-08-07 17:52 int (*leaf_verify)(SB, void *leaf); 2008-08-07 17:52 }; 2008-08-07 17:52 struct btree_ops ftree_ops = { 2008-08-07 17:52 .leaf_verify = leaf_verify, 2008-08-07 17:52 }; 2008-08-07 17:53 btree.c on its way to crappy-c-style object oriented 2008-08-07 18:06 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-07 18:37 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-07 19:13 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-07 21:44 shapor, a starting point: http://tux3.org/design.html 2008-08-07 22:11 oh thats nice 2008-08-07 22:12 i'm going to be pretty busy the next couple days, probably wont get a chance to do much until sunday 2008-08-07 22:13 docs should go in vcs too though 2008-08-07 22:15 flips: http://lwn.net/Articles/234441/ 2008-08-07 22:16 mentions "versioned pointers" 2008-08-07 22:16 only to the root inode 2008-08-07 22:16 that was one of the few pre-existing hits on versioned pointers before my post 2008-08-07 22:17 now 10 out of first 10 google hits point at me and/or tux3 2008-08-07 22:18 hit number 22 is finally on some msft concept 2008-08-07 22:19 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-07 22:20 there are a lot of similarities coming from the logfs people 2008-08-07 22:20 but basically they are another tree versioned concept 2008-08-07 22:20 hamer is closer 2008-08-07 22:20 hammer 2008-08-07 22:20 see the beer link yet? 2008-08-07 22:20 i did 2008-08-07 22:20 so far no beer 2008-08-07 22:21 heheh 2008-08-07 22:21 nice 2008-08-07 22:21 yeah i saw it too ;) 2008-08-07 22:21 flips: only because you can't email beer ;) 2008-08-07 22:21 I don't have any beer here, only... 2008-08-07 22:21 Results 1 - 10 of about 308,000 for tux3. 2008-08-07 22:21 if only you could 2008-08-07 22:21 flips: i gave you one the other day!! 2008-08-07 22:21 I'm still subsisting on ramback beer 2008-08-07 22:21 so i'm the first beer contributor 2008-08-07 22:21 :-) 2008-08-07 22:21 true 2008-08-07 22:22 and what does that make me? 2008-08-07 22:22 although i think that may have hindered more than it helped ;) 2008-08-07 22:22 a contributor 2008-08-07 22:22 to juvenile delinquency 2008-08-07 22:22 I was pretty useless that night 2008-08-07 22:22 crashed just in front of my door 2008-08-07 22:22 good beer :-D 2008-08-07 22:22 hung out on 3rd st on the way back 2008-08-07 22:22 that was great 2008-08-07 22:22 hehe 2008-08-07 22:23 shapor, you caused me to almost break my wrist. you know that, right/ 2008-08-07 22:23 ? 2008-08-07 22:23 ACTION is a bad influence 2008-08-07 22:23 showing off for drunks 2008-08-07 22:23 that's not right 2008-08-07 22:23 juvenile delinquency is a bad influence on us respectable citizens 2008-08-07 22:25 "It's good that we have projects like this I think. Sun and ZFS really need some competition, and maybe soon we'll have no need to envy ZFS." 2008-08-07 22:25 on the zfs list? 2008-08-07 22:25 "do you have ZFS envy" -> tux3.org poll question 2008-08-07 22:26 nice 2008-08-07 22:26 nice hook 2008-08-07 22:26 tuxopen(&sb, 123, buf, sizeof(buf), 0); 2008-08-07 22:27 Segmentation fault 2008-08-07 22:28 left out the init_buffers and tux3_init 2008-08-07 22:28 new node 2008-08-07 22:28 new leaf blocksize = 4096 2008-08-07 22:28 root at 1 2008-08-07 22:28 leaf at 2 2008-08-07 22:28 Thu Aug 7 22:28:01 2008: [10463] probe: Failed assertion "(ops->leaf_sniff)(sb, buffer->data)" 2008-08-07 22:28 http://www.scdsource.com/article.php?id=296 2008-08-07 22:28 Trace/breakpoint trap 2008-08-07 22:29 Looks like sicortex interconnect design 2008-08-07 22:29 somebody should push forward the i/o pin tech 2008-08-07 22:30 would not be that hard 2008-08-07 22:30 3D stacking for pins 2008-08-07 22:30 micro pins 2008-08-07 22:30 get way more that way 2008-08-07 22:30 actually, 3D switches with pins on the ends of the switches 2008-08-07 22:31 pin->switch->pin 2008-08-07 22:31 you know pins are currently .1 mm right? how stupid is that? 2008-08-07 22:31 ball pens 2008-08-07 22:31 they could easily be 1 um 2008-08-07 22:31 yeah, its a problem for the memory channel too 2008-08-07 22:31 problem isn't so much the pin, its the link 2008-08-07 22:31 on the mobo 2008-08-07 22:31 problem for most packages 2008-08-07 22:32 tim_dimm: from http://www.linux.com/feed/142781 comments 2008-08-07 22:32 idiotic that it hasn't been fixed reasonably 2008-08-07 22:32 nano-technology to the rescue 2008-08-07 22:33 "slew-rate requirements" 2008-08-07 22:33 data skew 2008-08-07 22:34 flips: wheres tuxopen ? 2008-08-07 22:34 not checked in yet 2008-08-07 22:34 pukes 2008-08-07 22:34 ah thats why its hard to find 2008-08-07 22:34 coming in about .5 hr, depending on whether I get into another ramback beer or not 2008-08-07 22:35 will that make it take more or less time? ;) 2008-08-07 22:35 more 2008-08-07 22:35 you found the fleaf bug yet? 2008-08-07 22:36 i haven't had a chance to look yet 2008-08-07 22:36 busy with work all day 2008-08-07 22:36 that's sick 2008-08-07 22:36 well some of us have day jobs! 2008-08-07 22:36 :P 2008-08-07 22:40 new leaf blocksize = 4096 2008-08-07 22:40 root at 1 2008-08-07 22:40 leaf at 2 2008-08-07 22:40 inode at 123 2008-08-07 22:40 Thu Aug 7 22:39:52 2008: [10543] ileaf_lookup: Failed assertion "at < leaf->count" 2008-08-07 22:43 pardon my noobness- what does assert do? 2008-08-07 22:45 similar to what it means in english 2008-08-07 22:45 "abort if this is not true" 2008-08-07 22:46 just figured that out 2008-08-07 22:46 http://www.cplusplus.com/reference/clibrary/cassert/assert.html 2008-08-07 22:46 me learn to use the google 2008-08-07 22:46 teh google 2008-08-07 22:46 yeah, the correct answer is http://www.justfuckinggoogleit.com/ 2008-08-07 22:47 but its not very nice to send that link 2008-08-07 22:47 :P 2008-08-07 22:51 teh goggles 2008-08-07 22:53 beer goggles' 2008-08-07 22:55 lookup inode 123, 0 + 123 2008-08-07 22:55 0 inodes, 4084 free: 2008-08-07 22:55 release buffer for 1 2008-08-07 22:55 release buffer for 2 2008-08-07 22:55 Thu Aug 7 22:54:52 2008: [10710] tuxopen: no inode 123 2008-08-07 22:55 good 2008-08-07 22:55 now to set the create flag and create the inode 2008-08-07 22:58 I know you guys are going to be at it until the wee hrs 2008-08-07 22:58 whowever wrote the ddsnap warn macro needs to be hurt 2008-08-07 22:58 not me 2008-08-07 22:58 wasn't I 2008-08-07 22:58 gotta recover 2008-08-07 22:58 I've been getting up at 6am for contractors all week 2008-08-07 23:00 <-crashing 2008-08-07 23:00 ttyl 2008-08-07 23:02 creating the inode requires abstracting add_extent_to_tree to also be able to add an inode 2008-08-07 23:03 starting with renaming as add_entity_to_tree 2008-08-07 23:33 flips: not my bug :) 2008-08-07 23:34 ACTION does a happy dance 2008-08-07 23:36 :-) 2008-08-07 23:36 mine? 2008-08-07 23:39 actually, maybe mine 2008-08-07 23:39 you've got mail :) 2008-08-07 23:40 that was about 30 minutes of debugging 2008-08-07 23:41 most of which was just isolating the issue 2008-08-07 23:42 we should put a test in the main which tests all the boundary conditions 2008-08-07 23:42 good call on grouplim = 7 2008-08-07 23:42 would have taken us a lot longer to hit this otherwise 2008-08-07 23:43 yes 2008-08-07 23:43 use the same trick as I did in btree.c, redefine main as notmain, include in another file and go crazy 2008-08-07 23:44 there's a lot to be said for having a very simple single file sanity test as now, at this point 2008-08-07 23:44 when i have more time 2008-08-07 23:44 i can rest now knowing that one is fixed 2008-08-07 23:44 there's everntually going to be a generic fuzztester for all leaf methods 2008-08-07 23:44 i dont want to discover 5 more which require 30 min of debugging each ;) 2008-08-07 23:45 but knowledge of the specific boundary condiditions means you can write a better single purpose test 2008-08-07 23:45 not tonight anyway 2008-08-07 23:46 nice fix 2008-08-07 23:46 did that used to read entry == entries 2008-08-07 23:47 i think i might have changed it to !group->count when i added splitting 2008-08-07 23:47 i remember poking it before, might have just been in testing 2008-08-07 23:48 how do i go back to an old rev in hg 2008-08-07 23:48 test now succeeds indeed 2008-08-07 23:48 hg co 2008-08-07 23:49 woo 2008-08-07 23:49 unsigned limit = !group->count || entry == entries ? 0 : (entry + 1)->limit; 2008-08-07 23:49 *entry = (struct entry){ .loglo = loglo, .limit = limit }; 2008-08-07 23:58 yeah, see anything wrong? 2008-08-07 23:58 lgtm 2008-08-07 23:59 if you're worried about limit=0, don't be, you go through and increment all the limits by 1 starting at entry* 2008-08-08 00:00 btw it was your bug... hg annotate says rev 15 ;) 2008-08-08 01:22 Fri Aug 8 01:21:43 2008: [11736] tuxopen: no inode 123 2008-08-08 01:22 Fri Aug 8 01:21:43 2008: [11736] tuxopen: new inode 123 2008-08-08 01:22 expand inum 291 at 0/0 by 16 2008-08-08 01:22 :-) 2008-08-08 01:22 yes it was my bug 2008-08-08 01:22 subtle indeed 2008-08-08 01:23 have to check two boundary conditions 2008-08-08 01:30 it threw me for a loop at first because it happened right after a split 2008-08-08 01:30 so i scrutinized the split code, which of course is flawless :P 2008-08-08 01:50 Fri Aug 8 01:49:40 2008: [30550] leaf_insert: Failed assertion "tail >= 0" 2008-08-08 02:48 hmm 2008-08-08 07:50 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-08 09:09 flips: i made a new test.. that errors was happening due to a corrupt tree 2008-08-08 09:09 after split group with count 1 at 1 2008-08-08 09:10 due to one single entry having grouplim entries 2008-08-08 09:10 which would never happen if grouplim==255 and we limited snapshots to 255 2008-08-08 11:14 hi 2008-08-08 11:15 you mean the new assertion above or the one you just fixed? 2008-08-08 11:15 g'mornin 2008-08-08 11:15 new one 2008-08-08 11:15 that sounds good 2008-08-08 11:15 the thing i fixed is right 2008-08-08 11:15 it looked right 2008-08-08 11:16 last night i used the #define main trick 2008-08-08 11:16 and gave it a few stress tests 2008-08-08 11:16 that is the only way it fails 2008-08-08 11:16 when a group has one entry which has grouplim entries in it 2008-08-08 11:17 in other words, more than grouplim different versions 2008-08-08 11:17 well no 2008-08-08 11:18 er 2008-08-08 11:18 it fell out of cache now 2008-08-08 11:18 heh 2008-08-08 11:18 oh right 2008-08-08 11:18 its exactly what i said earlier 2008-08-08 11:19 i think 2008-08-08 11:19 ACTION looks for all the things shapor said earlier 2008-08-08 11:19 the problem is splitting a group of count 1 2008-08-08 11:19 obviously useless 2008-08-08 11:20 because it has grouplim entries 2008-08-08 11:20 yes, what I said I thought 2008-08-08 11:20 oh, yes, you're right 2008-08-08 11:20 we're both right just using different words 2008-08-08 11:20 good 2008-08-08 11:21 sorry i confused yself 2008-08-08 11:21 myself* 2008-08-08 11:21 one "fix" is worth 3 dozen "confuseds" 2008-08-08 11:21 you're checking in your test? 2008-08-08 11:23 sure 2008-08-08 11:32 short n sweet 2008-08-08 11:33 fleaf.c because dleaf.c, data leaf 2008-08-08 11:34 reason is, I want fleaf for the free tree leaf 2008-08-08 11:34 so: ileaf (inode) dleaf (file data) fleaf (free blocks) aleaf (atimes) 2008-08-08 11:35 oh, and directory leaf :-/ 2008-08-08 11:35 that is another dleaf 2008-08-08 11:35 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-08 11:35 ah, but that is hleaf 2008-08-08 11:36 dleaf could also be eleaf, extent leaf 2008-08-08 11:36 that is better I think 2008-08-08 11:37 -!- pgquiles(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-08 11:37 maybe use 3 letters instead 2008-08-08 11:37 dirleaf 2008-08-08 11:37 that would fix it 2008-08-08 11:37 filleaf 2008-08-08 11:37 freleaf 2008-08-08 11:37 less confusing 2008-08-08 11:38 tlaleaf 2008-08-08 11:38 heh 2008-08-08 11:38 tnaleaf 2008-08-08 11:40 ileaf (inode table leaf) eleaf (extent tree leaf) fleaf (free extents leaf) aleaf (atime table leaf) dleaf (directory dirent block) 2008-08-08 11:41 can change our mind a few more times 2008-08-08 11:41 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-08 11:41 mercurial and git both lack metadata for tracking name changes 2008-08-08 11:41 suxors 2008-08-08 12:44 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-08 18:25 "Results 1 - 10 of about 308,000 for tux3" 2008-08-08 18:37 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-08 19:06 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-08 19:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-08 19:56 http://www.linkedin.com/pub/0/a57/305 2008-08-08 22:18 zfs design docs are really lame 2008-08-08 22:18 e.g.: http://opensolaris.org/os/community/zfs/structures/;jsessionid=9640F315CA43C6E0BED9C5DAEE2767D2 2008-08-08 22:49 ACTION gives up on finding any complete, coherent zfs design doc 2008-08-09 00:26 flips: ping 2008-08-09 01:07 shapor, pong 2008-08-09 01:15 maze, hi over here 2008-08-09 01:15 sure, no problem, I've been very busy - what about you? 2008-08-09 01:16 me too 2008-08-09 01:16 I've seen there's a lot of new email in my inbox about tux3 2008-08-09 01:16 and I haven't finished reading all of the old emails... ;-( 2008-08-09 01:16 is there enough tux3 code up now to get out of the "lame no code" category? 2008-08-09 01:16 (I just got back from a 'conference' like 2 days) 2008-08-09 01:17 there are a lot of emails to read 2008-08-09 01:17 particularly the matt dillon series 2008-08-09 01:17 http://kerneltrap.org/Linux/Comparing_HAMMER_And_Tux3 2008-08-09 01:17 depends, ultimately I guess until you have something that works as a filesystem (even if only as a userspace library)... 2008-08-09 01:17 that isn't even all of them 2008-08-09 01:17 it's close to that 2008-08-09 01:17 can open/make an inode now, read/write a file 2008-08-09 01:18 now need to open _then_ read/write 2008-08-09 01:18 and have multiple files 2008-08-09 01:18 not far away 2008-08-09 01:18 then directories... bitmap allocation... 2008-08-09 01:18 deletion... 2008-08-09 01:18 versions... 2008-08-09 01:18 ;-) 2008-08-09 01:19 atomic commit... 2008-08-09 01:19 one of the things that seemed to consistently crop up about filesystems at the meeting I was just at, was that a small amount of flash, to either have the superblock, or all the metadata (possibly including 'small' files) in flash is something desirable 2008-08-09 01:21 for tux3, being able to write the beginning of forward log chains to nvram wouild be nice 2008-08-09 01:21 only a few bytes needed 2008-08-09 01:21 but the commit logging strategy is going to run near media speed I think 2008-08-09 01:22 will get to that in a week or two 2008-08-09 01:22 ..fragmentation... 2008-08-09 01:22 right, there is at least a plan 2008-08-09 01:23 using generating functions to do a quadractic hash-like bounce to successively further away allocation goals 2008-08-09 01:23 meaning that when data does get bounced away from home, different updates get bounced to the same place 2008-08-09 01:23 right 2008-08-09 01:24 also, the logging strategy likes a certain amount of fragmentation 2008-08-09 01:24 this leaves places to store commit blocks at convenient places all over the volume 2008-08-09 01:24 shapor did some killer bug hunting 2008-08-09 01:25 right, I'll really have to read the design docs to get a better feel for all this 2008-08-09 01:26 design doc is essentially the lkml post converted to html 2008-08-09 01:26 shapor talked about adding some of the mailing list posts to it 2008-08-09 01:27 if he doesn't get around to it, I will add some 2008-08-09 01:27 yes, that would be much appreciated 2008-08-09 01:27 the generic btree thing came true 2008-08-09 01:27 works like a charm 2008-08-09 01:27 I haven't been able to be as involved as I would like to be 2008-08-09 01:28 generic as in? 2008-08-09 01:28 both for inodes and file content? 2008-08-09 01:28 as in the same btree code works for the inode table and file indexes 2008-08-09 01:28 and soon will also implement the free map 2008-08-09 01:28 yes 2008-08-09 01:28 and atime table, and directory indexes 2008-08-09 01:28 and I think I will add yet another table, a table of volume roots 2008-08-09 01:29 it's basically trivial to have multiple filesystems share the same allocation space 2008-08-09 01:29 do multiple filesystems really differ from directories in one root? 2008-08-09 01:29 yes, they don't invade each other's directories 2008-08-09 01:42 -!- pgquiles(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-09 01:48 -!- pgquiles_(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-09 01:51 struct inode *inode = tuxopen(&sb, 0x123, 1); 2008-08-09 01:51 tuxwrite(inode, 6, "hello world", 11); 2008-08-09 01:51 if (tuxread(inode, 6, buf, 11)) 2008-08-09 01:51 return 1; 2008-08-09 01:51 hexdump(buf, 11); 2008-08-09 01:51 works 2008-08-09 01:51 "open by inum" 2008-08-09 01:51 need to make it open by name now 2008-08-09 01:52 6 is the blocknum to write it on... need to make that byte offset and update the size attribute properly 2008-08-09 02:00 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-09 02:05 -!- pgquiles_(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-09 02:14 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-09 02:27 -!- pgquiles__(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-09 02:35 -!- pgquiles_(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-09 03:42 my rip of ext2/dir.c compiles, I'll test it... later 2008-08-09 06:00 -!- pgquiles(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-09 06:04 -!- pgquiles_(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-09 06:13 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-09 06:17 -!- pgquiles_(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-09 07:39 -!- pgquiles(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-09 07:45 -!- pgquiles_(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-09 07:54 -!- pgquiles__(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-09 07:55 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-09 12:07 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-09 12:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-09 14:00 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-09 14:36 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-09 15:11 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-09 16:29 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-09 18:21 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-10 00:35 -!- pgquiles(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-10 00:39 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-10 01:03 -!- pgquiles__(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-10 01:22 -!- pgquiles_(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-10 02:47 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-10 02:55 -!- pgquiles(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-10 04:43 -!- pgquiles(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-10 04:48 -!- pgquiles_(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-10 04:53 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-10 05:00 -!- pgquiles_(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-10 06:19 -!- pgquiles__(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-10 08:28 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-10 10:01 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-10 10:45 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-10 11:29 -!- pgquiles(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-10 11:37 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-10 11:42 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-10 12:19 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-10 12:39 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-10 13:47 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-11 01:39 -!- pgquiles(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-11 01:46 -!- pgquiles_(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-11 02:07 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-11 03:16 -!- pgquiles_(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-11 06:38 -!- Kirantpatil(~kiran@122.167.181.85) has joined #tux3 2008-08-11 06:39 Hello list 2008-08-11 06:40 i have few questions 2008-08-11 06:51 i am getting confuse which filesystem to use for storage 2008-08-11 06:51 tux3 or btrfs 2008-08-11 06:53 i have been watching Daniel's posts from zumastor and tux3 list 2008-08-11 07:04 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-11 07:32 -!- pgquiles(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-11 08:07 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-11 08:37 -!- pgquiles(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-11 08:42 -!- pgquiles_(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-11 09:36 -!- Kirantpatil(~kiran@122.167.181.85) has left #tux3 2008-08-11 10:15 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-11 13:21 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-11 13:24 -!- pgquiles(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-11 13:46 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-11 15:06 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-11 15:11 -!- pgquiles__(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-11 16:56 -!- pgquiles__(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-11 22:20 -!- Kirantpatil(~kiran@122.167.211.229) has joined #tux3 2008-08-11 22:20 -!- Kirantpatil(~kiran@122.167.211.229) has left #tux3 2008-08-12 00:00 -!- pgquiles(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-12 00:01 -!- pgquiles(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-12 00:06 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-12 00:16 -!- pgquiles__(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-12 02:27 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-12 05:06 -!- pgquiles(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-12 05:12 -!- pgquiles(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-12 06:13 -!- Kirantpatil(~kiran@122.167.176.118) has joined #tux3 2008-08-12 06:21 -!- Kirantpatil(~kiran@122.167.176.118) has left #tux3 2008-08-12 07:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-12 07:09 -!- pgquiles(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-12 08:18 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-12 13:09 -!- pgquiles(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-12 13:42 -!- pgquiles(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-12 13:51 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-12 13:54 http://www.smh.com.au/news/off-the-field/bills-blue-screen-of-death-malfunction/2008/08/12/1218306871673.html <- bsod at the olympics opening ceremony 2008-08-12 13:54 bill gates was apparently present 2008-08-12 14:10 yeah saw that ;) 2008-08-12 14:17 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-12 14:38 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-12 14:43 -!- pgquiles__(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-12 15:21 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-12 16:26 grunt. Just ported viro's ext2_readdir to tux3 2008-08-12 17:19 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-12 21:44 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-12 23:26 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-13 01:15 -!- pgquiles(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-13 01:22 -!- pgquiles_(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-13 01:26 -!- pgquiles__(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-13 01:32 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-13 02:06 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-13 03:28 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-13 03:40 -!- Kirantpatil(~kiran@122.167.219.1) has joined #tux3 2008-08-13 03:40 -!- Kirantpatil(~kiran@122.167.219.1) has left #tux3 2008-08-13 05:37 -!- pgquiles(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-13 06:01 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-13 06:29 -!- pgquiles__(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-13 07:43 -!- pgquiles__(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-13 09:13 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-13 09:19 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-13 10:53 lwn linked the tux3 structure post 2008-08-13 10:53 got a google alert before the weekly edition is even posted 2008-08-13 10:54 that must mean that Jon creates the article pages already on the web before adding the top level link 2008-08-13 11:10 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-13 11:43 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-13 14:47 or it might mean that you traveled into the future and got the links. Well, did you? 2008-08-13 16:48 caught me ;-) 2008-08-13 16:48 ok, time to implement the allocation bitmaps now 2008-08-14 00:48 -!- pgquiles(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-14 00:54 -!- pgquiles_(~pgquiles@d515302CB.access.telenet.be) has joined #tux3 2008-08-14 00:59 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-08-14 01:15 -!- pgquiles_(~pgquiles@d515302D0.access.telenet.be) has joined #tux3 2008-08-14 11:48 -!- pgquiles(~pgquiles@d54C56A6E.access.telenet.be) has joined #tux3 2008-08-14 16:03 -!- boom(~boom@c-76-117-208-224.hsd1.nj.comcast.net) has joined #tux3 2008-08-14 16:04 Hello, was just reading about tux3 and it sounded interesting. 2008-08-14 16:39 boom: welcome 2008-08-14 16:56 shapor: Thanks :D 2008-08-14 16:56 I'm sorrry to report that I am not a developer and as such have little to offer but encouragement. 2008-08-14 17:04 once we have some complete code you'd certainly be welcome to test it and find bugs :) 2008-08-14 17:07 Sounds good. 2008-08-14 18:02 if you're not a developer you can send beer :-) 2008-08-14 20:35 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-14 22:41 -!- flips(~phillips@phunq.net) has joined #tux3 2008-08-15 09:52 -!- boom(~boom@c-76-117-208-224.hsd1.nj.comcast.net) has joined #tux3 2008-08-15 09:52 -!- flips(~phillips@phunq.net) has joined #tux3 2008-08-15 10:59 -!- pgquiles(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-15 11:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-15 11:34 -!- pgquiles(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-15 11:35 Instead of beer, I have an offer for a free month of netflix, anyone interested? 2008-08-15 11:37 hmm, beer helps when coding, movies might be distracting 2008-08-15 11:37 thanks for the offer though! 2008-08-15 11:37 flips might 2008-08-15 11:38 we already have netflix :-) 2008-08-15 11:39 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-15 11:42 Ah well. 2008-08-15 11:49 -!- pgquiles__(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-15 12:11 -!- pgquiles__(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-15 12:11 flips: ping 2008-08-15 12:15 hi pqguiles 2008-08-15 12:15 good to see you here :-) 2008-08-15 12:15 pgquiles I meant 2008-08-15 12:15 whoops 2008-08-15 12:15 got to do something about those typos 2008-08-15 12:15 :-) 2008-08-15 12:16 flips: do you know Strigi? ( http://strigi.sf.net ) 2008-08-15 12:16 hi pgquiles 2008-08-15 12:16 I didn't know it 2008-08-15 12:16 I'm at aKademy and this morning while having breakfast I was talking with Strigi's lead developer. He says he would be interested in adding indices to the filesystem, so I acted as a pointer to you 2008-08-15 12:17 pau = &flips; 2008-08-15 12:17 I would like to provide support in tux3 for indexing daemons 2008-08-15 12:17 hi shapor 2008-08-15 12:18 I would like to work out a system where the filesystem can notice an indexer accurately about such things as new hard links 2008-08-15 12:18 perfect! 2008-08-15 12:18 ah cool, a hdd crawler 2008-08-15 12:18 so i'm guessing it does its own io throttling 2008-08-15 12:19 i've written such a tool before for backups 2008-08-15 12:19 well minus the indexing 2008-08-15 12:19 shapor: it does cpu throtting and he's working on implementing io throtting 2008-08-15 12:19 iothrottling is key, esp with fast cpus today 2008-08-15 12:20 i thought they were adding some kernel features to make that easier 2008-08-15 12:20 the good thing about strigi is it comes by default with kde4 and it integrates with nepomuk (semantic desktop and all) 2008-08-15 12:20 pgquiles, he should come onto the tux3 mailing list and say what he would like 2008-08-15 12:20 flips: that's exactly what I told him :-) 2008-08-15 12:20 :-) 2008-08-15 12:22 he should be arriving to The Netherlands in a couple of hours, I guess he'll subscribe in a few days 2008-08-15 12:22 79 members on the mailing list now 2008-08-15 12:23 there's a lot of interest in tux3 2008-08-15 12:25 well I'd better make another checkin then 2008-08-15 12:25 bitmap allocation almost integrated 2008-08-15 12:43 shapor, got a 64 bit printf patch for me? Just needs (long long) cast for all parameters printed as %L that are not actually long long on 64 bit linux 2008-08-15 12:43 or less verbosely, (u64), will work just as well 2008-08-15 12:43 on my laptop at home 2008-08-15 12:44 :( 2008-08-15 12:44 come to think of it, we should make all those easily findable 2008-08-15 12:44 by typedeffing something for them 2008-08-15 12:44 (foo_t) 2008-08-15 12:45 (llcompat) 2008-08-15 12:45 something 2008-08-15 12:45 something short hopefully 2008-08-15 12:46 (fudge) 2008-08-15 12:46 typedef long long fudge, ok? 2008-08-15 12:52 casting to u64 doesn't work 2008-08-15 12:52 because u64 is long unsigned int on 64 2008-08-15 12:53 and Lx expects long long unsigned int 2008-08-15 12:53 there isn't a u64 fmt string unfortunately 2008-08-15 12:54 casting to long long on 64 is kinda silly, i guess that is 128 or something? 2008-08-15 12:55 that's ok though 2008-08-15 12:55 it's just a printf 2008-08-15 12:55 mostly tracing anyway 2008-08-15 12:56 right about the u64 2008-08-15 12:56 so typedef long long fudge 2008-08-15 12:56 no u64 2008-08-15 12:57 typedef long long widen; 2008-08-15 12:57 there, that looks halfway civilized 2008-08-15 12:58 then when somebody gets around to implementing type aware printing in C it's an easy splat edit 2008-08-15 12:58 that is, when hell freezes over ;-) 2008-08-15 12:59 would be more accurate to say, when tux3 gets ported to c++ 2008-08-15 12:59 which will also be when hell freezes over because c++ does not support designated initializers 2008-08-15 13:00 it's really c--++ 2008-08-15 13:04 doesn't build 2008-08-15 13:04 dleaf.c:425: error: 'balloc' undeclared here (not in a function) 2008-08-15 13:05 need to include balloc.c or something 2008-08-15 13:06 happens in all the files actually 2008-08-15 13:07 hmm or forward declare it tux3.h ? and link to a balloc.o ? 2008-08-15 13:08 flips: ^ fail 2008-08-15 13:17 shapor, will fix 2008-08-15 13:32 shapor, that should do it 2008-08-15 13:35 it is stupid that you can't go back and fix commit comments in hg 2008-08-15 13:35 that is a problem with the git + hg + monotone breed 2008-08-15 13:36 also no real renames 2008-08-15 13:36 there remains room for another step forward in version control 2008-08-15 13:44 -!- pgquiles(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-15 13:50 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-15 14:10 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-15 14:25 -!- pgquiles_(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-15 14:44 -!- pgquiles__(~pgquiles@d54C5B8AA.access.telenet.be) has joined #tux3 2008-08-15 18:03 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-15 18:51 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-15 22:44 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-16 02:39 -!- Kirantpatil(~kiran@122.167.212.233) has joined #tux3 2008-08-16 02:39 -!- Kirantpatil(~kiran@122.167.212.233) has left #tux3 2008-08-16 11:14 -!- flips(~phillips@phunq.net) has joined #tux3 2008-08-16 11:14 -!- ChanServ changed mode/#tux3 -> -o flips 2008-08-16 14:42 -!- pgquiles(~pgquiles@132.Red-217-125-199.dynamicIP.rima-tde.net) has joined #tux3 2008-08-16 17:06 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-08-16 17:06 doesn't tux3 suck as a web server compared to apache2 ? 2008-08-16 17:17 bh: this channel isn't about tux3 the web server 2008-08-16 17:18 -!- ChanServ changed mode/#tux3 -> +o shapor 2008-08-16 17:49 bh -> bill huery 2008-08-16 17:49 huey 2008-08-16 17:50 ah ;) 2008-08-16 17:50 nice troll bh 2008-08-16 17:51 ACTION gets on a little more caffein 2008-08-16 17:53 smelled like a troll, thats why i op'ed myself ;) 2008-08-16 17:53 didn't make the connection 2008-08-16 17:58 :-) 2008-08-16 17:58 bh is like that, only made one of him 2008-08-16 17:58 I am thinking... probably crazy thought 2008-08-16 17:58 but it's getting not far away from kernel port time 2008-08-16 17:59 I wonder if some of the porting could be automated with a perl script 2008-08-16 17:59 haha 2008-08-16 17:59 see crazy above 2008-08-16 17:59 perhaps 2008-08-16 17:59 things like changing printf to printk 2008-08-16 17:59 instead of #define printf printk 2008-08-16 18:27 a couple of big differences between the user space code and kernel: 1) kackers don't like c99 inline decls 2) bit fields... are they consistently implemented across arches? likely not 2008-08-16 18:28 could possibly isolate those differences behind inline access functions buried in header files 2008-08-16 22:06 woohoo, bitmap flush worked on the first try after plugging in the btree root 2008-08-16 22:08 merging the changes in with my format string fix went smoothly 2008-08-16 22:08 hg++ 2008-08-16 22:37 indeed 2008-08-16 22:38 could you try it with the (unsigned int) casts removed 2008-08-16 22:38 I don't think there should be warnings but I'm prepared to be re-educated 2008-08-16 22:38 oh your merge 2008-08-16 22:38 actually I haven't tried merge in hg yet 2008-08-16 22:39 I'm glad to hear it's pleasant 2008-08-16 22:40 ok, just need to do free block and bitmap allocation is finished right up to the first kernel drop 2008-08-16 22:44 dleaf.c:133: warning: format '%u' expects type 'unsigned int', but argument 2 has type 'long int' 2008-08-16 22:45 just needs to be (int) to make it happy i suppose though 2008-08-16 22:45 since we're doing pointer math we expect to be a small value 2008-08-16 23:22 I think 2008-08-16 23:23 (int) works 2008-08-16 23:23 although casting to int specifically for %u seems silly 2008-08-16 23:23 really 2008-08-16 23:23 it is odd that pointer difference returns long in instead of int 2008-08-16 23:23 (unsigned int) makes more sense to me 2008-08-16 23:23 only on 64 bit 2008-08-16 23:24 we can also do the cast on the 32 bit side 2008-08-16 23:24 and use %lu as the format string? 2008-08-16 23:25 best way to do it is a silly thing to worry about since all the printfs are going away 2008-08-16 23:25 yes 2008-08-16 23:26 nice to keep some of the tracing around and expect it to work 2008-08-16 23:26 well we know it will work 2008-08-16 23:26 but i suspect printk is somewhat different anyway? 2008-08-16 23:28 very similar 2008-08-16 23:28 been honed by unix wookies for years 2008-08-16 23:32 ok, put back pretty much the way you had it 2008-08-16 23:32 yuck huh? 2008-08-16 23:33 ACTION done for today 2008-08-16 23:34 hm 2008-08-16 23:34 have you thought about the user interface to snapshot data 2008-08-16 23:35 say i want to copy a directory and all the previous version of it 2008-08-16 23:35 snapshot data or snapshot? 2008-08-16 23:35 to say, another tux3 filesystem 2008-08-16 23:35 all previous versions, no 2008-08-16 23:35 just a specified version 2008-08-16 23:35 copy with history is an interesting challenge 2008-08-16 23:36 could be useful for quite a few things 2008-08-16 23:36 sounds hard 2008-08-16 23:36 i know, thats why i asked 2008-08-16 23:36 but if the use case is compelling... 2008-08-16 23:37 copy history does sounds useful 2008-08-16 23:37 very 2008-08-16 23:37 say you take daily snapshots 2008-08-16 23:37 but you want to do a weekly backup to tape 2008-08-16 23:37 which includes all those incremental daily changes 2008-08-16 23:37 or even hourly 2008-08-16 23:38 easy, back up a bunch of deltas 2008-08-16 23:38 can you get a delta for a particular directory? 2008-08-16 23:38 that is the plan 2008-08-16 23:38 of a given file 2008-08-16 23:38 what is the interface? 2008-08-16 23:38 some ddlink thing 2008-08-16 23:38 with some c program driving the ddlink 2008-08-16 23:39 in other words, go crazy 2008-08-16 23:39 yeah 2008-08-16 23:39 but that wasn't the usecase i originally thought of 2008-08-16 23:39 i was trying to simplify it 2008-08-16 23:39 probably best to think about the primitive ops and think how to link them up to do complex things 2008-08-16 23:40 thinking about an organized delta store 2008-08-16 23:40 that you can not only write to but read from... and do what with? 2008-08-16 23:40 something like the guy did with zumastor maybe 2008-08-16 23:41 mountable deltas 2008-08-16 23:41 that would rule the universe 2008-08-16 23:41 but hard, probably 2008-08-16 23:41 yeah migrate them to nearline storage 2008-08-16 23:41 i bet you think doing it with ddsnap delta files was hard too though 2008-08-16 23:41 and someone did it ;) 2008-08-16 23:42 speaking of whom, you should loop him in about tux3 2008-08-16 23:44 yes 2008-08-16 23:44 remember the subject line? 2008-08-16 23:45 ddloop perhaps? 2008-08-16 23:45 I thought that was my name 2008-08-16 23:46 "Backups using ddsnap 2008-08-16 23:46 http://www.nomorevoid.com/downloads/dm-ddloop.tar.gz 2008-08-16 23:47 " 2008-08-16 23:47 right on our home page 2008-08-16 23:47 was the subject on the list 2008-08-16 23:47 yeah 2008-08-16 23:47 got it 2008-08-16 23:48 why dont google groups archives appear in google search results? 2008-08-16 23:48 because goog suxorx? 2008-08-16 23:48 http://groups.google.com/group/zumastor/browse_thread/thread/c95970acdc2e31ca/cc9931d18043f31f 2008-08-16 23:49 http://www.google.com/search?q=%22backups+using+ddsnap%22 2008-08-16 23:49 doesn't find it unfortunately 2008-08-16 23:57 googlegroups is really too lame to have a mailing list on 2008-08-16 23:57 it would be fine is gmane was subscribed 2008-08-16 23:57 I'll set up a new zumastor mailing list 2008-08-16 23:59 t A following integer conversion corresponds to a ptrdiff_t argument -- printf man page 2008-08-17 00:00 now lets see if printk has it 2008-08-17 00:00 set it up where? 2008-08-17 00:00 @tux3.org? 2008-08-17 00:01 if you move it you'd have re-subscribe all the members 2008-08-17 00:01 sounds like a pita 2008-08-17 00:03 true 2008-08-17 00:03 printk has %t 2008-08-17 00:03 so maybe we should use it on the use every feature principle 2008-08-17 00:04 whats %t? 2008-08-17 00:05 difference of pointers :p 2008-08-17 00:05 does printf? 2008-08-17 00:06 also 2008-08-17 00:09 http://lxr.linux.no/linux+v2.6.26.2/lib/vsprintf.c#L779 2008-08-17 00:09 -!- flips(~phillips@phunq.net) has left #tux3 2008-08-17 00:09 -!- flips(~phillips@phunq.net) has joined #tux3 2008-08-17 00:10 http://lxr.linux.no/linux+v2.6.26.2/lib/vsprintf.c#L779 <- the documentation for printk 2008-08-17 00:11 ah i see 2008-08-17 00:11 yes we should be using that 2008-08-17 00:11 perfect 2008-08-17 00:13 finally 2008-08-17 00:13 i've always wanted to read the printf man page to see all the features i never use 2008-08-17 00:13 this is a good reason to do so 2008-08-17 00:14 uhoh 2008-08-17 00:15 ? 2008-08-17 00:16 nice no compile warnings on 64 2008-08-17 00:17 %ti does make more sense than %tu 2008-08-17 00:17 in case of an error it will be easy to see the negative value then 2008-08-17 00:18 right 2008-08-17 03:06 -!- pgquiles(~pgquiles@132.Red-217-125-199.dynamicIP.rima-tde.net) has joined #tux3 2008-08-17 06:38 morning pgquiles 2008-08-17 06:39 hey flips 2008-08-17 06:40 a little early for here 2008-08-17 06:44 what time is it over there? 2008-08-17 06:44 6:45 2008-08-17 06:44 that hurts :-) 2008-08-17 06:45 yup 2008-08-17 06:45 gotta get some more zzz's 2008-08-17 06:45 see all the tux3 checkins 2008-08-17 06:45 shapor got another patch, 3rd or 4th I think 2008-08-17 06:46 did a Jos van den Oever write you? 2008-08-17 06:46 he's the guy developing Strigi 2008-08-17 06:52 not yet 2008-08-17 06:53 he'll eventually do 2008-08-17 06:54 getting indexing working well for kde would please me 2008-08-17 06:55 me too :-) 2008-08-17 06:55 anyway, strigi is not tied to kde 2008-08-17 06:56 even better 2008-08-17 06:56 but kde will use it early I suppose 2008-08-17 06:56 yes, it is already using it 2008-08-17 06:57 in 4.0 and 4.1 you need to explicitly enable it but in 4.2 will be enabled by default 2008-08-17 06:57 ACTION loves kde 2008-08-17 06:57 flips: there's Camp KDE in January in Jamaica, you know :-) 2008-08-17 06:57 wow, even more south than I already am 2008-08-17 06:58 I will keep it in mind 2008-08-17 06:59 I'm going to sleep one hour, I still need to recover from aKademy 2008-08-17 06:59 see you later 2008-08-17 06:59 bye 2008-08-17 11:56 -!- pgquiles_(~pgquiles@126.Red-80-39-172.dynamicIP.rima-tde.net) has joined #tux3 2008-08-17 13:42 ACTION goes wacks tree_expand into 2 big pieces 2008-08-17 13:43 typos galor 2008-08-17 13:54 whacking completed 2008-08-17 13:54 now to make it make sense 2008-08-17 13:58 7 parameters for insert_child now down to 6... 2008-08-17 15:14 iattr.c coming soon 2008-08-17 15:14 you think perhaps a dir should contain a default file attr entry? 2008-08-17 15:15 yes for sure 2008-08-17 15:15 in most cases dirs contain very similar files 2008-08-17 15:15 and be inherited instead of storing the same attr in each inode 2008-08-17 15:15 exactly 2008-08-17 15:15 inode attrs come in groups 2008-08-17 15:15 one of the groups is ctime/mode/uid/guid 2008-08-17 15:15 the "create" group 2008-08-17 15:16 that one can usually be inherited except for ctime 2008-08-17 15:16 well 2008-08-17 15:16 so, break out the ctime 2008-08-17 15:16 so there is maybe two flavors of the create group 2008-08-17 15:16 ctime only and ctime plus ownership 2008-08-17 15:17 it might be ok to always separate them 2008-08-17 15:17 a version and a 4 bit attr type field have to go in each attr 2008-08-17 15:18 so: struct ctime { u64 kind:4, version:10, time: 50 }; 2008-08-17 15:20 and struct owner { u64 kind:4, version:10, mode: 50, uid:32, gid:32 }; 2008-08-17 15:20 8 bytes for the first, 16 bytes for the second 2008-08-17 15:21 16 bytes saved every time it is possible to inherit the owner from the directory 2008-08-17 15:21 which is something like 99.9% of the time on my system I think 2008-08-17 15:25 ACTION just wrote a sick one-liner to determine the average number of different file perms a directory contains 2008-08-17 15:26 here is the distribution from my web server: 2008-08-17 15:26 651 0 2008-08-17 15:26 4318 1 2008-08-17 15:26 92 2 2008-08-17 15:26 8 3 2008-08-17 15:26 4 4 2008-08-17 15:27 so, 651 directories with no files in them, 4318 with only one set of perms on the files contained in it 2008-08-17 15:27 92 with 2, etc 2008-08-17 15:27 aka, output of: 2008-08-17 15:27 find / -xdev -type d -exec sh -c 'ls -l $1 | awk "/^\-/ {print \$1}" | sort -u |wc -l' {} {} \; | sort | uniq -c 2008-08-17 15:38 across the entire filesystem there are only 30 unique permissions settings 2008-08-17 15:38 only about 500k files 2008-08-17 15:38 nice observation 2008-08-17 15:38 s/only/on/ 2008-08-17 15:38 so there should be owner atoms? 2008-08-17 15:38 or sorry 2008-08-17 15:38 permission atoms 2008-08-17 15:40 above, 4318 out of 6000 files save 16 bytes 2008-08-17 15:41 running on my system 2008-08-17 15:41 those are directories 2008-08-17 15:41 not files 2008-08-17 15:41 right 2008-08-17 15:41 directories have attrs just like files, I'm getting a sense of overall saving 2008-08-17 15:42 ah 2008-08-17 15:42 considering just files, what is the single permissions percentage? 2008-08-17 15:43 you mean, how many files live in directories with only one set of permissions? 2008-08-17 15:43 this is also a fun one to run: 2008-08-17 15:43 find / -xdev -printf "%m\n" 2>/dev/null| sort | uniq -c| sort -n 2008-08-17 15:44 out of 53972 files, only 24 unique permissions sets 2008-08-17 15:44 41605 files are mode 644 2008-08-17 15:46 talking] 2008-08-17 15:49 39537 files live in directories with only one set of permissions 2008-08-17 15:52 its total brain damage to store 0644 all over the filesystem 2008-08-17 15:58 ACTION keeps getting more excited about tux3 2008-08-17 16:01 flips: where did "16 bytes" come from? 2008-08-17 16:02 each set of permissions is only 12 bits 2008-08-17 16:02 struct owner { u64 kind:4, version:10, mode: 50, uid:32, gid:32 }; 2008-08-17 16:02 they come in groups 2008-08-17 16:02 oh 2008-08-17 16:02 i see 2008-08-17 16:02 they have to be discriminated for traversal and versioned 2008-08-17 16:02 that is why putting them in groups is a win 2008-08-17 16:03 mode:50 ? 2008-08-17 16:03 some extra ;-) 2008-08-17 16:03 at this point I am trying to preserve 8 byte granularity 2008-08-17 16:03 I am not sure if that matters 2008-08-17 16:03 i see 2008-08-17 16:03 I can go to byte aligned and save more space 2008-08-17 16:03 probably the right thing to do actually 2008-08-17 16:04 but it is a detail of iattr.c 2008-08-17 16:04 thats a lot of extra mode bits 2008-08-17 16:04 why? 2008-08-17 16:04 struct owner { u64 kind:4, version:10, mode: 16, pad: 36, uid:32, gid:32 }; 2008-08-17 16:04 ok? 2008-08-17 16:05 oh i guess we should support "chattr" 2008-08-17 16:05 struct owner { u64 kind:4, version:10, mode:16, pad:4, uid:32, gid:32 }; <- 12 bytes] 2008-08-17 16:05 um 2008-08-17 16:05 I'm on drugs 2008-08-17 16:06 16 bytes 2008-08-17 16:06 so you were right, :50 was just silly 2008-08-17 16:06 chattr must be supported of course 2008-08-17 16:06 how many bits does chattr need 2008-08-17 16:06 chattr the command? 2008-08-17 16:07 yeah 2008-08-17 16:07 ioctl(3, EXT2_IOC_GETFLAGS, 0x7fff16bffeec) = 0 2008-08-17 16:08 struct owner { u64 kind:4, version:10, mode:20, uid:32, gid:32 }; <- 12 bytes] 2008-08-17 16:08 20 mode bits should do it 2008-08-17 16:08 we need to support that interface 2008-08-17 16:09 how does that work with the vfs layer 2008-08-17 16:09 http://lxr.linux.no/linux+v2.6.26.2/include/linux/ext2_fs.h#L200 2008-08-17 16:09 http://lxr.linux.no/linux+v2.6.26.2/+ident=11795060 2008-08-17 16:11 ext2 has 22 of its own per file flags, mostly bullshit 2008-08-17 16:11 unused 2008-08-17 16:11 things like compression 2008-08-17 16:12 tail packing 2008-08-17 16:12 #define EXT2_INDEX_FL FS_INDEX_FL /* hash-indexed directory */ <- one of the few that actually got implemented (by me) 2008-08-17 16:12 heh 2008-08-17 16:13 anyway, we will divide them into commonly used flags that go in the 20 mode bits and rare ones that go in some other attribute type 2008-08-17 16:14 16 basic attribute types should be enough 2008-08-17 16:14 one of those types is "extended attribute" 2008-08-17 16:14 ok 2008-08-17 16:14 well there are only 8 spare mode bits 2008-08-17 16:15 not counting chattr 2008-08-17 16:15 i was just trying to figure out how many chattr needed 2008-08-17 16:15 ACTION does man chattr 2008-08-17 16:15 not very many I think 2008-08-17 16:16 ----------------- /tmp 2008-08-17 16:16 thats a lot of dashes (output from lsattr) 2008-08-17 16:17 98% of your files could inherit struct owner from parent directory 2008-08-17 16:17 just crunched it 2008-08-17 16:17 my system is: 2008-08-17 16:17 4113 0 2008-08-17 16:17 16326 1 2008-08-17 16:17 419 2 2008-08-17 16:17 20 3 2008-08-17 16:17 8 4 2008-08-17 16:17 3 5 2008-08-17 16:17 1 7 2008-08-17 16:18 yeah 2008-08-17 16:18 my system is 97.3 2008-08-17 16:19 % 2008-08-17 16:19 probably rarely under 90% 2008-08-17 16:19 you are 97.7 actually 2008-08-17 16:19 even on multiuser systems 2008-08-17 16:19 you are 97.6 actually 2008-08-17 16:19 true 2008-08-17 16:19 its rare to work in the same directory as another user 2008-08-17 16:20 that is 261K of kernel cache saved for full stat 2008-08-17 16:20 on my system 2008-08-17 16:20 which is not particularly big fs 2008-08-17 16:21 man ls -lR is fucking slow 2008-08-17 16:21 ddtree is fscking fast ;-) 2008-08-17 16:22 it stats /etc/localtime for EVERY file ?? 2008-08-17 16:22 bleah 2008-08-17 16:22 work of the fsf unfortunately 2008-08-17 16:27 http://lxr.linux.no/linux+v2.6.26.2/fs/ext2/inode.c#L1147 <- ext2_set_inode_flags 2008-08-17 16:28 total of 10 flags 2008-08-17 16:28 per inode 2008-08-17 16:28 almost all seem sensible 2008-08-17 16:31 struct owner { u64 kind:4, version:10, mode:18, uid:32, gid:32 }; <- 12 bytes 2008-08-17 16:31 too tight maybe 2008-08-17 16:32 unless mode is an atom 2008-08-17 16:32 anyway bask to inums 2008-08-17 20:38 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2008-08-17 20:41 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-08-17 21:38 tuxkey_t *next_key(struct treepath *path, int levels) 2008-08-17 21:38 { 2008-08-17 21:38 for (int level = levels; --level < 0; ) 2008-08-17 21:38 if (!finished_level(path, level)) 2008-08-17 21:38 return &path[level].next->key; 2008-08-17 21:38 return NULL; 2008-08-17 21:38 } 2008-08-17 21:39 shapor, how does it look/ 2008-08-17 21:40 ok? 2008-08-17 21:42 flips: that doesn't seem very complex 2008-08-17 21:43 just looks up the path until it finds a level where we haven't read all the way to the end of the index block yet 2008-08-17 21:44 there we find a key that separates the subtree we are in (a leaf) from the next subtree to the right 2008-08-17 21:45 yeah i was able to read the c, you dont need to put that in a comment ;) 2008-08-17 21:45 just wondering why you were pasting it 2008-08-17 21:45 aw, I was just doing that ;-) 2008-08-17 21:45 because I haven't tested it 2008-08-17 21:45 heh 2008-08-17 21:45 it's part of the inode table block insertion stuff 2008-08-17 21:46 oh 2008-08-17 21:46 i dont know how that all fits together 2008-08-17 21:46 knowing the successor key tells us whether we should advance to the next block to search it or insert a new one, becauase the successor block is too far away in key space 2008-08-17 21:46 i think you described it in a post recently 2008-08-17 21:46 or no 2008-08-17 21:48 not in that detail 2008-08-17 21:48 and not everybody will be able to see the purpose from the code, so I put in the comment 2008-08-17 21:53 this function the way it is written can actually replace a nasty little bit of code in the delete, because it returns a pointer 2008-08-17 21:53 instead of the key 2008-08-17 21:53 in the delete, we have to find that successor key and change it 2008-08-17 21:57 i was thinking about the size and mtime 2008-08-17 21:57 do we have to update those? 2008-08-17 21:57 yes 2008-08-17 21:57 every time the file changes 2008-08-17 21:57 with a little bit of slop 2008-08-17 21:57 definitely on every fsync 2008-08-17 21:58 this is a job for the log of course 2008-08-17 21:58 tuxkey_t *p = next_key(path, levels), next = p ? *p : MAX_INODES; <- obscure enough? 2008-08-17 21:59 no i like it 2008-08-17 21:59 makes perfect sense 2008-08-17 22:00 traditionalists have trouble with C code like that 2008-08-17 22:00 then they need to be reminded that C99 was last millenium already 2008-08-17 22:00 doesn't work though 2008-08-17 22:01 a lot of folks are still scared of structure assignment 2008-08-17 22:01 let along inline decls 2008-08-17 22:01 let alone 2008-08-17 22:01 what doesnt work 2008-08-17 22:01 it doesn't work to remind them 2008-08-17 22:01 oh 2008-08-17 22:01 traditionalists are like that 2008-08-17 22:02 it is convenient that the maximum inodes and maximum blocks of a filesystem are the same 2008-08-17 22:02 that means there is a density of one inode per block 2008-08-17 22:02 howso 2008-08-17 22:03 which is convenient for deciding which inum to use 2008-08-17 22:03 over time with deletes that nice simple relationship deteriorates 2008-08-17 22:03 then we need other mojo 2008-08-17 23:04 flips: why are some of the data structures in the c files and some in tux3.h? 2008-08-17 23:13 because me being lame 2008-08-17 23:13 the c files are included 2008-08-17 23:13 like header files 2008-08-17 23:13 this is only for while it's in heavy hack mode 2008-08-17 23:14 when everything has to be declared twice it gets in the way of quick refactoring 2008-08-17 23:21 so struct inode is just from ext2 i guess? 2008-08-17 23:21 vfs 2008-08-17 23:22 with version added ? 2008-08-17 23:22 generic inode has version, yes, it is a different, bogus version 2008-08-17 23:22 stupid misnamed field 2008-08-17 23:22 i see 2008-08-17 23:23 it is really a change counter 2008-08-17 23:23 ah 2008-08-17 23:23 so that ext2/3 knows when it has to go look in the directory in case the entry it was positioned on was deleted 2008-08-17 23:23 it is that specific 2008-08-17 23:23 calling that a version was a heinous crime 2008-08-17 23:28 i'll just ignore it ;) 2008-08-17 23:29 it gets used in dir.c for revalidate, nowhere else 2008-08-17 23:36 causes running inode.c to fail 2008-08-17 23:36 what does? 2008-08-17 23:39 er i thought i pasted the assert 2008-08-17 23:40 whoops 2008-08-17 23:40 [10869] probe: Failed assertion "(ops->leaf_sniff)(sb, buffer->data)" 2008-08-17 23:41 that's fresh out of repository? 2008-08-17 23:42 maybe you have to give a filename on the command line? 2008-08-17 23:42 ah 2008-08-17 23:42 not a good error message to be sure 2008-08-17 23:42 yeah, my bad 2008-08-17 23:43 woo, it really tries hard even when it has no backing file 2008-08-17 23:44 I wonder if this is good 2008-08-17 23:45 generated 32 bitmap blocks, that is not good behaviou 2008-08-17 23:45 behavior 2008-08-17 23:46 [18006] main: Bitmap flush failed (Bad file descriptor) <- this is nice 2008-08-17 23:47 fd '(null)' = -1 (0xffffffff bytes) <- this maybe not so nice 2008-08-17 23:49 sb->image.blocks has a wierd number when fd is -1 2008-08-17 23:51 it's strange that printf doesn't support signed hexadecimal 2008-08-17 23:54 ok, fdsize64 now returns zero when it fails, better than overloading the size with -1, or maybe not 2008-08-17 23:54 anyway, inode.c then proceeds to fail on the leaf 2008-08-17 23:54 I wonder if those failure paths are decent 2008-08-17 23:55 probe needs to fail and not think it got a leaf 2008-08-17 23:55 hrm signed hex 2008-08-17 23:56 makes sense to me 2008-08-17 23:56 i suppose 2008-08-17 23:57 why doesn't fdsize64 just take a u64* 2008-08-17 23:57 it should 2008-08-17 23:57 and return err 2008-08-17 23:57 I thought I made it do that 2008-08-17 23:57 but the patch probably got danked 2008-08-17 23:57 like the ioctl does 2008-08-18 00:01 i thought i remember seeing that too 2008-08-18 00:13 [18233] main: fdsize64 failed for '(null)' (Bad file descriptor) 2008-08-18 00:14 better 2008-08-18 00:14 I'm still interested in the behavior when it tries to keep running anyway 2008-08-18 00:14 I'll just change the error() to warn() 2008-08-18 00:16 int fdsize64(int fd, uint64_t *size) 2008-08-18 00:16 { 2008-08-18 00:16 struct stat stat; 2008-08-18 00:16 if (fstat(fd, &stat)) 2008-08-18 00:16 return -errno; 2008-08-18 00:16 if (S_ISREG(stat.st_mode)) { 2008-08-18 00:16 *size = stat.st_size; 2008-08-18 00:16 return 0; 2008-08-18 00:16 } 2008-08-18 00:16 return ioctl(fd, BLKGETSIZE64, size) ? -errno : 0; 2008-08-18 00:16 } 2008-08-18 00:16 maybe it should just return -1 like other libc stuff 2008-08-18 00:16 yes it should 2008-08-18 00:17 int fdsize64(int fd, uint64_t *size) 2008-08-18 00:17 { 2008-08-18 00:17 struct stat stat; 2008-08-18 00:17 if (fstat(fd, &stat)) 2008-08-18 00:17 return -1; 2008-08-18 00:18 if (S_ISREG(stat.st_mode)) { 2008-08-18 00:18 *size = stat.st_size; 2008-08-18 00:18 return 0; 2008-08-18 00:18 } 2008-08-18 00:18 return ioctl(fd, BLKGETSIZE64, size); 2008-08-18 00:18 } 2008-08-18 00:29 lgtm 2008-08-18 00:30 flips: did you see my patch on the list 2008-08-18 00:30 not yet 2008-08-18 00:30 makefile update + more 64 bit wanrings 2008-08-18 00:31 ah 2008-08-18 00:33 needs a little merge lovin 2008-08-18 00:34 it wouldn't if you committed more often ;) 2008-08-18 00:34 its been 8 hrs.. cmon, yer slackin 2008-08-18 00:35 you erred 2008-08-18 00:35 it's been 2 minutes 2008-08-18 00:35 ah my cron job hasn't run 2008-08-18 00:35 damn it 2008-08-18 00:35 thats it, every minute 2008-08-18 00:37 um 2008-08-18 00:38 please, I am monitoring the http accesses 2008-08-18 00:38 you will dominate 2008-08-18 00:38 you now have 8, 1 minute apart 2008-08-18 00:39 there has to be an event driven way to do this 2008-08-18 00:39 I thought that was what rss was about 2008-08-18 00:41 if your server can't take 1 pull per minute from me, that is sad 2008-08-18 00:41 rss is bs 2008-08-18 00:41 its just an xml formated page 2008-08-18 00:41 it can 2008-08-18 00:41 ajax is supposedly the browser "push" technology 2008-08-18 00:41 it's my eyes when I scan the log 2008-08-18 00:42 I see all these pulls 2008-08-18 00:42 but its really just javascript'ed pulls 2008-08-18 00:42 of more xml 2008-08-18 00:42 lame 2008-08-18 00:42 stupid heavy format 2008-08-18 00:42 yep, polling is the way of the web 2008-08-18 00:43 hairy footed hippies on the steering committee methinks 2008-08-18 00:45 there, shapor.com seems to have mellowed a little 2008-08-18 00:45 how could I notify your pull if I wanted to, email? 2008-08-18 00:45 do some magic http access to your server? 2008-08-18 00:46 i could just provide a repo you could push to 2008-08-18 00:46 why bother with the overhead of requesting a pull ? 2008-08-18 00:46 then you poll it ;-) 2008-08-18 00:46 sure 2008-08-18 00:46 we should do that 2008-08-18 00:46 well yeah, poll on the local box 2008-08-18 00:47 I want auto-push 2008-08-18 00:47 does hg do it? 2008-08-18 00:47 doubtful 2008-08-18 00:47 hrm perhaps 2008-08-18 00:47 well you could poll locally too :P 2008-08-18 00:47 what I was thinking and making funny faces 2008-08-18 00:48 and I am supposed to fix inotify ;-) 2008-08-18 00:48 heh 2008-08-18 00:48 oh right.. for the kde guys? 2008-08-18 00:48 it works well enough 2008-08-18 00:48 right 2008-08-18 00:48 for this purpose 2008-08-18 00:48 they say it doesn't 2008-08-18 00:48 perhaps for this purpose 2008-08-18 00:49 hrm an hg watcher that pushes upstream on local commits would be useful 2008-08-18 00:49 could support git also 2008-08-18 00:49 let's get our whine in 2008-08-18 00:49 why whine? 2008-08-18 00:50 you'll just code it I know 2008-08-18 00:50 no whining 2008-08-18 00:50 it would be good 2008-08-18 00:50 some way you just say source/target like zumastor 2008-08-18 00:50 and have replicated nets of source code 2008-08-18 00:50 yeah certainly needs thought/first stab at it before whining 2008-08-18 00:51 bitbucket/github guys would be interested probably 2008-08-18 00:51 well, starting to get late 2008-08-18 00:51 i wonder if they have any programs to do it already 2008-08-18 00:51 not even 1 yet 2008-08-18 00:51 probably matt is no dummy 2008-08-18 00:52 I was just in the middle of setting up some btree unit testing 2008-08-18 00:52 got tired of testing btree in the live app 2008-08-18 00:52 things like advance should be tested in isolation 2008-08-18 00:52 and lots of other things 2008-08-18 00:52 if btrees are ever expected to be solid 2008-08-18 00:58 ok i cleaned up the 64 bit warning on the most recent code ;) 2008-08-18 00:58 patch? 2008-08-18 00:58 mailed 2008-08-18 00:58 this is already getting old 2008-08-18 00:58 right, I should pull from you 2008-08-18 00:59 i need to learn how to make a tree public with hg 2008-08-18 00:59 tomorrow 2008-08-18 01:00 I'll just look in my httpd.conf when you're ready 2008-08-18 01:01 basically nothing to do 2008-08-18 01:01 just give access to it 2008-08-18 01:02 if I can see the repo directory I can pull 2008-08-18 01:02 and unlike git this is efficient 2008-08-18 01:03 ah cool 2008-08-18 01:04 that is easy for me to do, just create a symlink 2008-08-18 01:04 yes 2008-08-18 01:04 the .hg dir? 2008-08-18 01:04 or the root of my checkout? 2008-08-18 01:04 the root I think 2008-08-18 01:04 let me check 2008-08-18 01:05 the root of your repo, not the .hg 2008-08-18 01:05 http://shapor.com/tux3/hg/ 2008-08-18 01:06 we shoulda used your last patch to test 2008-08-18 01:06 i canlt clone from it though 2008-08-18 01:06 are you sure its that simple? 2008-08-18 01:06 can't* 2008-08-18 01:06 yes 2008-08-18 01:07 you should be able to clone 2008-08-18 01:07 abort: 'http://shapor.com/tux3/hg/' does not appear to be an hg repository! 2008-08-18 01:07 does it have a .hg? 2008-08-18 01:07 yes 2008-08-18 01:07 http://shapor.com/tux3/hg/.hg/ 2008-08-18 01:07 just a sec 2008-08-18 01:09 i dont think its as simple as you say it is 2008-08-18 01:10 ah, you can clone like that 2008-08-18 01:10 ah 2008-08-18 01:10 but not pull 2008-08-18 01:10 yeah 2008-08-18 01:11 er are you sure? 2008-08-18 01:11 i think you can 2008-08-18 01:11 ah 2008-08-18 01:11 static-http: 2008-08-18 01:11 you can pull after you clone ;-) 2008-08-18 01:11 no you can't 2008-08-18 01:11 http://www.selenic.com/mercurial/wiki/index.cgi/StaticHTTP 2008-08-18 01:11 clone is happy to clone, but pull says no repo, bad message 2008-08-18 01:12 are you using static-http ? 2008-08-18 01:12 static wha? 2008-08-18 01:12 hg pull 2008-08-18 01:12 pulling from static-http://shapor.com/tux3/hg 2008-08-18 01:13 you can't just use the http: prefix 2008-08-18 01:13 that expect a mercurial cgi script on the other end 2008-08-18 01:13 static-http: prefix expects just the regular old files 2008-08-18 01:13 on the other end 2008-08-18 01:13 see the link i sent above 2008-08-18 01:13 ok, here goes 2008-08-18 01:14 hg pull static-http://shapor.com/tux3/hg 2008-08-18 01:14 abort: no repo found! 2008-08-18 01:16 you have to be in an hg tree 2008-08-18 01:16 hg clone static-http://shapor.com/tux3/hg <- works 2008-08-18 01:16 run that pull command inside your tux3 dir 2008-08-18 01:16 yes that works 2008-08-18 01:16 now... it shows as directory hg 2008-08-18 01:17 that is not too good style 2008-08-18 01:17 doh 2008-08-18 01:17 would be better named shapor 2008-08-18 01:17 well tux3 2008-08-18 01:17 better 2008-08-18 01:17 tux3-shapor? 2008-08-18 01:17 sure 2008-08-18 01:18 you can call it whatever you like though 2008-08-18 01:18 http pull this way is efficient enough 2008-08-18 01:18 i think 2008-08-18 01:18 way better than email 2008-08-18 01:18 hell yeah 2008-08-18 01:18 calling it hg is lame ;-) 2008-08-18 01:19 makes sense tux3/hg 2008-08-18 01:19 its the mercurial repo for tux3 ;) 2008-08-18 01:19 renamed to shapor-tux3 2008-08-18 01:21 ok i commited a change 2008-08-18 01:21 pull from me 2008-08-18 01:22 dinner time here 2008-08-18 01:22 ok 2008-08-18 01:22 hg pull static-http://shapor.com/tux3-shapor 2008-08-18 01:22 abort: HTTP Error 403: Forbidden 2008-08-18 01:23 hg pull static-http://shapor.com/tux3/shapor-tux3 2008-08-18 01:25 worked 2008-08-18 01:25 now I need to see the diff 2008-08-18 01:26 oh cool, hg has support for all kinds of hooks 2008-08-18 01:26 look in man hgrc 2008-08-18 01:26 under "hooks" 2008-08-18 01:27 yes, what an excellent use of time 2008-08-18 01:27 ok, I need to pay attention the the family now 2008-08-18 01:27 catch ya tomorrow 2008-08-18 02:50 -!- pgquiles(~pgquiles@172.Red-83-38-37.dynamicIP.rima-tde.net) has joined #tux3 2008-08-18 10:11 -!- pgquiles(~pgquiles@251.Red-81-37-107.dynamicIP.rima-tde.net) has joined #tux3 2008-08-18 12:28 So... is tux3 going to really be a "ZFS killer?" 2008-08-18 14:23 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-18 14:28 boom, zfs is self killing imho 2008-08-18 14:28 "zfs is a rampant layer violation" -- akpm 2008-08-18 14:28 tux3 + lvm3 will cover the checkbox items of zfs 2008-08-18 14:29 not that I am wildly excited about the idea of checksumming metadata, but tux3 will eventually have that too, as an option 2008-08-18 14:30 tux3 will have the immense advantage of running on Linux 2008-08-18 14:30 as long as Sun keeps being idiotic about the zfs license, zfs will not 2008-08-18 15:03 flips: The licensing was something that had me sort of confused. Is Linux's policy to never include non-GPL'd code in the kernel itself? 2008-08-18 15:13 more specific than that: the code has to be GPL v2 2008-08-18 15:13 v3 will no do 2008-08-18 15:13 Really..? 2008-08-18 15:13 Whose decision was that, Linus'? 2008-08-18 15:13 really 2008-08-18 15:14 Linus says he can't change is mind because every copyright holder would have to agree 2008-08-18 15:14 there are hundreds, some of them even died 2008-08-18 15:14 Okay, so even still, if the code is available (as it is for freebsd) what's preventing someone from just building and loading it as a third party module? 2008-08-18 15:15 you can, but binary modules without proper license are in a legal gray zone 2008-08-18 15:15 the code will certainly not go into mainlin until Sun adds a GPL v2 license 2008-08-18 15:15 Isn't that "gray zone" what's currently being filled by things like nvidia's binary driver? 2008-08-18 15:16 nvidia's driver has caused all kinds of problems 2008-08-18 15:16 being occupied by is a better term than being filled by 2008-08-18 15:16 filled sounds like satisfies 2008-08-18 15:16 That's true 2008-08-18 15:17 And would tux3 be GPL2, then? 2008-08-18 15:17 user space code is gpl v3, kernel code is gpl v2 2008-08-18 15:17 Okay, thanks for the info. 2008-08-18 15:18 I've gotta say, though, I've developed a growing respect for BSD-licensed code 2008-08-18 15:18 say, that reminds me, it is about time to collect my beer from Eben 2008-08-18 15:18 bsd is awesome 2008-08-18 15:18 I'm sure that would open you guys up to all kinds of abuse from the rest of capitalism though 2008-08-18 15:18 what would? 2008-08-18 15:18 oh 2008-08-18 15:18 bsd 2008-08-18 15:18 yes, I am not in this to give code to msoft 2008-08-18 15:19 I can support that. 2008-08-18 15:19 Hah. 2008-08-18 15:20 I wasted six hours today on their asses. 2008-08-18 15:21 Do you guys have a projected timeline? 2008-08-18 15:23 a draft roadmap is on the mailing list, revised one will go up later today 2008-08-18 15:25 Ah, I suppose I should subscribe to that bad boy. 2008-08-18 15:25 ;-) 2008-08-18 15:46 welcome to tux3 2008-08-18 15:51 Many thanks. 2008-08-18 15:51 np 2008-08-18 16:00 g99 -g -Wall buffer.c diskio.c btree.c && ./a.out foodev 2008-08-18 16:00 root at 0 2008-08-18 16:00 leaf at 1 2008-08-18 16:00 btree leaf with 0 entries 2008-08-18 16:00 leaf free = 3c 2008-08-18 16:00 btree unit tests starting to happen 2008-08-18 16:00 little leaves to make bushy trees 2008-08-18 17:25 folks ;0 2008-08-18 17:25 :) 2008-08-18 17:25 ACTION finds that the backlog has been truncated 2008-08-18 17:44 -!- konrad(~konrad@c-24-16-74-109.hsd1.wa.comcast.net) has joined #tux3 2008-08-18 22:48 hey shapor 2008-08-18 22:48 the btree unit test produces buggy results 2008-08-18 22:48 interested? 2008-08-18 22:48 seems I broke it on the port from ddsnap 2008-08-18 22:49 bh, you're coming up for air 2008-08-18 22:49 ? 2008-08-18 23:02 eh ? 2008-08-18 23:02 what's air ? 2008-08-18 23:03 just working late at night as usual 2008-08-18 23:54 flips: ah i see 2008-08-18 23:55 the insert after the split inserts to the wrong one 2008-08-18 23:55 right 2008-08-18 23:55 know why yet? 2008-08-18 23:55 didn't look at the code 2008-08-18 23:55 I was just going in there 2008-08-18 23:56 i'm guessing similar to the fleaf, er whatever its called now, dleaf? 2008-08-18 23:56 it was broken in the same way initially i think 2008-08-18 23:56 of course it works in ddsnap, I broke it :-/ 2008-08-18 23:56 unit tests noticed 2008-08-19 00:04 test for swap was the wrong direction 2008-08-19 00:06 Results 1 - 10 of about 217,000 for tux3 2008-08-19 00:07 time to show some results then 2008-08-19 00:10 ah, there is another bug 2008-08-19 00:10 I also decided to have the leafbuf be the last element in the tree path instead of handled separately 2008-08-19 00:10 that means it is released by brelse_path 2008-08-19 00:11 but tree_expand needs to store it in the path and doesn't 2008-08-19 00:17 hey flips 2008-08-19 00:17 it was nice meeting up with you Friday 2008-08-19 00:17 that was fun 2008-08-19 00:17 when are you up in la, ever? 2008-08-19 00:18 when I'm speeding through in my VW to head to Mountain View 2008-08-19 00:18 I'm going to do it again later on this week for Burning Man 2008-08-19 00:18 well feel free to slow down for a pit stop 2008-08-19 00:19 yeah, now that I know it's that easy to hook up with you folks, yeah 2008-08-19 00:19 maybe do that for a fuel stop or something like that. 2008-08-19 00:19 btw, LA's club scene is a bit weird, haven't figured it out yet 2008-08-19 00:19 ACTION chill's with Goth/Industrial folks 2008-08-19 00:19 haven't been to a club in la 2008-08-19 00:19 daddy thing does that 2008-08-19 00:20 in the SF area that's actually a professional nerd scene 2008-08-19 00:20 I did goth while in berlin 2008-08-19 00:20 my wife really got into it 2008-08-19 00:20 the previous ATM maintainer is a significant DJ in that scene 2008-08-19 00:20 some MIT Media Lab folks, etc... 2008-08-19 00:20 got some nice pics of us gothing 2008-08-19 00:20 oh really, funny 2008-08-19 00:20 you and wli should get together then :) 2008-08-19 00:20 ah wli 2008-08-19 00:20 he's more of that S&M type, I just like the music 2008-08-19 00:20 wow didn't know 2008-08-19 00:21 harald welte is a serious goth 2008-08-19 00:21 oh yeah, big time dude 2008-08-19 00:21 ran into him in a goth club in berlin, we both said what are you doing here 2008-08-19 00:21 disproportionate engineering and science folks are goth 2008-08-19 00:21 haha 2008-08-19 00:21 that's funnny 2008-08-19 00:21 my ex is a material scientist and love stuff like Joy Division and stuff 2008-08-19 00:21 small world 2008-08-19 00:22 yeah, folks in the SF scene know who I am for the most part, but not what I do per se 2008-08-19 00:22 they knew me when I was starting to complain about how irritating NetApp was and stuff ;) 2008-08-19 00:23 bitched me out when NetApp filed a lawsuit against Sun :) 2008-08-19 00:23 funny 2008-08-19 00:25 netapp should just stick to making money 2008-08-19 00:25 and not making people mad at them 2008-08-19 00:26 well, I fault Sun in this battle 2008-08-19 00:26 q: does ERR_PTR work out ok in userspace? 2008-08-19 00:26 they should have layed off 2008-08-19 00:26 I think I'd like to overload some pointer returns with (negative) error numbers 2008-08-19 00:27 sun probably thinks netapp's claim is weak 2008-08-19 00:27 I'd bet with sun on that 2008-08-19 00:27 well, that's for the courts to decide, but it was because Sun's lawyers stopped talking them is why they eventually filed the lawsuit 2008-08-19 00:28 that's publically known 2008-08-19 00:28 ok, I didn't know 2008-08-19 00:28 hard to know what happened with all the he said she said 2008-08-19 00:28 so really, in this industry with how patents are set up, they really had to cross sue Sun. They filed the lawsuit in a way intentionally so that Sun would also have to cross sue them 2008-08-19 00:29 It's in Dave Hitz's blog 2008-08-19 00:29 I can introduce you to those folks the next time you're up if you want 2008-08-19 00:29 hey how about applying your considerable intellect to the question of whether ERR_PTR is ok to use in userspace 2008-08-19 00:29 I know most of those folks well 2008-08-19 00:29 and I'm sure they'd like to talk to you out of curiosity and stuff 2008-08-19 00:29 maybe one day 2008-08-19 00:29 flips: I know nothing about userspace/kernel space boundary stuff, sorry 2008-08-19 00:30 has nothing to do with kernel 2008-08-19 00:30 well, the next time you head to Mountain View I can set something up for you folks 2008-08-19 00:30 has everything to do with memory mapps 2008-08-19 00:30 in userspace 2008-08-19 00:30 yeah, I'm retarded about this stuff, looking at a latency_trace now to see why the reschedule is taking so long 2008-08-19 00:30 btw, stay away from things like bit spins 2008-08-19 00:31 well, we will get to locking questions pretty soon 2008-08-19 00:31 bit spin... ok 2008-08-19 00:31 talk to rostedt if you have any unclarity about that 2008-08-19 00:31 always was suspicious about that 2008-08-19 00:31 lock_page is a bit spin 2008-08-19 00:31 that is used heavily 2008-08-19 00:31 but the current rwlock implementation sort of a miracle 2008-08-19 00:31 really readly really heavily 2008-08-19 00:31 really good work done by rostedt 2008-08-19 00:31 nice 2008-08-19 00:32 anything that's atomic is f-ed in -rt 2008-08-19 00:32 rwspinlock, right? 2008-08-19 00:32 make sure that you don't those locks for that long 2008-08-19 00:32 I'm pretty good about that 2008-08-19 00:32 only things like timers and the scheduler rq turn off interrupts and rescheduling for relatively long periods of time 2008-08-19 00:32 usually just take a spin lock long enough to get some other synchronizer set up 2008-08-19 00:33 all of that has been type redefined to be backed by a variant of the rtmutex 2008-08-19 00:33 so things like spinlocks are actually mutexes with the ability to sleep across BKL and still have it be persistently held to maintain correctness 2008-08-19 00:33 I'm wondering if I should get some multithreading happening in the userspace code 2008-08-19 00:33 semantic corrrectness 2008-08-19 00:33 get the locks at least partially sorted in userspace 2008-08-19 00:33 using futexes 2008-08-19 00:33 yeah, that might be useful for a mock up 2008-08-19 00:34 ACTION needs to get back to work 2008-08-19 00:34 the alternative is to skip that and just do that part in the kernel port 2008-08-19 00:34 btw, one of the Coverity owners is a Goth 2008-08-19 00:34 and a Stanford CSE professor 2008-08-19 00:34 they hang out near goog in mtv 2008-08-19 00:35 that was on the hiring committed for Sebastian Thrum (sp?) Grand Challenge winner 2008-08-19 00:35 dawson somebody 2008-08-19 00:35 engler 2008-08-19 00:35 nice dude, I gave him Burning Man advice a year ago :) 2008-08-19 00:35 had a good time 2008-08-19 00:35 hiring? 2008-08-19 00:35 what kind of advice does one need for burning man? 2008-08-19 00:36 hiring committee for Stanford CSS 2008-08-19 00:36 CSE department 2008-08-19 00:36 "watch out for the brown tabs" 2008-08-19 00:36 flips: how to have a good time what to look out for, etc... 2008-08-19 00:36 haha 2008-08-19 00:36 floppy naked chicks 2008-08-19 00:36 on bikes 2008-08-19 00:36 sounds, um, athletic 2008-08-19 00:37 well btree leaf ops are functioning ok 2008-08-19 00:37 one issue: inserting keys in sorted order results in many half full leaves 2008-08-19 00:37 because after a leaf is split it never gets inserted into again 2008-08-19 00:38 there must be something clever to do about that 2008-08-19 00:38 ok a node in the b-tree represents a file right ? 2008-08-19 00:38 and you put the versioning information at that node ? 2008-08-19 00:38 some btrees are inode table blocks, some are file indexes 2008-08-19 00:38 how are indirect blocks dumped into that ? 2008-08-19 00:38 a leaf in a btree gets the versioned pointers 2008-08-19 00:38 the btree is the indirect block stuff 2008-08-19 00:39 oh shit, now I get it 2008-08-19 00:39 that's what I was wondering about 2008-08-19 00:39 so the time space trade off is really all about the b-tree and the metadata shoved into it 2008-08-19 00:39 two levels of trees 1) inode table 2) file index 2008-08-19 00:39 is that a correct understanding ? or am I just lost ? 2008-08-19 00:40 btrees are fairly efficient space wise 2008-08-19 00:40 not as efficient as a classic ufs radix tree for an index 2008-08-19 00:40 is my articulation accurate regarding your FS ? 2008-08-19 00:40 yes 2008-08-19 00:40 I get it 2008-08-19 00:40 fuck, wow 2008-08-19 00:40 I didn't at our conversation, but I do now after talking to you and reading the posts 2008-08-19 00:40 it's more efficient to have a bunch of versioned pointers at the leaves of btrees than to be constantly rewrite tree nodes 2008-08-19 00:40 in theory 2008-08-19 00:41 yeah, you'll be able to do all sorts of funky things with it 2008-08-19 00:41 probably 2008-08-19 00:41 there was an OLS paper that talked about something similar actually 2008-08-19 00:41 2005 2008-08-19 00:41 2006 2008-08-19 00:41 would be interesting to see 2008-08-19 00:41 usign some kind of things like what you're talking about but to do DB kind of stuff with file metadata 2008-08-19 00:41 I didn't get the proceedings that year 2008-08-19 00:42 you could take a jpg or something and have a different header or something like that 2008-08-19 00:42 it should be online regardless 2008-08-19 00:42 heh 2008-08-19 00:42 well you could use versioning for that 2008-08-19 00:42 which potentially a powerful thing 2008-08-19 00:42 yes 2008-08-19 00:42 but I'm being fairly unimagination and just using it to implement posix and versioning 2008-08-19 00:42 yeah, I just got your idea, I'm half blow away by it 2008-08-19 00:42 good night's work then 2008-08-19 00:43 blown 2008-08-19 00:43 holy shit 2008-08-19 00:43 this could potentially smoke zfs since it's so rigid 2008-08-19 00:43 you can do all sorts of fucking things with those b-tree nodes 2008-08-19 00:43 am I right ? 2008-08-19 00:45 right 2008-08-19 00:45 it's about one zillion times more compact than zfs 2008-08-19 00:45 yes, wow 2008-08-19 00:45 it's brilliant 2008-08-19 00:48 there are some interesting things being implemented in the inode table leaves 2008-08-19 00:48 the file leaves aren't going to get much fancier 2008-08-19 00:48 they're already pretty darn fancy 2008-08-19 00:49 see dleaf.c 2008-08-19 00:49 insane 2008-08-19 00:49 ok, have to go do work 2008-08-19 00:49 later 2008-08-19 00:49 good luck 2008-08-19 00:49 bye 2008-08-19 01:01 shapor, there's another bug 2008-08-19 01:02 the unit test adds a tree level and should not 2008-08-19 01:08 another bug: some buffers not getting released 2008-08-19 01:38 bogus buffer counts are gone 2008-08-19 01:38 now about that bogus level add 2008-08-19 01:39 sb->entries_per_node wasn't set 2008-08-19 02:11 flips: how are you dealing with concurrency issues with b-tree access ? 2008-08-19 02:11 you'll be doing a lot of reads to that tree and it's got to be able to do it quickly 2008-08-19 02:12 start with a single btree mutex then make it more granular 2008-08-19 02:12 when probing, drop the lock on the level above each time it goes deeper 2008-08-19 02:12 so just the leaf ends up locked 2008-08-19 02:13 if there's a better idea, whack me 2008-08-19 02:13 have you thought about using rcu instead for the read-sides ? 2008-08-19 02:14 what about write coherency in that tree across some kind of atomic sync ? 2008-08-19 02:14 yes I have 2008-08-19 02:14 -!- konrad(~konrad@c-24-16-74-109.hsd1.mn.comcast.net) has joined #tux3 2008-08-19 02:14 not really deeply though 2008-08-19 02:14 it's not a great rcu candidate 2008-08-19 02:14 the granularity issues are tricky and you could be stuck with a contention issue accessing that tree 2008-08-19 02:14 writing has to be really efficient too 2008-08-19 02:15 and rcu pukes pretty badly for writing 2008-08-19 02:15 yeah, I know 2008-08-19 02:15 I think the contention will be pretty good, provided the locks are pushed down the tree 2008-08-19 02:15 I have another trick too 2008-08-19 02:15 cursors 2008-08-19 02:15 a cursor is a probe path into the btree that isn't released 2008-08-19 02:16 you can't have a top level lock or else you'll run into things like the radix tree stuff for the page cache right ? 2008-08-19 02:16 to change the higher level blocks that a cursor owns you have to get everybody to release their cursor 2008-08-19 02:16 and limit yourself to about 2.5 processors for scalability 2008-08-19 02:16 maybe you can push this down to the versioning pointers themselves 2008-08-19 02:16 there won't be a lock inversion with radix tree 2008-08-19 02:17 radix tree lock is taken after btree lock 2008-08-19 02:17 it's not about inversion but contention and cache issues 2008-08-19 02:17 locks are ordered root to leaf in the btree 2008-08-19 02:17 the idea is not to access the root very often 2008-08-19 02:17 that is what the cursors do 2008-08-19 02:17 well, I'd expect top-level locks to really be hammered hard 2008-08-19 02:17 ok 2008-08-19 02:17 what about per cpu locality instead ? 2008-08-19 02:18 it needs to be stated more precisely 2008-08-19 02:18 so you can rip it apart ;-) 2008-08-19 02:18 localize things on an inode level or something like that with SLAB support 2008-08-19 02:18 one cursor per cpu would be nice 2008-08-19 02:18 flips: just trying to help you think it out, not to rip per se, I want you to succeed 2008-08-19 02:18 top level blocks will only change rarely and are possibly rcu candidates 2008-08-19 02:18 I want everybody to succeed :) 2008-08-19 02:18 :-) 2008-08-19 02:19 you might like to talk to peterz about some of these issues 2008-08-19 02:19 he's a tree concurrency expert 2008-08-19 02:19 ok, one concept about these cursors is, you can lop the top levels away from the cursor, so only the deeper levels hold locks 2008-08-19 02:19 then when you need to advance the cursor or something the top level locks are retaken... temporarily 2008-08-19 02:20 yes, peterz would be good 2008-08-19 02:20 I think I ought to port to kernel early and deal with the locking there 2008-08-19 02:20 instead of prototyping that in usespace 2008-08-19 02:20 you see how the locking works in ufs style file indexes? 2008-08-19 02:20 it's cute 2008-08-19 02:20 there is none 2008-08-19 02:21 property of the ind/dind/tind layout 2008-08-19 02:22 flips: you also need to think about file duping 2008-08-19 02:22 ? 2008-08-19 02:22 particularly de-duping 2008-08-19 02:22 not sure what you mean 2008-08-19 02:22 using a sha1 hash to make sure a file's contents are the same and aren't replicated 2008-08-19 02:22 so just having a pointer to it will do 2008-08-19 02:23 say for backing up a Windows volume and not recopying every fucking .dll constant in the system 2008-08-19 02:23 and other immutable files 2008-08-19 02:23 oh right 2008-08-19 02:23 just something to think about 2008-08-19 02:23 yes 2008-08-19 02:24 also possible to handle that at the volume level 2008-08-19 02:26 well, that's got to be handled in the b-tree as well I'd think since it's your only metadata structure that I know of 2008-08-19 02:27 the volume manager can pretend it's giving different blocks to the filesystem when they are actually the same 2008-08-19 02:28 then probably you need reference counting at some level 2008-08-19 02:28 venti and stuff like that 2008-08-19 02:31 well, the metadata grows as you add more functionality, so packing becomes important 2008-08-19 02:32 if I'm coming up with uninterestng things please tell me and I'll shut up 2008-08-19 02:32 you aren't suggesting looking for identical metadata blocks? 2008-08-19 02:33 but having something that can also vary the flatness of a particular file would also be useful like for video applications 2008-08-19 02:33 you could represent discontinguous spans using a special indirect block or something and describe the spans using an extent (?) 2008-08-19 02:33 indirect pointer I mean 2008-08-19 02:33 er, no, block 2008-08-19 02:34 flips: I'm suggesting what ever will work 2008-08-19 02:34 there are also to be extents 2008-08-19 02:34 extents will really flatten things 2008-08-19 02:34 so spans could be represented by an extent right ? 2008-08-19 02:34 ok, good 2008-08-19 02:34 yes 2008-08-19 02:34 sparse extents too 2008-08-19 02:34 good 2008-08-19 02:35 ok, am I raising interesting points or not ? 2008-08-19 02:35 oh yes 2008-08-19 02:35 especially the locking 2008-08-19 02:35 I need to make a specific proposal 2008-08-19 02:35 starting from easy and moving to efficient 2008-08-19 02:35 well, the b-tree thing is so obvious yet so powerful I'm surprised that somebody hasn't tried this already 2008-08-19 02:36 btrfs is btrees, so is zfs 2008-08-19 02:36 but versioning at the leaves is new 2008-08-19 02:36 yeah, but you're using it in a novell way which is why it's interesting to me 2008-08-19 02:36 ok 2008-08-19 02:36 what seems novelle to you? 2008-08-19 02:37 novel 2008-08-19 02:37 a problem with a single big b-tree I would think might be aging elements in memory so that certain frequently used things will be in core for use, like for checking the integrity of a volume without having to load the same indirect pointer again and again 2008-08-19 02:37 flips: using a b-tree generically for all sorts of things 2008-08-19 02:38 generic btrees are new too, right 2008-08-19 02:38 I also don't know as much as you about file systems so my comment could be out of ignorance 2008-08-19 02:38 flips: I'm interested in the power of generic b-trees for all sorts of metadata 2008-08-19 02:38 the buffer cache blocks are lru's 2008-08-19 02:38 lru'd 2008-08-19 02:39 clean, old ones get evicted 2008-08-19 02:39 dirty ones have to be cleaned regularly, that is the atomic commit 2008-08-19 02:39 I will add the third kind of btree probably tomorrow 2008-08-19 02:40 actually, I already added a third kind, the unit test implements a new btree just for testing 2008-08-19 02:40 and to demo what you have to do to specialize the btree 2008-08-19 02:40 what about sensitivity to things like an inode versus indirect versus lower level indirect blocks ? 2008-08-19 02:41 same for all other kinds of metadata 2008-08-19 02:41 there needs to be a kind of ordering or something like that I'd expect 2008-08-19 02:41 for commit? 2008-08-19 02:42 like an NFS use of a volume might be different for a Samba 2008-08-19 02:42 flips: for general reading 2008-08-19 02:42 ...and needs different kinds of metadata loaded and persistent in different ways 2008-08-19 02:42 this why I'm suspicious about the Linux page cache 2008-08-19 02:43 everything is handled the same way 2008-08-19 02:43 in the page cache 2008-08-19 02:43 the aging seems overly simplistic 2008-08-19 02:43 probably is 2008-08-19 02:43 linux kind of sucks there 2008-08-19 02:43 yeah, I've noticed 2008-08-19 02:43 somebody measured and found our pageout performs worse than random 2008-08-19 02:44 bad 2008-08-19 02:44 well my wife is heading to to tomorrow 2008-08-19 02:45 ok 2008-08-19 02:45 and I will drive to the ariport 2008-08-19 02:45 so night then right 2008-08-19 02:45 ? 2008-08-19 02:45 I'll be up still for a few more hours 2008-08-19 02:45 continue anytime ok? 2008-08-19 02:45 sure, I hope it was a useful conversation 2008-08-19 02:45 night for me 2008-08-19 02:45 it was 2008-08-19 02:45 night 2008-08-19 02:45 locking is getting imminent 2008-08-19 02:45 bye 2008-08-19 02:50 you should also think about how to cluster related data together in the b-tree for contigous write allocatin 2008-08-19 02:50 the block allocator is a bitch 2008-08-19 02:50 thinking every much about that 2008-08-19 02:51 I will post some thoughts pretty soon 2008-08-19 02:51 ok, just hope that i'm relevant about this :) 2008-08-19 02:51 inode number targetting is a big part of it 2008-08-19 02:51 oh yes 2008-08-19 02:51 rotating media still rules the wold 2008-08-19 02:51 world 2008-08-19 02:51 because different kinds of metadata need to be treated differently 2008-08-19 02:52 which could be a drawback of having a big b-tree manage all of this 2008-08-19 02:52 I guess you can always dump a shit load of ram into your system as well 2008-08-19 02:52 the allocator will try to places inode table blocks near the directories that link them (note impossibility with hard links) and data blocks near the inode table blocks 2008-08-19 02:52 also impossible in general 2008-08-19 02:53 what about what about relate indirect blocks ? 2008-08-19 02:53 and allocation with regards to versioning pointers and that information ? 2008-08-19 02:53 meaning higher level btree blocks 2008-08-19 02:53 as long as I'm asking good questions, I'll not feel like a fucking dork 2008-08-19 02:53 allocation target needs to be derived from the allocation target of the data blocks 2008-08-19 02:54 versioning makes allocation much harder 2008-08-19 02:54 yes 2008-08-19 02:54 very much so 2008-08-19 02:54 because you basically have to store lots of the data in the same place 2008-08-19 02:54 so you'll have to have an upper bounds on the fs for doing this allocation efficiently 2008-08-19 02:54 that is where the idea of generating functions for allocation comes in 2008-08-19 02:54 like a quadratic hash 2008-08-19 02:55 otherwise you'll be running into collisions 2008-08-19 02:55 there will be massive collisions 2008-08-19 02:55 I am aiming to collide elegantly 2008-08-19 02:55 what about allocation maps in the versioning system ? self contained in the b-tree itself ? 2008-08-19 02:55 that is a cool thing about versioned pointers 2008-08-19 02:55 if it's done on per volume basis, it could be a lot of replication 2008-08-19 02:55 you can tell from the versioned pointers what blocks are free 2008-08-19 02:56 right, so it's unified into the algorithm right ? 2008-08-19 02:56 there is just one global free tree for the whole filesystem 2008-08-19 02:56 knowing when to free a block is part of the versioning algorithm, yes 2008-08-19 02:56 it's pretty subtle 2008-08-19 02:56 well, what about fragmentation of that data ? 2008-08-19 02:56 about the hardest part actually 2008-08-19 02:56 yes, versioning can fragment stuff 2008-08-19 02:56 you'd generally like to have that easily accessible 2008-08-19 02:56 think of a mysql database with snapshots every 5 minutes 2008-08-19 02:56 wham 2008-08-19 02:57 this conversation is logged right ? 2008-08-19 02:57 I believe so 2008-08-19 02:57 ok, just so that folks can ponder this stuff and come up with answers 2008-08-19 02:57 see tux3bot up them 2008-08-19 02:57 well, the allocation map is a bitch 2008-08-19 02:58 true 2008-08-19 02:58 the bitmap thing is pretty cute 2008-08-19 02:58 your read performance and friends are really tightly connected to how fast you can do a lookup in a b-tree 2008-08-19 02:58 you just reminded me, I can't have the allocation bitmap in my inode table 2008-08-19 02:58 it's global to multiple volumes 2008-08-19 02:59 one crude trick: cache the root of the btree 2008-08-19 02:59 and the 1st level for good measure 2008-08-19 02:59 well, replicated it 2008-08-19 02:59 branching factor is 2^8 2008-08-19 02:59 say there are 10 million inodes 2008-08-19 03:00 packed 32/block 2008-08-19 03:00 I think you should think about per CPU-ification straight up initially as apart of the design 2008-08-19 03:00 so that you avoid these issues 2008-08-19 03:00 2^18 blocks about 2008-08-19 03:00 you might have to push it down to an inode level or something and replicate all of the volume bits above it 2008-08-19 03:00 which is 3 btree levels 2008-08-19 03:01 yes, that is the right way to think about it 2008-08-19 03:01 no bouncing 2008-08-19 03:01 yeah, talking to matt about it will help us 2008-08-19 03:01 it's nearly 2 levels 2008-08-19 03:01 worth trying to make it 2 levels 2008-08-19 03:01 er, you. I'm avoiding work right now ;) 2008-08-19 03:01 then cache the root 2008-08-19 03:02 I'll stay up later to compensate 2008-08-19 03:02 that's one probe to get to the inode 2008-08-19 03:02 flips: I think it's critical to think about how you're going to organize the metadata, what for specific use at a specific time 2008-08-19 03:02 the versioning pointer stuff is really potentially powerful 2008-08-19 03:02 been thought about a lot 2008-08-19 03:03 I'm thinking about how to pack the btree nodes better now 2008-08-19 03:03 because caching this shit properly is a major bitch 2008-08-19 03:03 yes 2008-08-19 03:04 right now it's big an homogenous 2008-08-19 03:04 an=and 2008-08-19 03:04 which sounds like shitty cache performance 2008-08-19 03:05 which means that you have to think about these things straight up 2008-08-19 03:05 before trying to really fully implement it 2008-08-19 03:05 it's not homogenous 2008-08-19 03:05 inode table blocks try to have related inodes 2008-08-19 03:06 blocks ? 2008-08-19 03:06 directory blocks have temporally related entries 2008-08-19 03:06 leaves of the inode table btree 2008-08-19 03:06 have more than one inode per blocks 2008-08-19 03:28 -!- pgquiles(~pgquiles@246.Red-81-37-88.dynamicIP.rima-tde.net) has joined #tux3 2008-08-19 03:29 getting sleepy 2008-08-19 03:29 night 2008-08-19 03:29 night 2008-08-19 03:29 you're up late as well, wow 2008-08-19 04:51 -!- juancarlos(~juancarlo@33.Red-83-53-239.dynamicIP.rima-tde.net) has joined #tux3 2008-08-19 04:51 -!- juancarlos(~juancarlo@33.Red-83-53-239.dynamicIP.rima-tde.net) has left #tux3 2008-08-19 10:13 -!- pgquiles_(~pgquiles@154.Red-83-33-145.dynamicIP.rima-tde.net) has joined #tux3 2008-08-19 11:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-19 13:54 flips: you there ? 2008-08-19 13:55 have you thought about cluster failover in your system yet ? 2008-08-19 13:55 yes 2008-08-19 13:55 yes 2008-08-19 13:55 a little 2008-08-19 13:55 what do you think about union FS and btrfs ? 2008-08-19 13:55 mostly about how atomic commit will work on a cluster 2008-08-19 13:55 neither has much to do with a cluster 2008-08-19 13:56 perhaps you are talking about failing over the underlying volume? 2008-08-19 13:56 yes 2008-08-19 13:57 well, something like paired nodes taking over when one or the other fails 2008-08-19 13:57 this would be in a kind of grid computing environment 2008-08-19 13:57 cluster lite 2008-08-19 13:58 what's your opinion aobut btrfs ? 2008-08-19 13:58 about 2008-08-19 13:58 I was thinking about the extend tux3 to be a clusterfs issue 2008-08-19 13:58 btrfs in general? I wish them good luck 2008-08-19 13:58 get stable and be better than zfs 2008-08-19 13:58 folks seem to be interested in it and there's increasing engineering effort going into it 2008-08-19 13:58 but it has the same design flaw as zfs 2008-08-19 13:59 I wasn't impressed by it when I looked at it actually 2008-08-19 13:59 mashes the lvm together with the filesystem, bad 2008-08-19 13:59 me neither 2008-08-19 13:59 oh 2008-08-19 13:59 I meant zfs 2008-08-19 13:59 btrfs is a zfs knockoff, and I think zfs kind of sucks 2008-08-19 13:59 zfs is slow for one thing 2008-08-19 14:05 btree algorithms sure got solid fast once I implemented the unit test 2008-08-19 14:05 now let's try the shiny new advance method 2008-08-19 14:08 I just wasnt impressed by it 2008-08-19 14:08 it's like a bad knock off of WAFL 2008-08-19 14:09 without any of the coolness of that system 2008-08-19 14:09 maybe I'm wrong, we'll find out 2008-08-19 14:09 btrfs is getting a lot of attention and resources right now so we'll see 2008-08-19 14:11 I don't think you're far off the mark 2008-08-19 14:11 spent some time in the code myself 2008-08-19 14:12 tux3 file index btrees just got better 2008-08-19 14:12 two line hack 2008-08-19 14:12 improves average leaf fullness from 50% to 100% 2008-08-19 14:12 nice 2008-08-19 14:13 the thing that I wondered about regarding btrfs is that it's pulling all sorts of things together, but I just don't understand why and to what ends 2008-08-19 14:13 doesn't seem to break anything either. I'd appreciate comment on the post I just did on tux3 though 2008-08-19 14:13 that's the main problem I have with it 2008-08-19 14:13 right 2008-08-19 14:13 as well as it kind of ignoring all of the intricacies of how complicated a COW FS is 2008-08-19 14:13 it's really dumb to do that stuff with the volume manager when it isn't necessary 2008-08-19 14:14 I already figured out how to do the redudant metadata thing they are obsessed with, without violating the lvm boundary 2008-08-19 14:14 don't know much about the lvm, but it seems like a bunch of grab bag items thrown together for some unclear reasons 2008-08-19 14:15 you should comment on some of the intricacies 2008-08-19 14:15 one of them is certainly allocation 2008-08-19 14:15 like lvm isn't known to handle metadata specifically, so I don't know about how they're going to pull that together 2008-08-19 14:15 they don't do it in the lvm 2008-08-19 14:15 or anything really with regards to lvm 2008-08-19 14:16 they make multiple copies of each metadata block and have multiple pointers to them 2008-08-19 14:16 that is just dumb 2008-08-19 14:16 have one pointer to the metadata block and make the block redundant at the lvm level 2008-08-19 14:16 duh 2008-08-19 14:16 it's not bad if it's done for a specific reason to solve a particular problem with metadata 2008-08-19 14:16 it's a general fear of bad blocks 2008-08-19 14:16 or disks going bad 2008-08-19 14:17 it's a really top heavy solution 2008-08-19 14:17 so let the RAID layer handle it ? 2008-08-19 14:17 yes 2008-08-19 14:17 hmmm 2008-08-19 14:17 well, hard to say 2008-08-19 14:17 easy 2008-08-19 14:17 yeah, that makes sense, but I'm wondering where this will all go to 2008-08-19 14:17 have regions of different redundancy 2008-08-19 14:18 25% redundant for data, 200% for metadata 2008-08-19 14:18 interleave the regions, the filesystem knows which have which level of redundancy and sets allocation targets accordingly 2008-08-19 14:19 if necessary, have to lvm remap some regions to achieve higher or lower redundancy levels 2008-08-19 14:19 it would almost certainly be good enough just to let 1% of the volume be 200% redudant 2008-08-19 14:20 and distribute that evenly through the volume 2008-08-19 14:20 how's things going today ? was our discussion useful in pointing out problems last night ? 2008-08-19 14:20 was good 2008-08-19 14:20 reviewing a little now 2008-08-19 14:21 yes, the locking stuff 2008-08-19 14:21 nee to make a coherent proposal on the list 2008-08-19 14:21 don't know, maybe btrfs will win and I'm wrong about my skepticism 2008-08-19 14:21 also need to make a coherent proposal on atomic commit, get the basics working in user space 2008-08-19 14:22 linux filesystem projects do tend to keep moving along 2008-08-19 14:22 btrfs has some good helpers 2008-08-19 14:22 yeah, maybe they'll win 2008-08-19 14:22 though most of the coding still seems to fall on chris 2008-08-19 14:23 it's btrfs vs zfs, not vs tux3 imho 2008-08-19 14:23 I think btrfs has a good chance against zfs 2008-08-19 14:23 but my experience is that the linux page cache is inadequate for enterprise level filers 2008-08-19 14:23 somewhat true 2008-08-19 14:23 you really need something different than that 2008-08-19 14:23 the radix tree stuff is pretty good 2008-08-19 14:23 something really particular to buffers because of the mirroring logic and stuff 2008-08-19 14:24 buffer handling needs a big fix, that is true 2008-08-19 14:24 buffers have to be individually marked so that you know that it's been replicated properly, etc... 2008-08-19 14:24 tux3 worries about taht 2008-08-19 14:24 oh, mirroring 2008-08-19 14:24 whether the buffers you're copying and indirect blocks are valid for the copy after online checking 2008-08-19 14:24 not a good way to mirror 2008-08-19 14:25 delta mirroring is the right way to go, otherwise you probably just want raid1 2008-08-19 14:26 to delta mirror, you don't try to copy indirect blocks, just leaf data 2008-08-19 14:26 let the destination worry about setting up the indirect blocks 2008-08-19 14:27 got some 30 level btress happening ;-) 2008-08-19 14:27 by cutting the leafs down to 7 elements per 2008-08-19 14:28 beauty is when all the smoke clears and every buffer has zero use count 2008-08-19 14:35 -!- vandenoever(~vandenoev@ip5657eb5b.direct-adsl.nl) has joined #tux3 2008-08-19 14:35 good evening 2008-08-19 14:35 hi 2008-08-19 14:35 vandenoever: hi 2008-08-19 14:35 hi flips, i hear you rule this realm 2008-08-19 14:35 flips: vandenoever is the guy behind strigi 2008-08-19 14:35 vandenoever: flips is the guy behind tux3 2008-08-19 14:35 let the party begin! 2008-08-19 14:35 rule would be a bit of an exaggeration 2008-08-19 14:35 :-) 2008-08-19 14:36 friend of yours pgquiles_? 2008-08-19 14:36 right 2008-08-19 14:36 flips: you're the main attraction 2008-08-19 14:36 dunno, shapor is kind of cute 2008-08-19 14:36 ACTION ducks 2008-08-19 14:36 strigi looks very cool 2008-08-19 14:36 and I am a huge kde fan 2008-08-19 14:37 flips: i'd have to go by glyhp curves on that , which is rather hard 2008-08-19 14:37 flips: that 's a good start :-) 2008-08-19 14:37 so i was wondering if at some point there should be indexes as part of the filesystem 2008-08-19 14:37 I used to use glimpse a lot, just for lxr 2008-08-19 14:37 it never got gree 2008-08-19 14:38 then htdig came along 2008-08-19 14:38 which kde still uses for docs 2008-08-19 14:38 the shame! 2008-08-19 14:38 well, change the world step by step 2008-08-19 14:38 I suppose strigi beats it in every way? 2008-08-19 14:38 well 2008-08-19 14:38 sort of 2008-08-19 14:38 I would like to solve the problem of accurately maintaining an index 2008-08-19 14:39 without necessarily building it into the fs 2008-08-19 14:39 flips: that's the most urgent one 2008-08-19 14:39 flips: yes, let's not overdo it 2008-08-19 14:39 ddnotify ;-) 2008-08-19 14:39 you just invented that? nice 2008-08-19 14:39 nope 2008-08-19 14:39 I invented a bunch of other ddthings 2008-08-19 14:40 and ddlink might be really useful 2008-08-19 14:40 i mean: you just invented the name 2008-08-19 14:40 right 2008-08-19 14:40 ddlink is cool 2008-08-19 14:40 it is a tight two way coupling between kernel and userspace 2008-08-19 14:40 suitable for tasks like sending change notifies 2008-08-19 14:40 never heard of it ... 2008-08-19 14:41 google 2008-08-19 14:41 "ddlink kernel" 2008-08-19 14:41 yes 2008-08-19 14:41 it doesn't have a high profile 2008-08-19 14:41 ddlink phillips 2008-08-19 14:41 ACTION finds a pdf about instant startup 2008-08-19 14:41 even then... 2008-08-19 14:41 bleah 2008-08-19 14:41 just a sec 2008-08-19 14:41 An alternative interface to device mapper 2008-08-19 14:42 yes 2008-08-19 14:42 In more detail: ddlink is a generic pipe-like interface for controlling 2008-08-19 14:42 device drivers. 2008-08-19 14:42 hmm 2008-08-19 14:42 show you how much the world cares about that ;-) 2008-08-19 14:43 anything, the thing is, you can poll on a ddlink 2008-08-19 14:43 and it can send you, say, filesystem specific change notifications 2008-08-19 14:43 but i dont want to poll 2008-08-19 14:43 what would you like? 2008-08-19 14:43 oh 2008-08-19 14:43 not on a given inode 2008-08-19 14:43 a stream of file changes to read from 2008-08-19 14:43 that's right 2008-08-19 14:43 ddlink does that 2008-08-19 14:44 ok, cool 2008-08-19 14:44 poll just lets you read it efficiently 2008-08-19 14:44 filtered for user rights? 2008-08-19 14:44 that's a detail of the ddlink instance 2008-08-19 14:44 but yes 2008-08-19 14:44 obeys access rules 2008-08-19 14:44 by default 2008-08-19 14:44 so i go and say: give me a pipe to read file changes on /dev/sda3 2008-08-19 14:44 ? 2008-08-19 14:45 exactly 2008-08-19 14:45 and this is a kernel module? how is this exposed? 2008-08-19 14:45 it could also be on a superblock 2008-08-19 14:45 it is a kernel library 2008-08-19 14:45 a module instantiates a ddlink with a few methods 2008-08-19 14:45 ok, so no userspace api yet 2008-08-19 14:46 sure 2008-08-19 14:46 it's a normal pipish kind of api 2008-08-19 14:46 good 2008-08-19 14:46 posted some minimal demos 2008-08-19 14:46 and have much nicer ones 2008-08-19 14:46 so that's step 1 2008-08-19 14:46 this idea has been in kernel for a long time 2008-08-19 14:46 here's problem 2 2008-08-19 14:46 see rpc_pipefs 2008-08-19 14:47 flips: but not generally part of vfs? 2008-08-19 14:47 not at the level, doesn't need to be 2008-08-19 14:47 it just uses the vfs to do its thing 2008-08-19 14:47 so let's assume this would work (i'll read up) 2008-08-19 14:48 see this scenario: kernel boots 2008-08-19 14:48 fs is mounted, X is started 2008-08-19 14:48 user logs in 2008-08-19 14:48 files are changed 2008-08-19 14:48 desktop start 2008-08-19 14:48 desktop search starts 2008-08-19 14:48 ddlink is opened 2008-08-19 14:48 unf. we have missed file changes at this point 2008-08-19 14:49 i'd like the indexer to say to the filesystem: 2008-08-19 14:49 "the last change i got from you was N. what has happened since?" 2008-08-19 14:49 so fs needs a circular log 2008-08-19 14:49 good, no problem 2008-08-19 14:50 ddlink maintains an arbitrarily long queue 2008-08-19 14:50 waiting for someone to come along and slurp it up 2008-08-19 14:50 but not on disk, right? 2008-08-19 14:50 doesn't make the fs or anything wait synchronously either 2008-08-19 14:50 no 2008-08-19 14:50 memory 2008-08-19 14:50 because the same happens on shutdown 2008-08-19 14:50 ok, you want something on disk too? 2008-08-19 14:50 sounds reasonable 2008-08-19 14:50 or when indexer crashes 2008-08-19 14:51 you don't want kernel to buffer forever, right? 2008-08-19 14:51 or when user logs in without starting the indexer 2008-08-19 14:51 no, should be a reasonable limit 2008-08-19 14:51 because we can always do a full scan 2008-08-19 14:51 fs can say: " i dont remember all of that" and indexer does a full scan 2008-08-19 14:52 slight security problem: N should not be sequence 2008-08-19 14:52 ? 2008-08-19 14:52 anyway this is all a can do 2008-08-19 14:53 intruder could know how much was written by monitoring N 2008-08-19 14:53 even having the filesystem buffer the changes on disk 2008-08-19 14:53 nothing hard about it 2008-08-19 14:53 no, just has to be in the design 2008-08-19 14:53 which is why pgquiles_ pushed me here 2008-08-19 14:53 he said: go, go, flips is designing, we can add cruft! 2008-08-19 14:54 just kidding, but he did push me here because this is a good point to take this stuff into account 2008-08-19 14:54 :-) 2008-08-19 14:55 flips: the btrfs folks have a more concurrent b-tree implementation now 2008-08-19 14:55 according to their announcement 2008-08-19 14:55 ok, convince me that the events actually have to be buffered on disk as opposed to in memory 2008-08-19 14:55 I think I am closed to convinced 2008-08-19 14:55 but I bet you dillon has something to say about that with replications 2008-08-19 14:55 nice excuse to add that cruft to the design ;-) 2008-08-19 14:55 flips: it's a performance thing 2008-08-19 14:55 bh, I was aware of it 2008-08-19 14:55 ok 2008-08-19 14:55 if a user logs in, now the first thing that happens is that the indexer puts inotify watches everywhere 2008-08-19 14:56 bh, you're syncing up pretty fast 2008-08-19 14:56 or scans all dirs for changes 2008-08-19 14:56 yeah 2008-08-19 14:56 sucks 2008-08-19 14:56 I know 2008-08-19 14:56 thought about it 2008-08-19 14:56 with a cache on disk, indexer gets a short list of modified changes and is in sync 2008-08-19 14:56 yes 2008-08-19 14:56 that is the right way to go, I will make a design note 2008-08-19 14:57 eh syncing up on what ? current development on linux file systems ? 2008-08-19 14:57 ACTION dances 2008-08-19 14:57 just taking an interest largely because of your announcement 2008-08-19 14:57 change notification needs to be a first class citizen of a filesystem, you showed that 2008-08-19 14:57 flips: deliver us from inotify! ;-) 2008-08-19 14:57 I will do my best 2008-08-19 14:57 don't want to be negative, but I've been kind of down about Linux fs development overall 2008-08-19 14:57 this buffering could possibly be done at the vfs level too 2008-08-19 14:57 it just seems to scattered and disjointed 2008-08-19 14:57 only after gaining experience at the fs level 2008-08-19 14:58 bh, syncing up with btrfs facts 2008-08-19 14:58 flips: you mean vfs writes to a log file, so it works for al fses? 2008-08-19 14:58 most people just go on general impressions 2008-08-19 14:58 exactly 2008-08-19 14:58 but first some filesystem has to implement it and get it right 2008-08-19 14:58 before generalizing 2008-08-19 14:58 that would be even better, but log format would have to allow for sanity checking it 2008-08-19 14:58 and getting a mess like quota files 2008-08-19 14:59 yep 2008-08-19 14:59 obviously getting it in any fs is fine with me 2008-08-19 15:00 i was just wondering how this should be started 2008-08-19 15:00 a simple fuse with a change log could be used for designing 2008-08-19 15:00 starts with a design note I think 2008-08-19 15:00 uhuh 2008-08-19 15:00 well 2008-08-19 15:00 a bogus kernel module faking a ddlink would be good 2008-08-19 15:00 flips: i know, i'm cursing in the church of kernelspace 2008-08-19 15:01 you could do this: have two ddlinks 2008-08-19 15:01 you use one to feed fake filesystem behaviour into the kernel 2008-08-19 15:01 your index code uses the other 2008-08-19 15:01 as it would if the filesystem were generating the fake events 2008-08-19 15:01 my index code can make a ddlink? 2008-08-19 15:02 the module does 2008-08-19 15:02 by any method 2008-08-19 15:02 oki 2008-08-19 15:02 I currently favor ioctl for creating ddlinks 2008-08-19 15:03 for example, ioctl a file, the root of a fs or any other file 2008-08-19 15:03 to get your ddlink 2008-08-19 15:03 I use ioctl code 0xdd for that ;-) 2008-08-19 15:03 :-) 2008-08-19 15:05 flips: then that fd is a pipe from which to read the changes? 2008-08-19 15:05 yes 2008-08-19 15:05 the ioctl returns a fd 2008-08-19 15:05 boy, kernel programming almost sounds easy! 2008-08-19 15:06 this was pretty clean 2008-08-19 15:06 then invent a protocol 2008-08-19 15:06 right 2008-08-19 15:06 that part is fun 2008-08-19 15:06 too bad we cannot use inodes in the protocol 2008-08-19 15:06 I mostly just send structs over the pipe 2008-08-19 15:06 ? 2008-08-19 15:06 or can we 2008-08-19 15:07 you can use anything that positively identifies the change 2008-08-19 15:07 inode numbers would be good 2008-08-19 15:07 we need to tell the path and the type of change i guess 2008-08-19 15:07 much better than names I think 2008-08-19 15:07 can i map inode to path? 2008-08-19 15:07 yes 2008-08-19 15:07 the heavens open! 2008-08-19 15:07 really? how? 2008-08-19 15:07 you want to use some kind of handle for a directory, not a path I think 2008-08-19 15:08 path handling is crufty 2008-08-19 15:08 gets hard when the path changes asynchronously 2008-08-19 15:08 index uses urls as handles 2008-08-19 15:08 you would use the ddlink to ask the fs to tell you the name of an inode 2008-08-19 15:08 now 2008-08-19 15:08 of course 2008-08-19 15:08 there is a problem 2008-08-19 15:09 the inode can be multiply linked 2008-08-19 15:09 flips: ah ok, yes, that's possible, but i was not planning on talking ddlink, just to listen 2008-08-19 15:09 what you want is directory handles 2008-08-19 15:09 much linke openat etc 2008-08-19 15:09 much link 2008-08-19 15:09 much like 2008-08-19 15:10 directory handle + name 2008-08-19 15:10 instead of path/name 2008-08-19 15:10 then ask for directory name + parent handle till we reach root? 2008-08-19 15:10 right, that is always precisely defined 2008-08-19 15:10 unix semantics 2008-08-19 15:11 i see 2008-08-19 15:11 some notion of filesystem object would be cool 2008-08-19 15:11 an inode is a good object id 2008-08-19 15:11 the thing is, to decouple the object id from the name 2008-08-19 15:11 filesystem object it root for the ddlink module 2008-08-19 15:12 anyway, you're the expert there 2008-08-19 15:12 flips: for id we do use the path and we have currently no mechanism of transferring indexed information when moving a file 2008-08-19 15:12 so you could have a real id 2008-08-19 15:13 and map your current paths to a made up id 2008-08-19 15:13 but use the real, inode id if available 2008-08-19 15:13 we could but we'd have to change the entire indexer api 2008-08-19 15:14 do we need to do this to ensure that we are in sync? 2008-08-19 15:15 do it some time in the future 2008-08-19 15:15 i realize that inodes are more efficient in terms of moving and double linking 2008-08-19 15:15 it's just more accurate 2008-08-19 15:15 I think 2008-08-19 15:15 flips: what if a defrag tool comes along? 2008-08-19 15:15 at least, use the directory id's I think 2008-08-19 15:15 that should map to your stuff 2008-08-19 15:15 defraggers renumbering inodes? 2008-08-19 15:16 sure it's a danger 2008-08-19 15:16 but that should just look like a series of valid operations to you 2008-08-19 15:16 or what if user restores a backup and index was on another disk? 2008-08-19 15:16 you know you have it right when you can follow events through that maze 2008-08-19 15:16 then depending on type of restore, inodes might be different 2008-08-19 15:16 flips: it's still subject to implementation issues like everything, I couldn't predict the performance of either btrfs or tux3 until there was an implementation in place for testing 2008-08-19 15:16 true 2008-08-19 15:17 I'd tend to go for some kind of "meld" process to handle extreme events like that 2008-08-19 15:17 flips: it's an index so we can always rebuild it 2008-08-19 15:17 sounds like invalidating the whole index would be right in those cases 2008-08-19 15:17 right 2008-08-19 15:17 what we aim for is to be 99% sure that we dont need to do much work 2008-08-19 15:17 when starting up 2008-08-19 15:17 bh, you can't just assume that I'll make it kickass? ;-) 2008-08-19 15:18 yes 2008-08-19 15:18 I think I get it 2008-08-19 15:18 and to be able to tolerate startup + shutdown + startup where the indexer doesn't run for a whole cycle 2008-08-19 15:18 we can still have a 'do full scan' button, but it should not be needed 2008-08-19 15:18 and just picks up as if it did 2008-08-19 15:18 right 2008-08-19 15:19 I think I have a pretty clear picture 2008-08-19 15:19 will dust off ddsnap code 2008-08-19 15:19 ddlink I mean 2008-08-19 15:19 and refresh to current 2008-08-19 15:19 ddlink lives as a patch? 2008-08-19 15:19 flips: I hope so, but man the rumors about you and stuff give me doubt about you 2008-08-19 15:19 especially those freaky roller blades and stuff 2008-08-19 15:19 and funky hat 2008-08-19 15:20 weird friends 2008-08-19 15:20 http://phunq.net/ddtree 2008-08-19 15:20 I'll make it a patch 2008-08-19 15:20 ACTION giggles 2008-08-19 15:20 bh: oh my, now i have images of the 90s in my head 2008-08-19 15:20 bh, I take off the roller blades to debug 2008-08-19 15:20 and put them on your head to keep out the voice from outer spaec ? 2008-08-19 15:21 space ? 2008-08-19 15:21 :) 2008-08-19 15:21 my daughter likes to put them on 2008-08-19 15:21 no matter what I do, I can't keep the voices out 2008-08-19 15:21 flips: is ddlink tux3 specific? 2008-08-19 15:21 right now they're telling me to test the btree advance ;-) 2008-08-19 15:22 v, not at all 2008-08-19 15:22 completely generic, I find new places to use it all the time 2008-08-19 15:22 it will be an integral part of lvm3 2008-08-19 15:22 trond is even interested in changing nfs to use it 2008-08-19 15:23 it's cleaner than rpc_pipefs 2008-08-19 15:23 ACTION runs and hides 2008-08-19 15:24 when do you think you'll overrun btrfs ? :) 2008-08-19 15:24 christmas 2008-08-19 15:24 which christmas is left unspecified 2008-08-19 15:24 seriously ? 2008-08-19 15:24 :-D 2008-08-19 15:24 I was just joking actually 2008-08-19 15:24 I'm always serious ;-) 2008-08-19 15:24 ah, ok 2008-08-19 15:24 so before abolition of christianity and capitalism 2008-08-19 15:25 days before 2008-08-19 15:25 faster if coders send patches 2008-08-19 15:25 bh, you can still grab the glory of being contributor #3 2008-08-19 15:26 oh fuck 2008-08-19 15:26 :) 2008-08-19 15:26 I have this nasty schedule code/bug to work through 2008-08-19 15:26 scheduler 2008-08-19 15:26 and I'm kind of half clueless about what's going on 2008-08-19 15:26 bh, how about I do a quick kernel port and you make nice btree locking? 2008-08-19 15:27 should not be a huge time investment 2008-08-19 15:27 it's a hard problem I'm not sure what the best method is 2008-08-19 15:27 rcu on the upper nodes 2008-08-19 15:27 well, it depends 2008-08-19 15:27 hmmm 2008-08-19 15:27 needs to be cognizant of the atomic commit algorithm 2008-08-19 15:27 which I must crystallize first 2008-08-19 15:27 even spinning on the upper nodes would be find 2008-08-19 15:28 mutex on the deep nodes 2008-08-19 15:28 flips: i'm going to read up on ddlink and await the resurrection of it 2008-08-19 15:28 it's in the pipeline 2008-08-19 15:28 ddsetup is the example program, a clone of dmsetup 2008-08-19 15:28 but written in a fraction of the code 2008-08-19 15:28 flips: can you mail me when you get an update? then i can add another fs event backend to the indexer 2008-08-19 15:29 how about this: you describe your ask on tux3 mailing list 2008-08-19 15:29 then I respond by giving you something concrete? 2008-08-19 15:29 ok, will do so tomorrow, now ddlink bedtime reading and sleep 2008-08-19 15:29 cross post to your own ml 2008-08-19 15:30 there was some good discussion of it on lkml 2008-08-19 15:30 flips: how long ago? 2008-08-19 15:30 jon corbet asked me why not netlink 2008-08-19 15:30 and I showed why not pretty convincingly 2008-08-19 15:30 year or so 2008-08-19 15:30 was it on lwn.net if corbet was asking? 2008-08-19 15:30 lkml I think 2008-08-19 15:30 jon sometimes posts 2008-08-19 15:31 http://lkml.org/lkml/2008/3/5/327 2008-08-19 15:32 right 2008-08-19 15:32 a slightly exaggerated comparison 2008-08-19 15:32 but only slightly 2008-08-19 15:32 netlink really does suck 2008-08-19 15:32 for the press :-) 2008-08-19 15:32 right 2008-08-19 15:36 ok, good night! 2008-08-19 15:36 ACTION gets back to the advance test 2008-08-19 15:46 pop to level 1, 3 of 3 nodes 2008-08-19 15:46 [5815] devmap_blockio: read block dddddddddddddddd 2008-08-19 15:46 [5815] devmap_blockio: Failed assertion "dev->bits >= 9 && dev->fd" 2008-08-19 15:46 bugz ;-) 2008-08-19 15:46 ACTION is still reading the backlog, he ran away for (very late) dinner when tech discussion started 2008-08-19 15:46 was good 2008-08-19 15:46 good intro 2008-08-19 15:50 flips: I also happen to know one of the guys developing Tracker (http://www.gnome.org/projects/tracker/), in case you want another point of view... 2008-08-19 15:51 it he interfacing to strigi? 2008-08-19 15:51 no 2008-08-19 15:51 because? 2008-08-19 15:51 I'd be interested in why 2008-08-19 15:51 because it's like strigi but started by the gnome people 2008-08-19 15:52 :-) 2008-08-19 15:52 gnome guys usually get everything wrong 2008-08-19 15:52 and IIRC they have their own indexing engine (strigi uses clucene) 2008-08-19 15:52 I have all this marginally useless gnome invention cruft on my system 2008-08-19 15:52 some gnome thing was screwing up this morning and I'm not running gnome 2008-08-19 15:53 but I'm interested in the reasoning in the fascination with the bizarre sense 2008-08-19 15:53 :-D 2008-08-19 15:54 linus holds similar views btw 2008-08-19 15:57 I know 2008-08-19 15:57 found a bug in btree probe, wow 2008-08-19 15:57 it's been in ddsnap all these years 2008-08-19 15:57 never had to probe for a key that already exists 2008-08-19 15:57 fact is, when I started developing on linux when I was at the university, I used gtk+ (1.1.something, IIRC) 2008-08-19 15:57 that horrified me so much, I quickly moved to qt :-) 2008-08-19 15:58 I know, I used to check the gtk web page every day to see all the amazing new ideas 2008-08-19 15:58 loved glade 2008-08-19 15:58 glade was good, yeah 2008-08-19 15:58 that was before I learned about oop and the lack of it in gnome thinking 2008-08-19 15:58 gtk is one of the things that really hurts linux desktop adoption now 2008-08-19 15:59 particularly its use in moz firefox 2008-08-19 15:59 but corba, bonobo, gnomevfs (lack of use in gnome applications, actually), etc screwed things royally 2008-08-19 15:59 really 2008-08-19 15:59 clusterfsck 2008-08-19 15:59 reinvention of KIO as GIO is the last great idea from gtk people 2008-08-19 15:59 serial braindamage 2008-08-19 16:00 dbus is a mess 2008-08-19 16:00 dcop was so nice 2008-08-19 16:00 seemed to work 2008-08-19 16:00 now uses dbus, right? 2008-08-19 16:00 dbus has serious borkness 2008-08-19 16:00 something invented in a day after getting beer drunk must really work well :-) 2008-08-19 16:00 yes, now dbus 2008-08-19 16:00 I remember 2008-08-19 16:00 used X ICE 2008-08-19 16:01 :-) 2008-08-19 16:01 there is also some good stuff in dbus 2008-08-19 16:01 it's not the worst thing the gnome mafia came up with 2008-08-19 16:02 pop to level 1, 3 of 1 nodes 2008-08-19 16:02 [5908] devmap_blockio: read block c0de00000007 2008-08-19 16:02 [5908] devmap_blockio: Failed assertion "dev->bits >= 9 && dev->fd" 2008-08-19 16:02 whoops 2008-08-19 16:02 hmm 2008-08-19 16:03 shapor, getting close to sk8 o'clock? 2008-08-19 16:46 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-19 17:29 -!- boom(~boom@c-76-117-208-224.hsd1.nj.comcast.net) has joined #tux3 2008-08-20 00:05 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-20 01:17 hey flips 2008-08-20 01:18 oh everybody is asleep now :) 2008-08-20 01:18 ACTION is awake 2008-08-20 01:21 yeah, it was nice meeting you the other day, I hope you had fun with us 2008-08-20 01:23 yeah, it was fun 2008-08-20 01:24 tried to absorb some of your scheduler speak, i don't know much about it 2008-08-20 01:30 there's a lot of activity trying to cross lock the rq so that you can move tasks across to another processor 2008-08-20 01:31 depending on the run category, FIFO or OTHER, you can migrate it directly or use a migration thread to facilitate the move 2008-08-20 01:31 problem with -rt is that the FIFO detection code can be very aggressive about scanning other run queues which can be a bit unbounded 2008-08-20 01:32 that can hold the spinlock protecting for a long time and can cause other kinds of contention if a lot of migration operations are happening 2008-08-20 01:32 if the algorithm is polynomial time this might cause severe contention on those locks 2008-08-20 01:32 have you folks thought about how to do online disk checking yet ? 2008-08-20 01:35 online fsck? 2008-08-20 01:36 yes 2008-08-20 01:37 have you folks thought about extending the page cache to do more sophisticated things like explicit buffer tracking ? 2008-08-20 01:38 i haven't, i believe flips has mumbled about it 2008-08-20 01:39 yeah, I wonder what i know or not is patented already 2008-08-20 02:14 -!- pgquiles(~pgquiles@239.Red-83-41-113.dynamicIP.rima-tde.net) has joined #tux3 2008-08-20 07:01 -!- pgquiles(~pgquiles@239.Red-83-41-113.dynamicIP.rima-tde.net) has joined #tux3 2008-08-20 13:21 the new iterative btree dump is way prettier than the recursive one and shorter 2008-08-20 13:54 shapor, next_key worked perfectly on the first try 2008-08-20 14:51 "Having actually run ZFS in production, there are some serious drawbacks with the remaining features (copy-on-write fragmentation, problems in SAN environments, etc), that may leave one wishing they'd implemented the ZFS features in a more stackable way so you could easily discard inappropriate layers and features" -- znork 2008-08-20 15:27 "You've run ZFS in production, yet you can't see the improvement on Linux's model? You mean the fact that md is completely broken and LVM is unreliable and slow by comparison?" -- outZider 2008-08-20 15:27 "Sir, I wish I had points to mod this up!" -- doomicon 2008-08-20 15:28 "ZFS is really, really nice but it does have some warts and the biggest for many would be that arcane operating system that's dangling off its nutsack" -- Kent Recal 2008-08-20 15:31 lol 2008-08-20 15:31 "Maybe he wants a cluster file system, or one that does HSM. I know I do and ZFS as of today does neither. ZFS is designed for managing a bunch of direct attached hard disks in thumper or similar device. At anything else it is frankly a bit sucky." jabuzz 2008-08-20 16:13 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-20 23:46 I need a clever 16 bit hex magic number for the volume tree 2008-08-20 23:47 existing magics are 0x1eaf for file index leaves and 0x90de for inode table leaves 2008-08-20 23:47 could be 32 bits, since this table is special and size doesn't matter 2008-08-20 23:50 for now the magic is 0x2008 2008-08-21 01:27 hey 2008-08-21 01:28 flips: cow allocation is pretty important 2008-08-21 01:28 to prevent fragmentation 2008-08-21 01:30 really 2008-08-21 01:30 so there are some concepts being considered 2008-08-21 01:30 delayed writes help I guess 2008-08-21 01:30 they should 2008-08-21 01:30 ACTION is tired today 2008-08-21 01:30 there is a concept of generating function driven goals 2008-08-21 01:31 been up since about 12pm and have been in front of a computer for most of that time 2008-08-21 01:31 12 pm... 25 hours ago? 2008-08-21 01:31 or only 13? 2008-08-21 01:32 flips: 12pm != midnight ;) 2008-08-21 01:32 12am is 25+ hours ago 2008-08-21 01:32 midday 2008-08-21 01:32 anyway, the idea is when data writes do collide with other versions of the data, either for atomic commit reasons or because a snapshot is held, then the write gets bounced away to a new goal, and if it collides there, to a further away goal 2008-08-21 01:32 the thing is, related data should get bounced to a similar place 2008-08-21 01:33 how do you choose where? 2008-08-21 01:34 another concept is to avoid completely filling any given region, which would interfere with placing the small amount of metadata in the region that is needed to do a certain atomic commits 2008-08-21 01:34 first bounce to a little higher, them more higher, and even more, then try a little lower, then bounce way far away 2008-08-21 01:34 generating function decides 2008-08-21 01:35 like a quadratic hash 2008-08-21 01:35 if you can keep say 4MB globs of data together, it doesn't matter much that it is stored far from its inode 2008-08-21 01:36 the seek time ends up about 10% of the transfer time, which is ok 2008-08-21 01:36 it's when you have lots of itty bitty pieces scattered around that seeking gets dominant 2008-08-21 01:36 there is also a concept of keeping an allocation density per region 2008-08-21 01:37 say, per 128 MB region 2008-08-21 01:37 so the bounce function could take that into account 2008-08-21 01:39 rewrite of a 1 GB file is not necessarily as scary as it sounds, if the truncate is committed first and synced, the the old blocks can be freed and rewritten 2008-08-21 01:39 if snapshotted, you want to take a huge bounce far away 2008-08-21 01:39 so the bounce function needs to take the size of the file into account 2008-08-21 01:39 bigger file = bigger bounces 2008-08-21 01:39 ACTION has a massive headache 2008-08-21 01:40 ACTION recommends that people with headaches not think about impossible problems too much 2008-08-21 01:40 just get flash storage 2008-08-21 01:40 right 2008-08-21 01:40 so far, no actual coding has gone into allocation strategy 2008-08-21 01:41 i'd say relying on seekless storage would be a good first cut 2008-08-21 01:41 anyway, I now need to think of a name other than "btree" for the on disk representation of a btree root 2008-08-21 01:41 well that's happening by default 2008-08-21 01:42 but I want at least some minimal allocation policy right from the start 2008-08-21 01:42 based on inode number 2008-08-21 01:43 roughly speaking, the idea is to allocate inodes in clumps, the clumps all belonging to files created in the same directory 2008-08-21 01:43 and the clumps scattered fairly far apart 2008-08-21 01:43 whys is that significant? 2008-08-21 01:43 to make a tar benchmark go fast? 2008-08-21 01:43 the file data will then be targetted to the region of that clump of inodes 2008-08-21 01:43 tar is a big deal 2008-08-21 01:44 but in general, inodes should be near their directories and data block should be near the inodes 2008-08-21 01:44 would be better to put data in a place where the head will be when you are most likely to need it 2008-08-21 01:44 because the patter goes: look up dirent; open inode; read data 2008-08-21 01:44 i like the idea of spraying the drive with data if its idle 2008-08-21 01:44 files in the same directory tend to have some relationship to each other 2008-08-21 01:45 why choose when you can write it in more than one place 2008-08-21 01:45 that is kind of what hammer does 2008-08-21 01:45 oh? 2008-08-21 01:45 what does hammer do? 2008-08-21 01:46 it sprays writes into roughly the region it thinks they should go, then the reblocking process comes along later and arranges things tidily 2008-08-21 01:46 also, this is the only way space is freed in hammer 2008-08-21 01:46 free blocks are 128 MB I think 2008-08-21 01:46 i'm talking about writing the same data more than once 2008-08-21 01:46 which are obtained by compacting via reblocking 2008-08-21 01:46 that takes more time 2008-08-21 01:47 not if the drive is idle 2008-08-21 01:47 what if it isn't? 2008-08-21 01:47 takes less time to move the data later 2008-08-21 01:47 since you dont have to copy it, just erase an extra copy 2008-08-21 01:48 there might be something there 2008-08-21 01:48 you don't really want the disk spinning for minutes after a big episode of writes though 2008-08-21 01:48 if you have io's waiting, you obviously dont do it 2008-08-21 01:49 although if you have buffer laying around in ram, and the drive gets idle, why not write them somemore 2008-08-21 01:49 then you can trivially break it just by having a long running write, like untarring dozens of kernel trees 2008-08-21 01:49 true, the last point 2008-08-21 01:49 the drive should never be idle when there is a dirty buffer in cache 2008-08-21 01:49 big flaw in linux there 2008-08-21 01:49 or even clean! 2008-08-21 01:49 :P 2008-08-21 01:50 well 2008-08-21 01:50 not so sure that writing out clean data is a win 2008-08-21 01:50 if you know it should be migrated, sure it might be a good time to migrate 2008-08-21 01:50 opportunistic defrag 2008-08-21 01:50 but, that will be slow 2008-08-21 01:50 because? 2008-08-21 01:51 if it's in cache its just a write 2008-08-21 01:51 because you have to seek to do it 2008-08-21 01:51 ah in cache 2008-08-21 01:51 true 2008-08-21 01:51 yeah... writing clean data is kind of a crazy idea 2008-08-21 01:51 the thing is, most of the badly fragmented stuff won't be in cache 2008-08-21 01:51 but there still might be a slight win 2008-08-21 01:52 there is a something similar planned 2008-08-21 01:52 you could do it if you are reading 2008-08-21 01:52 that is the so called log rollup 2008-08-21 01:52 say you read a heavily fragmented file 2008-08-21 01:52 right 2008-08-21 01:52 it ends up in buffers 2008-08-21 01:52 good point 2008-08-21 01:52 you should neve pay that high price again 2008-08-21 01:52 paint it down in some free space 2008-08-21 01:52 and update metadata 2008-08-21 01:52 you choose a new allocation goal for the whole file, then take the opportunity to migrate it, since you had to read it anyway 2008-08-21 01:52 yes 2008-08-21 01:53 however 2008-08-21 01:53 there better be some write activity going on at the same time 2008-08-21 01:53 people don't really like when writing happens when you are just reading 2008-08-21 01:53 like atime 2008-08-21 01:53 course maybe nobody will notice 2008-08-21 01:54 no one will care 2008-08-21 01:54 fragmentation is the biggest problem with cow style filesystems 2008-08-21 01:54 almost* 2008-08-21 01:54 it should be an advantage that tux3 will not rewrite nearly as much metadata 2008-08-21 01:54 i'm reading a bit about it and i like the hammer approach 2008-08-21 01:54 I like hammer too 2008-08-21 01:55 I want to get on lkml and advocate somebody start porting 2008-08-21 01:55 nice and simple really 2008-08-21 01:55 for what it does, yes 2008-08-21 01:55 see, we got a new file checked in 2008-08-21 01:55 volume.c 2008-08-21 01:56 i'm not sold on always trying to defragment files though 2008-08-21 01:56 that will suck on a lot of workloads 2008-08-21 01:56 tomorrow I will try and actually have it reference the master inode table 2008-08-21 01:56 like a log file server 2008-08-21 01:56 the allocator has to try hard to lay down the data in a reasonable place on the first try 2008-08-21 01:57 would be nice if there was some userspace interface to opportunistic readahead 2008-08-21 01:57 say you have a log file server which is appead mostly 2008-08-21 01:58 then you want to grep all the logs for something 2008-08-21 01:58 drive seeks because files are all badly fragmented 2008-08-21 01:59 there is 2008-08-21 01:59 fadvise 2008-08-21 01:59 no that doesn't help 2008-08-21 01:59 I recall you explaining this before 2008-08-21 01:59 need to explain again ;-) 2008-08-21 02:00 i want to sweep the drive once and grep all the files i read 2008-08-21 02:00 idealy ;) 2008-08-21 02:00 I think I can handle the append slowly case 2008-08-21 02:00 a heuristic is triggered when the log file grows to a certain size and is opened for append 2008-08-21 02:01 then, the file will grow in chunks 2008-08-21 02:01 hm 2008-08-21 02:01 "big log file" trigger? 2008-08-21 02:01 hm 2008-08-21 02:01 the allocation goal function will choose a location to target the next chunk where there exists a fair amount of empty space 2008-08-21 02:01 and other things will be discouraged from squatting there 2008-08-21 02:01 could just profile access patterns in general 2008-08-21 02:01 could 2008-08-21 02:01 maybe should 2008-08-21 02:02 and just store that in ram 2008-08-21 02:02 but some important ones can be determined without much analysis 2008-08-21 02:02 doesn't need to be persistent 2008-08-21 02:02 unless the drive is idle of course ;) 2008-08-21 02:02 wow, zumstor built and passed tests with the mem monitor excised ;-) 2008-08-21 02:03 exactly 2.5 hrs 2008-08-21 02:03 true, and analyzing allocation pattern provides work for lazy cpus 2008-08-21 02:04 I think there may be some allocation "zones", for example, zones where 4 MB is the minimum allocation unit 2008-08-21 02:05 could profile directories too 2008-08-21 02:05 and no more than a single file is allowed in the same 4MB zone 2008-08-21 02:05 4MB chunk I mean 2008-08-21 02:05 directory x usually gets files that dont grow beyond 16kb 2008-08-21 02:05 right 2008-08-21 02:05 while directory y usually gets files that grow to 10gb 2008-08-21 02:05 and then they will be targetted to a small granularity zone 2008-08-21 02:05 yeah 2008-08-21 02:05 and new inode table blocks may be created in that zone too 2008-08-21 02:06 that is the beauty of variable attributes 2008-08-21 02:06 eventually, the original inode table blocks of a directory that was "mispredicted" might be moved to the new, more appropriate zone 2008-08-21 02:06 can just add more on the fly even 2008-08-21 02:06 yes 2008-08-21 02:07 or disable them altogether on flash 2008-08-21 02:07 there is also a concept of inode numbers "folding" over the volume 2008-08-21 02:07 so that two inode numbers very far apart can have allocation goals into the same physical region 2008-08-21 02:07 why do inode numbers matter 2008-08-21 02:08 the inode number determines the physical allocation goal 2008-08-21 02:08 the initial goal anyway 2008-08-21 02:08 so you set the allocation goal for a given file by choosing the inode number 2008-08-21 02:09 so that is saying your primary goal is to place it close to other files in the same directory? 2008-08-21 02:09 yes, and place the data near the inode 2008-08-21 02:09 that could be totally wrong 2008-08-21 02:09 example? 2008-08-21 02:09 maildirs 2008-08-21 02:10 directories full of files, one per message in your mailbox 2008-08-21 02:11 usually just add new files one at a time 2008-08-21 02:11 why is it wrong to place the data near the inode then? 2008-08-21 02:11 read them 1 or 2 at a time, never access them again 2008-08-21 02:11 that is, the file data near the file inode 2008-08-21 02:11 hm there must be a better.. er worse case 2008-08-21 02:11 what if you search your mailbox? 2008-08-21 02:12 "don't do that" ? 2008-08-21 02:12 depends on how brain dead the mail server software is 2008-08-21 02:12 most are pretty brain dead 2008-08-21 02:13 some keep a keywords index db file because search is slow 2008-08-21 02:13 grep * with 20000 files is slow 2008-08-21 02:13 although if you do need to do that, it woud be nice not to see 2008-08-21 02:13 seek* 2008-08-21 02:14 now that would be a cool system call 2008-08-21 02:14 "search these files for this pattern" 2008-08-21 02:14 and please dont seek 2008-08-21 02:15 for that kind of grep you want to ls -U | grep foo 2008-08-21 02:15 err 2008-08-21 02:15 well like that 2008-08-21 02:15 |xargs 2008-08-21 02:15 right 2008-08-21 02:15 hrm never thought of that 2008-08-21 02:15 smrt 2008-08-21 02:16 htree will then provide the entries in hash order 2008-08-21 02:16 no better than lexical order 2008-08-21 02:16 hm 2008-08-21 02:16 but phtree will provide them in physical order 2008-08-21 02:16 things will sing 2008-08-21 02:17 that sounds really sucky of htree 2008-08-21 02:17 btrfs guys are busy inplementing the htree idea 2008-08-21 02:17 htree is very fast as most things 2008-08-21 02:17 but it's not the best solution 2008-08-21 02:17 imho 2008-08-21 02:18 htree is really good for huge volumes when nothing is in cache 2008-08-21 02:18 with the caveat that the above load will still suck 2008-08-21 02:18 no matter what allocation strategy is used the key will be benchmarking common worklodas 2008-08-21 02:18 and making sure they sing 2008-08-21 02:18 right 2008-08-21 02:18 untarring kernel trees is one of the important ones 2008-08-21 02:18 and also trying to tickle worst cases 2008-08-21 02:18 then grep the kernel tree, stuff like taht 2008-08-21 02:20 would be neat to hint allocation strategy with ioctls or something 2008-08-21 02:20 similar idea to fadvise 2008-08-21 02:20 "this is a log file" 2008-08-21 02:20 or "this file will never be more than 8k" 2008-08-21 02:21 struct root { u64 block:48, levels:8, unused:8; }; 2008-08-21 02:21 struct btree { struct root root; u16 entries_per_leaf; }; 2008-08-21 02:21 but if fadvise is any indicator, such an interface would never get used 2008-08-21 02:21 sadly 2008-08-21 02:21 s/never/very rarely/ 2008-08-21 02:22 well we can make a ddlink interface and you can go crazy with hints 2008-08-21 02:22 see what works 2008-08-21 02:22 most important thing though is to act fairly reasonable in common loads 2008-08-21 02:23 yeah because all those great ideas go to shit if you are serving the volume over nfs 2008-08-21 02:23 and every write is sync too, that hurts 2008-08-21 02:23 heh 2008-08-21 02:23 sync has to be fast 2008-08-21 02:23 I think tux3 will have a really fast sync 2008-08-21 02:23 hammer would probably kick all ass as an nfs server 2008-08-21 02:23 because of the forward log thing 2008-08-21 02:24 quite possibly 2008-08-21 02:45 flips: see the mail on the list? 2008-08-21 02:45 ACTION looks 2008-08-21 02:46 so people are reading your messags afterall ;) 2008-08-21 02:46 :-) 2008-08-21 02:47 so hopefully it will be less of a blog in future 2008-08-21 02:47 or at least one that gets lots of comments ;) 2008-08-21 02:51 ok, time to respond 2008-08-21 02:51 just checked in a big splat change 2008-08-21 02:51 need to restructure the way args are passed to the btree methods somewhat 2008-08-21 02:51 so that leaf methods can use fields in the struct btree 2008-08-21 02:52 anyway... microchange 2008-08-21 02:52 but macro patches to do it 2008-08-21 02:53 86 members on tux3 now 2008-08-21 02:53 just passed zumastor a little while ago 2008-08-21 02:53 we need to get to a beanery the day it passes 100 2008-08-21 04:37 -!- konrad(~konrad@c-24-16-74-109.hsd1.wa.comcast.net) has joined #tux3 2008-08-21 06:07 flips: you going to the linux plumbers conf? 2008-08-21 06:29 pgquiles, wasn't planning on it 2008-08-21 11:54 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-21 14:23 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-08-21 14:29 shapor: ping 2008-08-21 14:36 wow- Tux3 finally got booted off the hottest messages list on lkml. hanging in there in the 1/2 life = 1 day list though 2008-08-21 14:58 tim_dimm: pong 2008-08-21 15:00 any response to your bind post? 2008-08-21 15:01 not on the list 2008-08-21 15:01 any privately? 2008-08-21 15:02 some guy from ISC replied, thanked me for the patch and said that it wouldn't compile on all platorms due to "compiler constructs" 2008-08-21 15:02 isn't (struct in_addr){ .s_addr = htonl(hst->ip)} ANSI C? 2008-08-21 15:05 shapor, it is C99 2008-08-21 15:05 ah i've gotten used to c99 i guess 2008-08-21 15:05 rewrite as .s_addr = htonl(hst->ip); 2008-08-21 15:06 obviously 2008-08-21 15:06 yeah 2008-08-21 15:06 boneheads over there sounds like 2008-08-21 15:06 didn't tell you the error message I bet 2008-08-21 15:06 no i had to ask what construct he was talking about 2008-08-21 15:07 they are in the business of intentially producing buggy software 2008-08-21 15:07 I am even more on the leading edge of insanity, write in gnu-c99 2008-08-21 15:07 fancy stuff 2008-08-21 15:07 the only practical difference I have noticed is, g99 has typeof 2008-08-21 15:07 it's beyond me how anybody can get by without it 2008-08-21 17:48 -!- MaZe(~MaZe@216-239-45-4.google.com) has left #tux3 2008-08-21 18:01 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-08-21 20:09 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2008-08-21 23:30 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-22 00:49 folks 2008-08-22 00:50 flips: or put the metadata delta in a contiguous block run 2008-08-22 00:56 delayed allocation is your best friend, the allocator is going to be a pain in the ass 2008-08-22 01:59 true that 2008-08-22 02:01 bh, I don't get your comment about the metadata delta 2008-08-22 02:02 hey 2008-08-22 02:02 oh maybe I'm being stupid 2008-08-22 02:03 like changes in the b-tree itself might benefit from being contiguous 2008-08-22 02:03 you mean for replication? 2008-08-22 02:04 the principle is simple: changes in metadata do not mean a thing 2008-08-22 02:04 for dealing with changes to the b-tree itself, maybe make a distinction in how metadata is written versus data 2008-08-22 02:04 it is only changes in the logical data that have to be replicated 2008-08-22 02:04 the packing for that is known so it might benefit from a special treatment of that case 2008-08-22 02:04 oh, ok, like that log things you wrote up about ? 2008-08-22 02:05 log thing? 2008-08-22 02:05 that is a way of doing atomic commit 2008-08-22 02:05 yes, it matters 2008-08-22 02:05 because the log can contain part of the logical data 2008-08-22 02:05 if it has not been rolled up into the "real" fs structure yet 2008-08-22 02:25 -!- pgquiles(~pgquiles@239.Red-83-41-113.dynamicIP.rima-tde.net) has joined #tux3 2008-08-22 03:33 ACTION is about to head to bed 2008-08-22 08:22 -!- pgquiles(~pgquiles@239.Red-83-41-113.dynamicIP.rima-tde.net) has joined #tux3 2008-08-22 10:25 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-22 10:32 -!- pgquiles(~pgquiles@239.Red-83-41-113.dynamicIP.rima-tde.net) has joined #tux3 2008-08-22 10:33 -!- konrad(~konrad@c-24-16-77-169.hsd1.wa.comcast.net) has joined #tux3 2008-08-22 11:24 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-22 11:29 ah 2008-08-22 11:30 got through a big jogjam, inode creation is now in a plausible from 2008-08-22 11:30 going to do a call for testing pretty soon 2008-08-22 11:30 ACTION thinks about maze 2008-08-22 11:31 or more properly, a call for reality check on basic algorithms 2008-08-22 11:31 testing/debugging is too easy for the great minds on this channel ;-) 2008-08-22 11:53 ;-) 2008-08-22 12:15 hey, there's a guy who hasn't joined our tux3 LinkedIn List 2008-08-22 12:16 got to engage buttgears 2008-08-22 12:16 http://www.linkedin.com/e/gis/154012 2008-08-22 12:16 buttgears 2008-08-22 12:16 that's a new one to me 2008-08-22 12:17 is that like toosh_drive? 2008-08-22 12:17 geekified version of got to get ass in gear 2008-08-22 12:38 hey 2008-08-22 12:38 hi bh 2008-08-22 22:33 -!- konrad(~konrad@c-24-16-74-109.hsd1.wa.comcast.net) has joined #tux3 2008-08-23 00:48 flips: you there >? 2008-08-23 00:48 I think you need to put compression "blobs" in tux3 from the get go 2008-08-23 00:48 so that you can dedup things 2008-08-23 00:48 hi bh 2008-08-23 00:48 I think it's rather important to do that kind of thing because of the various enterprise uses of that 2008-08-23 00:49 potentially by folks like facebook and stuff. 2008-08-23 00:49 can I translate that as just have everything blob-ready? 2008-08-23 00:49 because it already is, kind of 2008-08-23 00:49 having a kind type file specific metadata for large linearly seeked files is another thing that's minor, but valuable as well 2008-08-23 00:50 you mean, with a radix tree instead of a btree? 2008-08-23 00:50 extents make large linear files quite nice 2008-08-23 00:50 flips: you might like to consider and experiment with various sizes of compression blobs to see how efficient the compression and storage is, then store the sha1 hash to "union" common file segments 2008-08-23 00:50 I think that's really important IMO 2008-08-23 00:51 wait, I'm merging two things into one 2008-08-23 00:51 forget compression, replace that completely with deduping 2008-08-23 00:51 maybe you can then extend that to compresssion as well using the same blob infrastructure 2008-08-23 00:51 how about deduping at the lvm level? 2008-08-23 00:51 no, at the file level 2008-08-23 00:51 why does deduping ahve to involve the filesystem? 2008-08-23 00:52 file fragment level so that things like .jpgs and stuff can save on storage 2008-08-23 00:52 because I can't see how that can be done at the RAID level 2008-08-23 00:53 it's about how a file is represented as a piece of data in the fs 2008-08-23 00:53 that would be valuable for a lot of enterprise ready folks that use those kind of filers, you don't want to just add it on later and do a half as job at it 2008-08-23 00:53 use an extent-based interface to the lvm 2008-08-23 00:54 it is coming inevitably anyway 2008-08-23 00:54 how's that going to solve the problem ? 2008-08-23 00:54 it gives you variable length data 2008-08-23 00:54 well 2008-08-23 00:54 dedupping 2008-08-23 00:54 just not sure how that fits in, you mean extent per file segment or something like that ? 2008-08-23 00:54 I was thinking of the other 2008-08-23 00:54 ACTION is prepping for Burning Man tonight 2008-08-23 00:54 ok, so you want to have the filesystem reference data with finer granularity than blocks? 2008-08-23 00:55 so that you can do real micro-idenfication of similar data? 2008-08-23 00:55 no, large than blocks, with the magic size of a "blob" 2008-08-23 00:55 what is the difference between a blob and an extent/ 2008-08-23 00:56 so that you can do an incremental back up or something like that for, say, a 32k commit/write and have that be backed already by another segment existing on the disk 2008-08-23 00:56 there's a tradeoff between the metadata size and the savings of space. 2008-08-23 00:56 I'm just making up the term "blob" for a compression cluster of blocks 2008-08-23 00:56 contiguous in a file 2008-08-23 00:57 replication is a decent argument 2008-08-23 00:57 because the filesystem should replicate at the filesystem level 2008-08-23 00:57 this is heavy enough that it might benefit from scoping out which parts of the file system do it, say, by volume or specific directory 2008-08-23 00:57 a filesystem like this one anyway 2008-08-23 00:58 flips: I'm trying to communicate this to you because I think it's critically important' 2008-08-23 00:58 you will succeed in getting me thinking about it 2008-08-23 00:58 the fact's a file segment or cluster shouldn't matter since they might be able to use the same generic framework 2008-08-23 00:58 encryption would also fall into this category 2008-08-23 00:58 having multiple pointers pointing at the same blob is currently an alien concept to tux3 2008-08-23 00:59 how do you know when the blob can be released? 2008-08-23 00:59 well, give some thought and maybe you'll think it's important enough to change it 2008-08-23 00:59 not sure, good question 2008-08-23 00:59 I'd imagine it would be similar to the hard link problem 2008-08-23 00:59 maybe there is a good answer 2008-08-23 00:59 ref counting is not nice 2008-08-23 01:00 you have to have all those counts persistent 2008-08-23 01:00 yeah, well, it's better to do stuff like this up front if you decided you need it 2008-08-23 01:01 well I suppose refcounts could be done like extents 2008-08-23 01:01 and you only incur the overhead if using deduping 2008-08-23 01:01 which you expect to go slower I would hope 2008-08-23 01:02 pluggable btree leaf formats as tux3 has now gives you all the blob referencing machinery you need 2008-08-23 01:04 bh: so you're saying file-level checksumming for deduping? 2008-08-23 01:08 or.. a blob would be larger than an extent and smaller than a whole file 2008-08-23 03:29 zfs has dnodes and znodes, I wonder what those are 2008-08-23 03:34 the problem with "de-duping" with extents will be alignment 2008-08-23 03:34 yes, you have to include a header in the blob 2008-08-23 03:34 but alignment is not nearly as serious an issue with extent based filesystem as block based 2008-08-23 03:34 well even if you store it in the metadata 2008-08-23 03:35 if 2 files have a common chunk, it needs to be broken up in to extents the same way in both 2008-08-23 03:35 even if the files are identical 2008-08-23 03:35 the approximate size would be in the metadata, exact size in the blob 2008-08-23 03:35 they could be broken up in to extents different 2008-08-23 03:35 differently* 2008-08-23 03:36 the object is just to identify common blocks? 2008-08-23 03:36 and not arbitrary regions? that can also be done 2008-08-23 03:36 yeah, single instance storage, right? 2008-08-23 03:36 how do you decide where to draw the line? 2008-08-23 03:36 i think that is what bh was suggesting 2008-08-23 03:37 there is a vanishingly small chance it will get into the prototype implementation 2008-08-23 03:38 therefore the object of the exercise must be to see how we could be blob friendly 2008-08-23 03:38 i think the only reasonable way to do it is in the background 2008-08-23 03:38 and sharing between files 2008-08-23 03:38 not take a lot of decisions that would make it hard to do bloby things 2008-08-23 03:38 ugly code 2008-08-23 03:38 if sharing can be supported a background thing could be added on 2008-08-23 03:38 it will be very ugly 2008-08-23 03:39 will it really? 2008-08-23 03:39 another layer of indirection? 2008-08-23 03:39 heh 2008-08-23 03:39 doing it in the lvm would be cleaner, provided an extent based interface is available 2008-08-23 03:39 to the lvm 2008-08-23 03:39 that would be pretty neat 2008-08-23 03:39 we sort of have something like that already, namely bio 2008-08-23 03:40 work with all filesystems 2008-08-23 03:41 and you could only enable it on the "slow" data device 2008-08-23 03:41 not the fast metadata one 2008-08-23 03:41 right 2008-08-23 03:41 I'd like to see a killer argument why it has to be done in the filesystem 2008-08-23 03:41 can you change the bio size? 2008-08-23 03:42 yes 2008-08-23 03:42 especially with my stacking patch 2008-08-23 03:42 so the device will only accept 4k bios 2008-08-23 03:42 bio size can even be changed on the fly 2008-08-23 03:42 while the bio is in flight 2008-08-23 03:42 bios are pretty loosely goosey 2008-08-23 03:54 shapor, a problem with your ownership inheritence suggestion: a file does not know what directory it is in. 2008-08-23 03:54 so can't inherit anything from the directory 2008-08-23 03:54 it can however inherit from its inode table block 2008-08-23 03:54 which gives the same effect 2008-08-23 03:54 hrm 2008-08-23 03:56 "Shapor has suggested that there be per-directory default uid, gid and 2008-08-23 03:56 mode attributes, so any file with exactly those attrbutes does not have 2008-08-23 03:56 to represent ownership at all, but inherits it from the inode table 2008-08-23 03:56 block it lives in. Allocation policy will be such that its neighbours 2008-08-23 03:56 are likely to have inentical ownership." 2008-08-23 03:57 permissions also 2008-08-23 03:57 ah "mode attributes" 2008-08-23 03:57 glazed over that 2008-08-23 03:57 right 2008-08-23 03:58 the idea could be extended to acls with some effort 2008-08-23 04:09 sleepy time 2008-08-23 04:19 had to play the new star wars ps3 demo, it's awesome 2008-08-23 04:19 advances the state of the art of that kind of game 2008-08-23 04:19 unlike the recent movies 2008-08-23 04:53 -!- pgquiles(~pgquiles@239.Red-83-41-113.dynamicIP.rima-tde.net) has joined #tux3 2008-08-23 12:19 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-23 13:01 so how big will the tux3 superblock be 2008-08-23 13:01 traditional 4k? 2008-08-23 13:01 smaller I think 2008-08-23 13:02 maybe 1K or 512 bytes 2008-08-23 13:02 a tux3 filesystem will never be smaller than 4K, but not the entire 4K needs to be valid superblock data 2008-08-23 13:02 so it can be probed by reading 4K to find the superblock magic and block size 2008-08-23 13:02 does variable size make sense? 2008-08-23 13:03 variable size superblock? would would you do with the extra space on large block size? 2008-08-23 13:03 I think, just let it be 2008-08-23 13:03 text description of the filesystem? heh i dunno 2008-08-23 13:03 bitmaps of the devs 2008-08-23 13:04 pi to as many digits that will fit? 2008-08-23 13:04 the bigger the block size the more detailed pix you have of the devs 2008-08-23 13:04 hah, we dont want to scare people 2008-08-23 13:04 no, especially not topless rollerskating pictures of the devs 2008-08-23 13:04 at any resolution 2008-08-23 13:04 yikes 2008-08-23 13:04 speaking of which... 2008-08-23 13:05 isn't there something happening in venice today? 2008-08-23 13:06 starts at noon 2008-08-23 13:06 time to get a coffee in me and get skates on 2008-08-23 13:28 hey flips 2008-08-23 13:28 ACTION fell asleep last night fairly suddenly 2008-08-23 13:30 flips: shapor yeah, that's kind of what I meant, other than, maybe a mechanism that can dedup it at the userspace level 2008-08-23 13:30 maybe it's not the role of the FS to do that, just a suggestion 2008-08-23 13:30 but it shouldn't complicate the prototype imo because getting it out is more important 2008-08-23 13:33 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-23 13:38 bh, see if you can find a killer argument why the fs has to do it 2008-08-23 13:38 as opposed to the lvm 2008-08-23 13:38 assuming an extent-aware lvm 2008-08-23 13:39 (note for lvm3 design) 2008-08-23 14:57 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-23 14:58 hey tim_dimm 2008-08-23 14:58 hey 2008-08-23 14:58 coding away I would assume 2008-08-23 14:58 yes 2008-08-23 14:58 I'll roll out for a skate sometime 2008-08-23 14:58 I'm doing house duty today 2008-08-23 15:17 btree.c:451: error: 'typeof' applied to a bit-field <- lame :p 2008-08-23 15:17 lazy gcc devs 2008-08-23 15:17 lazy & ugly 2008-08-23 15:46 g99 is too advanced 2008-08-23 16:29 [11882] tuxread: read 0/c 2008-08-23 16:29 got 12 bytes 2008-08-23 16:29 0xbf93990c: 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 "hello world!" 2008-08-23 16:47 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-23 16:49 flips: all tests compile (without warnings) and run on 64 bit 2008-08-23 16:51 :-) 2008-08-23 16:51 I tried to fill in the (L)s where needed 2008-08-23 16:53 sk8 oclock 2008-08-23 17:28 ah, getting close anyway 2008-08-23 17:28 now should I skate first or have another coffee first 2008-08-23 17:28 leaning towards the latter 2008-08-23 17:28 makes the skate more interesting 2008-08-23 17:30 make it an irish coffee 2008-08-23 17:30 and see how far you can make it before you fall 2008-08-23 17:30 not without moral support 2008-08-23 17:30 get joelle down for one ;-) 2008-08-23 17:31 you married folks are too boring 2008-08-23 17:32 oh, I misspoke in my latest tux3 post 2008-08-23 17:32 hah, not quite 2008-08-23 17:32 I can make a 1EB file no problem, even if bitmaps can't map 1EB 2008-08-23 17:33 the block doesn't have to be high up at all 2008-08-23 17:33 but I will put it up high anyway 2008-08-23 17:33 1EB/2 - 1 2008-08-23 17:33 just set the allocation goal at 1EB/2 - 100 or so 2008-08-23 17:34 the whole fs will be allocated way up there 2008-08-23 17:34 possibly turning up some unhandled boundary conditions ;-) 2008-08-23 17:36 why? 2008-08-23 17:37 because it happens 2008-08-23 17:37 on occasion 2008-08-23 17:37 wait 2008-08-23 17:37 most of the big changes I made over the last couple weeks turned up something unhandled 2008-08-23 17:37 how are you going to allocate blocks at 1eb/2 ? 2008-08-23 17:37 you have a device that big? 2008-08-23 17:37 or going to use sparse file as your device? 2008-08-23 17:37 use a sparse file for the volume 2008-08-23 17:38 well 2008-08-23 17:38 what fs supports that big a spare file? 2008-08-23 17:38 you're right 2008-08-23 17:38 can't be quite that high 2008-08-23 17:38 16 TB 2008-08-23 17:38 pretty far off ;) 2008-08-23 17:38 your 64 bit system should be able to do 1 EB / 2 2008-08-23 17:38 on ext3? 2008-08-23 17:38 I'lll make it dependent on sizeof(int) so that gets tested 2008-08-23 17:38 yes 2008-08-23 17:39 let me see 2008-08-23 17:39 or now 2008-08-23 17:39 :) 2008-08-23 17:39 or no 2008-08-23 17:39 yeah i'm checking 2008-08-23 17:39 ext4 2008-08-23 17:39 ext4? 2008-08-23 17:39 good chance to eval ext4 2008-08-23 17:39 ext4 of course 2008-08-23 17:39 doubles the size of all pointers etc 2008-08-23 17:42 gah, I have written tuxopen in the wrong order 2008-08-23 17:42 need to create the inode before creating the dirent 2008-08-23 17:42 blush 2008-08-23 17:43 dd: truncating at 1125899906842624 bytes in output file `t': File too large 2008-08-23 17:43 on ext3 2008-08-23 17:43 want to try 16tb - 1 while you're at it? 2008-08-23 17:43 then 16TB even? 2008-08-23 17:44 1tb is ok 2008-08-23 17:44 2tb is not 2008-08-23 17:44 1 tb + 1? 2008-08-23 17:45 ah, there is some other lame limitaion in ext3 2008-08-23 17:45 the horrors are slowly coming back to me 2008-08-23 17:45 2tb - 1 seems to be the limit 2008-08-23 17:45 something about signed offsets 2008-08-23 17:46 the highest signed offset 2008-08-23 17:46 which is lame 2008-08-23 17:46 er maybe exactly 2t 2008-08-23 17:46 I would hope exactly 2008-08-23 17:46 since i'm using seek 2008-08-23 17:47 using 64 bit seek I suppose 2008-08-23 17:47 because of 64 bit system 2008-08-23 17:47 hrmwell i'm doing bs=1M 2008-08-23 17:47 skip=2M 2008-08-23 17:47 er seek=2M 2008-08-23 17:47 http://en.wikipedia.org/wiki/Ext3 2008-08-23 17:48 maybe dd is mathing it out 2008-08-23 17:48 Max file size 2 TiB 2008-08-23 17:48 now what is the boneheaded limit 2008-08-23 17:48 shoudl be 16 TB, the size that the page cache can handle 2008-08-23 17:48 ftruncate(1, 2199023255552) = -1 EFBIG (File too large) 2008-08-23 17:49 somebody lied? 2008-08-23 17:49 or do they not mean by limit what they think they mean? 2008-08-23 17:50 Linux yzf.shapor.com 2.6.18-6-amd64 #1 SMP Mon Jun 16 22:30:01 UTC 2008 x86_64 GNU/Linux 2008-08-23 17:51 from man ftruncate 2008-08-23 17:51 EFBIG The argument length is larger than the maximum file size. (XSI) 2008-08-23 17:51 the 2 TB limit comes from the structure of the ufs-style index, maybe 2008-08-23 17:51 oh might have to do with block size 2008-08-23 17:51 it does 2008-08-23 17:51 but ext3 is pretty much always 4K 2008-08-23 17:52 except if you make a fs on a floppy 2008-08-23 17:52 so many it onyl supports 16tb if you make it larger than 4k 2008-08-23 17:52 anyway, Tux3 will exactly hit its limits, not be off by one 2008-08-23 17:53 branching factor is 2^10 for ext2/3 index block 2008-08-23 17:53 Block size: 4096 2008-08-23 17:54 triple indirect is 10 + 10 + 10 + 12 bits 2008-08-23 17:54 42 bits 2008-08-23 17:54 add in some braindamage for signedness, and maybe that is the limit 2008-08-23 17:54 maybe not 2008-08-23 17:55 albert cahahan has a post from a few years back 2008-08-23 17:55 treasts the question accurately 2008-08-23 17:55 hrm ext3 limit wasn't alwaus 16t 2008-08-23 17:55 unlike me at the moment ;-) 2008-08-23 17:55 i see some discussions of people trying to get it to work 2008-08-23 17:55 back in '06 2008-08-23 17:55 and my kernel is pretty old 2008-08-23 17:56 hrm no that was fs size 2008-08-23 17:56 not file size 2008-08-23 17:57 flips: you suck at reading comprehension 2008-08-23 17:57 from wikipedia 2008-08-23 17:57 Max file size 2 TiB 2008-08-23 17:57 http://lwn.net/Articles/91731/ 2008-08-23 17:58 I pasted taht above 2008-08-23 17:58 oh 2008-08-23 17:58 i mean i suck at reading 2008-08-23 17:58 heh 2008-08-23 17:58 comprehension 2008-08-23 17:59 oh right 2008-08-23 17:59 it is about measuring blocks in sectors 2008-08-23 17:59 blah 2008-08-23 17:59 bleah 2008-08-23 17:59 hrm maybe on tmpfs 2008-08-23 17:59 it's ok, I don't need the underlying volume that big 2008-08-23 17:59 well it would be nice to test 2008-08-23 18:00 the handling of the bitmaps 2008-08-23 18:00 maybe xfs? 2008-08-23 18:00 Max file size 8 exabytes 2008-08-23 18:00 hrm nope, tmpfs fail as well 2008-08-23 18:00 what is this - 1 byte bs? 2008-08-23 18:01 cant be signed? 2008-08-23 18:01 that would be ... retarded 2008-08-23 18:01 that's not it 2008-08-23 18:01 it's just less by one byte 2008-08-23 18:01 nonsensicle 2008-08-23 18:01 nonsensical 2008-08-23 18:03 limit on tmpfs also seems to be 2 TB - 1 2008-08-23 18:05 no 2008-08-23 18:05 whoops i was in the wrong dir 2008-08-23 18:07 tmpfs is actually a bit over 256G 2008-08-23 18:07 I wonder what the limit is there 2008-08-23 18:07 swapper most likely 2008-08-23 18:07 what about ramfs? 2008-08-23 18:09 http://lkml.org/lkml/2004/1/30/101 2008-08-23 18:09 something related to total memory size 2008-08-23 18:10 suggested workaround of echo 1 >/proc/sys/vm/overcommit_memory 2008-08-23 18:10 didn't help 2008-08-23 18:12 wow ramfs is the ticket 2008-08-23 18:13 -rw-r--r-- 1 shapor shapor 8.0E 2008-08-23 18:13 t 2008-08-23 18:13 ramfs isn't phased at 8EB even 2008-08-23 18:13 dd runs out of offset first ;) 2008-08-23 18:14 dd: offset too large: cannot truncate to a length of seek=8808038400000 (1048576-byte) blocks 2008-08-23 18:15 I put a shot of kahlua in my coffee just for you 2008-08-23 18:15 don't know where anna stashed the wiskey or would have done it properly 2008-08-23 18:15 she's always one step ahead of me ;-) 2008-08-23 18:16 so we use ramfs for testing? 2008-08-23 18:16 yes 2008-08-23 18:16 good sleuthing 2008-08-23 18:17 ramfs on 64 bit 2008-08-23 18:18 because ramfs on 32 bit maxes out at 2^44 2008-08-23 18:18 16 TB 2008-08-23 18:18 due to the page cache index 2008-08-23 18:19 both for volumes and files 2008-08-23 18:20 did btrfs prototype in userspcae first too? 2008-08-23 18:20 anyway, I will work with the ext3 limit for the first big file test. We can still create a 1EB file in tux3 whatever the physical volume size 2008-08-23 18:20 I doubt it 2008-08-23 18:21 cut & paste of something most likely 2008-08-23 18:21 well 2008-08-23 18:21 don't have any clue 2008-08-23 18:21 tux3 is a cut n paste of ddsnap in part 2008-08-23 18:31 ok, what do we do if during a file create the inode creation and allocation succeeds but the dirent creation fails? 2008-08-23 18:32 probably better consult good old ext2 for guidance 2008-08-23 18:44 roll back? 2008-08-23 18:45 to most recent snapshot? 2008-08-23 18:45 no jsut give back the inode 2008-08-23 18:45 seeing as the changes only happened in a buffer that hasn't been committed yet, that is practical 2008-08-23 18:45 and the right thing to do 2008-08-23 18:45 just invalidate the buffer, done 2008-08-23 18:45 even if it has 2008-08-23 18:45 why does it matter 2008-08-23 18:45 orphan inode 2008-08-23 18:46 invalidating the buffer is right, and really nice 2008-08-23 18:46 it's a true rollback 2008-08-23 18:46 thanks 2008-08-23 18:47 why would dirent creation fail, io error or something? 2008-08-23 18:49 out of memory 2008-08-23 18:49 out of disk space 2008-08-23 18:49 udev fucked up? 2008-08-23 18:50 who knows 2008-08-23 18:50 yes and io error 2008-08-23 18:50 bad sector 2008-08-23 18:51 uncorrectable ecc error they call it these days 2008-08-23 18:51 or a cascading failure resulting from that, or bad cpu memory 2008-08-23 18:51 the chance of not corrupting any data when that happens seems low 2008-08-23 18:52 in every case the right thing is to invalidate the buffer 2008-08-23 18:52 yes 2008-08-23 18:52 ext2/3 is very good about no corrupting disk in cases like that 2008-08-23 18:53 that's partly why it's still our standard fs even with much sexier things about 2008-08-23 18:53 yeah 2008-08-23 18:53 true 2008-08-23 18:53 something tells me ZFS is 5 years away from that 2008-08-23 18:53 at least 2008-08-23 18:54 even with the hyped checksumming 2008-08-23 18:54 doesn't save you from a fs bug 2008-08-23 18:54 or memory errors 2008-08-23 18:54 the two most common causes of corruption 2008-08-23 18:54 which seems infinitely more likely than a non detected io corruption 2008-08-23 18:54 right 2008-08-23 18:55 even when the data center i ran hit 115 degrees ambient with crappy old ide drives 2008-08-23 18:55 we didnt see any of that 2008-08-23 18:55 some drives did die 2008-08-23 18:56 but it is obvious 2008-08-23 18:56 io errors up the ass 2008-08-23 18:56 ecc failures is mostly a rather successfull marketing ploy from vendors 2008-08-23 18:56 or ecc uncaught failures I meant 2008-08-23 18:57 ACTION gone skating 2008-08-23 18:57 reminds me of kmfdm lyrics about their music 2008-08-23 18:57 "its made my machines cause they dont make mistakes" 2008-08-23 18:57 heh 2008-08-23 18:58 however maybe all this hyped checksumming will convince hardware vendors to remove error checking, so we should still support it eventually 2008-08-23 19:00 it's already supported at replication time 2008-08-23 19:00 more support you mean 2008-08-23 19:25 yeah, what if you're not replicating 2008-08-23 19:45 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-23 21:41 shapor, then replicate ;-) 2008-08-23 21:41 another question is, what if you're not willing to put aside the space to keep a snapshot to checksum against 2008-08-23 21:42 answer to that may be: only keep the checksums. But for a snapshot, see? 2008-08-23 21:43 imho, checksuming every read is braindamage unless you have hardware sitting idle 2008-08-23 21:43 in which case you spent too much money on your box, probably 2008-08-23 21:43 now... 2008-08-23 22:31 -!- konrad(~konrad@c-24-16-74-109.hsd1.wa.comcast.net) has joined #tux3 2008-08-23 22:31 -!- flips(~phillips@phunq.net) has joined #tux3 2008-08-23 22:31 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-08-24 02:41 88 interested observers on the mailing list 2008-08-24 02:42 needs to translate into more commentary 2008-08-24 04:17 -!- pgquiles(~pgquiles@239.Red-83-41-113.dynamicIP.rima-tde.net) has joined #tux3 2008-08-24 04:21 flips: is lvm3 extent aware ? 2008-08-24 04:21 lvm4 is 2008-08-24 04:22 um 2008-08-24 04:22 sorry! 2008-08-24 04:22 it is just a block device 2008-08-24 04:22 hey, nice to see that you're awake now 2008-08-24 04:22 that means you throw bio structs at it 2008-08-24 04:22 ACTION just finished clubbing around San Francisco 2008-08-24 04:22 I sped thorugh LA today to get up there 2008-08-24 04:22 here 2008-08-24 04:22 ACTION got back from death race 2000 not too long ago 2008-08-24 04:23 I would have stopped if I had time, beside we saw each other not that long ago 2008-08-24 04:23 what is that ? 2008-08-24 04:23 bios are kinda like extents 2008-08-24 04:23 beyond that, lvm knows nothing about it 2008-08-24 04:23 so does that mean that the sha1 hash can be done a per extent basis in lvm4 ? 2008-08-24 04:23 it has a concept called extents 2008-08-24 04:23 but it isn't extents 2008-08-24 04:23 it is just fixed size multiples of the lvm allocation unit 2008-08-24 04:23 what the purpose of putting that in the raid layer ? 2008-08-24 04:24 there is no lvm4, I mispoke 2008-08-24 04:24 sha1 is a stupid hash to use 2008-08-24 04:24 it is ridiculously expensive to compute 2008-08-24 04:24 we're not doing crypto 2008-08-24 04:25 oh 2008-08-24 04:25 but you were thinking about hashed storage 2008-08-24 04:25 that is, content addressed storage 2008-08-24 04:25 sha1 is still pretty extravagant 2008-08-24 04:26 anyway, why the raid layer vs below it? 2008-08-24 04:26 don't know 2008-08-24 04:26 you have to reassemble your raid bits to do your content hash on them 2008-08-24 04:26 well, what else could be used ? 2008-08-24 04:26 depending on the properties of the hash it might be hard to do otherwise 2008-08-24 04:26 xor 2008-08-24 04:26 check out dx_hack_hash 2008-08-24 04:27 it performs well and is cheap to compute 2008-08-24 04:27 performs => distributes evenly 2008-08-24 04:27 sha1 is better, but not so much better as to be worth the cpu load 2008-08-24 04:28 it is also a much wider hash 2008-08-24 04:28 you can't use a 32 bit hash for this without a collision scheme 2008-08-24 04:30 well, what did you think about getting some kind of generic blob support to universally represetn chnk of data in a file so that you can avoid replicating it ? 2008-08-24 04:30 it would apply to uncompressed as well as compress storage 2008-08-24 04:30 http://en.wikipedia.org/wiki/Content-addressable_storage 2008-08-24 04:30 I wouldn't expect a radi layer to be a aware of that stuff 2008-08-24 04:31 ACTION can't chat much longer 2008-08-24 04:31 kay, I can't either 2008-08-24 04:31 got to consider the sleep thing 2008-08-24 04:32 interesting 2008-08-24 04:32 some kind of cheap hash that results in a protocol exchange between upstream and downstream to see if the blocks are really identical would be useful 2008-08-24 04:33 last bit, think about cdta localit and online disk checking. 2008-08-24 04:33 internet is dying 2008-08-24 04:33 night 2008-08-24 04:33 cdta? 2008-08-24 04:33 ah 2008-08-24 04:33 data locality 2008-08-24 04:33 sure 2008-08-24 04:34 been thinking indeed 2008-08-24 04:34 see the log 2008-08-24 04:34 for performance reasons, they should be considered together 2008-08-24 04:34 how far up ? 2008-08-24 04:34 is anything I say useful ? 2008-08-24 04:35 yes 2008-08-24 04:35 very 2008-08-24 04:35 check it out later 2008-08-24 04:35 shapor and me 2008-08-24 04:35 conclusion is that storing a thing that is kind of like a snapshot but does not have the actual snapshot data, just hashes of it 2008-08-24 04:36 would be useful for checking 2008-08-24 04:36 and efficient 2008-08-24 04:36 vs the stupid braindamage that zfs has popularized 2008-08-24 04:36 anything you say? 2008-08-24 04:36 yes 2008-08-24 04:37 inspired the hash snap idea 2008-08-24 05:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-24 12:47 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-24 14:07 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-24 14:36 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-24 15:18 iattr.c decoded an attribute list 2008-08-24 15:18 now an encoder 2008-08-24 15:18 or maybe I should check it in as is 2008-08-24 15:19 make shapor's eyes bleed 2008-08-24 15:21 check it in 2008-08-24 15:22 should be more exciting than just "return 0" 2008-08-24 15:22 it is 2008-08-24 15:22 it's real gouge your eyes out stuff 2008-08-24 15:23 strigi dox suck 2008-08-24 15:23 :-( 2008-08-24 15:24 pgquiles, just put out a call for tech writers 2008-08-24 15:24 flips: problem is tech writers would first need to understand what each class, method, etc do, which is the difficult part of these docs 2008-08-24 15:25 I'm not even sure which classes are internal and which ones are intended for use by applications! 2008-08-24 15:25 g99 -g -Wall iattr.c && ./a.out 2008-08-24 15:25 block = 1234, depth = 1 2008-08-24 15:25 I guess I will make it decode more than one attr before checking in 2008-08-24 15:25 pgquiles, decent tech writers understand that stuff 2008-08-24 15:26 jon corbet for example 2008-08-24 15:26 but he doesn't work for free... always 2008-08-24 15:42 shapor, iattr.c skeleton decoder is in 2008-08-24 15:42 next, skeleton encoder 2008-08-24 15:42 then really use both 2008-08-24 15:44 ACTION considers the wisdom of changing his skate wheels 2008-08-24 15:44 http://www.officemax.com/omax/catalog/sku.jsp?skuId=21607263 2008-08-24 15:44 nifty 2008-08-24 15:48 sounds big 2008-08-24 16:10 getting close to sk8 oclock 2008-08-24 17:54 getting really close to sk8 oclock 2008-08-24 18:38 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-24 21:50 ACTION has a relatively solid internet connection now 2008-08-24 22:59 flips: pull from me, some minor fixes 2008-08-24 23:00 hopefully not to iattr.c 2008-08-24 23:00 no 2008-08-24 23:00 kay 2008-08-24 23:00 now how do I do that again 2008-08-24 23:00 forgot to write down te prescription 2008-08-24 23:00 oh 2008-08-24 23:00 it was hard 2008-08-24 23:00 you have a bunch of heads 2008-08-24 23:00 and when I pulled I got them all 2008-08-24 23:01 hg pull static-http://shapor.com/tux3/shapor-tux3 2008-08-24 23:01 was a major pain to get rid of them 2008-08-24 23:01 right 2008-08-24 23:01 hrm 2008-08-24 23:01 but not just that 2008-08-24 23:01 best is to clone, then pull into that, then selectively pull from the local copy 2008-08-24 23:01 i should get rid of the heads first 2008-08-24 23:01 well there are probably better ways 2008-08-24 23:01 hm 2008-08-24 23:01 right 2008-08-24 23:02 make a clean repo to pull from 2008-08-24 23:02 that's what the kernel crowd does 2008-08-24 23:02 I'll peek at the repo online 2008-08-24 23:03 why not set up a cgi? 2008-08-24 23:04 ok i recloned 2008-08-24 23:04 and put one of my changes is (fix dependencies in Makefile) 2008-08-24 23:07 have you tried hg view? 2008-08-24 23:07 no 2008-08-24 23:07 try it :-) 2008-08-24 23:07 I can see your repo is clean right away with it 2008-08-24 23:08 $ hg view 2008-08-24 23:08 /usr/bin/env: wish: No such file or directory 2008-08-24 23:08 heh 2008-08-24 23:08 make it there 2008-08-24 23:13 ey 2008-08-24 23:14 I've got a more solid internet connection right now 2008-08-24 23:14 so I can talk for a bit before it drops out 2008-08-24 23:14 ACTION is still prepping for Burning Man 2008-08-24 23:16 shapor, merged 2008-08-24 23:16 not painful 2008-08-24 23:16 only shakey spot was forgetting the url 2008-08-24 23:16 which I have written down this time 2008-08-24 23:16 bh, hi 2008-08-24 23:17 I'm not clear on why prepping is required 2008-08-24 23:17 probably I just don't understand 2008-08-24 23:19 flips: should inode test be asserting ? 2008-08-24 23:19 no 2008-08-24 23:19 (just pulled) 2008-08-24 23:19 perhaps 64 bit bug 2008-08-24 23:19 i'll look 2008-08-24 23:20 outputs about 15 lines then [5972] brelse: Failed assertion "buffer->count" 2008-08-24 23:20 doesn't assert for me 2008-08-24 23:20 double free 2008-08-24 23:21 included the filename? 2008-08-24 23:21 should clean that up 2008-08-24 23:21 yes i included the filename 2008-08-24 23:21 valgrind says uninitialized value 2008-08-24 23:21 in tuxopen 2008-08-24 23:22 let me check here 2008-08-24 23:23 yup 2008-08-24 23:23 just a sec 2008-08-24 23:25 http://pastebin.com/m6e86586e 2008-08-24 23:26 fixed 2008-08-24 23:26 you beat me to it? 2008-08-24 23:26 no that is the output 2008-08-24 23:26 valgrind really blew up 2008-08-24 23:26 illegal opcode 2008-08-24 23:26 i havne't seen that before 2008-08-24 23:27 I inherited that dodgy interface from ripping the ext2 dir code 2008-08-24 23:27 fragile 2008-08-24 23:27 sure blame someone else :P 2008-08-24 23:28 well I'm the one who didn't run valgrind 2008-08-24 23:28 the encode/decode verges on pretty now, does it not? 2008-08-24 23:28 its pretty obvious you aren't using the makefile 2008-08-24 23:28 I use make 2008-08-24 23:29 fairly often 2008-08-24 23:29 not for running the test 2008-08-24 23:29 but not in development, usually just before or after a commit 2008-08-24 23:29 right 2008-08-24 23:29 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-24 23:29 hi tim 2008-08-24 23:29 hey shapor 2008-08-24 23:29 hiyah tim 2008-08-24 23:29 u missed a great skate today 2008-08-24 23:30 flips! 2008-08-24 23:30 dude! 2008-08-24 23:30 I had a pretty good one 2008-08-24 23:30 ;-_ 2008-08-24 23:30 doing a little faux grinding 2008-08-24 23:30 skating on the skateboard obstacles 2008-08-24 23:30 really? 2008-08-24 23:30 wow 2008-08-24 23:30 tim_dimm: did you get my email? 2008-08-24 23:30 about chris? 2008-08-24 23:30 yeah 2008-08-24 23:31 yeah, were you involved in that? 2008-08-24 23:31 we went to ikea tonight, so I didn't have time to respond 2008-08-24 23:31 ah, sounds... manly 2008-08-24 23:31 I also had to clean 40 mini bearings for my race 2008-08-24 23:31 that course looks amazing 2008-08-24 23:31 I'm crashing guys. just logged on to plug in my phone 2008-08-24 23:32 he sent a pic of the road rash he got longboarding downhill 2008-08-24 23:32 night 2008-08-24 23:32 night 2008-08-24 23:32 wanna see the road rash in the am 2008-08-24 23:32 http://tinyurl.com/6ql2h6 2008-08-24 23:32 tim_dimm: g'night 2008-08-24 23:32 I have a special treatment for road rash 2008-08-24 23:33 das ugly 2008-08-24 23:33 wiskey? 2008-08-24 23:33 hip crash 2008-08-24 23:33 no, trying to remember the name 2008-08-24 23:33 special bandages 2008-08-24 23:33 its late, I'll remember in the am 2008-08-24 23:33 k, crashin 2008-08-24 23:33 anna put mine away 2008-08-24 23:33 see you 2008-08-24 23:33 ttyl 2008-08-24 23:33 tegaderm 2008-08-24 23:40 flips: what is this {en,de}code_{two,four,six,eight} mess? 2008-08-24 23:40 serial encoding/decoding 2008-08-24 23:40 always looks like that or worse 2008-08-24 23:42 maybe a macro could make it cleaner? 2008-08-24 23:42 would make it worse 2008-08-24 23:42 try it 2008-08-24 23:43 make once the dust settles 2008-08-24 23:43 inlines are to be preferred over macros 2008-08-24 23:43 yes good attitude 2008-08-24 23:43 also read some similar code 2008-08-24 23:43 s/make/maybe/ 2008-08-24 23:43 say, xdelta 2008-08-24 23:43 then come back and complain ;-) 2008-08-24 23:44 heh 2008-08-24 23:44 serial coding/decoding never looks pretty because the compiler can help very little 2008-08-24 23:44 the endian conversion is the biggest mess 2008-08-24 23:44 if you can make that pretty, show me 2008-08-24 23:45 btw i fixed some (L) warnings 2008-08-24 23:45 if you want to pull 2008-08-24 23:45 ok 2008-08-24 23:47 why do we care about endianness? 2008-08-24 23:48 on disk format should be whatever is native 2008-08-24 23:48 because if somebody writes a filesystem on a ppc they want to be able to read it on an x86 2008-08-24 23:49 right, just record that in the superblock 2008-08-24 23:50 theres no reason do jerk around with the endianness in the normal case if you never swap dicks 2008-08-24 23:50 disks* 2008-08-24 23:50 all filesystems do endian conversion 2008-08-24 23:51 zfs will store in native format and convert if you read on the other format, that sucks the worst 2008-08-24 23:51 why? 2008-08-24 23:51 that seems right 2008-08-24 23:51 because you have a whole different code patch if somebody goes to a different endian host 2008-08-24 23:52 better to pick a format and stick with it 2008-08-24 23:52 that's why there is a network byte order for example 2008-08-24 23:52 sounds ideal, its a uncommon case 2008-08-24 23:52 could do the same wanking with context senstive conversion there, it just isn't wise 2008-08-24 23:52 everyone has pcs 2008-08-24 23:53 no filesystem will get merged in linux without conversion 2008-08-24 23:53 extept for ramfs/tmpfs 2008-08-24 23:53 and tux3 2008-08-24 23:53 :P 2008-08-24 23:53 welcome to the filesystem world 2008-08-24 23:54 there are some unpretty things that have to be done 2008-08-24 23:54 thats just dumb 2008-08-24 23:54 for no good reason 2008-08-24 23:54 not having a consistent disk format would be way dumber 2008-08-24 23:55 think about it: you end up with all the same code if you do context conversion anyway, plus other code 2008-08-24 23:55 oh wait 2008-08-24 23:55 yeah but its not normally in the code path 2008-08-24 23:55 chances are you end up with two copies of the conversion code 2008-08-24 23:55 it's cruft 2008-08-24 23:55 cpu cost of always converting is small 2008-08-24 23:55 there are dedicated processor instructions that do it 2008-08-24 23:56 ACTION is done grumbling about it being a waste 2008-08-24 23:57 ddsnap doesn't do anything with endianness iirc 2008-08-24 23:59 that has to be fixed before merging 2008-08-24 23:59 it's written in the comments at the top of the file 2008-08-25 00:00 big job, nobody wants to do it, its ugly 2008-08-25 00:02 ok zfs goes a bit too far they suport little and big both in the same filesystem 2008-08-25 00:02 thats just stupid 2008-08-25 00:02 Every data structure in ZFS is written in the byte order of the machine writing it, along with a flag to indicate what byte order was used. A ZFS volume on an Opteron machine will be little-endian; one controlled by an UltraSPARC will be big-endian. If you swap the disk between the two machines, it still will workand the more you write to it, the more it will become optimized for native reading. 2008-08-25 00:03 i was just saying put a little/big flag in the sb 2008-08-25 00:04 but yes, then you need double the conversion code 2008-08-25 00:04 bfd 2008-08-25 00:04 merge collision 2008-08-25 00:04 in iattr.c 2008-08-25 00:05 oops 2008-08-25 00:09 merged 2008-08-25 00:09 was worth it just to see how merge conflicts go in mercurial 2008-08-25 00:09 the answer is: smooth and obvious 2008-08-25 00:09 I love it how it starts the editor for you 2008-08-25 00:10 yeah i did that once 2008-08-25 00:10 yes, smooth and obvious 2008-08-25 00:11 i was pleasantly suprised, thought "this is too easy" 2008-08-25 00:11 sure smacks svn 2008-08-25 00:12 hg does need a way to delete a head 2008-08-25 00:12 though if I had that, I probably would not have tried merging like a good boy 2008-08-25 00:13 it's only just after midnight, I should add another attr maybe 2008-08-25 00:14 link count 2008-08-25 00:14 the last one that's needed for initial prototype I think 2008-08-25 00:15 say, are we going to allow more than 4 billion links to the same file? 2008-08-25 00:16 why is ctime clumped with mode,uid,gid? 2008-08-25 00:17 aren't they normally all set at the same time? 2008-08-25 00:17 ctime? 2008-08-25 00:17 create time 2008-08-25 00:17 change time 2008-08-25 00:17 no it is create, you're right about being set together 2008-08-25 00:17 when the uid/gid/mode are normally set 2008-08-25 00:18 but not likely to be shared 2008-08-25 00:18 true 2008-08-25 00:18 just feels wrong to put ctime in there 2008-08-25 00:18 we can have a separate attr that only encdes ctime, maybe 2008-08-25 00:18 could be wrong indeed 2008-08-25 00:19 it doesn't make a huge difference, about 2 extra bytes/inode 2008-08-25 00:19 to have it separate 2008-08-25 00:19 that's about 5% of the size of an inode 2008-08-25 00:19 basic inode 2008-08-25 00:21 so 16 attribute types supported? or 15? 2008-08-25 00:22 16 2008-08-25 00:22 one of them is "extended attribute" 2008-08-25 00:22 attribute structure and variations can get aribtrarily complicated, so the 16 is just to capture the most common ones 2008-08-25 00:23 right so how do you know how many are stored? 2008-08-25 00:23 the inode dictionary gives the size of the inode 2008-08-25 00:23 ileaf dict 2008-08-25 00:23 how many bits is that? 2008-08-25 00:24 64K 2008-08-25 00:24 but limit is the size of a table block 2008-08-25 00:24 we are going to have to let inodes overflow into the next block 2008-08-25 00:25 added another attribute 2008-08-25 00:25 took about 5 minutes this time 2008-08-25 00:25 sign of a good interface 2008-08-25 00:26 will think about separating out mtime 2008-08-25 00:26 or have an alternate, separate mtime 2008-08-25 00:26 that is probably the way to go 2008-08-25 00:26 you mean ctime? 2008-08-25 00:26 the presense of mtime with no owner means "inherit" 2008-08-25 00:26 yes ctime 2008-08-25 00:27 could get tricky with multiple versions 2008-08-25 00:27 also might have a separate mtime, for when database-type writes modify the file without changing the size 2008-08-25 00:27 yeah i was thinking that when i read iattr.c 2008-08-25 00:28 the immediate goal is just to get the prototype up 2008-08-25 00:28 optimize for ownership inheritance later 2008-08-25 00:28 I think all the necessary attributes are there now 2008-08-25 00:29 possilby want a blocks count, but can just ignore that to start 2008-08-25 00:29 its funny though, when you trim the size down this much you see how much fat is really there 2008-08-25 00:29 compared to? 2008-08-25 00:29 with the non-inheritent compact metadata storage the vast majority will be redundant mode,owner 2008-08-25 00:30 yes 2008-08-25 00:30 it would be nice to get down to 16 byte minimum inode size 2008-08-25 00:30 24 bytes a "hello" immediate file 2008-08-25 00:31 why 24? 2008-08-25 00:32 just thinkin what the mininum would be 2008-08-25 00:32 need 5 bytes for hello 2008-08-25 00:32 2 bytes to say this is in immediate data attribute + version 2008-08-25 00:32 so 16 + 7 ~= 24 2008-08-25 00:33 ;-) 2008-08-25 00:33 yeah newline at the end ;) 2008-08-25 00:33 so the whole point of tux3 is to compress the zumastor configuration database? i knew it!! 2008-08-25 00:34 right 2008-08-25 00:35 versus how big on zfs? 2008-08-25 00:35 I shudder to think 2008-08-25 00:35 512 bytes dnode I think 2008-08-25 00:36 but I don't know what a dnode is 2008-08-25 00:36 I imagine it's hugely grosser than that 2008-08-25 00:36 they use 128 bytes minimum for a pointer 2008-08-25 00:36 I don't know if they have immediate data 2008-08-25 00:36 oh they do 2008-08-25 00:36 bits you mean 2008-08-25 00:36 in the 128 byte pointer maybe 2008-08-25 00:36 bytes I mean 2008-08-25 00:36 ACTION blinks 2008-08-25 00:36 not kidding 2008-08-25 00:37 zfs is pretty gross actually 2008-08-25 00:37 it looks best in a brochure 2008-08-25 00:38 sounds like ipv6, but much worse 2008-08-25 00:40 I wonder which one has more deployments 2008-08-25 00:40 ipv6 by a lot 2008-08-25 00:41 its been around.... 15 years? 2008-08-25 00:41 I've never run into one in the wild 2008-08-25 00:41 as a percentage of who _could_ use it, it may be around a tie right now 2008-08-25 00:42 ok, iattr.c should be nearly done for now 2008-08-25 00:42 have to hook it up to inode.c 2008-08-25 00:42 nah, big networks are all doing v6 for backbones 2008-08-25 00:43 any big win? 2008-08-25 00:43 makes routing easier 2008-08-25 00:43 supported in hardware on the big routers 2008-08-25 00:43 you actully get a discount if you talk v6 to them 2008-08-25 00:43 discount? 2008-08-25 00:44 oh 2008-08-25 00:44 onthe peering 2008-08-25 00:44 yeah 2008-08-25 00:44 thanks, but no thanks 2008-08-25 00:44 i'll keep my nat 2008-08-25 00:44 it is kind of sad that I now will replace 4 lines with 150 lines 2008-08-25 00:45 few computers need public ips 2008-08-25 00:45 the amount of code that was needed to do the endian conversions + encode/decode attrs 2008-08-25 00:45 why 2008-08-25 00:45 why? 2008-08-25 00:46 typical phillips bloatware code :P 2008-08-25 00:46 somebody should add endian attributes to gcc 2008-08-25 00:46 and reliable, predictable bit fields 2008-08-25 00:47 whats not reliable/predictable about bit fields in gcc? 2008-08-25 00:47 there's no guarantee on where they will end up in the data object 2008-08-25 00:48 so you can't use them to define disk formats 2008-08-25 00:48 code would be a lot less if you could 2008-08-25 00:50 shall we go with the "howmuch" function, or think of a more respectable name? 2008-08-25 00:54 howmuch is respectable 2008-08-25 00:54 it allready changed to howbig ;-) 2008-08-25 00:55 how about __iattr_how_many_bytes 2008-08-25 00:55 ooh pretty 2008-08-25 00:56 just preface every function with __ 2008-08-25 00:56 so you always remember which file you're looking at 2008-08-25 00:56 and that the authors _ key was working 2008-08-25 00:57 what about the type of the return value? 2008-08-25 00:57 aren't you supposed to encode that in the name? 2008-08-25 00:57 oh right 2008-08-25 00:59 i threw up a lot working on bind last week 2008-08-25 00:59 you know its going to be bad when you have to look in the "bin" dir for all the source code 2008-08-25 01:00 wow, that was easy to integrate 2008-08-25 01:00 the high level code got about half the size 2008-08-25 01:01 vs the straight struct banging 2008-08-25 01:02 grep -r seems to indicate 64844 occurances of "isc" in the bind9 source tree 2008-08-25 01:02 isc? 2008-08-25 01:02 oh, thats just lines containing isc 2008-08-25 01:02 internet systems consortium 2008-08-25 01:02 the company that makes bind 2008-08-25 01:03 everything is isc_ 2008-08-25 01:03 about about discdrive? 2008-08-25 01:03 :-) 2008-08-25 01:03 oh and thats just *lines containing isc* 2008-08-25 01:03 not occurances 2008-08-25 01:04 if you add up all the bytes "isc" takes up in the bind source its probably an order of magnitude larger than the djbdns source 2008-08-25 01:08 isc_uint32_t isc_random_jitter(isc_uint32_t max, isc_uint32_t jitter); 2008-08-25 01:08 heh 2008-08-25 01:08 oh, now we can have a generic attribute dumper 2008-08-25 01:08 instead of a lame hexdump 2008-08-25 01:08 indeed 2008-08-25 01:09 kudos for you for at least trying to make that stinking thing a little better 2008-08-25 01:10 waste of time 2008-08-25 01:10 what I meant 2008-08-25 01:10 the direct c file includes are going to break pretty soon 2008-08-25 01:11 and it will go to a "proper" makefile 2008-08-25 01:12 for the moment I think I will include iattr.c in ileaf.c 2008-08-25 01:12 the notmain thing will break then 2008-08-25 01:12 ugh 2008-08-25 01:12 why did you start the attrs at 6? 2008-08-25 01:12 maybe it all breaks right now 2008-08-25 01:12 just so I could see them 2008-08-25 01:12 time to change the base to zero, almost 2008-08-25 01:13 though zeros are rather common 2008-08-25 01:13 most of the mistakes were caught by the assert on unknown kind, which would not have been unknown if zero was an attribute 2008-08-25 01:15 good call 2008-08-25 01:19 oh wow, the c file includes/notmain hack didn't break when I included iattr.c in ileaf.c 2008-08-25 01:20 this monster gets to stumble on another cycle 2008-08-25 01:24 http://lwn.net/Articles/112567/ 2008-08-25 01:25 should support xattr out of the gate 2008-08-25 01:31 ah, inode table dump looks much better with proper attribute dump 2008-08-25 01:31 yes, xattr is in from the start, except not the first kernel port 2008-08-25 01:31 got to cut some corners somewhere 2008-08-25 01:32 ah, now I notice that a bogus empty inode table leaf has crept in 2008-08-25 01:33 now that the table dump isn't all noisy 2008-08-25 01:33 1 level btree 0xbf900988 at 64: 2008-08-25 01:33 0x0/0, 4084 free: 2008-08-25 01:33 0x47/1, 4054 free: 2008-08-25 01:33 0x47: ctime 0 mode 81c0 uid 0 gid 0 btree (block 48 depth 1) (30 bytes) 2008-08-25 01:33 0x64/1, 4054 free: 2008-08-25 01:33 0x64: ctime 0 mode 41c0 uid 0 gid 0 btree (block 45 depth 1) (30 bytes) 2008-08-25 01:34 what sort of work is involved in porting to the kernel? 2008-08-25 01:34 have to un c99 it 2008-08-25 01:34 lindent 2008-08-25 01:34 locks 2008-08-25 01:34 hook up to bio interface 2008-08-25 01:34 hook up to vfs interfaces 2008-08-25 01:34 another dozen things I forgot 2008-08-25 01:35 hm, fun 2008-08-25 01:41 whoops, broke make with the #define main tricks 2008-08-25 01:41 I knew that was too easy 2008-08-25 01:48 The ctime--change time--is the time when changes were made to the file or directory's inode (owner, permissions, etc.). The ctime is also updated when the contents of a file change. It is needed by the dump command to determine if the file needs to be backed up. You can view the ctime with the ls -lc command 2008-08-25 01:49 so ctime always gets incremented when mtime does 2008-08-25 01:49 bleah 2008-08-25 01:49 the should probably be bundled 2008-08-25 01:49 what use is that? 2008-08-25 01:49 yes 2008-08-25 01:50 pretty crap feature 2008-08-25 01:50 beyond crap 2008-08-25 01:50 what's your source? 2008-08-25 01:50 so its essentially a copy of the mtime, but also gets updated if you chmod/chown 2008-08-25 01:50 flips: the internets 2008-08-25 01:50 and experimenting 2008-08-25 01:50 thanks for the heads up 2008-08-25 01:50 so yes, refactor that steaming pile 2008-08-25 01:51 are the drugs we have these days as good as the ones those guys were on? 2008-08-25 01:51 heh 2008-08-25 01:52 reminds me of a line my car geek friends have 2008-08-25 01:52 80s engine management is the result of 60s drug use 2008-08-25 01:52 what we are going to do is let mtime shadow ctime 2008-08-25 01:52 yeah and only add ctime if needed 2008-08-25 01:52 yes 2008-08-25 01:52 more cruft 2008-08-25 01:53 to debug 2008-08-25 01:53 xfs inodes are 256bytes by default 2008-08-25 01:54 caused some problems with selinux xattrs 2008-08-25 01:54 not enough room for them to fit in 2008-08-25 01:55 ext3 fits them in 2008-08-25 01:55 hey, want to collaborate on an article about filesystems for linux world? 2008-08-25 01:55 I've been asked to write about versioning filesystems 2008-08-25 01:56 zfs, btrfs, tux3 2008-08-25 01:56 sure 2008-08-25 01:56 I'll send email around 2008-08-25 01:56 it will be fun 2008-08-25 01:56 something like proctology 2008-08-25 01:56 get to learn all about them 2008-08-25 01:56 i should try btrfs 2008-08-25 01:56 yes 2008-08-25 02:00 ah the ctime and mtime are actually often different 2008-08-25 02:01 hmm, char * has a special feature that void * does not have 2008-08-25 02:02 you can subtract it from a pointer to anything else, and the other pointer is quietly converted to char * 2008-08-25 02:02 void * causes an error on subtract 2008-08-25 02:02 that is probably a flaw in definition of void * 2008-08-25 02:04 because you can set the mtime using utime() 2008-08-25 02:04 like tar does, when you untar something 2008-08-25 02:05 but you cannot alter the ctime 2008-08-25 02:05 however for the majority of the files we write() to, they will be equal 2008-08-25 02:06 I see 2008-08-25 02:06 ctime can stay in the owner group 2008-08-25 02:06 well 2008-08-25 02:06 no way 2008-08-25 02:06 let me see 2008-08-25 02:07 I was thinking if mtime can shadow it 2008-08-25 02:07 but now it seems it can't 2008-08-25 02:07 mtime could shadow ctime, yes 2008-08-25 02:07 only add mtime if someone sets it with utime() 2008-08-25 02:08 but if mtime is explicitly set then it can't shadow ctime 2008-08-25 02:08 it can be set to some time in the past, right? 2008-08-25 02:08 yeah 2008-08-25 02:08 why not have it shadow ctime 2008-08-25 02:08 right 2008-08-25 02:08 unless mtime is present 2008-08-25 02:08 it should 2008-08-25 02:09 and that attribute group should be ctime/size instead of mtime/size 2008-08-25 02:09 mtime gets its own attribute 2008-08-25 02:09 yep 2008-08-25 02:28 done 2008-08-25 02:30 and done for the evening 2008-08-25 02:32 its early 2008-08-25 02:34 mtime also needs to be added on a chmod/chown 2008-08-25 02:34 as a copy of the old ctime (if it doesn't already exist) 2008-08-25 02:35 mtime is also always deleted when ctime is updated due to a write 2008-08-25 03:15 -!- pgquiles(~pgquiles@239.Red-83-41-113.dynamicIP.rima-tde.net) has joined #tux3 2008-08-25 05:17 -!- pgquiles(~pgquiles@6.Red-81-39-193.dynamicIP.rima-tde.net) has joined #tux3 2008-08-25 08:07 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-25 10:38 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-25 10:49 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-25 14:01 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-25 14:25 crappy python 2008-08-25 14:25 quietly overflows its numerics in spite of running at 1/200th the speed of C 2008-08-25 14:28 oh wait 2008-08-25 14:29 was me 2008-08-25 14:29 python uses a**b for power, not a^b 2008-08-25 14:49 then how do you do bitwise xor? 2008-08-25 14:51 oh i thought you said that the other way 2008-08-25 14:51 why would you think it was ^? 2008-08-25 14:52 hard to see why python needs xor 2008-08-25 14:52 well 2008-08-25 14:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-25 14:52 I guess it does 2008-08-25 14:52 haven't finished my first cup of coffee is why 2008-08-25 14:52 navel gazing post on time resolution is up 2008-08-25 14:53 now back to the question of creating an actual exabyte file 2008-08-25 14:53 and that will be 1^60 bytes, not 1^60 less one 2008-08-25 14:55 to do that I think I will store the highest addressable byte in the csize attribute, not the actual size, which is one greater 2008-08-25 14:55 which means zero is not allowed as a csize value 2008-08-25 14:56 we will remove the size attribute instead 2008-08-25 14:56 or I might just spend the extra bit ;-) 2008-08-25 15:26 flips: you've got mail 2008-08-25 15:26 so I do 2008-08-25 15:27 interesting 2008-08-25 15:39 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-08-25 15:43 15:40 <== One of the features of btrfs is fs-based softraid with duplicated metadata. So if it detects a bad checksum of hte metadata on one disk, it can get it from another location. It can even do that with 1 disk 2008-08-25 15:44 misfeature imho 2008-08-25 15:44 raid doesn't belong in the fs 2008-08-25 15:44 they will be debugging that for ages 2008-08-25 15:45 tux3 may take a more practical approach 2008-08-25 15:45 if it detects an error in metadata, rebuilt the index 2008-08-25 15:46 the index is just an accelerator after all 2008-08-25 15:52 how will it detect the error? 2008-08-25 15:52 what index is just an accelerator? 2008-08-25 15:52 btree index 2008-08-25 15:53 only the leaves matter 2008-08-25 15:53 they each will be identified with the data they reference 2008-08-25 15:53 since btrfs gives less protection to data, that puts it on equal footing 2008-08-25 15:53 ok what if you lose a sector some of your metadata is on 2008-08-25 15:53 both can lose data if redundancy of underlying storage is exceeded 2008-08-25 15:54 which metadata? 2008-08-25 15:54 say a dleaf 2008-08-25 15:54 anything really 2008-08-25 15:54 then you blow a hole in some data, what is new? 2008-08-25 15:54 each dleaf stands on its own 2008-08-25 15:54 with btrfs you just read from a dedundant copy 2008-08-25 15:54 doesn't help if a chunk of data was lost 2008-08-25 15:54 what if its an inode table 2008-08-25 15:54 same thing 2008-08-25 15:55 lose a range of inodes 2008-08-25 15:55 it isn't going to happen 2008-08-25 15:55 if it does, so some data is gone 2008-08-25 15:55 but btrfs has a feature which makes that not happen, right? 2008-08-25 15:55 try pulling two disks from a btrfs array and see what happens 2008-08-25 15:55 (oops probably) 2008-08-25 15:55 i am only talking 1 disk 2008-08-25 15:55 no 2008-08-25 15:55 btrfs claims to have such a feature 2008-08-25 15:55 it's wanking 2008-08-25 15:56 job of the raid system 2008-08-25 15:56 ok i'm not referring to the implementation 2008-08-25 15:56 i'm referring to the design 2008-08-25 15:56 stupid design 2008-08-25 15:56 very costly in terms of complexity 2008-08-25 15:56 ok 2008-08-25 15:56 wrong level 2008-08-25 15:56 too complex 2008-08-25 15:56 yes 2008-08-25 15:56 complexity = unreliability 2008-08-25 15:56 ACTION blushes about dleaf 2008-08-25 15:56 heh 2008-08-25 15:57 at least that complexity is confined 2008-08-25 15:57 raid complexity tends to get splatted throughout an entire system 2008-08-25 15:57 if you don't take measures to confine it 2008-08-25 15:57 most peopledont really care about losing data on a single disk system 2008-08-25 15:57 its proably more likely you lose the whole disk than get a bad block anyway 2008-08-25 15:58 true, and the only time it ever happened to me was when the disk stopped and died 2008-08-25 15:58 or upgraded an ubuntu system with root on lvm 2008-08-25 15:58 only two times 2008-08-25 15:58 in 15 years of abusing my disks and filesystems 2008-08-25 15:58 the other claimed feature if being able to read the metadata off the least busy device 2008-08-25 15:59 hah 2008-08-25 15:59 I'll believe that when I see it 2008-08-25 15:59 just have lots of spindles 2008-08-25 15:59 and let it go 2008-08-25 15:59 detecting that seems...hard 2008-08-25 15:59 yes 2008-08-25 15:59 quixotic 2008-08-25 15:59 there are so many things that actually matter 2008-08-25 15:59 like having a light footprint on cache 2008-08-25 16:00 never mind tlb 2008-08-25 16:01 got that guitar player's cd 2008-08-25 16:01 ten bucks and I am a happy puppy 2008-08-25 16:25 serial typos :-/ 2008-08-25 17:12 flips: you've got mail 2008-08-25 17:13 my penis is already long enough, thanks <- somebody made me type that 2008-08-25 17:14 was that somebody shapor? 2008-08-25 17:14 man in the middle attack I think 2008-08-25 17:14 lol 2008-08-25 17:14 nice 2008-08-25 17:15 http://www.shiningsilence.com/dbsdlog/2008/08/25/3041.html 2008-08-25 17:15 reading 2008-08-25 17:18 tim_dimm, you have male 2008-08-25 17:18 it is just about sk8 oclock 2008-08-25 17:49 time for a wheel change 2008-08-25 17:49 these ones ground down to the nub from slaloming down seaview terrace ;-) 2008-08-25 17:50 doubt I'll get out today. maybe. 2008-08-25 17:50 have a 7pm with my spousal unit 2008-08-25 17:50 interviewing the night nurse 2008-08-25 17:51 night nurse? 2008-08-25 17:51 "interviewing the night nurse" sounds like a bad porn 2008-08-25 17:51 shapor and I will drink a toast to you 2008-08-25 17:51 it's on shapor 2008-08-25 17:51 helps out a few nights during first few weeks 2008-08-25 17:51 hey! 2008-08-25 17:51 I should be there for that 2008-08-25 17:51 ! 2008-08-25 17:52 sure should 2008-08-25 17:52 topic swap; these ceramic mini bearings will be fast at maryhill 2008-08-25 17:53 doing a good job scatching the paint off my frames sliding down that rail 2008-08-25 17:53 too bad I only have 16 2008-08-25 17:53 could not convince shapor to try 2008-08-25 17:53 maybe today 2008-08-25 17:53 then there's the barrel the skateboarders jump over 2008-08-25 17:53 should be good for a pretty spectacular crash 2008-08-25 17:54 on thurday, it will be 120 skateboarders vs. 8 inliners 2008-08-25 17:54 we'll be a bunch of dorks 2008-08-25 17:54 they get out of the way more obligingly lately 2008-08-25 17:54 but we might be faster than them 2008-08-25 17:54 I take care to skate close to them when they veer towards me 2008-08-25 17:55 they seem to like it 2008-08-25 17:55 clapping when they land one seems to help too 2008-08-25 17:55 don't have to do that much 2008-08-25 18:00 I wore the front and back 80's down to 74 2008-08-25 18:00 got nice rocker 2008-08-25 18:00 lousy tracking 2008-08-25 18:00 and the middle? 2008-08-25 18:01 this is very autistic of you, btw 2008-08-25 18:02 ;-) 2008-08-25 18:02 76 2008-08-25 18:02 discussing skating on the irc- that's not right 2008-08-25 18:02 because? 2008-08-25 18:02 cause irc is geeky enough 2008-08-25 18:03 :-D 2008-08-25 18:03 skating is an essential part of the design process 2008-08-25 18:03 alan skates a double rocker 2008-08-25 18:03 80's in the middle, 76's on the ends, with the ends raised up 2mm 2008-08-25 18:03 gives him 4mm of rocker 2008-08-25 18:04 explaining why he doesn't go faster than 20mph 2008-08-25 18:04 I need at least 3 mm, only have 2 2008-08-25 18:04 you only have 1mm 2008-08-25 18:04 you measured diameter, not radius 2008-08-25 18:04 right 2008-08-25 18:05 even that makes a big difference 2008-08-25 18:05 I have 2mm in my missions 2008-08-25 18:05 when I set the elastomers to soft 2008-08-25 18:05 2 would do me then 2008-08-25 18:05 yeah you skaters are a bunch of dorks 2008-08-25 18:05 can bias the preload left/right too 2008-08-25 18:05 make big diff on the feel 2008-08-25 18:06 never knew there was so much detail to skating 2008-08-25 18:06 these k2's have no adjustment whasoever 2008-08-25 18:06 I'll be going for a pair of seba fr1's pretty soon 2008-08-25 18:06 konrad: http://homepage.mac.com/timothyhuber/downhill/iMovieTheater68.html 2008-08-25 18:06 figure out how to get them from europe 2008-08-25 18:06 figured 2008-08-25 18:07 urk, virgin bearings 2008-08-25 18:07 got to push hard 2008-08-25 18:07 ACTION didn't write that either 2008-08-25 18:07 have to take em apart, get the nasty factory grease out and *cough* relube 2008-08-25 18:07 shapor pwning my keyboard probably 2008-08-25 18:07 tim_dimm: oh wow 2008-08-25 18:08 we hit 51.2 on sunday 2008-08-25 18:08 slight tailwind 2008-08-25 18:08 temps were cool 2008-08-25 18:08 jeez 2008-08-25 18:08 traction fell off by 10am 2008-08-25 18:09 just as good, the car clubs came out with their ferraris, porsches, lotus, subbies, etc 2008-08-25 18:09 you should see shapor doing downhill 2008-08-25 18:09 fucker just learned to skate 5 months ago- already hitting 35 2008-08-25 18:09 I'm a skier, front-back balance is a lot easier on us 2008-08-25 18:10 oh yeah 2008-08-25 18:10 although with the 5 wheel skates, that's not a problem 2008-08-25 18:10 slowing down is the problem 2008-08-25 18:10 how does one do that? 2008-08-25 18:10 http://www.dailymotion.com/tag/descente/video/764 2008-08-25 18:10 thought you'd ask 2008-08-25 18:11 that's how the best in the world do it 2008-08-25 18:11 I throw slalom turns, can scrub 15mph in ~25 ft 2008-08-25 18:12 next week is a world cup event in maryhill 2008-08-25 18:12 black & white wheels on the same skate look kinda cool 2008-08-25 18:13 http://www.maryhillfestivalofspeed.com/ 2008-08-25 18:14 wow. 2008-08-25 18:14 most of us are geeks over 40 2008-08-25 18:15 scott peer works at jpl. cassini runs on his nav software 2008-08-25 18:15 warren focke is an astrophysicist at stanford 2008-08-25 18:16 Washington state, not DC? 2008-08-25 18:16 y 2008-08-25 18:17 awesome road 2008-08-25 18:17 http://www.panoramio.com/photos/original/7534977.jpg 2008-08-25 18:17 40 is when you realize your knees can't survive jogging for another 20 years 2008-08-25 18:17 i learned that at 28 when I started skating 2008-08-25 18:18 tim_dimm: looks like eastern washington 2008-08-25 18:18 y 2008-08-25 18:19 90 miles east of portland 2008-08-25 18:21 we should do a tux3 ski trip 2008-08-25 18:23 we need to 2008-08-25 18:24 mount washington 2008-08-25 18:24 got the best snow in the pacific northwest last year 2008-08-25 18:24 and I have an in with the restaurant owner 2008-08-25 18:25 my right wheels show much more asymmetric wear then left 2008-08-25 18:25 got to fix that 2008-08-25 18:25 3 yrs ago, had thigh deep powder at mt baldy, 45 min from la 2008-08-25 18:25 heh 2008-08-25 18:26 many el nina will do it for us again this year 2008-08-25 18:26 maby 2008-08-25 18:26 i wish 2008-08-25 18:26 ACTION was born a powder pig 2008-08-25 18:26 ow 2008-08-25 18:27 konrad: where r u? 2008-08-25 18:27 seattle, washington area 2008-08-25 18:28 so you get what, 20-30 days /yr? 2008-08-25 18:28 pff 2008-08-25 18:28 I'm more lazy than that 2008-08-25 18:28 I think last season I only skied about 10 days 2008-08-25 18:29 i got 100 in '93 2008-08-25 18:29 ACTION skated 10 days in the last 10 days 2008-08-25 18:29 that's the difference between skiing and skating 2008-08-25 18:29 yeah. 2008-08-25 18:29 I'd like to pick it up 2008-08-25 18:29 they feel kind of the same except you can skate whenever you feel like it 2008-08-25 18:29 the only skating I've done was as a kid in a super smooth rink 2008-08-25 18:30 it's the same outside but bigger and bumpier 2008-08-25 18:30 mhm 2008-08-25 18:30 my wife used to be a professor at U W 2008-08-25 18:30 and steeper sometimes 2008-08-25 18:30 fun 2008-08-25 18:31 more cars too 2008-08-25 18:31 more bikinis too 2008-08-25 18:31 :) 2008-08-25 18:32 tim taught me to skate backwards for that very reason 2008-08-25 18:32 heh 2008-08-25 18:32 nice' 2008-08-25 18:32 skiing backwards isn't so hard 2008-08-25 18:33 I'm big on skiing sideways 2008-08-25 18:33 I like to roll too 2008-08-25 18:33 no arials, that is sick 2008-08-25 18:33 skiing sideways? 2008-08-25 18:33 yup 2008-08-25 18:33 fast 2008-08-25 18:33 for fun 2008-08-25 18:34 easy to catch an edge that way 2008-08-25 18:34 yup 2008-08-25 18:34 that's when the foll skillz help 2008-08-25 18:34 come up skiing, don't lose the rhythm 2008-08-25 18:34 heh 2008-08-25 18:34 I don't have those skills :( 2008-08-25 18:34 easy to get 2008-08-25 18:34 I just do my best not to fall 2008-08-25 18:34 just don't care ;-) 2008-08-25 18:34 ah, falling is part of a run for me 2008-08-25 18:35 got boring after not falling for a couple years ;) 2008-08-25 18:35 flips, you should try flips ! 2008-08-25 18:35 no way 2008-08-25 18:35 I value my neck 2008-08-25 18:35 I did a little 2008-08-25 18:35 couple feeble attempts 2008-08-25 18:36 ok inline dh then 2008-08-25 18:36 was into springboard diving and trampolline 2008-08-25 18:36 my brother much more so 2008-08-25 18:36 I like to do multiple flips in freefall 2008-08-25 18:36 air is soft 2008-08-25 18:36 hard packed snow is insanity 2008-08-25 18:38 back flips in freefall is fun, each one goes faster 2008-08-25 18:38 because of the way your head and heels catch air when you're tucked 2008-08-25 18:38 front fliops require real exertion 2008-08-25 18:40 we need a poll feature on the irc 2008-08-25 18:40 like they have on forums 2008-08-25 18:40 shapor? 2008-08-25 18:40 we could have a poll to see how many want flips to demonstrate 2008-08-25 18:40 lol 2008-08-25 18:42 heh 2008-08-25 18:42 anna gave my rig away years ago 2008-08-25 18:42 then bought live insurance on me ;-) 2008-08-25 18:44 had one of those http://wraggj.people.cofc.edu/skydive_hist.html 2008-08-25 18:44 more than one 2008-08-25 18:44 three at my worst ;) 2008-08-25 18:46 rolling 2008-08-25 18:48 -!- boom(~boom@c-76-117-208-224.hsd1.nj.comcast.net) has joined #tux3 2008-08-25 20:21 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2008-08-25 20:41 50 turns, top to bottom of seaside terrace 2008-08-25 20:48 is that not many? 2008-08-25 20:58 that is many 2008-08-25 20:59 got lucky and there were no cars 2008-08-25 20:59 ah 2008-08-25 20:59 I havn't skated in too long 2008-08-25 20:59 my old skates are too small :S 2008-08-25 21:00 skates are cheap 2008-08-25 21:00 really nice ones for $200 2008-08-25 21:00 pff 2008-08-25 21:13 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-25 21:14 wow, they are asking $449 au in australia for the skates I paid $200 for and almost got for $150 2008-08-25 21:14 http://www.baysideblades.com.au/inline_skates_dt/inline_skates/k2/k2_frontman.htm 2008-08-25 21:16 hey 2008-08-25 21:16 ACTION is at Hot Chips 2008-08-25 21:16 sounds tasty 2008-08-25 21:17 ah, sushi time for me 2008-08-25 21:17 nice, I need exercise 2008-08-25 21:17 I'm rawling out of my skin right now, maybe do some push ups or something like that later 2008-08-25 21:18 get skates 2008-08-25 21:18 worked out the details of versioning the inode attributes on that skate 2008-08-25 21:21 hm 2008-08-25 21:21 mmm raw fish 2008-08-25 21:28 flips: how will that work 2008-08-25 21:28 versioned attributes? 2008-08-25 21:43 yeah 2008-08-25 21:43 writing a note about it for the list? 2008-08-25 21:44 eventually 2008-08-25 21:44 it's pretty straightforward 2008-08-25 21:44 works just like versioned pointers 2008-08-25 21:44 same algorithms 2008-08-25 21:44 only thing is, when we walk through a collection of attributes instead of computing just one "max ord" value, we compute an array 2008-08-25 21:45 one element for each attribute group 2008-08-25 21:45 then for each attribute group, the operative item is the one with highest ord 2008-08-25 21:46 a single pass through the unordered attribute list does all attributes 2008-08-25 21:46 see the latest checkin for something resembling that (dump_attrs is rewritten) 2008-08-25 21:51 shapor, would it make sense to put link_count together with mtime? 2008-08-25 21:55 why? 2008-08-25 21:55 that rarely changes 2008-08-25 21:55 and they really never change together 2008-08-25 21:55 the question is, does mtime change when link count changes 2008-08-25 21:56 maybe not 2008-08-25 21:56 no why would it? 2008-08-25 21:56 it's a modification? 2008-08-25 21:56 ok 2008-08-25 21:56 forget that 2008-08-25 21:56 no mtime is modificatino of the file data 2008-08-25 21:56 next consideration is whether link count should be part of the data attribute 2008-08-25 21:56 bearing in mind that there can be multiple data attributes with different versions 2008-08-25 21:57 changing the link count only changes the ctime 2008-08-25 21:57 since its considered an inode change 2008-08-25 21:57 s/considered / 2008-08-25 21:57 / 2008-08-25 21:57 and size changes are way more frequent than link count changes 2008-08-25 21:58 so does not make sense to bundle with ctime/isize 2008-08-25 22:00 hmm, I bet I broke the make 2008-08-25 22:00 yup 2008-08-25 22:12 1 level btree at 64: 2008-08-25 22:12 0 inode(s) starting at 0x0 (4084 free) 2008-08-25 22:12 1 inode(s) starting at 0x47 (4060 free) 2008-08-25 22:12 0x47: mode 81c0 uid 0 gid 0 btree 48/1 2008-08-25 22:12 1 inode(s) starting at 0x64 (4060 free) 2008-08-25 22:12 0x64: mode 41c0 uid 0 gid 0 btree 45/1 2008-08-25 22:12 that 0 inodes block is the original inode table leaf 2008-08-25 22:12 then I set an inode goal way higher 2008-08-25 22:12 so it didn't get used 2008-08-25 22:13 in practice it's always going to get used 2008-08-25 22:13 but still 2008-08-25 22:13 I wonder if I should let a btree be degenerate without a root until something tries to put an inode in it 2008-08-25 22:14 then make the initial leaf hold that first thing instead of assuming its based at inode zero 2008-08-25 22:14 probably not worth any effort 2008-08-26 00:15 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-26 03:09 -!- cdk(~chinmay@121.246.34.93) has joined #tux3 2008-08-26 03:15 -!- cdk(~chinmay@121.246.34.93) has left #tux3 2008-08-26 03:19 -!- cdk(~chinmay@121.246.34.93) has joined #tux3 2008-08-26 03:19 -!- cdk(~chinmay@121.246.34.93) has left #tux3 2008-08-26 03:43 folks 2008-08-26 04:03 -!- pgquiles(~pgquiles@6.Red-81-39-193.dynamicIP.rima-tde.net) has joined #tux3 2008-08-26 04:56 -!- flipz(~phillips@phunq.net) has joined #tux3 2008-08-26 07:23 -!- pgquiles(~pgquiles@189.Red-81-44-176.dynamicIP.rima-tde.net) has joined #tux3 2008-08-26 07:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-26 09:46 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-26 11:13 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-26 11:41 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-26 12:58 hey 2008-08-26 12:58 ACTION is trying to do final packing before heading to BM 2008-08-26 13:38 -!- cybergirl(~cybergirl@ANantes-257-1-135-233.w90-32.abo.wanadoo.fr) has joined #tux3 2008-08-26 13:54 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-26 14:14 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-26 16:27 1225 /* We now have enough fields to check if the inode was active or not. 2008-08-26 16:27 1226 * This is needed because nfsd might try to access dead inodes 2008-08-26 16:27 1227 * the test is that same one that e2fsck uses 2008-08-26 16:27 1228 * NeilBrown 1999oct15 2008-08-26 16:27 1229 */ 2008-08-26 16:27 -- http://lxr.linux.no/linux+v2.6.26.3/fs/ext2/inode.c#L1205 2008-08-26 16:27 :p 2008-08-26 16:30 "active" ? 2008-08-26 16:31 ah ESTALE handling 2008-08-26 16:31 hrm wouldn't get invalidated somehow when nlink drops to zero? 2008-08-26 17:16 I haven't plumbed those greasy depths 2008-08-26 17:17 http://lxr.linux.no/linux+v2.6.26.3/fs/ext2/super.c#L330 2008-08-26 17:17 -- nfs hack 2008-08-26 17:17 we can do it somewhat more cleanly I think 2008-08-26 17:49 flipz: trying to distribute some of the kernel port work? ;) 2008-08-26 17:53 of course 2008-08-26 17:53 you going to skate today? 2008-08-26 17:53 I mean, so far it's been mainly lets sit back and watch 2008-08-26 17:53 except for you 2008-08-26 17:53 as usual 2008-08-26 17:54 and as usual, that gets old after a while 2008-08-26 17:54 I'm sure chris mason went through the same thing 2008-08-26 17:59 once it mounts people will be interested 2008-08-26 18:31 is this like "will it blend?" 2008-08-26 18:32 ah that is a great idea for publicity 2008-08-26 18:32 get a hard drive and a blentec blender 2008-08-26 18:32 a few hard drives 2008-08-26 18:32 "zfs: will it blend?" 2008-08-26 18:33 tux3 could be the only filesystem that destroys the blendtec 2008-08-26 18:33 :P 2008-08-26 18:33 or just do it once and let tux3 blend ;) 2008-08-26 18:34 right on 2008-08-26 18:34 we could rig the blendtec with plastic blades 2008-08-26 18:34 and make smoke come from the motor 2008-08-26 18:34 when we try Tux3 2008-08-26 20:23 -!- flipz(~phillips@phunq.net) has joined #tux3 2008-08-27 00:42 shapor, once it mounts then people will pile on 2008-08-27 00:42 people are already interested 2008-08-27 00:42 but interested lazy people who do not want to help get it to the point of mounting are uninteresting 2008-08-27 00:49 indeed 2008-08-27 00:49 flipz, just read your post a couple times, looking at the ext2 along side 2008-08-27 00:50 annoying on this 1024x768 laptop display 2008-08-27 00:50 a phone? 2008-08-27 00:50 inode.c is still pretty braindamaged 2008-08-27 00:51 does not yet implement the code in the post 2008-08-27 00:51 last checkin was 26 hours ago :( 2008-08-27 00:51 oops 2008-08-27 00:51 forgot to pull 2008-08-27 00:53 also your post didnt get recogized as a response to the previous one 2008-08-27 00:53 it wasn't 2008-08-27 00:53 should have been 2008-08-27 00:54 ah 2008-08-27 00:54 saw that it started with "Re:" 2008-08-27 00:54 why would you do that? 2008-08-27 00:55 pulled 2008-08-27 00:55 distracted 2008-08-27 00:55 most probably I did reply 2008-08-27 00:55 and something messed up 2008-08-27 00:56 that's why good archivers have "probably in reply to" 2008-08-27 00:56 no the in-reply to header wasn't there 2008-08-27 00:56 you can make it all better by incorporing them into design.html ;-) 2008-08-27 00:57 yes 2008-08-27 00:57 got the subject line from somewhere 2008-08-27 00:57 did not type it in by hand 2008-08-27 00:57 therefore something screwed up 2008-08-27 00:57 it is quite hard actually 2008-08-27 00:57 much harder than coding 2008-08-27 00:57 cutting and pasting? 2008-08-27 00:58 in a cohesive fashion, yes 2008-08-27 00:58 or thinking clearly when faced with a bunch of fuzzy rambling design notes? 2008-08-27 00:58 :) 2008-08-27 00:58 thinking hurts my head too 2008-08-27 00:58 try to avoid it whenever possible 2008-08-27 00:59 getting that all together would take a whole day of sitting down with it all really 2008-08-27 00:59 dont have the time right now :( 2008-08-27 01:00 don't get it all together then 2008-08-27 01:00 just get one piece of it together 2008-08-27 01:00 yeah 2008-08-27 01:00 the koreans have a saying: starting is half 2008-08-27 01:00 heh 2008-08-27 01:00 since the koreans are the masters of lazy, it is important for them to have their motivators all lined up in a row 2008-08-27 01:11 there, another commit 2008-08-27 01:11 send html ;-) 2008-08-27 01:31 heh 2008-08-27 01:37 ah, I just realized why the inum allocation goal keeps changing 2008-08-27 01:37 because I decided to make it the same as the block allocation goal 2008-08-27 01:37 for now 2008-08-27 01:38 probably not going to stay that way 2008-08-27 01:38 but it is ok to start with 2008-08-27 01:38 planned that carefullly then forgot I did it ;-) 2008-08-27 01:38 needs a comment 2008-08-27 01:45 /* 2008-08-27 01:45 * For now the inum allocation goal is the same as the block allocation 2008-08-27 01:45 * goal. This gives us a maximum inum density of one per block and 2008-08-27 01:46 * should give pretty good spacial correlation between inode table blocks 2008-08-27 01:46 * and file data belonging to those inodes provided somebody sets the 2008-08-27 01:46 * block allocation goal based on the directory the file will be created. 2008-08-27 01:46 */ 2008-08-27 01:46 will be in I mean 2008-08-27 02:47 flipz: http://shapor.com/tux3/shapor-tux3/doc/design.html 2008-08-27 02:47 wee 2008-08-27 02:47 i've started dropping the pieces in the original post 2008-08-27 02:47 ooh, pretty 2008-08-27 02:48 growing in to a real doc 2008-08-27 02:48 should we check it into the repo yet? 2008-08-27 02:48 no, needs a lot more work, an hour ago it was just the 2008-July.txt mailing list archive from your mailman 2008-08-27 02:49 -!- jennyf(~jennyf@ANantes-257-1-135-233.w90-32.abo.wanadoo.fr) has joined #tux3 2008-08-27 02:49 although, it is the only copy 2008-08-27 02:49 hi jennyf 2008-08-27 02:49 there is this too: http://tux3.org/design.html 2008-08-27 02:49 yeah i looked at that 2008-08-27 02:50 mostly rubbish? 2008-08-27 02:50 is it anything more than html-ized lkml post? 2008-08-27 02:50 microscopically more 2008-08-27 02:50 didn't look like it was 2008-08-27 02:50 hm 2008-08-27 02:50 there are a few bits 2008-08-27 02:51 worth not killing 2008-08-27 02:51 i've only done minor editing, removing list like "i forgot to mention this in my original post:" 2008-08-27 02:51 the phtree part 2008-08-27 02:51 ok 2008-08-27 02:51 just keep posting to the list and i'll extract 2008-08-27 02:51 ;) 2008-08-27 02:51 "new user interfaces" 2008-08-27 02:52 other things like inode attributes have been completely superceded I think 2008-08-27 02:52 should not take too long to snarf the few bits that aren't treated better elsewhere 2008-08-27 02:53 I think it's just the two I mentioned 2008-08-27 02:53 ok 2008-08-27 02:54 actually, maybe i will make a commit with that in it 2008-08-27 02:54 simply because its my only copy 2008-08-27 02:54 want to pull it in? 2008-08-27 02:54 sure 2008-08-27 02:54 just say when 2008-08-27 02:55 one quick scan for things that jump out and rip out your eyeballs 2008-08-27 02:55 committed 2008-08-27 02:55 ACTION looks for the pull address 2008-08-27 02:57 hg view is a godsend for this process 2008-08-27 02:57 I wish it didn't use such an ugly widget set though 2008-08-27 03:01 still haven't seen it 2008-08-27 03:01 oh btw 2008-08-27 03:01 i had the first hg fail today 2008-08-27 03:01 my inbox got spammed with cron failures, trying to pull from you was failing 2008-08-27 03:01 for 12 hours or so 2008-08-27 03:02 when i ran hg pull on the command line, it complained that it needed a lock 2008-08-27 03:02 hmm 2008-08-27 03:02 there was some stuck hg pull process 2008-08-27 03:02 why was the pull failing? 2008-08-27 03:02 your side? 2008-08-27 03:02 stupidly i killed it 2008-08-27 03:02 or mine? 2008-08-27 03:02 yeah on my side 2008-08-27 03:02 i should have tried to figure out why it was hung before i killed it 2008-08-27 03:03 hopefully it will happen again 2008-08-27 03:03 yeah, we'll see 2008-08-27 03:04 well there it is 2008-08-27 03:05 well I wonder if I am going to get my 1 exabyte file in 8 k volume demo for tomorrow 2008-08-27 03:05 maybe not 2008-08-27 03:05 I'll relax a little on that 2008-08-27 03:06 other important stuff is getting done too 2008-08-27 03:06 yeah, as long as progress marches on 2008-08-27 03:07 I think I will add the logic to round down the inode table split boundaries to multiples of some binary number like 64 2008-08-27 03:07 at that point I might have to play the shapor card 2008-08-27 03:07 to expunge the bugs 2008-08-27 03:07 cuz its a little hairy already 2008-08-27 03:07 not a great deal of code, but logic is subtle 2008-08-27 03:08 whyd you change your nick 2008-08-27 03:08 z mean something special? 2008-08-27 03:08 huh? 2008-08-27 03:08 what nick? 2008-08-27 03:08 hah 2008-08-27 03:08 it means the other one was in use 2008-08-27 03:08 because I went oom and crashed 2008-08-27 03:08 linux is ugly that way 2008-08-27 03:08 eek 2008-08-27 03:09 without ulimits 2008-08-27 03:09 happens regularly 2008-08-27 03:09 sucks 2008-08-27 03:09 set ulimits ;) 2008-08-27 03:09 I don't, because I want to feel that pain 2008-08-27 03:09 and fix it one day 2008-08-27 03:09 speaking of feeling pain 2008-08-27 03:09 watched gladiator all the way through 2008-08-27 03:09 buy more ram isn't a solution really 2008-08-27 03:09 you need to see it 2008-08-27 03:09 no 2008-08-27 03:09 since ff will just eat it up 2008-08-27 03:10 crap on my system expands to fill all space 2008-08-27 03:10 crap being mainly firefox 2008-08-27 03:10 with all those porn tabs open 2008-08-27 03:10 yes 2008-08-27 03:10 pitiful 2008-08-27 03:10 russion porn tabs, the worst 2008-08-27 03:10 so, is windows better for something then... porn? 2008-08-27 03:10 windows is way worse 2008-08-27 03:10 oh wait, russian? than it is worse 2008-08-27 03:11 I think 2008-08-27 03:11 spyware 2008-08-27 03:11 based on my knowledge of its kernel structure 2008-08-27 03:11 hrm true, joelle does have to reboot a lot 2008-08-27 03:12 the last round of windows kernel development was mainly copying linux 2.6 features 2008-08-27 03:12 we suck at oom, so they suck worse 2008-08-27 03:12 funny really 2008-08-27 03:19 there we go 2008-08-27 03:19 nice meaty post 2008-08-27 03:19 on allocation strategy 2008-08-27 03:19 barely scratched the surface though 2008-08-27 03:20 that ones been cooking a while 2008-08-27 03:21 that post? 2008-08-27 03:21 wrote it starting a couple hours ago 2008-08-27 03:21 but yes 2008-08-27 03:21 been blathering about it a while 2008-08-27 03:53 valgrind errors in ileaf 2008-08-27 03:53 must have been there a while 2008-08-27 05:55 -!- pgquiles(~pgquiles@189.Red-81-44-176.dynamicIP.rima-tde.net) has joined #tux3 2008-08-27 07:18 you guys had a late night 2008-08-27 07:18 talking about russian porn and all 2008-08-27 07:18 http://lxr.linux.no/linux+v2.6.26.3/CREDITS 2008-08-27 07:18 might be a source of tux3 developers 2008-08-27 07:19 flips: when you get a chance, troll that list and mark who's been naughty or nice. I'll fire off emails to the nice ones 2008-08-27 08:04 and another request- I could use a more condensed general description of Tux3 illustrating the main features/benefits. 2008-08-27 10:16 tim_dimm, I'll pen something 2008-08-27 10:16 question for u 2008-08-27 10:16 how would map reduce fit in along with tux3? 2008-08-27 10:16 or visa versa 2008-08-27 10:17 no idea 2008-08-27 10:17 lot of discussion about hadoop / map reduce on the cloud computing list 2008-08-27 10:17 Maybe shapor knows something about it 2008-08-27 11:45 tim_dimm: i don't think hadoop stuff really asks much of the filesystem 2008-08-27 14:46 So... exabyte file written 2008-08-27 14:46 with Tux3, an exabyte means an exabyte 2008-08-27 14:47 not an exabyte less one 2008-08-27 14:51 [12818] tuxseek: seek to 0xffffffffffffff4 2008-08-27 14:51 [12818] tuxread: read ffffffffffffff4/c 2008-08-27 14:51 0xbfd504a4: 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 "hello world!" 2008-08-27 14:53 [12843] tuxread: file pos 1000000000000000/c 2008-08-27 14:53 whoops 2008-08-27 14:54 flips: have you taken a decision about time resolution? 2008-08-27 14:54 [12861] tuxread: file pos 1000000000000000 2008-08-27 14:54 pgquiles, general idea is to go with 48 bits unless somebody jumps up with a use case that has to have more 2008-08-27 14:55 pgquiles, and 32 bits for atime, using the measures we discussed on the list to avoid breakage 2008-08-27 14:55 but not set in stone 2008-08-27 14:55 ext3's 1 second resolution is annoying for some uses 2008-08-27 14:55 that is too crude, yes 2008-08-27 14:56 millisecond resolution ought to be good enough for a file 2008-08-27 14:56 agreed 2008-08-27 14:56 0.1 seconds is good enough, IMHO 2008-08-27 14:56 probably 2008-08-27 14:56 1 second not quite 2008-08-27 14:57 100ms is NTFS' time resolution and my users are happy with that 2008-08-27 14:57 we should merge #zumastor and #tux3 in #daniel's :-P 2008-08-27 14:59 they're pretty separate 2008-08-27 14:59 for now 2008-08-27 14:59 at least until tux3 is ready to replicate 2008-08-27 15:29 ok, now tuxread and tuxwrite return EIO for access about 1 EB 2008-08-27 15:29 above 2008-08-27 15:43 mmm potstickers 2008-08-27 20:27 flips: minor (L) cleanups and Makefile committed to my repo 2008-08-27 20:32 I'll pull 2008-08-27 20:48 sefault boo 2008-08-27 20:48 inode->btree = new_btree(sb, &dtree_ops); // error??? 2008-08-27 20:48 needs to be handled 2008-08-27 20:49 segfault* 2008-08-27 20:54 ah this is the real culprit: 2008-08-27 20:54 486 struct buffer *rootbuf = new_node(&btree); 2008-08-27 20:55 hmm lots of missing error checking 2008-08-27 21:14 yup 2008-08-27 21:14 need to start using ERR_PTR 2008-08-27 21:15 which allows an errno to be overloaded on a pointer return 2008-08-27 21:15 used extensively in kernel 2008-08-27 21:16 hmm 2008-08-27 21:17 http://lxr.linux.no/linux+v2.6.26.3/include/linux/err.h#L22 2008-08-27 21:17 the getblk interface is classic evil 2008-08-27 21:18 returns NULL if anything goes wrong 2008-08-27 21:18 so everybody makes up some different random error to report higher 2008-08-27 21:20 whats the point of ERR_PTR? 2008-08-27 21:20 to return an error without adding a new parameter 2008-08-27 21:21 instead of checking for NULL return you check for IS_ERR 2008-08-27 21:21 just changes a type though 2008-08-27 21:21 looks quite pointless? 2008-08-27 21:21 it's how you use it 2008-08-27 21:22 if (IS_ERR(result = some_function()) return result; 2008-08-27 21:23 and some_function returns ERR_PTR(ENOMEM) etc 2008-08-27 21:24 I can never remember if it is supposed to be ERR_PTR(ENOMEM) or ERR_PTR(-ENOMEM) 2008-08-27 21:24 one of those will cause oopses 2008-08-27 21:24 ah i see 2008-08-27 21:24 hah 2008-08-27 21:24 crappy interface actually 2008-08-27 21:24 but 2008-08-27 21:24 everything in C is crappy 2008-08-27 21:25 so it fits 2008-08-27 21:25 exceptions yeah 2008-08-27 21:25 lack thereof 2008-08-27 21:25 errno! 2008-08-27 21:25 well we should adopt that interface 2008-08-27 21:25 if its what we need to kernel port anyway 2008-08-27 21:25 will make the port easier 2008-08-27 21:25 yes 2008-08-27 21:25 you'll see various shouts to myself about that 2008-08-27 21:32 ERR_PTR is in 2008-08-27 21:38 why is DATA_BTREE_ATTR called "root" ? 2008-08-27 21:47 hmm 2008-08-27 21:47 because it's the root of a btre 2008-08-27 21:47 btree 2008-08-27 21:47 gives the on-disk block that's the root 2008-08-27 21:48 wouldn't "data" make more sense though 2008-08-27 21:48 in the context of the other attributes 2008-08-27 21:49 it's a kind of data attribute 2008-08-27 21:49 one of four kinds 2008-08-27 21:49 it's btree data 2008-08-27 21:49 could call it btree, but that's already taken 2008-08-27 21:49 that is the in-memory version 2008-08-27 21:50 that has a bunch of extra fields and no endian requirment 2008-08-27 21:51 ACTION reads the inode attributes post 2008-08-27 21:51 which does? 2008-08-27 21:51 struct btree 2008-08-27 21:51 like struct inode, it's the cached version 2008-08-27 21:51 of a btree root 2008-08-27 21:54 ah 2008-08-27 22:11 zzz time 2008-08-28 02:46 -!- cdk(~chinmay@121.246.36.77) has joined #tux3 2008-08-28 02:48 -!- cdk(~chinmay@121.246.36.77) has left #tux3 2008-08-28 03:12 -!- pgquiles(~pgquiles@189.Red-81-44-176.dynamicIP.rima-tde.net) has joined #tux3 2008-08-28 12:02 wow, time flies. It's already sk8 oclock 2008-08-28 12:14 hmm, does the root directory have a dirent? 2008-08-28 12:15 I don't think it does 2008-08-28 12:24 ACTION starts to draft the lkml post for next week 2008-08-28 16:19 simplifying assumption: inode attributes are always encoded in the same order 2008-08-28 16:19 I wonder if this makes things simpler 2008-08-28 16:19 hmm 2008-08-28 16:19 maybe random order is simpler 2008-08-28 16:19 always insert new attributes at the end of the list 2008-08-28 16:21 so updating attributes is a 3 step process: 0) figure out total size of attributes and old size then expand or shrink inode as necessary 1) remove attributes no longer used 2) insert new attributes at end of list 3) rewrite any attributes that changed 2008-08-28 16:21 that's 3 steps counting from 0 ;-) 2008-08-28 16:21 bh, there? 2008-08-28 16:34 let's try this algorithm: 2008-08-28 16:34 / for each attribute from bottom to top 2008-08-28 16:34 / if the attribute changed 2008-08-28 16:34 / encode new attribute 2008-08-28 16:34 / else unless the attribute is dropped 2008-08-28 16:34 / copy old attribute 2008-08-28 16:34 / for each new attribute 2008-08-28 16:34 / encode new attribute 2008-08-28 16:35 with a small optimization to avoid doing anything if the attribute neither has to be change or moved 2008-08-28 16:47 oh, slight mistake 2008-08-28 16:47 can't shrink the inode until after decoding the attributes 2008-08-28 16:48 iattr.c:218: warning: format '%i' expects type 'int', but argument 2 has type 'long int' 2008-08-28 16:48 should be %ti 2008-08-28 16:48 ah 2008-08-28 16:49 hm you want to copy all the attributes instead of change in place? 2008-08-28 16:49 not really 2008-08-28 16:49 but for a first cut maybe it's easiest 2008-08-28 16:49 lame 2008-08-28 16:50 you can optimize it 2008-08-28 16:50 heh ok, i think it will be less code to change in place 2008-08-28 16:50 I doubt it, but show me 2008-08-28 16:50 the lame version should be working sometime later today 2008-08-28 16:51 well then i wont get to optimize it until next week 2008-08-28 16:51 ACTION is not bringing a laptop on the motorcycle 2008-08-28 16:51 incentive for you to get back with your fingers intact 2008-08-28 16:51 I'll write it especially lamely to that end 2008-08-28 16:52 the %ti warning doesn't show up on 32 bit apparently 2008-08-28 16:53 yeah since the size matches i guess 2008-08-28 16:54 those unpack and repack ops are really efficient by the way 2008-08-28 16:54 they translate into just a couple of asm instructions most of the time 2008-08-28 16:54 once declared inline 2008-08-28 17:01 weekend reading material: http://students.cs.byu.edu/~cs460ta/cs460/labs/pthreads.html 2008-08-28 17:07 wow it's sk8 oclock again 2008-08-28 17:40 -!- olgagirl(~olgagirl@ANantes-257-1-135-233.w90-32.abo.wanadoo.fr) has joined #tux3 2008-08-28 19:12 what's with these racy nicks from france? 2008-08-28 19:34 -!- lafille(~lafille@ANantes-257-1-135-233.w90-32.abo.wanadoo.fr) has joined #tux3 2008-08-28 19:56 cool, everything also works with 256 byte blocks 2008-08-28 19:56 including writing an exabyte sparse file 2008-08-28 23:18 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-28 23:45 -!- konrad(~konrad@c-24-16-77-169.hsd1.mn.comcast.net) has joined #tux3 2008-08-29 07:10 -!- pgquiles(~pgquiles@189.Red-81-44-176.dynamicIP.rima-tde.net) has joined #tux3 2008-08-29 10:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-08-29 11:47 -!- konrad(~konrad@c-24-16-74-109.hsd1.mn.comcast.net) has joined #tux3 2008-08-29 12:45 -!- konrad(~konrad@c-24-16-74-109.hsd1.mn.comcast.net) has joined #tux3 2008-08-29 12:45 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-08-29 20:51 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-08-30 05:49 -!- flips(~phillips@phunq.net) has joined #tux3 2008-08-30 12:08 -!- pgquiles(~pgquiles@195.Red-83-41-45.dynamicIP.rima-tde.net) has joined #tux3 2008-08-30 13:44 -!- pgquiles_(~pgquiles@161.Red-83-41-44.dynamicIP.rima-tde.net) has joined #tux3 2008-08-30 16:22 hmm, it's about sk8 oclock 2008-08-30 16:22 enough refactoring for the moment 2008-08-30 17:20 -!- pgquiles__(~pgquiles@161.Red-83-41-44.dynamicIP.rima-tde.net) has joined #tux3 2008-08-30 23:39 heh 2008-08-30 23:39 my old skates have 80mm wheels 2008-08-31 07:19 -!- pgquiles(~pgquiles@64.Red-81-44-62.dynamicIP.rima-tde.net) has joined #tux3 2008-08-31 10:36 -!- pgquiles(~pgquiles@153.Red-83-35-242.dynamicIP.rima-tde.net) has joined #tux3 2008-08-31 13:08 -!- pgquiles(~pgquiles@153.Red-83-35-242.dynamicIP.rima-tde.net) has joined #tux3 2008-08-31 17:37 tux3 is the 133rd google hit for "filesystem" 2008-08-31 17:37 this needs to be improved 2008-08-31 17:37 by posting working code of course 2008-08-31 17:37 but not only that 2008-09-01 02:37 -!- pgquiles(~pgquiles@153.Red-83-35-242.dynamicIP.rima-tde.net) has joined #tux3 2008-09-01 02:52 -!- pgquiles(~pgquiles@110.Red-83-41-45.dynamicIP.rima-tde.net) has joined #tux3 2008-09-01 03:01 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-01 11:50 flips: revision 199 broke inode.c, can pull fix from me 2008-09-01 11:50 how broke? 2008-09-01 11:51 I only see 198 revisions in my repo 2008-09-01 11:57 removed parameter from ext2_dump_entries 2008-09-01 11:57 ah yeah revision numbers aren't global 2008-09-01 11:58 already fixed 2008-09-01 11:58 here 2008-09-01 11:58 ok ;) 2008-09-01 11:58 the new interpreter makes it much easier to find bugs 2008-09-01 11:58 I found a bunch 2008-09-01 11:58 working on an inode table leaf split corruption one now 2008-09-01 11:59 the checking functions are really badly needed 2008-09-01 11:59 inode table block 0x0/40 (8c bytes free) 2008-09-01 11:59 0x0: [0] mode 0000000 uid 0 gid 0 root 22:1 ctime 0 size 2000 2008-09-01 11:59 0xd: [40] mode 0100700 uid 0 gid 0 root 24:1 2008-09-01 11:59 0x27: [64] mode 0100700 uid 0 gid 0 root 27:1 ctime 0 size ffffffffffffff 2008-09-01 11:59 resize inum 0xd at 0x28 from 24 to 40 2008-09-01 11:59 inode table block 0x0/40 (7c bytes free) 2008-09-01 11:59 0x0: [0] mode 0000000 uid 0 gid 0 root 22:1 ctime 0 size 2000 2008-09-01 11:59 you haven't checked it in yet? 2008-09-01 11:59 0xd: [40] mode 0100700 uid 0 gid 0 root 81c00000:24576 2008-09-01 11:59 0x27: [80] mode 0100700 uid 0 gid 0 root 27:1 ctime 0 size ffffffffffffff 2008-09-01 11:59 not yet 2008-09-01 12:00 right after this bug 2008-09-01 12:00 cool 2008-09-01 12:00 see the root attribute of inode d get messed up by the resize 2008-09-01 12:00 ah 2008-09-01 12:00 yeah 2008-09-01 12:01 for one thing, inode d isn't at offset 28, I don't know why it thinks it is 2008-09-01 12:01 anyway 2008-09-01 12:01 this one is my mess 2008-09-01 12:02 it turns out Tux3 can only do 64 petabytes with 256 byte blocks 2008-09-01 12:09 thats it?! 2008-09-01 12:09 it's because the dump was printing in decimal :p 2008-09-01 12:12 you mean pebibytes? 2008-09-01 12:12 http://en.wikipedia.org/wiki/Petabyte 2008-09-01 12:13 ACTION doesn't subscribe to that hairy footed nonsense 2008-09-01 12:13 why does anyone use base 10 anyway? 2008-09-01 12:14 when describing these things 2008-09-01 12:14 blame hard drive manufacturers i suppose 2008-09-01 12:16 why does anyone use base 10 for anything? 2008-09-01 12:16 something to do with counting on fingers and toes 2008-09-01 12:27 there was no bug 2008-09-01 12:28 intermediate state produced funny behavior 2008-09-01 12:40 on bug down 2008-09-01 12:40 parens around a conditional expression 2008-09-01 12:41 oh, that took care of two bugs 2008-09-01 12:41 nice 2008-09-01 12:41 ok time to check in 2008-09-01 12:41 now thats efficient ;) 2008-09-01 12:43 ./tux3 read --seek 72057594037927930 foodev foo <- this works 2008-09-01 12:43 reads the 64th petabyte of file foo in device foodev, with 256 byte blocks 2008-09-01 13:04 according to this ( http://blogs.netapp.com/standards_watch/2007/12/emc-netapp-dona.html ), there should be an NDMP implementation available from SNIA, which would make it easier to implement NDMP support in strigi, but yesterday I was unable to find that source code :-? 2008-09-01 13:04 oops, wrong channel 2008-09-01 13:04 hi all, btw :-) 2008-09-01 13:04 :) 2008-09-01 13:05 flips: going to add tux3 to the Makefile? 2008-09-01 13:05 not before I have a nap 2008-09-01 13:05 fee free 2008-09-01 13:05 feel free 2008-09-01 13:05 I'm writing a little post 2008-09-01 13:05 paste the cmdline in your shell history to build it ;) 2008-09-01 13:05 which should help make a test 2008-09-01 13:06 yes 2008-09-01 13:06 so i can add it without thinking as much 2008-09-01 13:07 g99 -g -Wall -lpopt buffer.c diskio.c tux3.c -otux3 2008-09-01 14:02 hey 2008-09-01 14:02 flips: just came back last night from Burning Man 2008-09-01 14:03 burned out? 2008-09-01 14:03 eh ? 2008-09-01 14:03 no, I had a blast 2008-09-01 14:03 joke 2008-09-01 14:04 oh ok, yeah, I figured half of Google's infrastructure engineering went out there as well 2008-09-01 14:05 I checked in several thousands lines of patches while you were taking care of more important things ;-) 2008-09-01 14:06 nice, I have a lot of work todo but I'm half clueless about certain parts of the scheduler code 2008-09-01 14:07 gregory is coming up with fixes for various scheduler path issues, but I doubt that it's going to fix the latency problem. The cross locks are very problematic 2008-09-01 14:08 I'll get gone for a bit 2008-09-01 14:08 ok 2008-09-01 17:10 back 2008-09-01 18:36 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-01 19:33 flips: what does checking the integrity of leaf nodes involve? 2008-09-01 19:35 also, gcc 4.3 won't build the tux3 tests 2008-09-01 20:20 ld complains that /usr/lib/libpopt.so is incompatible 2008-09-01 20:20 perhaps because tux3 is -std=gnu99? 2008-09-01 20:21 oh wait 2008-09-01 20:21 need to install 64-bit popt-devel 2008-09-01 20:23 konrad, hi 2008-09-01 20:24 hi 2008-09-01 20:24 konrad, it would be a great start just to check that the entries are all in non-descending order 2008-09-01 20:24 for dleaf 2008-09-01 20:25 for both dealf and ileaf, the upside down dictionaries should contain offsets in non-descending order 2008-09-01 20:25 to bottom of the dictionary should not be below the top of the highest entry 2008-09-01 20:25 can get fancier from there, but that will already detect most corruption 2008-09-01 20:25 ok 2008-09-01 20:26 :-) 2008-09-01 20:26 sounds like a hack about to begin 2008-09-01 20:26 shh 2008-09-01 20:26 if that's what it takes :-) 2008-09-01 20:27 before that 2008-09-01 20:27 another pretty straightforward project is to add more commands to tux3.c 2008-09-01 20:27 I get some weird errors building tux3.c 2008-09-01 20:27 like "remove" 2008-09-01 20:27 ah 2008-09-01 20:27 for some reason references to the inline functions go through ld 2008-09-01 20:27 you need to build with gcc -std=gnu99 2008-09-01 20:27 I am 2008-09-01 20:27 the errors are? 2008-09-01 20:28 ACTION checks to see if make works 2008-09-01 20:28 tux3/user/test/iattr.c:95: undefined reference to `encode16' 2008-09-01 20:28 x10 2008-09-01 20:28 you 2008-09-01 20:28 um 2008-09-01 20:28 lines 98, 99, 100, 103, 104, 107, 111, 114 2008-09-01 20:29 and 128, ... 2008-09-01 20:29 some others 2008-09-01 20:29 that is odd 2008-09-01 20:29 check your compile output 2008-09-01 20:29 there are some warnings about iattr.c: In function ‘decode16’: 2008-09-01 20:29 iattr.c:20: warning: ‘be_to_u16’ is static but used in inline function ‘decode16’ which is not static 2008-09-01 20:29 just a sec 2008-09-01 20:29 ah 2008-09-01 20:29 interesting 2008-09-01 20:29 ok 2008-09-01 20:29 what it to static inline 2008-09-01 20:30 change it to static inline 2008-09-01 20:30 I'll do that right now 2008-09-01 20:30 k 2008-09-01 20:30 should I do it to, or will you tell me when to pull? 2008-09-01 20:30 just about done 2008-09-01 20:31 builds now 2008-09-01 20:31 updated in repo 2008-09-01 20:31 good 2008-09-01 20:31 what's your gcc version? 2008-09-01 20:31 gcc --version 2008-09-01 20:31 4.3.0 2008-09-01 20:31 20080428 2008-09-01 20:31 I'm 4.1.2 2008-09-01 20:32 looks like a gcc regression 2008-09-01 20:32 smells like 2008-09-01 20:32 certainly possible 2008-09-01 20:32 but I' 2008-09-01 20:32 but I'm ok with this resolution 2008-09-01 20:32 I'm pretty happy with those endian macros 2008-09-01 20:32 very efficiently implemented 2008-09-01 20:32 mhm 2008-09-01 20:32 didn't even know they were there until last week 2008-09-01 20:33 found them by accident 2008-09-01 20:33 really essential 2008-09-01 20:33 #include <- the magic words 2008-09-01 20:34 yep 2008-09-01 20:34 I'm looking at that right now 2008-09-01 20:34 29 /* Return a value with all bytes in the 16 bit argument swapped. */ 2008-09-01 20:34 30 #define bswap_16(x) __bswap_16 (x) 2008-09-01 20:34 does it do-the-right-thing on big endian archs? 2008-09-01 20:34 you just did your first bug hunt ;-) 2008-09-01 20:34 yay 2008-09-01 20:35 I still get a couple format string warnings 2008-09-01 20:35 bswap is a 2 byte asm instruction as I recall 2008-09-01 20:35 runs at superscaler speed these days I think - can do more than one bswap per cycle 2008-09-01 20:35 what about on ppc, m68k or other BE archs? 2008-09-01 20:35 in other words, as close to free as it gets 2008-09-01 20:35 does it omit those or still try to swap? 2008-09-01 20:36 there are similar instructions on some of the other arches 2008-09-01 20:36 but since the native resolution for tux3 is bigendian, ppc is fine 2008-09-01 20:36 no swapping to do 2008-09-01 20:36 k 2008-09-01 20:36 big endian is _way_ nicer for debugging 2008-09-01 20:37 big endian is way nicer for everything :) 2008-09-01 20:37 bummer x86 isn't big endian 2008-09-01 20:37 had to stare at the hexdumps sometimes for a while, wondering why the lsb was up at the high end of the struct, I'm so used to the braindamaged intel order 2008-09-01 20:37 x86 is most braindamage, that happened to be implemented better than the other guys 2008-09-01 20:37 too bad about that 2008-09-01 20:38 motorola really blew it 2008-09-01 20:38 ibm's doing ok with ppc 2008-09-01 20:38 but I'd like to see them do more on power efficiency 2008-09-01 20:38 ppc rules the console world 2008-09-01 20:38 which is getting to be more machines than the biz world even 2008-09-01 20:39 more than certain search engine operators even 2008-09-01 20:41 Are: 2008-09-01 20:41 inode.c:421: warning: format ‘%Lx’ expects type ‘long long unsigned int’, but argument 2 has type ‘block_t’ 2008-09-01 20:41 tux3.c:139: warning: format ‘%Li’ expects type ‘long long int’, but argument 2 has type ‘u64’ 2008-09-01 20:41 tux3.c:175: warning: format ‘%Li’ expects type ‘long long int’, but argument 2 has type ‘u64’ 2008-09-01 20:41 supposed to happen? 2008-09-01 20:46 cast the printf arg using (L) 2008-09-01 20:47 that's a tux3 macro, same as (long long unsigned) 2008-09-01 20:47 I'm running 32 bit here, so you need to post your hg patch to the mailing list 2008-09-01 20:48 ACTION has to get his hammer machine online 2008-09-01 20:50 right 2008-09-01 20:51 that's what I thought, so I made the patch already 2008-09-01 20:51 was just waiting on posting it 2008-09-01 20:55 tux3 will need to go GPLv2 at some point to get into the kernel, no? 2008-09-01 21:02 applied, thanks 2008-09-01 21:03 that is correct, there is a post about that 2008-09-01 21:04 I reserve the right to relicense tux3, including the downgrade to v2 for the kernel port 2008-09-01 21:04 I suppose I should ping Eben and see if he likes my slight hack of his license ;-) 2008-09-01 21:16 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-09-01 21:16 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2008-09-01 21:51 flips: what are the offsets? 2008-09-01 21:51 offsets? 2008-09-01 21:51 oh 2008-09-01 21:51 in dleaf 2008-09-01 21:51 16 bit address within a dleaf 2008-09-01 21:51 looks like there's one per entry 2008-09-01 21:51 ok 2008-09-01 21:51 or 16 bit offset from the beginning of data in an ileaf 2008-09-01 21:52 ileaf would be an easier place to start 2008-09-01 21:52 dleaf has a two level index, both levels upside down, which is a little confusing 2008-09-01 21:52 heh 2008-09-01 21:53 the other confusing detail is that the 0th index entry is not actually represented, it is assumed to be zero 2008-09-01 21:53 for both dleaf and ileaf 2008-09-01 21:54 I think I might be able to code an inline to make that a little clearer 2008-09-01 22:01 another complication for ileaf is that leaf->count is allowed to be zero 2008-09-01 22:02 that means that dick[-i] can be invalid either because i is zero or leaf->count is zero, which usually implies the same, but not for ileaf 2008-09-01 22:02 which allows offsets higher than leaf->count 2008-09-01 22:03 because the ileaf dictionary can be extended to accomodate. 2008-09-01 22:57 how is the size of the zeroth inode an ileaf found? 2008-09-01 22:59 inode of an ileaf* 2008-09-01 23:01 er, if leaf->count is zero 2008-09-01 23:02 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-01 23:07 http://pastie.org/264354.txt <-- does that look about right? 2008-09-01 23:17 er 2008-09-01 23:18 I guess dict[-1] exists even if there's only the zeroth inum 2008-09-01 23:18 therefor dict[-btree->entries_per_leaf - 1] is part of the dict too 2008-09-02 00:06 alright I'm heading to bed 2008-09-02 00:13 sorry, didn't notice your chat 2008-09-02 00:13 if you type "flips" then the tab lights up 2008-09-02 00:15 konrad, your writeup is accurate 2008-09-02 00:16 kongrad, if leaf->count is zero, dict[-1] does not exist either 2008-09-02 00:17 valgrind will complain if you try to pretend it exists ;-) 2008-09-02 00:17 ACTION loves valgrind 2008-09-02 06:27 flips: right 2008-09-02 06:42 sorry for the exploded code 2008-09-02 09:30 it was nice code 2008-09-02 09:30 now it's compressed code ;-) 2008-09-02 09:31 yay 2008-09-02 09:32 now what? 2008-09-02 09:49 more checks? 2008-09-02 09:49 let me see 2008-09-02 09:49 what else can be checked about ileaf 2008-09-02 09:50 could check for unkown attributes 2008-09-02 09:50 or could tackle dleaf, much harder 2008-09-02 09:50 I'll do the former then start on the latter, I guess. Sound good? 2008-09-02 09:51 sounds good 2008-09-02 09:51 excellent 2008-09-02 09:51 attr_check 2008-09-02 09:52 I wonder why mailman fails to post so list posts 2008-09-02 09:53 I see your answer to masoud, but not masoud's post 2008-09-02 09:53 ah 2008-09-02 09:53 ACTION looks for logs 2008-09-02 09:53 he didn't write to list 2008-09-02 09:53 just replied to me 2008-09-02 09:53 ah right 2008-09-02 09:53 so I CC'd the list 2008-09-02 09:53 replying back to list is good 2008-09-02 09:59 ok time for me to start another hack 2008-09-02 09:59 truncate I think it was 2008-09-02 10:04 yep 2008-09-02 10:17 hm, how do I setup an hg username? 2008-09-02 10:18 (and anything else it needs) 2008-09-02 10:20 flips: http://pastie.caboo.se/264630 <-- look ok? 2008-09-02 10:25 konrad, also need to check that the attribute list ends exactly at the size limit 2008-09-02 10:25 and do that without accessing out of bounds 2008-09-02 10:25 slightly tricky 2008-09-02 10:25 the neat thing about an rcs like hg is you don't have to ask permission or have a user name 2008-09-02 10:26 it makes commits to the local repo nicer 2008-09-02 10:26 what do you mean by the list ends at the size limit? 2008-09-02 10:27 the attributes are all variable sizes 2008-09-02 10:28 so you need to do that attr = decode(attr...) thing 2008-09-02 10:28 checking that the resulting pointer is not out of range 2008-09-02 10:28 why not just look up the size and check that? 2008-09-02 10:29 sure 2008-09-02 10:29 which is exactly what decode* does 2008-09-02 10:29 ah 2008-09-02 10:29 the magic numbers 6 and 10 should be replaced by constants, you can add those constants to the enum 2008-09-02 10:30 flips: did you see the post about performance on the zumastor list? 2008-09-02 10:30 k 2008-09-02 10:30 mornin' all 2008-09-02 10:30 have not yet 2008-09-02 10:30 hiyah 2008-09-02 10:30 konrad, which post is that 2008-09-02 10:31 konrad: good work, welcome :) 2008-09-02 10:31 flips: which post is what? 2008-09-02 10:31 shapor: thanks 2008-09-02 10:31 oh 2008-09-02 10:31 I'm not on the sumastor list 2008-09-02 10:31 shapor said it 2008-09-02 10:31 :D 2008-09-02 10:31 let me see 2008-09-02 10:32 flips: Subject: Re: RHEL5 2.6.18 support? 2008-09-02 10:33 yes 2008-09-02 10:33 good post 2008-09-02 10:33 and we have the answer: tux3 + backport to zumastor 2008-09-02 10:34 flips: should attr_check fail if the size of an attr is less than 2, or is that allowed? 2008-09-02 10:34 allowed I think, but there is no attribute with that size 2008-09-02 10:35 right 2008-09-02 10:35 I mean 2 including the header 2008-09-02 10:35 that's a bug 2008-09-02 10:35 which is itself 2 bytes 2008-09-02 10:35 ok, I'll fail if that happens 2008-09-02 10:35 headers are never less that 2 bytes, I don't see changing that 2008-09-02 10:35 we're not quite that insane about compression 2008-09-02 10:35 ok 2008-09-02 10:38 "The long and short of truncate" -- new post coming 2008-09-02 10:39 flips: http://pastie.caboo.se/264644 2008-09-02 10:41 konrad, there can be multiple attributes per leaf entry 2008-09-02 10:41 attr_check should not know about dictionary format at all 2008-09-02 10:41 just take (base, size) 2008-09-02 10:42 hm? 2008-09-02 10:42 to set up a unit test, you need to actually encode some attributes, so this function belongs in iattr.c rather than ileaf.c 2008-09-02 10:42 ah 2008-09-02 10:44 attr_check(void *attrs, unsigned size)? 2008-09-02 10:45 right 2008-09-02 10:45 k 2008-09-02 10:45 would return yes/now I think 2008-09-02 10:45 and the caller would complain 2008-09-02 10:45 maybe 2008-09-02 10:53 hm 2008-09-02 10:53 in encode_attrs() in iattr.c 2008-09-02 10:53 the for loop goes does kind from 0 to 32 2008-09-02 10:53 when kind only gets 4 bits on disk 2008-09-02 10:54 yes, sloppy 2008-09-02 10:54 ;-) 2008-09-02 10:54 :) 2008-09-02 10:54 feel free to improve 2008-09-02 10:54 the reason the lowest attr kind is not zero is, catches more bugs it it isn't 2008-09-02 10:54 right, I saw that earlier 2008-09-02 10:55 I think attr kind zero will only get used when all 15 others are used 2008-09-02 10:55 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-02 10:55 and then it will likely mean "just pad this" 2008-09-02 10:55 heh 2008-09-02 10:55 we might declare at some point that attributes are always padded to even numbers of bytes 2008-09-02 10:56 or we might allow odd numbers 2008-09-02 10:56 then I'd have to recant the above statement about no attr kind less than 2 bytes 2008-09-02 10:56 we'd introduce at least one one byte attr 2008-09-02 10:56 "noop" 2008-09-02 10:56 or pad 2008-09-02 10:57 so that we can update attrs in some cases without moving everything in the leaf 2008-09-02 10:57 future optimization 2008-09-02 10:57 anyway, just to say the design might evolve a little some weeks down the road 2008-09-02 10:58 for now it is a two-byte granularity 2008-09-02 10:58 that means when immediate data attributes get added, they need to be padded out 2008-09-02 10:59 hey maze 2008-09-02 11:01 flips: more like this? http://pastie.caboo.se/264662 2008-09-02 11:01 exactly like that I think 2008-09-02 11:02 hello 2008-09-02 11:02 hey 2008-09-02 11:02 ready for your vfs tutorial? 2008-09-02 11:04 <- maze 2008-09-02 11:04 right now? no, sorry, I'm in the middle of a big turnup which is already behind schedule 2008-09-02 11:04 konrad, you can use the new enum you just declared for both the lower and upper limit of the encode loop 2008-09-02 11:04 no I meant in general 2008-09-02 11:05 it's going to be a long tutorial ;-) 2008-09-02 11:05 ah ok 2008-09-02 11:05 period of 3 weeks I'd think 2008-09-02 11:05 in general? haven't really had the time :-( to do much - as in almost none. 2008-09-02 11:05 I need a vacation... 2008-09-02 11:05 at the end you get the "phillips certificate of vfs competency" 2008-09-02 11:05 and the right to flame newbies on lkml 2008-09-02 11:05 well worth having 2008-09-02 11:06 cool ;-) I'd love to. 2008-09-02 11:06 ACTION listens carefully and hears the sound of google data centers burning down 2008-09-02 11:06 should be able to run a class of two, shapor is about ready for this 2008-09-02 11:07 konrad too I think 2008-09-02 11:07 I'll listen in certainly 2008-09-02 11:07 they're not burning too quickly at least ;-) 2008-09-02 11:08 maze, did you notice your comments on that fat key space were highly relevant? 2008-09-02 11:08 hammer essentially implements what you suggested 2008-09-02 11:08 so the idea is far from useless 2008-09-02 11:08 I'm not sure what you're referring to ;-) fat key space? 2008-09-02 11:09 your "beautiful idea" you had afterthe initial tux3 whiteboarding 2008-09-02 11:09 to incorporate the file offset in the btree key 2008-09-02 11:09 hammer does that 2008-09-02 11:09 ok, right that one 2008-09-02 11:09 just like you imagined 2008-09-02 11:09 I did rather like that one 2008-09-02 11:09 tux3 does not, that is the main difference between them, and the allocation method 2008-09-02 11:10 it makes a beautifully simple design 2008-09-02 11:10 exactly... so why doesn't tux3 use it? 2008-09-02 11:10 but my guess is, tux3 will end up faster as it is more cache efficient to have a two level tree 2008-09-02 11:10 the number of probes is the same 2008-09-02 11:11 hmm, I really htink it should work better with just one 2008-09-02 11:11 not probes, but btree compares 2008-09-02 11:11 I ran the numbers in detail 2008-09-02 11:11 hmm, really? interesting. 2008-09-02 11:11 having a single tree means a deeper tree 2008-09-02 11:11 it works out exactly 2008-09-02 11:11 true 2008-09-02 11:11 log(something) either way 2008-09-02 11:11 and it also probably means less children per node because of larger keys... 2008-09-02 11:12 it does 2008-09-02 11:12 yes, but... 2008-09-02 11:12 it should spread out better over the entire filesystem 2008-09-02 11:12 hammer: 64 tux3: 256 or 512 2008-09-02 11:12 so instead of access being log(# files) + log(size of file) 2008-09-02 11:12 tux3: sometimes 384 2008-09-02 11:12 you have access being something like (log used disk space) 2008-09-02 11:13 - extents 2008-09-02 11:13 in tux3? 2008-09-02 11:13 no comparison of fat vs thin btrees 2008-09-02 11:13 it's mainly log(inode table size) in tux3 2008-09-02 11:14 and the inodes are cached 2008-09-02 11:14 so that disappears mostly 2008-09-02 11:14 leaving the nice little per-file btrees 2008-09-02 11:14 so I guess the two level approach will significantly outperform, just a constant factor but a big one 2008-09-02 11:14 I just want the metadata on a different disk/ram/flash-backed/etc ;-) 2008-09-02 11:15 ah, that is coming 2008-09-02 11:15 my answer to zfs's mess 2008-09-02 11:15 is a rather nice hack 2008-09-02 11:15 I really really like forward logging 2008-09-02 11:15 involving having tux3 work together with lvm3 2008-09-02 11:15 although I hate the fact there's that 0.0001% chance of it breaking 2008-09-02 11:15 the forward logging thing is working out design wise, I should incorporate it into the userspace prototype now 2008-09-02 11:16 the chance is nowhere that big 2008-09-02 11:16 its completely under our control 2008-09-02 11:16 and we will have an option to disable it completely, just for the ultraparanoid 2008-09-02 11:16 the really trick there is to use a sufficiently paranoid checksumming signature 2008-09-02 11:16 with the "phase" commit philosophy, it will still be efficient even without relying on a hash 2008-09-02 11:17 right 2008-09-02 11:17 and even then - it can only fail on non-clean remount 2008-09-02 11:17 and the checksum can be avoided completely in significant cases 2008-09-02 11:17 right again 2008-09-02 11:17 which should be rare... so the chance of failure should be as close to '0' as can be while still theoretically possible 2008-09-02 11:18 so, the really nice thing is, when you have a whole bunch of transactions ready to commit, the forward logging can be done without any hash: wait for transaction completions, then mark complete in a known location 2008-09-02 11:18 that will be the ultra paranoid option 2008-09-02 11:19 I would think a 64 bit decent hash would get close to 0 error chance 2008-09-02 11:19 calculating the hashes should be cheap 2008-09-02 11:19 maybe make that configurable 2008-09-02 11:19 yes 2008-09-02 11:19 so long as its not m5 2008-09-02 11:19 we're not talking about calculating the hash from a large amount of data 2008-09-02 11:19 md5 2008-09-02 11:19 or something like that 2008-09-02 11:20 even if it's md5, it's still fast, because it's much-much faster than a disk seek 2008-09-02 11:20 zfs and btrfs find its a significant cost if they checksum everything 2008-09-02 11:20 and you'd be hashing something like 256 bytes or so 2008-09-02 11:20 about the biggest bottleneck in fact 2008-09-02 11:20 its really important to have an efficient hash 2008-09-02 11:21 oh, no, I thought of it as literally a block signature for the superblock 2008-09-02 11:21 not for everything else 2008-09-02 11:32 right 2008-09-02 11:33 I think, just checksum all the _used_ data in the commit block and part of the data blocks 2008-09-02 11:33 right, that'd be nice - use something like a crc32 (cpu support) for that - maybe two crc32's in parallel (or a crc64 if sse will support that) 2008-09-02 11:34 it can be an option whether we rely on the checksum to know that the data part of the transaction got onto media, or wait for completion on data before submitting the commit block 2008-09-02 11:34 cpu support for crs32? 2008-09-02 11:34 but I was thinkning of each block in the forward log and the superblock having a sort of tail signature which would look kind of like 2008-09-02 11:34 I don't know that instruction ;-) 2008-09-02 11:34 crc32 - yeap, coming in sse4.1 or so 2008-09-02 11:34 oh, that's too bad 2008-09-02 11:34 should be out in nehalem or even earlier 2008-09-02 11:34 crc32 sucks 2008-09-02 11:34 for hashing 2008-09-02 11:34 well.... 2008-09-02 11:35 you should be able to do a crc32*4 easily enough 2008-09-02 11:35 I hope it's not crc32 specific 2008-09-02 11:35 it is 2008-09-02 11:35 still not good 2008-09-02 11:35 crc32 has funnels 2008-09-02 11:35 lots of them 2008-09-02 11:35 yes, well... 2008-09-02 11:35 bleah 2008-09-02 11:35 ACTION hates intel 2008-09-02 11:35 I wish they supported md5/sha1 and aes in the cpu 2008-09-02 11:35 it's a powerful argument for using a substandard hash 2008-09-02 11:36 double bleah 2008-09-02 11:36 SSE4.2 Instruction Description CRC32 Accumulate CRC32C value using the polynomial 0x11EDC6F41 (or, without the high order bit, 0x1EDC6F41).[5] 2008-09-02 11:37 Nehalem and on, so next year 2008-09-02 11:37 I'll ping a mathematician to analyze it 2008-09-02 11:38 see if we can make something useful out of that turd 2008-09-02 11:38 I'd hate to incorporate crc32 into tux3 on-disk format just because intel farted 2008-09-02 11:38 we'll see what amd comes up with 2008-09-02 11:38 right 2008-09-02 11:38 in fact 2008-09-02 11:38 I know who to talk to about that 2008-09-02 11:39 amd is about to go intel one better 2008-09-02 11:39 lol, how? 2008-09-02 11:39 and I'd be happy to run a few cycles slower on intel just to force intel to do it right 2008-09-02 11:39 heh 2008-09-02 11:39 sekrit 2008-09-02 11:39 ok I need to get my mathematical ducks in a row for this 2008-09-02 11:40 AMD claims SSE5 will provide dramatic performance improvements, particularly in high performance computing (HPC), multimedia and computer security applications, including a 5x performance gain for Advanced Encryption Standard (AES) encryption and a 30% performance gain for discrete cosine transform (DCT) used to process video streams.[1] 2008-09-02 11:40 that's more like it 2008-09-02 11:40 I'll go get into the nda loop there 2008-09-02 11:40 AMD's) SSE5 does not include all (Intel's) SSE4 instructions. In other words, it is not a superset of SSE4 but a competitor to it. Likewise, Intels pre-Nehalem cores contain only a partial implementation of SSE4, called SSE4.1. This poses some difficulty and extra work for compilers and assembly-level hand tuning of code 2008-09-02 11:40 make sure amd is a tux-ready machine 2008-09-02 11:43 SSE5 includes: 2008-09-02 11:43 Fused multiply-accumulate (FMACxx) instructions Integer multiply-accumulate (IMAC, IMADC) instructions Permutation (PPERM, PERMPx) and conditional move (PCMOV) instructions Precision control, rounding, and conversion instructions 2008-09-02 11:44 note the permutation stuff 2008-09-02 11:44 probably what gives the aes boost 2008-09-02 11:44 should be useable for hash/crypt stuff as well 2008-09-02 11:45 noted 2008-09-02 11:45 that's the right way to do it 2008-09-02 11:45 it's perfect 2008-09-02 11:45 amd rulez 2008-09-02 11:45 intel suckorz 2008-09-02 11:45 sukzorz 2008-09-02 11:55 the fused multiply will also mean a huge amount for flops freaks everywhere ;-) 2008-09-02 11:56 ie. anybody doing anything high-precision 2008-09-02 11:58 :-) 2008-09-02 11:58 ACTION is a flops freak 2008-09-02 12:06 oh, weird, wonder how I managed to do that 2008-09-02 12:07 yes, odd 2008-09-02 12:07 indeed 2008-09-02 12:08 edit without compile most probably 2008-09-02 12:08 thought I did compile though 2008-09-02 12:08 odd 2008-09-02 12:12 scamjet time 2008-09-02 12:16 konrad, I tghi 2008-09-02 12:16 konrad, I think you have ileaf under control 2008-09-02 12:16 dleaf is 10x harder ;-) 2008-09-02 12:16 maybe 100x 2008-09-02 12:17 heh 2008-09-02 12:17 I suggest shapor for code review on that 2008-09-02 12:17 ok 2008-09-02 12:18 ACTION runs and hides 2008-09-02 12:19 not quick enough 2008-09-02 12:25 tux3 is the... 6th google result for tux3 2008-09-02 12:26 I get first 10 2008-09-02 12:27 bbl 2008-09-02 12:28 'tux 3' 2008-09-02 12:33 interesting 2008-09-02 12:33 http://pastie.caboo.se/264727 <-- building tux3 on my ppc machine 2008-09-02 12:42 comes from trace.h 2008-09-02 12:43 flips: ping 2008-09-02 13:58 hey 2008-09-02 14:32 konrad, pong 2008-09-02 14:32 why is there an asm("int3") in trace.h? 2008-09-02 14:37 it generates a trap into gcc on assert failure 2008-09-02 14:37 really useful 2008-09-02 14:37 sorry 2008-09-02 14:37 doesn't work on non-x86 2008-09-02 14:37 :( 2008-09-02 14:37 trap into gdb 2008-09-02 14:37 just comment it out 2008-09-02 14:38 did 2008-09-02 14:38 and hunt around for something that does work 2008-09-02 14:38 it's really useful 2008-09-02 14:38 you can put "b break" into your gdb .rc 2008-09-02 14:38 and void break(void) { } 2008-09-02 14:38 called from assert 2008-09-02 14:43 konrad, what non-x86 do you run on? 2008-09-02 14:43 ppc 2008-09-02 14:43 mac? 2008-09-02 14:44 ibook 2008-09-02 14:44 cool 2008-09-02 14:44 perfect for checking endian issues 2008-09-02 14:44 and wordsize 2008-09-02 14:44 yep 2008-09-02 14:44 all of ileaf and dleaf have to be converted for endian at some point 2008-09-02 14:44 not right away 2008-09-02 14:53 ACTION is back from Burning Man 2008-09-02 14:53 I feel great 2008-09-02 15:00 me too 2008-09-02 15:01 by the way, what is it that makes you feel great? (only the legal part please) 2008-09-02 15:03 I love this uniden phone system 2008-09-02 15:03 got the 8 series corded base station about 4 years ago 2008-09-02 15:03 its still the best home phone system on the planet 2008-09-02 15:04 just got two new handsets for it, the upgraded 905 series work fine 2008-09-02 15:04 and they're better than the original handsets 2008-09-02 15:04 almost like cell phones 2008-09-02 15:08 flips: I don't do drugs as a rule 2008-09-02 15:08 never really did 2008-09-02 15:08 hard to explain, it's just the overall intensity of the experience 2008-09-02 15:08 like a rage? 2008-09-02 15:09 having such community orientied people really disarms the typical resistence you'd might have dealing with people in a city 2008-09-02 15:09 ah, people not being aholes 2008-09-02 15:09 I get it 2008-09-02 15:09 that's a medium for other things, art, partying, etc... 2008-09-02 15:09 even aholes pretending not to be 2008-09-02 15:09 you'd like it 2008-09-02 15:09 I know I would 2008-09-02 15:09 it's like everthing wrong with US society reversed. 2008-09-02 15:09 kids not compatible I'd think 2008-09-02 15:10 no, folks bring their kids 2008-09-02 15:10 ah 2008-09-02 15:10 then next year for sure 2008-09-02 15:10 it's not a big deal, just avoid certain camps and you're set 2008-09-02 15:10 certain camps where... what? is happening 2008-09-02 15:10 they aren't exhibiting that stuff openly anyways, so it's no big deal 2008-09-02 15:10 death yoga? 2008-09-02 15:10 porn & eggs 2008-09-02 15:10 spike's 2008-09-02 15:10 stuff like that 2008-09-02 15:10 ic 2008-09-02 15:11 right 2008-09-02 15:12 not any worse than a goth festival I'd think 2008-09-02 15:13 you'd like that 2008-09-02 15:13 german version 2008-09-02 15:13 not really 2008-09-02 15:13 I guarantee it 2008-09-02 15:13 for one thing, there's a high concentration of ubergeeks 2008-09-02 15:14 yeah, your infrastructure engineering group is out there for sure, Tim Hockin 2008-09-02 15:14 the death guild camp is full f nerds as well 2008-09-02 15:14 larry & sergey even 2008-09-02 15:17 handset #4 now online, my home pbx is good for another 2 years 2008-09-02 15:18 going to celebrate with some french roast 2008-09-02 15:20 how's tux3 going ? 2008-09-02 15:20 any of my suggestions been thought about futher ? 2008-09-02 15:20 further ? 2008-09-02 15:20 oh yes 2008-09-02 15:21 I'm getting ready to set up a nice environment for you to develop the locking ;-) 2008-09-02 15:21 you'll see growth of the project with more folks joining when you get more stuff working 2008-09-02 15:21 oh shit 2008-09-02 15:21 that's true 2008-09-02 15:21 it's already happening 2008-09-02 15:21 good 2008-09-02 15:21 major stuff now works, see tux3.c 2008-09-02 15:21 yeah, because I don't have faith in Linux file systems after seeing a bunch of NetApp code 2008-09-02 15:21 can create and read/write a tux3 volume from shell commands now 2008-09-02 15:22 nice 2008-09-02 15:22 really did make a 64 petabyte file in an 8k volume image 2008-09-02 15:22 that's with 4K spare for the boot loader 2008-09-02 15:23 decided to make the tux3 superblock 1K just to have that work out ;-) 2008-09-02 15:23 that leaves 12 256 byte blocks for the filesystem structure, root directory, bitmaps, inode table 2008-09-02 15:23 is this all you're doing at Google right now ? 2008-09-02 15:23 you could say that 2008-09-02 15:24 but its actually part time 2008-09-02 15:24 you should see me when I work ;-) 2008-09-02 15:29 "I don't like the flashing red light in the upper left hand corner of each handset. This is a charge indicator that lets you know the phone is charged and ready to go. There is nothing wrong with letting consumers know this, but to have a light that continuously flashes can be a tremendous distraction." -- amazon idiot who doesn't know he owns a digital answering machine 2008-09-02 15:36 maybe I will pthread tux3 before doing delete 2008-09-02 15:36 just for bh 2008-09-02 15:40 flips: how fine-grained are you planning on going with locking? 2008-09-02 15:40 very 2008-09-02 15:40 ask bh ;-) 2008-09-02 15:40 leaf? 2008-09-02 15:40 yes 2008-09-02 15:40 hrm will you do that in the generic btree code? 2008-09-02 15:40 yes 2008-09-02 15:40 with the help of pthreads 2008-09-02 15:40 and futexes 2008-09-02 15:41 where are you planning on storing the locks? 2008-09-02 15:41 bh is going to have fun with it ;-) 2008-09-02 15:41 in the buffer heads 2008-09-02 15:41 or in a hash 2008-09-02 15:41 it's in flux 2008-09-02 15:41 either would work in kernel 2008-09-02 15:41 so i'm guessing locks in the intermediate nodes as well? 2008-09-02 15:42 for merge/split 2008-09-02 15:42 yes 2008-09-02 15:42 careful about deadlocks there 2008-09-02 15:42 all down the chain 2008-09-02 15:42 always 2008-09-02 15:42 anybody who thinkgs abba is a swedish pop group is not touching the locking code 2008-09-02 15:42 lol 2008-09-02 15:43 bh knows that stuff I'm pretty sure 2008-09-02 15:43 didn't ask, but what he talks about is beyond that 2008-09-02 15:45 hrm how about transactional stuff 2008-09-02 15:46 like where you have to create an inode, then reference from a directory 2008-09-02 15:46 which involves more than one tree 2008-09-02 15:47 I'll write it up in a few days 2008-09-02 15:47 it's pretty much all there in the hammer thread 2008-09-02 15:47 we track every time a buffer gets dirty 2008-09-02 15:47 i still haven't had the time to digest that whole brain dump 2008-09-02 15:48 then etiher add it to the current transaction phase or cow the buffer 2008-09-02 15:48 it's basically the phase part of phase tree, the part that netapp never tried to own 2008-09-02 15:50 cowing the buffer is a simple matter of setting its index to some other physical block 2008-09-02 15:50 or in that case of a file blocks, changing the pointer in its parent 2008-09-02 15:50 index block 2008-09-02 15:50 which is done only in cache 2008-09-02 15:50 not on disk 2008-09-02 15:51 so you have one view of the vs on disk, and another, current one that the vfs sees, in memory 2008-09-02 15:51 of the fs I mean 2008-09-02 15:51 when you get the aha on that it's going to be fun 2008-09-02 15:52 I think I'll use the term "fork" instead of cow 2008-09-02 15:53 it's much more descriptive of what happens 2008-09-02 15:53 so tux3's transaction model is to fork any buffer written to after a phase as closed 2008-09-02 15:53 if the phase is still open, just write to it normally 2008-09-02 15:54 unspeakably efficient 2008-09-02 15:54 tux3 has exactly two ways of getting info onto media 1) write to a buffer 2) save the superblock 2008-09-02 15:54 there will eventually be 3) directio 2008-09-02 15:55 which will require more fiddling 2008-09-02 15:56 I wonder if it would be worth the very minor regularity improvement to hold the superblock in a buffer 2008-09-02 15:56 well 2008-09-02 15:56 kind of dumb 2008-09-02 15:56 you don't know the block size for the superblock 2008-09-02 15:56 or 2008-09-02 15:56 more accurately, the blocksize of the superblock may not match the buffer cache blocksize 2008-09-02 15:57 or the filesystem blocksize 2008-09-02 15:57 both making it unnatural to force the sb into a buffer 2008-09-02 15:58 I think we may be studly and to the initial sb load and later saves directly via the bio interface 2008-09-02 15:58 which means we need to handle completion, get the interrupt back into foreground 2008-09-02 15:58 interrupt completion that is 2008-09-02 15:59 which we need to do anyway if we want to avoid the decrepit old block io library 2008-09-02 16:06 http://interviews.slashdot.org/comments.pl?sid=950917&cid=24845533 2008-09-02 16:20 all the remaining conditional exprs in ileaf.c involve leaf->count, there has to be a way to make a macro 2008-09-02 16:20 macroizing those will be a big help in easing the pain of endian conversion 2008-09-02 16:25 ACTION picks up konrad's cute negative for loop for dleaf_trunc 2008-09-02 16:25 I think I grabbed it from somewhere in ileaf.c 2008-09-02 16:25 really? 2008-09-02 16:25 or maybe that was my imagination 2008-09-02 16:25 yeah 2008-09-02 16:25 looks original 2008-09-02 16:26 I had something remotely like it 2008-09-02 16:26 but yours is actually readable 2008-09-02 16:26 ileaf->dump 2008-09-02 16:26 er 2008-09-02 16:26 ileaf_dump 2008-09-02 16:26 same thing 2008-09-02 16:26 roughly 2008-09-02 16:26 oh heh 2008-09-02 16:26 I forget some of the stuff I write :-) 2008-09-02 16:26 :D 2008-09-02 16:27 yours is better 2008-09-02 16:27 it's how I should have written it 2008-09-02 16:27 I'll change ileaf_dump to match, or do you want to do that? 2008-09-02 16:27 go ahead 2008-09-02 16:27 I'm attempting to wrap my head around dleaf 2008-09-02 16:27 good 2008-09-02 16:27 don't bother with ilead :-) 2008-09-02 16:27 dleaf is pure braindamange, ask shapor 2008-09-02 16:28 of the good kind 2008-09-02 16:28 it will make your head hurt 2008-09-02 16:28 heh 2008-09-02 16:32 u16 *gdict = (void *)leaf + btree->sb->blocksize; 2008-09-02 16:32 u16 *edict = (void *)(gdict - leaf->groups); 2008-09-02 16:32 more regular form 2008-09-02 16:32 plus a cute varname 2008-09-02 16:34 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-02 16:35 hey tim_dimm 2008-09-02 16:35 hey flips 2008-09-02 16:35 you can shell in any time ;-) 2008-09-02 16:35 no time to be stealthy, huh 2008-09-02 16:35 that's stealthy 2008-09-02 16:35 yup 2008-09-02 16:36 flips: shouldn't those be u32? 2008-09-02 16:36 dcc still doesn't work 2008-09-02 16:36 nat issue 2008-09-02 16:36 konrad, which? 2008-09-02 16:36 [16:32:04] u16 *gdict = (void *)leaf + btree->sb->blocksize; 2008-09-02 16:36 [16:32:04] u16 *edict = (void *)(gdict - leaf->groups); 2008-09-02 16:36 um 2008-09-02 16:36 oh yes 2008-09-02 16:37 shows I cut n pasted 2008-09-02 16:37 without engaging brain 2008-09-02 16:37 heh 2008-09-02 16:37 I'm gonna borrow those 2008-09-02 16:37 they are actually struct something * 2008-09-02 16:37 well yeah 2008-09-02 16:37 but u32 is the same size as said struct 2008-09-02 16:38 struct entry and struct group I think 2008-09-02 16:38 right, a dangerous coindicdence 2008-09-02 16:38 but safe in this case 2008-09-02 16:39 the cutest thing about dleaf is the way the entry offset is incremented inthe lookup loop 2008-09-02 16:39 that is where the brain hurt gets serious 2008-09-02 16:40 the for loops using struct pointers are gratuitous 2008-09-02 16:40 flips: remember to think about the allocator so that related bits of metadata are located closely to each other, this is very important for online disk checking 2008-09-02 16:40 it's more clear using array indices 2008-09-02 16:40 and the complier can optimize to the same thing in theory 2008-09-02 16:40 practice is different of course ;-) 2008-09-02 16:41 the ext3 paper, OLS 2007 ?, might be of interest here, they made a modification to ext3 so that fsck would runs much faster 2008-09-02 16:41 bh, you haven't been reading the recent posts ;-) 2008-09-02 16:41 talke about that very thing 2008-09-02 16:41 gcc -O9999999 linux.c 2008-09-02 16:41 did you get to read the paper btw ? 2008-09-02 16:41 :-) 2008-09-02 16:41 bh, I'm one of the stars in it ;-) 2008-09-02 16:41 yeah, I'm overloaded with -rt work right now, first day back 2008-09-02 16:41 downloaded it last week 2008-09-02 16:41 folks are hitting me up for stuff already 2008-09-02 16:42 oh really ? 2008-09-02 16:42 url ? 2008-09-02 16:42 yep 2008-09-02 16:42 um 2008-09-02 16:42 http://ext2.sourceforge.net/2005-ols/2005-ext3-paper.pdf 2008-09-02 16:43 getting close to sk8 oclock 2008-09-02 16:44 ACTION does another piece of 75% cacao chocolate 2008-09-02 16:44 tim_dimm might have a skate left in him 2008-09-02 16:45 went through 30 wheels in 4 days at Maryhill 2008-09-02 16:45 I'll be on the strand by 5:30 2008-09-02 16:45 about 2008-09-02 16:45 wow 2008-09-02 16:45 http://www.silverfishlongboarding.com/option,com_gallery2/Itemid,53/?g2_itemId=237609/ 2008-09-02 16:45 "vintage plantation" <- I highly recommend this chocolate 2008-09-02 16:46 i'm the one *not* in the rubber suit 2008-09-02 16:46 flips: there might be a newer paper on the matter from IBM 2008-09-02 16:46 beautiful 2008-09-02 16:46 bh, link? 2008-09-02 16:46 OLS 2007 or something like that 2008-09-02 16:46 I'll need a better hint 2008-09-02 16:47 tim, you're the one who looks cool 2008-09-02 16:47 except you need a mirrored helment 2008-09-02 16:47 helmet 2008-09-02 16:48 if you kept your elbows in I bet you woulda won 2008-09-02 16:48 and spray some pam on that jacket 2008-09-02 16:49 ACTION isn't very keen on online disk fragmention either in ext4 2008-09-02 16:49 seems kind of like bottom scrapping to me 2008-09-02 16:50 I was trying to grab some air at that point. Those guys just passed me, and I knew they were about to slam on the brakes. 2008-09-02 16:51 variable metadata is useful for homogenous file types like media files, hmmm, interesting 2008-09-02 17:05 bh, you really need to read my musings 2008-09-02 17:05 let me see if I can find a subject line 2008-09-02 17:06 scott on the right? 2008-09-02 17:06 in blue, yes 2008-09-02 17:06 how'd I guess ;) 2008-09-02 17:06 f'n magic 2008-09-02 17:06 shinyness 2008-09-02 17:07 serious about the pam 2008-09-02 17:07 slickness 2008-09-02 17:07 on the list ? 2008-09-02 17:07 should be 2008-09-02 17:07 yes 2008-09-02 17:07 I look at it a bit, but I didn't see very much 2008-09-02 17:07 or lkml ? 2008-09-02 17:07 the list 2008-09-02 17:08 tux3 ? 2008-09-02 17:08 "Spacial correlation between directory entries, inodes and file data" 2008-09-02 17:08 you have to read between the lines 2008-09-02 17:08 all I see is stuff about patches 2008-09-02 17:08 I have a followup post in the works 2008-09-02 17:08 but there is stuff ahead of it 2008-09-02 17:08 in the queue 2008-09-02 17:08 flips: how does the magic zero entry worth with the dleaf dicts? 2008-09-02 17:08 or is it present? 2008-09-02 17:09 konrad, same way 2008-09-02 17:09 0th entry is implied 2008-09-02 17:09 dict should be positioned one past the top of the list 2008-09-02 17:09 flips: you should make online disk checking the default mechanism for your file system, create a common fsck library to shared between the online checker and offline 2008-09-02 17:09 that is violated in dleaf.c sometimes for no good reason 2008-09-02 17:09 just because we were figuring out how to do it at the time 2008-09-02 17:09 offline checking would be used only in a dev situation 2008-09-02 17:09 bh, planned 2008-09-02 17:09 indeed 2008-09-02 17:09 good 2008-09-02 17:10 need to write a tech note 2008-09-02 17:10 ah ok 2008-09-02 17:10 the tux3 userspace implementation is in fact the base of the online tools 2008-09-02 17:10 because until we get reverse pointers and supporting stuff for file systems that's the only things that's going to work 2008-09-02 17:10 including defrag 2008-09-02 17:10 online and offline 2008-09-02 17:10 volume are getting so large that .... you know... 2008-09-02 17:11 reverse pointers is planned, tech note needed 2008-09-02 17:11 I've mentioned some details from time to time 2008-09-02 17:11 I know 2008-09-02 17:11 it's already broken 2008-09-02 17:11 broke years ago 2008-09-02 17:11 tux3 is going to be allocation groups as well 2008-09-02 17:11 and maybe... not sure about it... relative pointers 2008-09-02 17:12 maybe that is tux3.1 2008-09-02 17:12 don't know, it's too experimental 2008-09-02 17:12 right 2008-09-02 17:12 scary 2008-09-02 17:12 get the basics as much as you can first, format changes are another matter 2008-09-02 17:12 http://pastie.caboo.se/264894 2008-09-02 17:12 like that 2008-09-02 17:12 my dumper "from scratch" if you will 2008-09-02 17:12 that's the plan 2008-09-02 17:12 so I think I'm doing something right 2008-09-02 17:13 konrad, kool 2008-09-02 17:13 oh yes 2008-09-02 17:14 if I go out for a skate, your new dumper will be finished when I get back and I can use it 2008-09-02 17:14 hm? 2008-09-02 17:14 it's sort of redundant to the existing dleaf_dump 2008-09-02 17:14 I just wanted to be sure I understand how to loop through the groups 2008-09-02 17:14 er, entries 2008-09-02 17:14 and groups 2008-09-02 17:15 yours is going to be better, I like to backport like that 2008-09-02 17:15 it's called evolution 2008-09-02 17:17 should I make the output look like the old one? 2008-09-02 17:17 good place to start 2008-09-02 17:17 k time to get rolling 2008-09-02 17:19 what's the purpose of (struct entry*)foo->limit ? 2008-09-02 17:19 flips: tying up some loose ends. I'll be out by 6 2008-09-02 17:21 ok, I'll slow down a little 2008-09-02 17:21 see you at the skate park? 2008-09-02 17:21 sure 2008-09-02 17:22 I'll do slaloms at the pier for a while ;-)_ 2008-09-02 17:22 more fun than slowing down 2008-09-02 17:28 flips: stuff posted today on lkml ? 2008-09-02 17:28 bh, not today 2008-09-02 17:28 soon 2008-09-02 17:28 oh ok, so you haven't posted this yet then 2008-09-02 17:29 mainly just on the current state of the disk format 2008-09-02 17:29 ok 2008-09-02 17:29 "Spacial correlation between directory entries, inodes and file data" 2008-09-02 17:29 (read between the lines) 2008-09-02 17:29 it's working out well as far as it goes 2008-09-02 17:29 there's a lot more detail coming on that 2008-09-02 17:30 read the hint about generating functions 2008-09-02 17:30 spatial 2008-09-02 17:30 I've blabbed about that to you personally, but I don't know if it registered yet 2008-09-02 17:30 right 2008-09-02 17:30 spacial is my new word ;-) 2008-09-02 17:30 I googled for that and go nothing useful 2008-09-02 17:30 it's on the tux3 list 2008-09-02 17:31 I totally don't see it 2008-09-02 17:31 it's just patch discussion that I'm seeing 2008-09-02 17:31 you're right 2008-09-02 17:31 google is damaged or mailman 2008-09-02 17:33 http://tux3.org/pipermail/tux3/2008-August/000083.html 2008-09-02 17:33 google is braindamaged 2008-09-02 17:33 :-p 2008-09-02 17:33 later... 2008-09-02 17:44 ACTION reading 2008-09-02 17:53 ok, dumper2 worsk 2008-09-02 18:45 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-09-02 18:45 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2008-09-02 19:35 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-09-02 19:36 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2008-09-02 19:50 konrad, kool 2008-09-02 19:50 I'm not sure it's any more clear than the original 2008-09-02 19:51 it's 3 lines shorter, but that doesn't mean clearer 2008-09-02 20:01 it would be hard to be less clear than the original 2008-09-02 20:03 eh, it's not an easy process 2008-09-02 20:04 -easy +simple 2008-09-02 20:07 it's going to get even less easy when we add in versioned extents 2008-09-02 20:07 to the same code 2008-09-02 20:07 so it has to be clean 2008-09-02 20:12 mhm 2008-09-02 20:12 well, mine uses less pointer arithmetic and more array notation 2008-09-02 20:13 *(a + b) vs a[b] 2008-09-02 20:15 I think that's better 2008-09-02 20:15 for the dumper 2008-09-02 20:15 easier to read 2008-09-02 20:15 can save the pointer tricks for something that matters 2008-09-02 20:15 assuming the compiler can't optimize that well 2008-09-02 20:16 which is not a safe assumption 2008-09-02 20:18 flips: http://pastie.caboo.se/264975 there's a first poke at it 2008-09-02 20:19 not sure I like the ent == -1 logic 2008-09-02 20:20 otherwise looks pretty good 2008-09-02 20:20 offset -= doesn't look right 2008-09-02 20:20 should be += 2008-09-02 20:20 wow, dleaf_isinorder looks nice 2008-09-02 20:21 ent == -1 is the same as the check against the magical zero 2008-09-02 20:21 right 2008-09-02 20:21 ent < -1 ? 2008-09-02 20:21 something seems off by one 2008-09-02 20:21 hm? 2008-09-02 20:22 as in, for every entry but the first entry, check against the previous entry 2008-09-02 20:22 no? 2008-09-02 20:22 how can ent be < -1 ? 2008-09-02 20:22 ent grows smaller? 2008-09-02 20:22 -1, -2, -3 2008-09-02 20:22 oh right :-) 2008-09-02 20:22 (-3 < -1) 2008-09-02 20:22 upside down 2008-09-02 20:22 => true 2008-09-02 20:22 yeah :) 2008-09-02 20:22 braindamage 2008-09-02 20:22 sory 2008-09-02 20:22 same with offset -= 2008-09-02 20:22 offset is negative 2008-09-02 20:22 grows smaller 2008-09-02 20:22 should be ent < 0 though 2008-09-02 20:22 let's see what the skew is 2008-09-02 20:23 hmm 2008-09-02 20:23 maybe it's my brain 2008-09-02 20:23 ah 2008-09-02 20:23 I see 2008-09-02 20:23 I'm assuming the -1th element is greater than zero 2008-09-02 20:23 which isn't good 2008-09-02 20:23 true 2008-09-02 20:23 I think you want a structure where you assign a variable once outside the loop 2008-09-02 20:24 like you had in ileaf 2008-09-02 20:24 alright 2008-09-02 20:24 it's not bad 2008-09-02 20:24 but given how important it is, it should be crystalline 2008-09-02 20:24 yes 2008-09-02 20:25 ok, I can read it now 2008-09-02 20:25 you're right 2008-09-02 20:25 you need to get the correctness of the first value 2008-09-02 20:25 then induce correctness from there 2008-09-02 20:26 and all the little bits have to join together 2008-09-02 20:26 so you need to init your first value outside both loops 2008-09-02 20:26 you can safely init it to zeor 2008-09-02 20:27 zero 2008-09-02 20:27 since keys are unsigned 2008-09-02 20:27 flips: http://pastie.caboo.se/264977 2008-09-02 20:27 oh, keys are unsigned? 2008-09-02 20:28 not crystalline yet ;-) 2008-09-02 20:28 keys are 2008-09-02 20:28 u64 2008-09-02 20:28 see tuxkey_t 2008-09-02 20:28 wait 2008-09-02 20:28 but I'm testing limits 2008-09-02 20:28 not keys 2008-09-02 20:28 they're keys 2008-09-02 20:28 limits are u8 2008-09-02 20:28 that's what dleaf is 2008-09-02 20:28 a key dict 2008-09-02 20:28 right 2008-09-02 20:29 you have to expand those u8s to keys 2008-09-02 20:29 that's the clever thing here 2008-09-02 20:29 ah, I'm just doing what I did in ileaf_isinorder 2008-09-02 20:29 making sure the limits are non-descending 2008-09-02 20:29 when you get the aha it's going to be a big one ;-) 2008-09-02 20:29 which is still important 2008-09-02 20:29 we have two 48 bit fields we combine to make a key 2008-09-02 20:29 two 24 bit fields 2008-09-02 20:29 and the 8 bit fields are just indexes to allow us to do that 2008-09-02 20:29 to make a 48 bit key 2008-09-02 20:29 right 2008-09-02 20:30 sorry 2008-09-02 20:30 so you need to assemble the 48 bit key at each step and compare to prevkey 2008-09-02 20:30 ah, and it should be greater always? 2008-09-02 20:30 yes 2008-09-02 20:30 non-descending or ascending? 2008-09-02 20:30 the 8 bit fields within groups also ascend 2008-09-02 20:30 can two keys be the same? 2008-09-02 20:30 which is what your code checks 2008-09-02 20:30 right 2008-09-02 20:30 which is also good 2008-09-02 20:31 alright 2008-09-02 20:31 twe keys can be the same 2008-09-02 20:31 that's going to be critically important 2008-09-02 20:31 nondescending 2008-09-02 20:32 assembling the 48 bit key is pretty easy 2008-09-02 20:32 it's just 24 bits from the entry and the other 24 bits from the group that owns it 2008-09-02 20:32 computing the offset is a little trickier 2008-09-02 20:32 offset into data 2008-09-02 20:35 http://pastie.caboo.se/264980 checking offsets within groups and non-descending keys now 2008-09-02 20:36 it triggers 3 times running ./dleaf 2008-09-02 20:36 triggers? 2008-09-02 20:37 dleaf_check returns negative with "dleaf entries out of order!" as the error message 2008-09-02 20:37 probably not because of a bug in dealf.c 2008-09-02 20:37 or should I say, "possibly" 2008-09-02 20:38 I still don't much like the inits to -1 2008-09-02 20:38 well 2008-09-02 20:38 I think I see 2008-09-02 20:38 you want a do ( } while (cond) structure 2008-09-02 20:38 probably 2008-09-02 20:39 so the loop iterates over the final n-1 elements 2008-09-02 20:39 it is never allowed to have zero iterations 2008-09-02 20:39 so return false if you find that before entering the do loop 2008-09-02 20:40 so groups aren't allowed to have zero entries? 2008-09-02 20:40 right 2008-09-02 20:40 I should write the definition 2008-09-02 20:40 and post it 2008-09-02 20:41 about time 2008-09-02 20:41 the comment is a little lame 2008-09-02 20:41 in editing a dleaf, and group that drops to zero has to be deleted immediately 2008-09-02 20:42 s/and/any/ 2008-09-02 20:42 what sort of formatting do you prefer for do/while loops? 2008-09-02 20:42 hmm 2008-09-02 20:43 lindent 2008-09-02 20:43 that's with the first curly on the same line as the do 2008-09-02 20:43 I don't like it, but linus does 2008-09-02 20:43 used to write them like you 2008-09-02 20:43 and the second curly on the same or different line as the while? 2008-09-02 20:44 but in the end there is no way to make c pretty ;-) 2008-09-02 20:44 heh 2008-09-02 20:44 different line 2008-09-02 20:44 ok 2008-09-02 20:47 hm, the implied zero entry, what loglo does it have? zero? 2008-09-02 20:54 um 2008-09-02 20:54 ACTION thinks 2008-09-02 20:54 it's not actually there 2008-09-02 20:54 that's where the aha happens 2008-09-02 20:55 only the nonzero entries are actually there, and they encode the upper bound from the key, rather than the usual offset 2008-09-02 20:55 we start one entry away and have an implied zero because we are picking up a pair of entries at each step 2008-09-02 20:56 the current entry and the one above in a sense 2008-09-02 20:56 or maybe better to think of it as the current entry and the one below 2008-09-02 20:56 hm 2008-09-02 20:56 start at -1, and compare to -2, and so on? 2008-09-02 20:56 where you can always directly look at the limit 2008-09-02 20:56 but have to use a clever trick to look at the offset 2008-09-02 20:57 hmm 2008-09-02 20:57 yes 2008-09-02 20:57 well 2008-09-02 20:58 first set offset to zero 2008-09-02 20:58 mhm 2008-09-02 20:58 then enter the loop at i = 0, and look up dict [i -1] -> limit 2008-09-02 20:59 it's a matter of taste 2008-09-02 20:59 and mine was not good when I wrote the original ;-) 2008-09-02 20:59 the loop should always execute eactly n iterations 2008-09-02 20:59 and it should start from zero, but never access dict[0] 2008-09-02 21:00 I think 2008-09-02 21:00 even if it fails early? 2008-09-02 21:00 fail means bail 2008-09-02 21:00 zero tolerance of errors 2008-09-02 21:00 so no, except when it fals 2008-09-02 21:00 fails 2008-09-02 21:01 so what I meant was, the loop should not execute n-1 times 2008-09-02 21:01 but n times 2008-09-02 21:01 yeah 2008-09-02 21:01 and let the actual index i not be used in the loop, but i - 1 instead 2008-09-02 21:01 that's like docmentation 2008-09-02 21:01 because you can arrange the loop to be able to use i directly, but that makes it harder to understand 2008-09-02 21:01 the optimizer can easily do that on its own 2008-09-02 21:02 any, I'm talking about what I _should_ have thought about when I wrote the original 2008-09-02 21:02 was in kind of a hurry to get something running 2008-09-02 21:03 wow, genuine uniden replacement batteries cost almost as much as a new handset 2008-09-03 00:43 konrad, remember I suggested shapor review your code for dleaf ;-) 2008-09-03 00:43 I'm afraid I haven't been as good as reviewer as I could 2008-09-03 00:43 hm? 2008-09-03 00:43 I'm just looking at all the bits of your code I didn't really read ;-) 2008-09-03 00:44 ah that patch was sort of huge glob of stuff 2008-09-03 00:45 int dleaf_ordered(BTREE, struct dleaf *leaf) 2008-09-03 00:45 { 2008-09-03 00:45 struct group *gdict = (void *)leaf + btree->sb->blocksize; 2008-09-03 00:45 struct entry *edict = (void *)(gdict - leaf->groups); 2008-09-03 00:45 tuxkey_t key = 0; 2008-09-03 00:45 --gdict; 2008-09-03 00:45 --edict; 2008-09-03 00:45 for (int group = 0; group < -leaf->groups; group--) { 2008-09-03 00:45 tuxkey_t basekey = (tuxkey_t)gdict[group].loghi << 24; 2008-09-03 00:46 for (int entry = 0; entry < -gdict[group].count; entry--) { 2008-09-03 00:46 tuxkey_t newkey = basekey | edict[entry].loglo; 2008-09-03 00:46 if (key > newkey) 2008-09-03 00:46 return 0; 2008-09-03 00:46 key = newkey; 2008-09-03 00:46 } 2008-09-03 00:46 } 2008-09-03 00:46 return 1; 2008-09-03 00:46 } 2008-09-03 00:46 notice my cavalier attitude towards channel spam ;-) 2008-09-03 00:46 it does less than yours in more lines 2008-09-03 00:46 but it seems to work better 2008-09-03 00:47 looks clear 2008-09-03 00:50 probably complete broken 2008-09-03 00:50 looks ok except you're looking up dict[n] instead of dict[n - 1] 2008-09-03 00:51 which won't work on n = 0 2008-09-03 00:52 the loop inequalities are backwards 2008-09-03 00:52 it sucks ;-) 2008-09-03 00:53 heh 2008-09-03 00:53 these things are a little different than ileafs 2008-09-03 00:53 could just forward loop 2008-09-03 00:53 well 2008-09-03 00:53 to generate the keys it's straightforward 2008-09-03 00:53 generating the offsets is a little trickier 2008-09-03 00:58 ah, a flaw in both of our code 2008-09-03 00:58 you have to keep resetting edict 2008-09-03 00:59 oh? 2008-09-03 00:59 oh, for the current group? 2008-09-03 00:59 I did that 2008-09-03 01:00 I don't see where 2008-09-03 01:00 hm, what are you looking at? 2008-09-03 01:00 http://pastie.org/264980 2008-09-03 01:01 didn't there, right 2008-09-03 01:02 works now 2008-09-03 01:04 cool 2008-09-03 01:04 ok, calculating the offset to make sure that ascends is trickier, we get back into those special zero cases 2008-09-03 01:05 and we have the key generation using just straightforward indexing (though backwards) and the offset generation using demented-off-by-one 2008-09-03 01:05 so there is no pretty way to write it 2008-09-03 01:06 probably the best to is have the dicts one above like before 2008-09-03 01:06 and make the adjustment by one for key lookup, because that is easy 2008-09-03 01:07 and use the same method as for ileaf for the offset generation 2008-09-03 01:07 well 2008-09-03 01:07 hmm 2008-09-03 01:07 dunno 2008-09-03 01:07 like I said, no pretty way 2008-09-03 01:08 it's only the entry->offset that is off by one 2008-09-03 01:08 group->count is directly indexed 2008-09-03 01:08 so mumble 2008-09-03 01:14 ok, this works 2008-09-03 01:14 unsigned members = edict[entry].limit - (entry ? edict[entry + 1].limit : 0); 2008-09-03 01:14 could write "size" instead of members 2008-09-03 01:15 then members is the amount by which we increment base offset 2008-09-03 01:15 by definition positive 2008-09-03 01:15 so it can't go backwards, only too high 2008-09-03 01:15 well, the limits can go backwards 2008-09-03 01:15 I think you checked for that 2008-09-03 01:16 so offset can go backwards too 2008-09-03 01:17 I can write this better 2008-09-03 01:29 struct group *gdict = (void *)leaf + btree->sb->blocksize; 2008-09-03 01:29 struct entry *edict = (void *)(--gdict - leaf->groups); 2008-09-03 01:29 better 2008-09-03 01:30 have the dicts sitting on the zeroth entry for dleaf 2008-09-03 01:32 it's how I wrote it originally 2008-09-03 01:32 yay for convergent evolution 2008-09-03 01:34 :D 2008-09-03 01:34 I'll post mine to the list 2008-09-03 01:34 you can slice and dice as you like 2008-09-03 01:34 it's ever so slightly more readable than the original dump 2008-09-03 01:34 sounds good to me 2008-09-03 01:39 I needed to get that skeleton written anyway, as part of the delte 2008-09-03 01:39 delete 2008-09-03 01:41 I'm probably heading to sleep soon 2008-09-03 01:41 gnight 2008-09-03 01:41 I should likewise 2008-09-03 01:41 night 2008-09-03 01:41 bye 2008-09-03 01:41 heh, is it 2am already? 2008-09-03 01:41 bye 2008-09-03 01:41 yup 2008-09-03 09:51 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-03 12:21 hey 2008-09-03 13:18 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-03 14:07 bh hi 2008-09-03 14:08 well there is some leaf truncate code 2008-09-03 14:08 icky code 2008-09-03 14:08 but... now onto the btree part of the truncate 2008-09-03 14:09 soon we will be able to delete files and be much more like a filesystem 2008-09-03 14:14 flips: there's a linkedIn group for file system engineers 2008-09-03 14:15 pointer? 2008-09-03 14:15 http://www.linkedin.com/groups?about=&gid=64287 2008-09-03 14:15 is there a group for rollerskating filesystem engineers? 2008-09-03 14:15 did that work? 2008-09-03 14:15 who rollerskates around here? 2008-09-03 14:15 ;-) 2008-09-03 14:24 linkedin customer service sucks 2008-09-03 14:43 yup 2008-09-03 14:46 what's your beef with LinkedIn? 2008-09-03 14:47 bugs 2008-09-03 14:47 tell you about it later 2008-09-03 14:47 k 2008-09-03 14:57 -!- pgquiles(~pgquiles@75.Red-81-44-176.dynamicIP.rima-tde.net) has joined #tux3 2008-09-03 14:59 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-03 15:03 http://techvideoblog.com/ifa/98-linux-laptop-the-hivision-mininote/ 2008-09-03 15:03 <- $98 linux laptop 2008-09-03 15:03 apparently costs $120 at the moment 2008-09-03 15:03 600 MHz ARM 2008-09-03 15:03 does it blend? 2008-09-03 15:03 right in 2008-09-03 15:03 nice 2008-09-03 15:04 kay, I just have to hook up the good old ddsnap btree deletion to tux3 and we have file delete 2008-09-03 15:46 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-03 15:52 -!- pgquiles(~pgquiles@75.Red-81-44-176.dynamicIP.rima-tde.net) has joined #tux3 2008-09-03 16:27 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-03 17:04 http://www-01.ibm.com/support/docview.wss?uid=swg21230196 2008-09-03 17:05 that must be ancient 2008-09-03 17:05 found that on the LinkedIn File Systems developers group list 2008-09-03 17:05 look like '02 2008-09-03 17:05 Modified date: 2008-09-03 17:05 2007-02-15 2008-09-03 17:05 my bad 2008-09-03 17:05 it's a network filesystem 2008-09-03 17:17 anyone heard of storspeed before? 2008-09-03 17:21 nope, looks like a stealth storage startup in austin 2008-09-03 17:21 flips: the second looks really clear to me 2008-09-03 17:22 the flattened one? 2008-09-03 17:22 yeah 2008-09-03 17:23 glad you like it 2008-09-03 17:23 your interest helped me get going on the delete 2008-09-03 17:25 from what I can gather, storspeed is developing a nfs cache solution 2008-09-03 17:35 ack, just spent a couple hours struggling with btree delete where I did a minor typo in the port from ddsnap 2008-09-03 17:36 well, at least this code can be worked on now 2008-09-03 17:36 unlike ddsnap, where it is buried in a huge system with no unit tests 2008-09-03 17:36 scary 2008-09-03 17:36 good thing it never had a bug 2008-09-03 17:38 heh 2008-09-03 17:38 sk8 oclock 2008-09-03 17:41 getting near the end of the just plain filesystem mechanism stuff 2008-09-03 17:41 some interesting bits coming soon 2008-09-03 17:44 wheel swap before skate 2008-09-03 17:47 that means you're planning to skate? 2008-09-03 17:50 hey 2008-09-03 17:50 should be 2008-09-04 02:07 -!- pgquiles(~pgquiles@75.Red-81-44-176.dynamicIP.rima-tde.net) has joined #tux3 2008-09-04 03:14 flips: ? 2008-09-04 03:33 bh, hi 2008-09-04 04:49 -!- pgquiles_(~pgquiles@156.Red-83-33-70.dynamicIP.rima-tde.net) has joined #tux3 2008-09-04 04:50 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-04 05:10 tux3 just learned how to delete 2008-09-04 06:04 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-04 06:42 whoo! 2008-09-04 10:05 flips: http://kerneltrap.org/Linux/Tux3_Acting_Like_A_Filesystem 2008-09-04 11:07 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-04 11:38 -!- pgquiles__(~pgquiles@209.Red-81-32-36.dynamicIP.rima-tde.net) has joined #tux3 2008-09-04 12:15 konrad@hopeless test $ echo "Hello tux3 world" > tmp/tux3 2008-09-04 12:15 konrad@hopeless test $ cat tmp/tux3 2008-09-04 12:15 Hello tux3 world 2008-09-04 12:15 FUSE + tux3 2008-09-04 12:19 just for fun :) 2008-09-04 12:20 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-04 12:20 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-04 12:20 flips: for when you wake up, ping 2008-09-04 13:01 konrad: oh? 2008-09-04 13:01 i was wondering when someone was going to do that ;) 2008-09-04 13:02 :D 2008-09-04 13:08 it's very much ugly, incorrect, and a hack 2008-09-04 13:14 konrad, pong 2008-09-04 13:15 fuse-tux3, reads/writes sort of 2008-09-04 13:15 :-D 2008-09-04 13:15 using routines basically copied from tux3.c 2008-09-04 13:16 that's what they're for 2008-09-04 13:16 konrad, you are hereby offically annoited the maintainer for the fuse fork 2008-09-04 13:17 ouch 2008-09-04 13:17 :D 2008-09-04 13:18 ok what you probably need most right now is control over the tracing output 2008-09-04 13:18 I was going to turn all the trace(printf( into just trace( 2008-09-04 13:19 and have trace be a real function instead of a macro 2008-09-04 13:20 I wasn't totally serious about fuse-tux3 2008-09-04 13:20 fsck, my typo is now splatted all over the web 2008-09-04 13:20 I wasn't totally serious about tux3 2008-09-04 13:20 heh 2008-09-04 13:21 see my post to the btrfs list 2008-09-04 13:21 "considering the wisdom" 2008-09-04 13:21 so now I know it was a stupidly big job ;-) 2008-09-04 13:21 is that a list worth subscribing to? 2008-09-04 13:21 http://kerneltrap.org/Linux/Tux3_Acting_Like_A_Filesystem 2008-09-04 13:22 mildly interesting 2008-09-04 13:22 mostly meat and potatoes debugging 2008-09-04 13:22 saw that 2008-09-04 13:22 zfs list is more interesting 2008-09-04 13:22 because more bugs ;-) 2008-09-04 13:22 heh 2008-09-04 13:22 I know someone who switched his /home over to fuse-zfs recently 2008-09-04 13:23 but he was on reiser4 before that 2008-09-04 13:23 hardcore 2008-09-04 13:23 perfect crash test dummy for tux3 2008-09-04 13:23 heh, yes 2008-09-04 13:24 get ready to post your fuse hack to lkml I would say 2008-09-04 13:24 pick the 3 worst things about it, fix, then post 2008-09-04 13:25 ehhh 2008-09-04 13:25 1. It exists. 2008-09-04 13:26 that's both a bug and a feature 2008-09-04 13:26 2. It brings tux3 up even with zfs on linux :D 2008-09-04 13:26 sort of 2008-09-04 13:26 haha 2008-09-04 13:27 only basic reads/ writes work 2008-09-04 13:27 just kidding 2008-09-04 13:27 morally even 2008-09-04 13:31 have to say fuse is pretty easy to work with 2008-09-04 13:40 I am ashamed to say I never tried 2008-09-04 13:43 ah, kerneltrap posted my repost without the typo 2008-09-04 13:51 flips: did you see the lvm snapshot merging patch? 2008-09-04 13:52 folks 2008-09-04 13:53 hey bh 2008-09-04 13:53 hey 2008-09-04 13:53 when arey ou oflks going to rock the LInux file systems world ? 2008-09-04 13:53 bah 2008-09-04 13:53 when are you folks going to rock the LInux file systems world ? 2008-09-04 13:53 better :) 2008-09-04 13:53 man my typing if f-ed right now 2008-09-04 13:53 depends on how much sleep flips gets 2008-09-04 13:54 yeah, the sooner the better 2008-09-04 13:54 the more he misses, the closer it comes to linux fs world rocking 2008-09-04 13:54 :-) 2008-09-04 13:54 show folks out of some barbaric crap 2008-09-04 13:54 tim_dimm, that's for the heads up 2008-09-04 13:54 yeah, I was talking to him late last night, he should be up by now 2008-09-04 13:55 thanks I mean 2008-09-04 13:55 my pleasure dude 2008-09-04 13:55 it was convenient I actually check in the patch I was talking about in the post kerneltrap linked 2008-09-04 13:55 when its done, you can sleep ;-) 2008-09-04 13:55 before sleeping 2008-09-04 13:58 -!- eli(~elicriffi@66.249.86.209) has joined #tux3 2008-09-04 13:59 about time to improve the tux3 front page, no? 2008-09-04 13:59 I think the geek chic may be wearing a little thin 2008-09-04 14:00 shapor's site looks pretty good 2008-09-04 14:00 just port that over 2008-09-04 14:00 flips: yeah, reading that online checking post more, lots of details in it 2008-09-04 14:00 shapor? 2008-09-04 14:01 bh, the main point is there 2008-09-04 14:01 that one can accelerate online checking with a small amount of additional metadata, rarely accessed 2008-09-04 14:03 shapor already put up today's press link :-) 2008-09-04 14:03 have you mapped out the cases yet for checking ? 2008-09-04 14:03 other than the basic inode tree integrity stuff ? 2008-09-04 14:05 I thought I did that in the post 2008-09-04 14:05 you mean more detail? 2008-09-04 14:05 yeah, well, in your mind regardless of the docs 2008-09-04 14:05 yes 2008-09-04 14:05 good :) 2008-09-04 14:05 there is the inode level and the directory level 2008-09-04 14:06 because I'm sick of how lame Linux file systems are 2008-09-04 14:06 you have to have a "good" bit on a compartment of the inode level before checking the directory level 2008-09-04 14:06 what about the directory link count ? solved by groups again ? 2008-09-04 14:10 ACTION is still reading the post 2008-09-04 14:11 hi eli 2008-09-04 14:11 shapor, you want to meet eli 2008-09-04 14:12 googler, ex cluster admin 2008-09-04 14:12 maze too 2008-09-04 14:12 ? 2008-09-04 14:13 maze, do you know eli? 2008-09-04 14:13 nope 2008-09-04 14:13 came to the zumastor talk, we hung out after 2008-09-04 14:13 googler alert, aisle 5 2008-09-04 14:13 has got hands on with some interesting stuff, like lustre and redhat gfs 2008-09-04 14:13 hmm, that's not his login name then is it? 2008-09-04 14:14 prolly not 2008-09-04 14:15 hmm, that guy we sat at the whiteboard with? 2008-09-04 14:15 hmm, and where's aisle 5? 2008-09-04 14:16 heh 2008-09-04 14:16 :-) 2008-09-04 14:16 just me being the smartass that I am 2008-09-04 14:16 MaZe, haven't met you yet either 2008-09-04 14:16 you need to 2008-09-04 14:16 two awesome dudes 2008-09-04 14:17 hi 2008-09-04 14:17 in fact now that I think about it, everybody on the channel is awesome ;-) 2008-09-04 14:17 even the bot 2008-09-04 14:17 eli, hi 2008-09-04 14:17 elicriffield@google 2008-09-04 14:17 hey, eli ;-) 2008-09-04 14:17 ah 2008-09-04 14:17 by our mere presence, we are awesome 2008-09-04 14:17 flips: yeah, I was about to suggest using some kind of centralize reverse map metadata file of some sort 2008-09-04 14:18 the bot is obviously the most awesome, by virtue of being here the longest 2008-09-04 14:18 focusing on early mounting is the end goal here 2008-09-04 14:18 the sooner it can be done the better 2008-09-04 14:19 Maze: :-) 2008-09-04 14:19 so maybe using delayed freeing or something like for deletion until enough of the disk has been verified so that you know it's safe to do so without crushing something would be good 2008-09-04 14:19 basically the normal stuff 2008-09-04 14:20 hi eli 2008-09-04 14:20 hey 2008-09-04 14:20 problem here is that it's kind of wacky for a Unix/Linux style system to groke 2008-09-04 14:20 bh, not sure the reverse makes sense as a file... maybe 2008-09-04 14:20 grok 2008-09-04 14:20 actually maybe that could be good 2008-09-04 14:20 flips: well you have it as something else 2008-09-04 14:20 if the reverses can be fixed size 2008-09-04 14:20 so you can directly index the reverse of a block 2008-09-04 14:21 the thing about it is it'll be small 2008-09-04 14:21 get a token into some other structure 2008-09-04 14:21 and easily verified by a simple inode integrity check 2008-09-04 14:21 well, then you can map it into the page cache 2008-09-04 14:21 decending downward 2008-09-04 14:21 downward? 2008-09-04 14:21 you can special case it and not worry about aliasing, etc... 2008-09-04 14:21 inode->indirect->data 2008-09-04 14:21 the normal stuff 2008-09-04 14:22 aliasing=multiple references 2008-09-04 14:23 bh, I'm going to refresh the online check doc and add it to the design mix 2008-09-04 14:23 got to be thought about from early 2008-09-04 14:23 because you'll have only one per voluem and it's special cased 2008-09-04 14:23 right 2008-09-04 14:23 right, which is why I'm mentioning it to you 2008-09-04 14:23 going to drop the multivolume wanking 2008-09-04 14:23 so you have considered a lot of things beforehand and not deadend your project 2008-09-04 14:24 from a design point of view 2008-09-04 14:24 it's a simple solution for a difficult to track and reproduce data structure, could be too simplistic, you're the fs expert here not me 2008-09-04 14:25 in the online cheker code, it'll have to be one of the first metadata files to check other than the allocation map 2008-09-04 14:27 bh, I think it's about as simplistic as necessary 2008-09-04 14:28 it would be wrong to bog down the fs with a topheavy structure just for fsck 2008-09-04 14:29 it's just an idea 2008-09-04 14:29 I wasn't rejecting 2008-09-04 14:29 hugs? 2008-09-04 14:29 just pointing out that it's not necessary for the required persistent structure to be complex 2008-09-04 14:29 sure, like I said, it's just an idea 2008-09-04 14:30 so what's necessary is to reintroduce a notion of allocation groups 2008-09-04 14:30 the same could apply for not so frequent reverse mappings as well, there still has to be an ordering issue to be considered at mount/check time 2008-09-04 14:30 and record any pointers that cross those groups 2008-09-04 14:31 ordering? 2008-09-04 14:31 the good thing about having it as a file is that it can be easily checked in a single file 2008-09-04 14:31 yeah mount check ordering 2008-09-04 14:31 still don't get it 2008-09-04 14:31 you want to be able to moun this volume asap 2008-09-04 14:32 exactly 2008-09-04 14:32 but you need a certain set of metadata checked first 2008-09-04 14:32 that's why all checks are incremental 2008-09-04 14:32 whether or not it's consistent or not 2008-09-04 14:32 very little is checked before it stops 2008-09-04 14:32 starts 2008-09-04 14:32 not much more than the sb magic number 2008-09-04 14:32 tux3 already checks magic numbers on inode table leaves and file index leaves by the way 2008-09-04 14:32 well, what about checking the bare minimal metadata before you can mount it ? what is that ? 2008-09-04 14:33 1) allocation map 2008-09-04 14:33 respectively 0x90de and 0xc0de 2008-09-04 14:33 2) some random reverse map 2008-09-04 14:33 ... 2008-09-04 14:33 what else ? 2008-09-04 14:33 list it 2008-09-04 14:33 that's my suggestion 2008-09-04 14:33 no checking before mount ;-) 2008-09-04 14:33 well, ok 2008-09-04 14:33 incremental checking starts as soon as root dir is opened 2008-09-04 14:34 don't you want your b-tree checked before doing something with it ? 2008-09-04 14:34 it gets checked on the fly 2008-09-04 14:34 the btree code has to be written not to oops even if you feed it random numbers 2008-09-04 14:34 what about other metadata like the allocation map ? 2008-09-04 14:34 also checked on the fly 2008-09-04 14:34 what if you have a corruption in the b-tree ? how are you going to deal with that ? 2008-09-04 14:35 ext2/3 work like this and it is very effective 2008-09-04 14:35 or a corruption in the allocation map ? 2008-09-04 14:35 except they don't do the only the fly fsck 2008-09-04 14:35 corruption detected in the btree... means a scan for lost leaves has to occur 2008-09-04 14:35 and eio on that file until it's complete 2008-09-04 14:35 maybe 2008-09-04 14:37 just had a thought 2008-09-04 14:37 we could have a special log item every now and then that just duplicates some throwaway copies of the top few levels of the inode btree 2008-09-04 14:37 these structures rarely change 2008-09-04 14:38 its quiting time for me, i'll be back, good seeing you again flips and maze 2008-09-04 14:38 so when they do, just invalidate that log item 2008-09-04 14:39 flips: can we please support front truncation of files? 2008-09-04 14:39 maze, I am writing a post about that right now ;-) 2008-09-04 14:39 among other things 2008-09-04 14:39 oh, awesome 2008-09-04 14:39 "the long and short of truncation" 2008-09-04 14:39 sparse checking should be easy 2008-09-04 14:39 do you mean, actually moving data forward logically, or just punching a hole in the front? 2008-09-04 14:39 it's been my belief that appendable front truncatable files are the best 2008-09-04 14:40 because unaligned front truncation is nasty 2008-09-04 14:40 no, freeing blocks from the front, if it's unaligned then you end up with a full block, with some bytes 'unused' 2008-09-04 14:40 right, and you leave the logical addresses untouched? 2008-09-04 14:41 by logical you mean from the point of view of the apps? yes, they just see zeroes there, if they seek, but opening the file would preferably (probably not doable, since this is vfs) seek to the first 'used' byte 2008-09-04 14:42 right, that's exactly what I'm implementing 2008-09-04 14:42 this would be just awesome for any sort of logging 2008-09-04 14:42 there's a big comment in the code about how deficient my first cut truncate function is in that respect ;-) 2008-09-04 14:42 otherwise the vfs would need a patch to seek to non-free byte first 2008-09-04 14:43 http://tux3.org/tux3?fd=b64615fb8a11;file=user/test/dleaf.c 2008-09-04 14:43 my belief is large files should be immutable, front-truncatable, appendable 2008-09-04 14:43 front truncation is not an option in tux3 2008-09-04 14:43 versioning requires it 2008-09-04 14:44 heh 2008-09-04 14:44 "hole punch" is the usual term 2008-09-04 14:44 there's a nasty little corner case with extents 2008-09-04 14:44 punch a hole in the middle of an extent, the metadata expands 2008-09-04 14:45 unlike all other hole punches 2008-09-04 14:45 punch a hole in the middle of a versioned extent and braindamage is a clear and present danger 2008-09-04 14:49 hole punch is a telecine term 2008-09-04 14:49 hole punching is different - because you can do it in the middle of files 2008-09-04 14:49 that's why we like it 2008-09-04 14:49 that also has a usecase 2008-09-04 14:50 what is the difference between a hole punch at the beginning of a file and a front truncate? 2008-09-04 14:51 just had a thought: it would be nice to have an ioctl to return the first "present" offset in a file, for log reading 2008-09-04 14:51 offset padded down to block boundary with zeros 2008-09-04 14:52 then that exabyte upper limit actually makes sense 2008-09-04 14:52 it could be conceivable to actually hit that one day 2008-09-04 14:52 and have to rotate ;-) 2008-09-04 14:56 news item: msft browsers lose just .5 more share to non-msft then it will be a market share tie on w3schools 2008-09-04 14:56 meaning half the people learning about html as learning it with a non-msft browser 2008-09-04 14:57 just had a thought: it would be nice to have an ioctl to return the first "present" offset in a file, for log reading -> precisely, I'd envision that to be the default location when you open such a file 2008-09-04 14:57 hah! 2008-09-04 14:57 nice 2008-09-04 14:57 open(...O_FRONT); 2008-09-04 14:58 loc padded down to block boundary 2008-09-04 14:58 pos I mean 2008-09-04 14:58 and aligned I mean 2008-09-04 14:59 everybody missed my pun on tree chopping so far 2008-09-04 14:59 btree_chop 2008-09-04 15:00 veritable paul bunyan 2008-09-04 15:00 I suppose that's because I didn't spell it that way 2008-09-04 15:00 currently leaf_chop, I have to change it 2008-09-04 15:07 tim_dimm, you now have a commit dedicated to you 2008-09-04 15:07 http://tux3.org/tux3 2008-09-04 15:07 the paul bunyan commit? 2008-09-04 15:07 flips: online checking the most important and igored things I know of in most file systems today 2008-09-04 15:07 that one yes 2008-09-04 15:07 bh, ignored no longer 2008-09-04 15:07 my first commit dedication- I'm touched 2008-09-04 15:07 I'll get a post together in the next couple of days 2008-09-04 15:08 flips: by you or folks in the general community ? 2008-09-04 15:08 I'll also start thinking about relative block pointers 2008-09-04 15:09 could get some excellent compression that way 2008-09-04 15:09 also code complexity ;-) 2008-09-04 15:09 because I think its the best stop gap measure we have so far for petabyte volumes 2008-09-04 15:09 bh, by the tux3 community 2008-09-04 15:09 yeah, thanks ;) 2008-09-04 15:09 I'm so glad somebody listens to me 2008-09-04 15:09 because this is a critical problem 2008-09-04 15:09 we'll always be here for you ;-) 2008-09-04 15:10 probably the most important problem for file systems as this moment followed by snapshots 2008-09-04 15:10 :) 2008-09-04 15:10 one of them, true 2008-09-04 15:10 I just hope that I'm being useful in these discussions 2008-09-04 15:10 just plain bugginess is probably the number one problem 2008-09-04 15:10 affects reiser4, zfs, btrfs 2008-09-04 15:10 maybe not hammer 2008-09-04 15:11 and maybe that is because I'm just not reading the bug reports 2008-09-04 15:11 I suspect it's just plain "not hammer" 2008-09-04 15:14 bh: just curious. what's your day gig? 2008-09-04 15:15 (pardon my curiosity) 2008-09-04 15:16 flips, after return 0;, how about a print "TIMBER" 2008-09-04 15:16 good call 2008-09-04 15:16 well 2008-09-04 15:17 we want to chop the tree, not bring down the system ;-) 2008-09-04 15:17 it's really tree_disintegrate 2008-09-04 15:17 tim_dimm: Novell's R&D group 2008-09-04 15:17 gotcha 2008-09-04 15:17 I'm mostly a concurrency person with the -rt patch 2008-09-04 15:17 mr locking 2008-09-04 15:17 how about writing "timberrrr" on the _error exit_ to chop tree? 2008-09-04 15:17 locking instrumentation specifically 2008-09-04 15:18 and -rt conversion of the kernel to be fully preemptible 2008-09-04 15:18 bh is going to test his locking intrumentation on tux3 2008-09-04 15:18 ok, I'm thinking... what next 2008-09-04 15:18 nice 2008-09-04 15:18 pretty much me and Ingo are the only two folks on this planet to have made some attempt and map out the problem space for that 2008-09-04 15:18 and I'm am awfully close to saying, kernel port 2008-09-04 15:18 but it's ingo's patch 2008-09-04 15:18 -rt that is 2008-09-04 15:18 I think it's kernel port next 2008-09-04 15:19 first a little cleanup of the current source 2008-09-04 15:19 not much cleanup 2008-09-04 15:19 we're going to do a proper fork for the kernel I think 2008-09-04 15:19 too hard to make a lot of the code match 2008-09-04 15:19 we'll see 2008-09-04 15:19 tim_dimm: but I'm ex-netapp WAFL 2008-09-04 15:19 bh does a good job of not telling me any secrets ;-) 2008-09-04 15:19 mostly as a sustaining engineer which the vast majority of that kind of work there 2008-09-04 15:19 oh wow 2008-09-04 15:20 yeah, so I've seen what an enterprise file system should look like roughly and Linux is far far behind that 2008-09-04 15:20 agreed 2008-09-04 15:20 as one of the guilty parties 2008-09-04 15:20 bh: I'm bizdev at MetaRAM 2008-09-04 15:20 what is that ? 2008-09-04 15:20 thus the _dimm part 2008-09-04 15:20 big memory 2008-09-04 15:20 metaram is cool 2008-09-04 15:21 fred weber, former cto of amd's startup 2008-09-04 15:21 double the ddr2 limit 2008-09-04 15:21 so you're an engineer or a business person ? 2008-09-04 15:21 and ddr3 2008-09-04 15:21 biz with an ear for engineering 2008-09-04 15:21 and why are you here btw ? 2008-09-04 15:21 bizdev for flips 2008-09-04 15:21 roller skate instructor 2008-09-04 15:21 right flips? 2008-09-04 15:21 yup 2008-09-04 15:21 I consult 2008-09-04 15:21 our biz is sk8ing 2008-09-04 15:22 zen of sk8biz 2008-09-04 15:22 used to be at violin memory 2008-09-04 15:22 that's where we met 2008-09-04 15:22 right 2008-09-04 15:22 1/2TB of DRAM used as storage 2008-09-04 15:22 tim_dimm: eh ? 2008-09-04 15:22 now you can cram that 1/2TB into a server 2008-09-04 15:22 violin is a ssd startup 2008-09-04 15:22 I'm just wondering what your interest here is as a non-engineer, seems odd 2008-09-04 15:23 uh, kinda hard to articulate in one line 2008-09-04 15:23 and I'm suspicious of business folks in general as a rule :) 2008-09-04 15:23 well, I'm interested 2008-09-04 15:23 you could be an EMC mole of some sort or something 2008-09-04 15:23 :) 2008-09-04 15:23 my role here is tux3 evangelism 2008-09-04 15:23 no, i just skate fast 2008-09-04 15:24 no mole action 2008-09-04 15:24 so you're in LA with flips ? 2008-09-04 15:24 yup 2008-09-04 15:24 venice 2008-09-04 15:24 ah ok 2008-09-04 15:24 nice ot know 2008-09-04 15:24 to know 2008-09-04 15:24 bh, tim_dimm is learning C 2008-09-04 15:24 u? 2008-09-04 15:24 among other things 2008-09-04 15:24 bh, from the biggest C slackers on the block 2008-09-04 15:24 just a random angry kernel dude 2008-09-04 15:25 akd 2008-09-04 15:25 apparently angry enough to be a Solaris kernel engineer according to them :) 2008-09-04 15:25 rakd 2008-09-04 15:25 tim_dimm is learning C about the same rate I'm learning skating 2008-09-04 15:25 gee, hope so 2008-09-04 15:25 bh, I used to be in post production as an online editor and colorist 2008-09-04 15:25 I wonder how big a bribe sun would offer for me to work on zfs ;-) 2008-09-04 15:26 I'd be happy to 2008-09-04 15:26 what got me into technology was i/o bottlenecks 2008-09-04 15:26 start: rm * 2008-09-04 15:26 oh really 2008-09-04 15:26 ? 2008-09-04 15:26 and abandon tux3 ? 2008-09-04 15:26 depends on the size of the bribe 2008-09-04 15:26 uncompressed 4k digital cinema is 48MB per frame, 1.2GB/s 2008-09-04 15:26 you're doing well at Google and happy right ? 2008-09-04 15:26 ACTION has his price 2008-09-04 15:26 ok 2008-09-04 15:26 nice to know that you have a price 2008-09-04 15:26 good thing it's open source hmm 2008-09-04 15:27 can't take it back now 2008-09-04 15:27 it's unlikely that they'll give you that bribe unless they're in need of something for that NetApp/Sun lawsuit 2008-09-04 15:27 don't you think that tux3 would be a better file system ? 2008-09-04 15:28 bh: did you ever read flips' ramback, faster than a speeding bullet post? 2008-09-04 15:28 bh, of course it will 2008-09-04 15:28 tux3 will have about 1/10th the cache footprint of zfs 2008-09-04 15:28 which is a major source of bugs for them 2008-09-04 15:29 and I really don't care that tux3 can only access an exabyte while zfs can boil the oceans 2008-09-04 15:29 my fs is not for boiling oceans, I like the oceans the way they are 2008-09-04 15:29 really, who's going to utilize more than that in a single namespace 2008-09-04 15:30 ...in the next 5 years 2008-09-04 15:30 cern 2008-09-04 15:30 in 5 years tux3.1 will be out 2008-09-04 15:30 t-minus 6 days, btw 2008-09-04 15:31 nice to know we've got that long to live 2008-09-04 15:31 6 days or 5 yrs 2008-09-04 15:31 "the god particle" 2008-09-04 15:31 just don't be evil for the next 6 days as insurance 2008-09-04 15:31 6 days 2008-09-04 15:31 I suggested we name our son "higgs" 2008-09-04 15:31 Higgs Huber sounds dumb though 2008-09-04 15:31 that would be cool 2008-09-04 15:32 sounds good 2008-09-04 15:32 really 2008-09-04 15:32 so we settled for Pi 2008-09-04 15:32 I like higgs more for what it's worth 2008-09-04 15:32 I did too 2008-09-04 15:32 tim_dimm: no 2008-09-04 15:32 flips, wanna give the ramback one liner? 2008-09-04 15:32 higgs would be a most excellent middle name 2008-09-04 15:32 sounds "rich" 2008-09-04 15:33 hoidy toidy 2008-09-04 15:33 bh, ramback: every little factor of 25 performance improvement really helps 2008-09-04 15:34 http://lwn.net/Articles/272534/ 2008-09-04 15:39 ACTION reads 2008-09-04 15:53 konrad, could you post your fuse recipe to the tux3 list? 2008-09-04 15:54 ACTION rolls, really 2008-09-04 15:54 what happen to your userspace porting of the Linux page cache or something like that ? 2008-09-04 15:55 fuse-tux3.c + build instructions? 2008-09-04 15:56 it's not pretty, but ok 2008-09-04 15:56 I'll try and clean it up a bit first 2008-09-04 16:01 $ dd if=/dev/zero of=tmp/zeros2 count=1000 2008-09-04 16:01 1000+0 records in 2008-09-04 16:01 1000+0 records out 2008-09-04 16:01 512000 bytes (512 kB) copied, 13.9749 s, 36.6 kB/s 2008-09-04 16:02 it's not the quickest thing 2008-09-04 16:27 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-04 16:55 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-04 17:28 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-04 18:19 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-04 20:37 bh, the user space portin of the linux page cache was completed 3 or 4 weeks ago 2008-09-04 20:37 porting 2008-09-04 20:37 there was a kind of dazed sounding post about it 2008-09-05 00:07 -!- stargazr5(~gaurav@59.95.6.25) has joined #tux3 2008-09-05 00:11 -!- cdk(~chinmay@59.95.14.95) has joined #tux3 2008-09-05 00:15 got to consider what to hack next 2008-09-05 00:15 I'll sleep on it 2008-09-05 00:17 wow, yummy fuse code from conrad 2008-09-05 00:17 unyum 2008-09-05 00:17 :D 2008-09-05 00:19 @konrad:: trying to compile ur fuse file....getting errors.. 2008-09-05 00:19 line 256 2008-09-05 00:19 hm? 2008-09-05 00:19 error? 2008-09-05 00:19 unknown filed 'key' 2008-09-05 00:19 specified in initializer 2008-09-05 00:20 is your checkout of tux3 up to date? 2008-09-05 00:20 ok....i thought so....one moment will get back 2008-09-05 00:21 installing fuse now 2008-09-05 00:22 enjoy the pain 2008-09-05 00:22 lots of segfaults 2008-09-05 00:22 and no proper readdir 2008-09-05 00:22 I'm not sure how it's done, so I just ignored it completely 2008-09-05 00:25 tux3fuse.c:113:47: error: macro "fuse_main" passed 4 arguments, but takes just 3 2008-09-05 00:25 eh, you must have a different version of fuse than me 2008-09-05 00:25 the example on the fuse site shows 3 args 2008-09-05 00:25 but my version takes 4 2008-09-05 00:25 I have 2.7.3 2008-09-05 00:27 just get rid of the NULL argument 2008-09-05 00:29 fuse: failed to exec fusermount: No such file or directory 2008-09-05 00:30 the mountpoint or the fake filesystem? 2008-09-05 00:31 I think there is no fusermount 2008-09-05 00:31 ah, weird 2008-09-05 00:31 what distro? 2008-09-05 00:32 or I don't have fuse compiled into my kernel 2008-09-05 00:32 debian etch 2008-09-05 00:32 I'll run it on something else 2008-09-05 00:32 hm 2008-09-05 00:32 I'm on Fedora 9 2008-09-05 00:32 need to install fuse-utils I think 2008-09-05 00:32 ah 2008-09-05 00:33 got the latest version .... still cant compile :: undefined reference to `btree_delete' 2008-09-05 00:33 are you compiling it correctly? 2008-09-05 00:33 cdk, change that to tree_chop 2008-09-05 00:33 oh, did it change? 2008-09-05 00:33 k 2008-09-05 00:33 it did 2008-09-05 00:33 gratuitous 2008-09-05 00:33 heh, I guess I'm out of date :D 2008-09-05 00:33 going to have to be more careful about that now 2008-09-05 00:34 you might want to post a rebased version 2008-09-05 00:34 just for now until its merged 2008-09-05 00:34 I'll do that tomorrow, got to sleep now 2008-09-05 00:34 fusermount: failed to open /dev/fuse: No such file or directory 2008-09-05 00:34 I suppose I have to make the devnode 2008-09-05 00:34 is the fuse module loaded into your kernel? 2008-09-05 00:35 or somit 2008-09-05 00:35 no 2008-09-05 00:35 I'm not totally familiar with fuse 2008-09-05 00:35 but I need to make the devnode anyway 2008-09-05 00:35 I think 2008-09-05 00:35 ok compiled 2008-09-05 00:35 cdk is going to beat me ;-) 2008-09-05 00:35 heh 2008-09-05 00:36 sudo mknod -m 666 /dev/fuse c 10 229 2008-09-05 00:37 fusermount: fuse device not found, try 'modprobe fuse' first <- maybe as far as I get tonight 2008-09-05 00:37 yes....mounted and visible.. 2008-09-05 00:37 :) 2008-09-05 00:37 :) 2008-09-05 00:38 and I'm sure, very breakable 2008-09-05 00:38 cdk: now count the seconds until segfault 2008-09-05 00:38 :D 2008-09-05 00:38 we'll fix that 2008-09-05 00:38 :D 2008-09-05 00:39 compiling a fuse module 2008-09-05 00:39 stupid make decided to recompile the whole kernel 2008-09-05 00:39 oh no 2008-09-05 00:40 actually it did the right thing 2008-09-05 00:40 but I compiled the wrong thing 2008-09-05 00:40 configfs :p 2008-09-05 00:41 eh 2008-09-05 00:41 heh 2008-09-05 00:41 ok....loop while deleting file.. 2008-09-05 00:41 found it 2008-09-05 00:41 "filesystem in userspace support" 2008-09-05 00:42 cdk, loop? 2008-09-05 00:42 have more that one file on the fs....after fuse shows only one file hello??? 2008-09-05 00:42 cdk: yes. 2008-09-05 00:42 readdir doesn't work 2008-09-05 00:42 so remember the name you used before and cat the file 2008-09-05 00:42 that much should work 2008-09-05 00:42 mounted 2008-09-05 00:42 k 2008-09-05 00:43 readdir will work tomorrow morning ;-) 2008-09-05 00:43 need to sleep now 2008-09-05 00:43 there is the hello file 2008-09-05 00:43 how did it get there? 2008-09-05 00:43 where are u guys...its 1:30 afternoon here 2008-09-05 00:43 flips: look at tux3_readdir() 2008-09-05 00:43 it's static 2008-09-05 00:44 cdk: Pacific time, west coast USA 2008-09-05 00:44 in india 2008-09-05 00:44 konrad, nice 2008-09-05 00:44 I'll make it real pretty soon 2008-09-05 00:44 excellent 2008-09-05 00:45 konrad, this is a most pleasant development 2008-09-05 00:45 good 2008-09-05 00:45 cdk, maybe I'll drop by one of these days ;-) 2008-09-05 00:45 ofcourse 2008-09-05 00:45 always welcome 2008-09-05 00:46 and you're good enough to get fuse working faster than me ;-) 2008-09-05 00:46 course I'm kind of lame at stuff like that 2008-09-05 00:46 ;-) 2008-09-05 00:46 shapor is missing the fun 2008-09-05 00:46 yeah only good at making stuff work....no good at devel 2008-09-05 00:46 cdk, we can fix that 2008-09-05 00:46 hi flips 2008-09-05 00:47 everybody's running tux3-fuse while you're... 2008-09-05 00:47 doing something ;-) 2008-09-05 00:47 having a life maybe 2008-09-05 00:47 yeah, a bit of one anyway lol 2008-09-05 00:47 sh-3.1# echo hello world >foo/foo 2008-09-05 00:47 sh-3.1# cat foo/foo 2008-09-05 00:47 hello world 2008-09-05 00:47 sh-3.1# 2008-09-05 00:48 amazing 2008-09-05 00:48 :D 2008-09-05 00:48 it's mountable 2008-09-05 00:48 it is 2008-09-05 00:48 developers should come pouring in now 2008-09-05 00:48 tomorrow morning readdir will work 2008-09-05 00:48 yah 2008-09-05 00:48 I need to prepare for this by sleeping now ;-) 2008-09-05 00:49 oh 2008-09-05 00:49 I'll check it in first 2008-09-05 00:49 rebased and all? 2008-09-05 00:49 yes, that was easy 2008-09-05 00:49 k, good 2008-09-05 00:49 saves me 30 seconds tomorrow morning 2008-09-05 00:50 konrad: great work on getting tux3 up in fuse ;) 2008-09-05 00:50 ok....flips u beat me to cat... 2008-09-05 00:51 i still cant do that 2008-09-05 00:51 thanks shapor 2008-09-05 00:51 heh 2008-09-05 00:51 I caught up 2008-09-05 00:51 tux3fuse.c, ok konrad? 2008-09-05 00:51 and have it in the same directory for now 2008-09-05 00:51 sure 2008-09-05 00:51 its great to see it looking like a filesystem before the kernel port 2008-09-05 00:52 really 2008-09-05 00:52 will make testing quite scriptable 2008-09-05 00:52 when can we expect the kernel port? 2008-09-05 00:52 and tux3 university starts soon 2008-09-05 00:52 heh, tomorrow night, and if it doesn't happen, blame flips 2008-09-05 00:52 sounds good 2008-09-05 00:52 :) 2008-09-05 00:53 a few of my friends are also interested in tux3 might get them together and get some devel work done. 2008-09-05 00:55 ok...cat working now.. 2008-09-05 00:55 @konrad:: great works.... 2008-09-05 00:56 great work 2008-09-05 00:57 konrad, want your email in the commit message or not? 2008-09-05 00:57 hm? 2008-09-05 00:58 Port of tux3 to fuse, contributed by Conrad Meyer <- for example 2008-09-05 00:58 oh, sure 2008-09-05 00:58 it's in 2008-09-05 00:58 no more version skew ;-) 2008-09-05 00:59 yay 2008-09-05 00:59 added it to make? 2008-09-05 00:59 nope ;-) 2008-09-05 01:00 next 2008-09-05 01:02 tux3 is not in make either... 2008-09-05 01:03 some lamer didn't put it in I guess 2008-09-05 01:03 heh 2008-09-05 01:03 night night 2008-09-05 01:03 night 2008-09-05 01:04 i am off as well 2008-09-05 01:04 nice development 2008-09-05 01:04 thanks again konrad 2008-09-05 01:04 np 2008-09-05 01:06 -!- cdk(~chinmay@59.95.14.95) has left #tux3 2008-09-05 01:09 ok, makefile is there 2008-09-05 01:09 test isn't 2008-09-05 01:09 patches gratefully accepted 2008-09-05 01:09 there are some warnings to clear up 2008-09-05 01:21 sleepy time 2008-09-05 01:21 today was fun 2008-09-05 01:34 -!- Danjel(~chatzilla@c-a721e255.1143-1-64736c12.cust.bredbandsbolaget.se) has joined #tux3 2008-09-05 01:52 ok, rm doesn't segfault if the file doesn't exist now 2008-09-05 02:06 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-05 02:07 flips: heh you make almost the identical change to the Makefile that I did in that commit 2008-09-05 02:07 was only a few characters off, heh 2008-09-05 02:08 hey flips 2008-09-05 02:10 do you think that the ZFS notion of having sha1 hashes all over the place is sufficient to guarantee the integrity of a volume given that you can fix the corruption from redundant copies ? 2008-09-05 02:10 in place of online checking or any kind of checking ? 2008-09-05 02:12 ooh too bad 2008-09-05 02:13 droppoed off 2008-09-05 02:13 -!- flips(~phillips@phunq.net) has joined #tux3 2008-09-05 02:13 dropped off 2008-09-05 02:13 flips: you there ? 2008-09-05 02:21 sha1 in expensive 2008-09-05 02:21 probably needlessly 2008-09-05 02:21 and unless they are real real careful about the order of operations 2008-09-05 02:21 they still can't guarantee memory/cpu/disk channel corruption won't go undetected 2008-09-05 02:22 s/in/is/ 2008-09-05 02:22 [not that you can ever be 100% sure, if your cpu/memory could be bad - unless you're making a net fs] 2008-09-05 02:23 yeah, they could be summing a bad chunk of memory 2008-09-05 02:24 what else ? 2008-09-05 02:24 what about phantom writes ? 2008-09-05 02:46 ok night 2008-09-05 03:48 -!- nobody(c8c32a07@67.207.141.120) has joined #tux3 2008-09-05 04:27 -!- nobody(c8c32a07@67.207.141.120) has left #tux3 2008-09-05 05:06 -!- cdk(~chinmay@59.95.49.53) has joined #tux3 2008-09-05 05:07 -!- cdk(~chinmay@59.95.49.53) has left #tux3 2008-09-05 05:17 -!- stargazr5(~gauravstt@59.95.17.124) has joined #tux3 2008-09-05 07:12 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-05 09:36 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-09-05 10:18 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-09-05 10:35 -!- stargazr5(~gauravstt@59.95.19.213) has joined #tux3 2008-09-05 10:41 where does fuse log to? 2008-09-05 10:41 funny it doesn't just write to console when running on foreground 2008-09-05 10:58 oh it does 2008-09-05 10:58 silly me 2008-09-05 11:16 whoever came up with the readdir interface needs to be badly hurt 2008-09-05 12:07 hey 2008-09-05 12:07 hi 2008-09-05 12:11 moonbase:/more/src/hg/tux3/user/test# ls foo 2008-09-05 12:11 /bar /bar2 /bar3 /bar4 /bar5 /bar6 /foo /foo2 /foo3 /hello 2008-09-05 12:11 not sure what all the slashes are about 2008-09-05 12:14 the kernel filldir interface is the most amazing stinking pile of poo I have ever seen 2008-09-05 12:14 http://lxr.linux.no/linux+v2.6.26.3/fs/readdir.c#L146 2008-09-05 12:14 next time I'm in public I'll wear a bag over my head that says "no I am not a linux kernel hacker" 2008-09-05 12:15 dang it intel 2008-09-05 12:16 i'm looking for a x48 PCIe chipset, and they named theirs x48 2008-09-05 12:16 but it only supports 38 lanes 2008-09-05 12:16 I know I've seen a x48 lane pcie chipset out there 2008-09-05 12:16 bounders] 2008-09-05 12:17 that's a lot of lanes 2008-09-05 12:35 fuse marches on 2008-09-05 12:35 directory listing works now 2008-09-05 12:35 seems to 2008-09-05 12:36 I don't think it should be printing / for every file though 2008-09-05 12:36 wonder why it does that 2008-09-05 12:36 exercise for konrad ;-) 2008-09-05 12:46 http://hardware.slashdot.org/comments.pl?sid=954803&cid=24889733 2008-09-05 12:47 someone passed on my call for helpers 2008-09-05 12:48 nice 2008-09-05 12:49 btw, I pinged wook for an assist on the visualization we discussed 2008-09-05 12:50 no response yet 2008-09-05 12:57 wooklag 2008-09-05 13:31 wook_lag 2008-09-05 13:31 :-) 2008-09-05 13:57 wow- tux3 got a slashdot lkml bounce 2008-09-05 13:58 ? 2008-09-05 13:58 its #2 on the lkml hottest message list now 2008-09-05 13:59 ah 2008-09-05 13:59 heh 2008-09-05 13:59 speculating that was from the slashdot reference earlier 2008-09-05 13:59 i bet its just flips clicking reload ;) 2008-09-05 14:00 actually probably due to the fact that its been on the front page of kerneltrap for the past couple days 2008-09-05 14:00 well past day 2008-09-05 14:00 could be 2008-09-05 14:38 hullo 2008-09-05 14:42 hi konrad 2008-09-05 14:42 see the extra slashes on the ls output? 2008-09-05 14:42 I did not have time to investigate 2008-09-05 14:42 nope, build error :) 2008-09-05 14:42 tux3fuse.c:251:41: error: macro "fuse_main" requires 4 arguments, but only 3 given 2008-09-05 14:42 fun fun 2008-09-05 14:43 wow 2008-09-05 14:43 back up cheek by jowel with Linus's post 2008-09-05 14:43 lkml? 2008-09-05 14:43 slashdot blowback 2008-09-05 14:43 http://lkml.org/ 2008-09-05 14:44 konrad, I guess you have a different version of fuse 2008-09-05 14:44 too bad they couldn't keep the interface stable 2008-09-05 14:44 just add the NULL back 2008-09-05 14:44 we'll figure out what to do later 2008-09-05 14:45 konrad, anyway it should be tux3fs.c by now 2008-09-05 14:45 right 2008-09-05 14:45 got to go out for a bit 2008-09-05 14:45 heh, typo 2008-09-05 14:45 it's fux3fs.c 2008-09-05 14:47 whoops 2008-09-05 14:47 hm, getattr seems to be failing here 2008-09-05 14:47 yes, getattr is totally not implemented 2008-09-05 14:47 yet 2008-09-05 17:27 -!- tim_dimm_(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-05 17:32 konrad, which version of libfuse2 do you have? 2008-09-05 17:32 2.7.3 2008-09-05 17:32 2.5.3 here 2008-09-05 17:33 it's a shame they couldn't keep the api stable across 2 point releases 2008-09-05 17:33 I guess we need a #if on version 2008-09-05 17:33 hard to see how that will work 2008-09-05 17:34 heh 2008-09-05 17:34 or I upgrade 2008-09-05 17:34 and we define to work only on recent fuse 2008-09-05 17:34 :p 2008-09-05 17:34 if you feel like it 2008-09-05 17:34 I don't 2008-09-05 17:34 it's a backport to etch 2008-09-05 17:34 could just pass a define to gcc or somit 2008-09-05 17:34 hate fiddling with that 2008-09-05 17:35 yes, but the define has to key on something 2008-09-05 17:36 konrad, can you do ls /usr/lib/libfuse* 2008-09-05 17:37 then have stuff like /usr/include/fuse/fuse_compat.h 2008-09-05 17:38 and /usr/include/fuse/fuse_lowlevel_compat.h 2008-09-05 17:38 flips: you going to try and get this running on fuse ? 2008-09-05 17:38 bh, it's already running on fuse 2008-09-05 17:38 ok 2008-09-05 17:38 konrad did it 2008-09-05 17:38 nice 2008-09-05 17:38 when did konrad come on board ? 2008-09-05 17:38 now debugging the bugs 2008-09-05 17:38 nice 2008-09-05 17:38 earlier this week 2008-09-05 17:38 nice 2008-09-05 17:38 I think 2008-09-05 17:38 maybe last week 2008-09-05 17:38 I have /usr/include/fuse/fuse_compat.h 2008-09-05 17:39 but no /usr/lib/libfuse 2008-09-05 17:39 ah, redhat puts them somewhere else I suppose 2008-09-05 17:39 ok, out 2008-09-05 17:39 for no good reason 2008-09-05 17:39 well, and 64-bit 2008-09-05 17:39 /usr/lib64 2008-09-05 17:39 ah 2008-09-05 17:39 and what's there? 2008-09-05 17:40 nothing 2008-09-05 17:40 libfuse2 it should be 2008-09-05 17:40 ah 2008-09-05 17:40 /lib64/libfuse.so 2008-09-05 17:40 should be a bunch more 2008-09-05 17:41 ls /usr/lib64/libfuse.so* 2008-09-05 17:41 ls /usr/lib64/libfuse*.so* 2008-09-05 17:42 nope 2008-09-05 17:42 rpm -ql fuse-devel 2008-09-05 17:42 ah, I dimly recall rpm -ql 2008-09-05 17:42 /lib64/libfuse.so 2008-09-05 17:42 /lib64/libulockmgr.so 2008-09-05 17:43 that's it 2008-09-05 17:43 for .so's 2008-09-05 17:43 packaged quite differently 2008-09-05 17:43 and incompatibly 2008-09-05 17:43 I suspect the problem isn't the fuse guys 2008-09-05 17:43 but "creative" redhatters 2008-09-05 17:45 #define FUSE_USE_VERSION 26 <- aha 2008-09-05 17:45 that must have something to do with it 2008-09-05 17:46 ah, does 2.5.3 do version 26? 2008-09-05 17:46 read your fuse/fuse.h 2008-09-05 17:49 #ifndef FUSE_USE_VERSION 2008-09-05 17:49 #define FUSE_USE_VERSION 21 2008-09-05 17:49 #endif <- then there are lots of function mismatches and no .readdir 2008-09-05 17:49 hm 2008-09-05 17:50 #define fuse_main(argc, argv, op) \ 2008-09-05 17:50 fuse_main_real(argc, argv, op, sizeof(*(op))) 2008-09-05 17:50 what have you got there? 2008-09-05 17:51 477 #define fuse_main(argc, argv, op, user_data) \ 2008-09-05 17:51 478 fuse_main_real(argc, argv, op, sizeof(*(op)), user_data) 2008-09-05 17:51 idiotic api breakage 2008-09-05 17:52 :D 2008-09-05 17:52 well we can grep fuse.h for user_data 2008-09-05 17:52 in the makefile 2008-09-05 17:52 and write nasty notes there 2008-09-05 17:52 or we can ask on their mailing list 2008-09-05 17:53 better style 2008-09-05 17:53 there's probably an irc channel on this server 2008-09-05 17:53 I think I tried actually 2008-09-05 17:53 here and freenode 2008-09-05 17:53 nope 2008-09-05 17:54 http://lists.sourceforge.net/lists/listinfo/fuse-devel 2008-09-05 17:55 konrad, how would you like to be our representative on the fuse-devel list? 2008-09-05 17:55 :-) 2008-09-05 17:55 joining 2008-09-05 17:56 got an idea how to frame the question? 2008-09-05 17:56 "Why did you break the %#& API?" 2008-09-05 17:56 :-) 2008-09-05 17:56 excellent way to get silence 2008-09-05 17:57 right 2008-09-05 17:57 "what is the expected workaround for the extra user_data parameter added to fuse_main? 2008-09-05 18:02 konrad, you know what I'm going to do? 2008-09-05 18:03 hack my fuse.h 2008-09-05 18:03 and add an extra parameter that is ignored 2008-09-05 18:03 k 2008-09-05 18:03 I'll rev tux3 pretty soon to user the 4 parameter flavor of fuse_main 2008-09-05 18:03 skate first 2008-09-05 18:03 enjoy :) 2008-09-05 18:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-05 20:37 konrad, fixed the rename damage and adapted to the 4 arg from or fuse_main 2008-09-05 22:34 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-06 02:29 bleh 2008-09-06 02:30 the mailserver on sf.net rejects senders who don't have a postmaster@ address 2008-09-06 02:33 sending/subscribing from my other address 2008-09-06 03:35 -!- cdk(~chinmay@121.246.36.66) has joined #tux3 2008-09-06 03:43 -!- cdk(~chinmay@121.246.36.66) has left #tux3 2008-09-06 05:01 -!- stargazr5(~gauravstt@59.95.4.235) has joined #tux3 2008-09-06 07:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-06 09:04 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-06 11:34 -!- stargazr5(~gauravstt@59.95.21.222) has joined #tux3 2008-09-06 14:34 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-06 15:12 nearly sk8 oclock 2008-09-06 15:12 test earlier these days 2008-09-06 15:12 gets earlier 2008-09-06 15:13 the french girls go back to their hotels earlier too 2008-09-06 15:13 temperature drops slightly below bikini degrees, centigrade 2008-09-06 15:25 we have a response 2008-09-06 15:26 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-06 15:30 For example autoconf that tests for the number of parameters. 2008-09-06 15:30 See http://tinyurl.com/6k6hnm on how this is done. 2008-09-06 15:30 Then again, FUSE 2.5.3 is almost 2 1/2 years old.. 2008-09-06 15:30 """ 2008-09-06 15:40 response? 2008-09-06 15:40 oh 2008-09-06 15:40 the fuse list :-) 2008-09-06 15:40 autoconf is banned from tux3 2008-09-06 15:41 solutions that avoid autoconf are welcome 2008-09-06 15:41 for example, whatever test autoconf uses 2008-09-06 15:41 we cut & paste 2008-09-06 15:41 stripping out the fluff 2008-09-06 15:42 2.5 years is short on the unix timescale 2008-09-06 15:42 got to think longer term than that 2008-09-06 15:43 nice piece of autoconf wanking 2008-09-06 15:43 has nothing to do with fuse 2008-09-06 15:43 right, I don't like autoconf either 2008-09-06 15:43 hopefully a more clueful response is in the pipe :-) 2008-09-06 15:44 anyway, we have solved it, I hacked my fuse.h 2008-09-06 15:44 alright 2008-09-06 15:44 it's good you are on the fuse list 2008-09-06 15:45 I suspect there are many other things to complain about, most of them more important 2008-09-06 15:47 note that fuse has a parallel mode 2008-09-06 15:47 I actually tried to use fi->fh earlier 2008-09-06 15:47 so once we have it basically working 2008-09-06 15:47 ah 2008-09-06 15:47 but it was failing on the second read 2008-09-06 15:47 maybe I was shoving the wrong pointer in it 2008-09-06 15:47 need to get the fuse source and compile 2008-09-06 15:47 and debug the two together 2008-09-06 15:48 in uml even 2008-09-06 15:48 ok, I can cook up a recipe for that 2008-09-06 15:48 just turning on fuse event tracing in the kernel would be a huge help 2008-09-06 15:48 I wonder if there is an easy way to do that 2008-09-06 15:49 let's see what's in the fuse kernel code ;-) 2008-09-06 15:49 http://lxr.linux.no/linux+v2.6.26.3/fs/fuse/ <- fuse kernel code 2008-09-06 15:50 http://lxr.linux.no/linux+v2.6.26.3/include/linux/fuse.h <- and here 2008-09-06 15:51 http://lxr.linux.no/linux+v2.6.26.3/+ident=14862186 <- FOPEN_KEEP_CACHE for example 2008-09-06 15:52 http://lxr.linux.no/linux+v2.6.26.3/fs/fuse/dir.c#L755 <- /* Directories have separate file-handle space */ 2008-09-06 15:53 FUSE_GETATTR_FH 2008-09-06 15:53 konrad, better than a pointer is the inum, for now 2008-09-06 15:54 call open_inode 2008-09-06 15:54 we don't really care about performance at this point 2008-09-06 15:54 just working right 2008-09-06 15:54 k 2008-09-06 15:55 the libfuse stuff is important too 2008-09-06 15:55 more important that the kernel I think 2008-09-06 15:55 so we should compile it with -g 2008-09-06 15:55 to set breaks and see how it screws up ;-) 2008-09-06 15:55 let's see, how do you get a debug build in debian 2008-09-06 15:56 weakness of debian 2008-09-06 15:56 this is where gentoo is good 2008-09-06 15:56 but I can just apt-get remove libfuse2 2008-09-06 15:56 and build from tarball like a proper hacker 2008-09-06 16:01 sudo apt-get remove --purge fuse-utils libfuse-dev libfuse2 2008-09-06 16:05 all the pain of autoconf is coming back to me now 2008-09-06 16:05 including library path skew 2008-09-06 16:13 ok, installed, working 2008-09-06 16:13 needlessly painful 2008-09-06 16:14 ld.so.config... not documented in man ld :-P 2008-09-06 16:14 ld.so.conf I meant 2008-09-06 16:14 see? pain 2008-09-06 16:15 aha! it was a fuse bug 2008-09-06 16:15 now does not claim to be busy on umount 2008-09-06 16:16 heh 2008-09-06 16:16 and tux3fs does not exit file file not found when you unmount 2008-09-06 16:16 what version of fuse are you using now? 2008-09-06 16:16 um 2008-09-06 16:16 2.7.4 2008-09-06 16:16 is it leet? 2008-09-06 16:17 heh 2008-09-06 16:17 more recent anyways 2008-09-06 16:17 I decided not to track their unstable 2008-09-06 16:17 any more than they should track ours 2008-09-06 16:17 for their home dirs ;-) 2008-09-06 16:18 kay, got to get serious about skating 2008-09-06 16:18 tux3 fuse is looking good 2008-09-06 16:32 it's about time to create the version table 2008-09-06 16:33 will be inode number 2, maybe 2008-09-06 16:33 or maybe it deserves to be number 0 2008-09-06 16:33 even more important than bitmap 2008-09-06 16:33 which is after all a redundant structure 2008-09-06 16:34 0: version 1: bitmap 2: extentmap 2008-09-06 16:34 maybe 2008-09-06 16:35 the rule is: inums below 0x10 do not have dirents 2008-09-06 16:35 they are special tux3 files never seen by user space 2008-09-06 18:09 ok, sk8 oclock, really 2008-09-06 18:51 hey 2008-09-06 19:54 hi bh 2008-09-06 20:44 -!- ChanServ changed mode/#tux3 -> +o flips 2008-09-06 20:44 -!- flips changed topic to "Tux3 list members just hit 100! ~ http://tux3.org" 2008-09-06 20:44 -!- flips changed topic to "Tux3 list membership just hit 100! ~ http://tux3.org" 2008-09-06 20:44 -!- flips changed mode/#tux3 -> -o flips 2008-09-06 23:17 -!- stargazr5(~gauravstt@59.95.22.62) has joined #tux3 2008-09-07 00:24 flips: ping 2008-09-07 00:24 pong 2008-09-07 00:24 hmm, my autoponger seems to be working 2008-09-07 00:25 heh 2008-09-07 00:25 so, tux3fs isn't working for me now 2008-09-07 00:26 whoops 2008-09-07 00:26 i upgraded my fuse 2008-09-07 00:26 let's see if it works for me 2008-09-07 00:26 but when i run tux3fs, it just hangs 2008-09-07 00:26 gdb is your friend 2008-09-07 00:26 # ./tux3fs /tmp/testdev /tmp/test -f 2008-09-07 00:26 devmap_blockio: read [2] 2008-09-07 00:26 devmap_blockio: read [3] 2008-09-07 00:26 lookup inode 0x0, 0 + 0 2008-09-07 00:26 mode 0000000 uid 0 gid 0 root 4:1 ctime 0 size 20 2008-09-07 00:26 lookup inode 0xd, 0 + d 2008-09-07 00:26 mode 0040755 uid 0 gid 0 root 8:1 2008-09-07 00:26 is this make testfs? 2008-09-07 00:27 well yeah, but run manually 2008-09-07 00:27 due to no sudo being installed 2008-09-07 00:27 ls -ld /tmp/test 2008-09-07 00:27 drwxr-xr-x 0 root root 0 Dec 31 1969 /tmp/test 2008-09-07 00:27 although, actually, it does work 2008-09-07 00:27 i just have to be root to see it 2008-09-07 00:28 so what is the hang? <- joke 2008-09-07 00:28 I see 2008-09-07 00:28 as a regular user i get 2008-09-07 00:28 ?--------- ? ? ? ? ? /tmp/test 2008-09-07 00:28 heh 2008-09-07 00:28 right 2008-09-07 00:28 until somebody fixes it 2008-09-07 00:28 ok i'll fix it 2008-09-07 00:28 I'm back to adding new grooviness to tux3 itself 2008-09-07 00:28 thanks 2008-09-07 00:29 post on the version table coming in a few minutes 2008-09-07 00:29 just had a bunch of caffeine, i'll be up a while 2008-09-07 00:29 heh 2008-09-07 00:29 fun 2008-09-07 00:29 I just played them demo of disney's new sick trick quad game, I'm kinda hyped 2008-09-07 00:29 it's good 2008-09-07 00:29 recommended 2008-09-07 00:29 video game? 2008-09-07 00:29 ridiculously over the top quad bike racing/tricks game 2008-09-07 00:29 yes 2008-09-07 00:30 eh 2008-09-07 00:30 you'd like it 2008-09-07 00:30 it's sick 2008-09-07 00:30 not much gore though 2008-09-07 00:30 i can't get an adrenaline rush from video games anymore 2008-09-07 00:30 kinda pointless 2008-09-07 00:30 probably would from this one 2008-09-07 00:31 if i want to entertain myself i can go and do stupid shit in real life 2008-09-07 00:31 .. the advantage of having a 180hp motorcycle ;) 2008-09-07 00:31 you can watch this, then go out and kill yourself completely 2008-09-07 00:31 it's pretty sylish 2008-09-07 00:32 funny thing is, people are actually doing tricks really close to what's inthe game 2008-09-07 00:32 the game was meant to be ridiculous 2008-09-07 00:32 well 2008-09-07 00:32 real quads don't have helicopters flying beside them at the top of a jump 2008-09-07 00:35 they could 2008-09-07 05:10 -!- stargazr5(~gauravstt@59.95.24.129) has joined #tux3 2008-09-07 05:14 -!- cdk(~chinmay@121.246.36.66) has joined #tux3 2008-09-07 05:19 -!- cdk(~chinmay@121.246.36.66) has left #tux3 2008-09-07 06:56 -!- dipanjan(~chatzilla@122.167.27.144) has joined #tux3 2008-09-07 09:42 -!- dipanjan(~chatzilla@122.167.27.144) has joined #tux3 2008-09-07 11:19 -!- pgquiles(~pgquiles@253.Red-83-44-239.dynamicIP.rima-tde.net) has joined #tux3 2008-09-07 15:56 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-07 16:11 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-07 16:13 another big design note up 2008-09-07 16:14 in case anybody thinks these take longer to read that to write, I can assure you that is not the case ;-) 2008-09-07 16:16 http://slashdot.org/comments.pl?sid=954803&cid=24892667 <- warm n fuzzy 2008-09-07 16:20 flips: i still don't see your design note 2008-09-07 16:21 ah just delivered.. 2008-09-07 16:21 slow 2008-09-07 16:22 flips: i emailed a patch, and actually just commited a fix for a non-bug (yet) in it in my repo 2008-09-07 16:24 actually this is dumb: 2008-09-07 16:24 return inode ? 0 : -1; 2008-09-07 16:25 return -same as: 2008-09-07 16:25 return -!inode; 2008-09-07 16:27 you mean -!!inode? 2008-09-07 16:27 oh, nevermind 2008-09-07 16:30 back 2008-09-07 16:30 flips: how many bits for atom numbers? 2008-09-07 16:30 variable 2008-09-07 16:30 start with 16 bits for small ones 2008-09-07 16:31 and have a bigger variant 2008-09-07 16:31 with say 48 bits 2008-09-07 16:31 or 32 2008-09-07 16:31 don't have to get silly 2008-09-07 16:31 this may be painfully obvious 2008-09-07 16:31 but, why limit that approach to just xattrs? 2008-09-07 16:32 because nearly every file has a data attribute 2008-09-07 16:32 i was talking about this for sharing user, group, mode 2008-09-07 16:32 and mutliple versions of it 2008-09-07 16:32 ah 2008-09-07 16:32 of course 2008-09-07 16:32 bundle them all together 2008-09-07 16:32 what I meant in my post 2008-09-07 16:32 how bundle? 2008-09-07 16:33 i mean, put the user, group, and mode in the atom table too 2008-09-07 16:33 along with xattrs 2008-09-07 16:33 sure 2008-09-07 16:33 good idea 2008-09-07 16:33 separately or combined? 2008-09-07 16:33 the latter gets into content addressing 2008-09-07 16:33 probably combined 2008-09-07 16:33 more efficient that way 2008-09-07 16:34 more corner cases 2008-09-07 16:34 howso 2008-09-07 16:35 maxgid * maxuid * anymdode = huge 2008-09-07 16:36 yeah but that is a vrey unlikely case 2008-09-07 16:36 no reason to pay that max price for every file if you have only 10 or so on most systems 2008-09-07 16:36 how about a list post? 2008-09-07 16:36 yeah, working on it ;) 2008-09-07 16:36 recalling my one-liner from irc logs ;) 2008-09-07 16:37 find / -xdev -type d -exec sh -c 'ls -l $1 | awk "/^\-/ {print \$1}" | sort -u |wc -l' {} {} \; | sort | uniq -c 2008-09-07 16:37 heh 2008-09-07 16:37 another idea: record the actual user name in the fs, not just the id 2008-09-07 16:37 so that security will work when you copy a volume to a different system 2008-09-07 16:38 or at least some kind of mapping table 2008-09-07 16:38 which might well be an atom table 2008-09-07 16:39 shapor: what's that bit of shell do? 2008-09-07 16:39 using the atom table we could map 32 bit uid and gid down to 16 or even 8 bits 2008-09-07 16:39 its not quite what i wanted 2008-09-07 16:39 but it basically breaks down the number of unique permissions sets per directory on the system 2008-09-07 16:40 rather, it creates a distribution of uniqueness of permissions in a single directory 2008-09-07 16:41 shapor, is your tux3fs patch ready to go in, should I pull? 2008-09-07 16:42 yeah 2008-09-07 16:44 whoops, I got 3 heads somehow 2008-09-07 16:45 damn, there they are 2008-09-07 16:46 sitting in your repo, I did view but didn't look all the way down 2008-09-07 16:46 now how do you revert a pull 2008-09-07 16:46 I guess this is a big flaming deficiency in hg 2008-09-07 16:47 ok, hg rollback works 2008-09-07 16:47 shapor, could you merge your heads before I pull? 2008-09-07 16:48 how do i do that 2008-09-07 16:48 "hg merge" 2008-09-07 16:48 if it won't merge, tell it the version you want merged 2008-09-07 16:48 wish there was a simple delete head 2008-09-07 16:48 abort: there is nothing to merge, just use 'hg update' or look at 'hg heads' 2008-09-07 16:49 right, you need to tell it exactly what head 2008-09-07 16:49 start with raponen's patch you added independently 2008-09-07 16:49 that might have conflicts 2008-09-07 16:49 because I adjusted a little 2008-09-07 16:49 hg is broken here 2008-09-07 16:51 http://www.selenic.com/pipermail/mercurial/2006-September/010628.html "Delete a branch/head" 2008-09-07 16:51 hm ok 2008-09-07 16:51 merged my old heads 2008-09-07 16:51 now there is only one 2008-09-07 16:51 i think all is well with my repo 2008-09-07 16:52 "added 12 changesets with 11 changes to 3 files" 2008-09-07 16:53 so I got all your braindumps with that pull too ;-) 2008-09-07 16:53 well they are now carved in the history of tux3 for eternity 2008-09-07 16:54 "You can use either: 2008-09-07 16:54 hg clone -r oldrepo newrepo 2008-09-07 16:54 (safest method) 2008-09-07 16:54 or hg strip 2008-09-07 16:54 (does in repo stripping, you need mq extension activated to have the 2008-09-07 16:54 strip command)" http://www.selenic.com/pipermail/mercurial/2006-September/010628.html 2008-09-07 16:55 need to get in the habit of preparing a clean repo for pulling I think 2008-09-07 16:55 it can be automated 2008-09-07 16:55 or maybe In can tell pull to just pull the tip 2008-09-07 16:55 any reason for choosing hg over git? 2008-09-07 16:56 it's _way_ nicer to use 2008-09-07 16:56 git has a bunch of fuzzy thinking that nobody fixed just because it came from linus 2008-09-07 16:56 it's about exactly as fast, in spite of being written in super slow python 2008-09-07 16:57 I wasn't questioning the speed 2008-09-07 16:58 shapor, hg has pull -r, which I can use if I'm alert to see extra heads 2008-09-07 16:58 but I think the better approach is to clone the version you want me to pull, so I always pull from the same place, and you decide exactly what I pull 2008-09-07 16:58 knew you weren't 2008-09-07 16:58 alright :) 2008-09-07 16:58 it's just amazing that mercurial is as fast as it is, in spite of being hobbled by python 2008-09-07 16:59 matt mackall - amazing hacker 2008-09-07 16:59 heh 2008-09-07 17:00 I'd bet it either employs a lot of the stuff python is fast at (because underneath it's in C) or it employs its own C extensions 2008-09-07 17:00 python's dicts are really speedy, for example 2008-09-07 17:00 shapor, one thing I forgot to mention in my xattr post - we are going to use ext2 dirops for the atom table for the time being ;-) 2008-09-07 17:01 konrad, all of that 2008-09-07 17:01 plus it's a better design 2008-09-07 17:01 I bet it would kick git's tail if it were converted to c++ 2008-09-07 17:01 heh 2008-09-07 17:02 sk8 oclock 2008-09-07 17:02 whoo! 2008-09-07 17:02 enjoy 2008-09-07 17:02 we will 2008-09-07 17:02 going to meet shap for cocktails on the strand 2008-09-07 17:02 skating under the influence is fun and legal 2008-09-07 17:02 so far 2008-09-07 17:06 heh 2008-09-07 19:24 relatively legal anyway 2008-09-07 19:34 this channel is logged :) 2008-09-07 19:48 by me ;) 2008-09-07 20:07 heh 2008-09-07 20:07 me too, but it's not like I'm publicizing them 2008-09-07 20:07 konrad, completely legal 2008-09-07 20:07 unless of course you take out a baby buggy, just don't do that 2008-09-07 20:08 shapor, what say we put the logs up on tux3.org? 2008-09-07 20:09 and let googlebot chew on them 2008-09-07 20:09 help somebody learn something about skating maybe 2008-09-07 20:11 ok, howbig has to know how to include the size of variable sized items now 2008-09-07 20:12 I suppose easiest is to make the size of variable sized items a generic field 2008-09-07 20:13 something like [ kind:version:16, atom:16, size:16, data[size] } 2008-09-07 20:14 or for a file data attribute: [ kind:version:16, size:16, data[size] } 2008-09-07 20:14 hmm 2008-09-07 20:14 might be better to include the atom in the data 2008-09-07 20:14 { kind:version:16, size:16, atom:16 data[size - 2] } 2008-09-07 20:15 ok that seems better 2008-09-07 20:15 I'll make it so unless somebody has a better idea 2008-09-07 20:15 0x1dea <- great magic number, got to use it for something 2008-09-07 20:51 "howmuch" makes its grand return 2008-09-07 20:52 howbig -> size of fixed attributes, howmuch -> size of variable attributes 2008-09-07 22:30 flips: googlebot is already chewing on them 2008-09-07 22:31 they are linked off shapor.com/tux3 2008-09-07 22:31 ah good 2008-09-07 22:31 lets see what comes up for tux3 sk8 2008-09-07 22:31 nothing 2008-09-07 22:31 wtf is up with googlebot 2008-09-07 22:31 bad googlebot 2008-09-07 22:50 hm the lots are .txt, dunno if that has any effect 2008-09-07 23:13 flips: ah you pulled from me this time so i didn't have to merge my own changes, heh 2008-09-07 23:53 :-) 2008-09-08 00:13 hey 2008-09-08 00:13 how's it going ? 2008-09-08 00:14 ACTION just made it back to San Diego 2008-09-08 00:32 hey 2008-09-08 00:32 extended attributes are in process of being implemented 2008-09-08 00:32 only the disk format had any kind of design as of yesterday 2008-09-08 00:36 static struct xattrs foo = { .size = sizeof(foo), .list = { 2008-09-08 00:36 { .atom = 666, .size = 5, .data = "hello" }, 2008-09-08 00:36 { .atom = 777, .size = 6, .data = "world!" }, 2008-09-08 00:36 } }; 2008-09-08 02:46 the gates/seinfeld ad is awful, omgponies 2008-09-08 02:46 I couldn't keep watching it 2008-09-08 02:46 I've never seen anything that bad, not even close 2008-09-08 02:47 what were they thinking 2008-09-08 02:47 and how much did it cost ;-) 2008-09-08 02:48 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-08 02:54 agreed 2008-09-08 03:21 ext3 delete speed really is pathetic 2008-09-08 03:22 that's one thing we must do much better 2008-09-08 03:22 and we will 2008-09-08 03:22 going to be fun when we get to the transaction handling part 2008-09-08 03:50 I found newer old skates! 2008-09-08 03:50 these are only a half size smaller than what I currently wear 2008-09-08 03:50 I think 2008-09-08 03:51 :-) 2008-09-08 03:51 it's spreading 2008-09-08 03:51 isn't it a little late over there? 2008-09-08 03:51 or early? 2008-09-08 03:51 same timezone as you silly 2008-09-08 03:51 oh, I got the idea it was mn 2008-09-08 03:52 ah, comcast's rdns is broken 2008-09-08 03:52 they kinda fit 2008-09-08 03:55 it's late all the same 2008-09-08 03:55 I'd better crash 2008-09-08 03:55 xattr hacking is going slowly 2008-09-08 03:56 slowly is better than not at all 2008-09-08 03:57 true 2008-09-08 03:57 it's going to be good I think 2008-09-08 03:58 I expect better that average attriburte cache performance 2008-09-08 04:00 with this much effort invested I don't doubt it 2008-09-08 04:00 ok these skates kinda hurt my bones after a little wearing 2008-09-08 04:00 but so do ski boots so what's new 2008-09-08 04:00 heh, it's tiny compared to the effort invested in any other fs I know of 2008-09-08 04:01 you can get decent skates online for $200 2008-09-08 04:01 right 2008-09-08 04:01 that don't hurt 2008-09-08 04:01 but I have no income at present 2008-09-08 04:01 student 2008-09-08 04:01 and these aren't too bad 2008-09-08 04:01 ah, go crazy on your skates then 2008-09-08 04:01 been them up 2008-09-08 04:01 I'm busily destroying mine 2008-09-08 04:01 need new ones pretty soon 2008-09-08 04:02 heh 2008-09-08 04:02 I'll take em for a spin tomorrow 2008-09-08 04:02 seattle's kind of hilly for skates 2008-09-08 04:02 I used to bike a bit, which seems easier on hills 2008-09-08 04:05 static struct xattrs foo = { .blob = { 2008-09-08 04:05 { .code = 666, .size = 6, .data = "hello" }, 2008-09-08 04:05 { .code = 777, .size = 7, .data = "world!" }, 2008-09-08 04:05 } }; 2008-09-08 04:06 cache form of immediate xattrs 2008-09-08 04:06 how do we know how big .blob is? 2008-09-08 04:06 struct xattrs { unsigned size; struct xattr { u16 code, size; char data[]; } blob[]; } PACKED; 2008-09-08 04:06 we count it when loading the inode 2008-09-08 04:06 right, your struct above didn't mention size 2008-09-08 04:06 there is a .size field 2008-09-08 04:06 right 2008-09-08 04:06 because C is too braindamaged to calculated it 2008-09-08 04:07 mhm 2008-09-08 04:07 does very much the wrong thing when you ask it to do something reasonable 2008-09-08 04:07 the linux kernel isn't written in ocaml though 2008-09-08 04:07 static struct xattrs foo = { .size = sizeof(foo), .blob = { <- ought to work 2008-09-08 04:07 but it does not 2008-09-08 04:07 jw, what does it do? 2008-09-08 04:08 it uses offsetof(struct xattrs, blob) 2008-09-08 04:08 some wanking about flexible arrays 2008-09-08 04:08 ah 2008-09-08 04:08 but the compiler bloody initialized the thing to a certain size and should use that as sizeof 2008-09-08 04:09 flaw in C imho 2008-09-08 04:09 C or GCC? 2008-09-08 04:09 committee misfeature 2008-09-08 04:09 standard 2008-09-08 04:12 anyway: 2008-09-08 04:12 xattr 666: 0x804a37c: 68 65 6c 6c 6f 00 "hello." 2008-09-08 04:12 xattr 777: 0x804a386: 77 6f 72 6c 64 21 00 "world!." 2008-09-08 04:12 dump_xattrs: zero length xattr 2008-09-08 04:12 that's what it does if there is garbage in it 2008-09-08 04:14 the plan is to walk across the inode attrs filling in a vector with pointer to each xattr encountered 2008-09-08 04:15 then add up the sizes of all the xattrs, allocate memory big enough, and copy the xattr data into the cache struct shown above 2008-09-08 04:17 the xattr cache vector will be a binary sized multiple 2008-09-08 04:17 so there is some slack space, some of the time, to store more xattrs in it 2008-09-08 04:18 but no big deal to just realloc to the next binary size up when necessary 2008-09-08 04:18 better than a linked list 2008-09-08 04:19 immediatel file data probably better go in this cache struct too 2008-09-08 04:19 though it can possibly also go in the page cache 2008-09-08 04:19 it will go there 2008-09-08 04:20 but if the page gets evicted, I think we want to be able to go back to the in-memory inode to repopulate the page cache, rather than going all the way back to the inode table block 2008-09-08 04:20 this means there will be some triple caching of immediate file data: 1) in the inode table block 2) in the xattr cache 3) in the page cache 2008-09-08 04:21 seems a little excessive 2008-09-08 04:21 got to think about that 2008-09-08 04:22 we also have double caching of xattr data: 1) in the inode table block and 2) in the page cache 2008-09-08 04:22 don't really like that either 2008-09-08 04:22 sorry 2008-09-08 04:23 we also have double caching of xattr data: 1) in the inode table block and 2) in the inode's xattr cache 2008-09-08 04:23 what might make more sense is to pin the inode table block in memory and have the inode point into the block buffer 2008-09-08 04:24 so if the inode is evicted by the vm, it drops its count on the inode table block, which may now be evicted if no other inode holds a count on it 2008-09-08 04:25 if we do that, xattr caching looks maybe more sensible 2008-09-08 04:26 if attributes are heavily versioned though, we may pin a log of inode table blocks in memory, which hold a very low density of data we are actually using 2008-09-08 04:26 --sensible 2008-09-08 04:27 I am probably overworried about this double/tripe caching issue 2008-09-08 04:27 we have that anyway with existing inode attributes 2008-09-08 04:29 the triple caching of immediate data can be improved to just double caching by not loading immediate file data into the attribute cache. I dunno. 2008-09-08 04:29 instead, we let a miss in the page cache to retrieve the inode table block and load the immediate data into the page cache 2008-09-08 04:30 ok, this is right 2008-09-08 04:31 then when we update the inode on disk, we store an immediate data attribute only if the page cache is a) dirty and b) still small enough to be an immediate data attribute 2008-09-08 07:17 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-08 07:17 -!- pgquiles(~pgquiles@253.Red-83-44-239.dynamicIP.rima-tde.net) has joined #tux3 2008-09-08 07:17 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-09-08 07:17 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-09-08 07:17 -!- flips(~phillips@phunq.net) has joined #tux3 2008-09-08 07:17 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-09-08 07:17 -!- konrad(~konrad@c-24-16-74-109.hsd1.mn.comcast.net) has joined #tux3 2008-09-08 07:17 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-09-08 09:00 -!- stargazr5(~gauravstt@59.95.36.185) has joined #tux3 2008-09-08 10:07 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-08 11:24 so where were we 2008-09-08 11:24 xattrs 2008-09-08 11:25 so we are going to have this little xattr cache hanging off each inode, which is just a binary size chunk of memory a la kmalloc 2008-09-08 11:25 and xattrs get loaded into it when the inode is loaded, and written into it when somebody sets an xaddr, marking the inode 'xattr dirty' 2008-09-08 11:26 meaning the xattr cache has to be written back to the inode table block when the inode is saved/synced 2008-09-08 11:27 xattr atoms are stored on disk in directory format, specifically ext2 directory format for now. So we do an ext2_find_entry to resolve an xattr name to an atom on sys_getxattr or sys_setxattr 2008-09-08 11:28 then search the inode's xattr cache for that atom 2008-09-08 11:29 if ext2_find_entry fails for sys_setxattr, then we do ext2_create_entry 2008-09-08 11:30 for now, we are always going to load all xattrs when an inode is loaded and write all xattrs when the inode is saved 2008-09-08 11:30 later, especially with versioning, we don't want to load all xattrs every time 2008-09-08 11:31 so getxattr will first search the cache, then go to the inode table block if that fails 2008-09-08 11:31 and setxattr will have to scan the xattrs present in the inode to know which ones to keep and which ones to overwrite on save 2008-09-08 11:32 the new size of a saved inode will then not be completely determinable from examining the inode alone, as it is now 2008-09-08 11:32 so we get a little more complexity here, not too bad 2008-09-08 11:33 the initial implementation of load all, save all, no versioning will be pretty simple and fast 2008-09-08 11:35 always calling ext2_find_entry for each xaddr atom can be avoided by keeping a hash of xattr atoms, and we only do the find_entry on a miss in the xattr atom hash, or we always keep the xattr hash fully populated for now (there are usually only a few different kinds of xattrs) 2008-09-08 11:43 -!- elicriffield(~elicriffi@66.249.86.209) has joined #tux3 2008-09-08 11:58 hey 2008-09-08 12:07 hey 2008-09-08 12:09 hi eli 2008-09-08 12:09 it's xattrs day today ;-) 2008-09-08 12:09 oh fun :) 2008-09-08 12:09 yah, not the most exciting, exciting to some folks though 2008-09-08 12:10 man im just trying to stay awake today 2008-09-08 12:10 I guess a reasonable approach is emerging, should have something working by tomorrow say 2008-09-08 12:10 right, it's one of those for me too 2008-09-08 12:13 make iattr && ./iattr 2008-09-08 12:13 gcc -std=gnu99 -Wall -g iattr.c -o iattr 2008-09-08 12:13 xattr 666: 0x804a37c: 68 65 6c 6c 6f 00 "hello." 2008-09-08 12:13 xattr 777: 0x804a386: 77 6f 72 6c 64 21 00 "world!." 2008-09-08 12:13 two xattrs 2008-09-08 12:13 in an xcache ;-) 2008-09-08 12:13 now to wrap that with lots of xattr access yumminess 2008-09-08 12:20 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-08 12:28 ACTION reads shapors diversity post 2008-09-08 13:04 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2008-09-08 13:42 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-09-08 14:37 flips: 2008-09-08 14:37 You don't even need to play with autoconf, just do 2008-09-08 14:37 #define FUSE_USE_VERSION 26 2008-09-08 14:37 #include 2008-09-08 14:37 ... 2008-09-08 14:37 #if FUSE_VERSION >= 26 2008-09-08 14:38 fuse_main(argc, argv, &my_op, NULL); 2008-09-08 14:38 #else 2008-09-08 14:38 fuse_main(argc, argv, &my_op); 2008-09-08 14:38 #endif 2008-09-08 14:38 But all this is only important if you need some API features from 2008-09-08 14:38 2.6.x/2.7.x. Otherwise you can just use the old API unconditionally: 2008-09-08 14:38 #define FUSE_USE_VERSION 25 2008-09-08 14:38 #include 2008-09-08 15:00 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-08 16:56 ok, xattr cache lookup works 2008-09-08 16:56 konrad, we now do #define FUSE_USE_VERSION 27 2008-09-08 16:56 aj 2008-09-08 16:57 but good suggestion 2008-09-08 16:57 nice solution 2008-09-08 16:57 alright 2008-09-08 16:57 I should have xattr support in tux3.c by tomorrow 2008-09-08 16:58 then we can try it out in fuse 2008-09-08 16:58 fun 2008-09-08 16:58 I don't think we have functions for that yet 2008-09-08 16:58 not in tux3.c 2008-09-08 16:58 writing them now 2008-09-08 16:58 or rather, not in inode.c 2008-09-08 16:59 I meant in tux3fs.c 2008-09-08 16:59 maybe tux3fuse.c but I havn't looked at it much 2008-09-08 16:59 xcache_dump and xcache_lookup work, now writing xcache_update, which is considerably harder 2008-09-08 16:59 I wonder what passes for an xattr delete 2008-09-08 17:00 setxattr to empty? 2008-09-08 17:00 the low level fuse api has it 2008-09-08 17:14 alright 2008-09-08 17:15 touch: setting times of `tmp/abc': Function not implemented 2008-09-08 17:15 (high level api) 2008-09-08 17:17 I know 2008-09-08 17:17 don't know what it is 2008-09-08 17:17 sniffed at it a little 2008-09-08 17:17 libc braindamage it seems 2008-09-08 17:18 triggered by some combination of fuse things 2008-09-08 17:19 konrad, I suggest asking on the fuse list 2008-09-08 17:24 hm? I think we just havn't implemented one of the functions fuse wants us to 2008-09-08 17:24 might be 2008-09-08 17:25 Function 'name' not implemented would be much more informative 2008-09-08 17:31 hey tim_dimm 2008-09-08 17:31 hey flips 2008-09-08 17:35 shapor: ping 2008-09-08 17:40 xcache_delete works 2008-09-08 17:48 xcache_update works 2008-09-08 17:49 now some memory management to take care of changing the size of the xcache as necessary 2008-09-08 17:49 but first, a skate 2008-09-08 17:50 it's looking quite like the tux3 command will have get/set xattr by tomorrow 2008-09-08 19:43 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-08 19:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-08 22:53 time for a checkin 2008-09-08 23:16 been 30 hours :( 2008-09-08 23:33 folks 2008-09-09 00:04 yo 2008-09-09 00:17 ACTION starts writing encode_xattrs 2008-09-09 00:24 hmm, issue 2008-09-09 00:24 what endianness of xattr data? 2008-09-09 00:25 I suppose the filesystem does not care 2008-09-09 00:26 if the application cares, it better take care of it 2008-09-09 00:28 flips: unlink a 0 length file causes an infinite loop heh 2008-09-09 00:28 heh 2008-09-09 00:29 fixed yet? 2008-09-09 00:29 free <- [ffffdddddddddddd] 2008-09-09 00:29 bfree: block 0xffffdddddddddddd already free! 2008-09-09 00:29 free <- [ffffdddddddddddd] 2008-09-09 00:29 bfree: block 0xffffdddddddddddd already free! 2008-09-09 00:29 ACTION thinks about why that might be 2008-09-09 00:29 those dddddddd are uninited data 2008-09-09 00:29 shapor: that happens to me a lot 2008-09-09 00:29 bad 2008-09-09 00:29 you should complain more 2008-09-09 00:30 :D 2008-09-09 00:30 does tree_chop do the right thing if there is no data int he btree? 2008-09-09 00:30 hm 2008-09-09 00:30 and by 'that' I mean the infinite loop with the exact same values 2008-09-09 00:30 or is the way we're creating files in fuse broken 2008-09-09 00:31 /* let's clear out the buffer array and data and set to deadly data 0xdd */ 2008-09-09 00:31 memset(data_pool, 0xdd, max_buffers*bufsize); 2008-09-09 00:31 do you want me to fix it, or do you want to sniff first? 2008-09-09 00:31 it's deep in the poo 2008-09-09 00:32 no question its a tux3 bug 2008-09-09 00:32 i suspected fuse 2008-09-09 00:32 the d's above prove it 2008-09-09 00:32 hm 2008-09-09 00:32 should be able to reproduce easily either with tux3.c or inode.c 2008-09-09 00:33 hack the test a little 2008-09-09 00:33 starts out like this: 2008-09-09 00:33 ---- delete file ---- 2008-09-09 00:33 lookup inode 0x21, 0 + 21 2008-09-09 00:33 open_inode: found inode 0x21 2008-09-09 00:33 mode 0100666 uid 0 gid 0 root 21:1 2008-09-09 00:33 free <- [0] 2008-09-09 00:33 free <- [0] 2008-09-09 00:33 bfree: block 0x0 already free! 2008-09-09 00:33 free <- [0] 2008-09-09 00:33 bfree: block 0x0 already free! 2008-09-09 00:34 ffffdddddddddddd <- I'm surprised the allocator manages to deal with this 2008-09-09 00:34 that repeats for a while 2008-09-09 00:34 then 2008-09-09 00:34 filemap_blockio: read <0:bbbbbbbb> 2008-09-09 00:34 filemap_blockio: unmapped block bbbbbbbb 2008-09-09 00:34 free <- [ffffdddddddddddd] 2008-09-09 00:34 got to check some limits there 2008-09-09 00:34 bfree: block 0xffffdddddddddddd already free! 2008-09-09 00:34 and the loop begins 2008-09-09 00:34 it's possibly trying to treat a block of zeros a dleaf 2008-09-09 00:35 or it did a getblk where it should have done a bread 2008-09-09 00:35 something :-) 2008-09-09 00:35 is this the first tux3 bug fuse has found? 2008-09-09 00:35 think so 2008-09-09 00:36 actually, fuse found a couple bugs in dir.c 2008-09-09 00:40 ah i see 2008-09-09 00:40 so soon? 2008-09-09 00:40 reproduced the bug by adding a test case to dleaf.c 2008-09-09 00:40 and running make dleaftest 2008-09-09 00:40 your favorite file 2008-09-09 00:41 empty dleaf is the culprit 2008-09-09 00:41 and the fix? 2008-09-09 00:42 fix? 2008-09-09 00:42 its more fun to break! 2008-09-09 00:42 :P 2008-09-09 00:42 so we allocate a dleaf even for a 0 length file? 2008-09-09 00:43 that is something that will be optimized away right? 2008-09-09 00:44 143 /* 2008-09-09 00:44 144 * Reasons this dleaf truncater sucks: 2008-09-09 00:44 yes 2008-09-09 00:44 haha 2008-09-09 00:44 it will be optimized away 2008-09-09 00:44 good thing we aren't doing premature optimization 2008-09-09 00:44 very good 2008-09-09 00:45 bug wouldnt be noticed 2008-09-09 00:45 flips: were you high when you wrote this? 2008-09-09 00:45 quite 2008-09-09 00:45 konrad was complicit 2008-09-09 00:46 so what is the condition it doesn't handle? 2008-09-09 00:46 heh 2008-09-09 00:46 empty dleaf, but initialized? 2008-09-09 00:46 where did those ddddddd's come from? 2008-09-09 00:46 nah it just overflowed 2008-09-09 00:47 to some other area 2008-09-09 00:47 at first it was all zeros 2008-09-09 00:48 flips: in general how are we going to deal with corruption 2008-09-09 00:48 we need a lot more integrity checking 2008-09-09 00:48 it will slow it down 2008-09-09 00:48 yes 2008-09-09 00:48 that's life 2008-09-09 00:49 the rule is: you should be able to randomize any block and not cause an oops 2008-09-09 00:49 everything has to fail on nasty random disk data 2008-09-09 00:49 definitely 2008-09-09 00:49 filesystems are hard 2008-09-09 00:49 that can be pretty lightweight checking 2008-09-09 00:49 see dir.c 2008-09-09 00:49 very mature code 2008-09-09 00:49 has obviously gone through a lot of fixes in that regard 2008-09-09 00:50 not nice code mind you 2008-09-09 00:50 just heavily fixed 2008-09-09 00:51 so i was able to find that bug anyway 2008-09-09 00:51 because "touch file" is working in my repo :) 2008-09-09 00:52 :-) 2008-09-09 00:53 well sort-of working 2008-09-09 00:53 the mtime is wrong 2008-09-09 00:53 expected 2008-09-09 00:54 ah duh 2008-09-09 00:54 is getaatr just broken? 2008-09-09 00:54 hm 2008-09-09 00:54 could be 2008-09-09 00:54 i'm calling store_inode after setting the i_mtime 2008-09-09 00:54 er save_inode 2008-09-09 00:55 it does seem to be doing it by looking at the debug info 2008-09-09 00:55 huh, the xattr encoder seems to work 2008-09-09 00:55 you also have to set the ->present bit for the mtime 2008-09-09 00:56 I don't think anything does that for you 2008-09-09 00:58 i thought i saw something in inode.c 2008-09-09 00:58 hrm 2008-09-09 00:59 way back 2008-09-09 00:59 then it became discretionary 2008-09-09 00:59 without a test case ;-) 2008-09-09 00:59 oh 2008-09-09 00:59 tisk tisk 2008-09-09 01:12 ? 2008-09-09 01:12 encode_xattrs is functional, now for decode 2008-09-09 01:14 flips: you can pull fuse utime support + dealf bug test from me 2008-09-09 01:14 ok, need a place to decode the xattrs to 2008-09-09 01:14 hmm 2008-09-09 01:14 kay 2008-09-09 01:14 no fix yet, been dicking with the mtime issue 2008-09-09 01:14 you just want to send your cruft over? 2008-09-09 01:14 oh 2008-09-09 01:14 utime 2008-09-09 01:14 its annoying me that everything is 1969 2008-09-09 01:15 epoch 2008-09-09 01:15 we need the logic to make the timestamps default to eachother 2008-09-09 01:15 but thats not even the issue 2008-09-09 01:15 even the ctime is broken 2008-09-09 01:15 have you thought about setting up a dedicated repo for me to pull from, into which you only pull one head? 2008-09-09 01:15 or do you want your dicking around recorded for posterity ;-) 2008-09-09 01:16 theres only one head in my repo 2008-09-09 01:16 because you merged 2008-09-09 01:16 but all your merges show up over here 2008-09-09 01:16 ah, annoying 2008-09-09 01:16 why is hg such a pita 2008-09-09 01:16 unless you do that version specific pull 2008-09-09 01:16 this is supposed to just work 2008-09-09 01:16 git is identical 2008-09-09 01:17 well its not really wrong 2008-09-09 01:17 cant you cherry pick my change? 2008-09-09 01:17 yes 2008-09-09 01:17 but its time consuming 2008-09-09 01:17 should be oneliner 2008-09-09 01:17 much more efficient for you to just pull your time to a dedicated repo. that can be automatic 2008-09-09 01:17 pull your tip 2008-09-09 01:17 well 2008-09-09 01:18 not much difference 2008-09-09 01:18 try hg view 2008-09-09 01:18 you'll see all the extra stuff 2008-09-09 01:20 shapor, the times are all broken because they are never set 2008-09-09 01:20 you can see where to set them in inode.c 2008-09-09 01:20 there's a gettime available I think 2008-09-09 01:20 http://www.selenic.com/mercurial/wiki/index.cgi/CommunicatingChanges#line-63 2008-09-09 01:20 would that be easier? 2008-09-09 01:21 doesn't decode_attrs set them in struct inode? 2008-09-09 01:21 gets called in open_inode 2008-09-09 01:22 they never get set to an actual time 2008-09-09 01:23 they should now with my utime code though 2008-09-09 01:23 not sure why its not working 2008-09-09 01:23 hrm 2008-09-09 01:23 too many interfaces 2008-09-09 01:24 bah 2008-09-09 01:24 way too many 2008-09-09 01:24 inode->i_mtime = inode->i_ctime = inode->i_atime = iattr->mtime; ><- this is where to set the time 2008-09-09 01:24 in inode.c 2008-09-09 01:25 I thought some of the times were stored differently (lower resolution) 2008-09-09 01:25 brings up the question: what is the format of a tux3 time? 2008-09-09 01:25 yeah 2008-09-09 01:25 yes 2008-09-09 01:25 so, I'd like to get away from the traditional decimal encoding 2008-09-09 01:25 were the fraction is millionths or billionths 2008-09-09 01:25 and use strictly binary 2008-09-09 01:26 that means multiplying and dividing to convert to the braindamaged linux format 2008-09-09 01:26 not sure whether this is wise 2008-09-09 01:27 hrm 202 u64 i_size, i_mtime, i_ctime, i_atime; 2008-09-09 01:27 u64? 2008-09-09 01:28 that's just in the inode, the memory version 2008-09-09 01:28 on disk it gets sqzed, maybe 2008-09-09 01:28 yeah but still 2008-09-09 01:28 whats the point of u64 in memory even? 2008-09-09 01:28 u32 is too small 2008-09-09 01:28 and c doesn't play well with others 2008-09-09 01:28 so we have our own format? 2008-09-09 01:29 we have to use the linux format there 2008-09-09 01:29 when we got to kernel 2008-09-09 01:29 hm 2008-09-09 01:29 because those are in the generic part of the inode 2008-09-09 01:29 i'm getting tired 2008-09-09 01:29 I haven't divided up the inode into generic and filesystem specific parts yet 2008-09-09 01:29 i think my changes are crap, might not want to merge them 2008-09-09 01:29 they dont add much 2008-09-09 01:29 other than sort-of working touch 2008-09-09 01:29 sleep on it 2008-09-09 01:29 fly away 2008-09-09 01:29 hack onthe plane 2008-09-09 01:30 see how nicely encode_xattrs came out 2008-09-09 01:30 in iattr.c 2008-09-09 01:31 decode_xattrs is a little messier because we have to guess how big to make the cache 2008-09-09 01:31 I suppose I'd better run a size-guessing pass first to know the size for thecache 2008-09-09 02:30 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-09 02:32 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-09 02:52 development is going pretty well 2008-09-09 03:02 I'd say 2008-09-09 04:42 konrad, your fuse post got linked from lwn.net: http://lwn.net/Articles/297308/ 2008-09-09 04:42 jesus 2008-09-09 04:43 night 2008-09-09 04:43 night 2008-09-09 04:55 where'd that get linked from? 2008-09-09 06:31 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-09 12:17 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-09 12:41 nice 2008-09-09 14:22 ACTION makes another cuppa before pulling the trigger on the "one less feature" psot 2008-09-09 14:22 post 2008-09-09 14:41 flips: by depending so much on LVM, aren't you limiting the OSs on top of tux3 can be run? What if I want to use tux3 with Solaris, FreeBSD or HP-UX? 2008-09-09 14:42 by "limiting the OSs" I mean "it will require a lot of work to make tux3 work with non-Linux OSs" 2008-09-09 14:43 they will have a lot of work anyway 2008-09-09 14:43 gpl license is incompatible with all those 2008-09-09 14:43 but if tux3 catches on in linux then they will just have to do it 2008-09-09 14:44 pgquiles, you mean to say you already read the one less feature post? 2008-09-09 14:46 ACTION thinks about making it an early sk8 day 2008-09-09 14:46 flips: yes 2008-09-09 14:47 anyway, this isn't more than people already depend on lvm 2008-09-09 14:47 ah, ok 2008-09-09 14:48 what I don't want to do is depend on an lvm built into a filesystem, maintained by a very small group of developers and used by only one application 2008-09-09 14:48 it's bad enough depending on my generic btree code ;-) 2008-09-09 14:49 :-) 2008-09-09 14:50 I'm really impressed at how much you do in so few lines of code 2008-09-09 14:50 easy when you leave out the error handling :-) 2008-09-09 14:50 well 2008-09-09 14:51 you make filesystem development sound easy, like "hey, this morning I feel like I'm going to write my filesystem" :-) 2008-09-09 14:51 trying to put some of that in too 2008-09-09 14:51 that's how I felt 6 weeks ago 2008-09-09 14:51 thought it could be prototyped in 2 weeks 2008-09-09 14:51 I was wrong 2008-09-09 14:51 looks like 8 weeks 2008-09-09 14:52 and that is only with the unexpected help of fuse 2008-09-09 14:52 and of course with the help of all the helpers 2008-09-09 14:52 are seem to be increasing exponentially 2008-09-09 14:53 who seem I mean 2008-09-09 14:54 timothy has a baby girl 2008-09-09 14:54 newborn? 2008-09-09 14:56 they are very nice while in the poop-machine age 2008-09-09 14:57 when teeth show... they cry all the time :-) 2008-09-09 14:58 newborn indeed 2008-09-09 14:58 it's worth the effort 2008-09-09 14:59 it is, indeed 2008-09-09 15:00 until they are 12 years old and start behaving like young terrorists :-D 2008-09-09 15:00 lkml is running slow today 2008-09-09 15:05 bedtime 2008-09-09 15:07 bye 2008-09-09 15:07 adios 2008-09-09 15:44 -!- kbingham(~kbingham@92.8.9.246) has joined #tux3 2008-09-09 15:44 finally showed up on lkml: http://lkml.org/lkml/2008/9/9/402 2008-09-09 15:45 "Tux3 Report: One less feature" 2008-09-09 16:15 new file 2008-09-09 16:15 xattr.c 2008-09-09 16:16 will skate first then make atom resolution actually work 2008-09-09 16:25 a quick q: is it ok to submit some patches to allow tux3 to compile on mac? :P 2008-09-09 16:34 if you can work out the license issues 2008-09-09 16:34 not sure how you'd do that 2008-09-09 16:34 compile under a linux emulator maybe 2008-09-09 16:36 I'm only interested in running tux3 in userspace 2008-09-09 16:36 does this conflict with any license? 2008-09-09 16:37 not that I know of 2008-09-09 16:38 good :D 2008-09-09 16:38 sounds like a cool project 2008-09-09 16:39 the fuse lib you link with will need to be gpl 2008-09-09 16:39 or compatible 2008-09-09 16:39 gpl v3 compatible, which is a little easier than v2 2008-09-09 16:39 I don't want to link to fuse actually 2008-09-09 16:40 directly using kernel calls is traditionally ok 2008-09-09 16:40 you're probably using libc, right? 2008-09-09 16:40 I think macos uses libc 2008-09-09 16:40 that's compatible 2008-09-09 16:40 true 2008-09-09 16:40 some things are not quite the same 2008-09-09 16:41 the libc is deeply embedded in mac 2008-09-09 16:41 I can imagine 2008-09-09 16:41 obviously, tux3 will really need forks now ;-) 2008-09-09 16:41 tux3 xattrs are intended to be forks as well 2008-09-09 16:41 one cool thing I want to accomplish to be able to run an FTP server to expose the tux3 ;-) 2008-09-09 16:42 don't be in a big rush to do that unless you are great at debugging 2008-09-09 16:42 well... I am actually attempting this for some linux fs anyway 2008-09-09 16:43 doing it also for tux3 fits well :D 2008-09-09 16:43 ftp load is fairly forgiving 2008-09-09 16:43 read mostly 2008-09-09 16:44 ok, I'd better get my skate in 2008-09-09 16:44 promised to do a tux3 university session tonight at 8 2008-09-09 16:46 amazing how many more penis extension spams I get when I am actively posting about Tux3, I wonder if there is a relation 2008-09-09 16:46 8pm? (like in 15 minutes?) 2008-09-09 16:47 ohh... the other coast 2008-09-09 16:47 right 2008-09-09 16:48 great! we have time to go home and eat till then :P 2008-09-09 16:49 do we need to review some material before showing up? :D 2008-09-09 16:50 optional 2008-09-09 16:50 if you have time, read "understanding the linux kernel" and "linux device drivers" ;-) 2008-09-09 16:50 haha 2008-09-09 16:50 understanding is on my right, the ldd is at home :P 2008-09-09 16:51 I found the "understanding..." more useful so that's why I hauled it to school 2008-09-09 16:52 linux device drivers will grow on you 2008-09-09 16:52 this 3rd edition really put some weight 2008-09-09 16:52 it's the deeper of the two 2008-09-09 16:52 lets see which edition I have 2008-09-09 16:52 LDD is the first one I got in touch actually 2008-09-09 16:52 the first edition 2008-09-09 16:52 some years ago :D 2008-09-09 16:53 3rd, and I still find it on the superficial side 2008-09-09 16:53 for fs stuff or in general? 2008-09-09 16:57 in general 2008-09-09 16:57 vfs, vm, whatever 2008-09-09 16:58 :D 2008-09-09 17:31 damn... looks like 8pm your time is about 4am in uk :S 2008-09-09 17:33 i'll have to be a passive observer in the morning :) 2008-09-09 18:46 back 2008-09-09 18:46 what's wrong with being up at 4 a.m? 2008-09-09 19:11 mmm, a little sake and sushi to get focussed 2008-09-09 19:11 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-09-09 19:50 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-09 19:55 who's here and awake? 2008-09-09 19:56 I am ;-) 2008-09-09 19:56 that's a quorum 2008-09-09 19:56 mmm, sake and sushi... 2008-09-09 19:57 was good 2008-09-09 19:57 well 2008-09-09 19:57 I better heat up another one 2008-09-09 19:58 ACTION is also awake 2008-09-09 19:59 got a browser ready? 2008-09-09 20:00 always... 2008-09-09 20:00 lxr? :P 2008-09-09 20:00 http://lxr.linux.no/linux <- ok, open this 2008-09-09 20:00 of course 2008-09-09 20:00 as expected 2008-09-09 20:00 -!- RalucaME(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-09 20:00 -!- nataliep(~nataliep@72.14.224.1) has joined #tux3 2008-09-09 20:00 hi natalie 2008-09-09 20:00 hi dan 2008-09-09 20:01 max, have you met natalie? 2008-09-09 20:01 maze? 2008-09-09 20:01 flips: http://lxr.linux.no/linux <- ok, open this 2008-09-09 20:01 hmm? 2008-09-09 20:01 http://lxr.linux.no/linux+v2.6.26.5/fs/open.c#L1106 <- let's start with sys_open 2008-09-09 20:02 everybody see it? 2008-09-09 20:02 yes - we should perhaps first ask though how many people are listnening 2008-09-09 20:02 ACTION nods 2008-09-09 20:02 ACTION nods too 2008-09-09 20:02 3 is fine with me 2008-09-09 20:02 ACTION nods sagely 2008-09-09 20:02 nods :) 2008-09-09 20:03 it's logged anyway 2008-09-09 20:03 true 2008-09-09 20:03 ok, every syscall in linux starts with sys_ 2008-09-09 20:03 and continues with the name you get from man 2008-09-09 20:03 so man 2 open 2008-09-09 20:03 all it does is a little linkage 2008-09-09 20:04 then real action starts in do_sys_open 2008-09-09 20:04 so lets go there by clicking on it 2008-09-09 20:04 and click on the Function link 2008-09-09 20:04 http://lxr.linux.no/linux+v2.6.26.5/fs/open.c#L1084 2008-09-09 20:04 why isn't sys_open isn't just a call to sys_openat(AT_FDCWD, ...)? 2008-09-09 20:04 we're still in the same file 2008-09-09 20:04 good question 2008-09-09 20:05 ask al viro and add some epithet on the end ;-) 2008-09-09 20:05 can sys_* never call sys_* ? 2008-09-09 20:05 or is this something that could be cleaned up? 2008-09-09 20:05 syscalls use a weird linkage 2008-09-09 20:05 gcc and do it, but its odd 2008-09-09 20:05 can do it 2008-09-09 20:05 sys_creat calls sys_open 2008-09-09 20:05 so it could probably be replaced 2008-09-09 20:06 I open call sys_ functions from deep in kernel 2008-09-09 20:06 often 2008-09-09 20:06 let me recant 2008-09-09 20:06 syscalls _sometimes_ use a weird linkage 2008-09-09 20:06 so yes you could nest them 2008-09-09 20:06 al doesn't for no reason I know 2008-09-09 20:07 it's like "yuck, is a nasty top level entry point" 2008-09-09 20:07 gcc and do it, but its odd - what did you mean? 2008-09-09 20:07 I was rambling 2008-09-09 20:07 ;-) 2008-09-09 20:07 the weird stuff happens before we even get there 2008-09-09 20:07 in the syscall table 2008-09-09 20:08 so by the time we hit sys_* we're in pure C land? 2008-09-09 20:08 #ok, so we're in a different address space than we were a nanosceond 2008-09-09 20:08 yes 2008-09-09 20:08 usually 2008-09-09 20:08 some syscalls have strange register linkage 2008-09-09 20:08 anyway the vfs doesn't much care about that 2008-09-09 20:08 it gets away from syscall land as soon as it can 2008-09-09 20:09 what we are going to see, is a lot of messing around with user addresses 2008-09-09 20:09 because a nanosecond ago or so, we were in processor ring 3 2008-09-09 20:09 userspace 2008-09-09 20:09 now we're in ring 0 2008-09-09 20:09 different address space 2008-09-09 20:09 kind of 2008-09-09 20:10 different priveledge level 2008-09-09 20:10 that too 2008-09-09 20:10 everthing is a little different 2008-09-09 20:10 kind of like the twilight zone 2008-09-09 20:10 we're on the inside of the glass looking out now, like in that harry potter movie 2008-09-09 20:10 ok 2008-09-09 20:10 so we have to get the name for the open 2008-09-09 20:10 it's in a different address space 2008-09-09 20:11 so we do copy_from_user to get it 2008-09-09 20:11 just looking for the def of getname 2008-09-09 20:11 it's not very interesting actually 2008-09-09 20:11 it stores the name on a full page of kernel memory 2008-09-09 20:12 or it used to 2008-09-09 20:12 now I see we use a kmem cache for it 2008-09-09 20:12 http://lxr.linux.no/linux+v2.6.26.5/fs/namei.c#L141 2008-09-09 20:12 http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L1615 2008-09-09 20:12 thanks 2008-09-09 20:12 and an audit hook 2008-09-09 20:13 things change around in here fairly frequently 2008-09-09 20:13 it's usually worth starting from the top in lxr every time 2008-09-09 20:13 just so you can check for details that changed 2008-09-09 20:13 by the top you mean all the way from sys_open or some other top? 2008-09-09 20:13 that audit thingy is new 2008-09-09 20:13 right 2008-09-09 20:14 like I said, getname is boring 2008-09-09 20:14 let's go back to do_ _open 2008-09-09 20:14 perhaps, another question: how much of a fs is driven by userspace triggered syscalls? 2008-09-09 20:14 nearly all of it 2008-09-09 20:14 particularly for traditional fs's 2008-09-09 20:14 new ones tend to have some daemons helping 2008-09-09 20:15 generally, the more daemons, the less reliable 2008-09-09 20:15 which are effectively kernel threads doing syscalls? 2008-09-09 20:15 not doing syscalls 2008-09-09 20:15 using internal interfaces 2008-09-09 20:15 using syscalls internally sucks, because of being in the wrong address space 2008-09-09 20:15 the syscall expects to get its data from userspace 2008-09-09 20:15 oh, right the copy_from_user stuff 2008-09-09 20:15 right 2008-09-09 20:16 anyway you're using syscalls internally, something linux is broken 2008-09-09 20:16 or you're stupid 2008-09-09 20:16 heh 2008-09-09 20:16 about 50/50 2008-09-09 20:16 the next interesting place is do_filp_open 2008-09-09 20:17 lxr is a little funky indexing some of these 2008-09-09 20:17 would do_filp_open be the kernel-internal interface to open? 2008-09-09 20:18 factoring is a little arbitrary 2008-09-09 20:18 [ie. would this be what you would call from above mentioned kernel threads/daemons?] 2008-09-09 20:18 it's another helper that happens to do almost all the work 2008-09-09 20:18 yes you can 2008-09-09 20:18 if its not static, then something is using it 2008-09-09 20:18 often something bogus 2008-09-09 20:18 something external that is 2008-09-09 20:18 not statie = part of kernel api 2008-09-09 20:19 often unwisely ;-) 2008-09-09 20:19 http://lxr.linux.no/linux+v2.6.26.5/fs/namei.c#L1761 2008-09-09 20:19 I wish lxr was smarter about finding defs of extern functions 2008-09-09 20:19 I went to usage, then to the first reference onthe list 2008-09-09 20:19 now things are happening 2008-09-09 20:20 [actually filp_open seems to be the kernel-interface, not that it much matters] 2008-09-09 20:20 that is true 2008-09-09 20:20 see "arbitrary factoring" above 2008-09-09 20:20 it's kind of a pile is some ways ;-) 2008-09-09 20:20 in other ways it's beautiful 2008-09-09 20:20 only about 3 of those ;-) 2008-09-09 20:21 we'll get to some scary code now 2008-09-09 20:21 do_filp_open is pretty big... 2008-09-09 20:21 path_lookup_open 2008-09-09 20:22 it;'s big because it's implementing all of unix semantics + all of linux semantics + historical cruft + arcane voodooism nobody is quite sure about 2008-09-09 20:22 http://lxr.linux.no/linux+v2.6.26.5/fs/namei.c#L1238 2008-09-09 20:23 so the vfs layer does permissions checking... not the fs itself? 2008-09-09 20:23 we're going to stay away from path lookup to avoid brain damage 2008-09-09 20:23 that is correct 2008-09-09 20:23 the vfs checks permissions and does a lot of locking too 2008-09-09 20:23 also implements the namespace caching 2008-09-09 20:23 it does a huge amount of work 2008-09-09 20:24 what's namespace caching? 2008-09-09 20:24 dentry cache 2008-09-09 20:24 every time you open a file, linux creates a dentry for the name 2008-09-09 20:24 that lives in cache 2008-09-09 20:24 dentry points at inode 2008-09-09 20:24 so what does the dentry cache map between? 2008-09-09 20:24 dentries are pretty big, inodes are pretty big memory structures too 2008-09-09 20:25 filename and inode (possibly lack of inode)? 2008-09-09 20:25 the dentry maps filename -> inode 2008-09-09 20:25 in cache 2008-09-09 20:25 does filename include full path? 2008-09-09 20:25 only when tere is a miss in the dentry cache does the vfs go to the filesystem 2008-09-09 20:25 no, not the full path 2008-09-09 20:25 Important sizes: 2008-09-09 20:25 block 1024 2008-09-09 20:25 inode 300 2008-09-09 20:25 dentry 128 2008-09-09 20:25 bh 56 2008-09-09 20:25 kmem_cache 12 2008-09-09 20:25 the parent inode and the filename 2008-09-09 20:25 relative to the fs root? 2008-09-09 20:25 ah 2008-09-09 20:25 razvanm, nice 2008-09-09 20:26 flips might want to correct me ;-) 2008-09-09 20:26 so every time you open a file, you get a dentry+inode+file 2008-09-09 20:26 already a lot of cache memory 2008-09-09 20:26 for a tiny thing maybe 2008-09-09 20:26 with 6 btyes in it echo hello >foo 2008-09-09 20:26 oh thats why it gets pretty big 2008-09-09 20:27 it's only the beginning 2008-09-09 20:27 among other slabs 2008-09-09 20:27 you also get an "address_space" for the inode 2008-09-09 20:27 misnamed 2008-09-09 20:27 that is the radix tree 2008-09-09 20:27 so if the file is opened, the dentry + inode are locked in cache? 2008-09-09 20:27 yes 2008-09-09 20:27 and the whole chain of parents 2008-09-09 20:27 up to the superblock of the fs 2008-09-09 20:27 not locked 2008-09-09 20:28 they can be evicted 2008-09-09 20:28 usage count increased? 2008-09-09 20:28 only until the inode goes away 2008-09-09 20:28 sorry 2008-09-09 20:28 you've lost me then. 2008-09-09 20:28 the inode's use count is elevated 2008-09-09 20:28 until the dentry goes away 2008-09-09 20:28 it's about the nastiest part of the whole vfs 2008-09-09 20:28 and we're here already 2008-09-09 20:29 so dentries can come and go as they please? 2008-09-09 20:29 what happens is, dentries spend a lot of their life sitting around in cache with zero use count 2008-09-09 20:29 that's what happens if you open a file, do something, and close it 2008-09-09 20:29 $ cat /proc/slabinfo | egrep 'dentry|#' 2008-09-09 20:29 # name : tunables : slabdata 2008-09-09 20:29 dentry 253015 253576 132 29 1 : tunables 120 60 8 : slabdata 8744 8744 0 2008-09-09 20:29 only when the vm comes along and tries to shrink the caches to recover memory do the dentries and inodes go away 2008-09-09 20:29 note 132 2008-09-09 20:30 yes, something ahs been pushing stuff out 2008-09-09 20:30 it changes from time to time 2008-09-09 20:30 these days, linux pushes too much cache out at the wrong times 2008-09-09 20:30 you will notice that if you run on a slow machine 2008-09-09 20:30 see, this is the real vfs course ;-) 2008-09-09 20:31 ok, lifetime of objects in cache is one of the biggest touchy spots in linux 2008-09-09 20:31 it's often very hard to know what owns what 2008-09-09 20:31 there's no way to tell the vfs you're doing a bg filesystem scan and to not cache for eternity? 2008-09-09 20:31 and yet, you have to when you work on fs code 2008-09-09 20:32 there are various ways to tell it that 2008-09-09 20:32 good ways is another question 2008-09-09 20:32 we have the concept of hot and cold ends of lru list 2008-09-09 20:32 when something is gets accessed, it gets moved to the hot end 2008-09-09 20:32 and stuff is evicted from the cold end 2008-09-09 20:32 in theory 2008-09-09 20:32 in practice... well 2008-09-09 20:33 linux has been benchmarked as worse than random replacement policy 2008-09-09 20:33 somebody needs to go in and fix that 2008-09-09 20:33 so a queue basically 2008-09-09 20:33 the only way i'm aware of to inform the kernel about your intentions is posix_fadvise, and that doesnt let you do much 2008-09-09 20:33 a lru list, yes 2008-09-09 20:33 acts like a queue 2008-09-09 20:33 certainly nothing related to dentries 2008-09-09 20:33 old stuff is supposed to move down to the cold end and get evicted 2008-09-09 20:34 shapor, though you were on a plane 2008-09-09 20:34 just a sec 2008-09-09 20:34 not yet 2008-09-09 20:34 back 2008-09-09 20:34 is there one global dentry cache? per cpu? per socket? per fs? per inode? 2008-09-09 20:34 ok, we're not doing vm 2008-09-09 20:34 this is vfs ;-) 2008-09-09 20:34 is global, right? :P 2008-09-09 20:34 there is one global dentry cache 2008-09-09 20:35 it is indexed by fs*dir*name 2008-09-09 20:35 so it acts like one per fs 2008-09-09 20:35 so it maps superblock:inode:filename -> inode? 2008-09-09 20:35 the only way i know of purging it is umount(), right flips? 2008-09-09 20:35 yes 2008-09-09 20:35 in general,yes 2008-09-09 20:35 there are internal interfaces for purging 2008-09-09 20:36 a fs has access to that 2008-09-09 20:36 but almost nobody understands how to use that or cares to find out 2008-09-09 20:36 if you get it wrong, al will bark at you 2008-09-09 20:36 aren't we wasting a lot of memory by continuously keeping the 'fs' in there? most systems don't have that many mounted filesystems 2008-09-09 20:36 we waste huge buckets of memory 2008-09-09 20:37 yes, linux is a little special in this regard 2008-09-09 20:37 dentry cache is a linux only thing 2008-09-09 20:37 it gives a performance advantage in general 2008-09-09 20:37 but it uses massive gobs of memory 2008-09-09 20:37 it's tricky 2008-09-09 20:38 you can always print out the path a file was opened by 2008-09-09 20:38 other OSs doesn't cache the dentries? 2008-09-09 20:38 by following parent links in the dentry cache 2008-09-09 20:38 I don't think other os's have dentries? 2008-09-09 20:38 I'm not _that_ familiar with bsd etc 2008-09-09 20:38 but I think not 2008-09-09 20:38 earlier you'd said the dentries could be evicted? 2008-09-09 20:38 the above gets interesting when the namespace topology is changing while you follow links 2008-09-09 20:39 they can 2008-09-09 20:39 let's go find the dentry cache 2008-09-09 20:39 so how does that match up with being able to follow the parent links in the dentry cache? 2008-09-09 20:39 2008-09-09 20:40 hmm, no dentry.c 2008-09-09 20:40 http://lxr.linux.no/linux+v2.6.26.5/include/linux/dcache.h#L81 2008-09-09 20:40 there's a dcache.h 2008-09-09 20:41 ok, namei.c is the home of the dentry cache 2008-09-09 20:41 inconsistent naming 2008-09-09 20:42 struct dentry is defined in dcache.h though 2008-09-09 20:42 in general in linux, you want to be looking for "get" and "put" operations 2008-09-09 20:42 get means in object count, put means dec 2008-09-09 20:42 strange terminology, made in linux I think 2008-09-09 20:43 dache.c 2008-09-09 20:43 http://lxr.linux.no/linux+v2.6.26.5/fs/dcache.c 2008-09-09 20:43 so, dput 2008-09-09 20:43 http://lxr.linux.no/linux+v2.6.26.5/fs/dcache.c#L185 2008-09-09 20:44 big mess 2008-09-09 20:44 but you have to get familiar with it 2008-09-09 20:44 we also have iput, drop usage count of an inode 2008-09-09 20:45 is this too much down inthe nitty gritty? 2008-09-09 20:45 nope 2008-09-09 20:45 speaking for myself only of course - but the nitty gritty is always what I failed to grasp 2008-09-09 20:45 figuring out how an inode gets released is challenging 2008-09-09 20:46 look at iput, then iput_final 2008-09-09 20:46 http://lxr.linux.no/linux+v2.6.26.5/fs/inode.c#L1149 2008-09-09 20:46 generally, if the fs does not want to take care of something, the vfs will do it for it 2008-09-09 20:46 this is the case in iput_final 2008-09-09 20:47 normally, inodes are dropped by generic_drop_inode 2008-09-09 20:47 there we see some classic unix 2008-09-09 20:48 the decision whether to delete an unlinked inode or not 2008-09-09 20:48 by this time, the dentry is long gone 2008-09-09 20:48 so is the directory entry, if i_nlink is zero 2008-09-09 20:48 we're nearly done for today 2008-09-09 20:49 I'm a little surpised by the fact than op can be null.. 2008-09-09 20:49 hour is coming up 2008-09-09 20:49 we still have 10 more minutes! :D 2008-09-09 20:49 the op can be null because the vfs does it in that case 2008-09-09 20:49 I'm going to answer questions for the next 10 minutes 2008-09-09 20:49 ACTION was about to ask about about deleting a directory 2008-09-09 20:50 so you than have a null op in the superblock operations and instead have it handled through such if statements all over the place? 2008-09-09 20:50 ok, let's go look a file_operations 2008-09-09 20:50 that seems like very non-OO 2008-09-09 20:50 maze, correy 2008-09-09 20:50 correct 2008-09-09 20:50 it's oo linux style 2008-09-09 20:50 very few linux hackers known any oo language 2008-09-09 20:50 is there a reason for that? that also seems worse performance wise... 2008-09-09 20:51 since we then have the if instead of just calling the method 2008-09-09 20:51 it doesn't cost much cpu 2008-09-09 20:51 it's sloppy 2008-09-09 20:51 and looks ugly 2008-09-09 20:51 and is inconsistent 2008-09-09 20:51 it's another branch that can be mispreditcted though 2008-09-09 20:51 every operation has its own custom way of doing things, usually 2008-09-09 20:51 if the branch matters, we tell the compiler not to mispredict 2008-09-09 20:52 ugh... 2008-09-09 20:52 see "likely/uinlikely" 2008-09-09 20:52 unlikely 2008-09-09 20:52 yeah, I know 2008-09-09 20:52 the inefficiencies here are somewhat covered up by the fact that there are slow disks underneath 2008-09-09 20:52 and then, it's not really inefficient 2008-09-09 20:53 the stuff that _can_ cost lots of cpu has been profiled and fixed long ago 2008-09-09 20:53 ok, so it's just disgusting and extra code complexity ;-) 2008-09-09 20:53 these days, it costs a lot more to contend a spinlock than mispredict a branch 2008-09-09 20:53 yes, it's fairly disgusting 2008-09-09 20:53 one never learns to love it ;-) 2008-09-09 20:53 respect it, yes 2008-09-09 20:54 it does a lot, has a huge amount of flexibility 2008-09-09 20:54 ok, there was a question 2008-09-09 20:54 let's go look at how ext2 deletes a directory 2008-09-09 20:54 is stuff like this not fixable? 2008-09-09 20:54 right, delete a directory. 2008-09-09 20:54 the right person could fix it 2008-09-09 20:54 you have to have memorized stevens 2008-09-09 20:55 and you have to like fighting in pig shit 2008-09-09 20:55 do you llike fighting in pig shit? 2008-09-09 20:55 because you have some of the other qualifications ;-) 2008-09-09 20:56 I have a tendency to fight uphill battles, yes. 2008-09-09 20:56 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/namei.c#L275 <- ext2_rmdir 2008-09-09 20:56 pretty easy to read 2008-09-09 20:56 and write for that matter 2008-09-09 20:56 I didn't say uphill ;-) 2008-09-09 20:56 it's not a hill 2008-09-09 20:57 it's a ditch at the bottom of the farm 2008-09-09 20:57 flips: ack 2008-09-09 20:57 ah, but sh*t flows downhill, and if you're at the bottom 2008-09-09 20:57 stevens - which book is that referring to? 2008-09-09 20:57 when I asked the question I did not remember that the OS will refuse to delete a non-empty dir :P 2008-09-09 20:57 ext2_rmdir is plugged into this thing: http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/namei.c#L376 2008-09-09 20:57 and *_operations structure 2008-09-09 20:58 passes for an instance of a class in linux 2008-09-09 20:58 Advanced Programming in the UNIX Environment, Addison-Wesley, 1992. 2008-09-09 20:59 ok, now that we have found what ext2_rmdir is plugged into, we can follow it back up into the vfs 2008-09-09 20:59 clock on ext2_dir_inode_operations 2008-09-09 21:00 sorry 2008-09-09 21:00 clock on inode_operations 2008-09-09 21:00 http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L1250 2008-09-09 21:00 then usage 2008-09-09 21:01 lxr is spinning 2008-09-09 21:01 this is the slowest operations... 2008-09-09 21:01 yes, and 3 doing it at the same time is enough to bring it to its knees 2008-09-09 21:01 apparently 2008-09-09 21:02 actually, I would say is as slow as usual 2008-09-09 21:02 as you can see, this is a popular struct 2008-09-09 21:02 true 2008-09-09 21:02 your are looking for the instances that are _not_ in a specific filesystem 2008-09-09 21:03 fs/namei.c, line 2971 2008-09-09 21:03 for example 2008-09-09 21:03 whoops, not interesting 2008-09-09 21:03 bad_inode.c inode.c libfs.c 2008-09-09 21:03 razvanm probably had the right idea 2008-09-09 21:04 yes, inode.c is good 2008-09-09 21:04 http://lxr.linux.no/linux+v2.6.26.5/fs/inode.c#L114 2008-09-09 21:05 uhh... we're 5 minutes over time :P 2008-09-09 21:05 ACTION says big thanks! 2008-09-09 21:05 ;-) 2008-09-09 21:05 yep 2008-09-09 21:05 so homework: 2008-09-09 21:05 find out were inode_operations->rmdir is called 2008-09-09 21:06 it isn't spelled that way 2008-09-09 21:06 this is what makes linux fun ;-) 2008-09-09 21:06 very little is spelled the way you would expect 2008-09-09 21:06 :D 2008-09-09 21:06 ok, did we have fun today? 2008-09-09 21:06 that was awesome, too short :) thanks to all... i love the format of this class :) 2008-09-09 21:06 ACTION did had fun :-) 2008-09-09 21:07 thanks natalie :-) 2008-09-09 21:07 thanx flips, was cool 2008-09-09 21:07 the most important item is how to navigate lxr 2008-09-09 21:07 welcome, ralucame 2008-09-09 21:07 ACTION is RalucaME's twin ;-) 2008-09-09 21:07 :) 2008-09-09 21:07 where it's called from outside of fs'es? or within? 2008-09-09 21:08 way too short - agreed. 2008-09-09 21:08 at this rate we'll need more than a few of these ;-) 2008-09-09 21:08 Thanks! 2008-09-09 21:08 aha 2008-09-09 21:08 from the vfs 2008-09-09 21:08 that is, outside the fs 2008-09-09 21:08 things eventually start to fit a pattern 2008-09-09 21:09 and you don't need me to suggest how to follow the twisty paths any more 2008-09-09 21:09 at first it looks like random gibberish 2008-09-09 21:09 then later, you learn it is actually random gibberish 2008-09-09 21:09 :-) 2008-09-09 21:09 but it is fast and flexible gibberish 2008-09-09 21:09 flips: i_op->rmdir ? 2008-09-09 21:10 sounds good 2008-09-09 21:10 ups ;-) 2008-09-09 21:10 aww 2008-09-09 21:10 I personally always spell my ops "ops" 2008-09-09 21:10 makes it much easier to navitage 2008-09-09 21:10 so you look for ops->rmdir and you always find it 2008-09-09 21:11 the code was "inode->i_op = &empty_iops;" in alloc_inode 2008-09-09 21:11 boring 2008-09-09 21:11 haven't found the real one nwo 2008-09-09 21:11 yet I mean 2008-09-09 21:12 http://lxr.linux.no/linux+v2.6.26.5/fs/namei.c#L2256 2008-09-09 21:13 that's it 2008-09-09 21:13 gold star 2008-09-09 21:14 :P 2008-09-09 21:16 oh, lol 2008-09-09 21:16 more homework? 2008-09-09 21:16 and I just sent it via pm 2008-09-09 21:16 figure out how a struct inode gets deleted ;-) 2008-09-09 21:17 ACTION will do pm next time 2008-09-09 21:17 wild guess: iput gets called? 2008-09-09 21:17 thursday again ok? 2008-09-09 21:17 at 8pm? 2008-09-09 21:17 maze, that's a good first order approximation 2008-09-09 21:17 yes 2008-09-09 21:17 sounds good 2008-09-09 21:18 see you then! 2008-09-09 21:19 cu 2008-09-09 21:41 -!- nataliep(~nataliep@72.14.224.1) has left #tux3 2008-09-09 22:09 -!- pranith(~ca4bcee2@66.90.73.223) has joined #tux3 2008-09-09 22:14 -!- pranith(~ca4bcee2@66.90.73.223) has joined #tux3 2008-09-09 22:15 hello 2008-09-09 22:15 anyone here? 2008-09-09 22:21 hi 2008-09-09 22:23 hi flips 2008-09-09 22:23 what's up? 2008-09-09 22:23 are u daniel phillips? 2008-09-09 22:23 yes 2008-09-09 22:24 ok, i mailed you yesterday about tux3 :) 2008-09-09 22:24 I remember 2008-09-09 22:24 welcome to #tux3 2008-09-09 22:24 thank you :) 2008-09-09 22:24 have you been reading the mailing list archives? 2008-09-09 22:24 hmm, not much actually 2008-09-09 22:25 that is a very good place to start 2008-09-09 22:25 im joining it now.. 2008-09-09 22:25 http://tux3.org/pipermail/tux3/ 2008-09-09 22:26 ok 2008-09-09 22:28 flips, any particular mail you want me to start with? 2008-09-09 22:28 any order 2008-09-09 22:28 just poke around until you find one that interests yhou 2008-09-09 22:29 and follow the thread 2008-09-09 22:29 see what people are doing 2008-09-09 22:29 the fuse stuff is very interesting 2008-09-09 22:29 ok 2008-09-09 22:35 excellent ramen 2008-09-09 22:35 that japanese store wasn't kidding when they said it was "a little spicy" 2008-09-09 22:36 hmm 2008-09-09 22:46 flips, do u think i need to have some background about file systems to start with? 2008-09-09 22:46 always helpful 2008-09-09 22:46 as i mentioned, i've just recently started going throught the design book... 2008-09-09 22:46 there is a lot written on it 2008-09-09 22:47 a linux specific book would be good too 2008-09-09 22:47 I never read the beos book 2008-09-09 22:47 or any book on filesystem design ;-) 2008-09-09 22:47 i couldn't find any linux specific os book :( 2008-09-09 22:47 filesystem* 2008-09-09 22:47 "understanding the linux kernel" 2008-09-09 22:47 probably best for this 2008-09-09 22:48 for filesystems? 2008-09-09 22:48 wikipedia is good too 2008-09-09 22:48 yes 2008-09-09 22:48 ok 2008-09-09 22:48 http://www.yolinux.com/TUTORIALS/LinuxClustersAndFileSystems.html 2008-09-09 22:48 any concepts i need to pay particular attention to? 2008-09-09 22:48 vfs 2008-09-09 22:48 locking 2008-09-09 22:48 struct bio 2008-09-09 22:49 struct page 2008-09-09 22:49 struct inode 2008-09-09 22:49 struct dentry 2008-09-09 22:49 struct buffer_head 2008-09-09 22:50 http://en.wikipedia.org/wiki/Filesystem 2008-09-09 22:50 http://en.wikipedia.org/wiki/Ext2 2008-09-09 22:50 http://en.wikipedia.org/wiki/Ext3 2008-09-09 22:51 http://en.wikipedia.org/wiki/Journaling_file_system 2008-09-09 22:51 http://en.wikipedia.org/wiki/Comparison_of_file_systems 2008-09-09 22:51 http://en.wikipedia.org/wiki/ACID 2008-09-09 22:51 <- very important 2008-09-09 22:51 hmm, ACID? first time i'm hearing about this 2008-09-09 22:52 it is the most important concept of all 2008-09-09 22:52 ohk, i knew abt these. just never heard the term ACID :) 2008-09-09 22:53 knew in the sense, i heard abt them. not "know" knowing :) 2008-09-09 22:53 need to memorize those concepts 2008-09-09 22:53 ok 2008-09-09 23:14 feh I just got bitten by the stupid ext2 convention than zero inode means deleted entry 2008-09-09 23:14 what's the matter with having an inode numbered zero again? 2008-09-09 23:14 zero inode? 2008-09-09 23:14 inum = 0 2008-09-09 23:15 hmm 2008-09-09 23:15 that's what ext2 uses to determine a dirent is deleted 2008-09-09 23:15 a little better than DOS, which sets the first character of the filename to 'e' 2008-09-09 23:15 but not much 2008-09-09 23:16 why not rely on name_len = zero instead? 2008-09-09 23:16 dumb 2008-09-09 23:16 I might change that for tux3 2008-09-09 23:16 u better do :) 2008-09-09 23:17 can't call it ext2_create_entry any more then ;-) 2008-09-09 23:17 so far it's exactly compatible 2008-09-09 23:17 but this is annoying 2008-09-09 23:18 I think I will at least create an is_deleted macro 2008-09-09 23:18 everywhere it relies on inum = 0 2008-09-09 23:18 might sound stupid, but i dont know.. whats the need for a deleted dirent? 2008-09-09 23:18 so you can recover the space to use for some other filename 2008-09-09 23:19 isn't that recovered during deletion time? 2008-09-09 23:19 that's the code I'm working on 2008-09-09 23:20 oh! :) 2008-09-09 23:20 and inum is what? the number of inodes? 2008-09-09 23:22 static inline int is_deleted(ext2_dirent *dirent) 2008-09-09 23:22 { 2008-09-09 23:22 return !dirent->inum; 2008-09-09 23:22 } 2008-09-09 23:23 hmm 2008-09-09 23:23 thats nice 2008-09-09 23:45 -!- cdk(~chinmay@121.246.36.139) has joined #tux3 2008-09-09 23:46 can now find/create xattr atoms in the atom table 2008-09-09 23:46 enough for today 2008-09-09 23:48 sleeping? 2008-09-09 23:50 soon 2008-09-09 23:57 which place? 2008-09-09 23:59 santa monica, CA 2008-09-09 23:59 and you? 2008-09-10 00:09 new delhi, india 2008-09-10 00:10 what do you do? 2008-09-10 00:13 hey 2008-09-10 00:14 hi bh 2008-09-10 00:23 johns hopkins 2008-09-10 00:23 u 2008-09-10 00:24 wrong channel 2008-09-10 00:35 hmm 2008-09-10 00:35 u work at johns hopkins? 2008-09-10 00:37 no, in santa monica 2008-09-10 00:37 linux kernel hacker 2008-09-10 00:37 time to sleep 2008-09-10 00:37 see you later 2008-09-10 00:50 :) 2008-09-10 00:50 goodnite 2008-09-10 01:52 -!- kbingham(~kbingham@92.0.11.166) has joined #tux3 2008-09-10 03:13 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-10 06:07 -!- stargazr5(~gauravstt@59.95.3.98) has joined #tux3 2008-09-10 07:02 -!- stargazr5(~gauravstt@59.95.35.187) has joined #tux3 2008-09-10 08:03 -!- RzM|Away(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-10 09:28 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-10 10:36 -!- Kirantpatil(~kiran@122.167.179.145) has joined #tux3 2008-09-10 10:37 -!- Kirantpatil(~kiran@122.167.179.145) has left #tux3 2008-09-10 11:19 -!- Bobby(~Bobby@122.160.64.177) has joined #tux3 2008-09-10 11:19 hello 2008-09-10 11:25 -!- Bobby(~Bobby@122.160.64.177) has joined #tux3 2008-09-10 11:51 -!- Bobby(~Bobby@122.160.64.177) has joined #tux3 2008-09-10 12:19 morning 2008-09-10 12:35 reading the lesson from last night 2008-09-10 12:35 test of fire 2008-09-10 12:49 int set_xattr(struct inode *inode, char *name, unsigned len, void *data, unsigned size) 2008-09-10 12:49 { 2008-09-10 12:49 atom_t atom = get_atom(inode->sb->atable, name, len); 2008-09-10 12:49 return xcache_update(inode, atom, data, len); 2008-09-10 12:49 } 2008-09-10 12:49 short and sweet? 2008-09-10 12:51 yes 2008-09-10 12:52 should that be xcache_update(inode, atom, data, size)? 2008-09-10 12:54 it should 2008-09-10 12:54 now is 2008-09-10 12:55 was originally going to be namelen and datalen 2008-09-10 12:55 it reads better as it is now 2008-09-10 12:55 now lets see if it works 2008-09-10 12:57 the test sets xattrs on the atom table directory inode 2008-09-10 12:58 it works 2008-09-10 12:58 set_xattr(inode, "foo", 3, "bar", 3); 2008-09-10 12:58 xcache_dump(inode); 2008-09-10 12:58 xattr 1: 0x8050fc8: 62 61 72 "bar" 2008-09-10 12:58 the test would have passed even with the bug above ;-) 2008-09-10 13:10 -!- kbingham(~kbingham@92.20.206.84) has joined #tux3 2008-09-10 13:14 next step is to get xattrs on/off disk 2008-09-10 13:28 notice in the above factoring we can easily have multiple atom tables 2008-09-10 13:28 I wonder what that is good for 2008-09-10 13:28 multiple xattr namespaces 2008-09-10 13:31 -!- kbingham(~kbingham@92.22.74.132) has joined #tux3 2008-09-10 13:35 "Stasticially speaking, the part of your disk that loses data is probably in the movies that suck, not the movies that are good, simply because the vast majority of movies suck. " -- http://alumnit.ca/~apenwarr/log/?m=200809#08 2008-09-10 13:35 wiser words were seldom spoken 2008-09-10 13:43 ACTION discusses concurrency issues with flips  2008-09-10 13:45 thinking about how'd I parallelize b-tree operations 2008-09-10 13:47 righto 2008-09-10 13:48 first obvious invariant: lock acquisition order is root-to-leaf 2008-09-10 13:49 obvious optimization rule: nodes near the root must not stay locked for long periods, including never locked across a read of disk data, other than the data of the ndoe itself 2008-09-10 13:51 nuther obvious one: no holding of locks above a node being read from disk 2008-09-10 13:51 this one is kinda tough sometimes, in cases like rename where the vfs holds locks 2008-09-10 13:59 here's a subtle and important one: subtrees below a locked node can be locked by another process, this is normal and important for throughput 2008-09-10 14:00 time for another hit of that excellent ramen 2008-09-10 14:28 hmm, I should probably stop put contributers email addys in the first line of the log 2008-09-10 14:28 where spambots can get them 2008-09-10 14:29 second line is most likely ok, it's the contributers call: do they want fame + spam? or just satisfaction? 2008-09-10 14:29 personally I try to err on the side of fame and have efficient spam filters 2008-09-10 14:31 my email is in tons of public places already, I've already got decent spam filters 2008-09-10 14:33 finished reading through last night's lesson 2008-09-10 14:33 and accompanying lxr pages 2008-09-10 14:34 and? 2008-09-10 14:35 ACTION can confirm that this ramien is excellent 2008-09-10 14:37 so my question (maybe it's stupid): what is an inode? 2008-09-10 14:37 my current guess is 'an in-memory representation of the stuff surrounding a file, except for the contents' 2008-09-10 14:38 it's two or three things that are often conflated 2008-09-10 14:39 1) (the real def) it is a structure that caches the details of a filesystem object or device object in the vfs 2008-09-10 14:39 2) it is the image of an above object in file store 2008-09-10 14:40 3) it is the numeric id of the backing store image of an inode 2008-09-10 14:40 actually, 0) it is an object that caches all the data and attributes of a filesystem object or system device 2008-09-10 14:41 the notion of object being separate from how it is cached or stored 2008-09-10 14:41 so you guess is close to the mark 2008-09-10 14:41 very close 2008-09-10 14:41 alright 2008-09-10 14:41 in tux3, sometimes caches the contents as well 2008-09-10 14:42 well 2008-09-10 14:42 always caches the contents 2008-09-10 14:42 because it points at a "mapping" 2008-09-10 14:42 which is a radix tree that points at the cached pages of the data of the inode 2008-09-10 14:43 also caches the xattr data, which in tux3 is nearly the same as the file data 2008-09-10 14:43 heh 2008-09-10 14:43 how does ext[234] and friends handle xattr data and quotas? 2008-09-10 14:44 weirdly 2008-09-10 14:44 in both cases 2008-09-10 14:44 read ext3/xattr.c 2008-09-10 14:44 going 2008-09-10 14:45 tries to pack xattrs for different inodes together in blocks, then notice when entire xattr blocks are the same and have multiple pointers to them from different inodes 2008-09-10 14:45 quotas are done through this awful vfs-level abstraction 2008-09-10 14:45 quota files 2008-09-10 14:45 a real mess 2008-09-10 14:45 wrong idea 2008-09-10 14:46 I am not sure whether there is any connection between xaddrs and quota in ext* 2008-09-10 14:47 actually, the vfs thoughtfully provides a bypass around the quota file mess so a filesystem that wants to do it right can do so 2008-09-10 14:47 don't know if anybody uses that bypass 2008-09-10 14:52 for ext3: All attributes must fit in the inode and one additional block. 2008-09-10 14:55 right. lame. 2008-09-10 14:56 heh. 2008-09-10 14:57 tux3 goes at it more like HFS file fork 2008-09-10 14:57 ext3 uses the macros le32_to_cpu (and equivalently for tux3, be??_to_cpu) 2008-09-10 14:57 strangely, macos doesn't do xattrs like file forks, it limits them like ext* 2008-09-10 14:57 I'm going to respell those I think 2008-09-10 14:57 they're clunky 2008-09-10 14:58 from_be_u32 and to_be_u32 <- less clunky 2008-09-10 14:59 or from_beu32 and to_beu32 2008-09-10 15:00 or from_u32b and to_u32b 2008-09-10 15:00 vs from_u32l and to_u32l 2008-09-10 15:00 or from_u32be and to_u32be 2008-09-10 15:00 vs from_u32le and to_u32le 2008-09-10 15:01 not quite decided 2008-09-10 15:01 but probably will do a big spam edit in the next couple of days to make it better than it is 2008-09-10 15:01 we have a lot of endian work ahead of us and the inlines should support it, not get i n the way 2008-09-10 15:03 from_beu32 and to_beu32 <- this is probably the form that is the easiest the edit and least likely to offend kernel hacks 2008-09-10 15:03 easist to edit I mean 2008-09-10 15:04 in kernel they will likely just be #defined to be the kernel faves 2008-09-10 15:04 it's pathetic that gcc doesn't just make this an attribute 2008-09-10 15:56 my vote: 2008-09-10 15:56 from_u32be and to_u32be 2008-09-10 15:57 ok 2008-09-10 15:57 seems easiest to read 2008-09-10 15:57 I think yours is the casting vote because you are the only one who voted 2008-09-10 15:57 ;-) 2008-09-10 15:59 I know there's already been tons of discussion about C++ in the kernel, but sometimes, some aspects of it (OO, private fields, accessors) would make the interfaces so much cleaner... 2008-09-10 15:59 it's too bad some sort of OO shim can't be included in C 2008-09-10 16:00 really 2008-09-10 16:00 too bad the C and C++ camps don't even speak to each other any more 2008-09-10 16:00 feel free to write a chunk of tux3 in c++ if you want 2008-09-10 16:00 for example, tux3.c 2008-09-10 16:01 c++ desperately needs designated initializers 2008-09-10 16:13 maze, there ya go 2008-09-10 16:14 hmm? what do you mean? 2008-09-10 16:14 I see that the tux3 announcement is more popular than the 2.6.24.4 announcement on lkml.org 2008-09-10 16:14 endian conversions respelled according to your taste 2008-09-10 16:14 2.6.24.7 was out already ;-) 2008-09-10 16:14 oh, right , cool! 2008-09-10 16:14 so you probably mean 2.6.26.4 which was a screw up and was soon 2.6.26.5 2008-09-10 16:15 heh 2008-09-10 16:15 you'd think that would make the announcement even more popular 2008-09-10 16:15 what was the screw up? 2008-09-10 16:15 it didn't compile ;-) 2008-09-10 16:16 wow 2008-09-10 16:16 really? 2008-09-10 16:16 how could that happen 2008-09-10 16:16 at least some relativelly important option 2008-09-10 16:16 we should donate gregkh a computer 2008-09-10 16:16 http://kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.26.5 2008-09-10 16:17 there's a build option exactly to catch stuff like that 2008-09-10 16:17 and... how about trying to build before posting? 2008-09-10 16:17 not that I always do it 2008-09-10 16:17 oh, at least he fixed it by releasing a new number 2008-09-10 16:18 the the consequences of messing that up are less far reaching with tux3 than stable linux 2008-09-10 16:18 I've got some best-left-unnamed teams that like to release bugfixed version of sw with the same version number 2008-09-10 16:18 right, and hundreds or thousands of people who innocently download the screwup will be confused and/or annoyed 2008-09-10 16:19 people don't download that quickly 2008-09-10 16:19 because one of the things gregkh also doesn't do, is rename screwups as -dontuse 2008-09-10 16:19 it was fixed in 6 hours 2008-09-10 16:19 yes they do 2008-09-10 16:19 most people pick it up through the distros 2008-09-10 16:19 check out the load spikes on lkml.org 2008-09-10 16:19 sorry 2008-09-10 16:19 kernel.org 2008-09-10 16:19 I'm still running 2.6.26.3 and waiting for a reboot to hit 2.6.26.5... 2008-09-10 16:20 with even my laptop getting average uptimes of 3-4 weeks, reboots ain't that often, so upgrades ain't that often 2008-09-10 16:20 behind ;-) 2008-09-10 16:20 I won't mention what I'm running ;-) 2008-09-10 16:20 my workstation is also the tux3.org server 2008-09-10 16:20 I really want 2.6.27.1 to come out though 2008-09-10 16:20 so I don't reboot much 2008-09-10 16:20 heh 2008-09-10 16:20 or try the latest flights of fancy of kernel devs to soon 2008-09-10 16:20 the wireless driver ath9k should fix a lot of my wireless woes 2008-09-10 16:21 me too 2008-09-10 16:21 <- notice the 2.6.27.1 ;-) when 2.6.27 isn't out yet 2008-09-10 16:21 I figure once 2.6.27.1 is out 2.6.27 shouldn't eat your data anymore 2008-09-10 16:21 looking forward to it with almost as much anticipation as an open ati driver that performs better than 50% as well as the bespoke one 2008-09-10 16:21 heh 2008-09-10 16:22 I did upgrade to 2.6.27rc3 and I'm not sure whether it was ath9k, or 2.6.27 or a bad build 2008-09-10 16:22 ok, have to do something about a haircut now 2008-09-10 16:22 or bad config options 2008-09-10 16:22 before jumping back in and finishing up adding xattr support to tux3.c 2008-09-10 16:22 but it promptly ran out of swiotlb buffers and corrupted my hard disk 2008-09-10 16:23 my ath woes are taken care of by having an eee 2008-09-10 16:23 spent a weekend recovering... thankfully had a one week old backup (earlier that week my laptop stopped booting...) 2008-09-10 16:23 they messed with setting up the binary/evil driver 2008-09-10 16:23 works great 2008-09-10 16:23 course I've got a bunch of other aths around taht would benefit 2008-09-10 16:23 my wife's machine for example 2008-09-10 16:24 I'm using madwifi drivers now on a macbook pro 3,1 - they work, but occasionally they disconnect, and you need to unload and reload the entire wireless stack 2008-09-10 16:24 which has a wire running into it because I don't have the energy to mess with the braindamanged firmware laod 2008-09-10 16:24 madwifi has been solid as a rock for me in the 4-5 years I've used it 2008-09-10 16:24 on a pci wireless 2008-09-10 16:25 I have a 'fix-wireless.sh' running in an xterm - it pings default gateway, if it's unreachable for 5 seconds, then prompty shuts down dhcpc/wpa_supplicant/wireless and brings it all back up - best part is you don't even lose existing established network conenctions (ssh, etc) 2008-09-10 16:26 it's still annoying though, because you occasionally get these 15 second pauses (happens maybe once an hour) 2008-09-10 16:26 :p 2008-09-10 16:27 I was running a very old incarnation of madwifi 2008-09-10 16:27 likely some value in that 2008-09-10 16:27 couldn't get the latest working after trying for an hour or so, so just plugged in the wire 2008-09-10 16:27 which is way faster anyway 2008-09-10 16:28 agreed, you want wired for everything stationary 2008-09-10 16:28 laptops aren't stationary though... 2008-09-10 16:29 right, which is why I love my eee 2008-09-10 16:29 don't have to worry about a thing, somebody else does 2008-09-10 16:30 not to mention the fact that it fits comfortably in the flap of my camera backpack 2008-09-10 16:32 it runs linux? 2008-09-10 16:32 yes 2008-09-10 16:32 beautifully 2008-09-10 16:32 by default? 2008-09-10 16:33 everybody loves it here 2008-09-10 16:33 yes 2008-09-10 16:33 hmm 2008-09-10 16:33 funny thing is, the linux and windows versions cost the same, but you get 20G of flash disk with linux and only 12G with XP 2008-09-10 16:33 901 is the one to get 2008-09-10 16:34 I have the 900 and I'll probably pick up a 901 pretty soon 2008-09-10 16:34 need more than one for this family 2008-09-10 16:40 tux3 has bloated up to ~6,500 lines of .c + .h, sparsely commented and densely written, including unit tests 2008-09-10 16:41 so I think the kernel port will come in around 10K lines, complete with versioning, commit transactions and shiny new directory index 2008-09-10 16:41 about 1-2 months away from knowing 2008-09-10 16:41 depends a lot on how tux3 university goes ;-) 2008-09-10 16:52 http://userweb.kernel.org/~warthog9/damaged_server/ <- wow 2008-09-10 16:52 I'll think twice about using fedex 2008-09-10 16:57 sk8 oclock 2008-09-10 17:03 ouch 2008-09-10 17:03 are you counting the somewhat redundant tux3fuse/tux3fs in those 6500? 2008-09-10 18:30 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-10 18:51 -!- nataliep_(~nataliep@66-102-14-1.google.com) has joined #tux3 2008-09-10 19:22 ACTION just realized that the next tux3 university is tomorrow and not today :P 2008-09-10 19:49 konrad, yes 2008-09-10 19:49 konrad, it was a joke 2008-09-10 19:49 6,500 lines at this point is really tight 2008-09-10 19:50 also includes the buffer and page cache emulation 2008-09-10 19:51 and none of tux3.c belongs in kernel, though it is probably the base on which tux3 mkfs and fsck will be built 2008-09-10 19:54 incidentally, that is about 1,000 lines/week 2008-09-10 19:54 a respectable pace 2008-09-10 19:55 especially considering the rewrite ratio is really high 2008-09-10 19:56 ACTION extracts the cork from a bottle of cabernet 2008-09-10 19:57 nice um, tasty california wine to go with the pasta 2008-09-10 19:57 unpretentious, unsophisticated, gives you its phone number right away 2008-09-10 19:58 top note of jelly beans 2008-09-10 20:55 next, inode.c need to know about loading and saving xattrs 2008-09-10 21:29 flips: very respectable 2008-09-10 21:29 especially if it all works 2008-09-10 21:30 the babernet worked fine 2008-09-10 21:30 cabernet 2008-09-10 21:30 <- proof 2008-09-10 21:30 ACTION has to arrange another cabal meeting 2008-09-10 21:31 bh, when you up here next? 2008-09-10 21:32 don't know 2008-09-10 21:32 I mean, really any time I want 2008-09-10 21:32 doesn't take long for san diego does it? 2008-09-10 21:32 right 2008-09-10 21:32 you'e not that far away 2008-09-10 21:32 I'll ping you when the next cabernet meeting comes up 2008-09-10 21:33 cabernet ? 2008-09-10 21:33 err 2008-09-10 21:33 cabal ;-) 2008-09-10 21:33 my tastes run more to bordeaux at cabal meetings 2008-09-10 21:33 or sake 2008-09-10 21:33 sorry, oh ok, well, that would be great. Did one happen today ? 2008-09-10 21:33 no, last was a week or so ago 2008-09-10 21:33 who came ? 2008-09-10 21:33 can't say, it's a cabal 2008-09-10 21:34 good peeps 2008-09-10 21:34 ok 2008-09-10 22:08 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-10 22:19 -!- stargazr5(~gauravstt@59.95.19.195) has joined #tux3 2008-09-10 22:54 -!- pgquiles(~pgquiles@253.Red-83-44-239.dynamicIP.rima-tde.net) has joined #tux3 2008-09-11 00:30 inode table block 0x0/15 (f2c bytes free) 2008-09-11 00:30 0x0: new_xcache: realloc xcache to 9999 2008-09-11 00:30 mode 0000000 uid 0 gid 0 root 4:1 ctime 0 size 200 2008-09-11 00:30 0x2: new_xcache: realloc xcache to 9999 2008-09-11 00:30 mode 0000000 uid 0 gid 0 root 6:1 2008-09-11 00:30 0xa: new_xcache: realloc xcache to 9999 2008-09-11 00:30 mode 0000000 uid 0 gid 0 root a:1 2008-09-11 00:30 0xd: new_xcache: realloc xcache to 9999 2008-09-11 00:30 mode 0040755 uid 0 gid 0 root 8:1 2008-09-11 00:30 0xe: new_xcache: realloc xcache to 9999 2008-09-11 00:30 mode 0100700 uid 0 gid 0 root d:1 ctime 0 size 1008 xattr(s) 2008-09-11 00:30 {1} => 0x805f110: 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 "hello world!" 2008-09-11 00:30 inode 0xe (14) has an extended attribute with atom number 1 and body "hello world!" 2008-09-11 00:31 so an xattr made it into the inode table 2008-09-11 00:31 and onto disk I think 2008-09-11 00:31 need to verify that by trying to get it back 2008-09-11 01:29 nice 2008-09-11 02:16 one step closer to ruling the world 2008-09-11 02:18 flips: what you were trying to do was add reference counting to attributes ? 2008-09-11 02:18 but aborted it ? 2008-09-11 02:18 aborted? 2008-09-11 02:18 no reference counting just now 2008-09-11 02:18 just trying to get xattrs onto disk and back off 2008-09-11 02:18 very close to that now 2008-09-11 02:19 right there was some discussion about it and you decided to go with a simpler approach 2008-09-11 02:19 I decided to go with reference counting 2008-09-11 02:19 yeah, looks like it 2008-09-11 02:19 oh really ? 2008-09-11 02:19 but not just yet 2008-09-11 02:19 have you thought about having extensions for easy of use with samba ? 2008-09-11 02:19 they just want xattrs that work well 2008-09-11 02:20 ok 2008-09-11 02:20 tridge was disappointed with the performance of pretty well every filesystem wrt xattrs 2008-09-11 02:21 it's a hard problem to solve 2008-09-11 02:21 more folks just ignore it 2008-09-11 02:21 more=most 2008-09-11 02:21 generally done badly from what I've seen 2008-09-11 02:21 I like the way it's coming out in tux3 2008-09-11 02:21 store_attrs: Failed assertion "attr == base + size"! 2008-09-11 02:21 Trace/breakpoint trap 2008-09-11 02:21 got to debug 2008-09-11 02:22 ok 2008-09-11 02:23 ah I see the problem 2008-09-11 02:24 the attribute size estimation done for xattrs before saving inode should not include the xattr header size, only the variable data part 2008-09-11 02:32 there, got my xattr back 2008-09-11 02:32 lets see if I can set a new one 2008-09-11 02:32 yay 2008-09-11 02:32 nope, that's the problem, the second set fails 2008-09-11 02:32 but it's progress 2008-09-11 02:33 very definite progress 2008-09-11 02:46 it works now 2008-09-11 02:46 konrad, you can say yay for real ;-) 2008-09-11 02:47 yay for real 2008-09-11 02:47 :-) 2008-09-11 02:47 bug was in a part of the code you worked on 2008-09-11 02:47 for (int kind = MIN_ATTR; kind < VAR_ATTRS; kind++) { 2008-09-11 02:47 but xattrs didn't exist then 2008-09-11 02:48 the attribute encode now has two parts 2008-09-11 02:48 the part that encodes 'standard' attributes 2008-09-11 02:48 and the part that enocdes extended attribute from the xcache 2008-09-11 02:48 ah 2008-09-11 02:49 the standard attribute encoder better not write out headers for extended attributes, which it was doing 2008-09-11 02:49 this part of the code is going to evolve a lot as things progress 2008-09-11 02:50 it gets more complex when versioning arrives 2008-09-11 02:50 then we can't just blindly overwrite the entire set of attributes in the verison table 2008-09-11 02:50 because the inode only has the attributes for one version 2008-09-11 02:50 attributes for other versions have to be left alone 2008-09-11 02:50 messy 2008-09-11 02:51 but also some weeks away 2008-09-11 02:51 this code will do for the nonversioning protoytpe 2008-09-11 02:55 committed 2008-09-11 02:55 enough for today 2008-09-11 02:58 I think I need to reward myself with a pair of these: http://www.skatehut.co.uk/acatalog/Seba_FR1_Skates_-_Orange_White___195.html 2008-09-11 02:59 £200 is a lot in USD 2008-09-11 03:01 can get them for $350 here 2008-09-11 03:01 I think 2008-09-11 03:01 not easy to get 2008-09-11 03:01 americans have dodgy taste in skates ;-) 2008-09-11 03:02 everybody is either fitness or agressive 2008-09-11 03:02 aggressive skates are just stupid 2008-09-11 03:02 made for only one thing: sliding down rails 2008-09-11 03:02 yeah 2008-09-11 03:02 tiny little wheels 2008-09-11 03:02 don't need wheels for that 2008-09-11 03:02 heh 2008-09-11 03:03 just wear a pair of shoes with no traction :) 2008-09-11 03:03 "extreme walking" 2008-09-11 03:03 right 2008-09-11 03:03 I saw a couple of aggro skaters for the first time on the strand 2008-09-11 03:03 jumping up on things, seemed like fun 2008-09-11 03:03 heh 2008-09-11 03:04 but I can do that on my street skates too 2008-09-11 03:04 really? 2008-09-11 03:04 kind of tough to slide down rails 2008-09-11 03:04 or impossible 2008-09-11 03:04 you have enough space between the middle wheels? 2008-09-11 03:04 yeah 2008-09-11 03:04 no I don't grind 2008-09-11 03:04 there's a grind plate, I can get up on some things with it 2008-09-11 03:04 yeah 2008-09-11 03:04 ah 2008-09-11 03:05 sounds like you have some experience 2008-09-11 03:05 no 2008-09-11 03:05 I just put on skates for the first time in 6-7 years a few days ago 2008-09-11 03:05 they're kind of a size or size and a half too small 2008-09-11 03:05 ouch 2008-09-11 03:05 yeah 2008-09-11 03:07 started skating down the little stub walls the skateboarders grind on 2008-09-11 03:07 that seems to impress the skateboarders 2008-09-11 03:07 it's easier to do it on one foot 2008-09-11 03:07 probably looks harder though 2008-09-11 03:17 heh 2008-09-11 03:52 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-11 04:14 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-11 04:51 -!- kmeyer(~konrad@c-24-16-74-109.hsd1.mn.comcast.net) has joined #tux3 2008-09-11 05:00 flips: tux3fuse has xattrs now (not my doing) 2008-09-11 05:30 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-11 05:30 -!- pgquiles(~pgquiles@253.Red-83-44-239.dynamicIP.rima-tde.net) has joined #tux3 2008-09-11 05:30 -!- nataliep_(~nataliep@66-102-14-1.google.com) has joined #tux3 2008-09-11 05:30 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-09-11 05:30 -!- eli(~elicriffi@66.249.86.209) has joined #tux3 2008-09-11 05:30 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-09-11 05:58 -!- konrad(~konrad@c-24-16-74-109.hsd1.mn.comcast.net) has joined #tux3 2008-09-11 05:58 -!- RzM|Away(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-11 05:58 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-09-11 05:58 -!- flips(~phillips@phunq.net) has joined #tux3 2008-09-11 05:58 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-09-11 07:33 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-11 07:48 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-11 09:02 -!- pgquiles(~pgquiles@253.Red-83-44-239.dynamicIP.rima-tde.net) has joined #tux3 2008-09-11 12:09 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-11 13:45 -!- kbingham(~kbingham@92.20.246.248) has joined #tux3 2008-09-11 14:04 -!- cdk(~chinmay@121.246.36.119) has joined #tux3 2008-09-11 14:06 konrad, that was amazingly fast of tero hmm? Nice code too. 2008-09-11 14:06 konrad, but you made this all happen, and your code was very decent as well 2008-09-11 14:07 tero is in the real pro category, lots to learn from him 2008-09-11 14:12 -!- cdk(~chinmay@121.246.36.119) has left #tux3 2008-09-11 14:15 -!- cdk(~chinmay@121.246.36.119) has joined #tux3 2008-09-11 14:19 -!- cdk(~chinmay@121.246.36.119) has joined #tux3 2008-09-11 14:31 My new boombox arrive 2008-09-11 14:31 now I can go totally ghetto, skating down to the beach with a ghetto blaster in my hand 2008-09-11 14:32 ACTION is degenerating under the influence of certain skaters 2008-09-11 15:04 haircut time 2008-09-11 15:04 later... 2008-09-11 15:37 http://linux.slashdot.org/article.pl?sid=08/09/11/1913229 <- first time I ever say the "pigfuckers" tag on slashdot 2008-09-11 15:37 re lenova caving to msft on shipping linux preinstalls 2008-09-11 16:42 -!- tim_dimm(~timothyhu@adsl-67-114-40-138.dsl.scrm01.pacbell.net) has joined #tux3 2008-09-11 16:43 howdy 2008-09-11 18:06 sk8 oclock 2008-09-11 18:06 on this skate, I will think about implementation details of atom refcounts 2008-09-11 19:28 that was fun 2008-09-11 19:28 I did a move that got the sk8rs clapping 2008-09-11 19:28 then they said "ok rollerbladers are allowed" 2008-09-11 19:28 in the sk8 park that is 2008-09-11 19:29 got to grab a quick bite, then hopefully we can do chapter two of tux3 university 2008-09-11 19:29 anybody here for that? 2008-09-11 19:30 ACTION nods 2008-09-11 19:47 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-11 19:55 hiyah 2008-09-11 19:55 just warming up for the next episode 2008-09-11 19:55 with a some pasta and a glass of cabernet 2008-09-11 19:55 ran out of chianti ;-) 2008-09-11 19:56 :D 2008-09-11 19:56 btw, what is brand is that ramen you mention the other day? 2008-09-11 19:57 important point 2008-09-11 19:57 shin ramyun 2008-09-11 19:57 made by nog shim 2008-09-11 19:58 sorry 2008-09-11 19:58 nong shim 2008-09-11 19:58 "family pack" 2008-09-11 19:58 "gourmet spicy" 2008-09-11 19:58 overdid it a little yesterday, had three packs ;-) 2008-09-11 19:58 don't do that 2008-09-11 19:58 -!- RalucaME(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-11 19:58 http://kimchimamas.typepad.com/.shared/image.html?/photos/uncategorized/2007/12/13/nong_shim.jpg ? 2008-09-11 19:59 exactly 2008-09-11 20:00 I like it hot :-) 2008-09-11 20:00 seldom had better ramyun, even in korea 2008-09-11 20:00 it's actually korean of course 2008-09-11 20:00 got it from a japanese grocery 2008-09-11 20:01 -!- ebiederm(~eric@c-24-130-11-59.hsd1.ca.comcast.net) has joined #tux3 2008-09-11 20:01 hi eric 2008-09-11 20:01 so I'll have no chance to find it at Giant or Superfresh :( 2008-09-11 20:01 let me introduce you to one of the foremost kernel hackers in the known universe 2008-09-11 20:01 eric biederman 2008-09-11 20:01 say hi :-) 2008-09-11 20:01 hello all. 2008-09-11 20:02 eric is responsible for much of what makes linux great in the supercomputing cluster space 2008-09-11 20:02 ACTION also says Hello! 2008-09-11 20:02 konrad, don't be shy ;-) 2008-09-11 20:02 hi Eric 2008-09-11 20:02 hello :) 2008-09-11 20:02 well eric is not really a vfs guy, just a general genius 2008-09-11 20:03 knows everything about everything nearly 2008-09-11 20:03 :-) 2008-09-11 20:03 lol 2008-09-11 20:03 ACTION double checks that the logging is enabled 2008-09-11 20:03 hey 2008-09-11 20:03 also, nataliep_ up there is the linux kernel bug manager 2008-09-11 20:03 more folks have joined, nice 2008-09-11 20:03 ok, let's start 2008-09-11 20:04 first let me ask some questions: what does VFS stand for? 2008-09-11 20:04 virtual file system 2008-09-11 20:04 close but no 2008-09-11 20:04 subsystem? :D 2008-09-11 20:04 ACTION listens to the sound of googling 2008-09-11 20:04 hey 2008-09-11 20:04 maze! 2008-09-11 20:05 yeah, so 8pm is a little tight ;-) 2008-09-11 20:05 maze is about the smartest smart person I met a google 2008-09-11 20:05 hehe, thanks! 2008-09-11 20:05 no exaggeration 2008-09-11 20:05 ok, let's try again: what does VFS stand for? 2008-09-11 20:05 googlling is ok 2008-09-11 20:05 ACTION is diluting the quality of the channel :P 2008-09-11 20:06 I doubt that, razvanm 2008-09-11 20:06 virtual file system 2008-09-11 20:06 er wait 2008-09-11 20:06 switch? 2008-09-11 20:06 that's been said hasn't it 2008-09-11 20:06 right! 2008-09-11 20:06 see? 2008-09-11 20:06 razvanm wins 2008-09-11 20:06 it stands for virtual filesystem switch 2008-09-11 20:06 versioning file system :P 2008-09-11 20:06 firefox had 'AVFS' at the top of my url bar for vfs :( 2008-09-11 20:06 how it got that name, I don't know 2008-09-11 20:06 it was the first hit for 'vfs lnux' :P 2008-09-11 20:06 eric probably does 2008-09-11 20:07 lol 2008-09-11 20:07 it switches between the different filesystems like a network switch switches between computers 2008-09-11 20:07 somebody better find out, because it's sure to come up at a geek challenge context at linuxtag eventually 2008-09-11 20:07 yes 2008-09-11 20:07 it is a colletion of methods that together implement a filesystem 2008-09-11 20:07 find out what? 2008-09-11 20:08 how it came to be called that 2008-09-11 20:08 I know where it came from but not why they picked the name. When the implemented the second filesystem on BSD they needed an abstraction layer. 2008-09-11 20:08 the vfs.txt from Documentation says: Overview of the Linux Virtual File System 2008-09-11 20:08 who came up with it 2008-09-11 20:08 etc 2008-09-11 20:08 ah 2008-09-11 20:08 trivia ;-) 2008-09-11 20:08 I knew eric would win that somehow ;-) 2008-09-11 20:08 well let me tell you 2008-09-11 20:08 the foremost filesystem dev on bsd does not know what vfs means 2008-09-11 20:08 :D 2008-09-11 20:08 or who called it taht, or why 2008-09-11 20:08 yet he is definitely the foremost fs dev 2008-09-11 20:09 everybody know his name? 2008-09-11 20:09 quick... 2008-09-11 20:09 hint: 2008-09-11 20:09 I suck at trivia... I'm lucky to know my own name... 2008-09-11 20:09 McKusick? 2008-09-11 20:09 he engaged in a discussion re tux3 design recently 2008-09-11 20:09 mckusick is close but no 2008-09-11 20:10 hint: firefly 2008-09-11 20:10 Dillon? 2008-09-11 20:10 the dragonfly hammer guy? 2008-09-11 20:10 yes! 2008-09-11 20:10 Matt Dillon IIRC 2008-09-11 20:10 hammer? 2008-09-11 20:10 also responsible for linux having a reverse mapped vm 2008-09-11 20:10 used to be the bsd vm guy 2008-09-11 20:10 is now the vm fs guy 2008-09-11 20:10 and runs his own distro 2008-09-11 20:10 intensely clueful person 2008-09-11 20:10 ok 2008-09-11 20:10 let's do some vfs 2008-09-11 20:11 ACTION is ready 2008-09-11 20:11 and let's start from the opposite end that we started from yesterday 2008-09-11 20:11 everybody got their browsers ready? 2008-09-11 20:11 yesterday? 2008-09-11 20:11 eh 2008-09-11 20:11 day before yesterday 2008-09-11 20:11 last time ;-) 2008-09-11 20:11 :D 2008-09-11 20:11 loaded ;-) 2008-09-11 20:12 lxr.linux.no should be my homepage or something 2008-09-11 20:12 lets go here: http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c 2008-09-11 20:12 super.c is the "main" for a linux filesystem 2008-09-11 20:12 we might call it tux3.c for tux3, or we might go with tradition and call it super.c 2008-09-11 20:13 it's got module_{init,exit} 2008-09-11 20:13 it has two basic tasks: 1) parse the mount options 2) load the fs superblock 2008-09-11 20:13 right 2008-09-11 20:13 it takes care of a few other details besides 2008-09-11 20:13 so let's take a look at some really crappy parsing code 2008-09-11 20:14 parse_options 2008-09-11 20:14 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L428 2008-09-11 20:14 line 429 2008-09-11 20:14 oops :) 2008-09-11 20:14 depends on the version of course 2008-09-11 20:15 429 on mine as well 2008-09-11 20:15 nothing really interesting here 2008-09-11 20:15 just good to know where it is 2008-09-11 20:15 so, there isn't actually such a thing as a linux "mount" program 2008-09-11 20:15 so it gets a string and a pointer to the superblock? 2008-09-11 20:15 all we do is call the fs's mount entry point 2008-09-11 20:15 sbi 2008-09-11 20:16 not quite the same 2008-09-11 20:16 sbi is the filesystem-specific bit of a superblock 2008-09-11 20:16 so that's the in-mem representation of an ext2 superblock 2008-09-11 20:16 superblocks and inodes in linux are both generic structures 2008-09-11 20:16 almost 2008-09-11 20:16 re in-mem rep 2008-09-11 20:17 there is also an exact image of the disk superblock that ext2 keeps around 2008-09-11 20:17 I don't know if tux3 will bother 2008-09-11 20:17 we shall see, that is a fiddly detain 2008-09-11 20:17 the sbi corresponds to what is called struct sb in the tux3 userspace 2008-09-11 20:18 and tux3 doesn't really have a generic superblock implemented at the moment 2008-09-11 20:18 linux kernel does 2008-09-11 20:18 superblock fields are separated into two classes: 1) ones that core vfs knows what to do with 2) ones that only mean something to the fielsystem 2008-09-11 20:18 inodes are separated the same way 2008-09-11 20:19 by a completely different mechanism, for not good reason 2008-09-11 20:19 any idea what the 0pt_ in the tokens means? 2008-09-11 20:19 the superblock specialization is via a fs-specific pointer 2008-09-11 20:19 oh, its opt not 0pt ;-) 2008-09-11 20:19 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L395 2008-09-11 20:20 not really 2008-09-11 20:20 5 minutes of poking will answer that 2008-09-11 20:20 or 1 minute 2008-09-11 20:20 there is some fairly trivial macro magic going on here and there 2008-09-11 20:20 [I mis-parsed as zero-pt font size...] 2008-09-11 20:20 anyway 2008-09-11 20:21 like I said, awful parsing code 2008-09-11 20:21 used to be a lot worse 2008-09-11 20:21 gets the job done in way too many lines 2008-09-11 20:21 well lets look at a more interesting bit 2008-09-11 20:21 loading the superblock 2008-09-11 20:21 quite tricky 2008-09-11 20:21 because the filesystem isn't working yet 2008-09-11 20:21 we don't even know the blocksize 2008-09-11 20:22 we have ext2_get_sb 2008-09-11 20:22 which is stored in the ext2_fs_type structure 2008-09-11 20:23 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L366 2008-09-11 20:23 of type "file_system_type" 2008-09-11 20:23 this is the starting point for any filesystem 2008-09-11 20:23 the tip of the iceberg 2008-09-11 20:23 root of the tree 2008-09-11 20:23 heart of the dragon etc 2008-09-11 20:24 file_system_type defines a few methods, by far the most important of which is get_sb 2008-09-11 20:24 this structure is passed to register_filesystem 2008-09-11 20:24 when the module is initialized 2008-09-11 20:24 which happens these days whether or not is actually a module by the way 2008-09-11 20:25 and that makes the filesystem appear in /proc/filesystems 2008-09-11 20:25 so everybody should do cat /proc/filesystems now 2008-09-11 20:25 and tell what they see there that is really interesting 2008-09-11 20:26 lots of nodev's 2008-09-11 20:26 lots of internal no-blockdev fs'es and 4 dev-fs'es 2008-09-11 20:26 suggesting that nodiv is a stupid idea... 2008-09-11 20:26 which is true 2008-09-11 20:26 and? 2008-09-11 20:26 my oly non-nodev are ext3 and vfat :P 2008-09-11 20:26 well, there's ext3, hfsplus, iso9660, fuseblk 2008-09-11 20:26 right 2008-09-11 20:26 and there is no tux3 2008-09-11 20:26 and a ton of internal ones (usb, ramfs, etc...) 2008-09-11 20:26 :D 2008-09-11 20:26 that is the most important thing to notice 2008-09-11 20:27 and that is why there is a tux3 university 2008-09-11 20:27 notice also that there is a ramfs 2008-09-11 20:27 ramfs is the second most useful filesystem for learning about the vfs 2008-09-11 20:27 the most useful being ext2 2008-09-11 20:27 also sockfs 2008-09-11 20:27 is sockfs for unix domain sockets? 2008-09-11 20:27 suckfs 2008-09-11 20:27 right 2008-09-11 20:28 :-) 2008-09-11 20:28 I'd prefer a shoefs 2008-09-11 20:28 don't take anything from the net side of linux as an example of anything besides "fast" 2008-09-11 20:28 sk8fs 2008-09-11 20:28 yup 2008-09-11 20:28 I see "fuse" 2008-09-11 20:28 interesting 2008-09-11 20:28 in fact, 3 or them 2008-09-11 20:28 fuse ,fuseblk, and fusectl 2008-09-11 20:28 3 of them 2008-09-11 20:29 that's a little over the top 2008-09-11 20:29 oh, this one is always a laugh: hugetlbfs 2008-09-11 20:29 a naive person would think one would be enough 2008-09-11 20:29 or would already be one too many 2008-09-11 20:29 hugetlbfs is indded the worst fs ever conceived 2008-09-11 20:29 what's the difference between rootfs/ramfs/tmpfs ? 2008-09-11 20:29 sometimes even the great penguin has bad days 2008-09-11 20:29 rootfs exists just to get linux booted 2008-09-11 20:30 probably a bad idea 2008-09-11 20:30 but that's how it works 2008-09-11 20:30 ramfs is really interesting 2008-09-11 20:30 it is basically just the vfs cache layer of a fs with all backing store stripped away 2008-09-11 20:30 it is worth reading every line 2008-09-11 20:30 is the split merely to be able to shave off more code in embedded? 2008-09-11 20:31 it is split for tutorial reasons 2008-09-11 20:31 ;-) 2008-09-11 20:31 ramfs is to serve as an example of a minimal fs with no backing store 2008-09-11 20:31 somehow it bloated up to 589 lines though 2008-09-11 20:31 when it really only needs 150 maybe 2008-09-11 20:32 so I guess somebody didn't get the memo ;-) 2008-09-11 20:32 tmpfs is the real workhorse 2008-09-11 20:32 that is basically ramfs backed by the swap device 2008-09-11 20:32 $ wc -l file-mmu.c 2008-09-11 20:32 53 file-mmu.c 2008-09-11 20:32 common mounted on /tmp these days 2008-09-11 20:32 commonly 2008-09-11 20:32 ok, I'll take a short break 2008-09-11 20:33 to refill my cabernet 2008-09-11 20:33 so tmpfs can be swapped out, while ramfs and rootfs can't 2008-09-11 20:33 linus pronounces 'vfs' as 'virtual filesystems' in ramfs/inode.c 2008-09-11 20:33 and why don't you compare notes? 2008-09-11 20:33 linus doesn't always get it right ;-) 2008-09-11 20:33 tytso would normally clobber him in a geek trivial contest 2008-09-11 20:34 http://farm1.static.flickr.com/164/413387043_ab2c7569a4.jpg :P 2008-09-11 20:34 :-) 2008-09-11 20:35 the reflection isn't quite as nice here 2008-09-11 20:35 but it does reflect, in this idea desk 2008-09-11 20:35 ikea 2008-09-11 20:35 ACTION also sits at an ikea desk ;-) 2008-09-11 20:36 ok, let's go up to ext2_fill_super 2008-09-11 20:36 we pass that as a method to a vfs library call 2008-09-11 20:36 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L737 2008-09-11 20:36 if you think that is an odd way to init a fs you'd be right ;-) 2008-09-11 20:36 so what is an sbi? 2008-09-11 20:37 ACTION waits 2008-09-11 20:37 sb info 2008-09-11 20:37 ext2_sb_info ptr 2008-09-11 20:37 right, and what points at it? 2008-09-11 20:37 sb->s_fs_info 2008-09-11 20:38 right 2008-09-11 20:38 so that is how the linux fs specializes a superblock 2008-09-11 20:38 by haing s_fs_info point at something allocated and initialized by the fs 2008-09-11 20:38 that only the fs will ever use 2008-09-11 20:38 how does it know how big to make it? 2008-09-11 20:39 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L768 <- here we read the superblock 2008-09-11 20:39 MaZe: sizeof(*sbi) 2008-09-11 20:39 maze, the fs declares it, and it makes it sizeof(that) 2008-09-11 20:39 won't that be fs dependant though? 2008-09-11 20:39 it is 2008-09-11 20:39 that is why it is a fs-specific pointer field 2008-09-11 20:39 pointer is always the same size :) 2008-09-11 20:40 core vfs will never look there 2008-09-11 20:40 right 2008-09-11 20:40 oh, right it's allocated within ext2 code 2008-09-11 20:40 thank goodness for that small mercy 2008-09-11 20:40 right 2008-09-11 20:40 here http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L755 2008-09-11 20:40 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L144 2008-09-11 20:40 one can easily imagine a universe in which pointers on the same machine are not all the same size 2008-09-11 20:41 keep'em beasties far away from me... 2008-09-11 20:41 so there is a some braindamage about trying to use the "blocksize as the device" to load the superblock 2008-09-11 20:41 bad idea 2008-09-11 20:41 should just assume that it is always the same size 2008-09-11 20:41 there is no legitimate concept of blocksize on a device, actually 2008-09-11 20:42 never mind that I have coded one in my vfs emulation ;-) 2008-09-11 20:42 that is a wart I will get rid of probably one day when it irritates me enough 2008-09-11 20:42 only the fs sbi should know the blocksize of the filesystem 2008-09-11 20:43 so, that nonsense about device blocksize is so that ext2 can use "sb_bread" to read the superblock 2008-09-11 20:43 again, there is no reason for this 2008-09-11 20:43 the tux3 userspace code directlly case "diskIo" there 2008-09-11 20:43 bypassing the buffer emulation 2008-09-11 20:43 and ext2 really should do the same, not have that fragile blocksize code there 2008-09-11 20:44 not get_sb_bdev ? 2008-09-11 20:44 right 2008-09-11 20:44 equivalent of tux3 diskio 2008-09-11 20:44 well 2008-09-11 20:44 these fns have a lot of cruft attached 2008-09-11 20:44 been through many iterations of doing things the wrong way 2008-09-11 20:45 so you want to go to the lowest level thing that will actually read if you want to be clear and robust here 2008-09-11 20:45 I'd be tempted to submit a bio 2008-09-11 20:45 but anyway 2008-09-11 20:45 we'll get there soon enough, and have to implement our own version of that 2008-09-11 20:45 let's do it a little more cleanly, but we don't have to save the world 2008-09-11 20:45 just now 2008-09-11 20:46 873 /* If the blocksize doesn't match, re-read the thing.. */ <- excellent example of yunk 2008-09-11 20:46 huck 2008-09-11 20:46 yuck 2008-09-11 20:46 :-) 2008-09-11 20:46 "yunk" is short for "yucky junk" 2008-09-11 20:46 and "huck" is what we will do with that in tux3 2008-09-11 20:47 so by here ext2 has managed to read its superblock: http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L898 2008-09-11 20:47 should actually have only been 3 lines, though we did do some options processing as well 2008-09-11 20:48 most of that is historical cruft 2008-09-11 20:48 keep in mind that ext2 is one of the cleanest filesystems ;-) 2008-09-11 20:48 :D 2008-09-11 20:48 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L915 <- ext2 dutifully reads the frag size, even though this bsd ufs concept was never implemented and never will be 2008-09-11 20:49 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L941 <- it checks the super magic 2008-09-11 20:49 tux3 gets to this point about 20 lines in or so 2008-09-11 20:50 a few more than that actually 2008-09-11 20:50 tux3.c 2008-09-11 20:50 but in the kernel implementation, it will be about a dozen lines from the fill_super entry 2008-09-11 20:50 as it should be 2008-09-11 20:51 next big job is to read the root directory! 2008-09-11 20:51 this is exciting because the filesystem isn't working yet 2008-09-11 20:51 wouldn't it be enough to just get the rootdir's inode number? 2008-09-11 20:51 we need to get the root dir up and running as an inode 2008-09-11 20:51 so that that open (2) and readdir work on it 2008-09-11 20:52 so yes 2008-09-11 20:52 we need to know the rootdirs inode number 2008-09-11 20:52 that has evolved over time with ext2 2008-09-11 20:52 it used to just be a fixed number 2008-09-11 20:52 now there is a fancier method 2008-09-11 20:52 for no good reason 2008-09-11 20:53 Tux3 uses inode number 0xd (for "directory" or "daniel") for the root dir 2008-09-11 20:53 http://lxr.linux.no/linux+v2.6.26.5/include/linux/ext2_fs.h#L61 2008-09-11 20:53 right 2008-09-11 20:53 somewhere there is "good ol'" something 2008-09-11 20:54 first non-reserved is 11 2008-09-11 20:54 I might have conflated that with something else 2008-09-11 20:54 #define EXT2_GOOD_OLD_FIRST_INO 11 2008-09-11 20:54 well it doesn't matter except for geek quizes 2008-09-11 20:54 yah 2008-09-11 20:54 that's it 2008-09-11 20:54 ok, we have 6 minutes for questions 2008-09-11 20:54 going to stop right here, just before doing anything interesting ;-) 2008-09-11 20:55 exactly! :O 2008-09-11 20:55 ouch 2008-09-11 20:55 when's the next meeting? next tuesday at 8pm? 2008-09-11 20:55 this lesson was definitely shorter... 2008-09-11 20:55 well it was fun looking at all that busy looking code that doesn't actually do much, no? 2008-09-11 20:55 it seemed too little time 2008-09-11 20:55 how about tomorrow? :D 2008-09-11 20:55 next tuesday, yes 2008-09-11 20:55 yeah, tomorrow works 2008-09-11 20:55 tuesday, then? 2008-09-11 20:56 homework is: know how the root dir is loaded and initialized, and now that differs from how any other inode is opened 2008-09-11 20:56 and how 2008-09-11 20:56 I meant 2008-09-11 20:56 tomorrow is friday ;-) 2008-09-11 20:57 so what's the 'desired' way to read data off disk in a fs? submit bio-s? would that also be the best way to read the superblock (you seem to have suggested that) 2008-09-11 20:57 friday is my most productive day :P 2008-09-11 20:57 not only do I have to relax then, I have to get atom refcounting working 2008-09-11 20:57 maze, I like submit_bio, yes 2008-09-11 20:57 then you have to wait on some lock 2008-09-11 20:57 is that the lowest level interface to the block device layer? 2008-09-11 20:57 two or three lines 2008-09-11 20:57 it is 2008-09-11 20:57 the lowest one you can use without getting shouted at 2008-09-11 20:57 does it support priorities? 2008-09-11 20:57 depends on the elevator 2008-09-11 20:58 mostly linux elevators are pretty crappy 2008-09-11 20:58 no good rt elevator for example 2008-09-11 20:58 if that's what you're asking 2008-09-11 20:58 yeah, something like that 2008-09-11 20:58 feel free to write a noncrappy one 2008-09-11 20:58 you're the man to do it 2008-09-11 20:59 if I'm operating on behalf of a user, and he's running at some prio, or asking for some priority on his read/write/file op, than I'd like to be able to pass that down to the blockdev layer 2008-09-11 20:59 yes, and save us from that broken pos that is the current io scheduler 2008-09-11 20:59 I mean I obviously shouldn't be dealing with that in the fs, except for making sure I submit requests with the right priorities 2008-09-11 20:59 no, not in the fs 2008-09-11 20:59 though one can imagine the fs making suggestions 2008-09-11 21:00 and a realtime fs most certainly has to interact with the io scheduler 2008-09-11 21:00 (submit_bio is used only by xfs, ocfs2, jfs, gfs2 and ext4) 2008-09-11 21:00 the fs also has to answer the question "can I submit this request at all, and meet the constraints" 2008-09-11 21:00 what about networking? how would you go about sending/receiving udp? tcp? raw frames? other protocol? 2008-09-11 21:00 (what do the others use?) 2008-09-11 21:01 only the fs can know certainly crucial information about those constrainnts 2008-09-11 21:01 networking? 2008-09-11 21:01 sorry, missed the connection 2008-09-11 21:01 you mean realtime? 2008-09-11 21:01 [have to be careful - low prio process fetches a directory, higher priority process than needs to fetch it again - needs to result in increasing the bio priority or resubmitting it or something] 2008-09-11 21:01 razvanm, notice that submit_bio is used in all _modern_ fs's 2008-09-11 21:01 networking connection - I'm imagining a disk and network based multi-node fs 2008-09-11 21:02 I'm imaginative ;-) 2008-09-11 21:02 gfs2 only loosely meeting that definition 2008-09-11 21:02 flips: right :D 2008-09-11 21:02 maze,you're already IO fixing priority inversion? 2008-09-11 21:03 ah 2008-09-11 21:03 right 2008-09-11 21:03 no, just pointing out you have to be careful 2008-09-11 21:03 that kind of networking 2008-09-11 21:03 you do 2008-09-11 21:03 and as a rule we are not 2008-09-11 21:03 far from it 2008-09-11 21:03 tcp/ip is not realtime 2008-09-11 21:03 however 2008-09-11 21:03 so there's a lot of things I'd like to work on if I had the time ;-) 2008-09-11 21:03 you can kinda sorta pretend it is, sometimes 2008-09-11 21:03 networking is real-time if you have caching done correctly ;-) 2008-09-11 21:03 right 2008-09-11 21:04 really? 2008-09-11 21:04 you will have to convince me of that 2008-09-11 21:04 I think that random backout already makes it not realtime 2008-09-11 21:04 CSMACD 2008-09-11 21:04 or something like that 2008-09-11 21:04 oh, ok, I don't mean RT as in rtlinux rt 2008-09-11 21:04 carrier sense multiple access collision detect 2008-09-11 21:04 I meant usable on a desktop 2008-09-11 21:04 ah 2008-09-11 21:04 I always mean actual rt when somebody says rt 2008-09-11 21:05 flips: if the data is represented identified by some hashes over them then it could be ;-) 2008-09-11 21:05 I meant usable and not get killed by background tasks 2008-09-11 21:05 there linus and I differ 2008-09-11 21:05 (for reading) 2008-09-11 21:05 uhm, I never mentioned rt ;-) 2008-09-11 21:05 razvanm, what could be? 2008-09-11 21:05 maze, ok 2008-09-11 21:05 sorry 2008-09-11 21:05 try again? 2008-09-11 21:05 too many threads :D 2008-09-11 21:05 yup 2008-09-11 21:05 while rt is nice of course, and you should design with making it possible in the future of course 2008-09-11 21:05 the phillips switch is overloading 2008-09-11 21:06 I just wanted fg tasks to be able to run at higher priority than bg tasks (a garbage collector or bg file scan or ...) 2008-09-11 21:06 ;-) 2008-09-11 21:06 ok, well a single node filesystem has no business knowing anything about networking 2008-09-11 21:06 right 2008-09-11 21:06 yes, you have control over that 2008-09-11 21:06 complete control 2008-09-11 21:06 you are root 2008-09-11 21:06 beyond root 2008-09-11 21:06 we've already determined that a fs has to provide some interfaces to the vfs layer, and it interfaces with the blockdev layer via bio's 2008-09-11 21:07 there's only one limitation to what a filesystem in linux can do: use symbols that are not exported to modules, when it is compiled as a module 2008-09-11 21:07 add in some atomics/locks/primitives already provided by the kernel and mem management, and you have all pieces ;-) 2008-09-11 21:07 yes 2008-09-11 21:07 watch out for layer violations 2008-09-11 21:07 but in general, go crazy 2008-09-11 21:08 so basically, now the question was: how to implement nfs - what would the interface not to blockdev, but to network, be? 2008-09-11 21:08 there is not much to do 2008-09-11 21:08 nfs basically runs on top of a filesystem that doesn't even have to know its there 2008-09-11 21:08 there are a few small, weird hooks 2008-09-11 21:08 uhm? 2008-09-11 21:08 the details of which I forget 2008-09-11 21:08 nfs stacks on top of a host fs 2008-09-11 21:09 the host fs doesn't have to know it's being stacked on 2008-09-11 21:09 it just have to behave itself 2008-09-11 21:09 like a unix fs 2008-09-11 21:09 what do you mean by host fs? oh you mean for the nfs server? 2008-09-11 21:09 that's actually pretty hard ;p) 2008-09-11 21:09 I was thinking about the nfs client 2008-09-11 21:09 right 2008-09-11 21:09 ah, nfs client 2008-09-11 21:09 strange exception to pretty much everything 2008-09-11 21:09 it stacks on top of a remote host fs 2008-09-11 21:10 with all the oddities that implies 2008-09-11 21:10 including mid-flight reboots 2008-09-11 21:10 indeed 2008-09-11 21:10 there are papers written about how much this sucks 2008-09-11 21:10 let me see 2008-09-11 21:10 http://www.cc.gatech.edu/classes/AY2007/cs4210_fall/papers/nfsOLS.pdf 2008-09-11 21:10 the reboot? yeah, that's terrible, but it can be done in a way that it would work 2008-09-11 21:10 marginally 2008-09-11 21:10 you'd detect remote server reboot and have to dump caches, etc... 2008-09-11 21:11 I've been living/breathing that for the last 3 years 2008-09-11 21:11 I know ;-) 2008-09-11 21:11 yes, but we don't 2008-09-11 21:11 it's pathetic 2008-09-11 21:11 nobody pays attention to statd 2008-09-11 21:11 except lockd 2008-09-11 21:11 no excuse 2008-09-11 21:11 oh, I'm not thinking about NFS, I hate NFS, I'm thinking about a networkfs 2008-09-11 21:11 sun braindamage 2008-09-11 21:11 and linux too, because we should have fixed it by now 2008-09-11 21:12 oh a real networkfs 2008-09-11 21:12 just trying to figure out what the layering is there vfs / networkfs (missing this interface layer) networking 2008-09-11 21:12 well, lustre is getting close 2008-09-11 21:12 oscfs2 also 2008-09-11 21:12 I'm sure you will crack that one 2008-09-11 21:12 will be fun to watch your progress 2008-09-11 21:12 in the meantime, goals with tux3 are modest 2008-09-11 21:12 I need more than 24 hours in a day 2008-09-11 21:13 that is: support nfs no worse than any other filesystem 2008-09-11 21:13 hopefully much better 2008-09-11 21:13 hehe 2008-09-11 21:13 ebiederm, thanks for visiting 2008-09-11 21:14 I hope we did not disappoint ;-) 2008-09-11 21:14 an OT question: why hg and not git? 2008-09-11 21:14 ok, it is back to the question of atom refcounting 2008-09-11 21:14 you been following the thread, maze? 2008-09-11 21:15 sorry, which thread? 2008-09-11 21:15 razvanm, hg is a lot more usable than git 2008-09-11 21:15 about mercurial? 2008-09-11 21:15 instand on 2008-09-11 21:15 maze, no, about xattr atoms 2008-09-11 21:15 ah, no. 2008-09-11 21:15 on the tux3 list 2008-09-11 21:15 should I? 2008-09-11 21:15 please 2008-09-11 21:15 you subscribed? 2008-09-11 21:16 glancing ;-) 2008-09-11 21:16 I think I subscribed you 2008-09-11 21:16 more xattr design details? 2008-09-11 21:16 right, and associated posts 2008-09-11 21:16 the parent of that is the root of that tree 2008-09-11 21:17 uhm, gmail doesn't do trees ;-) 2008-09-11 21:17 they should fix that 2008-09-11 21:17 :p 2008-09-11 21:17 it's only beta 2008-09-11 21:17 right, it's also slow... 2008-09-11 21:17 let me see 2008-09-11 21:17 I know, I run exim4 here and it's beyond fast 2008-09-11 21:17 it's scary 2008-09-11 21:18 so I'm a big fan of atoms, because the space saving can be extreme 2008-09-11 21:18 [Tux3] The long and short of extended attributes 2008-09-11 21:18 ah, I like the sound of that 2008-09-11 21:18 you probably want to support even more atoms for selinux... but then the code gets complex 2008-09-11 21:18 I've been doing a lot of introspecting about it 2008-09-11 21:19 so you have the easy solution - use no atoms 2008-09-11 21:19 always on the verge of mass deleting that code 2008-09-11 21:19 I know, but I also feel its lame 2008-09-11 21:19 and just store rep { string=string } 2008-09-11 21:19 no null's thanks ;-) 2008-09-11 21:19 ext3 is 8 bit clean 2008-09-11 21:19 but otherwise yes 2008-09-11 21:19 (mind you I'd actually store that in reversed order, at the front of the file, going backwards towards negative offsets) 2008-09-11 21:20 reccount, namecount, , 2008-09-11 21:20 have it stored the same way as the rest of the file data 2008-09-11 21:20 ? 2008-09-11 21:20 xattr1=value1 xattr2=value2 filecontent="hello" ==> 2008-09-11 21:21 sorry, I meant tux3 is 8 bit clean 2008-09-11 21:21 where are the negative offsets? 2008-09-11 21:21 oh I see 2008-09-11 21:21 2eulav=2rttax 1eulav=1rttax hello 2008-09-11 21:21 | offset 0 at [H] in hello 2008-09-11 21:21 demented ;-) 2008-09-11 21:21 interesting idea 2008-09-11 21:21 it means you don't have to implement it though ;-) 2008-09-11 21:22 well the page cache doesn't have negative offsets 2008-09-11 21:22 you'd have to store at the top of the index range 2008-09-11 21:22 that's a good idea 2008-09-11 21:22 it should work out fine 2008-09-11 21:22 means you can't quite have a 16 TB file on 32 bit linux though 2008-09-11 21:23 16 TB less the maximum size of attributes 2008-09-11 21:23 no, you shave it down by however many xattrs you have 2008-09-11 21:23 so maybe a few kilobytes - in the future maybe more... who knows 2008-09-11 21:23 ok, that's twisted enough for me 2008-09-11 21:23 in what sense is it twisted? 2008-09-11 21:23 works perfectly on 64 bit linux... probably find a couple of radix tree bugs 2008-09-11 21:24 eeking out a small simplification by using the other end of the address range 2008-09-11 21:24 twisted 2008-09-11 21:24 I like it 2008-09-11 21:24 right you have to be signedness clean, or you can offset everything by a zero offset constant 2008-09-11 21:24 right 2008-09-11 21:24 like I way 2008-09-11 21:24 probably turn up a couple core linux bugs there 2008-09-11 21:24 but worth doing just for that reason 2008-09-11 21:24 or you can even just store it like this 0:hello empty space for expansion reverse xattrs :-1 2008-09-11 21:25 since you have to support holes anyway... 2008-09-11 21:25 sure 2008-09-11 21:25 it allows us to treat xattrs more like file data in kernel 2008-09-11 21:25 that's a tux3 meme 2008-09-11 21:25 exactly 2008-09-11 21:25 so I like it 2008-09-11 21:26 it means xattr support in the fs on-disk image is basically free 2008-09-11 21:26 for now we have the "xcache" 2008-09-11 21:26 which is even faster to access than a page cache mapping page 2008-09-11 21:26 well 2008-09-11 21:26 hmm 2008-09-11 21:26 is it? 2008-09-11 21:26 somewhat 2008-09-11 21:26 I think it's mostly free 2008-09-11 21:26 gets close 2008-09-11 21:26 I was going to have separate btree for big xattrs 2008-09-11 21:27 and small ones go inthe inode, just like immediate file data 2008-09-11 21:27 (still imagining a world with just one btree) 2008-09-11 21:27 but mapping intermediate sized attributes into the top of the file address space is a possibility 2008-09-11 21:27 theoretically you can put almost all file metadata at the -1 point 2008-09-11 21:27 not only xattrs 2008-09-11 21:27 thejust one btree idea has already been done, it's called hammer 2008-09-11 21:27 not sure how that would work for performance 2008-09-11 21:28 but you'd get versioning for free 2008-09-11 21:28 I think that two level btree is significantly more cache efficient 2008-09-11 21:28 I've played with mapping file metadata into the file address space before 2008-09-11 21:28 perhaps. 2008-09-11 21:28 without joy 2008-09-11 21:28 spent a lot of mental energy on it, found no real wins 2008-09-11 21:28 where are the problems? 2008-09-11 21:29 finding a reason to do it 2008-09-11 21:29 an example that runs faster 2008-09-11 21:29 yeah, it's probably worth optimizing the hell out of inode stat time 2008-09-11 21:29 stat time? 2008-09-11 21:29 ah 2008-09-11 21:30 yes 2008-09-11 21:30 how fast you can stat a bunch of inodes 2008-09-11 21:30 tux3 is going to work very well there 2008-09-11 21:30 basically just run down the inode table 2008-09-11 21:30 wait a minute, it's a table? not a btree? 2008-09-11 21:30 and the inode table will be intitionally laid out in a clumpy way 2008-09-11 21:30 it's a btree 2008-09-11 21:30 oh, ok. 2008-09-11 21:30 call it a table for historical reasons 2008-09-11 21:31 variable size inodes 2008-09-11 21:31 a tux3 exclusive, maybe 2008-09-11 21:31 really defines the design and implementation 2008-09-11 21:31  2) Refcount all atoms and delete any that fall to zero <- my vote 2008-09-11 21:31 mine too 2008-09-11 21:31 just challenging to do as fast as the crude approach 2008-09-11 21:31 possibly delaying cleanup till unmount, not sure if that would ease up anything though 2008-09-11 21:32 tux3 has the concept of log rollup 2008-09-11 21:32 I'll be posting about that in much more detail over the next week or so 2008-09-11 21:32 it's continuous cleanup 2008-09-11 21:32 doesn't have to be a flurry of cleanup wither at umount or mount 2008-09-11 21:32 or remount after crash even 2008-09-11 21:33 you can actually put it in the btree ;-) 2008-09-11 21:33 why? 2008-09-11 21:33 you want search through it to be efficient - both ways 2008-09-11 21:33 oh right 2008-09-11 21:33 both atom -> string conversion and string -> atom conversion 2008-09-11 21:33 interesting idea 2008-09-11 21:33 oh 2008-09-11 21:33 I thought you meant the log 2008-09-11 21:33 have some reserved btree prefix 2008-09-11 21:34 of course the atom table will be a btree 2008-09-11 21:34 it will be an HTree in facrt 2008-09-11 21:34 fact 2008-09-11 21:34 the log? yeah though about how the log could be in the btree 2008-09-11 21:34 even had some half-baked concept, but didn't think about it long enough to really know if that's worth even thinking about 2008-09-11 21:34 turns out that the deficiencies of HTree that make it tough to implement readdir accurately don't apply at all to the xattr atom use case 2008-09-11 21:35 and htree is just about optimal for that 2008-09-11 21:35 atom->string is just an array, since there's no holes 2008-09-11 21:35 as far as reverse conversion goes... 2008-09-11 21:35 there are two ideas I'm considering 2008-09-11 21:35 one is to use the address of the dirent as the atom number 2008-09-11 21:36 this decreases thedensity of the atom space somewaht 2008-09-11 21:36 huh, how does that work? 2008-09-11 21:36 by a factor of 4 to be precise 2008-09-11 21:36 oh, right, I think I see 2008-09-11 21:36 just look up the dirent and return the offset fromthe beginning of the file as the atom number 2008-09-11 21:36 have the atoms themselves be pointers 2008-09-11 21:36 cute 2008-09-11 21:36 ACTION has to put a different keyboard onthis machine with a better space bar 2008-09-11 21:36 right 2008-09-11 21:37 the other option is to have a reverse lookup table, that points back at the dirents 2008-09-11 21:37 potentially div 4 or something to make em more likely to fit in a byte 2008-09-11 21:37 I favor the second 2008-09-11 21:37 because I like the atoms to be as dense as possible 2008-09-11 21:37 for compression reasons 2008-09-11 21:37 I already took the div4 into account ;-) 2008-09-11 21:37 I'm still not convinced compression of this part of the fs really matters... 2008-09-11 21:38 sure it does 2008-09-11 21:38 atom number field is current 16 bits 2008-09-11 21:38 64K atoms 2008-09-11 21:38 before having to go to a 32 bit atom number 2008-09-11 21:38 that's comfortable 2008-09-11 21:38 -!- stargazr5(~gauravstt@59.95.38.255) has joined #tux3 2008-09-11 21:38 14 bits not so much 2008-09-11 21:38 still 2008-09-11 21:39 could go either way on that 2008-09-11 21:39 Terrible hack: 2008-09-11 21:39 $ getfattr -n user.hash -e text -h --absolute-names -L xhash 2008-09-11 21:39 # file: xhash 2008-09-11 21:39 user.hash="1114234:1191:1219215805:e233bf8dd0415ec9b7fea0193803357c:6325f0060bd5f23cf6ba106fd6500efa76d9bc5e" 2008-09-11 21:39 Storing mtime/md5sum/sha1sum in a xattr for fs recovery ;-) 2008-09-11 21:39 got to decide by midnight ;-) 2008-09-11 21:39 ? 2008-09-11 21:40 so I store the mtime:md5sum:sha1sum of each file on my drive in a xattr for that file 2008-09-11 21:40 I get constant time md5sum calculation on files 2008-09-11 21:40 ease of verifying file integrity 2008-09-11 21:40 cool 2008-09-11 21:40 and I can verify integrity of files in case of fs crash (ie. like when I upgraded to 2.6.27-rc3) 2008-09-11 21:41 I think I like it more than zfs "checksum everything" mentality 2008-09-11 21:41 makes sense to only checksum logically 2008-09-11 21:41 and yes it does need to be regenerated on file modifications, so the newest files lack it 2008-09-11 21:41 sha1 is ok only if you want crytographic verifiability, otherwise it's slower than necessary 2008-09-11 21:42 compare that with my laptop 20mb/s read speed... 2008-09-11 21:42 and it doesn't matter 2008-09-11 21:42 it matters if you're running a server 2008-09-11 21:42 a lot 2008-09-11 21:42 true 2008-09-11 21:42 option? 2008-09-11 21:42 right 2008-09-11 21:42 probably include something like crc64 or whatever cheap 64-bit hash you can find 2008-09-11 21:43 (no idea what a fast good 64-bit hash is nowadays) 2008-09-11 21:43 crc is bad 2008-09-11 21:43 funnels to hell 2008-09-11 21:43 dx_hack_hash is getting closer 2008-09-11 21:43 uses a hacked lfsr idea 2008-09-11 21:43 needs analysis 2008-09-11 21:44 maze, you'd be good at that 2008-09-11 21:44 I think 2008-09-11 21:44 we appear to have scared of everybody else... noone is asking any other questions ;-) 2008-09-11 21:44 yeah 2008-09-11 21:44 analysis of speed? or of hash spread? 2008-09-11 21:44 and they're the ones who actually check in code ;-) 2008-09-11 21:45 got to be careful about that 2008-09-11 21:45 hash spread 2008-09-11 21:45 etc 2008-09-11 21:45 I'm in the middle of a cluster turn up... 2008-09-11 21:45 speed is about optimal 2008-09-11 21:45 I made sure of that 2008-09-11 21:45 well 2008-09-11 21:45 truth be told I could make it much faster 2008-09-11 21:45 hopefully I can at least provide 'inspiration' or something 2008-09-11 21:45 it's meant for hashing short strings with good spread 2008-09-11 21:46 short, very nonrandom strings 2008-09-11 21:46 does a good job of that 2008-09-11 21:46 re: atom refcounting 2008-09-11 21:46 you don't have to sync it to disk really if you are hacky/smart about it 2008-09-11 21:47 I'm going to post the results of my design thinking from the skate earlier 2008-09-11 21:47 really? 2008-09-11 21:47 since you can put it in the log 2008-09-11 21:47 sounds like magic 2008-09-11 21:47 of course 2008-09-11 21:47 planned 2008-09-11 21:47 or I wouldn't have gone this route at all 2008-09-11 21:47 and if the order is right, then it can never get out of sync 2008-09-11 21:47 again, of course 2008-09-11 21:47 and the entire thing should be small enough you can periodically just write out a new copy 2008-09-11 21:47 I've been computing the exact percentages of log bandwdith that will be required ;-) 2008-09-11 21:47 of the entire thing 2008-09-11 21:48 again, of course 2008-09-11 21:48 but we don't 2008-09-11 21:48 we even do that incrementally 2008-09-11 21:48 and: you can afford to lose decrements, since at most the ref counts will be too high 2008-09-11 21:48 and arrange the structure that have to be updated to be close together 2008-09-11 21:48 and compact 2008-09-11 21:48 ah 2008-09-11 21:48 which is kind of dirty... 2008-09-11 21:48 really? 2008-09-11 21:48 way dirty 2008-09-11 21:48 but there is likely something there 2008-09-11 21:49 you can't lose track long term 2008-09-11 21:49 that would be bad 2008-09-11 21:49 -!- tim_dimm(~timothyhu@adsl-67-114-40-138.dsl.scrm01.pacbell.net) has joined #tux3 2008-09-11 21:49 but you can do a false-positive-to-be-tested-later kind of thing 2008-09-11 21:49 I'd guess most fs'es will have 2 dozen or less atoms 2008-09-11 21:49 he tim_dimm 2008-09-11 21:49 welcome back, daddy! 2008-09-11 21:49 hey 2008-09-11 21:49 wassap? 2008-09-11 21:49 well 2008-09-11 21:50 they're doing good 2008-09-11 21:50 we just did episode 2 of tux3 university 2008-09-11 21:50 how'd I do, maze? 2008-09-11 21:50 ah, missed it! 2008-09-11 21:50 keeping your interesting, hit the right level? 2008-09-11 21:50 ok, though I think the first was more action packed 2008-09-11 21:50 enough swear words? too many? 2008-09-11 21:50 hehe 2008-09-11 21:50 well I can easily pick up the pace 2008-09-11 21:50 it's just that, where we were is where the tux3 kernel port willa actully start 2008-09-11 21:51 I wish there was a: these are your primitives, this is how they function, know this and C and data structures and you don't need to know anything else linux specific 2008-09-11 21:51 can shapor put together one of those word clouds for tux3 university? 2008-09-11 21:51 in an ideal world 2008-09-11 21:51 word cloud? 2008-09-11 21:51 ah 2008-09-11 21:51 scrape the logs 2008-09-11 21:51 make a book ;-) 2008-09-11 21:51 you know, a clear definition of interfaces ;-) 2008-09-11 21:51 you know, the more common words are bigger 2008-09-11 21:51 full of swear words 2008-09-11 21:52 and embarrassing stories about certain well known kernel hackers 2008-09-11 21:52 who joined the U? 2008-09-11 21:52 maze, good louck with that 2008-09-11 21:52 bunch of people up there in the channel list 2008-09-11 21:52 good folks 2008-09-11 21:52 I'm seeing a lot of new names on the irc 2008-09-11 21:52 missed natalie tonight 2008-09-11 21:52 yes, it's bulking up 2008-09-11 21:52 so is the subscribe list 2008-09-11 21:52 and so are the checkins 2008-09-11 21:53 can ask for more 2008-09-11 21:53 what's that up to? 2008-09-11 21:53 you may want to have something selinux specific to optimize that straight in the metadata 2008-09-11 21:53 same for acl's 2008-09-11 21:53 tux3 subscribers are over 100 2008-09-11 21:53 a dictionary with a few hundred entries would optimize down all selinux entries on my machine down to 3 bytes 2008-09-11 21:53 MaZe: you refering to the xattrs? 2008-09-11 21:53 maze, I'm waiting for the selinux people to smell the coffee and come tell use what we're doing right/wrong 2008-09-11 21:53 maze, I have reason to believe that will happen soon ;-) 2008-09-11 21:54 problem is: dictionary needs to be dynamic 2008-09-11 21:54 so basically atoms x 2 2008-09-11 21:54 maze, just what I'm thinking 2008-09-11 21:54 read the posts 2008-09-11 21:54 but again, that's kind of vital to be fast 2008-09-11 21:54 you'll see I addressed that specifically 2008-09-11 21:54 you know where I work, you know how much email I get 2008-09-11 21:54 ;-) 2008-09-11 21:54 the last thing I want to do when I'm 'done' with work is read more email 2008-09-11 21:54 the more details post talks about it I think 2008-09-11 21:55 well read it on the job 2008-09-11 21:55 still parsing (in 2nd window) 2008-09-11 21:55 everybody knows sre's have that kind of time ;-) 2008-09-11 21:55 yeah... 2008-09-11 21:55 when there are no data centers burning down 2008-09-11 21:56 it's job-related 2008-09-11 21:56 so xattr's don't need to be fast - (overly fast) - for anything - except for the parts that are actually used within the kernel 2008-09-11 21:56 got to keep the finger on the pulse of sel development 2008-09-11 21:56 ie. acl's and selinux 2008-09-11 21:56 yup 2008-09-11 21:56 got that in my post too 2008-09-11 21:56 selinux is on every file, acl only on special files 2008-09-11 21:56 haven't gotten there obviously yet 2008-09-11 21:57 the xattr interface kind of sucks for efficiency 2008-09-11 21:57 choke point 2008-09-11 21:58 another solution, is to support atoms for the important stuff, and leave the rest as strings 2008-09-11 21:58 that way you compress selinux/acl but leave the user stuff uncompressed 2008-09-11 21:59 don't have to deal with denial of service against atom space attacks 2008-09-11 21:59 kind of best of both worlds 2008-09-11 21:59 possibly allow a superblock list of optimized entries, and a utility (mount time option), to include a new atom 2008-09-11 22:00 that might be both simple (very) and efficient and trivial to implement 2008-09-11 22:00 flips: see the latest list post? 2008-09-11 22:01 -!- ebiederm(~eric@c-24-130-11-59.hsd1.ca.comcast.net) has left #tux3 2008-09-11 22:01 you still have to deal with negative lookups correctly, but that's an easy optimization 2008-09-11 22:01 maze, also an option, yes 2008-09-11 22:01 important = atom 2008-09-11 22:02 exactly 2008-09-11 22:02 negative lookups? 2008-09-11 22:02 and what is actually an atom is specified by the admin during (previous and current) mount 2008-09-11 22:02 konrad,not yet 2008-09-11 22:02 if you have fs with 'selinux' not being an atom 2008-09-11 22:02 and have that written out to disk as a string 2008-09-11 22:03 and then you remount with selinux as atom (so it promotes) 2008-09-11 22:03 then if you lookup selinux atom on a file with it from before the remount you won't find it, unless you search the string entries as well 2008-09-11 22:03 in which case lack of the field, must mean search for the string instead and promote if needed to atom 2008-09-11 22:04 but the atom table is part of the fs, so how does remoutn come into it? 2008-09-11 22:04 the atom table can't be shrunk 2008-09-11 22:04 but can have new entries added via mount options 2008-09-11 22:04 I sort of get it 2008-09-11 22:04 ie. mount -o atomize=selinux -t tux3 /dev/hda3 / 2008-09-11 22:04 would be worth a list post maybe 2008-09-11 22:05 and than at the beginning you atomize (awesome term) the entries you know will be common 2008-09-11 22:05 so the security.selinux 2008-09-11 22:05 anyway, I think the truth is, the refcounting is going to be so efficient that nobody will care about the slight overhead and will love the warm fuzzy feeling of compression 2008-09-11 22:05 the subpieces of security selinux (since it's split in 3 parts) 2008-09-11 22:05 and being able to use long xattr names without penalty 2008-09-11 22:05 refcounting does have issues with quota 2008-09-11 22:06 unless you count xattrs against user quota 2008-09-11 22:06 which you probably should... 2008-09-11 22:06 yes, the refcounting is primarily to address quota 2008-09-11 22:06 my solution has the benefit you don't need refcounts 2008-09-11 22:06 you still get optimal performance for anything that matters 2008-09-11 22:07 - what matters being selected by the admin (and you can compile in a list of default atoms into tux3, being what we grab from selinux in fedora or whatever) 2008-09-11 22:07 do acl's store numeric ids or text strings? 2008-09-11 22:07 (ids = uids/gids) 2008-09-11 22:07 I'd hope numeric... 2008-09-11 22:09 anyway, that way you should be able to store all selinux data straight along with the mtime/ctime in the inode, using up a 32bit int or something like that 2008-09-11 22:12 btw, you're wrong on the ACLs being the most important use of xattr's - selinux is _BY FAR_ 2008-09-11 22:14 maze, could you work it up into a post? 2008-09-11 22:14 btw 2008-09-11 22:14 if you look at man getfacl, you'll see: 2008-09-11 22:14 1: # file: somedir/ 2008-09-11 22:14 2: # owner: lisa 2008-09-11 22:14 3: # group: staff 2008-09-11 22:14 4: user::rwx 2008-09-11 22:14 5: user:joe:rwx #effective:r-x 2008-09-11 22:14 6: group::rwx #effective:r-x 2008-09-11 22:14 7: group:cool:r-x 2008-09-11 22:14 8: mask:r-x 2008-09-11 22:14 9: other:r-x 2008-09-11 22:14 10: default:user::rwx 2008-09-11 22:14 11: default:user:joe:rwx #effective:r-x 2008-09-11 22:14 I hope acl's are binary but I don't know yet 2008-09-11 22:14 12: default:group::r-x 2008-09-11 22:14 13: default:mask:r-x 2008-09-11 22:14 14: default:other:--- 2008-09-11 22:15 from which you'll notice that for permissions you want to fit the standard ugo+-rwx in the inode 2008-09-11 22:15 but also 2008-09-11 22:15 some of the other stuff which isn't per used 2008-09-11 22:15 s/used/user/ 2008-09-11 22:15 I can't tell if the other latest list post is spam or not 2008-09-11 22:15 so what is going to be cool is also doing atoms on the acl bodies 2008-09-11 22:15 which one? 2008-09-11 22:16 the mask on line 8 in particular 2008-09-11 22:16 "sir I want join your proyect" 2008-09-11 22:16 not spam 2008-09-11 22:16 definitely 2008-09-11 22:16 they do some clever spam nowadays 2008-09-11 22:16 directories also need the default ACLs off of lines 10-14 2008-09-11 22:16 konrad, you meant the latest from tero? 2008-09-11 22:16 which I think don't need special handling 2008-09-11 22:17 flips: nanden yen 2008-09-11 22:17 Not Tero 2008-09-11 22:17 yes, the best kind 2008-09-11 22:17 the problem with acl's though is that of course the amount of space they can take up is unbounded 2008-09-11 22:17 I'll respond 2008-09-11 22:17 maze, sure 2008-09-11 22:17 so only atomize the short ones 2008-09-11 22:17 although you can get it down to like 40 bits per entry 2008-09-11 22:18 so acl's don't really need atomization per say 2008-09-11 22:18 selinux needs atomization 2008-09-11 22:18 ah 2008-09-11 22:18 because selinux stores arbitrary strings 2008-09-11 22:18 I didn't realize the distinction 2008-09-11 22:19 it has to do compartments and stuff 2008-09-11 22:19 acl's basically store the above listed fields (user, user:uid, group, group:gid, mask, other) with 3 bits (rwx) and possibly the actual uid/gid (32 bits) 2008-09-11 22:19 that's not an acl as I understand it 2008-09-11 22:20 that is DAC 2008-09-11 22:20 course getfacl is definitive ;-) 2008-09-11 22:20 $ getfattr -d -m . -e text -h --absolute-names -L /etc 2008-09-11 22:20 # file: /etc 2008-09-11 22:20 security.selinux="system_u:object_r:etc_t:s0\000" 2008-09-11 22:20 so there you have an example selinux xattr 2008-09-11 22:20 the system_u object_r and etc_t are going to appear on tens of thousands of files 2008-09-11 22:20 -!- tim_dimm_(~timothyhu@adsl-67-114-40-138.dsl.scrm01.pacbell.net) has joined #tux3 2008-09-11 22:20 and come from a set of like maybe 200 entries on my system 2008-09-11 22:21 it's going to be very satisfying to compress "security.selinux" to 2 bytes 2008-09-11 22:21 so you need to have security.selinux xattr stored as a 32 bit field in the inode 2008-09-11 22:21 storing 10 bits per each of the 3 fields (u/r/t) 2008-09-11 22:21 and having a dictionary for each of those fields 2008-09-11 22:22 [actually 4 bytes for the entire thing most likely] 2008-09-11 22:22 we could build selinux fields directly into tux3 inodes if they're decent 2008-09-11 22:22 all tux3 attributes are optional 2008-09-11 22:22 exactly - you really want to do that 2008-09-11 22:22 so there's no real cost 2008-09-11 22:22 I guess we better do that 2008-09-11 22:22 could you write a post asking for that? 2008-09-11 22:22 and explaining what they are? :-) 2008-09-11 22:22 heh 2008-09-11 22:22 I'd need to run some stats gathering on my local system 2008-09-11 22:23 sure 2008-09-11 22:23 that's a yes I take it 2008-09-11 22:23 ACTION considers the ramyun deficiency problem 2008-09-11 22:24 ok, generating xattr dump from my machine 2008-09-11 22:24 I'll write something up 2008-09-11 22:24 about selinux/acl/other xattrs 2008-09-11 22:24 and what's important 2008-09-11 22:24 kay, I'll get some ranyun into me then prototype the refcounting 2008-09-11 22:24 and include something about my md5/sha1 idea above into it 2008-09-11 22:24 yah 2008-09-11 22:24 to point out why you want it to be user extensible 2008-09-11 22:25 nice 2008-09-11 22:25 for selinux you can even refuse to accept stuff from outside the dictionary 2008-09-11 22:26 although to be fair that's only settable by root, so might as well just transparently extend the dicts 2008-09-11 22:29 we can have some fun 2008-09-11 22:29 letting security folks play with stuff 2008-09-11 22:29 btw, here's a file with both selinux attrs, and extended acls (extra read rights to local user) 2008-09-11 22:29 $ getfattr -d -m . -e text -h --absolute-names -L junk 2008-09-11 22:29 # file: junk 2008-09-11 22:29 security.selinux="unconfined_u:object_r:default_t:s0\000" 2008-09-11 22:29 system.posix_acl_access="\002\000\000\000\001\000\007\000\377\377\377\377\002\000\004\000d\000\000\000\004\000\005\000\377\377\377\377\020\000\005\000\377\377\377\377 \000\005\000\377\377\377\377" 2008-09-11 22:29 Would you have guessed? 2008-09-11 22:30 everybody needs to have fun 2008-09-11 22:30 ah 2008-09-11 22:30 looks binary 2008-09-11 22:30 notice how 'local' (a user name) doesn't show up 2008-09-11 22:30 very 2008-09-11 22:30 instead the uid (100) does 2008-09-11 22:31 where is it? 2008-09-11 22:31 100 = 0144 = 'd' 2008-09-11 22:32 (there's extra stuff there that always gets set on any file with extended acls... basically cruft...) 2008-09-11 22:32 there sure are a lot of 377s 2008-09-11 22:32 it looks like it uses 32-bit ints 2008-09-11 22:33 probably for the u/gids 2008-09-11 22:33 wow, that d is really well hidden 2008-09-11 22:33 hehe 2008-09-11 22:33 bad choice of username 2008-09-11 22:33 crappy dump 2008-09-11 22:33 see hexdump.c 2008-09-11 22:34 yeah ;-), well I asked for text 2008-09-11 22:34 $ getfattr -d -m . -e hex -h --absolute-names -L junk 2008-09-11 22:34 # file: junk 2008-09-11 22:34 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 2008-09-11 22:34 system.posix_acl_access=0x0200000001000700ffffffff020004006400000004000500ffffffff10000500ffffffff20000500ffffffff 2008-09-11 22:34 still not readable - but better 2008-09-11 22:34 just installed the acl package 2008-09-11 22:34 I will try to clue up a bit 2008-09-11 22:34 the fs has to be mounted with acl support 2008-09-11 22:35 basically all you care about are getfattr/setfattr for xattr mods 2008-09-11 22:35 tux3 will always mount it xattr support, anyway 2008-09-11 22:35 and getfacl/setfacl for acl stuff 2008-09-11 22:35 I wonder that acl support is 2008-09-11 22:35 extra module? 2008-09-11 22:35 to actually respect extended acls stored as xattrs 2008-09-11 22:35 $ cat /proc/mounts 2008-09-11 22:35 /dev/root / ext3 rw,relatime,errors=continue,user_xattr,acl,data=ordered 0 0 2008-09-11 22:35 notice user_xattr,acl 2008-09-11 22:36 I wonder what acl does 2008-09-11 22:36 user xattr allows setting user.* xattrs (not selinux nor acl) 2008-09-11 22:36 could always read the fscking source 2008-09-11 22:36 hehe 2008-09-11 22:36 man 5 attr 2008-09-11 22:36 ah, well tux3 will yet users set xattrs by default 2008-09-11 22:36 man 5 acl 2008-09-11 22:36 no reason not to 2008-09-11 22:37 I meant, what does it do in ext3 mount 2008-09-11 22:37 uhm, I think that's actually a vfs switch - so you have no choice ;-) 2008-09-11 22:37 it seems strange you'd have to ask for it 2008-09-11 22:37 notice you don't have to ask for htree ;-) 2008-09-11 22:37 I think it's the default nowadays 2008-09-11 22:37 that's a pretty big deal 2008-09-11 22:37 not sure though 2008-09-11 22:37 what is? 2008-09-11 22:37 not having to ask for htree 2008-09-11 22:38 you can mount -o remount,acl/noacl 2008-09-11 22:38 you get dir indexing by default 2008-09-11 22:38 even though it's horribly complex 2008-09-11 22:38 you've lost me - how does user_xattr and acl correspond to htree/dir indexing? 2008-09-11 22:38 doesn't 2008-09-11 22:38 ah 2008-09-11 22:38 just talking about defaults 2008-09-11 22:39 it makes no sense you'd have to ask for xattr or acl 2008-09-11 22:39 if you want to prevent you users from mucking around with it 2008-09-11 22:39 why would you? 2008-09-11 22:39 I believe the reason is 2008-09-11 22:40 (mind you both are ext3 options, not vfs I believe) 2008-09-11 22:40 true 2008-09-11 22:40 it requries a newer version of the ext3 superblock 2008-09-11 22:40 so does htree 2008-09-11 22:40 so you need it for backward compatibility 2008-09-11 22:40 ah 2008-09-11 22:40 that is the difference 2008-09-11 22:40 htree was forward compatible 2008-09-11 22:40 in case you don't want to generate xattrs on the fs 2008-09-11 22:40 right 2008-09-11 22:40 now it makes sense 2008-09-11 22:40 well 2008-09-11 22:40 tux3 doesn't have that problem 2008-09-11 22:40 forward? or backward? 2008-09-11 22:41 backward 2008-09-11 22:41 there is no backward, but of course we need to plan for forward 2008-09-11 22:41 could an old system r/w a newer fs with data stored in htreE? 2008-09-11 22:41 yes 2008-09-11 22:41 cute, no? 2008-09-11 22:41 very tricky to make that happen 2008-09-11 22:41 ah, ok, then clearly need no option if it's better 2008-09-11 22:41 cute - yes! 2008-09-11 22:41 right, it was never worse, that was another cute thing 2008-09-11 22:41 wicked. 2008-09-11 22:42 because it would fall back to _exactly_ the old code at the crossover point 2008-09-11 22:42 hehe 2008-09-11 22:42 which turned out to be two dirent blocks 2008-09-11 22:42 at two blocks htree was already faster 2008-09-11 22:42 so it just creates the index when the first block overfills 2008-09-11 22:43 and at that point that's still a cheap op 2008-09-11 22:44 right 2008-09-11 22:44 htree is really fast 2008-09-11 22:45 dirops measured in tens of usec, even back then 2008-09-11 22:46 ugh, I wish I could do some coding... some real low level put-nose-in-the-deep low-level hackery 2008-09-11 22:47 you can, after your post ;-) 2008-09-11 22:47 that in itself will take a while to write 2008-09-11 22:48 it's job-related 2008-09-11 22:49 good security keeps data centers from catching fire 2008-09-11 22:50 uhm, not so sure about that ;-) 2008-09-11 22:50 they catch fire for totally non-security related reasons 2008-09-11 22:51 ok, _sometimes_ keeps data centers from catching fire 2008-09-11 22:51 could potentially, one day, keep a fire from starting 2008-09-11 22:52 you know alan cox figure out how to remotely disable the temperture override on intel processors? 2008-09-11 22:52 he could literally melt down processors remotely 2008-09-11 22:52 let me see 2008-09-11 22:53 make sure that isn't apocryphal 2008-09-11 22:53 pretty sure not 2008-09-11 22:55 heh 2008-09-11 22:56 I think our machines would lose power at that point, although I'm not sure 2008-09-11 22:56 care to let him try? ;-) 2008-09-11 22:56 it's rather odd that you can do that in software 2008-09-11 22:57 you'd think: overheat is overheat 2008-09-11 22:57 there should be no gate on it 2008-09-11 22:57 I think they have gates on stuff like that for a simple reason 2008-09-11 22:58 they don't know where the overheat point is until after they've built and tested the cpu 2008-09-11 22:58 some batches are better than others 2008-09-11 22:58 those get sold as higher frequency cpus, and or with more cache 2008-09-11 22:59 today faster/more expensive/more top of the line cpus are cpus with less broken parts 2008-09-11 22:59 less of the cache disabled - because it didn't work, less power consumption at higher speed, less heat generated, better freqeuency tolerances etc 2008-09-11 23:00 less alu units disabled (there are always spares) 2008-09-11 23:00 less cores disabled 2008-09-11 23:00 already the 486sx was a dx with the floating point unit disabled because it failed qa 2008-09-11 23:01 -!- flips(~phillips@phunq.net) has left #tux3 2008-09-11 23:01 -!- flips(~phillips@phunq.net) has joined #tux3 2008-09-11 23:01 creating some excellent virus payload opportunities 2008-09-11 23:01 huh? 2008-09-11 23:02 course, it's usually not in the interest of a virus to physicall destroy its host 2008-09-11 23:02 oh, as in burn the cpu virus? 2008-09-11 23:02 right, a real HACF instruction 2008-09-11 23:02 spread, mutate, exterminate later? 2008-09-11 23:02 HACF? 2008-09-11 23:02 Halt And Catch Fire 2008-09-11 23:02 $ cat /tmp/xattr.se | wc -l 2008-09-11 23:02 383 2008-09-11 23:03 so I have 383 different security.selinux xattr entries on my laptop 2008-09-11 23:03 highly compressable if that's what you're syaing 2008-09-11 23:03 yup 2008-09-11 23:03 MaZe: how'd you count? 2008-09-11 23:04 sort | uniq -c | wc -l 2008-09-11 23:04 er, before that bit 2008-09-11 23:04 this is going to hurt your eyes ;-) 2008-09-11 23:04 find / | xargs getxattr (something)? 2008-09-11 23:04 find / -xdev -print0 | xargs -0 -n 1 getfattr -d -m . -e text -h --absolute-names -L | egrep '^security\.selinux=' | sort | uniq -c | wc -l 2008-09-11 23:04 thanks 2008-09-11 23:05 relatively unpainful line of shell 2008-09-11 23:05 I actually dumped to file in their 2008-09-11 23:05 no perl or awk ;-) 2008-09-11 23:05 so hope I didn't mix that up 2008-09-11 23:05 don't know perl 2008-09-11 23:05 avoid awk 2008-09-11 23:05 abuse egrep and sed instead 2008-09-11 23:05 sed is good for some nice chicken tracks 2008-09-11 23:05 base64 to base16 conversion in sed anyone? 2008-09-11 23:06 ouch, really? 2008-09-11 23:06 or the other way? 2008-09-11 23:06 or base32? 2008-09-11 23:06 or to binary? 2008-09-11 23:06 really ;-) 2008-09-11 23:06 ow :) 2008-09-11 23:07 (obviously you go through binary) 2008-09-11 23:08 so I split up my selinux strings, here's the results (there's 4 : seperated pieces) 2008-09-11 23:08 $ cat /tmp/xattr.se1 2008-09-11 23:08 679862 system_u 2008-09-11 23:08 24514 unconfined_u 2008-09-11 23:08 $ cat /tmp/xattr.se2 2008-09-11 23:08 704376 object_r 2008-09-11 23:08 $ cat /tmp/xattr.se3 | wc -l 2008-09-11 23:08 356 2008-09-11 23:08 $ cat /tmp/xattr.se4 2008-09-11 23:08 704376 s0 2008-09-11 23:09 so, yeah, badly needs a dict 2008-09-11 23:09 how about base63 to base15? 2008-09-11 23:09 just one less 2008-09-11 23:09 lol 2008-09-11 23:09 how hard could that be 2008-09-11 23:10 ah, but how useful would that be? 2008-09-11 23:10 ACTION feels that way about base64 2008-09-11 23:11 base64 is nice and concise, when you still want something printable (ie. a filename) 2008-09-11 23:11 but doesn't need to be remembered (still needs to be cut'n'pasteable though) 2008-09-11 23:11 see, I just didn't arrive on the planet as a sysop 2008-09-11 23:11 so yeah files with names being base64 encoding of content hash... 2008-09-11 23:12 useful stuff ;-) 2008-09-11 23:12 good thing lots of sysops like to hang around filesystem projects 2008-09-11 23:12 makes for some fancy scripts 2008-09-11 23:12 $ cat /tmp/xattr.se3 | sort -n | tail -n 8 2008-09-11 23:12 6125 modules_object_t 2008-09-11 23:12 13658 locale_t 2008-09-11 23:12 16690 man_t 2008-09-11 23:12 25337 src_t 2008-09-11 23:12 26456 lib_t 2008-09-11 23:12 32875 user_home_t 2008-09-11 23:12 109266 usr_t 2008-09-11 23:12 461436 default_t 2008-09-11 23:13 so you can see the long tail distribution of the 3rd element 2008-09-11 23:13 that's with all the _t ? 2008-09-11 23:13 type 2008-09-11 23:13 right, which I associate with source code 2008-09-11 23:13 or somit 2008-09-11 23:13 is that source? 2008-09-11 23:13 u = user 2008-09-11 23:13 r = role 2008-09-11 23:14 t = type 2008-09-11 23:14 ah 2008-09-11 23:14 parts of the tripplet have different suffixes 2008-09-11 23:14 it's basically always .*_u:.*_r:.*_t 2008-09-11 23:14 let me see, that's MAC terminology 2008-09-11 23:14 not surprising there... 2008-09-11 23:14 not sure what exactly the last quad (s0) is 2008-09-11 23:14 possibly range? 2008-09-11 23:15 anyway, on a bigger box, there would obviously be more 2008-09-11 23:15 but we're not talking about a lot here - just the order of hundreds to thousands 2008-09-11 23:16 instructive 2008-09-11 23:16 I'm going to have to aborb this as we go 2008-09-11 23:16 but anyway 2008-09-11 23:16 I think the last quad isn't used for anything yet but is reserved for future stuff? 2008-09-11 23:16 seem to be on the right track, just by dumb luck 2008-09-11 23:16 maybe, it's newer than the rest 2008-09-11 23:16 err 2008-09-11 23:16 no, that's not right 2008-09-11 23:16 hold on 2008-09-11 23:17 oh, I know 2008-09-11 23:17 if user sets a selinux context which isn't atomized - he gets eperm, if root, the atom table is auto-extended 2008-09-11 23:18 deals with dos correctly 2008-09-11 23:18 or make it a mount option, selinux-auto-atomize={always,if-root,never} 2008-09-11 23:21 atomize is the new tux3 thing? 2008-09-11 23:24 215 different selinux states here 2008-09-11 23:25 :-) 2008-09-11 23:25 well atomize is descriptive in what it does to the number of bytes 2008-09-11 23:25 heh 2008-09-11 23:25 what is the average length of an acl? 2008-09-11 23:26 and what is the percentage of files that have them? 2008-09-11 23:26 selinux? all 2008-09-11 23:26 extended acl? almost none 2008-09-11 23:26 avg length of selinux acl... working 2008-09-11 23:27 security.selinux="...:...:...:..\000"[newline] - avg at 53 2008-09-11 23:27 so like 48 in memory 2008-09-11 23:28 extended acl: 2008-09-11 23:28 system.posix_acl_acces="..." with minimum length of 2008-09-11 23:29 so we will compress those about 25/1 2008-09-11 23:29 61 or so 2008-09-11 23:29 30/1 then 2008-09-11 23:30 minimum length 2008-09-11 23:30 oh 2008-09-11 23:30 well 2008-09-11 23:30 well selinux is everywhere - and compressed to 4 bytes 2008-09-11 23:30 "significant" metadata compression coming up 2008-09-11 23:30 acl is not everywhere... 2008-09-11 23:30 ah 2008-09-11 23:30 so only 2/1 2008-09-11 23:30 when we compress the 4 bytes to 2 byte atoms 2008-09-11 23:30 in a selinux system everyfile has to have a selinux context 2008-09-11 23:30 but not every file has to have extended acls 2008-09-11 23:31 I see 2008-09-11 23:31 so it depends on how you do the selinux compression 2008-09-11 23:31 how badly you want to compress it 2008-09-11 23:31 look up every xattr body in the atom table 2008-09-11 23:31 easy 2008-09-11 23:31 the most relaxed method uses 16 bytes 2008-09-11 23:31 -!- stargazr5(~gauravstt@59.95.38.255) has joined #tux3 2008-09-11 23:31 the most compressed 2 bytes 2008-09-11 23:32 I see 2008-09-11 23:32 varying levels of extendability 2008-09-11 23:32 future-proofing 2008-09-11 23:32 you could do selinux compression like standard unix priveleges compression 2008-09-11 23:32 yes, but how big do you make the bitfields? 2008-09-11 23:32 i.e. if it's the same as the parent directory, don't store it in the inode 2008-09-11 23:32 so they've already done a pretty good job of compressing bodies 2008-09-11 23:32 it's the xattr labels that will stick out 2008-09-11 23:32 you don't need the xattr labels at all 2008-09-11 23:33 the selinux and acl xattr labels are trivially obvious well known labels, that you fake on xattr access 2008-09-11 23:33 konrad, not there's no way to find the partent directory actually 2008-09-11 23:33 so if we do that, it will be the containing inode table block 2008-09-11 23:33 or region 2008-09-11 23:34 "note there's no way to find the parent directory actually 2008-09-11 23:34 " 2008-09-11 23:34 so you don't need to store them as xattrs anyway 2008-09-11 23:34 hardlinks 2008-09-11 23:34 hard to type and eat ramyun at the same time 2008-09-11 23:34 inode can exist in multiple dirs 2008-09-11 23:35 time to do my atom refcount post 2008-09-11 23:35 I'll post the design, then implement 2008-09-11 23:35 like a good boy should 2008-09-11 23:36 also, note that xattrs 2008-09-11 23:36 basically come in a couple variaties 2008-09-11 23:36 security.* system.* trusted.* user.* 2008-09-11 23:37 so it's worthwhile to compress {security|system|trusted|user} regardless 2008-09-11 23:37 flips: er, sorry, yes, that. 2008-09-11 23:38 security is basically for selinux (& other such systems), system for extended acls (& other such system - capabilities), user for users, trusted (accessible to CAP_SYS_ADMIN) 2008-09-11 23:39 ACTION trashes the latest lucasarts game on slashdot 2008-09-11 23:39 got to keep focussed here 2008-09-11 23:39 what's the game? 2008-09-11 23:39 I just received sport 2008-09-11 23:39 erm, spore 2008-09-11 23:40 spore is on my never play list 2008-09-11 23:40 with the drm 2008-09-11 23:40 even if they're doing it to windows boxes 2008-09-11 23:40 there's drm? 2008-09-11 23:40 hmm, well I have a mac 2008-09-11 23:40 it fscks with the registry 2008-09-11 23:40 for copy protection 2008-09-11 23:41 hacks the os security 2008-09-11 23:41 details are all over the web 2008-09-11 23:41 not that hacking windows security is all that leet... still it's agin the law 2008-09-11 23:42 hmm, all my windoze are safely in kvm cages 2008-09-11 23:42 you having fun with spore? 2008-09-11 23:42 makes for the fastest windows installs ever 2008-09-11 23:43 haven't launched it yet ;-) 2008-09-11 23:43 I had fun with civ revolutions 2008-09-11 23:43 never played a civ game before 2008-09-11 23:43 will need to reboot into mac 2008-09-11 23:43 surprised me, I didn't think I'd like it 2008-09-11 23:43 well 2008-09-11 23:43 post coming, for realz 2008-09-11 23:46 man setxattr: If extended attributes are not supported by the filesystem, or are disabled, errno is set to ENOTSUP. 2008-09-11 23:53 -!- kd(kdpict@118.94.53.35) has joined #tux3 2008-09-11 23:54 yah 2008-09-11 23:54 just replied 2008-09-12 00:00 as did I 2008-09-12 00:03 good to do this stuff in public 2008-09-12 00:03 keeps a useful record, and it's a subtle way of poking at the libxattr guys to fix their packages 2008-09-12 00:03 ACTION wonders who the libxattr guys are 2008-09-12 00:04 choose one: 1) redhat 2) suse 2008-09-12 00:05 http://oss.sgi.com/projects/xfs/ 2008-09-12 00:05 possible 2008-09-12 00:06 $ rpm -qf `which getfattr ` 2008-09-12 00:06 attr-2.4.41-1.fc9.x86_64 2008-09-12 00:06 it was a suse guy who put xattr+acl support into ext3 2008-09-12 00:06 $ rpm -qi attr | grep URL 2008-09-12 00:06 URL : http://oss.sgi.com/projects/xfs/ 2008-09-12 00:06 ah, ok 2008-09-12 00:06 -!- stargazr5(~gauravstt@59.95.35.250) has joined #tux3 2008-09-12 00:07 well all the slashdotters seem to agree that the new lucasarts game lacks in the gameplay department 2008-09-12 00:07 it's fun though 2008-09-12 00:08 for a few minutes 2008-09-12 00:08 throwing big chunks of metal at little robots 2008-09-12 00:08 making lightning come out of your fingers and not do much 2008-09-12 00:10 what's it called? 2008-09-12 00:11 force unleashed 2008-09-12 00:11 you play as a sith apprentice 2008-09-12 00:11 darth's personal waterboy 2008-09-12 00:14 flips: tried KOTOR? 2008-09-12 00:14 both parts are good 2008-09-12 00:14 enjoyed kotor 2008-09-12 00:15 force unleashed better ? 2008-09-12 00:16 but I don't want to play any more bioware games 2008-09-12 00:16 too cookie cutter 2008-09-12 00:16 even jade empire felt like kotor 2008-09-12 00:16 and I got spoiled by oblivion 2008-09-12 00:16 everything you couldn't do in kotor,you could do in oblivion 2008-09-12 00:16 nope. haven't played either 2008-09-12 00:16 yeah. oblivion was great. 2008-09-12 00:17 force unleashed is mildly entertaining 2008-09-12 00:17 they got some mechanics right, others badly wrong 2008-09-12 00:17 and it's way linear 2008-09-12 00:18 I'll probably get force unleashed 2008-09-12 00:18 but I don't expect much 2008-09-12 00:18 just filling time until bethesda comes up with something new ;-) 2008-09-12 00:18 :) 2008-09-12 00:19 thanks for ur support to the de-duplication idea 2008-09-12 00:19 will get back to you when we have something more concrete 2008-09-12 00:19 welcome, so that's you 2008-09-12 00:20 sure 2008-09-12 00:20 deduplication seems to be a big hot button 2008-09-12 00:20 yeah...me, stargazr5 , kd and another 2008-09-12 00:20 for new fs design 2008-09-12 00:20 ah 2008-09-12 00:20 were you here for the vfs tour today? 2008-09-12 00:21 no....missed it... :( but following it now. 2008-09-12 00:21 good 2008-09-12 00:21 how many years of C has each of you got? 2008-09-12 00:22 look through the logs - there's bound to be some jewels in there 2008-09-12 00:23 about 2 and half years 2008-09-12 00:23 MaZe: thanks. will do. 2008-09-12 00:24 and have you done an OS and/or FS course yet? 2008-09-12 00:24 I think this is for an advanced fs course, right? 2008-09-12 00:24 ACTION reads again 2008-09-12 00:25 yes. OS. 2008-09-12 00:26 how much low level experience do you have? C / assembly interfaces? While it's not really needed, it comes in useful from time to time. 2008-09-12 00:27 [of course, IMHO, assembly is always useful to know... so I may be biased] 2008-09-12 00:27 I agree 2008-09-12 00:27 not as a first language 2008-09-12 00:27 but certainly as one of the first 5 2008-09-12 00:28 it's been years now since I've written any 2008-09-12 00:28 if I do write some, it's likely to be for some strange arch like cell spe 2008-09-12 00:28 oh, I've never written true assembly... it's always been inline, often entire procedures, but never entire programs (unless the entire program was a hundred lines or less) 2008-09-12 00:29 ah, I've written tens of thousands of lines 2008-09-12 00:29 wallowed in it 2008-09-12 00:29 have a bit of experience in assembly...not much.. 2008-09-12 00:29 oh, I've written tens of thousands of lines, never all in one piece though 2008-09-12 00:29 got really good at it, then realized there's people much, much better 2008-09-12 00:30 this is part of project that we can do on any cs topic 2008-09-12 00:30 right 2008-09-12 00:30 The biggest pieces I've written were usually either asm-coded bigint adders and the like, or something like a boot sector 2008-09-12 00:30 just reread your post 2008-09-12 00:30 I've done stuff llike transcoded knuth's algorithms for infinite precision math from MIX and x86 2008-09-12 00:31 neither of which has a lot of code, but either has huge performance boosts from assembly, or just needs to be in asm 2008-09-12 00:31 making the carries work out is hard ;-) 2008-09-12 00:31 MIX to x86 I mean 2008-09-12 00:31 oh, but that sort of stuff is still not pure asm, you write that in C with good macros and inline asm 2008-09-12 00:31 not me 2008-09-12 00:31 pure 2008-09-12 00:31 there is almost never a reason to write pure asm 2008-09-12 00:32 sure, when the OS isn't linux 2008-09-12 00:32 and the compiler isn't gcc 2008-09-12 00:32 gcc asm syntax blows by the way ;-) 2008-09-12 00:32 true, if compiler != gcc, then hang yourself 2008-09-12 00:32 blows as in bad? or in good? 2008-09-12 00:32 bad 2008-09-12 00:32 sucks chunks 2008-09-12 00:32 the syntax is bad, but it is extremely powerful 2008-09-12 00:33 although it takes quite some getting used to 2008-09-12 00:33 I know, so why not have the syntax be good and be extremely powerful? 2008-09-12 00:33 hehe 2008-09-12 00:33 yeah, well... that's like 2008-09-12 00:33 C 2008-09-12 00:33 the syntax for C also blows 2008-09-12 00:33 kinda yes 2008-09-12 00:33 at&t asm blows orders worse 2008-09-12 00:33 I think that's where it came from 2008-09-12 00:34 well 2008-09-12 00:34 let's not scare the visitors ;-) 2008-09-12 00:34 cranky old hacks 2008-09-12 00:35 a (*b(c d, e (*f)(g)))(h); 2008-09-12 00:35 what kind of syntax is that? 2008-09-12 00:35 next tux3 university? 2008-09-12 00:36 tue at 8 pm pacific 2008-09-12 00:36 tuesday 8 pm 2008-09-12 00:36 right, I tend to forget that timezone 2008-09-12 00:36 will be there this time. 2008-09-12 00:36 ok 2008-09-12 00:36 like the world revolves around silly valley 2008-09-12 00:36 :) 2008-09-12 00:36 :) 2008-09-12 00:36 doesn't it? 2008-09-12 00:36 not sure 2008-09-12 00:36 I didn't use to think so ;) 2008-09-12 00:37 I have some stories about silicon valley and time zones that I can't share ;-( 2008-09-12 00:37 nice to know there's a reason to get you drunk 2008-09-12 00:38 speaking of which 2008-09-12 00:38 a sake would help get this post written 2008-09-12 00:38 or should I just hack 2008-09-12 00:38 hmm 2008-09-12 00:38 I mean, with the sake in hand of course 2008-09-12 00:39 [btw, that declaration above, that's valid C - that's the declaration of signal from the std C library, where a,e=void b=signal c,g,h=int d=sig f=func 2008-09-12 00:39 taking abt timezones...its luch time here...i am off.. 2008-09-12 00:40 wow 2008-09-12 00:40 me too 2008-09-12 00:41 see you cdk, stargazr5 2008-09-12 00:42 you know I can read that without thinking? 2008-09-12 00:42 that's scary 2008-09-12 00:42 also used to to hex multiply/divide in my head at the most geekiest 2008-09-12 00:42 still can do it, more slowly 2008-09-12 00:50 the C syntax above? without thinking? really? that is scary 2008-09-12 00:50 that is like the ugliest part of C... 2008-09-12 00:51 arguably 2008-09-12 00:51 personally, I think const is 2008-09-12 00:51 dreamed up by a sadist 2008-09-12 00:51 oh, const is relatively simple to parse though 2008-09-12 00:51 especially if you write it so 'char const *' instead of 'const char *' 2008-09-12 00:51 but a devilishly effective makework project 2008-09-12 00:52 and then read from the back, since C is mostly read from the back/center anyway 2008-09-12 00:52 bottom up reading is a powerful organizing force 2008-09-12 00:52 I so much prefer Pascal syntax for type definitions 2008-09-12 00:52 you just read it left to right 2008-09-12 00:52 me too 2008-09-12 00:53 but pascal as a whole is just plain irritating 2008-09-12 00:53 if the things a pointer it says so in the first character 2008-09-12 00:53 I am also offended by == 2008-09-12 00:53 irritating? a little long-winded, true, but so frickin' easy to understand 2008-09-12 00:53 and friends 2008-09-12 00:53 yes, I prefer the pascal := and = as opposed to = and == 2008-09-12 00:54 I don't mind != nor <> - doesn't really matter to me 2008-09-12 00:54 the most pleasant language I've worked in is pick basic 2008-09-12 00:54 as modified by a friend of mine 2008-09-12 00:54 not aware of that - does that differ from basic? 2008-09-12 00:54 most of the stupidities gone 2008-09-12 00:54 and missing stuffyou need in, like structuring primitives 2008-09-12 00:54 [are you aware modern oo pascal like delphi, like freepascal, is 32bit and has object oriented programming, operator overloading, function overloading, etc...] 2008-09-12 00:55 it's very different from basic 2008-09-12 00:55 totally not anything like msft basic 2008-09-12 00:55 h 2008-09-12 00:55 ehrm, ah 2008-09-12 00:55 I haven't used the latest borland stuff, no 2008-09-12 00:55 msft liked it enough to headhunt the guy as I recall 2008-09-12 00:55 and we got c# :p 2008-09-12 00:56 another flavor of C-that-blows 2008-09-12 00:56 C# is just a flavour of java 2008-09-12 00:56 with the worst of C added in 2008-09-12 00:58 I want a melding of pascal (type declaration syntax, ease of reading code), Java (generics, interfaces [extended]), gnu-ism (inline asm power), C (low level control), C++ (some of the OO, dropping multiple inheritance), not sure what to do with some things [exceptions] 2008-09-12 00:59 let me know when you have code to try 2008-09-12 00:59 make it a very strongly typed language, drop most of the legacy crap, support useful UI candy (type in constants in any base, etc...) 2008-09-12 01:00 don't forget to make it interactive 2008-09-12 01:00 and managed 2008-09-12 01:00 and semicolons optional 2008-09-12 01:00 being able to recompile program blocks and replace them on the fly - yes I've wanted that ;-) 2008-09-12 01:00 managed? 2008-09-12 01:00 likewise parens, including fn call parens 2008-09-12 01:00 a function needs parens 2008-09-12 01:00 managed = can't segfault 2008-09-12 01:00 a procedure doesn't 2008-09-12 01:01 what does that mean? 2008-09-12 01:01 can't segfault? 2008-09-12 01:01 yes. 2008-09-12 01:01 lisp can't segfault in principle 2008-09-12 01:01 java can't either 2008-09-12 01:01 oh, but then it's not low-level 2008-09-12 01:01 I'm really concerned as to why 65% of my memory is in use (not including cache) 2008-09-12 01:01 in principle. 2008-09-12 01:01 konrad, in what context? 2008-09-12 01:01 the above would be a language you could write a kernel in 2008-09-12 01:02 konrad, running mozilla? 2008-09-12 01:02 oh 2008-09-12 01:02 flips: yeah, but that's only eating 600M or something 2008-09-12 01:02 I have 6 gigs 2008-09-12 01:02 I've got 45% in programs on a 4 g machine 2008-09-12 01:02 ncie 2008-09-12 01:02 nice 2008-09-12 01:02 50% cache 2008-09-12 01:02 yeah. I'm concerned. 2008-09-12 01:02 need to get yourself a memory map 2008-09-12 01:02 from proc 2008-09-12 01:02 there must be a tool 2008-09-12 01:03 (and in my case that probably is a gig and a half of firefox3) 2008-09-12 01:03 wicked 2008-09-12 01:03 13457 maze 20 0 103m 12m 9.9m S 0.0 0.3 0:29.40 gnome-power-man 2008-09-12 01:04 103M! 2008-09-12 01:04 3409 root 20 0 1454m 1.1g 28m S 5.6 18.0 531:18.61 Xorg 2008-09-12 01:04 just say gno to gnome 2008-09-12 01:04 but that's still insubstantial relative to 6 2008-09-12 01:05 kmail's using 500M 2008-09-12 01:05 what do you get from cat /proc/meminfo? 2008-09-12 01:05 pastie maybe? 2008-09-12 01:06 http://pastie.caboo.se/271078 2008-09-12 01:10 .6 gig of buffers, woof 2008-09-12 01:11 A gig into swap 2008-09-12 01:11 that's braindamage 2008-09-12 01:11 something is using 2.7 gig of straight memory 2008-09-12 01:11 should not be hard to find 2008-09-12 01:11 well 2008-09-12 01:11 X leaks often 2008-09-12 01:11 if you don't see it in the processes then yes, that is worrisome 2008-09-12 01:12 but you'd see the X usage even when it's leaking 2008-09-12 01:12 you know Shift-M with top? 2008-09-12 01:12 I think that's the one 2008-09-12 01:12 shows your proces in rss order 2008-09-12 01:12 or is it total vm size 2008-09-12 01:12 one of those 2008-09-12 01:12 vm size I think 2008-09-12 01:13 in order, Xorg, firefox, nautilus, gnome-panel, kmail, pidgin, gnome-terminal 2008-09-12 01:13 and some others 2008-09-12 01:13 VmallocTotal: 34359738367 kB 2008-09-12 01:13 that has to be broken 2008-09-12 01:14 VmallocChunk: 34359675895 kB 2008-09-12 01:14 haven't spent much time crawling in vm lately 2008-09-12 01:14 you might want to go onto #mm on this server 2008-09-12 01:15 and complain about that vmalloctotal 2008-09-12 01:15 it's late enough that I can't be arsed 2008-09-12 01:15 works for me 2008-09-12 01:15 alright, I got back a significant chunk of it by ditching firefox 2008-09-12 01:16 down to 51% used 2008-09-12 01:16 still 2008-09-12 01:17 check your anon 2008-09-12 01:17 I don't think firefox was using that 2008-09-12 01:18 600M of it went away after closing firefox 2008-09-12 01:18 leaving 2.1 gig in anon? 2008-09-12 01:18 that's broken 2008-09-12 01:18 yes 2008-09-12 01:19 check top 2008-09-12 01:19 for what? 2008-09-12 01:19 shift-M 2008-09-12 01:19 look at virtual size 2008-09-12 01:19 and rss 2008-09-12 01:19 fes 2008-09-12 01:19 res 2008-09-12 01:20 xorg has 1471m virt 1.1g res, nautilus 767m virt 212m res, kmail 536m virt 146m res 2008-09-12 01:20 top 3 2008-09-12 01:20 fsking pigs 2008-09-12 01:20 wtf is X doing? 2008-09-12 01:21 nautilus... 2008-09-12 01:21 :p 2008-09-12 01:21 shh :) 2008-09-12 01:21 I'd say you' 2008-09-12 01:21 I'd say you've got a simple case of out of control X plus 4 x bloatware 2008-09-12 01:21 sounds about right 2008-09-12 01:22 I'd call the X part a bug 2008-09-12 01:22 the other is just sloth 2008-09-12 01:22 it leaks like crazy 2008-09-12 01:22 write nasty emails to xorg 2008-09-12 01:22 tell them you want your money back 2008-09-12 01:23 kernel is unsing an unconscionable amount of buffers 2008-09-12 01:23 metadata is supposed to be small. Buffers -> metadata 2008-09-12 01:23 you can go complain on #mm 2008-09-12 01:23 tell peterz to do something ;-) 2008-09-12 01:23 that might be a result of encrypted harddrive 2008-09-12 01:24 probably 2008-09-12 01:24 well 2008-09-12 01:24 dodgy encryption layer 2008-09-12 01:24 right 2008-09-12 01:25 what's the encryption method, craptoloop or dm-crapt? 2008-09-12 01:25 dm-crypt 2008-09-12 01:25 complain on the dm-devel list 2008-09-12 01:25 there 2008-09-12 01:25 got it all sorted ;-) 2008-09-12 01:25 :) 2008-09-12 01:49 incremental refcount block update cost <- I love english 2008-09-12 01:49 don't you maze? 2008-09-12 01:51 all that stuff in the refcount post was actually designed during the skate today 2008-09-12 01:51 the same skate the resulting in "rollerbladers are allowed" inthe sk8board park 2008-09-12 01:52 so it was a good skate, all things considered 2008-09-12 01:52 g'night 2008-09-12 01:57 "Meanwhile, Rockbox has performed a valuable service for Debian developers who would otherwise have to struggle to find a project with longer release cycles than their own. " hah 2008-09-12 01:59 :-) 2008-09-12 01:59 what is rockbox? 2008-09-12 02:03 different firmware for your ipod 2008-09-12 02:03 or other 'mp3 player'-class devices 2008-09-12 02:04 (with additional functionality in mind) 2008-09-12 02:09 ACTION talks about b-tree parallelization with flips 2008-09-12 02:09 I was thinking about how something like RCU could be integrated into a b-tree. I don't know the specifics of a b-tree per se other than it's a tree that's flatter and better suited for storage 2008-09-12 02:11 flips: so I was thinking about per inode processing if we decided to parallelize it on that bassis 2008-09-12 02:11 we don't have rcu in userspace 2008-09-12 02:11 but we need locking in userspace 2008-09-12 02:11 also: rcu has some scary artifacts 2008-09-12 02:11 file locking would have to be done on a per inode basis 2008-09-12 02:12 I think it's a case of, do some decent spinlock + mutex work, then try rcu 2008-09-12 02:12 not rcu first 2008-09-12 02:12 rcu is wierd when it goes wrong 2008-09-12 02:12 so that code would to have to be split up in that manner if you do it that way 2008-09-12 02:12 yeah, I know 2008-09-12 02:12 so are spinlocks etc, but the weirdness is a lot easier to grasp 2008-09-12 02:12 you have to validae the data and stuff before using it 2008-09-12 02:12 making sure that it's not stale 2008-09-12 02:12 per inode is too coarse 2008-09-12 02:13 across quiescence periods 2008-09-12 02:13 well, what then ? 2008-09-12 02:13 simple spinlocks and mutexes 2008-09-12 02:13 decide how many 2008-09-12 02:13 what order acquired 2008-09-12 02:13 for how long 2008-09-12 02:13 what granularity of data protected 2008-09-12 02:14 estimate contention 2008-09-12 02:14 decide where rw is appropriate 2008-09-12 02:14 where trylock works 2008-09-12 02:14 then think about per cpu 2008-09-12 02:14 problem here is that rwlocks suck badly since they still depend on an atomic operation, limited scalability which is why some kind of per CPU-ism is useful 2008-09-12 02:14 at first, pretend bouncing costs nothing 2008-09-12 02:15 ok 2008-09-12 02:15 when it's working reliably and bouncing starts to show in the profile (because everything else is so fast) then take anti-bounce measures 2008-09-12 02:15 let think, how do we want to break this up ? 2008-09-12 02:15 what are we protecting and what circumstances ? 2008-09-12 02:16 there should be any number of concurrent readers and writes allowed in one file inode at the same time 2008-09-12 02:16 same with the inode table 2008-09-12 02:16 they will be partitioned by subtree 2008-09-12 02:16 higher levels of the tree can have rwlocks 2008-09-12 02:17 I think 2008-09-12 02:17 low level, simple mutex or spinlock is better 2008-09-12 02:17 what parts of the subtree ? 2008-09-12 02:17 =and how are they realted to the inode itself ? 2008-09-12 02:17 and how are they related to the inode itself ? 2008-09-12 02:17 a file index tree descends from the inode table 2008-09-12 02:18 think 50 petabyte file 2008-09-12 02:18 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-12 02:18 well, it depends on what you're guarding, we have to define the relationship first 2008-09-12 02:18 to get the sense of how many read/writes can be in it at the same time 2008-09-12 02:18 let's start from the beginning 2008-09-12 02:18 what happens on a file open ? 2008-09-12 02:18 guarding changes to the index leaf nodes, which is to say, the block pointers, and later extents 2008-09-12 02:18 and define the common operations 2008-09-12 02:18 open, read, write, close 2008-09-12 02:19 on file open we first look in the directoy file 2008-09-12 02:19 find the inode number 2008-09-12 02:19 then probe into the inode table 2008-09-12 02:19 that's a flat file, right ? 2008-09-12 02:19 find the inode table block, and the inode in it 2008-09-12 02:19 the directory? 2008-09-12 02:19 currently flat 2008-09-12 02:19 diretory file 2008-09-12 02:19 directory file 2008-09-12 02:19 later will have a btree mapped into the flat file 2008-09-12 02:19 has its own locking considerations 2008-09-12 02:19 I'm assuming that's it's a specific inode on the file system 2008-09-12 02:20 what is? 2008-09-12 02:20 we have two structures so far right ? 2008-09-12 02:20 1) directory map file 2008-09-12 02:20 2) b-tree 2008-09-12 02:20 not quite like that 2008-09-12 02:20 tux3 is a two level btree structure 2008-09-12 02:20 top level btree is the inode table 2008-09-12 02:21 from the inode table descend some large number of file index btrees 2008-09-12 02:21 a directory is the leaves of one of those btrees 2008-09-12 02:21 that is, the data blocks 2008-09-12 02:21 the leaves of a file index btree actually contain pointers to data blocks 2008-09-12 02:22 so we go probing around in some directory, taking the same locks as we would for any file 2008-09-12 02:22 ACTION reads 2008-09-12 02:22 that is, locking various levels of the index btree of the directory file 2008-09-12 02:23 once we find a data block we read it into the page cache and drop our locks 2008-09-12 02:23 maybe not all of them, maybe just up to some level 2008-09-12 02:23 well 2008-09-12 02:24 that is a little tricky, because the linux generic_file_read etc functions don't work that way 2008-09-12 02:24 they generally cause the filesystem to walk its index tree over and over again, for each block 2008-09-12 02:24 sucks 2008-09-12 02:24 ACTION is a bit confused 2008-09-12 02:24 ACTION thinks 2008-09-12 02:24 we don't need to worry about that 2008-09-12 02:25 for the moment we only need to be able to dive down into the btree and find a pointer to some data block 2008-09-12 02:25 see inode.c 2008-09-12 02:25 "filemap_blockio" 2008-09-12 02:26 most of the work is done by "probe" 2008-09-12 02:26 probe is where most of the locking action will happen 2008-09-12 02:27 so a file is a b-tree ? 2008-09-12 02:27 which is lower in level to the inode b-tree ? 2008-09-12 02:27 that's the relationship ? correct ? 2008-09-12 02:27 http://tux3.org/tux3?f=6ea2692d2839;file=user/test/btree.c 2008-09-12 02:27 see probe in there 2008-09-12 02:27 a file is _indexed_ by a btree 2008-09-12 02:27 that's in the lower level right ? 2008-09-12 02:28 ACTION looks 2008-09-12 02:28 a file lives in data blocks, that are pointed to by pointers that live in the leaves of a btree, called a data index btree 2008-09-12 02:28 the leavesof that btree are called dleaves 2008-09-12 02:28 see dleaf.c 2008-09-12 02:29 the situation with dleaf.c is pretty simple 2008-09-12 02:29 we can protext an entire dleaf as one logical entitity 2008-09-12 02:30 that covers about 500 file data blocks 2008-09-12 02:30 which is an ok granularity 2008-09-12 02:30 top level b-tree right ? 2008-09-12 02:30 what top level btree? 2008-09-12 02:30 a dtree is a second level btree 2008-09-12 02:30 you have an inode b-tree and a data index b-tree, correct ? 2008-09-12 02:30 the top level btree is the inode table 2008-09-12 02:30 right 2008-09-12 02:30 I'm just trying to understand the terminology here 2008-09-12 02:30 ok, good 2008-09-12 02:30 that's what I though 2008-09-12 02:30 thought 2008-09-12 02:30 itree vs dtree 2008-09-12 02:31 good 2008-09-12 02:31 yeah, good terminology 2008-09-12 02:31 thanks 2008-09-12 02:31 itree->dtree 2008-09-12 02:31 right 2008-09-12 02:31 terminology is important 2008-09-12 02:31 agreed 2008-09-12 02:31 it's what I figured you said in the first place, but I had to be sure 2008-09-12 02:31 right, protect a dtree entirely with a lock 2008-09-12 02:31 http://kerneltrap.org/Linux/Tux3_Hierarchical_Structure 2008-09-12 02:32 some of this is wrong now 2008-09-12 02:32 ACTION reads 2008-09-12 02:32 dropped the volume table, moved the free map inside the itree 2008-09-12 02:32 update it before the next tux3 university 2008-09-12 02:32 as a normal file 2008-09-12 02:32 that's hard 2008-09-12 02:32 that's on somebody else's site 2008-09-12 02:32 but I can post something on tux3.org 2008-09-12 02:32 ACTION really appreciates the help in learning this from flips 2008-09-12 02:33 do you have an allocation maps that's shared 2008-09-12 02:33 ? 2008-09-12 02:33 at least you can see the inode table / data index table relationship there 2008-09-12 02:33 but it is obscured by the volume table, which I determined to be useless 2008-09-12 02:33 that's a potentially huge problem for contention with regards to the allocator 2008-09-12 02:33 allocation map? 2008-09-12 02:34 there is an allocation bitmpa 2008-09-12 02:34 block allocation map 2008-09-12 02:34 which is a normal file 2008-09-12 02:34 well, how do you modify it, say, under heavy delete or data creation pressure ? 2008-09-12 02:34 sb->bitmap in inode.c 2008-09-12 02:34 doesn't it need a lock around it ? 2008-09-12 02:34 currently there is no locking 2008-09-12 02:34 or concurrency 2008-09-12 02:34 soon 2008-09-12 02:35 well, doesn't it need it ? 2008-09-12 02:35 but it is just a normal file 2008-09-12 02:35 lock it with the same granularity 2008-09-12 02:35 there's a lot of activity there so I expect it to be heavily hit 2008-09-12 02:35 sure 2008-09-12 02:35 same granularity as what ? 2008-09-12 02:35 other files too 2008-09-12 02:35 but! 2008-09-12 02:35 there is a difference with tux3 2008-09-12 02:35 ok 2008-09-12 02:35 tux3 has this way of logging changes to the bitmaps 2008-09-12 02:35 it doesn't have to lock, write block, wait 2008-09-12 02:35 ok 2008-09-12 02:35 that kind of thing 2008-09-12 02:35 oh nice 2008-09-12 02:36 so locks on the bitmap are just page cache locks 2008-09-12 02:36 deltas to the allocation map are just appended 2008-09-12 02:36 that is, most like actually locking pages when we get to kernel 2008-09-12 02:36 or we could lock buffers 2008-09-12 02:36 locking pages is a little faster 2008-09-12 02:36 what about during concurrent access against an online checker that needs to know about all of the appended logs ? 2008-09-12 02:36 yes, deltas to the allocation map are just logged 2008-09-12 02:37 and every now and then we pour a bunch of them into the allocation map and write it out 2008-09-12 02:37 differ those checks until the log has been commit to the disk and then restart it ? 2008-09-12 02:37 committed 2008-09-12 02:37 the allocation map always has the most recent version of the allcoation 2008-09-12 02:37 in buffers 2008-09-12 02:37 in memory 2008-09-12 02:37 because, say, you want to verify if data blocks that some indirect mapping is pointing is allocated or not 2008-09-12 02:37 so an online check, ah, needs to check the disk image 2008-09-12 02:37 not the cached image 2008-09-12 02:37 right? 2008-09-12 02:38 pretty hard to do otheriwse 2008-09-12 02:38 anway, that's not the immediate problem 2008-09-12 02:38 the immediate problem is just tohave fast, concurrent access to everything 2008-09-12 02:38 if it's not and a log is being committed, we should delay it until that log has been committed ? 2008-09-12 02:38 just thinking out loud 2008-09-12 02:38 if what is not? 2008-09-12 02:38 what ? the log ? 2008-09-12 02:39 the log itself 2008-09-12 02:39 "if it's not"you said 2008-09-12 02:39 don't know what "it" is 2008-09-12 02:39 is there a scenario where the online checking of that portion of the disk and ...on that'll never happen 2008-09-12 02:39 because of the atomic commit 2008-09-12 02:39 it's should be consistent at that point from previous commits 2008-09-12 02:40 we don't really need to check logs that are being committed to disk and wait for them to complete 2008-09-12 02:40 or do we ? 2008-09-12 02:40 we do 2008-09-12 02:40 because the logs form a promise of what the "real" disk image "should" look like 2008-09-12 02:40 yeah, well then we have to lock them down or something like that 2008-09-12 02:40 so we need to take it into account during checking 2008-09-12 02:40 but checking is far in the future 2008-09-12 02:41 under, say a rwlock lock, reader side 2008-09-12 02:41 at least 3 months 2008-09-12 02:41 probably 4 2008-09-12 02:41 ok 2008-09-12 02:41 worth thinking about 2008-09-12 02:41 let's do beer on it 2008-09-12 02:41 I'll think about it on my next skate, if refcounting is done ;-) 2008-09-12 02:42 going back to the bitmap 2008-09-12 02:42 so, at least each bit has to be protected 2008-09-12 02:42 we do scan, find, change 2008-09-12 02:42 and that scan/find/change to allocate a block has to be under a spinlock 2008-09-12 02:43 flips: am I being helpful or not ? 2008-09-12 02:43 or in userspace, under a pthread mutex 2008-09-12 02:43 or saying stupid irrelevant things ? just checking 2008-09-12 02:43 of course 2008-09-12 02:43 ok 2008-09-12 02:43 I haven't been required to be precise about this up till now 2008-09-12 02:43 or deal with somebody who had written nontrivial locking 2008-09-12 02:43 we'll I hope I'm helping 2008-09-12 02:43 yep 2008-09-12 02:43 anyway, the allocation bitmap is a good place to start 2008-09-12 02:44 because there is a pretty simple situation there 2008-09-12 02:44 I think you can definitely isolate a dtree using an individual dtree lock 2008-09-12 02:44 once you know your bitmap block isn't going away 2008-09-12 02:44 well, one lock per dtree is way too crude 2008-09-12 02:44 that's good, but you have to think about the upward relationship between than than itree 2008-09-12 02:44 actually you don't 2008-09-12 02:44 you can treat it as individual blocks 2008-09-12 02:44 is the lock against the dtree sufficient to protect the link in the itree pointing to it ? 2008-09-12 02:45 stuff like that 2008-09-12 02:45 you lock your way down through the btree until you get to the datablock, lock the data block and let everything else go 2008-09-12 02:45 ok, let's define what a read would look like through that. 2008-09-12 02:45 ok, right 2008-09-12 02:45 you look up the inode in the itree 2008-09-12 02:45 you get it 2008-09-12 02:45 what next ? 2008-09-12 02:45 it points to a dtree 2008-09-12 02:45 well that's a good point 2008-09-12 02:45 so you want to delete a data block 2008-09-12 02:45 that means you have to lock the data block 2008-09-12 02:46 so I think that's clear, right? 2008-09-12 02:46 yes 2008-09-12 02:46 that means, the read has to be off it 2008-09-12 02:46 reader 2008-09-12 02:46 what do you lock then ? the dtree or some part of the dtree ? 2008-09-12 02:46 you lock the block 2008-09-12 02:46 how would region locking look like ? 2008-09-12 02:46 region locking looks like locking a subtree node 2008-09-12 02:46 then you have to wait for _every_ other lock to go away 2008-09-12 02:47 not a good idea 2008-09-12 02:47 why do you want to lock a region? 2008-09-12 02:48 posix semantics or something like that 2008-09-12 02:48 totally different locking level 2008-09-12 02:48 can't you lock a range in the file under posix ? 2008-09-12 02:48 waaay up in the vfs 2008-09-12 02:48 layered, independent 2008-09-12 02:49 also, the linux posix locking code blows 2008-09-12 02:49 coarse grained as hell 2008-09-12 02:49 true, but you can bypass it 2008-09-12 02:49 single fucking lock 2008-09-12 02:49 yup, and a linear list 2008-09-12 02:49 blows 2008-09-12 02:49 but you still do it at the same level 2008-09-12 02:49 that's not our concern now 2008-09-12 02:50 or maybe I'm just stuck in the wrong mindset 2008-09-12 02:50 could be 2008-09-12 02:50 anyway, at least we can let that suck exaclty as it always has 2008-09-12 02:50 we won't lose a benchmark showdown for that reason 2008-09-12 02:50 who uses posix locks anyway? ;) 2008-09-12 02:52 there is a case where you want to lock a region 2008-09-12 02:52 cluster fs 2008-09-12 02:52 but that's not us 2008-09-12 02:52 yet 2008-09-12 02:53 ok 2008-09-12 02:53 let's continue 2008-09-12 02:53 how do you lock the data block ? 2008-09-12 02:53 I guess for tux3 we can think of a single block as our unit of locking 2008-09-12 02:53 this is for a read remember... 2008-09-12 02:53 in kernel? 2008-09-12 02:53 take the block lock 2008-09-12 02:53 it's a bitspin lock as I recall 2008-09-12 02:54 same with the page lock 2008-09-12 02:54 it's fast enough for this purpose 2008-09-12 02:54 we'll its all something to think about 2008-09-12 02:54 in userspace 2008-09-12 02:54 pthread mutex, we will put one in each buffer 2008-09-12 02:54 that's pretty nasty 2008-09-12 02:54 so you lock the mutex in the buffer 2008-09-12 02:54 because? 2008-09-12 02:54 how big of a file chunk are we deleting ? 2008-09-12 02:55 ah, delete 2008-09-12 02:55 well, nice thing about truncate is, we don't have to wait for it 2008-09-12 02:55 we can just mark the inode as "truncated" and we're done 2008-09-12 02:56 we don't even have to update the inode 2008-09-12 02:56 just promise to in our log 2008-09-12 02:56 or do we need to lock during the read as well ? 2008-09-12 02:56 and take our sweek time, walking through the dtree, taking locks, freeing blocks 2008-09-12 02:56 on a block basis 2008-09-12 02:56 we need to lock on read, yes 2008-09-12 02:56 on a block basis 2008-09-12 02:56 which ? what does the lock hierarchy look like 2008-09-12 02:56 ? 2008-09-12 02:57 just long enough to enter the block into the cache 2008-09-12 02:57 do we lock the itree ? dtree ? what ? 2008-09-12 02:57 we work our way down the levels of the two trees, taking locks and releasing them 2008-09-12 02:57 we only hold a lock long enough to know that we can see the next object in cache 2008-09-12 02:57 do we take reader locks along the way ? 2008-09-12 02:57 if we don't see it in cache, drop everything, read it in, start over fromthe top 2008-09-12 02:58 simple mined algorithm 2008-09-12 02:58 a starting point 2008-09-12 02:58 and hold them across the entire operation ? 2008-09-12 02:58 let's definte this 2008-09-12 02:58 only across the operation of finding the next level down in the cache 2008-09-12 02:58 soon as we find that, we lock it, release the parent 2008-09-12 02:58 be specific 2008-09-12 02:58 make sense? 2008-09-12 02:58 I thought that was specific 2008-09-12 02:59 hold the itree lock until we get the specific dtree ? 2008-09-12 02:59 there is no itree lock 2008-09-12 02:59 no, more specific :) 2008-09-12 02:59 ok, lock the root of the itree 2008-09-12 02:59 ok 2008-09-12 02:59 that is, look for it in cache 2008-09-12 02:59 ok 2008-09-12 02:59 if it's not there, issue a read 2008-09-12 02:59 block until it is 2008-09-12 02:59 to load the portion of the itree, right ? 2008-09-12 02:59 then block until we own the read lock 2008-09-12 02:59 have a read lock 2008-09-12 03:00 nope 2008-09-12 03:00 wait... 2008-09-12 03:00 to probe down where we want to go 2008-09-12 03:00 let me summarize 2008-09-12 03:00 yes, that stops everybody 2008-09-12 03:00 probing the itree 2008-09-12 03:00 well 2008-09-12 03:00 it stops writers 2008-09-12 03:00 yes 2008-09-12 03:00 because we have a read lock on the root 2008-09-12 03:00 right 2008-09-12 03:00 we aren't going to keep it long 2008-09-12 03:00 that would be unfriendly 2008-09-12 03:00 yes 2008-09-12 03:01 so what we do is, we find the next index block down inthe inode table index tree 2008-09-12 03:01 check its in cache 2008-09-12 03:01 if so, take a read lock on it 2008-09-12 03:01 if not, drop the root lock, issue a read, block on it, start again at the root 2008-09-12 03:01 obviously this may never terminate ;-) 2008-09-12 03:02 but we have other problems if it doesn't 2008-09-12 03:02 so we look up a inode; reader lock the itree; if it's not there issue a read to load that in, release all of the above locks until that block's wait queue wakes; read that block, get that dtree link 2008-09-12 03:02 while holding the itree reader lock 2008-09-12 03:02 we don't read lock the itree 2008-09-12 03:02 correct ? 2008-09-12 03:02 we read lock the root of the itree 2008-09-12 03:02 big difference 2008-09-12 03:02 ok 2008-09-12 03:02 so let's say the itree has seven levels of index 2008-09-12 03:02 big itree 2008-09-12 03:02 what's the difference ? 2008-09-12 03:02 ok 2008-09-12 03:03 we start by locking the root 2008-09-12 03:03 then we lock level one index, and drop the root lock 2008-09-12 03:03 then lock level 2 index block, and drop the level 1 2008-09-12 03:03 and so on 2008-09-12 03:03 down to level 7 2008-09-12 03:03 then we start the same process onthe dtree 2008-09-12 03:03 make sense? 2008-09-12 03:03 or propagate downwards, releasing a lock 2008-09-12 03:03 kind of scary 2008-09-12 03:03 ok, it's not that scary 2008-09-12 03:04 big reason: we wil normally keep hitting the same inode table block several times 2008-09-12 03:04 so we keep a "cursor" 2008-09-12 03:04 right, advance the cursor as needed 2008-09-12 03:04 what about rebalancing operations ? how does this effect it ? 2008-09-12 03:04 got to worry about how cursors interact with write locks on the itree 2008-09-12 03:04 but then 2008-09-12 03:04 that's why we're talking about it 2008-09-12 03:05 I don't know how to manipulate it other than with a big coarse grained lock as this time 2008-09-12 03:05 ok when you want to rebalance, delete, insert, split, whatever, you need a write lock 2008-09-12 03:05 on the parent and on the blocks being changed 2008-09-12 03:05 so you do the same thing 2008-09-12 03:05 I simply don't know enough about b-trees to know how to downward propagate the lock 2008-09-12 03:05 cursor 2008-09-12 03:05 me neither 2008-09-12 03:05 haven't done this before 2008-09-12 03:06 it's jsut brainwork though 2008-09-12 03:06 no magic 2008-09-12 03:06 ok, that's a big deal 2008-09-12 03:06 the expert on tree locking that I know of is peterz 2008-09-12 03:06 what kind of tree? 2008-09-12 03:06 he's done all sorts of shit 2008-09-12 03:06 radix tree and other things 2008-09-12 03:06 I'll ping him 2008-09-12 03:06 higly concurrent trees 2008-09-12 03:06 radix tree is pretty simple 2008-09-12 03:06 highly concurrent trees 2008-09-12 03:06 compared to a filesystem index 2008-09-12 03:07 he's the best person for the job that I know of 2008-09-12 03:07 yes, I'll point peterz at it 2008-09-12 03:07 you're not bad 2008-09-12 03:07 you're asking the right questions 2008-09-12 03:07 he might not have time, but I don't think that your current track of fine graining the system upfront is the best solution 2008-09-12 03:07 ah 2008-09-12 03:08 we have another knob we can tweak 2008-09-12 03:08 you should consider seriously per cpu-ing it if possible or faking it userspace 2008-09-12 03:08 there is also a refcount on each buffer 2008-09-12 03:08 we can get highly concurrent reads with RCU, that's a given 2008-09-12 03:08 per-cpuing it before figuring out how to do it with normal locks would not be wise 2008-09-12 03:08 1) walk 2) run 2008-09-12 03:08 it's just a matter of how we can modify it to apply to your current atomic log at that time 2008-09-12 03:09 ok, see that recount comment 2008-09-12 03:09 maybe the use of an atomic counter would help to version the logs for both RCU tree nodes and the atomic disk log 2008-09-12 03:09 very important 2008-09-12 03:09 forget rcu 2008-09-12 03:09 rcu is braindamage 2008-09-12 03:10 when we want it rcu'd, we'll hand it to the rcu guys 2008-09-12 03:10 it can be but it's also a bad ass algorithm 2008-09-12 03:10 it's a real use of time 2008-09-12 03:10 depends on the kind of guarantees you need 2008-09-12 03:10 not consistent with getting a solid prototype up 2008-09-12 03:10 ok 2008-09-12 03:10 I guess basic thread safety is first 2008-09-12 03:10 anyway, the question, what happens when the itree geometry needs to change 2008-09-12 03:11 so we have all these readers walking down the tree, that's great 2008-09-12 03:11 and they release their locks as they go so somebody can come behind and maybe go down a different subtree 2008-09-12 03:11 very nice already 2008-09-12 03:11 but 2008-09-12 03:11 how do you change the tree geometry? 2008-09-12 03:11 well 2008-09-12 03:12 you can actuall do it when there are readers buzzing away inside subtrees that you're moving around 2008-09-12 03:12 that is cool 2008-09-12 03:12 you just need to write lock the parent and read lock the children, so you know that the tasks ahead of you have gotten off the parent 2008-09-12 03:13 then you can change the parent 2008-09-12 03:13 make sense? 2008-09-12 03:13 you can for example, split the parent 2008-09-12 03:13 and then you may have to change the parent's parent 2008-09-12 03:13 well 2008-09-12 03:13 fun 2008-09-12 03:14 ACTION reads 2008-09-12 03:14 you might have to check the path to find out how high up the splits will go, and get write locks on that whole chain 2008-09-12 03:14 I was just talking to peterz 2008-09-12 03:14 asked him the same questions we were talking about here 2008-09-12 03:14 he's not going to be around much since he's headed to plumber's 2008-09-12 03:14 the comparison of answers must be fascinating 2008-09-12 03:15 I think that itree node deletion needs to be tied to file handle semantics somehow 2008-09-12 03:15 tell him good luck with the plumbing, there is a lot of shit in those pipes 2008-09-12 03:15 na, forget that 2008-09-12 03:15 good 2008-09-12 03:15 it's way wrong ;) 2008-09-12 03:16 peterz used a lock in a linked list node to protect link modification 2008-09-12 03:16 so think about why your deleting an itree node 2008-09-12 03:16 yeah, you're completely removing it 2008-09-12 03:16 but why? 2008-09-12 03:16 it's not really what you want 2008-09-12 03:17 I'm agreening with you 2008-09-12 03:17 it's because you're coalescing the itree, and why are you doing that? 2008-09-12 03:17 agreeing 2008-09-12 03:17 I know 2008-09-12 03:17 I'm doing rhetoric 2008-09-12 03:17 ok 2008-09-12 03:17 so you're doing that because you've just delete masses of files and you want to tighten up the inode table tree a little 2008-09-12 03:17 actually, this is quite optional 2008-09-12 03:18 we don't really need to do that 2008-09-12 03:18 particularly if we tend to reuse the same inode numbers in the not too distant future 2008-09-12 03:18 it's only if we are determined to use completely different ones, for no good reason, that we need to fiddle the geometry of the itree on delete 2008-09-12 03:19 anway 2008-09-12 03:19 let's assume that we do want to be tidy and coalesce the itree frequently, even if we are not required to 2008-09-12 03:20 that means merging nodes in general 2008-09-12 03:20 just what I was talking about 2008-09-12 03:20 well 2008-09-12 03:20 no, I was thinking of splitting above 2008-09-12 03:20 merging is more forgiving as far as locking the access path goes 2008-09-12 03:20 tough problem 2008-09-12 03:21 I'd go with something simple first 2008-09-12 03:21 not really, you only need to write lock the parent and the two blocks being merged 2008-09-12 03:21 this is too much for a prototype 2008-09-12 03:21 but the first thing to do is to define specifically how a coarse grained set of locks would work on it 2008-09-12 03:21 it is traditional to start with a single global lock on each btree 2008-09-12 03:21 and then propagate downwards 2008-09-12 03:21 and find out how badly that sucks 2008-09-12 03:21 right 2008-09-12 03:21 we know in advance it sucks too much 2008-09-12 03:21 so why bother? 2008-09-12 03:21 we have lockstat so we can get a good idea of how sucky it is and it will be sucky 2008-09-12 03:21 maybe in next week's prototype 2008-09-12 03:21 then that's it 2008-09-12 03:22 ok 2008-09-12 03:22 the biggest focus at this time is to get your prototype working fully 2008-09-12 03:22 true 2008-09-12 03:22 with concurrency 2008-09-12 03:22 in user space 2008-09-12 03:22 had a slight change of philosophy there 2008-09-12 03:22 the best thing we can do is make provisions to do fine grained locking or per cpu-ification in the future easily 2008-09-12 03:22 when the fuse stuff landed 2008-09-12 03:22 not solve the entire problem upfront 2008-09-12 03:22 right 2008-09-12 03:22 well 2008-09-12 03:23 I don't think you mean by per-cpu what I mean 2008-09-12 03:23 but let's ask interesting questions and get help from folks like peterz 2008-09-12 03:23 per-cpu to me means replicating the relevant date per-cpu 2008-09-12 03:23 that's a big mess 2008-09-12 03:23 last resort 2008-09-12 03:23 all that 2008-09-12 03:24 no, 2008-09-12 03:24 it's about avoiding locking in the first place during an inode operation 2008-09-12 03:24 like how? 2008-09-12 03:24 as much locking as possible under that operation 2008-09-12 03:25 like making the entire read path as per cpu as possible 2008-09-12 03:25 vfs is going to take i_sem, can't tell it not to 2008-09-12 03:25 we have to stick to the part we own 2008-09-12 03:25 yeah 2008-09-12 03:25 and are responsible for 2008-09-12 03:25 which is our indexing structures 2008-09-12 03:25 so we're already taking an inode lock of some sort 2008-09-12 03:25 different inode 2008-09-12 03:26 there's the struct inode, which the vfs locks, and the image of the inode on an inode table block, which we lock 2008-09-12 03:41 sleeping time 2008-09-12 03:46 ok 2008-09-12 03:46 night 2008-09-12 03:46 later flips sleep well 2008-09-12 03:46 ACTION is going to be up still 2008-09-12 03:47 night 2008-09-12 08:44 -!- tim_dimm(~timothyhu@adsl-67-114-40-138.dsl.scrm01.pacbell.net) has joined #tux3 2008-09-12 09:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-12 09:26 -!- eli(~elicriffi@66.249.86.209) has joined #tux3 2008-09-12 10:24 -!- kd(kdpict@118.94.54.179) has joined #tux3 2008-09-12 11:15 -!- kushal(kdpict@118.94.54.179) has joined #tux3 2008-09-12 11:39 -!- pgquiles(~pgquiles@229.Red-83-49-101.dynamicIP.rima-tde.net) has joined #tux3 2008-09-12 13:05 flips: 2008-09-12 13:05 05:09 < peterz> bh: not too hard - I send you a paper on that iirc 2008-09-12 13:05 05:09 < giel> not too hard implementation-wise, or time/space wise? 2008-09-12 13:05 05:10 < giel> complexity theory! 2008-09-12 13:05 05:10 < peterz> implementation wise :-) 2008-09-12 13:05 05:10 < peterz> the btree space/time considerations don't change 2008-09-12 13:05 05:11 < peterz> the thing that's hardest about the fine grain locking is the optimistic locking approach 2008-09-12 13:05 05:11 < peterz> you'd have to work out where upwards traversal stops on your way down 2008-09-12 13:05 which channel? 2008-09-12 13:05 he's exactly right 2008-09-12 13:06 woke up thinking about precisely that 2008-09-12 13:06 this morning 2008-09-12 13:10 #offtopic2 2008-09-12 13:10 but he's traveling right now 2008-09-12 13:11 to KS and Plumbers 2008-09-12 13:11 it's not offtopic ;-) 2008-09-12 13:11 I'll invite peterz here 2008-09-12 13:11 better than #offtopic 2008-09-12 13:12 KS is going to be buzzing about tux3 ;-) 2008-09-12 13:12 lots of trash talking from the trash talkers 2008-09-12 13:13 KS has degenerated to mostly wanking 2008-09-12 13:13 very little tech gets done there any more 2008-09-12 13:13 just climbers getting fact time 2008-09-12 13:13 are they ? 2008-09-12 13:13 regarding buzz ? 2008-09-12 13:13 ? 2008-09-12 13:14 face time I mean 2008-09-12 13:14 oh are they a bunch of wankers now ? this is a publically logged channel keep in mind :) 2008-09-12 13:14 ah right 2008-09-12 13:14 well that's a public comment 2008-09-12 13:14 ACTION giggles 2008-09-12 13:14 not trying to make friends, eh ? :) 2008-09-12 13:14 never have gotten along well with wankers 2008-09-12 13:15 just me 2008-09-12 13:15 yeah, well, I can do with less political wanking and more changes into the kernel 2008-09-12 13:15 yep 2008-09-12 13:15 opensolaris helps focus on that 2008-09-12 13:15 not enough yet 2008-09-12 13:16 specificallly a couple of things I've been planning for years but never had the time to really do 2008-09-12 13:16 linux is losing "customers" to opensolaris 2008-09-12 13:16 it's a fact 2008-09-12 13:16 oh really ? 2008-09-12 13:16 not desktoppers, but datacenters 2008-09-12 13:16 backrooms 2008-09-12 13:16 the guys with money 2008-09-12 13:17 flips: have peterz resend you the paper, I don't know where it is for at the moment 2008-09-12 13:17 paper? 2008-09-12 13:18 the paper on the topic regarding trees and locking 2008-09-12 13:18 would be nice 2008-09-12 13:18 ask him for it 2008-09-12 13:18 sure 2008-09-12 13:19 we have to do some minor changes to btree.c I think 2008-09-12 13:19 because it currently climbs the path when it has to split 2008-09-12 13:19 it has to descend instead, and it has to drop locks before doing that 2008-09-12 13:20 so it may find that somebody else has changed the object is was looking at when it gets back down 2008-09-12 13:20 the object can't be deleted fortunately 2008-09-12 13:20 because its caller must hold a reference 2008-09-12 13:20 it'll have to be locked in-roder 2008-09-12 13:20 in-order 2008-09-12 13:20 on the cache image 2008-09-12 13:20 what ever that is 2008-09-12 13:20 so that is the rule: you have to hold a ref on the cached object before you can delete the disk object 2008-09-12 13:21 and the ref count of the cached object must be equal to one 2008-09-12 13:21 good example of unwritten lore about the kernel 2008-09-12 13:21 books don't tell you that 2008-09-12 13:21 but anybody who is allowed to touch core vfs knows that 2008-09-12 13:21 fs hackers have to know it do, and often don't 2008-09-12 13:21 know it too that is 2008-09-12 13:22 locking order for a btree is simple 2008-09-12 13:22 root-to-leaf 2008-09-12 13:22 left-to-right if that granularity matters, which it doesn't 2008-09-12 13:22 so just root-to-leaf 2008-09-12 13:23 but resize_btree goes leaf-to-root, doesn't work 2008-09-12 13:31 ok, time to stop cleaning up and write some refcounting code 2008-09-12 13:31 no comments back on my post last night 2008-09-12 13:31 I thought folks would chew on that 2008-09-12 13:32 it's really core to tux3 performance in general 2008-09-12 13:32 not just atom refcounting 2008-09-12 13:33 hi. i just copied and pasted the irclogs from the tux3 university sessions. As I haven't seen them on the mailing list as of yet, should I post them? 2008-09-12 13:33 sure 2008-09-12 13:34 complete with all the swearing ;-) 2008-09-12 13:34 this is real life university 2008-09-12 13:34 i'll make sure of it :) 2008-09-12 13:34 might replace some of the bad words with @%$@# 2008-09-12 13:34 or not 2008-09-12 13:34 guess not :) 2008-09-12 13:34 whatever you think is right ;) 2008-09-12 13:34 nice nick 2008-09-12 13:35 well, it was datapunk when I was like 15 or so 2008-09-12 13:35 also good 2008-09-12 13:35 and as they tend to get shorte it now resembles a trekkie name 2008-09-12 13:35 but thanks 2008-09-12 13:35 you have piercings? or just virtual piercings? 2008-09-12 13:35 just virtual 2008-09-12 13:36 :) 2008-09-12 13:36 some of my best friends in berlin had some interesting piercings 2008-09-12 13:36 but for example, harald avoids it 2008-09-12 13:36 works better in the boardroom 2008-09-12 13:36 I was going to look at the reasons for it, but is the problem with deleting files known? 2008-09-12 13:37 frist I heard of it 2008-09-12 13:37 go ahead on it 2008-09-12 13:37 well, i don't particularly like them 2008-09-12 13:37 I wasn't very careful when I put that in 2008-09-12 13:37 ok, will do, after a little algebra session 2008-09-12 13:37 see you later 2008-09-12 13:37 wo wohnst du? 2008-09-12 13:37 karlsruhe 2008-09-12 13:37 if you know it 2008-09-12 13:37 ah cool 2008-09-12 13:37 near SAS 2008-09-12 13:37 sure, been there a few times 2008-09-12 13:38 quite 2008-09-12 13:38 quiet 2008-09-12 13:38 just like the name 2008-09-12 13:38 lots of geeks in the area 2008-09-12 13:38 yep, they are 2008-09-12 13:38 CS is pretty strong 2008-09-12 13:38 suse not far away 2008-09-12 13:38 ibm 2008-09-12 13:38 not sas 2008-09-12 13:38 um 2008-09-12 13:38 um 2008-09-12 13:38 sap? 2008-09-12 13:38 right 2008-09-12 13:38 where I've been too 2008-09-12 13:39 there's a great guy there 2008-09-12 13:39 gotten around a lot? 2008-09-12 13:39 drei jahre in Deutscheland 2008-09-12 13:39 um, 6 jahre 2008-09-12 13:40 that would be 6 Jahre 2008-09-12 13:40 getting rusty 2008-09-12 13:40 for work i guess? 2008-09-12 13:40 and fun 2008-09-12 13:40 well, i've only been to the usa for 11 months 2008-09-12 13:40 and that was for school... and fun 2008-09-12 13:41 that's about enough to be honest 2008-09-12 13:41 berlin is a lot more fun 2008-09-12 13:41 and less tense 2008-09-12 13:41 contrary to popular belief 2008-09-12 13:41 only been there a few times, mostly during the ccc congresses 2008-09-12 13:41 but it certainly is fun 2008-09-12 13:41 geek hotbed 2008-09-12 13:42 ok gotta go, will be back in an hour or so 2008-09-12 13:42 hottest hotbed in europe imho 2008-09-12 13:42 bis spater dann 2008-09-12 13:42 und zu weit weg :) 2008-09-12 13:42 bis dann 2008-09-12 13:42 ACTION is getting really rusty ;) 2008-09-12 13:44 I just had a thought 2008-09-12 13:44 we should schedule an official tux3 cabal meeting for Oct 31 2008-09-12 13:44 on irc, plus a real location in LA 2008-09-12 13:47 might be a good time 2008-09-12 14:12 -!- tim_dimm(~timothyhu@adsl-67-114-40-138.dsl.scrm01.pacbell.net) has joined #tux3 2008-09-12 14:12 hi tim_dimm 2008-09-12 14:12 coming up for air? 2008-09-12 14:12 hi flips 2008-09-12 14:12 trying to 2008-09-12 14:12 manage a quick skate today? 2008-09-12 14:13 still in sacramento 2008-09-12 14:13 oh right 2008-09-12 14:13 Pi and Persey are still in the ICU 2008-09-12 14:13 and it would be inadvisable anyway 2008-09-12 14:13 how's that? 2008-09-12 14:13 you got a week with no weeks under you to look forward to 2008-09-12 14:13 heh 2008-09-12 14:13 can't justify skating, unless it is to the nursery 2008-09-12 14:14 with no wheels under you I meant 2008-09-12 14:14 I can justify it if the heart rate is up enough 2008-09-12 14:14 getting bad typoitis here 2008-09-12 14:14 full word typos now 2008-09-12 14:14 happens when the volume of code goes stratospheric 2008-09-12 14:14 just read through the first of two tux3 U sessions 2008-09-12 14:14 it hung together pretty well 2008-09-12 14:14 going to now 2008-09-12 14:15 not too much pure bs 2008-09-12 14:15 some content 2008-09-12 14:15 what time of day did you start? 2008-09-12 14:15 today? 2008-09-12 14:15 or last night? 2008-09-12 14:15 8 pm tue and thur 2008-09-12 14:15 no, for the U 2008-09-12 14:15 k 2008-09-12 14:15 will be regular 2008-09-12 14:15 as far as I can manage 2008-09-12 14:15 sounds like a great tool for building community 2008-09-12 14:15 we had one inner linux guru here thursday 2008-09-12 14:16 eric biederman 2008-09-12 14:16 linux cluster guy 2008-09-12 14:16 the linux cluster guy 2008-09-12 14:16 and of course natalie is an inner linux gal 2008-09-12 14:16 googling... 2008-09-12 14:16 you'll get a few hits ;-) 2008-09-12 14:17 you get that email this am about SE Linux? 2008-09-12 14:17 223K to be exact 2008-09-12 14:18 no 2008-09-12 14:18 let me check 2008-09-12 14:18 oh 2008-09-12 14:18 yes 2008-09-12 14:19 right direction? 2008-09-12 14:19 knew about apparmor, suse's better answer to selinux 2008-09-12 14:19 uses the same kernel hooks 2008-09-12 14:19 yes 2008-09-12 14:19 I'm not sure how much apparmor is being worked on right now 2008-09-12 14:19 it's another one of those good projects that gets beaten up by something sloppier but more devs 2008-09-12 14:20 apparently, Novell canned all the engineers working on it in '07 2008-09-12 14:20 bh...Got any intel on apparmor? 2008-09-12 14:20 ? 2008-09-12 14:20 ah 2008-09-12 14:20 ACTION pokes bh 2008-09-12 14:21 heh 2008-09-12 14:21 be nice to know what happened there 2008-09-12 14:21 see who's maintaining it even 2008-09-12 14:21 somebody always maintains os projects 2008-09-12 14:21 they never die... except for evms 2008-09-12 14:21 RIP 2008-09-12 14:21 well 2008-09-12 14:22 lvm3 will rise ;-) 2008-09-12 14:22 we're about a month away from serious lvm3 development 2008-09-12 14:22 http://www.novell.com/linux/security/apparmor/selinux_comparison.html 2008-09-12 14:22 kickoff 2008-09-12 14:22 tim_dimm, there's a proposal to have a public tux3 cabal meeting on Oct 31 2008-09-12 14:22 physically located at a certain garage I'm thinking of 2008-09-12 14:23 and on the web/net 2008-09-12 14:23 what think you? 2008-09-12 14:23 I'm there barring spit-up, diaper changes and burping sessions 2008-09-12 14:23 barring? 2008-09-12 14:23 should be irrespective of 2008-09-12 14:23 uh, poor choice of words 2008-09-12 14:23 anything except burping 2008-09-12 14:23 farting? 2008-09-12 14:23 not good enough 2008-09-12 14:23 too early to teeth 2008-09-12 14:24 how do you spell that- teeeethhh 2008-09-12 14:24 you know 2008-09-12 14:24 dana had one the first week 2008-09-12 14:24 was hell for anna 2008-09-12 14:24 i bet 2008-09-12 14:24 she quickly learned how to punish mommy with it 2008-09-12 14:25 so began a somewaht tense relationship ;) 2008-09-12 14:25 still? 2008-09-12 14:25 ;-) 2008-09-12 14:25 of course 2008-09-12 14:25 but detent has set in, mutual respect, mommy love, all that 2008-09-12 14:25 this apparently lasts till about 9 YO 2008-09-12 14:26 with luck 2008-09-12 14:26 guess they grow out of it 2008-09-12 14:26 tween is the new teen 2008-09-12 14:26 anyway 2008-09-12 14:26 we better stop talking like that 2008-09-12 14:26 or all the devs willrun away screaming 2008-09-12 14:27 tux3 and child-rearing 2008-09-12 14:27 and skating 2008-09-12 14:27 you will have lots of time to learn C while you're burping 2008-09-12 14:27 and rocking 2008-09-12 14:27 right, on that note I think I'll go skate now 2008-09-12 14:27 oh, big news 2008-09-12 14:27 k 2008-09-12 14:28 all ears 2008-09-12 14:28 (eyes) 2008-09-12 14:28 skateboarders clapped for my move yesterday 2008-09-12 14:28 nice- what was it? 2008-09-12 14:28 pronounced: "ok rollerbladers are allowed now" 2008-09-12 14:28 nothing much 2008-09-12 14:28 grind? 2008-09-12 14:28 skated up on the little vert wall on one skate, tapped the top with the other, skated down on one skate 2008-09-12 14:29 nice 2008-09-12 14:29 been grinding and getting nodes 2008-09-12 14:29 also skating down the top of the grinding wall 2008-09-12 14:29 very skinny 2008-09-12 14:29 tough to stay on 2008-09-12 14:29 it has an S curve at the end 2008-09-12 14:29 careful, grinds lead to crashes which leads to wrist injuries 2008-09-12 14:29 not much, but enough to drop you off 2008-09-12 14:29 my grinds aren't really grinds 2008-09-12 14:30 just slding down the rail onthe side of my skate 2008-09-12 14:30 one foot 2008-09-12 14:30 no danger 2008-09-12 14:30 I need to get protection before doing anything more 2008-09-12 14:30 makes lots of noise 2008-09-12 14:30 attracts attention ;) 2008-09-12 14:31 found the head of the U logs 2008-09-12 14:31 reading now 2008-09-12 14:31 have fun 2008-09-12 14:31 loads 2008-09-12 15:08 i just noticed: someone (?) said that dentries were 132 bytes. on my system it says 200. Normal deviations? 2008-09-12 15:09 or just a different kernel version? 2008-09-12 15:10 I'm reading the logs right now, see 8 references to dentries. none mention how many bytes 2008-09-12 15:11 second.05:29 < RazvanM> dentry 253015 253576 132 29 1 : tunables 120 60 8 : slabdata 8744 8744 0 2008-09-12 15:12 my bad- I searched for dentries 2008-09-12 15:12 not dentry 2008-09-12 15:13 well, not really important. just something i was wondering about 2008-09-12 15:14 data, 64 bit kernel? 2008-09-12 15:15 grossly big aren't they 2008-09-12 15:15 filename "foo" turns into a 200 byte dentry, and that's far from all the cache gobbling for that little guy 2008-09-12 15:15 yes, it is. right on both accounts. 2008-09-12 15:16 that's what makes sysfs such an idiotic idea 2008-09-12 15:16 take tiny little ascii strings which are already bloated way beyond the binary rep, and blow them up into gigantic, slow, awkward things 2008-09-12 15:17 then implement it badly on top of that 2008-09-12 15:17 and have a crappy internal and external interface 2008-09-12 15:17 bugs 2008-09-12 15:17 unstable api 2008-09-12 15:17 and you have the piece of shit we see today 2008-09-12 15:17 just thought I'd share that ;-) 2008-09-12 15:50 sk8 oclock 2008-09-12 15:54 ACTION is back 2008-09-12 15:54 I know nothing about apparmor 2008-09-12 15:57 kernel klink I presume ;-) 2008-09-12 15:57 (hogan's hero's) 2008-09-12 15:58 diaper-30 2008-09-12 15:58 l8tr 2008-09-12 18:17 nuther cuppa 2008-09-12 18:17 should be good enough to get refcounting implemented 2008-09-12 18:32 ok I see october 31st is a friday 2008-09-12 18:32 that means that the tux3 cabal meeting has to be a party 2008-09-12 18:32 might have to scale this up 2008-09-12 19:04 #define REFCOUNT_TABLE_BLOCK (1ULL << 28) 2008-09-12 19:04 #define REFCOUNT_HIGH_BLOCK (REFCOUNT_TABLE_BLOCK + (1ULL << 21)) 2008-09-12 19:04 #define UNATOM_TABLE_BLOCK (REFCOUNT_TABLE_BLOCK + (1ULL << 23)) 2008-09-12 19:10 -!- Aks(~ankitsriv@123.237.71.198) has joined #tux3 2008-09-12 19:16 Hi all, I am new to this project and want to know abt versioned pointers. 2008-09-12 19:16 welcome 2008-09-12 19:16 read the post yet? 2008-09-12 19:17 http://lwn.net/Articles/288896/ 2008-09-12 19:17 thanks for the link 2008-09-12 19:18 enjoy 2008-09-12 19:22 atom = dir->sb->atomgen++; /* use refcount for allocation */ 2008-09-12 19:22 if (!ext2_create_entry(dir, name, len, atom, 0)) { 2008-09-12 19:22 unsigned block = ATOM_REFCOUNT_BLOCK + (atom >> (dir->sb->blockbits - 1)); 2008-09-12 19:22 struct buffer *buffer = bread(dir->map, block); 2008-09-12 19:22 *(u16 *)buffer->data += 1; 2008-09-12 19:22 brelse(buffer); 2008-09-12 19:22 return atom; 2008-09-12 19:22 } 2008-09-12 19:23 got to put in the carry bit handling 2008-09-12 19:29 mistake in that code 2008-09-12 19:30 ACTION challenges tux3 readers to find it 2008-09-12 19:30 it's in the block calc 2008-09-12 19:32 ACTION steps out for a bit 2008-09-12 19:32 oh, and I have to do endian conversion 2008-09-12 19:32 almost forgot 2008-09-12 20:20 atom >> (dir->sb->blockbits - 1) - that looks weird, although I'm not actually sure what exactly blockbits is, either way the -1 smells wrong 2008-09-12 20:21 ah, never mind 2008-09-12 20:21 *(u16 *)buffer->data += 1 <- this is wrong since this is first u16 in block 2008-09-12 20:23 lacks [atom & ((1 << (dir->sb->blockbits)) - 1] instead of the prefix '*' 2008-09-12 20:23 ie. it should be ((u16 *)buffer->data)[atom & ((1 << (dir->sb->blockbits)) - 1]++; 2008-09-12 20:24 ((u16 *)buffer->data)[atom & ((1 << dir->sb->blockbits) - 1)]++; 2008-09-12 20:24 parentheses got mixed up 2008-09-12 20:25 from which I guess blockbits on a 4K filesystem is lg2(4Ki) = 12 2008-09-12 20:26 at which point the first comment about it smelling funny is irrelevant (I thought blockbits was the number of bits in a block, ie. 4Ki * 8 for a 4KiB block) 2008-09-12 20:26 so you're using bread/brelse, not bios? 2008-09-12 20:26 does brelse bwrite? 2008-09-12 20:27 I guess it must work like some sort of in kernel mmap 2008-09-12 20:27 hence the no nead for explicit write back 2008-09-12 20:27 at which point I guess we rely on cpu page dirty bits to actually know whether we need to write back 2008-09-12 20:31 unless bread/brelse, are actually operations on blocks within a file, which is suggested by the dir->map first parameter 2008-09-12 20:31 hmm, isn't it clear I have no bloody idea what I'm talking about yet? 2008-09-12 20:31 and I'm talking into a blackhole... 2008-09-12 20:31 start talking to yourself... then you know you're going crazy. 2008-09-12 20:41 -!- Kirantpatil(~kiran@122.167.202.116) has joined #tux3 2008-09-12 21:08 maze, it's all messed up, actually 2008-09-12 21:08 rev on the way 2008-09-12 21:08 you are right about the lacks 2008-09-12 21:08 maze, I should just have asked you to write it ;-) 2008-09-12 21:08 ACTION will be back 2008-09-12 21:09 which lacks 2008-09-12 21:09 don't leave ;-) I'm here 2008-09-12 21:09 what am I right about? I was guessing above? 2008-09-12 21:10 ofcourse with u16* in their it's all only host-endian compatible 2008-09-12 21:10 s/their/there 2008-09-12 21:11 I think stick to 1 byte counters - deal away with endianness in all but 0.5% of the cases 2008-09-12 21:12 eh, I'm not sure this entire effort is worth it 2008-09-12 21:12 I think you just need to support a half-dozen hard coded atoms, a small list of atoms for selinux, and store the rest as strings 2008-09-12 21:12 ultimately just gzipping the xattr block may be the easiest 2008-09-12 21:13 for everything besides selinux/acl 2008-09-12 21:13 [still parsing through the binary acl encoding, to see if it can be faked] 2008-09-12 21:20 -!- nataliep(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-09-12 21:33 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-12 21:35 although linux currently seems to use something more like 32 [4 byte header] + (16 [tag/type] + 16[...rwx] + 32[default=-1]) * 4 + (16 [tag/type] + 16[...rwx] + 32[uid/gid]) * [# of exceptions] bits 2008-09-12 21:35 either way, while these are normally small - or not even present - they can grow arbitrally large 2008-09-12 21:37 hmm 2008-09-12 21:37 interesting questions 2008-09-12 21:37 do filesystems in linux implement selinux and acls, or do they just implement xattr - would think just xattr, but... there are ext2/3/4... etc acl.h 2008-09-12 21:38 oh, directories get doubled entries, one being default 2008-09-12 21:38 the other being actual 2008-09-12 21:39 the 4 byte header is the version (lendian 2) 2008-09-12 22:00 back 2008-09-12 22:00 maze, it's not much effort 2008-09-12 22:00 . 2008-09-12 22:00 and it exactly emulates the ascii xattr behaviour, with superior compression 2008-09-12 22:01 now I just need to write it right ;) 2008-09-12 22:01 yes, but there are some issues there 2008-09-12 22:01 there are? 2008-09-12 22:01 for example: selinux is a quad of four values 2008-09-12 22:01 I thought I was just about done 2008-09-12 22:01 if you have a lot of valid states for each of them 2008-09-12 22:01 then the total number of states for the entire quad can blossom 2008-09-12 22:01 those are attr bodies 2008-09-12 22:02 we're working on attr names at the moment 2008-09-12 22:02 ah, and see... this is the tricky part 2008-09-12 22:02 you kind of have to look at some of them at the same time 2008-09-12 22:02 some of... selinux acls? 2008-09-12 22:03 so basically, once you strip out the selinux and extended acl xattrs (that's 3 different xattr strings), all that's left is barely used by anyone 2008-09-12 22:03 ok I see what you're saying 2008-09-12 22:03 sorry, didn't read carefully 2008-09-12 22:03 you're running out way ahead of me 2008-09-12 22:03 as usual 2008-09-12 22:03 it may not be worth optimizing that... 2008-09-12 22:03 well it can be optimized at the selinux level 2008-09-12 22:04 exactly. 2008-09-12 22:04 maybe not quite as efficiently 2008-09-12 22:04 maybe more 2008-09-12 22:04 so selinux basically needs to be optimized 2008-09-12 22:04 let's let them tell us 2008-09-12 22:04 acls need to be optimized 2008-09-12 22:04 probably 2008-09-12 22:04 and then all the rest needs to be (maybe?) optimized if we feel like it 2008-09-12 22:04 so I suggest that once xattrs are working properly, we invite the selinux folks to come over and do an audit 2008-09-12 22:04 and what needs to be optimized is not the headers (ie. the security.something= part) 2008-09-12 22:04 but the bodies 2008-09-12 22:05 ie. the part after the = 2008-09-12 22:05 both 2008-09-12 22:05 imho 2008-09-12 22:05 the part before the = sign is trivial, there are about 9 values to compress as atoms, leave the rest as strings 2008-09-12 22:05 the heads are much less variable, therefore so much easier to optimize 2008-09-12 22:05 agreed. 2008-09-12 22:05 the problem, though, isn't so much what to optimize 2008-09-12 22:06 so... bodies 2008-09-12 22:06 not that hard 2008-09-12 22:06 but how to store this, and where... 2008-09-12 22:06 but maybe not appropriate at this level 2008-09-12 22:06 we'll see 2008-09-12 22:06 and which parts to store on disk where 2008-09-12 22:06 it's kind of important to plan this out correctly to begin with, because this is ondisk format, not in memory 2008-09-12 22:06 lyou know, we'd probably get more mileage out of giving the selinux guys a way to run their own dictionary 2008-09-12 22:06 just like our atom dictionary 2008-09-12 22:06 exactly... 2008-09-12 22:06 and have an api for it 2008-09-12 22:06 kay 2008-09-12 22:06 we'll propose it 2008-09-12 22:06 I'm convinced now we need 4 dicts for selinux (one per quad) 2008-09-12 22:07 but first let them have a look at the basics 2008-09-12 22:07 and a dictionary for acls 2008-09-12 22:07 and a dictionary for 'other' xattrs headers 2008-09-12 22:07 it's my understanding they're usually disappointed with performance etc of just basic xattrs 2008-09-12 22:07 so basically what we need is some extensible dict interface 2008-09-12 22:07 trying not to let tux3 fall into that 2008-09-12 22:07 which falls into the log nicely 2008-09-12 22:07 right 2008-09-12 22:07 and it's app specific 2008-09-12 22:07 we need to export an api, not for dicts 2008-09-12 22:07 not four dicts 2008-09-12 22:08 the api would be the normal add/remove/list xattr api that everybody uses 2008-09-12 22:08 what's important is how we store it internally in the fs 2008-09-12 22:08 plus a wahy of dividing it into four 2008-09-12 22:08 that's easy - split on : 2008-09-12 22:08 bleah 2008-09-12 22:08 no parsing 2008-09-12 22:08 in the fs 2008-09-12 22:09 mechanism, not policy 2008-09-12 22:09 yeah, well, that's gonna have to happen, unless you don't want to split the quads 2008-09-12 22:09 which could potentiall explode the dics 2008-09-12 22:09 no parsing ;-) 2008-09-12 22:09 notice, tux3 has no parsing 2008-09-12 22:09 if we want we can use zlib 2008-09-12 22:09 or something more global 2008-09-12 22:09 you're missing the point here ;-) 2008-09-12 22:09 zlib is nice and all that 2008-09-12 22:10 but selinux xattrs are used on every frickin file access 2008-09-12 22:10 still 2008-09-12 22:10 they have to be blazing fast 2008-09-12 22:10 no parsing ;-) 2008-09-12 22:10 we need a better solution 2008-09-12 22:10 you can make the parsing in such a way that it'll still work even if it doesn't parse 2008-09-12 22:10 preferably one that performs even better than stupid ascii colon separated strings 2008-09-12 22:10 ah, but the ascii colon seperated strings are the api 2008-09-12 22:11 you have to do it that way 2008-09-12 22:11 sucky api 2008-09-12 22:11 unless we rip through all of the selinux code in the kernel 2008-09-12 22:11 anyway, linux does not have an acl api 2008-09-12 22:11 selinux does 2008-09-12 22:11 different 2008-09-12 22:11 the vfs layer provides an xattr api 2008-09-12 22:11 that's what we have to implement 2008-09-12 22:11 right, only that 2008-09-12 22:11 well 2008-09-12 22:11 we don't have to parse 2008-09-12 22:11 however internally we have to make it deal with the common cases quickly 2008-09-12 22:12 we can compress on byte pair if we like 2008-09-12 22:12 byte pair of what? you're getting strings in the api? 2008-09-12 22:12 byte pairs is a typical compression method 2008-09-12 22:12 16 bit values work better with a dict than 8 or 48 2008-09-12 22:12 for example 2008-09-12 22:12 and the common cases are going to be reading (and to a lesser extent writing) selinux xattrs and less often (but still very often) extended acls 2008-09-12 22:13 verging on premature optimization here 2008-09-12 22:13 no no, don't think lzw compression - that doesn't buy us anything here 2008-09-12 22:13 the selinux guys will cream all over if xattrs just work fine 2008-09-12 22:13 what needs to be done is we need to explicitly remove the selinux/acl from the xattr code and not treat them in the fs as xattrs at all 2008-09-12 22:13 treat them like you treat inode permissions 2008-09-12 22:14 put them directly in the inode 2008-09-12 22:14 have you measured the actual disbribution of unique quads? 2008-09-12 22:14 I thought you did that 2008-09-12 22:14 yes 2008-09-12 22:14 and it came out very tight 2008-09-12 22:14 exactly 2008-09-12 22:14 tightly clustered too 2008-09-12 22:14 so what's the problem 2008-09-12 22:14 but that's not something we can guarantee on a prod system 2008-09-12 22:14 just atomize the common ones 2008-09-12 22:14 store the weirdos literally 2008-09-12 22:14 agreed. 2008-09-12 22:14 ok 2008-09-12 22:14 so let's do it 2008-09-12 22:15 hmm, how to put this 2008-09-12 22:15 you don't want the xattr_get(selinux_xattr) 2008-09-12 22:15 to have to parse the entire xattr block for the inode 2008-09-12 22:16 it doesn't 2008-09-12 22:16 it only looks in the xcache 2008-09-12 22:16 but xattrs of other types, can be pretty much unique per file... 2008-09-12 22:16 (above md5/sha1 hash case) 2008-09-12 22:17 right 2008-09-12 22:17 so I guess we're going to check a hash of the xattr 2008-09-12 22:17 htree style 2008-09-12 22:17 so having the two very differently performing/characteristic concepts in one place will most likely break performance 2008-09-12 22:18 it's not very hard to look for likely atomize candidates I think 2008-09-12 22:18 depends on what the symbols of the alphabet 2008-09-12 22:18 and how deeply you're parsing 2008-09-12 22:18 I say, just put everything in the dict 2008-09-12 22:18 why not? 2008-09-12 22:18 has to be stored somewhere 2008-09-12 22:19 security.selinux="unconfined_u:object_r:default_t:s0\000" 2008-09-12 22:19 system.posix_acl_access="0sAgAAAAEABwD/////AgAEAGQAAAAEAAUA/////xAABQD/////IAAFAP////8=" 2008-09-12 22:19 user.hash="sdfsdfjhsdjfhsdjkfhdjskahfjkdsahkj" 2008-09-12 22:19 the dict is as good a place as any 2008-09-12 22:19 how would you atomize the above? 2008-09-12 22:19 ext2_find_entry 2008-09-12 22:19 later, htree_find_entry 2008-09-12 22:20 sucky compression in the example 2008-09-12 22:21 so how will the dict deal, with a few dozen entries with milions of occurences, a few hundred with tens of thousands, and a few million entries with one (to a couple) occurence(s) each 2008-09-12 22:21 that's a real world scenario straight of my laptop 20G drive 2008-09-12 22:21 millions of occurences, what's the problem? 2008-09-12 22:21 few million, easy 2008-09-12 22:21 that's what htree does 2008-09-12 22:21 handles millions of entries 2008-09-12 22:21 really fast 2008-09-12 22:22 uhm, I think my problem is I'm not convinced it's fast enough, when it could be O(1) 2008-09-12 22:22 o(1) is always good 2008-09-12 22:22 the millions of unique entries are blossoming the tree 2008-09-12 22:22 but damm fast is damm fast 2008-09-12 22:23 it's a btree 2008-09-12 22:23 slowing down accesses for the millions of entries 2008-09-12 22:23 it likes to blossom 2008-09-12 22:23 it says "go ahead, make my day" 2008-09-12 22:23 right, but are common entries stored nearer the root? 2008-09-12 22:23 never 2008-09-12 22:23 no - because it's a btree 2008-09-12 22:23 right 2008-09-12 22:23 so you've got o(depth) lookups 2008-09-12 22:23 very flat 2008-09-12 22:23 usually only two levels 2008-09-12 22:23 precisely what you want to avoid 2008-09-12 22:23 for a few million entires 2008-09-12 22:23 depth is smaller than you think 2008-09-12 22:24 much smaller 2008-09-12 22:24 but you're thinking of access speed from a disk io performance 2008-09-12 22:24 outlook 2008-09-12 22:24 nope 2008-09-12 22:24 we need to be fast in ram 2008-09-12 22:24 cpu speed 2008-09-12 22:24 that's what it works at 2008-09-12 22:24 dirops are cpu bound 2008-09-12 22:24 not disk bound 2008-09-12 22:25 ok, I'm firmly of the opinion we need 2 different dicts/htrees at the minimum 2008-09-12 22:25 I agree: 1) heads 2) bodies 2008-09-12 22:25 but I know you want to parse and segment 2008-09-12 22:25 one small one for the stuff which is known to exist all over the place (selinux/acl bodies) 2008-09-12 22:25 I don't think we should, the selinux guys should 2008-09-12 22:25 but 2008-09-12 22:25 the other for non-standard bodies 2008-09-12 22:25 I;'ll keep an open mind 2008-09-12 22:25 we would need to change the kernel vfs interface of selinux - I don't see that happening 2008-09-12 22:26 and then we'd need to keep around the old one for other fs'es anyway 2008-09-12 22:26 well xattrs already have namespaces 2008-09-12 22:26 part of the abpi 2008-09-12 22:26 api 2008-09-12 22:26 braindamaged part 2008-09-12 22:26 and even if we don't split, we'll still get perf boosts 2008-09-12 22:26 (don't split on :) 2008-09-12 22:26 that's a colon ) 2008-09-12 22:26 I thought it was a smile :) 2008-09-12 22:26 well it was a ':' than a ')' 2008-09-12 22:26 ;-) 2008-09-12 22:27 ok, immediate goal is to fix the brain damage in my refcounting 2008-09-12 22:27 if you have a dentry and inode already in memory 2008-09-12 22:27 sorry about the pile of poo I posed ;) 2008-09-12 22:27 how long does it take to fetch the xattrs for that inode? 2008-09-12 22:27 order of magnitude 2008-09-12 22:27 oh that's another thing... we can easily put a small hash in front of the atom dict 2008-09-12 22:28 very easily 2008-09-12 22:28 sub microsecond 2008-09-12 22:28 I guess 2008-09-12 22:28 [because with the correct implementation the above is less than 50 cycles] 2008-09-12 22:28 sure, and a microsecond is about 3,000 2008-09-12 22:28 60 times slower 2008-09-12 22:29 so there's something to be gained 2008-09-12 22:29 I thinjk we gain most of it from putting a hash in front of the dirops 2008-09-12 22:29 hash of what? 2008-09-12 22:29 so we end up with level 1, level 2 2008-09-12 22:29 hash of the thing we're atomizing 2008-09-12 22:29 just keep the common ones there 2008-09-12 22:29 let the cold ones drop off 2008-09-12 22:30 ok, now you've gotten ahead of me... 2008-09-12 22:30 it's a linux meme 2008-09-12 22:30 dentry hash as an example of that 2008-09-12 22:30 ok, how about, first a question: what exactly is an atom (example?) 2008-09-12 22:30 sucky example 2008-09-12 22:30 [how big is an atom] 2008-09-12 22:30 it's just a small integer with a name 2008-09-12 22:30 and a refcount 2008-09-12 22:30 ok, the names, can we have an example? 2008-09-12 22:30 the name of an atom is up to 255 chars (tradition) 2008-09-12 22:31 names are unrestricted 2008-09-12 22:31 pascal strings ;-) 2008-09-12 22:31 right 2008-09-12 22:31 in fact they are pascal strings 2008-09-12 22:31 that's what ext2 uses 2008-09-12 22:31 and what I always use 2008-09-12 22:31 ok, so how would you atomize (where would the atom boundaries be) in the above 3 line xattr example I posted? 2008-09-12 22:32 beats the crap out of shitty C strings 2008-09-12 22:32 strlen is fast ;-) 2008-09-12 22:32 sucks compared to looking up a byte 2008-09-12 22:32 strlen does cacheline damage 2008-09-12 22:32 "considered harmful to cache lines" 2008-09-12 22:32 I meant strlen is fast on pascal strings 2008-09-12 22:32 right 2008-09-12 22:33 I'd atomize the whole 3 line xattr 2008-09-12 22:33 and store the atom 2008-09-12 22:33 the whole thing? 2008-09-12 22:33 and have a limit of 2^48 atoms 2008-09-12 22:33 ok, on my drive, you'd have all refcounts = 1 2008-09-12 22:33 right now, it's 2^32 atoms 2008-09-12 22:33 if we do bodies, might want to widen that 2008-09-12 22:33 sure 2008-09-12 22:33 who cares 2008-09-12 22:34 selinux does 2008-09-12 22:34 refcounts take hardly any space 2008-09-12 22:34 2 bytes each 2008-09-12 22:34 now, as soon as something _does_ collide, you know right away 2008-09-12 22:34 that is 2008-09-12 22:34 match another body 2008-09-12 22:34 well 2008-09-12 22:34 anyway 2008-09-12 22:34 it's premature 2008-09-12 22:35 xattrs have to work 2008-09-12 22:35 or nobody cares how well we compress acls 2008-09-12 22:35 yes, but they have to be treated seperately 2008-09-12 22:35 here - I'll write up how it should be done IMHO 2008-09-12 22:35 I know, you want to tokenize 2008-09-12 22:35 the xattr string 2008-09-12 22:35 and compress that way 2008-09-12 22:36 so why don't they tokenize? 2008-09-12 22:36 yes please 2008-09-12 22:36 and let's invite the selinuxen to read that post 2008-09-12 22:37 builtin:[security.selinux] sedict1:[unconfined_u] sedict2:[object_r] sedict3:[default_t] sedict4:[s0] 2008-09-12 22:37 builtin:[system.posix_acl_access] acl_dict:[0sAgAAAAEABwD/////AgAEAGQAAAAEAAUA/////xAABQD/////IAAFAP////8] 2008-09-12 22:37 builtin:[user.] user_dict:[hash] user_dict:[sdfsdfjhsdjfhsdjkfhdjskahfjkdsahkj] 2008-09-12 22:37 I'm still researching the subject 2008-09-12 22:37 and then you want to store the first two directly within the inode 2008-09-12 22:37 got to fix my mess here now 2008-09-12 22:37 that's for one file? 2008-09-12 22:38 sedict1=2=3=4 could be the same dict, potentially could be the same dict as the acl_dict, potentially the same as builtin 2008-09-12 22:38 yes 2008-09-12 22:38 we stoe all of the directly in the inode 2008-09-12 22:38 on disk 2008-09-12 22:38 and cache them in memory 2008-09-12 22:38 when the inode is loaded 2008-09-12 22:38 so what's the size of an inode on disk? 2008-09-12 22:38 variable 2008-09-12 22:38 maximum? 2008-09-12 22:39 from about 40 bytes to unlimited 2008-09-12 22:39 current limitation is an inode table block 2008-09-12 22:39 but that will go away 2008-09-12 22:40 hmm, I need to start writing a junk fs 2008-09-12 22:40 to get a better feeling for the kernel interfaces 2008-09-12 22:40 you'll get an excellent chance very soon 2008-09-12 22:40 we're going to start by porting a junk fs to kernel 2008-09-12 22:40 next tuesday we'll do that 2008-09-12 22:41 really? 2008-09-12 22:41 hmm 2008-09-12 22:41 promise 2008-09-12 22:41 you said you wanted me to pick up the pace 2008-09-12 22:41 pick it up we shall 2008-09-12 22:41 hope we don't lose anybody ;) 2008-09-12 22:41 I'm beginning to think an fs should actually have (at least) two layers 2008-09-12 22:42 basically the frontend (UI / interface with vfs) and the backend (interface with block devices) 2008-09-12 22:42 with the ability for the middle to be network seperated 2008-09-12 22:42 what about the inodes in the middle? 2008-09-12 22:42 you need a clean api in the middle that deals correctly with coherency issues 2008-09-12 22:43 but I think this is the only way to get a well performing net fs 2008-09-12 22:43 you are entirely correct, and that is how tux3 is structured 2008-09-12 22:43 it has the cache level and the block level 2008-09-12 22:43 they are separately cleanly... better be 2008-09-12 22:43 or it simply won't work 2008-09-12 22:44 the backend is then a get/set/lock/unlock/notify system 2008-09-12 22:44 well 2008-09-12 22:44 kinda 2008-09-12 22:44 the back end is more like async messages 2008-09-12 22:44 loosely 2008-09-12 22:44 very loosely 2008-09-12 22:44 yeah, that kind of describes what I'm thinking 2008-09-12 22:45 hard to phrase really 2008-09-12 22:45 especially since it's still unclear to me ;-) 2008-09-12 22:45 the fact it has to be implemented as to separate pieces is now clear to me, with an interface layer that uses tcp-ip 2008-09-12 22:45 BUT 2008-09-12 22:45 can short-circuit the network stack on local host 2008-09-12 22:46 great, it was never clear to me ;) 2008-09-12 22:46 it just came out like that 2008-09-12 22:46 did it itself 2008-09-12 22:46 here's an example: 2008-09-12 22:47 application -> user space -> kernel space -> vfs layer -> client file system layer -> send rpc call -> network stack -> receive rpc call -> dispatch -> server file system layer -> block device layer 2008-09-12 22:47 and that's only half the loop 2008-09-12 22:47 for an nfs 2008-09-12 22:47 now if the tcp/ip network stack layer does cookies or UUID to identify that it's talking to itself, than it can zip it up to 2008-09-12 22:48 client fs layer -> direct dispatch -> server fs layer 2008-09-12 22:48 and of course there's the return path 2008-09-12 22:48 that's the kind of thinking that originally lead to nfs 2008-09-12 22:48 actually, the reverse of that 2008-09-12 22:48 and it has to be part sync, part async, part notify 2008-09-12 22:48 we had your second one 2008-09-12 22:48 and some genius decided it could easily be hacked to be the first one 2008-09-12 22:48 anyway 2008-09-12 22:49 it's not a NFS 2008-09-12 22:49 the real problem is now how to minimize the latency and data sent across the net in the middle 2008-09-12 22:49 it will become a cluster fs before it becomes an nfs 2008-09-12 22:49 and that relies on doing cache coherency and read/write (various types of) and notifications of changes/lock/lock-breaking correctly 2008-09-12 22:49 yes 2008-09-12 22:50 anyway... enough about my plans to conquer the world 2008-09-12 22:50 which _nobody_ in the oss world has succeed in doing well 2008-09-12 22:50 probably also not in the propietary world either 2008-09-12 22:50 anyway, you need to plan the entire fs from the ground up with the assumption all the clients (even the local host) are remote 2008-09-12 22:50 since we can't see the code or try it we don't know 2008-09-12 22:50 that way you don't need to deal with the local host specially 2008-09-12 22:50 well 2008-09-12 22:51 you don't have to put in remote hooks from the beginning 2008-09-12 22:51 [except for the dispatch optimization] 2008-09-12 22:51 you just have to be aware of where problems can be created 2008-09-12 22:51 you have to design it as if they were there 2008-09-12 22:51 you might not code it quite like that 2008-09-12 22:51 although I think you should 2008-09-12 22:51 whre it costs nothing, yes 2008-09-12 22:51 even if the net code is a shim .h file 2008-09-12 22:51 that's seldom the case 2008-09-12 22:52 but the real problem is, you're answering a demand that doesn't exist 2008-09-12 22:52 people have been optimizing for the wrong situation ;-) 2008-09-12 22:52 you're hoping your nfs will be so much more amazing, everybody will use it instead of sucky nfs 2008-09-12 22:52 what demand do you think that is? 2008-09-12 22:52 but you're likely to be amazed and disappointed 2008-09-12 22:52 truthfully? 2008-09-12 22:52 there's very little demand for a good nfs outside of hpc 2008-09-12 22:52 and they like lustre 2008-09-12 22:52 I don't care about who uses it or not ;-) 2008-09-12 22:53 prefectly happy 2008-09-12 22:53 they just want it to be more reliable and faster 2008-09-12 22:53 I just like good design 2008-09-12 22:53 well 2008-09-12 22:53 just writing a dlm to support it will keep you busy for months 2008-09-12 22:53 if you know _exactly_ what to do 2008-09-12 22:53 yeah, I know, not a good way to design it ;-) 2008-09-12 22:53 if you want to make money 2008-09-12 22:53 but oh well 2008-09-12 22:54 you can do it 2008-09-12 22:54 if you already have something working 2008-09-12 22:54 that people want 2008-09-12 22:54 and are willing to bribe you to make even more like what they want 2008-09-12 22:54 eh, dlm's aren't that hard if you have a clean api and don't have to deal with prior borkage 2008-09-12 22:54 well 2008-09-12 22:54 "not hard" translates into several months, trust me 2008-09-12 22:54 problem is the leakage of breakage from outside 2008-09-12 22:54 but prove me wrong 2008-09-12 22:55 I would like to have a good dlm 2008-09-12 22:55 in fact 2008-09-12 22:55 would you be kind enough to post a design note on dlm? 2008-09-12 22:55 because I'd like to cluster tux3 2008-09-12 22:55 nope, because I don't have the design yet ;-) 2008-09-12 22:55 by this time next year 2008-09-12 22:55 well 2008-09-12 22:55 when? 2008-09-12 22:55 I think I'm going to try writing a junk fs this weekend 2008-09-12 22:55 unless stuff burns (I'm oncall) 2008-09-12 22:55 great 2008-09-12 22:56 and we'll see how well I understand the kernel apis 2008-09-12 22:56 I'll check in with you on saturday when you're 50% done 2008-09-12 22:56 you'll figure them out fast 2008-09-12 22:56 and I'm going to start writing directly in kernel space, because that's the entire purpose of the exercise ;-) 2008-09-12 22:56 little painful to get some of the crap to behave 2008-09-12 22:56 I believe debugging is probably easiest in kvm? 2008-09-12 22:56 uml 2008-09-12 22:56 far and away 2008-09-12 22:56 why? 2008-09-12 22:57 can you strace ? 2008-09-12 22:57 just: make defconfig ARCH=um && make linux ARCH=um; ./linux ubd0=/my/rootfs 2008-09-12 22:57 that's it 2008-09-12 22:57 you can gdb it 2008-09-12 22:57 ah 2008-09-12 22:57 takes a little coaxing 2008-09-12 22:58 where are the logs stored for this channel btw? 2008-09-12 22:58 checking tux3.org 2008-09-12 22:58 linked from shapor's page I think, which is linked from tux3.org 2008-09-12 22:58 http://shapor.com/tux3/irclogs/current.txt 2008-09-12 22:58 hehe, uptodate to the second 2008-09-12 22:59 ok, you will be having fun writing lots of new fs code and I will be slaving away finishing xattrs 2008-09-12 22:59 you got the better deal 2008-09-12 23:00 $ wget -q -O - "http://shapor.com/tux3/irclogs/current.txt" | cut -b18- | sed -rn 's@^<([^>]*)>.*@\1@p' | sort | uniq -c | sort -nr | head -n 9 2008-09-12 23:00 6427 flips 2008-09-12 23:00 1199 shapor 2008-09-12 23:00 908 MaZe 2008-09-12 23:00 818 bh 2008-09-12 23:00 616 konrad 2008-09-12 23:00 369 tim_dimm 2008-09-12 23:00 113 vandenoever 2008-09-12 23:00 104 RazvanM 2008-09-12 23:00 96 flipz 2008-09-12 23:00 interesting stats there 2008-09-12 23:01 how the hell I'm number 3 on that list I'll never know... 2008-09-12 23:01 you're moving up fast 2008-09-12 23:01 fast typer 2008-09-12 23:01 just have to press the enter key enough :) 2008-09-12 23:01 well, yeah 2008-09-12 23:01 and wiggle those fingers 2008-09-12 23:01 now everybody is just typing stuff to boost their ranking ;-) 2008-09-12 23:02 nice example of sed chickentracks 2008-09-12 23:02 I am a sed-maniac 2008-09-12 23:02 you can cut and paste your code examples if you need a quick boost 2008-09-12 23:03 right, well, I also have a copy of spore waiting for me... 2008-09-12 23:03 wonder if it's any good 2008-09-12 23:04 let me know 2008-09-12 23:04 and I should reinstall my desktop with the next version of ubuntu 2008-09-12 23:04 my 4 year old can't wait to get her hands on pure 2008-09-12 23:04 the quad racing game 2008-09-12 23:04 from disney 2008-09-12 23:04 demon is much fun 2008-09-12 23:04 demo 2008-09-12 23:05 I'll probably take a few spins around the italian track after I do the next refcount iter 2008-09-12 23:05 folks 2008-09-12 23:05 unsigned attomoff = (atom << 1) & (-1 << blockbits); 2008-09-12 23:05 hey bh 2008-09-12 23:05 re above: new fs code includes xattrs eventually ;-) 2008-09-12 23:05 wow, when I need a reason not to code, one quickly arrives 2008-09-12 23:05 good luck with that ;) 2008-09-12 23:06 for me, a week on xattrs alone 2008-09-12 23:06 maybe you're faster 2008-09-12 23:06 uhm that code you posted looks wrong 2008-09-12 23:06 missing ~ 2008-09-12 23:06 yeah 2008-09-12 23:06 unsigned attomoff = (atom << 1) & ~(-1 << blockbits); 2008-09-12 23:06 that's why I pasted it ;) 2008-09-12 23:06 better than a compiler 2008-09-12 23:06 oh, sorry 2008-09-12 23:06 didn't realize it was a quiz 2008-09-12 23:06 heh 2008-09-12 23:07 no it was me actually fucking up 2008-09-12 23:07 in a way 2008-09-12 23:07 oh paste in wrong window? 2008-09-12 23:07 and in a way, not 2008-09-12 23:07 no 2008-09-12 23:07 I pasted it, you saw the bug 2008-09-12 23:07 nice 2008-09-12 23:07 heh 2008-09-12 23:08 unsigned block = ATOM_REFCOUNT_BLOCK + ((atom >> blockits) << 1)); 2008-09-12 23:09 unsigned block = ATOM_REFCOUNT_BLOCK + (atom >> (blockits - 1)); 2008-09-12 23:09 the above is wrong 2008-09-12 23:09 again ;-) 2008-09-12 23:09 since it's always even 2008-09-12 23:09 are you running an IQ test or something? 2008-09-12 23:10 a stupidity test on myself 2008-09-12 23:10 atom is not always even 2008-09-12 23:10 why would it be? 2008-09-12 23:10 block is 2008-09-12 23:11 ((atom >> blockits) << 1)) <- always even 2008-09-12 23:11 the code you pasted results in blocks even/oddness being constant 2008-09-12 23:11 yeah, gosh 2008-09-12 23:11 I'd assume you don't want that 2008-09-12 23:11 well actually I think I do 2008-09-12 23:11 the even block is for the low 16 bits 2008-09-12 23:11 the odd for the high 2008-09-12 23:12 then you're still 1 off 2008-09-12 23:12 probably 2008-09-12 23:12 since what you described is true with double 8 bits 2008-09-12 23:12 not with double 16 bits 2008-09-12 23:12 I was definitely conflating things and being fuzzy 2008-09-12 23:12 that's why you will get your xattrs written in _less_ than a week 2008-09-12 23:13 unsigned block = ATOM_REFCOUNT_BLOCK + (atom >> (blockits - 1)) << 1; 2008-09-12 23:13 is what you want if you want double-blocks of u16s with low and high blocks 2008-09-12 23:13 yes 2008-09-12 23:13 thanks 2008-09-12 23:13 although that deserves a comment ;-) 2008-09-12 23:13 see, now I'm getting semantically addressable code 2008-09-12 23:14 I loosely indicate the semantics, and the code comes back 2008-09-12 23:14 what do you mean? 2008-09-12 23:14 that's the one we want I think 2008-09-12 23:14 I think you'd actually want to spread it in a different way for performance reasons 2008-09-12 23:15 first all the low blocks, then all the high blocks 2008-09-12 23:15 since the high blocks will almost never be updated 2008-09-12 23:15 thus you want low blocks to be sequential on disk 2008-09-12 23:15 16 bit count with carry into the high block is a very nice balance between getting lots of atoms onto a block and not carrying too often 2008-09-12 23:15 there's no advantage to separating it that way 2008-09-12 23:15 that I know of 2008-09-12 23:15 a _very_ small advantage in the radix tree lookup 2008-09-12 23:16 but it was my first idea, what you said 2008-09-12 23:16 and I might stick with that indeed 2008-09-12 23:16 I think having two different ATOM_REFCOUNT_BLOCK would make for cleaner easier to understand code 2008-09-12 23:16 two different? 2008-09-12 23:16 ATOM_REFCOUNT_{LOW,HIGH}_BLOCK 2008-09-12 23:17 sure 2008-09-12 23:17 it's that way now 2008-09-12 23:17 before you posted that expression 2008-09-12 23:17 I haven't actually looked at the code ;-) 2008-09-12 23:17 just buggy 2008-09-12 23:17 ok 2008-09-12 23:17 give me an hour 2008-09-12 23:17 and you get to look at working, tested code 2008-09-12 23:17 I program much better when I'm not chatting 2008-09-12 23:18 ok, I think I'm gonna finally head home from work, and maybe reboot into mac os x and start up spore - and stop bothering you ;-) 2008-09-12 23:18 wow 2008-09-12 23:18 didn't realize you're camping out there 2008-09-12 23:18 but why not 2008-09-12 23:18 infinite junk food 2008-09-12 23:18 sweet sound of vacuum cleaners in the distance 2008-09-12 23:19 plus 30" screen 2008-09-12 23:19 not to mention my place is a studio 2008-09-12 23:19 that can be fixed 2008-09-12 23:19 just put in a ticket 2008-09-12 23:19 has an empty fridge (ok, I have spaghetti) 2008-09-12 23:19 you'll have one at home 2008-09-12 23:19 yah 2008-09-12 23:19 is a total mess 2008-09-12 23:19 you need to get that ramyun 2008-09-12 23:20 get it on the way home 2008-09-12 23:20 and is lacking (even after more than 2 years) basic amenities like a desk 2008-09-12 23:20 ouch 2008-09-12 23:20 just order one online 2008-09-12 23:20 ikea 2008-09-12 23:20 have it in 3 days 2008-09-12 23:20 because I've never found a good way to fit one in 2008-09-12 23:20 some assembly required 2008-09-12 23:20 that's the trick 2008-09-12 23:20 so to be fair, even back in Poland, when I had a desk, I spent most time with laptop on lap on bed 2008-09-12 23:20 and move to phoenix where you can have a real house ;) 2008-09-12 23:20 which is one of the reasons I've never bothered 2008-09-12 23:21 how do you say "hello" in polish? 2008-09-12 23:21 hallo? 2008-09-12 23:21 DzieÅ„ Dobry. 2008-09-12 23:21 is Good Day 2008-09-12 23:21 hello is 'halo' 2008-09-12 23:21 characters didn't work in xchat 2008-09-12 23:21 but that's kind of like pulled in from english 2008-09-12 23:22 that's the unaccented version of above? 2008-09-12 23:22 Dzien' Dobry 2008-09-12 23:22 n with accent like / 2008-09-12 23:22 got it 2008-09-12 23:22 like sheen dobry 2008-09-12 23:22 kinda 2008-09-12 23:22 with more bite 2008-09-12 23:22 and 'halo' is more like a phone greeting when you pick up then really hello 2008-09-12 23:23 ok, I'll try it on a pole tomorrow and see if it works 2008-09-12 23:23 you're more likely to use 'hej' (hey) between friends in person, and dzien dobry for more formal purposes and 'halo' when picking up the phone 2008-09-12 23:23 hmm 2008-09-12 23:23 let me write it down more phonetically 2008-09-12 23:23 ugh 2008-09-12 23:24 dobry = [hard d] [short o] [hard b] [hard trilled r] [short i/y] 2008-09-12 23:25 oh yeah I can say dobry 2008-09-12 23:25 even with the trill 2008-09-12 23:25 so dzien, is dzie - en - two syllables 2008-09-12 23:26 tshee - ehn 2008-09-12 23:26 ? 2008-09-12 23:26 were dzi is a consonant, e is a vowel and n' is a soft nasal n 2008-09-12 23:26 nice 2008-09-12 23:26 okay maybe more like one syllable, kind of hard to say because soft vowels (and there's two of them here) are kind of syllable like 2008-09-12 23:27 so the i in dzi takes the sound dz and makes it soft 2008-09-12 23:27 good enough to try 2008-09-12 23:27 I'll get corrected soon enough 2008-09-12 23:27 which is exactly the purpose of the accent on the n (accent on consonants is written as an i infront of a vowel - hence the dz "i" en' 2008-09-12 23:27 that's really dz'en' 2008-09-12 23:28 ah 2008-09-12 23:28 and dz is a single letter 2008-09-12 23:28 wow, more complex than say czech 2008-09-12 23:28 that just happens to be written with two 2008-09-12 23:28 since we ran out of latin characters 2008-09-12 23:29 hence dz rz sz cz ch should basically be treated as single letters 2008-09-12 23:29 and then there's sounds like drz brz and so on, which are almost like a single letter 2008-09-12 23:29 ugh 2008-09-12 23:29 with a trill? 2008-09-12 23:30 while you can theoretically read it as b-rz and d-rz, almost everybody pronounces is it quickly where it kind of melds into one 2008-09-12 23:30 and there's no 'r' in their ;-) 2008-09-12 23:30 since rz is actually 'z with a dot' ie. ż 2008-09-12 23:30 which is the first letter of my last name 2008-09-12 23:30 which is almost like the j in french 'je' 2008-09-12 23:30 so more like 'zh' 2008-09-12 23:31 IC 2008-09-12 23:31 it's actually very consistant 2008-09-12 23:31 very few words can't be read correctly if you know the relatively short set of rules 2008-09-12 23:31 there are very few exceptions 2008-09-12 23:31 I'll learn those rules 2008-09-12 23:31 over time 2008-09-12 23:32 slavic languages are fun 2008-09-12 23:32 great langues for being cynical in 2008-09-12 23:32 from what I have seen 2008-09-12 23:32 an example being frozen (zmarzniÄ™ty), where the syllable split is zmar-zniÄ™-ty [the Ä™ is nasal e, or e with a , ogonek accent - like french c with cedilla, but flipped other way] 2008-09-12 23:32 whoops 2008-09-12 23:32 none of those chars are working on braindead xchat 2008-09-12 23:32 so even though you see rz it ain't zh 2008-09-12 23:33 those were all the same character 2008-09-12 23:33 time for a less braindead chat client 2008-09-12 23:33 probably because I'm writing in utf8 2008-09-12 23:33 yes 2008-09-12 23:33 although they show up fine in pidgin 2008-09-12 23:33 but xchat should grok that 2008-09-12 23:33 xchat probably does 2008-09-12 23:33 xchat sucks at everything by default 2008-09-12 23:33 I'd guess your terminal is not utf8 or something 2008-09-12 23:33 I wouldn't assume taht 2008-09-12 23:33 it doesn't run in a terminal 2008-09-12 23:35 oh, it doesn't? 2008-09-12 23:35 oh right the x ;-) 2008-09-12 23:36 what are we on here? 2008-09-12 23:36 -!- maze_pallas(~elbereth@216-239-45-4.google.com) has joined #tux3 2008-09-12 23:36 oftc 2008-09-12 23:36 let's see if my xchat works 2008-09-12 23:36 ąćęłńóśźż 2008-09-12 23:36 works 4 me 2008-09-12 23:36 ąćęłńóśźż 2008-09-12 23:36 yup 2008-09-12 23:37 ĄĆĘÅŃÓŚŹŻ 2008-09-12 23:37 xchat does utf8 great here 2008-09-12 23:37 ĄĆĘÅŃÓŚŹŻ 2008-09-12 23:37 well 2008-09-12 23:37 right 2008-09-12 23:37 I have to set something 2008-09-12 23:37 or upgrade 2008-09-12 23:37 but you probably need to have proper locale 2008-09-12 23:37 oh 2008-09-12 23:37 xchat_2.6.1-0ubuntu2_i386.deb here 2008-09-12 23:37 I have xchat 2.8.4 2008-09-12 23:37 freshly installed 2008-09-12 23:37 I thought unicode was supposed to be independent of locale 2008-09-12 23:38 -!- maze_(~maze@216-239-45-4.google.com) has joined #tux3 2008-09-12 23:38 testing ;-) 2008-09-12 23:38 ?????ó??? 2008-09-12 23:38 nope 2008-09-12 23:38 2.6.8-0.3 2008-09-12 23:38 etch 2008-09-12 23:38 :p 2008-09-12 23:38 try LC_ALL=en_US.utf-8 xchat as the startup command 2008-09-12 23:39 -!- maze__(~maze@216-239-45-4.google.com) has joined #tux3 2008-09-12 23:39 testing 2008-09-12 23:39 ĄĆĘÅŃÓŚŹŻ ąćęłńóśźż 2008-09-12 23:39 yep works 2008-09-12 23:39 it's gotten crowded in here 2008-09-12 23:40 I guess I don't really know what that means... 2008-09-12 23:40 amazed at all the maze clones 2008-09-12 23:40 you better get home 2008-09-12 23:40 I need to hear about spore 2008-09-12 23:40 but obviously something is broken if xchat's wire encoding depends on the locale 2008-09-12 23:41 maybe there's a switch or something 2008-09-12 23:42 I'm giving up on getting xchat to show those chars 2008-09-12 23:42 ah no idea 2008-09-12 23:42 just quit 2008-09-12 23:42 one day I will do a gui for irssi 2008-09-12 23:42 and restart with LC_ALL=en_US.utf-8 xchat 2008-09-12 23:42 keep promising myself 2008-09-12 23:42 decloned. 2008-09-12 23:43 going to try the above? or shall I head home? 2008-09-12 23:44 head home 2008-09-12 23:44 ok will do so then 2008-09-12 23:44 I'll have it working when you get there 2008-09-12 23:44 and you can start your fs 2008-09-12 23:44 or your spore review 2008-09-12 23:45 pick up some ramyun on the way in case it turns into a long one 2008-09-12 23:45 ok, might do so, there's a store on route after all... 2008-09-12 23:48 hey 2008-09-12 23:48 hi 2008-09-12 23:48 yeah, just got some crude gdb script to scan through all system threads and print out their state 2008-09-12 23:48 more to come 2008-09-12 23:49 this is to help with core examinations 2008-09-12 23:49 which it seems that nobody does under Linux 2008-09-12 23:49 true 2008-09-12 23:49 I have never 2008-09-12 23:49 should 2008-09-12 23:50 it's not a criticism, just a new tool to help with debugging 2008-09-12 23:50 I'm getting a bunch of nfsd threads in state 4 which is a bit odd 2008-09-12 23:51 it's about time I learned summa that fu 2008-09-12 23:53 summa ? 2008-09-12 23:53 =some of 2008-09-12 23:53 ok 2008-09-12 23:53 yap 2008-09-12 23:53 that's all we did at netapp practically 2008-09-12 23:53 for better or worse 2008-09-12 23:53 gotta get me summa data 2008-09-12 23:53 gotta get me summa dat 2008-09-13 00:13 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-13 00:13 test 2008-09-13 00:20 it didn't work 2008-09-13 00:33 ok, one nasty little issue with putting the atom tables in the atomdict... ext2_create_entry likes to rely on the inode->i_size to know how many dirent blocks there are 2008-09-13 00:34 now the poor thing thinks there are an awful lot of them 2008-09-13 00:52 so this being inside the filesystem, I can actually let it have stuff out past the end of i_size 2008-09-13 00:53 and let the dirops happilly continue using i_size to know how many dir blocks there are 2008-09-13 00:54 that's probably ok 2008-09-13 00:54 got to think about it 2008-09-13 00:55 now this is where I'd like somebody to be awake ;) 2008-09-13 00:55 I guess I'll go take a run around the pure track 2008-09-13 00:55 see if that focusses me 2008-09-13 01:10 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-13 02:17 -!- Aks(~ankitsriv@123.237.71.198) has joined #tux3 2008-09-13 02:27 should have pinged if you needed someone awake 2008-09-13 02:28 maze, do you know what pread is supposed to do with read past end of file? 2008-09-13 02:29 I would think, return zero 2008-09-13 02:29 ah, I'll ptrace ;) 2008-09-13 02:29 brilliant 2008-09-13 02:29 you could just right a tiny program to test it 2008-09-13 02:30 my guess is pread should behave exactly like lseek read 2008-09-13 02:31 probably EINVAL 2008-09-13 02:31 but testing would see 2008-09-13 02:33 it's supposed to return zero 2008-09-13 02:34 and in fact something else is going on 2008-09-13 02:34 so never mind I'll sort it 2008-09-13 02:34 how was spore? 2008-09-13 02:34 haven't gotten to it yet 2008-09-13 02:34 how's the new fs? 2008-09-13 02:34 going through open tabs in firefox and closing them 2008-09-13 02:34 right 2008-09-13 02:34 lots of stuff from the week to catch up on before I reboot 2008-09-13 02:34 yah 2008-09-13 02:34 I'm about 30% through... 2008-09-13 02:35 that's why it's good that firefox crashes 2008-09-13 02:35 it does down, you lose 150 tabs, you find out they didn't matter, you get hours of your life back 2008-09-13 02:36 nope it restarts with all the tabs 2008-09-13 02:36 thankfully 2008-09-13 02:36 I actually always killall firefox-bin instead of closing it 2008-09-13 02:36 ok, got it sorted 2008-09-13 02:36 that way the tabs don't get lost ;-) 2008-09-13 02:36 so how is it? 2008-09-13 02:36 it's because I running 32 bit fileops 2008-09-13 02:36 shifted the block number, passed zero to pread 2008-09-13 02:37 everything makes sense 2008-09-13 02:37 well, It's a little tricky to test this high offset sparse file stuff 2008-09-13 02:38 I'll find a way around it 2008-09-13 02:38 it still seems like a good thing to do 2008-09-13 02:40 there we are, compiled with 64 bit r/w and got the proper error 2008-09-13 02:40 % 2008-09-13 02:40 5 2008-09-13 02:40 EIO 2008-09-13 02:48 hmm, I don't think there's really any need to test in 32bit userspace if the kernels 64 bit, right? 2008-09-13 02:48 all the conversions happen way earlier at the syscall entry point... 2008-09-13 02:49 all combinations need testing eventually 2008-09-13 02:49 but true, you can live blissfully in 64 bit 2008-09-13 02:49 never worry about 32 bit 2008-09-13 02:50 yes. all combos need testing, but much later 2008-09-13 02:50 right 2008-09-13 02:50 I don't think that's something that needs worrying in the dev stage 2008-09-13 02:50 I'll stay in 32 bit 2008-09-13 02:50 it's the most demanding 2008-09-13 02:50 32 bit kernel? 2008-09-13 02:50 yes 2008-09-13 02:50 ah 2008-09-13 02:50 shapor runs 64 bit 2008-09-13 02:50 most others do 2008-09-13 02:50 so do I 2008-09-13 02:50 but if it doesn't work on 32 bit it doesn't exist 2008-09-13 02:50 I actually run an interesting system 2008-09-13 02:51 45 bit fedora 8.5 2008-09-13 02:51 45? 2008-09-13 02:51 yeah, a joke 2008-09-13 02:51 haha 2008-09-13 02:51 I run a 23 db system 2008-09-13 02:51 it's a 32-bit fedora 8 system, with 64-bit kernel installed, some stuff to support that, and a 64-bit compiler, than upgraded to fedora 9, than more stuff upgraded to 64-bits 2008-09-13 02:51 that's the most important thing about it from my point of view 2008-09-13 02:51 wait, sorry 2008-09-13 02:52 it's 28 db 2008-09-13 02:52 hmm... 2008-09-13 02:52 let me see what it really is 2008-09-13 02:52 mine's a macbook pro laptop, so it's almost silent unless I run the procs full throttle 2008-09-13 02:52 I also have a headless (well with projector) box, which is also very quiet, since it's a shuttle 2008-09-13 02:53 anyway since my laptop is mostly 32 bits, I felt avg of 32 + 64 = 48 was not right 2008-09-13 02:53 so instead went with geometric mean sqrt(32 * 64) = sqrt(2) * 32 = 1.4 *32 = 32 + 12.8 = 44.8 ~ 45 2008-09-13 02:54 and that's why it's a 45 bit fedora 8.5 system 2008-09-13 02:54 brilliant logic, ain't it? 2008-09-13 02:55 right 2008-09-13 02:55 29 dba @ 1 meter 2008-09-13 02:55 ok night 2008-09-13 02:55 and it's still too noisy for me 2008-09-13 02:55 I want 27 now 2008-09-13 02:55 hard to get 2008-09-13 02:55 flips: good luck with design and coding as usual :) 2008-09-13 02:55 heh, that's why I use a laptop, and a headless box 2008-09-13 02:55 this is quieter than any laptop I've used 2008-09-13 02:55 the box can be on the other end of the room 2008-09-13 02:55 considerably 2008-09-13 02:56 really? even when you run with no cpu consumption on a laptop and spun down drives? 2008-09-13 02:56 spun down ok 2008-09-13 02:56 the hard drive is the noisiest component 2008-09-13 02:56 and I got the quiest ones on the market 2008-09-13 02:56 remember my laptop is basically a remote X/xterm/ssh server 2008-09-13 02:56 quietest 2008-09-13 02:56 sure 2008-09-13 02:57 but the drive doesn't stay spun down 2008-09-13 02:57 yeah, I actually use flash for some things 2008-09-13 02:57 in my experience 2008-09-13 02:57 and don't like things suddenly going "whirr" either ;-) 2008-09-13 02:57 quiet means, makes no noise 2008-09-13 02:57 like the root fs part that is read only (8gb), with tons of the non-permanent stuff living in tmpfs (although no swap) 2008-09-13 02:58 ok, you're hardcore 2008-09-13 02:58 I'd expect no less 2008-09-13 02:58 that way I have ro root fs with 8 gb, tmpfs with /var pieces 2008-09-13 02:58 my favorite box here is the fit pc 2008-09-13 02:58 no fan 2008-09-13 02:58 has a quiet 2.5 in drive 2008-09-13 02:58 ah 2008-09-13 02:58 which will be replaced with a 2.5 in flash drive 2008-09-13 02:58 pretty soon 2008-09-13 02:59 my daughter's favorite box too 2008-09-13 02:59 I actually have my flash drive in a raid array with an identical size partition on the hard disk 2008-09-13 02:59 has a fine linux distro on it 2008-09-13 02:59 didn't have to do a thing 2008-09-13 02:59 with the hard drive part set to raid mode 'write_mostly' 2008-09-13 02:59 that way if you pull the flash it falls back to the hard drive 2008-09-13 02:59 than you can remount,rw 2008-09-13 02:59 update the system 2008-09-13 02:59 ok, now I have this little logical problem 2008-09-13 02:59 remount,ro 2008-09-13 02:59 put the flash back 2008-09-13 02:59 in 2008-09-13 03:00 and then 8gbs sync to flash in one go 2008-09-13 03:00 - presto wear levelling solved even with ext3 ;-) 2008-09-13 03:00 cool 2008-09-13 03:00 that's md? 2008-09-13 03:00 must be 2008-09-13 03:00 dm can't sync ;-) 2008-09-13 03:00 once both are synced, there are no writes (read-only mount), and the hdd is write-mostly, so all reads hit swap 2008-09-13 03:00 yup it's md 2008-09-13 03:00 erm 2008-09-13 03:01 not swap - flash 2008-09-13 03:01 ddraid project is going to get going pretty soon 2008-09-13 03:01 cluster raid 2008-09-13 03:01 been gathering dust for some time 2008-09-13 03:01 nice and quiet - and fast - since seek time is awesome 2008-09-13 03:01 but it's going to be really useful 2008-09-13 03:01 and linear read is pretty much the same as a normal 2.5 inch drive 2008-09-13 03:01 (25mb/s) 2008-09-13 03:02 which drive is it? 2008-09-13 03:02 and since it's an expresscard 8gb flash - it doesn't stick out of the notebook - just sits nestled within the cavity 2008-09-13 03:02 ah 2008-09-13 03:02 uhm, some no name I picked up off of ebay for like 40 bucks a year and a half back 2008-09-13 03:02 seen em at fry's 2008-09-13 03:03 need 32 gb I think 2008-09-13 03:03 and with the drive spun down, there's less heat - thus less need for fans 2008-09-13 03:03 I don't really want to do work on a smaller one 2008-09-13 03:03 remember - this is just the OS 2008-09-13 03:03 data lives in the cloud 2008-09-13 03:03 in this case on the headless box 2008-09-13 03:04 nice little cloud 2008-09-13 03:04 a cumullo closetus 2008-09-13 03:04 or when at work... it lives elsewhere ;-) 2008-09-13 03:04 ok, my logistical problem 2008-09-13 03:04 and of course email and so on, are already in the cloud to begin with 2008-09-13 03:04 I'm testing this gigantic sparse file stuff 2008-09-13 03:04 and my linux oss can't do those big files 2008-09-13 03:04 my fs 2008-09-13 03:05 let me see 2008-09-13 03:05 why not? 2008-09-13 03:05 what kernel? 2008-09-13 03:05 oss? 2008-09-13 03:05 I'm writing stuff at 2^40 bytes out 2008-09-13 03:05 was that a typo for os? 2008-09-13 03:05 yes 2008-09-13 03:05 uhm, compiling 32-bit userspace with 32-bit kernel? 2008-09-13 03:05 so, ext3 just can't do that 2008-09-13 03:05 does 34 bits work? 2008-09-13 03:05 right, 32/32 2008-09-13 03:05 no 2008-09-13 03:06 2^40, see? 2008-09-13 03:06 tux3 is 2^48 2008-09-13 03:06 tux3 can do it 2008-09-13 03:06 asking whether 34 works 2008-09-13 03:06 even mapped into a loopback file 2008-09-13 03:06 34 doesn't 2008-09-13 03:06 um 2008-09-13 03:06 wait 2008-09-13 03:06 then your problem is compile options 2008-09-13 03:06 no, 34 should be ok 2008-09-13 03:06 #define USE_LARGEFILEOFFSET 64 or so 2008-09-13 03:06 yep, done 2008-09-13 03:07 the problem is the loopback file 2008-09-13 03:07 so 34 works? 2008-09-13 03:07 well 2008-09-13 03:07 34 is 16gb 2008-09-13 03:07 that should definitely work 2008-09-13 03:07 see 2^40 above 2008-09-13 03:07 33 is 8gb - that I know works - since that's dvd images ;-) 2008-09-13 03:07 well, that's why I'm asking if you've tested with 33 2008-09-13 03:07 terabyte 2008-09-13 03:07 no 2008-09-13 03:07 I guess I will 2008-09-13 03:07 then probably worth checking ;-) 2008-09-13 03:07 but I can't leave it that way 2008-09-13 03:08 not satisfactory 2008-09-13 03:08 if it doesn't work at 33, your problem is not in the kernel 2008-09-13 03:08 sure, I can do 15 minutes of testing with a much lower offset 2008-09-13 03:08 or half an hour 2008-09-13 03:08 but I need to write real code 2008-09-13 03:08 if it works at 33, but not at 40, then you've got a kernel internal problem 2008-09-13 03:08 it works fine 2008-09-13 03:08 like I said, this is just logical 2008-09-13 03:09 flips: btw, mmLinux was my first kernel project ever 2008-09-13 03:09 wait a minute 2008-09-13 03:09 just so that you know 2008-09-13 03:09 I think 2 TB is the biggest sparse file I can make on this system 2008-09-13 03:09 we're talking about the size of a file right? 2008-09-13 03:09 I figured it was either going to make me or kil me 2008-09-13 03:09 bh, looks like a fine project 2008-09-13 03:09 right 2008-09-13 03:09 I've done well with the rap that it's given me 2008-09-13 03:09 yes 2008-09-13 03:09 that makes sense it's probably 4GB * 512 or something 2008-09-13 03:09 I didn't know about it at all 2008-09-13 03:09 however, I think I can do better 2008-09-13 03:10 maze, exactly 2008-09-13 03:10 lame 2008-09-13 03:10 it's the blocks count in the ext3 inode 2008-09-13 03:10 http://en.wikipedia.org/wiki/Ext3 2008-09-13 03:10 yeah, I was in the middle of the entire -rt thing and I got forgotten about, dropped out of existence 2008-09-13 03:10 see file size limit = 2tb 2008-09-13 03:10 measured in sectors instead of blocks, lamest idea ever 2008-09-13 03:10 unfixable 2008-09-13 03:10 apparently 2008-09-13 03:11 use jfs 2008-09-13 03:11 ok, so the correct way to test this is to boot tux3 up far enough that I can do the testing in tux3 files 2008-09-13 03:11 lol 2008-09-13 03:11 which go up to 2^48 (true) 2008-09-13 03:11 I wasn't joking 2008-09-13 03:11 been doing that already 2008-09-13 03:11 jfs does 4pb which is 52 bits 2008-09-13 03:11 for a few weeks even 2008-09-13 03:11 ah 2008-09-13 03:12 that's another option 2008-09-13 03:12 but it would limit people's abiltiy to test 2008-09-13 03:12 or xfs 2008-09-13 03:12 8 exabyte limit 2008-09-13 03:12 which is 63 bits 2008-09-13 03:12 it's important that tux3 builds and tests on the lowest common denominator linux system 2008-09-13 03:12 can't require a specific host fs 2008-09-13 03:12 you can if you stick it in a loopback 2008-09-13 03:12 I am 2008-09-13 03:13 that get's me up to 2 TB on ext3 2008-09-13 03:13 then you merely need to format the loopback with jfs 2008-09-13 03:13 can't expect users to do that 2008-09-13 03:13 oh, sparse loopback 2008-09-13 03:13 I thought base fs -> file -> loopback -> jfs -> sparse file -> loopback -> tux3 2008-09-13 03:13 but that does get complex 2008-09-13 03:13 and probably blows the stack 2008-09-13 03:13 can't expect the user even to have jfs 2008-09-13 03:14 it's not compiled in by default 2008-09-13 03:14 well 2008-09-13 03:14 default on fedora 2008-09-13 03:14 probably comes in modules on most recent distros 2008-09-13 03:14 ok 2008-09-13 03:14 and ubuntu too, in a module I expect 2008-09-13 03:14 but still 2008-09-13 03:14 module of course 2008-09-13 03:14 then they have to mess with a fragile loopback 2008-09-13 03:14 it's pi o'clock 2008-09-13 03:14 and really tux3 can do that by itself 2008-09-13 03:14 heh 2008-09-13 03:14 so it is 2008-09-13 03:15 what are you plans for october 31? 2008-09-13 03:15 uhm 2008-09-13 03:15 it's a friday 2008-09-13 03:15 halloween 2008-09-13 03:15 looks like it's a friday 2008-09-13 03:16 looks like none at the moment 2008-09-13 03:16 I'm thinking of arranging an official cabal meeting that day 2008-09-13 03:16 were? 2008-09-13 03:16 just an idea 2008-09-13 03:16 rather: where? 2008-09-13 03:17 somewhere in the fear and loathing of LA 2008-09-13 03:17 ugh, would need to drive down... 2008-09-13 03:17 with a web presence 2008-09-13 03:17 unless we organized it somewhere mid-way 2008-09-13 03:17 possible 2008-09-13 03:17 like I say, just an idea at the moment 2008-09-13 03:18 bay area certainly could be good for attendance 2008-09-13 03:18 where are you? 2008-09-13 03:18 santa monica 2008-09-13 03:18 socal 2008-09-13 03:18 something like santa maria 2008-09-13 03:19 close 2008-09-13 03:19 or grover beach 2008-09-13 03:19 I'm sure those saints all live near each other 2008-09-13 03:19 that was a suggestion of a mid-way meet point 2008-09-13 03:19 right next to the beach 2008-09-13 03:19 indeed 2008-09-13 03:19 oh 2008-09-13 03:19 I see 2008-09-13 03:20 it's outside the LA basin 2008-09-13 03:20 outside of LA jams 2008-09-13 03:20 so you deal with LA 2008-09-13 03:20 we norcal folks get a larger distance to travel 2008-09-13 03:21 I'd probably take highway 1 and leave 3 hours early - since I love that drive... but oh well ;-) 2008-09-13 03:21 150 miles on pch... 2008-09-13 03:21 yes, it's not too far 2008-09-13 03:21 anyway, just an idea 2008-09-13 03:21 that's 130 miles from you, 230 from me 2008-09-13 03:22 ok I know what I'll do 2008-09-13 03:22 something around there, some restaurant? 2008-09-13 03:22 about my logistical problem 2008-09-13 03:22 I'll make the position of the atom refcount map a variable in the superblock 2008-09-13 03:22 and set it really low for unit testing 2008-09-13 03:22 duh 2008-09-13 03:23 yes, something like that 2008-09-13 03:23 130 miles I can handle 2008-09-13 03:23 you're younger, can handle further 2008-09-13 03:23 I need to check with such folks as natalie 2008-09-13 03:23 we could do this as an LA-only event 2008-09-13 03:23 or try for larger coverage 2008-09-13 03:24 where's she from? 2008-09-13 03:24 ukraine 2008-09-13 03:24 lives in LA 2008-09-13 03:24 I brought her into goog 2008-09-13 03:24 looks like SM 2008-09-13 03:24 goog's lucky about that 2008-09-13 03:24 most likely 2008-09-13 03:24 we can patch you in ;-) 2008-09-13 03:25 patch as in? 2008-09-13 03:25 remotey 2008-09-13 03:25 that's another option 2008-09-13 03:25 to do it with a web presence 2008-09-13 03:25 oh, as in organize it in the office? to vc? 2008-09-13 03:25 the idea is to have some, um, ethanol involved 2008-09-13 03:26 not quite 2008-09-13 03:26 let's have an email loop 2008-09-13 03:26 cause I very heavily doubt I have the net uplink for any decent vc at home 2008-09-13 03:26 about it 2008-09-13 03:26 it's just comsucktik 2008-09-13 03:26 yes 2008-09-13 03:26 well 2008-09-13 03:42 just a question: do you test tux3 on 64bit? 2008-09-13 03:44 because it seems that all I get are error messages :) 2008-09-13 03:45 or maybe I need a newer fuse-version. Which one are you using? 2008-09-13 03:45 2.7.3 here 2008-09-13 03:53 flips is on 2.7.4 2008-09-13 03:53 and shapor and I are on 64 bit 2008-09-13 03:53 fuse version is set to 27 2008-09-13 03:53 hmm... I thought at least creating files should work under fuse, shouldn't it? 2008-09-13 03:53 in the source 2008-09-13 03:54 it should 2008-09-13 03:54 post your error? 2008-09-13 03:54 data, you have a web paste utility you use? 2008-09-13 03:55 well, there are a few, but not a single one 2008-09-13 03:55 which do you prefer? 2008-09-13 03:55 any 2008-09-13 03:55 just paste your output there 2008-09-13 03:55 and let konrad go at it ;-) 2008-09-13 03:55 heh 2008-09-13 03:56 I think tux3fuse is the toy to use 2008-09-13 03:56 I can't see tux3fs as being useful anymore with the low level one there 2008-09-13 03:56 desktop test # touch test 2008-09-13 03:56 desktop test # ls 2008-09-13 03:56 desktop test # 2008-09-13 03:56 nothing shows up 2008-09-13 03:56 konrad, I'll leave that question to you and shapor 2008-09-13 03:56 first error. 2008-09-13 03:56 data: at least it doesn't crash :) 2008-09-13 03:56 I think you're right but I'm not the expert 2008-09-13 03:57 pff, I first looked at fuse the day I sent that email 2008-09-13 03:57 that's more than me 2008-09-13 03:57 still haven't looked at it 2008-09-13 03:57 heh 2008-09-13 03:57 -su: echo: write error: Transport endpoint is not connected 2008-09-13 03:57 atom refcounting getting closer 2008-09-13 03:58 echo "foo" > bar 2008-09-13 03:58 means fuse didn't start 2008-09-13 03:58 run with -f 2008-09-13 03:58 that is, make defuse 2008-09-13 03:58 right. using that 2008-09-13 03:59 and it hangs like it should? 2008-09-13 03:59 anyway, you want to paste all the output 2008-09-13 03:59 there should be lots 2008-09-13 03:59 hm, I'm not getting tux3fuse to mount anything here 2008-09-13 03:59 http://www.nomorepasting.com/getpaste.php?pasteid=20198 2008-09-13 04:00 when I do: echo "foo" > bar 2008-09-13 04:00 the steps are something like: dd if=/dev/zero of=./dev seek=100M count=1; ./tux3 mkfs ./dev; ./tux3fuse dev tmp/ 2008-09-13 04:00 yes? 2008-09-13 04:00 oh, segment fault 2008-09-13 04:01 I'm not even getting that 2008-09-13 04:01 i just used make defuse 2008-09-13 04:01 you want to find where the segfault is 2008-09-13 04:01 tux3_init: fdsize64 failed for 'dev' (Bad file descriptor)! 2008-09-13 04:01 konrad's on it ;) 2008-09-13 04:01 nah I got to get to sleep 2008-09-13 04:02 and i have to do more algebra (bah!) 2008-09-13 04:02 and I'm moving in ~5 days so I may be otherwise occupied come this tuesday evening 2008-09-13 04:02 data, try: sudo gdb -args ./tux3fuse /tmp/testdev /tmp/test -f 2008-09-13 04:02 slight variation 2008-09-13 04:03 run under gdb 2008-09-13 04:03 0x0000000000406239 in xcache_limit (xcache=0x0) at tux3.h:284 2008-09-13 04:03 284 return (void *)xcache + xcache->size; 2008-09-13 04:03 yup it's a bug 2008-09-13 04:04 shall we chase it tomorrow? 2008-09-13 04:04 it's the middle of the day? :P 2008-09-13 04:04 oh right 2008-09-13 04:04 well 2008-09-13 04:04 data: we're in PST, it's 4am for flips and I :D 2008-09-13 04:04 we need to get a tux3 debug center going over there in europe 2008-09-13 04:04 i'll have a look at it if I find the time 2008-09-13 04:04 ok 2008-09-13 04:04 good luck 2008-09-13 04:04 it's just a bug ;-) 2008-09-13 04:04 but otherwise I'll be around tomorrow 2008-09-13 04:05 now you can run under gdb, makes it easier to chase 2008-09-13 08:21 -!- pgquiles(~pgquiles@229.Red-83-49-101.dynamicIP.rima-tde.net) has joined #tux3 2008-09-13 10:10 -!- Aks(~ankitsriv@123.237.71.198) has left #tux3 2008-09-13 12:22 well, xattr get/set actually seem to be an interesting method for extended operations on inodes 2008-09-13 13:26 maze, how is the new fs going? 2008-09-13 13:26 writing the makefile... 2008-09-13 13:27 that's most of the work, if your write the fs in "make" and use fuse 2008-09-13 13:27 you didn't sleep much 2008-09-13 13:28 starting from the makefile ;-) 2008-09-13 13:28 want something that compiles 2008-09-13 13:28 and I'm writing straight in kernel-space 2008-09-13 13:28 hardcore 2008-09-13 13:28 I'd like to have a build-debug environment 2008-09-13 13:29 good exercise 2008-09-13 13:29 hey, I'm not doing this to test a concept of a fs 2008-09-13 13:29 but to learn the API 2008-09-13 13:29 I know 2008-09-13 13:29 I'm excited about that 2008-09-13 13:29 you're probably going to be telling me about it in a week 2008-09-13 13:29 things I didn't know and should have ;) 2008-09-13 13:30 one would only hope... 2008-09-13 13:30 but right know, I'm not even getting it to compile a module ;-) 2008-09-13 13:30 I'm also expecting to hear some swearing in the channel 2008-09-13 13:30 the secret, the way everbody starts a new fs: cut and paste ramfs 2008-09-13 13:31 even lazier people cut and paste tux2 2008-09-13 13:31 sorry 2008-09-13 13:31 ext2 2008-09-13 13:31 I'm lazy - but I think that would be counter productive 2008-09-13 13:31 I'm starting with a clean slate, with ramfs/tmpfs/ext2 as cut-n-paste sources 2008-09-13 13:31 but planning on writing it all 2008-09-13 13:32 I want to understand every line of code 2008-09-13 13:32 and the only way to do that is to write it yourself... 2008-09-13 13:32 well - I've got a working makefile. 2008-09-13 13:32 of course it currently doesn't build any modules... 2008-09-13 13:32 ugh 2008-09-13 13:32 and use jon corbet's examples 2008-09-13 13:32 so maybe the definition of working is more like ' it doesn't report parse errors' 2008-09-13 13:32 there is a particularly good example from linux device drivers on building a minimal module 2008-09-13 13:35 http://lwn.net/Articles/21817/ 2008-09-13 13:35 enjoy 2008-09-13 13:35 hmm 2008-09-13 13:35 this was around the time rusty fscked with the module system and messed it all up 2008-09-13 13:38 don't neglect to have a close look at my use_atom code I just posted to the list, they way it handles the positive and negative carries between shorts might be interesting to you 2008-09-13 13:38 a form of bit bashing you don't see much these days 2008-09-13 13:38 clumsy in c 2008-09-13 13:40 okay, have a junkfs.ko 2008-09-13 13:41 well it loads into running kernel (yay for testing on machine you're working on) 2008-09-13 13:41 and unloads 2008-09-13 13:41 of course all it has is empty init/exit 2008-09-13 13:42 [maze@nike junkfs]$ make clean 2008-09-13 13:42 rm -f *~ *.o *.ko *.mod.c .*.cmd 2008-09-13 13:42 rm -f modules.order .depend .version .*.o.flags .*.o.d 2008-09-13 13:42 rm -rf .tmp_versions 2008-09-13 13:42 rm -f Module.markers Module.symvers 2008-09-13 13:42 [maze@nike junkfs]$ make 2008-09-13 13:42 make -C /lib/modules/2.6.26.3-29.fc9.x86_64/build SUBDIRS=/home/maze/junkfs modules 2008-09-13 13:42 make[1]: Entering directory `/usr/src/kernels/2.6.26.3-29.fc9.x86_64' 2008-09-13 13:42 CC [M] /home/maze/junkfs/super.o 2008-09-13 13:42 LD [M] /home/maze/junkfs/junkfs.o 2008-09-13 13:42 Building modules, stage 2. 2008-09-13 13:42 MODPOST 1 modules 2008-09-13 13:42 CC /home/maze/junkfs/junkfs.mod.o 2008-09-13 13:42 LD [M] /home/maze/junkfs/junkfs.ko 2008-09-13 13:42 make[1]: Leaving directory `/usr/src/kernels/2.6.26.3-29.fc9.x86_64' 2008-09-13 13:42 (reverse-i-search)`modp': modprobe ath_pci 2008-09-13 13:42 [maze@nike junkfs]$ /sbin/lsmod | egrep ju 2008-09-13 13:42 [maze@nike junkfs]$ sudo /sbin/insmod ./junkfs.ko 2008-09-13 13:42 [maze@nike junkfs]$ /sbin/lsmod | egrep ju 2008-09-13 13:42 junkfs 9856 0 2008-09-13 13:42 [maze@nike junkfs]$ sudo /sbin/rmmod junkfs 2008-09-13 13:42 [maze@nike junkfs]$ /sbin/lsmod | egrep ju 2008-09-13 13:42 [maze@nike junkfs]$ 2008-09-13 13:50 cat /proc/filesystems | grep junk 2008-09-13 14:02 ok, atom reverse mapping then we are done with atoms for a while 2008-09-13 14:04 ok, printk debugging v0.1 ready 2008-09-13 14:05 moving to v0.2 2008-09-13 14:05 @/home/maze/junkfs/super.c:26 - Entering: init_junk_fs() 2008-09-13 14:05 @/home/maze/junkfs/super.c:27 - Exiting: init_junk_fs() 2008-09-13 14:05 @/home/maze/junkfs/super.c:32 - Entering: exit_junk_fs() 2008-09-13 14:05 @/home/maze/junkfs/super.c:33 - Exiting: exit_junk_fs() 2008-09-13 14:05 registered it yet? 2008-09-13 14:05 guess not 2008-09-13 14:05 or you would have grepped 2008-09-13 14:06 that would be v0.0.2 I think 2008-09-13 14:06 well 2008-09-13 14:06 that's just me ;) 2008-09-13 14:07 you're building in your home directory, most hacks build right in a kernel tree 2008-09-13 14:07 so you can git the whole tree 2008-09-13 14:08 I don't even have the source for the kernel ;-) 2008-09-13 14:08 it's just the normal fedora core 9 kernel from koji 2008-09-13 14:08 leet 2008-09-13 14:08 sploit time 2008-09-13 14:08 no, working on making it verbose 2008-09-13 14:08 that is in fact how I started tux2 2008-09-13 14:09 worked with modules up until I realized I was going to be bringing down my workstation a lot 2008-09-13 14:09 well, next step will be kvm 2008-09-13 14:09 you're moving along 2008-09-13 14:10 well, first still have to get debugging more verbose 2008-09-13 14:10 it's not dumping function entry or exit values 2008-09-13 14:10 ltt? 2008-09-13 14:10 and after all - the entire point of this exercise is to learn the api 2008-09-13 14:10 which means seeing what gets passed in 2008-09-13 14:10 (and out) 2008-09-13 14:11 plus it makes debugging easier 2008-09-13 14:17 ok, I have to "unbundle" the ext2 dirops so I can find out which block and offset it created a new dirent at 2008-09-13 14:17 easiest way is to return the dirent and buffer I guess 2008-09-13 14:17 and to be able to search a given dirent block 2008-09-13 14:17 probably how it should have been written in the first place 2008-09-13 14:25 Current debug output: 2008-09-13 14:25 @/home/maze/junkfs/super.c:44 - Entering: init_junk_fs() 2008-09-13 14:25 @/home/maze/junkfs/super.c:39 - Entering: test(5, 6) 2008-09-13 14:25 @/home/maze/junkfs/super.c:40 - Exiting: test(...) = 0 2008-09-13 14:25 @/home/maze/junkfs/super.c:46 - Exiting: init_junk_fs(...) = 0 2008-09-13 14:25 @/home/maze/junkfs/super.c:50 - Entering: exit_junk_fs() 2008-09-13 14:25 @/home/maze/junkfs/super.c:51 - Exiting: exit_junk_fs(...) 2008-09-13 14:26 here's the code: 2008-09-13 14:26 static int test (int a, int b) { 2008-09-13 14:26 <------>DBG_ENTER2(int,a,int,b); 2008-09-13 14:26 <------>DBG_RETURN1(int,0); 2008-09-13 14:26 } 2008-09-13 14:26 static int __init init_junk_fs(void) { 2008-09-13 14:26 <------>DBG_ENTER0(); 2008-09-13 14:26 <------>test(5, 6); 2008-09-13 14:26 <------>DBG_RETURN1(int,0); 2008-09-13 14:26 } 2008-09-13 14:26 static void __exit exit_junk_fs(void) { 2008-09-13 14:26 <------>DBG_ENTER0(); 2008-09-13 14:26 <------>DBG_RETURN0(); 2008-09-13 14:26 } 2008-09-13 14:26 what are those funny minuses? 2008-09-13 14:26 tabs? 2008-09-13 14:26 oh, that's tabs 2008-09-13 14:27 dark blue on darker blue background, but they show up fine after pasting 2008-09-13 14:27 you need spaces after your commas or my head will explode getting yucky goo everywhere 2008-09-13 14:28 ok, you're ready to register/unregister your fs 2008-09-13 14:28 right, also need to force func declare and debug into the same line I think, and dedup it 2008-09-13 14:28 will total about 6 lines 2008-09-13 14:28 right. 2008-09-13 14:28 plus a couple for a stub fill_super 2008-09-13 14:28 well, and the filesystem_type decl ;-) 2008-09-13 14:28 it starts to bloat 2008-09-13 14:28 of course ;-) 2008-09-13 14:29 I'd go with the separate code and debug lines 2008-09-13 14:31 so, the following isn't better? 2008-09-13 14:31 DECLARE0(static,int,__init,init_junk_fs) 2008-09-13 14:31 <------>test(5, 6); 2008-09-13 14:31 <------>DBG_RETURN1(int, 0); 2008-09-13 14:31 } 2008-09-13 14:31 I'm not sure myself 2008-09-13 14:31 seperate lines means it's easier to turn off on a per function basis 2008-09-13 14:32 makes my eyes bleed 2008-09-13 14:32 if not sure, go with the unbundled form 2008-09-13 14:32 oh right spaces ;-) 2008-09-13 14:32 right 2008-09-13 14:32 separate lines rules the world for debug traces 2008-09-13 14:33 got to remember, we're writing in C, don't try to make it pretty, you will not succeed, and if it doesn't look ugly then the fates will not smile upon you 2008-09-13 14:34 yeah, but for return unless I use a temp variable, I kind of have to return from within the macro 2008-09-13 14:34 This is what test looks like now: 2008-09-13 14:34 DECLARE2(static, int, , test, int, a, int, b) 2008-09-13 14:34 DBG_RETURN1(int, 0); 2008-09-13 14:34 } 2008-09-13 14:35 hmm, requires a little thinking, and I'm hungry 2008-09-13 14:35 try ltt 2008-09-13 14:35 it does this for you 2008-09-13 14:35 what's ltt? 2008-09-13 14:35 so you can concentrate on the problem 2008-09-13 14:35 linux trace toolkit 2008-09-13 14:36 yes, google found it 2008-09-13 14:37 ok, I have a better idea than shelling ext2_create_entry to return buffer and dirent... instead of an error code, return the dir file pos 2008-09-13 14:37 or -1 if there was an error 2008-09-13 14:37 the only error ext2_create_entry returns anyway is -EIO 2008-09-13 14:37 just more braindamaged C style error handling, or lack of it 2008-09-13 14:37 I'm not making it worse, honest 2008-09-13 14:46 oh, right this is C 2008-09-13 14:46 can't declare vars in the middle of func body 2008-09-13 14:46 you can 2008-09-13 14:46 well 2008-09-13 14:47 you have to override the kernel compile flags 2008-09-13 14:47 so you don't get warnings 2008-09-13 14:47 we might build tux3 that way for a while 2008-09-13 14:47 folks 2008-09-13 14:48 until the squacking from old schoolers gets too much to bear 2008-09-13 14:48 I can't remember what the reason for not using C++ was in the kernel? 2008-09-13 14:48 was it the programmers? 2008-09-13 14:49 or were there actual issues with the compiler 2008-09-13 14:49 fear of exceptions and crazy hidden semantics 2008-09-13 14:49 no real issues 2008-09-13 14:49 I know you _can_ compile non-C++-std-compliant C++ without any libraries 2008-09-13 14:49 on, no designated initializers 2008-09-13 14:49 that's a killer 2008-09-13 14:50 even c99 is permabanned 2008-09-13 14:50 I probably know the feature, but I'm not sure what that refers to, is that the .something = something struct initializer 2008-09-13 14:50 for no reason whatsoever 2008-09-13 14:50 or the [d] = something 2008-09-13 14:50 right 2008-09-13 14:50 essential 2008-09-13 14:50 both? 2008-09-13 14:50 agreed, they're essential 2008-09-13 14:50 the former mostly 2008-09-13 14:50 I have used the second exactly once, and that was last week 2008-09-13 14:50 in tux3 2008-09-13 14:51 fear of exceptions is a good one, but you should just not use them ;-) 2008-09-13 14:51 it's mostly fear of hidden behavior 2008-09-13 14:51 linus hates that 2008-09-13 14:51 I've never seen a c++ prog that didn't have it 2008-09-13 14:51 hidden behaviour... is the C++ compiler more loose? 2008-09-13 14:51 way loose 2008-09-13 14:52 code generated is beyond pathetic compared to hand crafted C 2008-09-13 14:52 you can also write hand crafted c++ of course but nobody does 2008-09-13 14:52 even if you don't use code-killing features? 2008-09-13 14:52 (like exceptions, multiple inheritance, large parts of OO, etc) 2008-09-13 14:52 linuxers don't have that discpline 2008-09-13 14:52 [templates...] 2008-09-13 14:53 remember, 90%+ of linux is dodgy drivers written by people who wish it was saturday 2008-09-13 14:53 ah, so it really boils down to programmers 2008-09-13 14:53 I wish it was saturday 2008-09-13 14:53 (and it is!) 2008-09-13 14:53 me too 2008-09-13 14:53 right 2008-09-13 14:53 it's nice to be happy :) 2008-09-13 14:53 speaking of which 2008-09-13 14:53 nice when wishes come true 2008-09-13 14:53 nearly sk8 oclock 2008-09-13 14:53 and I'm nearly done with the atom revmap 2008-09-13 14:53 woohoo 2008-09-13 14:54 I need to include linux/fs.h apparently ;-) 2008-09-13 14:56 the fun starts 2008-09-13 14:58 well registering was simple 2008-09-13 14:59 of course there's no get_sb function declared... 2008-09-13 14:59 naturally 2008-09-13 14:59 now you can see it in proc 2008-09-13 14:59 a lot of stuff is happening 2008-09-13 15:00 that's where you realize the vfs is actually oo, even if it was developed by folks with very little understanding of oo 2008-09-13 15:00 which includes me ;-) 2008-09-13 15:00 though to be sure my role in vfs devel was minor 2008-09-13 15:01 mainly just contributed the inode specialization model 2008-09-13 15:02 ok, atom reverse entries are being created 2008-09-13 15:02 now lets reverse an atom 2008-09-13 15:02 I suppose I could use the readdir interface for this 2008-09-13 15:03 that would be kind of perverse 2008-09-13 15:03 no, sorry, really perverse 2008-09-13 15:03 nodev junkfs 2008-09-13 15:03 I just won't do that 2008-09-13 15:03 good 2008-09-13 15:03 now to take a look at the flags 2008-09-13 15:04 the difference between a nodev and a dev is significant ;-) 2008-09-13 15:04 have fun will kill_litter_super 2008-09-13 15:04 with 2008-09-13 15:04 well, right now none are declared, allthough it is easy to make it dev 2008-09-13 15:04 I'd like to see what is available 2008-09-13 15:04 kinda easy 2008-09-13 15:04 and kinda not 2008-09-13 15:04 you will see 2008-09-13 15:04 it's not as crystalline as you think right now 2008-09-13 15:06 /* public flags for file_system_type */ 94#define FS_REQUIRES_DEV 1 95#define FS_BINARY_MOUNTDATA 2 96#define FS_HAS_SUBTYPE 4 97#define FS_REVAL_DOT 16384 /* Check the paths ".", ".." for staleness */ 98#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() 99 * during rename() internally. 100 */ 2008-09-13 15:06 so requires dev, means a block device with the data, as opposed to just in mem 2008-09-13 15:06 I wonder if nfs needs dev 2008-09-13 15:06 probably not 2008-09-13 15:07 right 2008-09-13 15:07 nfs is nodev 2008-09-13 15:07 binary mount data is probably for nfs and smb because they use binary mount options and hence have special mount programs 2008-09-13 15:07 subtype - no idea 2008-09-13 15:07 good observation 2008-09-13 15:07 dev means "block dev" 2008-09-13 15:07 maybe something like fat 2008-09-13 15:07 is considered subtyped 2008-09-13 15:08 no idea either 2008-09-13 15:08 sounds like rot 2008-09-13 15:08 reval_dot seems especially useful for nfs, maybe others 2008-09-13 15:08 d_move - seems like something worth knowing 2008-09-13 15:08 although probably later on 2008-09-13 15:08 see, I never looked at all those flags 2008-09-13 15:08 worthwhile knowing there's an implementation option there 2008-09-13 15:08 they come and go 2008-09-13 15:08 not a stable api 2008-09-13 15:08 well, you have to develop for some api ;-) 2008-09-13 15:09 right 2008-09-13 15:09 internal kernel api is a moving target 2008-09-13 15:09 of course 2008-09-13 15:09 partly intentional to encourage out of tree people to merge 2008-09-13 15:09 partly to improve it 2008-09-13 15:10 probably worthwhile, although breakage for breakage sake should be frowned upon, if it's just okay to change stuff, but only for the 'better', than that's another issue 2008-09-13 15:11 ok, lets see which fs'es use which flags 2008-09-13 15:11 yeah, I'm needlessly thorough... but oh, well, can't change who and what I am 2008-09-13 15:11 that means I'll be able to ask you questions soon 2008-09-13 15:12 it's a feature, not a bug 2008-09-13 15:13 blockdev - tons, as expected - including nfsd (the server) 2008-09-13 15:13 although for nfsd it's actually checking you're exporting a fs with a dev backing 2008-09-13 15:13 wonder if that means you can't export ramfs 2008-09-13 15:14 oh because nfs uses the dev from cookie to lookup the fs 2008-09-13 15:15 ugh, broken 2008-09-13 15:15 [by design] 2008-09-13 15:16 although there's a hack to be able to re-export nfs mounts 2008-09-13 15:16 binary_mountdata 2008-09-13 15:16 we're going to have to copy this buffer and make it tux3 U #3 2008-09-13 15:17 coda, ncpfs (netware), nfs, smbfs/cifs 2008-09-13 15:17 so basically the complex net file systems 2008-09-13 15:17 all nodev? 2008-09-13 15:17 of course 2008-09-13 15:17 probably because they take so many options related to networking 2008-09-13 15:17 -> no binary_mountdata 2008-09-13 15:17 never looked at that 2008-09-13 15:18 the opposite of that is? 2008-09-13 15:18 subtype appears to be a fuse hack 2008-09-13 15:18 ah, right 2008-09-13 15:18 the opposite of binary_mountdata is not putting it in flags 2008-09-13 15:18 see my complaint about fuse on that topic, thursday 2008-09-13 15:18 all 'normal' filesystems use text string mount options 2008-09-13 15:18 wrong idea 2008-09-13 15:18 each fuse fs should get its own type 2008-09-13 15:18 not all of the "fuse" 2008-09-13 15:18 just wrong 2008-09-13 15:19 okay, skipping fuse parsing ;-) giving me a headache 2008-09-13 15:20 REVAL_DOT 2008-09-13 15:20 is nfs only 2008-09-13 15:20 related to parent directory entries of a path being able to go stale 2008-09-13 15:20 revalidate 2008-09-13 15:20 something which is related to nfs protocol borkenness 2008-09-13 15:20 yes 2008-09-13 15:20 subtle 2008-09-13 15:20 although maybe hard to fix in a new netfs 2008-09-13 15:20 no, not hard 2008-09-13 15:21 just has to be stateful 2008-09-13 15:21 right 2008-09-13 15:21 stateless is unworkable braindamage 2008-09-13 15:21 but you want it both stateful, and stateless 2008-09-13 15:21 a pox upon us 2008-09-13 15:21 I don't agree 2008-09-13 15:21 lightweight state 2008-09-13 15:21 that scales 2008-09-13 15:21 is good 2008-09-13 15:21 nfs is bad 2008-09-13 15:21 doesn't work properly 2008-09-13 15:21 you want lightweight statefull with fallback to stateless 2008-09-13 15:21 trond will disagree of course 2008-09-13 15:21 the fallback being mostly for the server reboot/failover case 2008-09-13 15:22 you never want stateless 2008-09-13 15:22 stateless == brainless 2008-09-13 15:22 :-) 2008-09-13 15:22 well, would have to think about it more... stateless has nice features that you do want 2008-09-13 15:22 a nematode is getting close to stateless 2008-09-13 15:22 metanode? 2008-09-13 15:23 heh 2008-09-13 15:23 is that an anagram? 2008-09-13 15:23 wow 2008-09-13 15:23 so which is it? 2008-09-13 15:23 don't start with puns now ;-) 2008-09-13 15:23 nematode: disgusting little worm 2008-09-13 15:23 right, but is there an fs concept called nematode? 2008-09-13 15:23 nfs: disgusting little hack that grew up into a huge disgusting little worm 2008-09-13 15:24 no 2008-09-13 15:24 just me dissing nfs 2008-09-13 15:24 wish trond were here ;-) 2008-09-13 15:24 so you just mistyped metanode? or you meant the worm 2008-09-13 15:24 seem - I'm clueless 2008-09-13 15:24 sarcasm/irony/human interaction just fly right over me 2008-09-13 15:24 no, I meant to type nematode, I was comparing nfs to a nematode 2008-09-13 15:24 s/seem/see/ 2008-09-13 15:25 both are nearly stateless 2008-09-13 15:25 wait a minute - nfs is stateless... sin't it? 2008-09-13 15:25 not quite 2008-09-13 15:25 lockd implements a stateful protocol 2008-09-13 15:25 it's fakery to pretend it doesn't 2008-09-13 15:25 right - those are extensions 2008-09-13 15:25 although 2008-09-13 15:26 to be fair running with out it doesn't happen 2008-09-13 15:26 also tcp 2008-09-13 15:26 can't really be separated 2008-09-13 15:26 not really 2008-09-13 15:30 nfs is actually like 4 fs'es 2008-09-13 15:30 2 being v3 vs v4 2008-09-13 15:30 and 2 being normal vs cross-device registration hackery 2008-09-13 15:30 so you have 2 * 2 = 4 2008-09-13 15:30 right 2008-09-13 15:30 anyway 2008-09-13 15:30 D_MOVE 2008-09-13 15:30 it's rather cleverly and lazily compressed into fairly small source 2008-09-13 15:30 in linux 2008-09-13 15:30 apparently used by 2008-09-13 15:31 nfs and ocfs2 2008-09-13 15:31 probably related to directory deletions in some way 2008-09-13 15:31 what does it do? 2008-09-13 15:31 so much hackery in linux is because of nfs 2008-09-13 15:31 we'd be way better off if it had never been written 2008-09-13 15:32 well, some people make a living from it 2008-09-13 15:32 so they are ok 2008-09-13 15:32 and they are generally good to drink with 2008-09-13 15:32 especially good to drink with 2008-09-13 15:32 I think there must be a connection 2008-09-13 15:32 nope renames 2008-09-13 15:33 right, dentry move 2008-09-13 15:33 actually I dimly recall that 2008-09-13 15:33 so basically this is something along the lines of support for atomic renames 2008-09-13 15:33 a big wart in dentry cache 2008-09-13 15:33 and somehow nfs and ocfs2 are special 2008-09-13 15:33 I wonder why ocfs2 needs it 2008-09-13 15:34 can ask mark fasheh about that 2008-09-13 15:34 FS will handle d_move() 99 * during rename() internally. 2008-09-13 15:34 from the header file for that #define 2008-09-13 15:34 okay, looks like at first glance (as expected) we just need a backing blockdev 2008-09-13 15:34 I strongly suspect it was to solve a locking bottleneck in ocfs2 2008-09-13 15:35 but worth knowing for future design that there are such hacks 2008-09-13 15:35 for rename and for stale . and .. 2008-09-13 15:36 yes 2008-09-13 15:36 probably want to avoid it, but... 2008-09-13 15:36 what's ocfs2? 2008-09-13 15:36 nice little cluster filesystem from oracle 2008-09-13 15:36 quite underrated 2008-09-13 15:37 question about filenames 2008-09-13 15:37 does the vfs layer enforce, no nulls and no slashes in a file name? 2008-09-13 15:37 but otherwise anything goes? 2008-09-13 15:38 okay nodev is out 2008-09-13 15:39 next step - what the hell is get_sb ;-) 2008-09-13 15:39 main: >>> found unatom entry 12 for atom 1 2008-09-13 15:40 now to print out the name 2008-09-13 15:41 main: found unatom entry 0 for atom 0 2008-09-13 15:41 main: found unatom entry 12 for atom 1 2008-09-13 15:41 main: found unatom entry 24 for atom 2 2008-09-13 15:41 main: found unatom entry 0 for atom 3 2008-09-13 15:41 main: found unatom entry 0 for atom 4 2008-09-13 15:41 etc 2008-09-13 15:42 well, it should not be 0 for unknown atoms 2008-09-13 15:42 probably 2008-09-13 15:42 oh 2008-09-13 15:42 sure it should 2008-09-13 15:42 unused entry in the unatom table 2008-09-13 15:46 unused entry in the unatom table 2008-09-13 15:46 whoops 2008-09-13 15:47 main: found unatom entry 0 for atom 0 2008-09-13 15:47 0xb7d10400: 00 00 00 00 0c 00 03 00 66 6f 6f dd 01 00 00 00 "........foo....." 2008-09-13 15:47 there we go 2008-09-13 15:47 reversed 2008-09-13 15:47 time to skate 2008-09-13 16:01 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-09-13 16:02 6,800 lines 2008-09-13 16:02 only added about 300 including xattr support and atom refcounting 2008-09-13 16:02 xattrs will come in at around 500 lines total and be perfectly usuable 2008-09-13 16:02 superior maybe 2008-09-13 16:03 sk8 oclock 2008-09-13 16:03 really 2008-09-13 16:09 made further improvements to debugging: 2008-09-13 16:09 @/home/maze/junkfs/super.c:41 - Entering: init_junk_fs() 2008-09-13 16:09 @/home/maze/junkfs/super.c:26 - Entering: test(a=(int)5, b=(int)6) 2008-09-13 16:09 @/home/maze/junkfs/super.c:27 - Returning: test(...) = a + b = (int)11 2008-09-13 16:09 @/home/maze/junkfs/super.c:46 - Mark in init_junk_fs(...) err=(int)0 2008-09-13 16:09 @/home/maze/junkfs/super.c:49 - Returning: init_junk_fs(...) = 0 = (int)0 2008-09-13 16:09 @/home/maze/junkfs/super.c:55 - Entering: exit_junk_fs() 2008-09-13 16:10 @/home/maze/junkfs/super.c:57 - Returning: exit_junk_fs(...) = void 2008-09-13 16:10 I have to admit, your plan to write a fs to learn vfs is working out well 2008-09-13 16:10 spaces around the = please ;-) 2008-09-13 16:11 hmm, but those are like colons or something 2008-09-13 16:11 then colon with one space after 2008-09-13 16:11 lindent 2008-09-13 16:11 might as well get used to it 2008-09-13 16:11 well, this is text output from dmesg 2008-09-13 16:12 but yeah ': ' is probably better 2008-09-13 16:12 still 2008-09-13 16:12 it may well escape to lkml one day 2008-09-13 16:12 who knows 2008-09-13 16:12 @/home/maze/junkfs/super.c:26 - Entering: test(a: (int)5, b: (int)6) 2008-09-13 16:12 @/home/maze/junkfs/super.c:27 - Returning: test(...) = a + b = (int)11 2008-09-13 16:12 @/home/maze/junkfs/super.c:46 - Mark in init_junk_fs(...) err: (int)0 2008-09-13 16:13 does look better 2008-09-13 16:13 printk bytes are cheap ;-) 2008-09-13 16:13 yes 2008-09-13 16:13 easier on my eyes 2008-09-13 16:13 pleasant even 2008-09-13 16:13 changed already 2008-09-13 16:13 I noticed 2008-09-13 16:13 it was after all a 2 byte change 2008-09-13 16:13 changed, tested and pasted into the cloud 2008-09-13 16:13 that's the spirit 2008-09-13 16:13 I should probably setup a repository for this junkfs 2008-09-13 16:14 and work on getting a kvm debug working 2008-09-13 16:14 probably don't want to muck around with the get_sb stuff on my live box 2008-09-13 16:14 also have to figure out how to get compile junk to go elsewhere 2008-09-13 16:14 than the source dir 2008-09-13 16:16 yes 2008-09-13 16:17 cd your/source 2008-09-13 16:17 hg init 2008-09-13 16:17 hg add . 2008-09-13 16:17 hg commit 2008-09-13 16:17 that's all there is to it 2008-09-13 16:17 hg is mercurial? 2008-09-13 16:17 yes 2008-09-13 16:18 probably need to install it first then 2008-09-13 16:19 I'm really pleased with the xattr atom stuff 2008-09-13 16:19 awesome 2008-09-13 16:19 need to use some slight imagination to see how it will perform with a little cache in front of it, and to see the impact of atomic update/log rollup 2008-09-13 16:19 but otherwise I guess it's done 2008-09-13 16:19 some fiddling 2008-09-13 16:20 no more questions about potential lurking complexity 2008-09-13 16:20 and whether it can emulate straight ascii strings 2008-09-13 16:20 I don't think we need a option, really 2008-09-13 16:21 cool 2008-09-13 16:21 one thing missing: find a free atom 2008-09-13 16:21 to use 2008-09-13 16:21 instead of bindly generating new ones, need to code that 2008-09-13 16:22 the plan is to just let the thing expand up to some size, count the deletions in it, then when deletions/size exceeds a threshold, we rescan for deleted entries 2008-09-13 16:22 deleted atoms 2008-09-13 16:22 probably overkill 2008-09-13 16:22 an alternative is to put a linked list of free atoms in the unatom table 2008-09-13 16:22 better 2008-09-13 16:23 oh shiny - finaly rhel4.7 is out in centos 4.7 2008-09-13 16:23 yup free atom list is always better 2008-09-13 16:24 it shall be so 2008-09-13 16:24 will code that when I get back 2008-09-13 16:24 also need to code the atom table dump 2008-09-13 16:25 so some more fiddling until I can escape to more interesting things 2008-09-13 16:32 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2008-09-13 17:27 one slight drawback to my atable design I just noticed 2008-09-13 17:27 putting the tables up so high will make the radix tree quite deep 2008-09-13 17:27 I think 2008-09-13 17:27 so when I map at block 2^28 2008-09-13 17:28 radix tree has 2^5 fanout 2008-09-13 17:28 that is 6 radix tree levels 2008-09-13 17:28 probably nothing to worry about 2008-09-13 17:28 we zip through those very fast 2008-09-13 17:28 and with the hash in front of it, the overhead will disappear in the noise, if it was not already 2008-09-13 17:29 against that, we have the pleasing property of only having to sync one file to sync the entire atable including recounts and reverse map 2008-09-13 20:25 -!- tim_dimm(~mobile@32.156.233.244) has joined #tux3 2008-09-13 20:27 -!- tim_dimm(~mobile@32.156.233.244) has joined #tux3 2008-09-13 20:28 Howdy 2008-09-13 20:29 Got a geeky irc app for my phone 2008-09-13 20:29 :-) 2008-09-13 20:32 -!- tim_dimm(~mobile@32.156.233.244) has joined #tux3 2008-09-13 22:15 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-13 22:15 wb maze 2008-09-13 22:15 hey 2008-09-13 22:15 show_freeatoms: next = dead000000000666 2008-09-13 22:16 linked list 2008-09-13 22:16 nontrivial proposition when it's linked through disk blocks 2008-09-13 22:16 see the cute magic number 2008-09-13 22:16 for deleted atom 2008-09-13 22:16 I've a question about scheduling priorities... [yes, cute indeed] 2008-09-13 22:17 I'm not much of a scheduler person but fire away 2008-09-13 22:17 so, if we have a pre-emptible kernel, and one low-priority (ie. niced) user process does something which results in a call to the fs code, what priority does that code run within the kernel? 2008-09-13 22:18 same 2008-09-13 22:18 niced 2008-09-13 22:18 will it also be effectively scheduled as a niced task? giving way to other threads of execution of higher priority? 2008-09-13 22:18 yes 2008-09-13 22:18 so does linux then automatically boost thread execution priority within the kernel if a low-prio thread is blocking a higher-prio thread by holding a lock? 2008-09-13 22:19 [and, can thread priority be manually temporarily increase/decreased/changed within kernel code - for whatever reason] 2008-09-13 22:20 there's some priority inheritance stuff, yes, but I'm not familiar with it 2008-09-13 22:20 ie. how does the linux kernel deal with std priority inversion jazz 2008-09-13 22:20 you can do whatever you want in kernel 2008-09-13 22:20 ahh 2008-09-13 22:20 including changing priority 2008-09-13 22:20 of your task or any other 2008-09-13 22:21 you can also fill the entire kernel with zero ;-) 2008-09-13 22:21 true - good point 2008-09-13 22:21 although I was trying to read the bootid uuid from my module, and that actually turns out to be very non-trivial 2008-09-13 22:21 didn't say anything was easy 2008-09-13 22:21 almost nothing is 2008-09-13 22:21 but you can do it 2008-09-13 22:21 since it's not exported, and I don't see a good way to grab sysctl's from within the kernel ;-) and the interfaces are always user-oriented 2008-09-13 22:22 [obviously here easiest solution is to fix random.c to export the boot_id... but that's not something that can be done in a module] 2008-09-13 22:22 lyou'll get frustrated about what is not exported, until you realize... just export it 2008-09-13 22:22 right 2008-09-13 22:22 if it's a stupid idea you'll find out soon enough 2008-09-13 22:22 it just won't compile on 'older' kernels then 2008-09-13 22:23 or get past linus usually 2008-09-13 22:23 following linked lists is always scary 2008-09-13 22:23 never feels like it's going to terminate 2008-09-13 22:23 it did: 2008-09-13 22:23 show_freeatoms: next = dead000000000666 2008-09-13 22:23 show_freeatoms: next = dead000000000000 2008-09-13 22:24 this time 2008-09-13 22:24 well there's basically a static char[16] with the bootid, it's got links to it, but... ugh 2008-09-13 22:24 notice the frist atom to die was number 0 2008-09-13 22:24 so 0 is a valid atom 2008-09-13 22:24 probably going to regret that 2008-09-13 22:24 yup ;-) 2008-09-13 22:24 I always make 0 invalid 2008-09-13 22:24 or free 2008-09-13 22:24 or something 2008-09-13 22:24 well 2008-09-13 22:25 don't make it invalid just because you're lame ;) 2008-09-13 22:25 make it invalid because you have a good reason 2008-09-13 22:25 I don't have a good reason for atom zero yet 2008-09-13 22:25 but there likely is one 2008-09-13 22:25 good reason: it's easier on the eyes when you later debug it 2008-09-13 22:25 you never expect a 0 value to actually be pointing/referencing to something 2008-09-13 22:25 the magic number there makes things pretty unambiguous 2008-09-13 22:26 actually with the magic number - sure 2008-09-13 22:26 depends 2008-09-13 22:26 it's without that I'd be worried 2008-09-13 22:26 zero is often a valid offset 2008-09-13 22:26 ie. before it's dead 2008-09-13 22:26 it is a valid dirent offset in ext2 for example 2008-09-13 22:26 yes, but offset is a seperate matter 2008-09-13 22:26 well 2008-09-13 22:26 they're all offsets 2008-09-13 22:26 ok, right it needs some deeper thinko 2008-09-13 22:26 there is no such thing as an absolute address any more ;) 2008-09-13 22:27 -!- tim_dimm(~timothyhu@adsl-67-114-40-138.dsl.scrm01.pacbell.net) has joined #tux3 2008-09-13 22:27 hi daddy_dimm 2008-09-13 22:27 chatting on your iphone? 2008-09-13 22:27 yup 2008-09-13 22:27 leet 2008-09-13 22:27 ulberleet 2008-09-13 22:27 uber even 2008-09-13 22:27 makin' sure it came through 2008-09-13 22:27 it did 2008-09-13 22:27 k, pissin off the wife now 2008-09-13 22:27 then you had to change a diaper or something 2008-09-13 22:27 I should go 2008-09-13 22:27 heh 2008-09-13 22:28 short leash 2008-09-13 22:28 it was ever thus 2008-09-13 22:28 it was your idea ;) 2008-09-13 22:28 got 2500 to change minumim if I do my part 2008-09-13 22:28 k 2008-09-13 22:28 yammer at ya later 2008-09-13 22:28 later 2008-09-13 22:28 toodles 2008-09-13 22:29 what's up with zumastor? 2008-09-13 22:29 ok, now I just need to allocate from that list, then I'm done for the night 2008-09-13 22:29 again, nontrivial 2008-09-13 22:29 when the list is linked through file blocks 2008-09-13 22:29 entirely different scale of hacking than in memory 2008-09-13 22:31 I see I have some buffer leaks to chase 2008-09-13 22:31 so... it's going to be a while 2008-09-13 22:31 before I can rest 2008-09-13 22:35 -!- stargazr5(~gauravstt@59.95.17.142) has joined #tux3 2008-09-13 22:38 http://lxr.linux.no/linux+v2.6.26.5/drivers/md/md.c#L595 2008-09-13 22:38 what the hell is that code doing - and why? 2008-09-13 22:38 isn't that spurious? 2008-09-13 22:38 never mind 2008-09-13 22:38 dealing with carry 2008-09-13 22:39 two iterations, yes 2008-09-13 22:39 still looks crappy 2008-09-13 22:41 it's the sort off stuff that is cleaner in assembler 2008-09-13 22:42 adc %ah,%al; adc 0,%al - or whatever the proper registers are called nowadays 2008-09-13 22:42 uhm, first one add, second adc 2008-09-13 22:43 and that's merely 16 bit -> 8 bit, not 32->16 2008-09-13 22:43 much cleaner 2008-09-13 22:46 okay, trying to read in a superblock now, using bios 2008-09-13 22:56 brave 2008-09-13 22:57 submit_bio()... then what? 2008-09-13 23:00 glacing through other fs'es 2008-09-13 23:00 and other kernel subsystem 2008-09-13 23:00 the submit code in swap.c seems promising 2008-09-13 23:00 also check block_read_full_page 2008-09-13 23:00 and friends 2008-09-13 23:01 you need to set up an endio that unlocks something 2008-09-13 23:01 wakes up your process typically 2008-09-13 23:01 and you have to remember what you are supposed to wake up somehow 2008-09-13 23:01 in the private field of the bio 2008-09-13 23:01 you will typically stick some state struct in there 2008-09-13 23:01 this is working on the metal 2008-09-13 23:02 :-) cool. 2008-09-13 23:02 you could try the prepare_to_sleep etc api here 2008-09-13 23:02 submit, then sleep 2008-09-13 23:03 see, that's why people tend to use submit_bh, because then you can do wait_on_buffer 2008-09-13 23:03 but it's a very crufty path 2008-09-13 23:08 http://lxr.linux.no/linux+v2.6.26.5/mm/page_io.c#L25 2008-09-13 23:08 that looks like it might be a decent example of asynch bio handling 2008-09-13 23:09 pretty good 2008-09-13 23:10 end page writeback will do a bunch of stuff you don't need 2008-09-13 23:11 but writebackk will only happen if I dirty the page right? 2008-09-13 23:11 actually, scratch that 2008-09-13 23:11 I'm still not quite sure, whether this interface is read/write or mmap or both 2008-09-13 23:11 you're in complete control 2008-09-13 23:12 when you go submit_bio, stuff starts to happen 2008-09-13 23:12 but you will not be able to use these functions directly 2008-09-13 23:12 just use as a guid to write your own 2008-09-13 23:12 right 2008-09-13 23:12 somebody ought to make a simple "read that into this page" 2008-09-13 23:12 can pages be shared between kernel and userspace? 2008-09-13 23:12 based on this 2008-09-13 23:12 but nobody has that I know 2008-09-13 23:13 only by mapping into a page table 2008-09-13 23:13 ie. both the kernel and userspace have a part of disk mmap'ed into phys memory? 2008-09-13 23:13 and whichever edits, the other sees? 2008-09-13 23:13 yes 2008-09-13 23:13 by doing the mapping through the pagetable? 2008-09-13 23:13 done all the time 2008-09-13 23:13 yes 2008-09-13 23:14 I'm taking baby steps here ;-) 2008-09-13 23:14 you have to ask the right way, get it setup correctly 2008-09-13 23:14 so that it will be recovered properly when your process exits 2008-09-13 23:14 and so on 2008-09-13 23:15 it's a big topic 2008-09-13 23:15 we're going to do "read" on tuesday 2008-09-13 23:16 that in itself is a big topic 2008-09-14 02:04 ACTION is back from (goth club) Sabbat 2008-09-14 02:04 flips: oct 31th is halloween btw 2008-09-14 02:04 you might like to reschedule 2008-09-14 02:04 of course 2008-09-14 02:04 that was the point 2008-09-14 02:05 ACTION chuckles 2008-09-14 02:05 you going trick or treating? 2008-09-14 02:05 ok, now I would like to know your rationale for this 2008-09-14 02:05 probably neither, I don't live in an area that's family friendly 2008-09-14 02:06 be back in a bit 2008-09-14 02:06 I applaud you for your sense of humor, but I'm left a bit hanging as to what you're intending 2008-09-14 02:06 ACTION is freaking drunk now 2008-09-14 02:06 drunk IRCing 2008-09-14 02:06 I'll be sober in about 1/2 hour or so 2008-09-14 02:06 ok 2008-09-14 02:08 ACTION heads to get late night food 2008-09-14 02:08 flips: btw, I live right behind a big goth night/club in San Diego, Sabbat 2008-09-14 02:09 you can get a listing of clubs from socalgoth (southern cal goth) 2008-09-14 02:09 which has unified LA through SD listing 2008-09-14 02:09 for goth/industrial events 2008-09-14 04:13 -!- trymeeeee(~zxcvbnm@123.236.188.107) has joined #tux3 2008-09-14 05:28 -!- Aks(~ankitsriv@123.237.71.198) has joined #tux3 2008-09-14 05:34 -!- Aks(~ankitsriv@123.237.71.198) has left #tux3 2008-09-14 11:05 -!- pgquiles(~pgquiles@50.Red-79-153-248.staticIP.rima-tde.net) has joined #tux3 2008-09-14 11:13 -!- stargazr5(~gauravstt@59.95.30.8) has joined #tux3 2008-09-14 11:51 -!- Kirantpatil(~kiran@122.167.212.171) has joined #tux3 2008-09-14 11:51 -!- Kirantpatil(~kiran@122.167.212.171) has left #tux3 2008-09-14 11:54 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-14 13:06 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-14 16:04 ACTION starts a new edition of the Tux3 Report 2008-09-14 16:05 "Xattrs and Atoms" 2008-09-14 16:05 nice to have this as a fait accompli 2008-09-14 16:05 almsot 2008-09-14 16:05 just have to set the layout fields for the real filesystem and do some full system testing 2008-09-14 16:38 sk8 oclock 2008-09-14 17:17 hey 2008-09-14 19:46 flips: check my repo 2008-09-14 19:46 fix for inode->xcache leak 2008-09-14 19:59 will do 2008-09-14 20:00 pulls very slowly when there's a simultaneous kernel download in progress 2008-09-14 20:00 got to get me sum more a that bandwidth 2008-09-14 20:02 shapor, what's that (int)strlen(name) for? 2008-09-14 20:13 hey, should I do the tux3 kernel part with git or mercurial? 2008-09-14 20:14 maybe I should as on the mercurial channel 2008-09-14 20:16 hey shapor 2008-09-14 20:22 it'll eventually need to be git in the kernel 2008-09-14 20:27 %.*s expects int not size_t 2008-09-14 20:28 flips: ^ 2008-09-14 20:28 konrad, not really 2008-09-14 20:28 there are some mercurial projects for kernel things, for example btrfs 2008-09-14 20:28 hm, ok 2008-09-14 20:29 it's rather stupid for strlen to return size_t 2008-09-14 20:29 indeed 2008-09-14 20:29 kinda makes you want to reimplement it, doesnt it? 2008-09-14 20:29 like anybody should scan that much ascii text looking for a crappy null byte 2008-09-14 20:29 in asm ;) 2008-09-14 20:30 your basic 5 byte assembly program 2008-09-14 20:30 12 if you can a really fancy fast one 2008-09-14 20:30 scasb makes it easy doesnt it? 2008-09-14 20:30 scasb is slow on a lot of procs 2008-09-14 20:30 have'nt been keeping up with the latest 2008-09-14 20:30 hmm 2008-09-14 20:31 but a simple look using basic register instructions is fastest today 2008-09-14 20:31 let the superscaler logic do its thing 2008-09-14 20:31 and the shadow registers 2008-09-14 20:31 simple loop 2008-09-14 20:31 anyway, we're git 2008-09-14 20:31 I just checked it in 2008-09-14 20:31 tux3 stub kernel fs is landing tonight 2008-09-14 20:32 shapor, anyway we have %t 2008-09-14 20:32 that's for this braindamage I think 2008-09-14 20:32 wll 2008-09-14 20:32 doesn't work for %.*s 2008-09-14 20:32 yuck 2008-09-14 20:33 stupid ancient unix gods 2008-09-14 20:34 heh 2008-09-14 20:38 there we go 2008-09-14 20:41 typical linux: CONFIG_MMU means "CONFIG_NOMMU" 2008-09-14 20:42 tux3 will not support nommu for now 2008-09-14 20:42 if somebody wants that they can pay for it 2008-09-14 20:43 ah, ramfs has an actual application 2008-09-14 20:43 it implements rootfs 2008-09-14 20:43 that's why it got a little bloaty 2008-09-14 20:43 lately 2008-09-14 20:47 * Tux3 Versioning Filesystem 2008-09-14 20:47 * 2008-09-14 20:47 * Portions Copyright (C) 2000 Linus Torvalds, 2000 Transmeta Corp. 2008-09-14 20:47 * Licensed under the GPL v2 2008-09-14 20:47 */ 2008-09-14 20:47 well 2008-09-14 20:47 what about the other (c) 2008-09-14 20:48 * Tux3 Versioning Filesystem 2008-09-14 20:48 * 2008-09-14 20:48 * Copyright (c) 2008, Daniel Phillips 2008-09-14 20:48 * Portions Copyright (C) 2000 Linus Torvalds, 2000 Transmeta Corp. 2008-09-14 20:48 * Licensed under the GPL v2 2008-09-14 20:48 */ 2008-09-14 20:48 there we go 2008-09-14 20:50 one little c one C 2008-09-14 20:52 hrm there is a still one leak in the inode test 2008-09-14 20:52 ==15560== 8,160 (8,040 direct, 120 indirect) bytes in 1 blocks are definitely lost in loss record 4 of 7 2008-09-14 20:52 ==15560== at 0x4A1B858: malloc (vg_replace_malloc.c:149) 2008-09-14 20:52 ==15560== by 0x401DC2: new_map (buffer.c:442) 2008-09-14 20:52 ==15560== by 0x40988C: new_inode (inode.c:111) 2008-09-14 20:52 ==15560== by 0x40AD46: make_tux3 (inode.c:476) 2008-09-14 20:52 linus wrote the big C 2008-09-14 20:52 ==15560== by 0x40B17B: main (inode.c:530) 2008-09-14 20:53 I'm not correcting his typos 2008-09-14 20:53 I treat his copyright notice as (c) linus 2008-09-14 20:53 well 2008-09-14 20:53 it does look stupid 2008-09-14 20:53 there we go, changed to (c), I'm a flagrant copyright scofflaw 2008-09-14 20:53 arrest me 2008-09-14 20:54 i do not expect that leak to last long 2008-09-14 20:59 config TUX3_FS 2008-09-14 20:59 tristate "Tux3 Versioning Filesystem" 2008-09-14 20:59 help 2008-09-14 20:59 To compile this file system support as a module, choose M here: the 2008-09-14 20:59 module will be called tux3. 2008-09-14 20:59 If unsure, say Maybe. 2008-09-14 20:59 hrm its only the map in the sb inode 2008-09-14 21:00 in make_tux3 2008-09-14 21:00 seems odd since free_inode does indeed free the map unless its null 2008-09-14 21:00 the way those initializers work is dodgy 2008-09-14 21:00 structure assignments 2008-09-14 21:01 combined with desginated init = brainmuck 2008-09-14 21:01 probably should do it all with mallocs 2008-09-14 21:01 the fs init that is 2008-09-14 21:01 the reason for the cute little minimal struct decs is getting old 2008-09-14 21:38 -!- amey(~amey@116.73.35.180) has joined #tux3 2008-09-14 21:39 hi 2008-09-14 21:40 -!- amey(~amey@116.73.35.180) has left #tux3 2008-09-14 21:41 cat /proc/filesystems 2008-09-14 21:41 nodev sysfs 2008-09-14 21:41 nodev rootfs 2008-09-14 21:41 nodev bdev 2008-09-14 21:41 nodev proc 2008-09-14 21:41 nodev sockfs 2008-09-14 21:41 nodev pipefs 2008-09-14 21:41 nodev anon_inodefs 2008-09-14 21:41 nodev tmpfs 2008-09-14 21:41 nodev inotifyfs 2008-09-14 21:41 nodev devpts 2008-09-14 21:41 reiserfs 2008-09-14 21:41 ext3 2008-09-14 21:41 ext2 2008-09-14 21:41 nodev tux3 2008-09-14 21:41 nodev ramfs 2008-09-14 21:41 nodev hostfs 2008-09-14 21:41 nodev mqueu 2008-09-14 21:41 let's get rid of some useless ones 2008-09-14 21:42 anon_inodefs <- :p 2008-09-14 21:42 what is that? 2008-09-14 21:42 crap 2008-09-14 21:42 haven't looked at it 2008-09-14 21:42 but I can tell from the name 2008-09-14 21:42 few other dodgy looking ones 2008-09-14 21:43 now, are job is to get rid of the nodev on tux3 2008-09-14 21:43 let's try to mount 2008-09-14 21:45 root@deep:~# mount -t tux3 tux3 /mnt 2008-09-14 21:45 root@deep:~# echo hello >/mnt/foo 2008-09-14 21:45 root@deep:~# cat /mnt/foo 2008-09-14 21:45 hello 2008-09-14 21:46 root@deep:~# mount 2008-09-14 21:46 /dev/ubda on / type ext2 (rw) 2008-09-14 21:46 proc on /proc type proc (rw) 2008-09-14 21:46 devpts on /dev/pts type devpts (rw,gid=5,mode=620) 2008-09-14 21:46 tux3 on /mnt type tux3 (rw) 2008-09-14 21:46 ok, time to check it in 2008-09-14 22:00 http://phunq.net/ddtree 2008-09-14 22:00 http://phunq.net/ddtree?p=tux3fs;a=summary 2008-09-14 22:01 just for now 2008-09-14 22:06 git... it's actually pretty bad 2008-09-14 22:06 compared to mercurial 2008-09-14 22:06 user unfriendly 2008-09-14 22:07 does not do what you expect 2008-09-14 22:18 you need to do commit -a 2008-09-14 22:18 to get what mercurial does for just commit 2008-09-14 22:19 and what any rational person would want 2008-09-14 22:22 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-14 22:34 nice 2008-09-14 22:42 http://shapor.com/tux3/ updated 2008-09-14 22:50 :) 2008-09-14 22:51 shapor, when's the next round of updates on the design doc? 2008-09-14 22:56 i was just thinking about that 2008-09-14 23:39 -!- nataliep_(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-09-14 23:45 so, fuse's pkg-config wants -D_FILE_OFFSET_BITS and so does everything else, or else off_t will be wrong (current just diskio has it) 2008-09-14 23:45 the only problem is, if I put -D_FILE_OFFSET_BITS on everything, then it shoes up twice in the fuse compile 2008-09-14 23:45 esthetically irritating 2008-09-14 23:46 well, it's just going to be that way 2008-09-14 23:46 and our build will start to suck, just like every build 2008-09-15 01:17 hey 2008-09-15 01:22 hi 2008-09-15 01:22 new lkml post just when out 2008-09-15 01:23 "Tux3 Report: What next?" 2008-09-15 01:28 http://lkml.org/lkml/2008/9/15/23 2008-09-15 01:52 -!- konrad(~konrad@c-24-16-74-109.hsd1.wa.comcast.net) has joined #tux3 2008-09-15 02:10 just read the post 2008-09-15 02:10 good advertizement :) 2008-09-15 02:11 that's the idea 2008-09-15 02:12 ok night :) 2008-09-15 02:45 http://www.letterp.com/~dbg/practical-file-system-design.pdf <- a book on filesystem design 2008-09-15 02:45 I should read it 2008-09-15 02:45 learn something 2008-09-15 03:06 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-15 03:56 flips: git has another level between the repository and your checkout. that's why you need to do the commit -a 2008-09-15 03:56 you can just alias it though 2008-09-15 04:47 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-09-15 07:02 -!- eli(~elicriffi@66.249.86.209) has joined #tux3 2008-09-15 07:48 -!- Kirantpatil(~kiran@122.167.194.220) has joined #tux3 2008-09-15 07:48 -!- Kirantpatil(~kiran@122.167.194.220) has left #tux3 2008-09-15 08:08 -!- amey(~amey@116.73.35.180) has joined #tux3 2008-09-15 08:08 100 2008-09-15 08:08 100! 2008-09-15 08:09 -!- amey(~amey@116.73.35.180) has left #tux3 2008-09-15 09:21 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-15 09:46 -!- Kirantpatil(~kiran@122.167.211.98) has joined #tux3 2008-09-15 09:46 -!- Kirantpatil(~kiran@122.167.211.98) has left #tux3 2008-09-15 10:13 -!- Kirantpatil(~kiran@122.167.211.98) has joined #tux3 2008-09-15 10:13 -!- Kirantpatil(~kiran@122.167.211.98) has left #tux3 2008-09-15 11:48 data, the thing is, it is not clear that git needs that extra level 2008-09-15 11:49 in fact, hg makes it clear that it doesn't 2008-09-15 12:23 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-15 12:23 folks 2008-09-15 12:37 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-15 12:40 hi bh 2008-09-15 12:40 hey maze 2008-09-15 12:40 see, today there is a tux3 filesystem in kernel 2008-09-15 12:40 that's the good news, the bad news is it's really just ramfs 2008-09-15 12:54 url to the post ? 2008-09-15 12:55 http://lkml.org/lkml/2008/9/15/23 2008-09-15 12:55 flips: tux3.org seems... slooow 2008-09-15 12:56 shapor, true 2008-09-15 12:56 don't know why 2008-09-15 12:56 getting lots of traffic? 2008-09-15 12:56 let's see what traffic I've got 2008-09-15 12:56 probably not 2008-09-15 12:56 haven't been monitoring 2008-09-15 12:56 could just be crappy linux vm stepping on itself 2008-09-15 12:56 means: close firefox 2008-09-15 12:57 perhaps git tree getting crawled 2008-09-15 12:57 googlebot is hammering me 2008-09-15 12:57 really hammering 2008-09-15 12:58 fsck 2008-09-15 12:58 DoS 2008-09-15 12:58 oh god 2008-09-15 12:58 it's indexing my git tree 2008-09-15 12:58 stupid, stupid bot 2008-09-15 12:58 /kickban googlebot 2008-09-15 12:58 thats easy 2008-09-15 12:58 robots.txt 2008-09-15 12:59 suggestions? 2008-09-15 12:59 where do I put it? 2008-09-15 12:59 http://en.wikipedia.org/wiki/Robots.txt 2008-09-15 12:59 put it in / 2008-09-15 12:59 User-agent: * 2008-09-15 12:59 Crawl-delay: 10 2008-09-15 12:59 damn I thought I could use the shapedia 2008-09-15 13:00 that will make it wait 10 seconds between requests to your box 2008-09-15 13:00 oh good I can 2008-09-15 13:00 I only want it to stay out of the git tree 2008-09-15 13:00 anything else it's welcome to index 2008-09-15 13:00 you can put in a disallow line then 2008-09-15 13:00 why not let it crawl the git tree with a 10 sec delay ? 2008-09-15 13:00 you can disallow like 2008-09-15 13:01 Disallow: /ddtree/ 2008-09-15 13:01 just think how many millions of dollars worth of storage git + lxr consume in goog datacenters 2008-09-15 13:01 so 2008-09-15 13:01 User-agent: * 2008-09-15 13:01 Disallow: /ddtree/ 2008-09-15 13:01 add some intelligence to googlebot, then donate 1/2 the savings to oss projects 2008-09-15 13:01 hrm although 2008-09-15 13:02 maybe you dont want the trailing slash 2008-09-15 13:02 in fact i dont think you do 2008-09-15 13:02 ACTION puts that in the suggestion box 2008-09-15 13:09 you should just disallow areas not meant to be indexed ;-) 2008-09-15 13:10 crawl-delay has a tendency to me of little actual use 2008-09-15 13:10 s/me/be/ 2008-09-15 13:10 MaZe: it should at least spread out the pain ;) 2008-09-15 13:15 mayhaps 2008-09-15 15:19 daniel@moonbase:/src/2.6.26.5.tux3$ git add robots.txt 2008-09-15 15:19 daniel@moonbase:/src/2.6.26.5.tux3$ git add fs/robots.txt 2008-09-15 15:19 daniel@moonbase:/src/2.6.26.5.tux3$ git diff 2008-09-15 15:19 daniel@moonbase:/src/2.6.26.5.tux3$ git diff -a 2008-09-15 15:19 no output 2008-09-15 15:19 fscking git 2008-09-15 15:19 I know there is a way, but how about it should just work 2008-09-15 15:19 good example of why we should mainly work with mercurial 2008-09-15 15:23 git diff HEAD 2008-09-15 15:24 true 2008-09-15 15:24 or git diff WANK 2008-09-15 15:24 ;-) 2008-09-15 15:24 thanks 2008-09-15 15:24 I like git ;-) 2008-09-15 15:24 I do too 2008-09-15 15:24 but not nearly as much as mercurial 2008-09-15 15:25 I'm using git for tinyos... what would be the reasons to switch to mercurial? :D 2008-09-15 15:29 RazvanM, mercurial makes it much easier for new contributers to get up to speed 2008-09-15 15:29 and doesn't get in the way of git experts 2008-09-15 15:30 basically, you just don't type the options that always seemed a little odd 2008-09-15 15:30 my ramp up for mercurial after git was about, um, 10 minutes 2008-09-15 15:30 same for Shapor I think 2008-09-15 15:30 Git took "some getting used to" 2008-09-15 15:30 hit more bugs in git than mercurial too 2008-09-15 15:30 and some things that should be bugs 2008-09-15 15:31 but are instead treated as features 2008-09-15 15:32 :D 2008-09-15 15:32 I definitely agree that it took me some time to get used to git 2008-09-15 15:33 for the dealing with changes for tinyos was very good for me though 2008-09-15 15:33 anyway as you can see I'm being even handed and using both 2008-09-15 15:33 puts me in a better position to complain about git ;) 2008-09-15 15:33 that is true ;-) 2008-09-15 15:33 tried mercurial? 2008-09-15 15:33 well you must 2008-09-15 15:33 if you have tux3 checked out 2008-09-15 15:34 for tux3 I had a chance to play with it 2008-09-15 15:34 it's just amazing how it seems to do the right thing by default 2008-09-15 15:34 the only whine I've had about it is, there isn't a simple command to delete a head 2008-09-15 15:34 there are a couple of longish commands available 2008-09-15 15:35 I did some changes to make it run on mac but then the new updates failed so I removed the files :P 2008-09-15 15:35 sorry 2008-09-15 15:35 I'll try to make fewer changes ;) 2008-09-15 15:35 hehe 2008-09-15 15:35 not your fault 2008-09-15 15:35 see if you can come up with some guidlines to make that smoother 2008-09-15 15:35 stuff I can do to make life easier for you 2008-09-15 15:37 I hope I'll find some time to try to make it work on mac 2008-09-15 15:37 these days you could probably make it work on a cell phone 2008-09-15 15:37 certainly a Nokia 800 series 2008-09-15 15:37 one thing that I did was to change mode_t to tux3_mode_t to avoid the collisions with the system's one 2008-09-15 15:37 :D :D :D 2008-09-15 15:38 want to send a patch or should I just do that? 2008-09-15 15:38 I'll do that 2008-09-15 15:38 what does it collide with? 2008-09-15 15:38 some libc thing? 2008-09-15 15:38 I think so 2008-09-15 15:38 could you find out? 2008-09-15 15:38 I need to reboot to see if my kernel panic is still in 10.5.5 2008-09-15 15:38 I'll be back 2008-09-15 15:39 (with the answer about mode_t :D) 2008-09-15 15:39 tux3 doesn't have a mode_t 2008-09-15 15:40 it uses the libc mode_t 2008-09-15 15:40 man 2 stat 2008-09-15 15:41 oops he's gone 2008-09-15 15:41 understandably 2008-09-15 15:47 i don't like the fact hg is python 2008-09-15 15:48 I don't care much 2008-09-15 15:48 I don't like the fact that guido doesn't support native code gen for python 2008-09-15 15:48 iow, guido is more of a problem than matt 2008-09-15 16:04 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-15 16:05 wb 2008-09-15 16:05 damn... they didn't fix my kernel panic :( 2008-09-15 16:05 my robots.txt fu is insufficient 2008-09-15 16:05 help me ;) 2008-09-15 16:05 who didn't? 2008-09-15 16:05 apple :P 2008-09-15 16:06 they just released 10.5.5 2008-09-15 16:06 how yahoobot it beating me up 2008-09-15 16:06 now 2008-09-15 16:06 ah 2008-09-15 16:06 linux seldom panics 2008-09-15 16:06 maybe apple will see the light 2008-09-15 16:06 about the mote_t: ./stdlib.h:typedef __darwin_mode_t mode_t; 2008-09-15 16:07 in my case it panics with a one line C program :P 2008-09-15 16:07 so... 2008-09-15 16:07 in exaclty what way is it incompatible? 2008-09-15 16:08 tux3 was also typedef-ing mode_t I think... 2008-09-15 16:08 perhaps it doesn't anymore :D 2008-09-15 16:10 it does not from what I see... 2008-09-15 16:12 I need to go back on some graphs for a paper. I'll try to make an attempt to get tux3 to compile on mac later tonight. 2008-09-15 16:17 cu 2008-09-15 16:17 tux3 u tomorrow ;) 2008-09-15 16:18 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-09-15 16:19 typedef __uint16_t __darwin_mode_t; 2008-09-15 16:19 lamerz 2008-09-15 16:22 robots.txt standard is really lame 2008-09-15 16:22 only works in the root of the server 2008-09-15 16:22 "standard" :p 2008-09-15 16:23 yep 2008-09-15 16:23 tux3 uses mode_t in just one place: fuse 2008-09-15 16:23 does fuse work on mac? 2008-09-15 16:23 so send your complaint to our fuser department ;) 2008-09-15 16:23 I wonder 2008-09-15 16:24 would be weird hmm? 2008-09-15 16:24 mhm 2008-09-15 16:24 but fuse won't compile by default 2008-09-15 16:24 the pkg-config will just not fire 2008-09-15 16:24 I think 2008-09-15 16:24 ifeq ($(shell pkg-config fuse && echo found), found) 2008-09-15 16:24 binaries += tux3fs tux3fuse 2008-09-15 16:24 endif 2008-09-15 16:24 right 2008-09-15 16:24 so if there's a mode_t problem, it's not in the tux3 source 2008-09-15 16:25 nice piece of make scripting by the way 2008-09-15 16:25 who did that again? 2008-09-15 16:25 RazvanM maybe 2008-09-15 16:37 moonbase:/var/www# cat robots.txt 2008-09-15 16:37 User-agent: * 2008-09-15 16:37 Disallow: / 2008-09-15 16:38 I'll refine that later 2008-09-15 16:38 needless to say, googlebot is still bothering me 2008-09-15 16:38 flips: you know robots.txt does not work for rogue bots, do you? 2008-09-15 16:39 "don't take no for an answer" -- googlebot 2008-09-15 16:39 pgquiles, if I see a rogue bot I know what to do 2008-09-15 16:39 I can say that googlebot is very impolite 2008-09-15 16:39 :-) 2008-09-15 16:39 has no concept of staying within a reasonable share of bandwidth 2008-09-15 16:40 obviously, googlebot is coded and maintained by "smart people" ;) 2008-09-15 16:40 :-D 2008-09-15 16:48 oh, msnbot has joined the party 2008-09-15 16:48 everybody knows where there's a good party it seems 2008-09-15 16:49 now, I want to explain to them: crawl my site, just don't index every version of linus's git tree 2008-09-15 16:49 please 2008-09-15 16:57 tell someone at google about git 2008-09-15 17:01 flips: with a robots.txt like that it is likely you won't show up in search results 2008-09-15 17:02 I know 2008-09-15 17:02 I needed some peace 2008-09-15 17:02 while I write the real file 2008-09-15 17:02 Disallow: /ddtree didn't work ? 2008-09-15 17:02 i guess you dont really care about phunq.net anyway 2008-09-15 17:02 let's try it 2008-09-15 17:03 from my access logs it looks like most bots only grab robots.txt before a scan 2008-09-15 17:03 not before earch request 2008-09-15 17:03 so if they have already started crawling it might just keep continuing 2008-09-15 17:03 i dunno 2008-09-15 17:04 I know 2008-09-15 17:04 that's impolite 2008-09-15 17:04 also for your disallow i think you need to give full paths 2008-09-15 17:04 so 2008-09-15 17:04 Disallow: fifo.c 2008-09-15 17:04 specially for one like googlebot with zigabucks spent on it 2008-09-15 17:04 won't do anything 2008-09-15 17:04 true 2008-09-15 17:04 and it has to be in the root 2008-09-15 17:04 all of which is work 2008-09-15 17:04 if you run apache you could add a disallow for googlebot in the .htaccess 2008-09-15 17:04 which isn't pleasant when my system is getting hammered 2008-09-15 17:04 based on user agent 2008-09-15 17:05 does it work for nonhtml files? 2008-09-15 17:05 I guess 2008-09-15 17:05 yes 2008-09-15 17:05 but you have to name the bot 2008-09-15 17:05 right? 2008-09-15 17:05 yeah 2008-09-15 17:05 funny how these standards don't get improved 2008-09-15 17:05 intrenched 2008-09-15 17:05 hard to change 2008-09-15 17:06 also no interest from goog in making it easy to exclude the bot, it's all or nothing, or pain 2008-09-15 17:06 just look for +http:// in the user agent 2008-09-15 17:06 same with the others of course 2008-09-15 17:06 i mean cmon we use http + javascript for "pushing" data to a browser 2008-09-15 17:06 semi-polite bots all have that 2008-09-15 17:06 whcih is just a polling pull loop 2008-09-15 17:06 its all crap 2008-09-15 17:07 but xml makes it better ;) 2008-09-15 17:07 ACTION pukes a little in his mouth 2008-09-15 17:07 flips: .htpasswd? 2008-09-15 17:07 for the moment no robot's are hitting me 2008-09-15 17:07 pgquiles_, but the site needs to be public 2008-09-15 17:08 moonbase:/var/www# cat robots.txt 2008-09-15 17:08 User-agent: * 2008-09-15 17:08 Disallow: /ddtree 2008-09-15 17:08 ok? 2008-09-15 17:08 seems reasonable 2008-09-15 17:09 I think the bots are staying away until their next sniff cycle 2008-09-15 17:09 http://www.whitehouse.gov/robots.txt 2008-09-15 17:09 because of the / 2008-09-15 17:09 interesting 2008-09-15 17:09 from before 2008-09-15 17:09 thats the first result for robots.txt disallow on google 2008-09-15 17:10 that's the one site that should be completely indexed 2008-09-15 17:10 without option 2008-09-15 17:10 heh 2008-09-15 17:10 the only url it is pulling is /ddtree 2008-09-15 17:10 User-agent: whsearch 2008-09-15 17:10 everything else are just paremeters being passed to it 2008-09-15 17:11 so i think /ddtree is sufficient 2008-09-15 17:11 can search an extra dozen things 2008-09-15 17:11 ok 2008-09-15 17:11 later I will refine that 2008-09-15 17:11 so it allows searching the tux3 part of ddtree 2008-09-15 17:11 and only that 2008-09-15 17:27 it's time to do extents 2008-09-15 17:27 never mind I claimed I'd do versioning next on lkml 2008-09-15 17:27 extends = benchmarkability 2008-09-15 17:27 easy 2008-09-15 17:28 and that big zfs bully won't kick sand in the face of skinny little tux3 any more 2008-09-15 17:28 tux3 will learn ju-extent-fu 2008-09-15 17:29 well where do we start 2008-09-15 17:29 the hardest part is actually versioned extents, so it's convenient that's the part getting deferred 2008-09-15 17:29 -!- pgquiles__(~pgquiles@50.Red-79-153-248.staticIP.rima-tde.net) has joined #tux3 2008-09-15 17:30 we need to be able to alloc an extent for one thing 2008-09-15 17:30 so the bitmap scanning gets fancier 2008-09-15 17:30 not just searching for a bit any more, but a contiguous run of bits 2008-09-15 17:30 and sometimes it might be best for it to say: here's the longest run I found in that region 2008-09-15 17:31 instead of just failing because it didn't find the length asked 2008-09-15 17:31 gets into heuristics 2008-09-15 17:31 then what else 2008-09-15 17:32 extents only appear in dleaf 2008-09-15 17:32 so... 2008-09-15 17:32 caller of deaf methods has impact 2008-09-15 17:32 those are in btree.c and inode.c 2008-09-15 17:32 I wonder if there is any btree.c impact 2008-09-15 17:32 there should not be 2008-09-15 17:33 actually, btree.c never even knows its calling dleaf methods 2008-09-15 17:33 so I guess the entire impact is in inode.c 2008-09-15 17:33 maybe 2008-09-15 17:33 and balloc.c 2008-09-15 17:33 big and 2008-09-15 17:34 yeah, so that makes them easy right? :P 2008-09-15 17:34 relatively 2008-09-15 17:34 compard to versioned extents 2008-09-15 17:34 should just make them versioned to begin with 2008-09-15 17:34 but still one of the messier bits so far 2008-09-15 17:34 my head hurts just thinking about it 2008-09-15 17:34 defer 2008-09-15 17:34 haha 2008-09-15 17:34 gain experience 2008-09-15 17:34 with the simpler case 2008-09-15 17:34 that is like, at the core of tux3 though 2008-09-15 17:35 oh yeah 2008-09-15 17:35 heart and soul 2008-09-15 17:35 but winning benchmarks is too 2008-09-15 17:35 and I can smell blood ;) 2008-09-15 17:35 without versioning, boo 2008-09-15 17:35 :P 2008-09-15 17:35 doesn't bother me 2008-09-15 17:35 yeah 2008-09-15 17:35 walk first 2008-09-15 17:35 if versioning arrives a month later 2008-09-15 17:35 run later 2008-09-15 17:35 then jump 2008-09-15 17:35 then arabesque 2008-09-15 17:35 teleport 2008-09-15 17:36 right 2008-09-15 17:36 then stretch space and time so travel isn't necessary anymore 2008-09-15 17:36 that's called multicore 2008-09-15 17:36 heh 2008-09-15 17:36 or is it qubits? 2008-09-15 17:36 bubytes would rule 2008-09-15 17:36 drugs i think 2008-09-15 17:36 qubytes 2008-09-15 17:37 that too 2008-09-15 17:37 just imagine the great hack 2008-09-15 17:37 "in xanadu did kubla geek a stately filesystem arch decree" 2008-09-15 17:38 -- anon 2008-09-15 17:42 ok, I"m going to get another eee 2008-09-15 17:42 seems one isn't enough 2008-09-15 17:42 wow your original tux3 announcement is still in the hottest threads on lkml 2008-09-15 17:42 the wife likes it ;) 2008-09-15 17:42 so it is, pushed down a little by the time travel 2008-09-15 17:43 probably a plot by linus to get his spot back 2008-09-15 17:43 and who is frans pop, is that a real name? ;) <- jk 2008-09-15 17:44 wow, what next is in the top 2008-09-15 17:44 right after -rc6 and time travel 2008-09-15 17:46 getting close to sk8 oclock 2008-09-15 17:53 ACTION flips_rollin 2008-09-15 19:43 flips: http://milek.blogspot.com/2008/03/zfs-de-duplication.html 2008-09-15 19:46 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-15 20:02 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2008-09-15 20:03 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-15 20:05 -!- Kirantpatil(~kiran@122.167.199.254) has joined #tux3 2008-09-15 20:05 -!- Kirantpatil(~kiran@122.167.199.254) has left #tux3 2008-09-15 20:06 -!- ChanServ changed mode/#tux3 -> -o hirofumi 2008-09-15 20:06 -!- ChanServ changed topic to "Tux3 list membership just hit 100! ~ http://tux3.org" 2008-09-15 21:04 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-09-15 21:04 back 2008-09-15 21:04 that was a bit wierd 2008-09-15 21:07 flips: what do you think about the link ? 2008-09-15 21:25 -!- Aks(~ankitsriv@59.90.32.1) has joined #tux3 2008-09-15 22:13 -!- Aks(~ankitsriv@59.90.32.1) has left #tux3 2008-09-15 23:49 flips: you there ? 2008-09-16 01:29 `now 2008-09-16 01:30 autokilled... graphic 2008-09-16 01:32 march, it's old 2008-09-16 01:33 truth is, I fully trust in our students to out deduplicate any weeny engineers 2008-09-16 01:35 "what next" is #5 on lkml.org 2008-09-16 01:35 #6 2008-09-16 01:41 hehe 2008-09-16 01:43 hey maze 2008-09-16 01:44 hey 2008-09-16 01:44 we need to get your bio transfer working 2008-09-16 01:44 20 lines or less is defined as "working" 2008-09-16 01:44 ;-) 2008-09-16 01:44 just write a simple endio 2008-09-16 01:44 yeah, I've mostly been reading docs all sunday 2008-09-16 01:44 and browsing the source code 2008-09-16 01:45 put your submitter into sleep on a wait queue 2008-09-16 01:45 that's it 2008-09-16 01:45 one line wait queue declaration 2008-09-16 01:45 I could probably write something that works now, but I need to setup a better (less likely to crash machine) debug scenario than insmod into running machine 2008-09-16 01:45 naw 2008-09-16 01:45 and now it's unfortunately the work week ;-) 2008-09-16 01:45 write something that works 2008-09-16 01:45 forget about debugging 2008-09-16 01:45 hehe 2008-09-16 01:45 it works or it doesn' 2008-09-16 01:45 doesn't 2008-09-16 01:46 you'll know by the amount of smoke 2008-09-16 01:46 I don't like my work machine smoking though ;-) 2008-09-16 01:46 you're not going to make me write it for you? 2008-09-16 01:46 oh, wait a minute 2008-09-16 01:46 there should be a simple-ish solution 2008-09-16 01:46 there is 2008-09-16 01:47 sleep 2008-09-16 01:47 endio wakes 2008-09-16 01:47 simple 2008-09-16 01:47 no, no, no - not mucking around with disk io on my live machine 2008-09-16 01:47 ACTION has to pop some popcorn 2008-09-16 01:47 I've already gone through a painful weekend of data recovery 2008-09-16 01:47 why not? 2008-09-16 01:47 trust me 2008-09-16 01:47 nothing will break 2008-09-16 01:47 I trust you... I don't trust myself. 2008-09-16 01:47 except you might have to reboot ;) 2008-09-16 01:47 but probably not 2008-09-16 01:48 hard to go wrong with a read 2008-09-16 01:48 these days, your task can oops and the machine keeps right on running 2008-09-16 01:49 if you fear, compile uml 2008-09-16 01:49 make ARCH=um 2008-09-16 01:50 hmm 2008-09-16 01:50 ok, trying to write something 2008-09-16 01:50 will try pasting here in a moment 2008-09-16 01:51 deal 2008-09-16 01:51 ACTION goes to write some extent code 2008-09-16 01:51 totally drunk 2008-09-16 01:51 prolly better concentrate on the popcorn 2008-09-16 01:52 we had to celebrate tim's twins tonight 2008-09-16 01:55 balloc_from_range has to become balloc_extent_from_range 2008-09-16 01:56 going to be a mess 2008-09-16 01:56 fortunately, balloc_from_range is pretty tight 2008-09-16 01:57 going to be tighter with big endian scan 2008-09-16 02:17 block_t balloc_extent_from_range(struct inode *inode, block_t start, block_t count, unsigned length) 2008-09-16 02:17 declare looks could 2008-09-16 02:17 could somebody implement this please? 2008-09-16 02:18 ;-) 2008-09-16 02:22 maze, you need my uml recipe 2008-09-16 02:22 hmm, wouldn't mind 2008-09-16 02:22 nothing to fear except fear itself 2008-09-16 02:22 ok 2008-09-16 02:22 would you imagine I just roped myself into helping out someone for work... 2008-09-16 02:22 work? 2008-09-16 02:22 what's that? 2008-09-16 02:23 hehe - that annoying aspect of life 2008-09-16 02:24 that occupies 5pm-9am 2008-09-16 02:25 maze, wget http://phunq.net/root_fs 2008-09-16 02:25 100M 2008-09-16 02:26 exactly 2008-09-16 02:34 20% 2008-09-16 02:34 you've got slow uplink ;-) 2008-09-16 02:35 I could have pulled a dvd off of kernel.org by now 2008-09-16 02:36 we'll move 2008-09-16 02:36 to a faster host 2008-09-16 02:36 pretty soon 2008-09-16 02:36 this host is my desktop 2008-09-16 02:36 kernel.org hosts dvds now? 2008-09-16 02:37 not kernel.org 2008-09-16 02:37 home grown 2008-09-16 02:37 homeboy 2008-09-16 02:37 in response to MaZe's comment 2008-09-16 02:37 ;) 2008-09-16 02:37 got it? 2008-09-16 02:38 28% 2008-09-16 02:38 wah 2008-09-16 02:38 sucks 2008-09-16 02:38 what isp is your desktop on? 2008-09-16 02:38 speakeasy 2008-09-16 02:38 they are looking sucky 2008-09-16 02:38 especially at the price I pay 2008-09-16 02:39 granted comcast sent out a letter saying that starting october users going over 250G a month have to pay extra 2008-09-16 02:39 I'm nowhere close 2008-09-16 02:39 internet is barbaric in us of a 2008-09-16 02:40 primitive savages 2008-09-16 02:40 yep 2008-09-16 02:40 I don't come close either, at least I don't think 2008-09-16 02:40 -!- pgquiles__(~pgquiles@50.Red-79-153-248.staticIP.rima-tde.net) has joined #tux3 2008-09-16 02:40 I could if upload was better ... 2008-09-16 02:40 ok, how's it? 2008-09-16 02:40 you have to average roughly 0.8mbit a month to get 250G 2008-09-16 02:40 34% 2008-09-16 02:40 er, 0.8mbit/s constantly 2008-09-16 02:40 cuz I have to write the rest of the recipe by the time you get it 2008-09-16 02:40 good 2008-09-16 02:40 gives me time to think 2008-09-16 02:40 39.5KB/s 2008-09-16 02:40 lol 2008-09-16 02:41 bleah 2008-09-16 02:41 modem 2008-09-16 02:41 fyi: 2008-09-16 02:41 maze@athina:~$ wget http://mirrors.kernel.org/centos/4.7/isos/x86_64/CentOS-4.7-x86_64-binDVD.iso 2008-09-16 02:41 --02:39:10-- http://mirrors.kernel.org/centos/4.7/isos/x86_64/CentOS-4.7-x86_64-binDVD.iso 2008-09-16 02:41 => `CentOS-4.7-x86_64-binDVD.iso' 2008-09-16 02:41 Resolving mirrors.kernel.org... 204.152.191.39, 204.152.191.7 2008-09-16 02:41 Connecting to mirrors.kernel.org|204.152.191.39|:80... connected. 2008-09-16 02:41 HTTP request sent, awaiting response... 200 OK 2008-09-16 02:41 Length: 2,699,399,168 (2.5G) [application/x-iso9660-image] 2008-09-16 02:41 100%[==================================>] 2,699,399,168 13.21M/s ETA 00:001 2008-09-16 02:41 02:41:15 (21.04 MB/s) - `CentOS-4.7-x86_64-binDVD.iso' saved [2699399168/2699399168] 2008-09-16 02:41 maze@athina:~$ 2008-09-16 02:42 38% 2008-09-16 02:42 m 2008-09-16 02:42 ah right 2008-09-16 02:42 ok, I live in santa monica 2008-09-16 02:42 give me a break 2008-09-16 02:42 ;-) 2008-09-16 02:43 I find kernel.org to be ridiculously fast though 2008-09-16 02:43 never a good benchmark 2008-09-16 02:43 not a coincidence 2008-09-16 02:43 wow 2008-09-16 02:43 21MB/s is impressive 2008-09-16 02:43 I don't think I can write to my harddrive that fast 2008-09-16 02:43 kernel.org is not far from the backbone 2008-09-16 02:44 I can 2008-09-16 02:44 yeah, few ms ping time 2008-09-16 02:44 I can do 60MB/s 2008-09-16 02:44 I can max out gigabit 2008-09-16 02:44 I actually know kernel.org does readahead buffering in 64mb chunks 2008-09-16 02:44 max out gigabit = 60 MB/sec 2008-09-16 02:44 before you hit chipset 2008-09-16 02:45 unless it's a servers chipset 2008-09-16 02:45 because I can see 64MB fly in in 0.7 seconds, then 1-2 second wait as kernel.org reads in the next 64mb 2008-09-16 02:45 nah, my laptop does 117 MB/s tcp 2008-09-16 02:45 assuming there's no disk IO involved 2008-09-16 02:45 no it doesn't 2008-09-16 02:45 the disk is much slower of course 2008-09-16 02:45 tested - it does 2008-09-16 02:45 well 2008-09-16 02:46 true 2008-09-16 02:46 that's the max 2008-09-16 02:46 unbellievable 2008-09-16 02:47 ok, the rest of the recipe: 2008-09-16 02:47 [maze@nike ~]$ time dd if=/dev/zero bs=65536 count=20480 | ssh -c arcfour128 maze@athina cat \> /dev/null 2008-09-16 02:47 20480+0 records in 2008-09-16 02:47 20480+0 records out 2008-09-16 02:47 1342177280 bytes (1.3 GB) copied, 17.1915 s, 78.1 MB/s 2008-09-16 02:47 real 0m17.202s 2008-09-16 02:47 user 0m11.025s 2008-09-16 02:47 sys 0m5.620s 2008-09-16 02:47 that's with both systems actually doing work 2008-09-16 02:47 make defconfig ARCH=um && make lines ARCH=um && ./linus ubdo=root_fs 2008-09-16 02:48 and notice with ssh 2008-09-16 02:48 (spot the typo) 2008-09-16 02:48 lines? 2008-09-16 02:48 typo #1 2008-09-16 02:49 ubd or ubdo - not sure 2008-09-16 02:49 ./linus - weird 2008-09-16 02:49 make defconfig ARCH=um && make linux ARCH=um && ./linux ubd0=root_fs 2008-09-16 02:49 but I really have never used this so I'm guessing 2008-09-16 02:49 ah 2008-09-16 02:50 55% 2008-09-16 02:50 I told you I was totally drunk 2008-09-16 02:50 later you will realize you could have created your own root_fs 5x faster than downloading from me 2008-09-16 02:51 only thing is, it might not have nano in it 2008-09-16 02:51 and it might now be 105 MB 2008-09-16 02:51 well, I'm actually writing the code now, so it parallelizes 2008-09-16 02:51 besides I actually have a rootfs somewhere around here 2008-09-16 02:51 it's also 100mb or so 2008-09-16 02:51 mine is better 2008-09-16 02:51 includes ssh/mc a few other things 2008-09-16 02:51 sshd of course 2008-09-16 02:51 reboot time! 2008-09-16 02:51 that's good 2008-09-16 02:51 reboot konrad 2008-09-16 02:52 new kernel yay 2008-09-16 02:52 i can get > 1 reboot second with uml 2008-09-16 02:53 well my move in 2 days gets me a little closer to the backbone 2008-09-16 02:53 I am happy 2008-09-16 02:53 I would be too 2008-09-16 02:53 my connection sucks 2008-09-16 02:54 alright, reboot for real this time 2008-09-16 02:54 I might blog about speakeasy 2008-09-16 02:54 see if that gets it upgraded 2008-09-16 02:55 we are signed up for FAST by the way 2008-09-16 02:55 in advance 2008-09-16 02:55 poseter seeion, WIP report and BOF 2008-09-16 02:56 session 2008-09-16 02:57 konrad's not back 2008-09-16 02:57 seem, reboots take longer the faster cpus get 2008-09-16 02:57 bio_alloc(GFS_KERNEL, 1); 2008-09-16 02:57 wicked... 2008-09-16 02:57 72% 2008-09-16 02:57 pathetic 2008-09-16 02:58 yea! 2008-09-16 02:58 I have a race in my code 2008-09-16 02:58 getting a race on sleep is easy 2008-09-16 02:59 that race? 2008-09-16 03:00 nah 2008-09-16 03:00 possibility to remove module, before bio comes back 2008-09-16 03:00 true 2008-09-16 03:00 don't worry about it 2008-09-16 03:00 just don't have twitchy fingers 2008-09-16 03:01 if you want to fix the race 2008-09-16 03:01 flips: ok 2008-09-16 03:01 see module_inc 2008-09-16 03:01 or whatever it's called 2008-09-16 03:01 just wanted to know if you knew about that news 2008-09-16 03:01 forget 2008-09-16 03:01 always 2008-09-16 03:01 lived and breathed it 2008-09-16 03:03 news? 2008-09-16 03:03 -!- konrad(~konrad@c-24-16-74-109.hsd1.mn.comcast.net) has joined #tux3 2008-09-16 03:04 new kernel, new kmod-nvidia breakage 2008-09-16 03:04 bh, when zfs stop panicking on boot it might matter ;) 2008-09-16 03:06 sha256, leet 2008-09-16 03:08 -!- Kirantpatil(~kiran@122.167.182.147) has joined #tux3 2008-09-16 03:10 AH! 2008-09-16 03:10 what do you know, someone already wrote all that code, and I don't need to reimplement it... 2008-09-16 03:10 lol 2008-09-16 03:10 what, sha256? 2008-09-16 03:10 ACTION boggles at the concept 2008-09-16 03:11 nah, get_sb_bdev and kill_block_super 2008-09-16 03:11 well, still worth having written half of it by myslef 2008-09-16 03:11 ah 2008-09-16 03:17 maze, true 2008-09-16 03:17 just have to put up with some minor oddities 2008-09-16 03:17 you will still have to reimplement it 2008-09-16 03:17 but not yet 2008-09-16 03:17 probably 2008-09-16 03:18 still try to figure out what it actually accomplishes 2008-09-16 03:18 does it work? 2008-09-16 03:18 ask away 2008-09-16 03:18 not done yet, so no 2008-09-16 03:18 doesn't compile 2008-09-16 03:19 the kernel is in bad need of documentation 2008-09-16 03:20 flips: I got into a disagreement with some zfs fanboys about their file system 2008-09-16 03:20 I don't like this 2008-09-16 03:20 it's doing something and I don't know what 2008-09-16 03:20 ugh 2008-09-16 03:21 told them that they can't do much other than reformat their volume(s) if there's a corruption since they don't have an integrity checker 2008-09-16 03:21 they didn't get it, oh well 2008-09-16 03:22 lol 2008-09-16 03:22 bh, understandable 2008-09-16 03:23 the more I know about it, the more I'm glad it's not on linux 2008-09-16 03:23 they're kind of stupid and they weren't listening, no sense in arguing with folks like that 2008-09-16 03:23 they think that checksums will save the entire fucking world 2008-09-16 03:24 it'll help, but it's not the full story 2008-09-16 03:24 they need to add some metamagical themas 2008-09-16 03:24 that, and only that will save the world 2008-09-16 03:24 what's that ? 2008-09-16 03:24 the worlds already been saved - over two thousand years ago 2008-09-16 03:24 special form of checksum 2008-09-16 03:24 ... 2008-09-16 03:24 requires qubit processor 2008-09-16 03:24 uh 2008-09-16 03:25 bear in mind there was some drinking involved tonight 2008-09-16 03:25 celebrating tim's tiwn 2008-09-16 03:25 just get your file system working dude so that folks will stop trash talking behind your back and stuff 2008-09-16 03:25 tim's twins 2008-09-16 03:25 heh 2008-09-16 03:25 let them trash talk 2008-09-16 03:26 only makes them trash talkers 2008-09-16 03:26 and trash talkers always to it behind one's back 2008-09-16 03:26 nothing changes 2008-09-16 03:26 true 2008-09-16 03:26 besides, I know who they are ;) 2008-09-16 03:27 funny how those reports get around 2008-09-16 03:27 anybodny who would be trash talking now is simply somebody who can't read code 2008-09-16 03:30 maze, you must have it by now 2008-09-16 03:30 boot uml and change your life 2008-09-16 03:30 nah, I'm slow 2008-09-16 03:30 oh I have the rootfs 2008-09-16 03:30 the commands are elementary 2008-09-16 03:30 I don't have runnable (or even compileable) code 2008-09-16 03:30 doesn't matter 2008-09-16 03:30 just boot a defconfig 2008-09-16 03:30 addiction is instant 2008-09-16 03:30 faster than crack 2008-09-16 03:31 I ran out of disk space during kernel compile. 2008-09-16 03:31 :P 2008-09-16 03:31 delete gnome 2008-09-16 03:32 ow. 2008-09-16 03:32 get rid of that centos image MaZe 2008-09-16 03:32 it's on another machine 2008-09-16 03:32 oh :( 2008-09-16 03:32 get out your credit card and order a new hd 2008-09-16 03:32 660GB/$85 newegg 2008-09-16 03:32 rush 2008-09-16 03:32 overnight 2008-09-16 03:33 pay 50% of the cost of the disk ;) 2008-09-16 03:33 lol 2008-09-16 03:33 it's a laptop drive 2008-09-16 03:33 oh 2008-09-16 03:33 pay $150 2008-09-16 03:33 then 2008-09-16 03:33 better idea 2008-09-16 03:33 throw in dvd 2008-09-16 03:33 and burn 2008-09-16 03:34 delete the windows partition 2008-09-16 03:34 you know you don't use it 2008-09-16 03:34 it's only 8gb 2008-09-16 03:34 but I have a compressed fast install dump, that I'll burn 2008-09-16 03:35 nothing installs windows xp quite as fast as bzcat winxp.img.bz2 > /dev/win 2008-09-16 03:36 err, re-installs 2008-09-16 03:36 grab the image after activating? :D 2008-09-16 03:36 of course 2008-09-16 03:36 and updating 2008-09-16 03:36 nice 2008-09-16 03:36 fully patched sp3 2008-09-16 03:38 8g is about 50 compiled kernels 2008-09-16 03:38 delete and be happy 2008-09-16 03:39 well, maybe only 25 2008-09-16 03:39 objs got bloaty 2008-09-16 03:40 actually I had 2G free 2008-09-16 03:40 the build took it all up 2008-09-16 03:40 gross 2008-09-16 03:41 centos feature? 2008-09-16 03:41 ACTION couldn't reisist 2008-09-16 03:41 rather kernel build bloat 2008-09-16 03:41 sure 2008-09-16 03:41 didn't realize it was that bad 2008-09-16 03:41 although maybe because I was doing a test build of a patch 2008-09-16 03:41 there was a decent kernel build once... 2008-09-16 03:42 build with -g bloats it 2008-09-16 03:42 make allmodconfig; make bzImage 2008-09-16 03:42 oh now 2008-09-16 03:42 just don't 2008-09-16 03:42 was testing a patch 2008-09-16 03:42 make defconfig 2008-09-16 03:43 even defconfig is bloated 2008-09-16 03:43 but not even in the ballpark of what you did 2008-09-16 03:46 hehe 2008-09-16 03:46 I wonder where your kernel pulls modules from - or indeed even if it is modular 2008-09-16 03:48 do you realize your root_fs, compresses down to 24MB with bzip? 2008-09-16 03:48 you should switch to using an initramfs.cpio.gz 2008-09-16 03:48 oh yeah 2008-09-16 03:48 :-/ 2008-09-16 03:48 forgot 2008-09-16 03:48 about my lame uplink 2008-09-16 03:48 WARNING: vmlinux: 'memcpy' exported twice. Previous export was in vmlinux 2008-09-16 03:49 huh? 2008-09-16 03:49 what kernel? 2008-09-16 03:49 2.6.27-rc6 2008-09-16 03:49 make mrproper 2008-09-16 03:49 yeah! 2008-09-16 03:49 panic 2008-09-16 03:49 was 2008-09-16 03:49 ok 2008-09-16 03:49 let's go back to 2.6.26.5 2008-09-16 03:49 good idea 2008-09-16 03:50 dvd burned 2008-09-16 03:50 did windoze 2008-09-16 03:50 die 2008-09-16 03:50 ? 2008-09-16 03:50 hmm? why? 2008-09-16 03:51 oh, not yet 2008-09-16 03:51 just wondering 2008-09-16 03:51 verify 2008-09-16 03:52 I'll rzip my root_fs 2008-09-16 03:52 see how small it gets 2008-09-16 03:52 ok building 2.6.26.5 2008-09-16 03:53 it's trying to rzip 2008-09-16 03:53 kind of dimming the lights here 2008-09-16 03:53 memory wise 2008-09-16 03:53 hmm, wonder if one of the two patches I was testing was what broke rc6 2008-09-16 03:53 probably 2008-09-16 03:53 better exit firefox 2008-09-16 03:54 what the hell is rzip? 2008-09-16 03:54 tridge's zip 2008-09-16 03:54 besides sounding powerfull 2008-09-16 03:54 beats pretty much anything 2008-09-16 03:54 like a chainsaw 2008-09-16 03:54 also author of rsync 2008-09-16 03:55 ls -l root_fs* 2008-09-16 03:55 -rwxr-xr-x 1 root root 18067032 Sep 16 02:24 root_fs.rz 2008-09-16 03:55 ok? 2008-09-16 03:57 yum install rzip 2008-09-16 03:57 good taste 2008-09-16 03:58 Kernel panic - not syncing: Out of memory and no killable processes... 2008-09-16 03:58 wtf 2008-09-16 03:58 host or guest? 2008-09-16 03:58 console [mc-1] enabled 2008-09-16 03:58 ubda: unknown partition table 2008-09-16 03:58 VFS: Mounted root (ext2 filesystem) readonly. 2008-09-16 03:58 request_module: runaway loop modprobe binfmt-464c 2008-09-16 03:58 request_module: runaway loop modprobe binfmt-464c 2008-09-16 03:58 request_module: runaway loop modprobe binfmt-464c 2008-09-16 03:58 request_module: runaway loop modprobe binfmt-464c 2008-09-16 03:58 request_module: runaway loop modprobe binfmt-464c 2008-09-16 03:58 Kernel panic - not syncing: Out of memory and no killable processes... 2008-09-16 03:58 guest, after all - I'm still here 2008-09-16 03:59 you did the commands above? 2008-09-16 03:59 ah 2008-09-16 03:59 you sure that's ubd0 2008-09-16 03:59 yes 2008-09-16 03:59 cause $ file ../root_fs 2008-09-16 03:59 ../root_fs: Linux rev 0.0 ext2 filesystem data 2008-09-16 03:59 try fsck on the root_fs 2008-09-16 03:59 ah, no nevermind, that runs 2008-09-16 04:00 does um have to run as root? 2008-09-16 04:00 no 2008-09-16 04:00 wonder if it's a 64 bit bug 2008-09-16 04:00 email jeff 2008-09-16 04:00 jdike 2008-09-16 04:01 I would expect it to work though 2008-09-16 04:01 I think somebody at intel must have a 64 bit workstation 2008-09-16 04:01 $ gcc --version 2008-09-16 04:01 gcc (GCC) 4.3.0 20080428 (Red Hat 4.3.0-8) 2008-09-16 04:01 ight 2008-09-16 04:01 night 2008-09-16 04:01 red hat... that's always scary 2008-09-16 04:01 bye bye 2008-09-16 04:01 lol 2008-09-16 04:01 bye 2008-09-16 04:02 oh 2008-09-16 04:02 I know 2008-09-16 04:02 your image is 32-bit 2008-09-16 04:02 my kernel is 64-bit 2008-09-16 04:02 right 2008-09-16 04:02 it can't handle the 32-bit binaries 2008-09-16 04:02 it's supposed to work 2008-09-16 04:02 ah, but is the support code compiled in? 2008-09-16 04:02 or is it trying to modprobe 2008-09-16 04:02 seeing a 32-bit modprobe 2008-09-16 04:03 it should be? 2008-09-16 04:03 and trying to modprobe 2008-09-16 04:03 should not be modprobing 2008-09-16 04:03 anything 2008-09-16 04:03 not, it's not 2008-09-16 04:03 request_module: runaway loop modprobe binfmt-464c 2008-09-16 04:03 but you're probalby right 2008-09-16 04:03 ;-) 2008-09-16 04:03 it's supposed to work and doesn't 2008-09-16 04:03 kernel bugz 2008-09-16 04:03 ahah 2008-09-16 04:03 turn off module loading 2008-09-16 04:03 in the kernel config 2008-09-16 04:04 rusty's code 2008-09-16 04:04 I take no responsibility 2008-09-16 04:04 except module loading is precisely what I want to debug my module... 2008-09-16 04:05 CONFIG_IA32_EMULATION defaults to no 2008-09-16 04:08 forget modules in uml 2008-09-16 04:08 not the point 2008-09-16 04:08 your whole kernel is a module 2008-09-16 04:08 yeah, will this doesn't work at all 2008-09-16 04:08 modules ain't the problems 2008-09-16 04:08 modules and uml don't work that great 2008-09-16 04:08 for debugging 2008-09-16 04:09 the problem is 64-bit kernel, 32-bit userspace, no 32bit emulation 2008-09-16 04:09 well 2008-09-16 04:09 grab a rootfs 2008-09-16 04:09 yougot another bootable partition? 2008-09-16 04:09 or 2008-09-16 04:09 _compile a 32 bit kernel_ 2008-09-16 04:09 remember this is uml 2008-09-16 04:09 yeah, but how to compile a 32-bit uml 2008-09-16 04:09 ARCH=um32? 2008-09-16 04:09 hmm 2008-09-16 04:09 yah 2008-09-16 04:09 busted isn't it 2008-09-16 04:10 bummer 2008-09-16 04:10 well 2008-09-16 04:10 hmm 2008-09-16 04:10 sure, just set the config option 2008-09-16 04:10 well I do happen to have a handy setarch 2008-09-16 04:10 nope 2008-09-16 04:10 that doesn't work 2008-09-16 04:11 I guess you're right 2008-09-16 04:11 it's lamer than that 2008-09-16 04:11 well 2008-09-16 04:11 you need a 64 bit rootfs 2008-09-16 04:11 you got any other bootable partitions? 2008-09-16 04:12 mac os x ;-) 2008-09-16 04:12 that's why running under kvm would be easier 2008-09-16 04:12 I've now attempted: 2008-09-16 04:12 make defconfig ARCH=um 2008-09-16 04:13 make menuconfig ARCH=um 2008-09-16 04:13 turned on IA32_EMULATION 2008-09-16 04:13 now 2008-09-16 04:13 make oldconfig ARCH=um 2008-09-16 04:13 lots of enters 2008-09-16 04:13 make linux ARCH=um 2008-09-16 04:13 I really should get home and to bed 2008-09-16 04:14 wow, still at the plex 2008-09-16 04:14 you should 2008-09-16 04:14 yeah 2008-09-16 04:14 was playing cards till 11 2008-09-16 04:14 I apologize on behalf of all lame linux hackers 2008-09-16 04:14 about the 64/32 bit thing 2008-09-16 04:15 WARNING: vmlinux: 'memcpy' exported twice. Previous export was in vmlinux 2008-09-16 04:15 but it built... 2008-09-16 04:15 thank god for small meries 2008-09-16 04:15 mercies 2008-09-16 04:16 VFS: Cannot open root device "98:0" or unknown-block(98,0) 2008-09-16 04:16 Please append a correct "root=" boot option; here are the available partitions: 2008-09-16 04:16 Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(98,0) 2008-09-16 04:16 yeah 2008-09-16 04:16 awesome 2008-09-16 04:16 progress 2008-09-16 04:16 not really 2008-09-16 04:16 this time panic was before mount 2008-09-16 04:16 you just need to get the path to your root_fs right now 2008-09-16 04:16 no, it didn't find your rootfs 2008-09-16 04:16 lame error 2008-09-16 04:16 path was right 2008-09-16 04:17 tab completed 2008-09-16 04:17 see your command? 2008-09-16 04:17 $ linux-2.6.26.5/linux ubd0=root_fs 2008-09-16 04:17 [maze@nike l]$ ls -al root_fs 2008-09-16 04:17 -rw-r--r-- 1 maze eng 104857600 2008-09-16 02:24 root_fs 2008-09-16 04:17 ubd0 most not be compiled into the kernel 2008-09-16 04:18 must somehow have gotten unselected 2008-09-16 04:18 true 2008-09-16 04:18 if you didn't use make defconfig, probable 2008-09-16 04:18 you can strace uml 2008-09-16 04:19 I think... 2008-09-16 04:21 I think problem is 2008-09-16 04:21 make menuconfig was without ARCH=um 2008-09-16 04:21 looks like 64-bit um doesn't have an emulate 32 option 2008-09-16 04:21 that will do you 2008-09-16 04:22 64 bit kernel is supposed to run 32 bit binaries 2008-09-16 04:22 thunking is build into the kernel 2008-09-16 04:25 I don't think it knows how to parse a 32-bit elf header though 2008-09-16 04:26 you might think about that go home and crash option 2008-09-16 04:26 tomorrow get a 64 bit rootfs from somewhere, including copying it from yourself 2008-09-16 04:26 just copy your root onto another disk 2008-09-16 04:26 and plug it in 2008-09-16 04:26 beside your good old root 2008-09-16 04:30 hmm 2008-09-16 04:35 last attempt 2008-09-16 04:35 I should probably make -j2 2008-09-16 04:35 oh well 2008-09-16 04:36 don't make many modules 2008-09-16 04:36 j4, for dual intel 2008-09-16 04:36 really? 2008-09-16 04:36 just make 2008-09-16 04:36 seems to keep 1 cpu pegged 2008-09-16 04:36 two x smt 2008-09-16 04:36 lame smt 2008-09-16 04:36 oh, this is one core duo 2008-09-16 04:36 not double dual 2008-09-16 04:37 right, they kinda dropped smt 2008-09-16 04:37 ddin't they 2008-09-16 04:37 mixed bag 2008-09-16 04:37 now if we actually had a decent migrating os, It could migrate to my desktop 2008-09-16 04:37 seemedlike a good idea 2008-09-16 04:37 not really 2008-09-16 04:37 smt is still coming back 2008-09-16 04:37 dec made it work, intel not so much 2008-09-16 04:37 ht = smt lite 2008-09-16 04:37 migration is another pet peeve of mine - it shouldn't be that frickin hard 2008-09-16 04:38 of course - a decent fs is the first step ;-) 2008-09-16 04:38 it's hard because we like it hard 2008-09-16 04:38 I like reaching not only for the moon, but for the sun and alpha centauri at the same time ;-) 2008-09-16 04:38 that's we we keep getting in our own way 2008-09-16 04:39 they already had it mostly working in 2.4 2008-09-16 04:39 and then they had a beta for 2.6 2008-09-16 04:39 but I think part of the problem is the overblown syscall interface 2008-09-16 04:39 and its small compared to windows 2008-09-16 04:39 linux should be stripped down, and the linux syscall interface should be a lodable module 2008-09-16 04:40 given multiple enemas 2008-09-16 04:40 from both ends 2008-09-16 04:40 agreed 2008-09-16 04:40 it won't happen in our lifetimes 2008-09-16 04:40 that way you could replace it with whatever 2008-09-16 04:40 heh 2008-09-16 04:40 email linus 2008-09-16 04:40 and experiment with a set of syscalls which would be inherently migratable 2008-09-16 04:40 tell him it's time to make the syscall table loadable 2008-09-16 04:40 and have it per - process selectable 2008-09-16 04:41 is he going to make a laughing stock of me? 2008-09-16 04:41 he's going to say something memorable anyway 2008-09-16 04:41 might be nice to you if he's having a good day 2008-09-16 04:41 and I don't email him first ;) 2008-09-16 04:43 well, balloc_extent_from_range looks ready to try 2008-09-16 04:43 not pretty 2008-09-16 04:43 far from it 2008-09-16 04:43 qemu-kvm -M pc -cpu qemu64 -m 256 -smp 2 -net none -kernel linux-2.6.26.5/arch/x86_64/boot/bzImage -drive file=root_fs,boot=on -append 'ro root=/dev/hda' 2008-09-16 04:43 works 2008-09-16 04:43 :) 2008-09-16 04:43 amazing 2008-09-16 04:43 make mrproper && make clean && make defconfig && make bzImage 2008-09-16 04:43 what about the ARCH=um? 2008-09-16 04:43 that's not uml I guess 2008-09-16 04:44 normal 64-bit kernel 2008-09-16 04:44 yah, and a real boot 2008-09-16 04:44 me and uml send our regrets 2008-09-16 04:44 but... 2008-09-16 04:44 didn't see you boot 2008-09-16 04:44 huh? 2008-09-16 04:44 ah... 2008-09-16 04:45 oh 2008-09-16 04:45 trying to root me remotely? 2008-09-16 04:45 qemu 2008-09-16 04:45 point taken 2008-09-16 04:45 ;-) 2008-09-16 04:45 I'll leave that to shapor 2008-09-16 04:45 I thought you had some sort of ping in the image 2008-09-16 04:45 ok, you are qemu and I am uml 2008-09-16 04:45 yeah, looks like I'll stick to qemu 2008-09-16 04:45 seems to work fine 2008-09-16 04:45 somebody needs to findout why 32 bit root_fs doesn't work on 64 bit kernel 2008-09-16 04:46 I'm sure jdike will be fascinated 2008-09-16 04:46 there's some mongo dir and stuff in their 2008-09-16 04:46 I'm guessing nobody's even tried 2008-09-16 04:46 balloc_extent_from_range is ready to try 2008-09-16 04:46 not tonight 2008-09-16 04:46 or rather 2008-09-16 04:46 not now 2008-09-16 04:46 I've got a meeting at 9:30 2008-09-16 04:47 argh 2008-09-16 04:47 what rootfs did you use with qemu? 2008-09-16 04:47 yours 2008-09-16 04:47 don't go hom 2008-09-16 04:47 64-bit kernel, 32-bit rootfs - works fine 2008-09-16 04:47 sleep on the massage chair 2008-09-16 04:47 good, like it's supposed to 2008-09-16 04:47 sounds like a jdike issue 2008-09-16 04:47 yeah, I think I'll do that 2008-09-16 04:48 set it on repeat 2008-09-16 04:49 so there probably wasn't a problem with 2.6.27-rc6 after all 2008-09-16 04:49 nice to have a stable point to assign blame from 2008-09-16 04:50 7875 flips 2008-09-16 04:50 1715 MaZe 2008-09-16 04:50 1289 shapor 2008-09-16 04:50 882 bh 2008-09-16 04:50 668 konrad 2008-09-16 04:50 380 tim_dimm 2008-09-16 04:50 128 RazvanM 2008-09-16 04:50 113 vandenoever 2008-09-16 04:50 96 flipz 2008-09-16 04:50 lol 2008-09-16 04:50 irc? 2008-09-16 04:50 yeah 2008-09-16 04:50 as if I don't get enough typing 2008-09-16 04:50 maze is rising 2008-09-16 04:50 am I? 2008-09-16 04:51 think so 2008-09-16 04:51 I thought you were going up much much faster 2008-09-16 04:51 race for 2nd 2008-09-16 04:51 what do you mean race for 2nd? 2008-09-16 04:51 flipz is a real lame 2008-09-16 04:51 don't you tink? 2008-09-16 04:51 heh 2008-09-16 04:52 better get summa that sleep 2008-09-16 04:52 yeah, that's probably because he does the coding, while you sit around on irc 2008-09-16 04:52 I could be accused of contributing to the delinquency of a googler 2008-09-16 04:52 you could 2008-09-16 04:53 return found - contig + 1; <- tomorrow we see if it works 2008-09-16 04:53 new extent balloc 2008-09-16 04:53 anyway - I'm gone 2008-09-16 04:53 bye 2008-09-16 04:53 me 2 2008-09-16 07:11 -!- kushal(~kushal@121.246.32.210) has joined #tux3 2008-09-16 08:01 -!- Kirantpatil(~kiran@122.167.215.81) has joined #tux3 2008-09-16 08:01 hello list 2008-09-16 08:02 how many hours to go for the Part-3 of tux3 university ? 2008-09-16 09:53 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-16 10:25 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-16 10:45 -!- kushal(~kushal@121.246.33.21) has joined #tux3 2008-09-16 10:51 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-16 10:55 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-16 10:58 howdy 2008-09-16 10:58 when's the next tux3 university scheduled? 2008-09-16 11:37 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-16 11:50 -!- Kirantpatil(~kiran@122.167.176.249) has joined #tux3 2008-09-16 11:52 -!- Kirantpatil(~kiran@122.167.176.249) has left #tux3 2008-09-16 12:00 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-09-16 12:00 ACTION is back 2008-09-16 12:00 hrm thats weird i got banned from oftc.net 2008-09-16 12:00 maybe my bot went crazy 2008-09-16 12:03 ah no i guess it was everyone 2008-09-16 12:03 heh 2008-09-16 12:52 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-16 13:17 folks 2008-09-16 13:58 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-09-16 14:06 morning 2008-09-16 14:07 it's an extent morning for #tux3 2008-09-16 14:38 hey 2008-09-16 14:38 hi bh 2008-09-16 14:39 how's it going ? 2008-09-16 14:41 coding bitops 2008-09-16 14:41 fun 2008-09-16 14:41 extents are fun 2008-09-16 14:41 been researching tree locking? 2008-09-16 14:41 there's a lot written on the subject 2008-09-16 14:55 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-16 14:57 ok, looks like we have extent enabled balloc now 2008-09-16 14:57 but 2008-09-16 14:57 lots still to do 2008-09-16 14:57 on extents 2008-09-16 14:58 messy messy 2008-09-16 16:02 tux3 noses over 7K lines 2008-09-16 16:02 decimal K 2008-09-16 16:05 Sun is using mercurial for its new project site 2008-09-16 16:06 I think that means mercurial wins 2008-09-16 16:06 big props to matt 2008-09-16 16:06 http://projectkenai.com/projects/xvmserver/sources/earlyaccess/show 2008-09-16 16:06 clueful of Sun 2008-09-16 16:06 I'm shocked ;) 2008-09-16 16:18 projectkenai is new and doesn't have git yet, but they plan to 2008-09-16 16:18 if it's the same site I'm remembering 2008-09-16 16:19 yep, it is 2008-09-16 16:58 flips: nice 2008-09-16 17:02 -!- kbingham(~kbingham@92.9.62.202) has joined #tux3 2008-09-16 17:17 I'm thinking back on a design decision I made pretty early for the prototype - to depart from the usual kernel get_block model and have tux3 actually initiate the IO at that point, unlike get_block where the fs just tells the VFS where a particular logical block is supposed to be read/written physically 2008-09-16 17:17 I am increasingly getting the feeling that that decision was right 2008-09-16 17:17 especially as I get working on extents 2008-09-16 17:18 and took a look at how the btrfs guys do extents 2008-09-16 17:18 that is scary 2008-09-16 17:18 looks like they want to go make big changes to the vfs 2008-09-16 17:18 without really considering the alternatives 2008-09-16 17:18 I might not have looked close enough, but that's what it looks like on first blush 2008-09-16 17:28 when are atomic commits going to work ? 2008-09-16 17:28 after extents 2008-09-16 17:29 or sooner if you want to code it 2008-09-16 17:29 fun 2008-09-16 17:31 flips: what was that disk failure article you were mentioning last night? 2008-09-16 17:32 got a link? 2008-09-16 17:32 just a sec 2008-09-16 17:32 http://alumnit.ca/~apenwarr/log/?m=200809#08 2008-09-16 17:40 interesting 2008-09-16 17:41 nearly sk8 oclock 2008-09-16 17:42 ACTION is getting tired of checking in extent bits 2008-09-16 17:43 wow i'd never heard of ionice 2008-09-16 17:43 awesome! 2008-09-16 17:43 "Linux supports io scheduling priorities and classes since 2.6.13 with the CFQ io scheduler." 2008-09-16 17:43 !! 2008-09-16 17:44 well 2008-09-16 17:44 have i been living under a rock? 2008-09-16 17:44 don't get _too_ excited 2008-09-16 17:44 cfq is, um 2008-09-16 17:44 you know 2008-09-16 17:44 there's a reason it's not the default 2008-09-16 17:44 just the fact there is such an interface is reassuring 2008-09-16 17:44 yes 2008-09-16 17:45 pluggable disk elevators 2008-09-16 17:45 if it doesn't work as advertised thats simply a bug to file 2008-09-16 17:45 been in for 4-5 years 2008-09-16 17:45 then someone smarter than me can fix it ;) 2008-09-16 17:45 danger is when somebody less smart than you fixes it 2008-09-16 17:47 ok, we have extent allocate and extent free now 2008-09-16 17:47 not really great versions of, but simple and serviceable for now 2008-09-16 17:47 that was the easy part 2008-09-16 18:06 -!- cdk(~chinmay@121.246.33.227) has joined #tux3 2008-09-16 18:29 -!- RalucaM(~ral@londo.cnds.jhu.edu) has joined #tux3 2008-09-16 18:33 -!- Aks(~ankitsriv@123.239.79.30) has joined #tux3 2008-09-16 18:34 -!- Aks(~ankitsriv@123.239.79.30) has left #tux3 2008-09-16 18:53 -!- kbingham(~kbingham@92.8.19.189) has joined #tux3 2008-09-16 19:03 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-16 19:09 -!- kbingham(~kbingham@92.8.3.46) has joined #tux3 2008-09-16 19:13 -!- stargazr5(~gauravstt@59.95.18.36) has joined #tux3 2008-09-16 19:20 -!- Kirantpatil(~kiran@122.167.218.72) has joined #tux3 2008-09-16 19:20 -!- Kirantpatil(~kiran@122.167.218.72) has left #tux3 2008-09-16 19:43 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-16 19:45 15 minutes and counting 2008-09-16 19:54 OT: http://www.noodleson.com/store/images/nongshim/vegetal.jpg 2008-09-16 19:57 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-16 19:57 seems on topic to me 2008-09-16 19:57 :-) 2008-09-16 19:57 hi ralucam 2008-09-16 19:57 OT but important 2008-09-16 19:57 my new chair is comfy 2008-09-16 19:57 hey tim 2008-09-16 19:58 yes 2008-09-16 19:58 online 2008-09-16 19:58 that's important 2008-09-16 19:58 and this watermellon is delicious 2008-09-16 19:58 hi everybody 2008-09-16 19:58 everybody warming up their browers? 2008-09-16 19:59 ACTION is trying to do at least a modest part of the homework... 2008-09-16 19:59 standard precaution is to restart firefox 2008-09-16 19:59 so it doesn't go oom when I'm trying to talk ;) 2008-09-16 19:59 ugh, oh right, what was the homework? 2008-09-16 20:00 read the superblock? ;-) 2008-09-16 20:00 flips: homework is: know how the root dir is loaded and initialized, and now that differs from how any other inode is opened 2008-09-16 20:00 it was about loading the root directory 2008-09-16 20:00 -!- pranith(7aa040b1@webchat.mibbit.com) has joined #tux3 2008-09-16 20:00 and what did we find? 2008-09-16 20:01 that it gets loaded explicitely 2008-09-16 20:01 because... 2008-09-16 20:01 because dir lookup doesn't work 2008-09-16 20:01 well it's the mount point 2008-09-16 20:02 because there is no dir to look up in 2008-09-16 20:02 root of the tree and all that 2008-09-16 20:02 ACTION is searching for s_root... 2008-09-16 20:02 so we have to open the root dir "manually", using functionality that normally gets called by something like ext2_lookup 2008-09-16 20:02 not quite that function 2008-09-16 20:02 anyway 2008-09-16 20:03 we're starting somewhere different today 2008-09-16 20:03 because maze wants to go faster ;) 2008-09-16 20:03 so let's go to sys_write 2008-09-16 20:03 I'm guiltless - I tell you... 2008-09-16 20:03 http://lxr.linux.no/linux+v2.6.26.5/fs/dcache.c#L1062 2008-09-16 20:03 ok ok ok ok 2008-09-16 20:03 I think we killed lxr 2008-09-16 20:04 seems 2008-09-16 20:04 next time I'll go there before I announce the destination ;) 2008-09-16 20:04 http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L370 2008-09-16 20:04 http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L370 2008-09-16 20:04 it works from here 2008-09-16 20:04 works here too 2008-09-16 20:04 Razvan's always faster ;-) 2008-09-16 20:04 ok, who wants to walk down into it? 2008-09-16 20:04 instead of me this time? 2008-09-16 20:05 seems to me, razvanm does that pretty well 2008-09-16 20:05 you know the first few layers 2008-09-16 20:05 it's just the same idea as sys_open 2008-09-16 20:05 ACTION is doesn't too much about fs yet :( 2008-09-16 20:05 you know how to poke down into a syscall though 2008-09-16 20:05 file_pos_read and file_pos_write are probably to fetch and store the current file offset 2008-09-16 20:05 just keep clicking until you see something that isn't obvious 2008-09-16 20:06 let's look at those 2008-09-16 20:06 fget_light and fput_light must be fd to struct file lookup with locking 2008-09-16 20:06 so all that's left is vfs_write 2008-09-16 20:06 pretty simple (file_pos_read/write) 2008-09-16 20:06 which was kind of obvious to begin with ;-) 2008-09-16 20:06 I don't know why they're even abstracted 2008-09-16 20:07 fget/put_light are demented 2008-09-16 20:07 two of the most subtle and demented functions in the entire kernel 2008-09-16 20:07 don't worry about them today ;) 2008-09-16 20:07 they were conceived by a vile an twisted mind, and get to live because they are fast 2008-09-16 20:07 what's demented about them? 2008-09-16 20:07 heh 2008-09-16 20:07 later 2008-09-16 20:07 really 2008-09-16 20:08 google if you must 2008-09-16 20:08 http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L313 2008-09-16 20:08 ok, that's vfs_write 2008-09-16 20:08 suffice to say that they keep our file from disappearing while we are writing to it 2008-09-16 20:08 it would be bad otherwise 2008-09-16 20:08 right - locking 2008-09-16 20:08 razvanm, good, and what do you see there? 2008-09-16 20:09 a bunch of permission checks 2008-09-16 20:09 and then a f_op->write call 2008-09-16 20:09 f_op->write if exists 2008-09-16 20:09 typical, right? 2008-09-16 20:09 provided it's available 2008-09-16 20:09 what you don't see is any locks being taken 2008-09-16 20:09 ot do_sync_write otherwise 2008-09-16 20:09 ot = or 2008-09-16 20:09 there is _very little locking_ in this path 2008-09-16 20:09 helping make it fast 2008-09-16 20:09 and a cute inc_syscw 2008-09-16 20:09 -!- kbingham(~kbingham@92.8.217.48) has joined #tux3 2008-09-16 20:10 the consequence of that is, the filesystem can be hit in a very parallel way 2008-09-16 20:10 what is rw_verify_area? 2008-09-16 20:10 probably locking 2008-09-16 20:10 sometimes in ways that don't make sense, or are from buggy, racy applications, and the filesystem has to do something reasonable 2008-09-16 20:10 i.e., not crash and not corrupt 2008-09-16 20:10 rw_verify_area... hmm 2008-09-16 20:10 as in byte-range locks 2008-09-16 20:10 newish thing 2008-09-16 20:11 no sorry 2008-09-16 20:11 it's implementing flock 2008-09-16 20:11 bad name 2008-09-16 20:11 very 2008-09-16 20:11 http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L196 2008-09-16 20:11 we don't care about it really 2008-09-16 20:11 I'd guess it checks no-one else has locked the area we're about to write to 2008-09-16 20:11 normally nobody uses flock 2008-09-16 20:12 crufty old baggage 2008-09-16 20:12 more interesting that selinux has a hook there 2008-09-16 20:12 the "security_*" <- typical selinux hook 2008-09-16 20:12 flips: inc_syscw.. tsk->syscw++ 2008-09-16 20:12 but this is not really interesting, let's pop back out and go deeper 2008-09-16 20:12 that's a generic security hook though right? 2008-09-16 20:12 yes 2008-09-16 20:13 I forget what we call the generic harness 2008-09-16 20:13 http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L313 <- back here 2008-09-16 20:13 next we see that meme again 2008-09-16 20:14 our fs can either completely replace the write logic with its own, or the vfs will supply a basic framework and call lower level methods in the fs 2008-09-16 20:14 327 if (file->f_op->write) 2008-09-16 20:14 328 ret = file->f_op->write(file, buf, count, pos); 2008-09-16 20:14 very few fs's will use this hook 2008-09-16 20:15 i thought we were supposed to use the vfs framework... 2008-09-16 20:15 http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L288 2008-09-16 20:15 almost all continue on down into do_sync_write 2008-09-16 20:15 which is still the vfs 2008-09-16 20:15 most filesystems don't want to have the responsibility of doing all the things the vfs is about to do now 2008-09-16 20:16 -!- amey(~amey@116.73.35.180) has joined #tux3 2008-09-16 20:16 http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L288 <- do_sync_write 2008-09-16 20:16 so, internally the kernel is kind of aio oriented 2008-09-16 20:16 asynchronous IO 2008-09-16 20:17 and synchronous IO is just a shell around it of the form "start and IO op; wait on a wait queue until its done" 2008-09-16 20:17 we see that here 2008-09-16 20:17 very simple... if you don't poke into the details 2008-09-16 20:17 we will, but later 2008-09-16 20:17 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/file.c#L50 2008-09-16 20:17 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2487 2008-09-16 20:17 so now... we lose the trail 2008-09-16 20:18 because the vfs calls the real write action through a variable 2008-09-16 20:18 any suggestions how we can pick up that trail again? 2008-09-16 20:18 aio_write :P 2008-09-16 20:18 filp->f_op->aio_write 2008-09-16 20:18 right 2008-09-16 20:18 we can grep the entire kernel for it 2008-09-16 20:19 or we can go back to ext2/inode.c 2008-09-16 20:19 where I know it is ;) 2008-09-16 20:19 let's do that 2008-09-16 20:19 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2364 2008-09-16 20:19 you're getting ahead ;) 2008-09-16 20:19 let's see how we get there 2008-09-16 20:20 and I was wrong about the file 2008-09-16 20:20 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/file.c#L50 2008-09-16 20:20 interesting 2008-09-16 20:21 ? 2008-09-16 20:21 now we see that ext2 just fills that in with a generic function 2008-09-16 20:21 that maze already found 2008-09-16 20:21 so lets clikc on it and go to filemap 2008-09-16 20:21 even this a fs is not interesting in implementing it :D 2008-09-16 20:21 that's right 2008-09-16 20:21 ext2 mostly lets the vfs do everything for it 2008-09-16 20:21 and its still 7,500 lines long 2008-09-16 20:21 worth considering what's in those 7,500 lines 2008-09-16 20:22 keep in mind that the VFS was essentially created just by taking a functioning filesystem and chopping it in half 2008-09-16 20:22 the top half, which became the vfs 2008-09-16 20:23 and the bottom half, which is a bunch of specific methods for doing things like figuring out the position of a block on disk 2008-09-16 20:23 and the bottom half which became the fs drivers 2008-09-16 20:23 ext2 should still have something to say about the write... 2008-09-16 20:23 which because ext2 and all its friends 2008-09-16 20:23 might not 2008-09-16 20:23 ext2 is not journaled 2008-09-16 20:23 might just have a get_disk_block(file, offset) 2008-09-16 20:23 ext2 is happy to let the vfs take over completely here, but of course, the vfs will come back to ext2 at some point 2008-09-16 20:23 why not ext3? 2008-09-16 20:23 and allocate/free_disk_block 2008-09-16 20:23 we will get there in about 5-10 minutes 2008-09-16 20:24 ok 2008-09-16 20:24 for comparison, you could look at ext3/file.c 2008-09-16 20:24 let's do that later 2008-09-16 20:24 http://lxr.linux.no/linux+v2.6.26.5/+code=generic_file_aio_write 2008-09-16 20:24 http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/file.c#L50 2008-09-16 20:24 ext2 is not journaled - so each file is just a read/write collection of blocks on disk 2008-09-16 20:25 even ext3 doesn't normally journal data 2008-09-16 20:25 so all you need is the ability to lookup a given files/offsets block location on disk and you can read/write just fine 2008-09-16 20:25 but it can... 2008-09-16 20:25 next step: http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2364 2008-09-16 20:25 yes, and so it must supply different methods for its different journalling options 2008-09-16 20:26 http://lxr.linux.no/linux+v2.6.26.5/fs/ext3/file.c#L113 2008-09-16 20:26 not *must*, but that is what it does 2008-09-16 20:26 113 .aio_read = generic_file_aio_read, 2008-09-16 20:26 114 .aio_write = ext3_file_write, 2008-09-16 20:26 so ext3 has it's own write, but uses the generic read 2008-09-16 20:26 thanks razvanm 2008-09-16 20:26 notice that generic_file_aio_write didn't really do much 2008-09-16 20:27 generic read but custom write... interesting 2008-09-16 20:27 jsut took care of some options 2008-09-16 20:27 optional unix semantics 2008-09-16 20:27 razvanm, sure, no journal needed on read 2008-09-16 20:28 finally, __generic_file_aio_write_nolock is doing something 2008-09-16 20:28 not much... but more than the others 2008-09-16 20:28 aaaa... ext3 :D 2008-09-16 20:28 since on read you can just let the generic file/offset block lookup code handle it, but on write - you might need to go through the journal if the right mount optiones (data=ordered I think) were used 2008-09-16 20:28 or data=journaled - never sure 2008-09-16 20:28 here we see readv being implemented 2008-09-16 20:28 um 2008-09-16 20:28 writev 2008-09-16 20:29 where? 2008-09-16 20:29 generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); 2008-09-16 20:29 nr_segs... writev segs 2008-09-16 20:29 not important 2008-09-16 20:29 easy enough to understand 2008-09-16 20:30 is that verifying we can read the ram the user passed us? 2008-09-16 20:30 probably 2008-09-16 20:30 let's find out 2008-09-16 20:30 1149 /* 2008-09-16 20:30 1150 * If any segment has a negative length, or the cumulative 2008-09-16 20:30 1151 * length ever wraps negative then return -EINVAL. 2008-09-16 20:30 1152 */ 2008-09-16 20:31 no, just checking for properly formed structs 2008-09-16 20:31 if (access_ok(access_flags, iv->iov_base, iv->iov_len)) 2008-09-16 20:31 I htink it does full access checks 2008-09-16 20:31 security 2008-09-16 20:31 note the return -EFAULT 2008-09-16 20:32 so we will rely on the mmu 2008-09-16 20:32 to fault 2008-09-16 20:32 and sometimes check for faulting contitions by hand 2008-09-16 20:32 http://lxr.linux.no/linux+v2.6.26.5/include/asm-m32r/uaccess.h#L108 <- access_ok just within memory or not 2008-09-16 20:32 no I think it checks by hand, but only returns EFAULT if first part is bad, otherwise it marks how many are good, and ignore the rest 2008-09-16 20:33 vfs_check_frozen implements the filesystem "freeze" feature... which is used for snapshotting 2008-09-16 20:33 kind of misconceived 2008-09-16 20:33 so you'll get a partial write instead of an EFAULT if you have a bad mapping in the middle of a writev 2008-09-16 20:33 sounds reasonable 2008-09-16 20:33 can't realy on mmu since we probably will use dma 2008-09-16 20:34 then we have a bunch of code associated with direct IO 2008-09-16 20:34 which we are going to skip 2008-09-16 20:34 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2319 2008-09-16 20:34 maze, true 2008-09-16 20:34 so we're going to check access somewhere 2008-09-16 20:34 but not here 2008-09-16 20:35 notice, no real work got done 2008-09-16 20:35 we're still just deepening the call chain and allowing for various options and whatnot 2008-09-16 20:35 at this point, we're seriously not expecting any real work to get done ;-) 2008-09-16 20:35 then we get to generic_file_buffered_write 2008-09-16 20:35 ACTION does! :D 2008-09-16 20:35 think that's going to do work? 2008-09-16 20:36 nope 2008-09-16 20:36 you'd be right 2008-09-16 20:36 short break 2008-09-16 20:36 while I fill the wine glass 2008-09-16 20:37 wine? i thought u wanted beer 2008-09-16 20:37 ;) 2008-09-16 20:37 nobody sent any 2008-09-16 20:37 aww 2008-09-16 20:37 ok here we go again 2008-09-16 20:38 ACTION thinks a_ops->write_begin must be the key... 2008-09-16 20:38 we have a ->write_begin option 2008-09-16 20:38 which is new for me 2008-09-16 20:38 the two functions are right next to each other 2008-09-16 20:38 and look similat 2008-09-16 20:38 and that 2copy thing, likewise 2008-09-16 20:38 probably something aio related 2008-09-16 20:38 looks like braindamange 2008-09-16 20:39 the 2copy is also using some a_ops 2008-09-16 20:39 notice a_ops 2008-09-16 20:39 is struct addres_space_operations 2008-09-16 20:40 http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L444 2008-09-16 20:40 lost the scent for a moment 2008-09-16 20:40 ACTION knows readpage from romfs... 2008-09-16 20:40 sounds mmap-ish 2008-09-16 20:42 ACTION has to go to work :( 2008-09-16 20:42 ACTION says bbyee, do post the logs ... 2008-09-16 20:42 guessing a_ops are operations that can be performed on mmaped fs pages 2008-09-16 20:42 with ability for fs to override it to trigger journaling etc 2008-09-16 20:42 bye bye 2008-09-16 20:42 ok, this code has bben "worked on" 2008-09-16 20:42 rearranged hopefully for a good reason 2008-09-16 20:43 readpage is the only 'read' the romfs is doing 2008-09-16 20:43 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2231 2008-09-16 20:43 so its called not only for mmap stuff 2008-09-16 20:43 generic_perform_write 2008-09-16 20:43 that may be an optimization though 2008-09-16 20:43 this is where the real action happens 2008-09-16 20:43 who knows... 2008-09-16 20:43 or one form of real action 2008-09-16 20:43 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2231 2008-09-16 20:43 we're going to talk about a_ops 2008-09-16 20:44 this is the key to most filesystem io in linux 2008-09-16 20:45 ok, so here is a typical write mem 2008-09-16 20:45 write_begin, write_end 2008-09-16 20:45 right 2008-09-16 20:45 and in between we copy data from userspace 2008-09-16 20:45 onto a page 2008-09-16 20:45 copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); 2008-09-16 20:45 so what is in write_beging? probably get a page into the page cache of an inode 2008-09-16 20:46 and write_end will send that page down to the hardware 2008-09-16 20:46 looks like the kernel basically mmaps in the page and then mmaps it out 2008-09-16 20:46 copy_from_user gets the data, and generates EFAULT if necessary 2008-09-16 20:46 either because of illegal access, or page swapped out 2008-09-16 20:47 pagefault_disable(); 2008-09-16 20:47 uhm? 2008-09-16 20:47 things get interested in the page was swapped out to a swapfile onthe same filesystem 2008-09-16 20:47 interesting 2008-09-16 20:47 swapfile on the same filesystem?? 2008-09-16 20:47 right 2008-09-16 20:47 swapfile is not a separate fs? 2008-09-16 20:47 trying to prevent recursive fault 2008-09-16 20:47 sounds like that just turned off page-in 2008-09-16 20:47 I don't have the details at hand just now 2008-09-16 20:48 razvanm, swap can be separate, or it can be on a filesystem 2008-09-16 20:48 there are some nasty possible recursions when its on a filesystem 2008-09-16 20:48 very nasty 2008-09-16 20:48 ACTION doesn't know how to create a swap on a fs :| 2008-09-16 20:48 2 minutes until question time 2008-09-16 20:49 it's going to be another "cliffhanger" ending 2008-09-16 20:49 :-) 2008-09-16 20:49 lol 2008-09-16 20:49 now this function is not very instructive 2008-09-16 20:49 because it doesn't directly use the page cache ops 2008-09-16 20:49 it provides hooks for them 2008-09-16 20:49 are you sure we went into the right function? not the 2copy one? 2008-09-16 20:49 let's see if we can pop out and find a variant that does use the page cache ops 2008-09-16 20:50 I'm sure we didn't 2008-09-16 20:50 somebody has been messing with names 2008-09-16 20:50 I hope it was for a good reason 2008-09-16 20:50 it isn't always 2008-09-16 20:50 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2063 2008-09-16 20:50 and as you can see, the call chain is kind of unreasonably deep 2008-09-16 20:50 this all seems extremely complex 2008-09-16 20:51 for now I can't say unnecessarily... but... 2008-09-16 20:51 what does the 2copy mean? 2008-09-16 20:51 yes, this looks like what remains of good old generic_write 2008-09-16 20:51 it means brain-dead original 1st copy apparently 2008-09-16 20:51 maze, I am happy to have reached your "complex" threshold 2008-09-16 20:51 it gets more complex 2008-09-16 20:52 in _2copy, we will alloc pages, map them into a page cache, copy data onto them, and submit them to disk 2008-09-16 20:52 we will call the fs's ->write_page method to do the latter 2008-09-16 20:53 and that method will figure out _where_ on disk the page should go 2008-09-16 20:53 I don't know wyat 2copy means 2008-09-16 20:53 why do we have to copy_from_user 2008-09-16 20:53 can't we write directly from userspace data? 2008-09-16 20:53 feels like... wanking... but I will know for sure for thursdays's session 2008-09-16 20:54 maze, because this is _buffered_ write 2008-09-16 20:54 we are placing the data in cache 2008-09-16 20:54 oh, right 2008-09-16 20:54 we can't just place references to pages in cache 2008-09-16 20:54 because the user data is not necessarily properly aligned 2008-09-16 20:54 couldn't we just rip the page out from under the user, and give him a r/o cow page? 2008-09-16 20:54 linus does want to attempt something like that 2008-09-16 20:54 but it's too hard, even for him 2008-09-16 20:55 ACTION doesn't see the write_page.... 2008-09-16 20:55 me neither 2008-09-16 20:55 there is prepare_write 2008-09-16 20:55 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2192 2008-09-16 20:55 home is: see the writepage 2008-09-16 20:55 ;-) 2008-09-16 20:55 and commit_write 2008-09-16 20:55 on thursday we will pick up at the writepage 2008-09-16 20:56 I'm not sure why there would need to be a write page 2008-09-16 20:56 yep, it looks like _2copy really is the new incarnation of generic_write 2008-09-16 20:56 it used to just be generic_write 2008-09-16 20:56 but then it started getting more and more "wrapped" 2008-09-16 20:56 until we see this thing 2008-09-16 20:56 unreadable thing you could say 2008-09-16 20:56 :-) 2008-09-16 20:57 maze, the purpose of the ->writepages in there is to get dirty, buffered pages onto disk 2008-09-16 20:57 -!- kbingham(~kbingham@92.20.210.138) has joined #tux3 2008-09-16 20:57 won't commit_write do that? 2008-09-16 20:57 ah, that's what you asked 2008-09-16 20:57 why two 2008-09-16 20:57 no good reason actually 2008-09-16 20:58 there's usually a "prepare_write" and a "commit_write" 2008-09-16 20:58 http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L458 2008-09-16 20:58 one or the other generally doesn't do much 2008-09-16 20:59 there's a writeage, writepages, prepatre_write,commit_write,write_begin,write_end ... 2008-09-16 20:59 pick'n'choose 2008-09-16 20:59 yes 2008-09-16 20:59 big mess 2008-09-16 20:59 linux IO is trying to find its identity 2008-09-16 20:59 lol 2008-09-16 21:00 it was simpler and nicer in the past? 2008-09-16 21:00 beginning of 2.6 was simpler, yes 2008-09-16 21:00 o_direct is a very good thing, but it added considerable complexity 2008-09-16 21:00 it looks like different file systems use different interfaces 2008-09-16 21:00 likewise aio 2008-09-16 21:01 maze, somewhat true 2008-09-16 21:01 almost everybody uses generic_write 2008-09-16 21:01 and thus we have a lot 2008-09-16 21:01 not much global structural analysis goes on 2008-09-16 21:01 so that the structure can be simplified 2008-09-16 21:01 because that doesn't add new features 2008-09-16 21:01 or fix bugs 2008-09-16 21:02 are the address_space_operations fs internal? 2008-09-16 21:02 introduces them more likely 2008-09-16 21:02 or are they more global mm? 2008-09-16 21:02 but it makes the code messy 2008-09-16 21:02 like many such things in linux, they are usually library methods 2008-09-16 21:02 kernel library 2008-09-16 21:02 which the fs can lightly wrap 2008-09-16 21:02 or use directly 2008-09-16 21:03 the ->writepages thing is a relatively new invention 2008-09-16 21:03 that allows the filesystem to map more than one page at a time for IO 2008-09-16 21:03 lead to nice benchmark improvements 2008-09-16 21:03 and more mess in filemap.c 2008-09-16 21:03 and this is where variable page sizes will get interesting 2008-09-16 21:04 filemap.c is where most of the impact is, yes 2008-09-16 21:04 insightful 2008-09-16 21:04 4 minutes over ;) 2008-09-16 21:04 how did we do for pacing today? 2008-09-16 21:04 i try 2008-09-16 21:04 nice pace 2008-09-16 21:04 pretty decent I think 2008-09-16 21:05 sorry I asked so many questions 2008-09-16 21:05 ok, we will be back into write on thursday 2008-09-16 21:05 ;-) 2008-09-16 21:05 tim_dimm: ask questions - it's the only way to learn anything 2008-09-16 21:05 ACTION is not happy with the length though ;-) 2008-09-16 21:05 homework is: find the implementations of the ->writepage calls in ext2 2008-09-16 21:05 I was just trying to figure out what / where to read 2008-09-16 21:05 never been inside the kernel like that before 2008-09-16 21:06 it's bizarre, isn't it 2008-09-16 21:06 yeah 2008-09-16 21:06 so here's a question: buffered, aio, o_direct - what are the permutations/combinations, what do they mean, and how do they interact with each other if the same spot is being accessed via different means 2008-09-16 21:06 maze, very good question, and the answer is: with considerable complexity 2008-09-16 21:06 lovely answer 2008-09-16 21:06 it is necessary to maintain cache consistency with all possible combinations 2008-09-16 21:07 that's like my friend at work, who sits next to me and regularly answers either/or questions with a 'yes' spoken in a deadpan voice 2008-09-16 21:07 are there hooks for cache consistency or is it handle another way? 2008-09-16 21:07 that is why that section handling o_direct that we skipped is so... um... interesting 2008-09-16 21:07 tim_dimm, the vfs handles it 2008-09-16 21:07 and there are rules that the fs has to follow 2008-09-16 21:07 O_DIRECT means unbuffered straight to disk, right? 2008-09-16 21:08 basically "do not skate over that cliff" 2008-09-16 21:08 and is pretty meaningless for read... 2008-09-16 21:08 maze, right 2008-09-16 21:08 o_direct write has to invalidate any buffer data at that point 2008-09-16 21:08 all synchronous io should be easily implementable via aio 2008-09-16 21:08 also flush out dirty buffered data in that range 2008-09-16 21:08 did you guys cover vfs on another tux3 night? 2008-09-16 21:08 maze, it is 2008-09-16 21:09 tim_dimm, partly 2008-09-16 21:09 this is part of the vfs we're doing now 2008-09-16 21:09 so you basically need to support {buffered | direct } asynchronous io 2008-09-16 21:09 would it be worthwhile to have an entire session on it?' 2008-09-16 21:09 we did an easy one first 2008-09-16 21:09 maze, yes 2008-09-16 21:09 in fact we already looked at the functions that support it 2008-09-16 21:10 tim_dimm, that was essentially the first session 2008-09-16 21:10 o_direct write has to invalidate any buffered data at that point - uh? 2008-09-16 21:10 k, I'll revisit in the logs 2008-09-16 21:10 maze, yes 2008-09-16 21:10 buffered data for what? 2008-09-16 21:10 somebody might have been reading/writing the device with buffered ops at the same time 2008-09-16 21:11 this is not uncommon 2008-09-16 21:11 oh, the buffered but not yet written stuff gets dropped? 2008-09-16 21:11 flushed to disk 2008-09-16 21:11 or overwritten with the - so flushed, not invalidated 2008-09-16 21:11 what gets invalidateD? 2008-09-16 21:11 you're right, fully replaced pages get dropped 2008-09-16 21:11 partially replaced pages have to be flushed 2008-09-16 21:12 so it's not so much invalidated, as overwritten and thus dropped/replaced with the new data 2008-09-16 21:12 right 2008-09-16 21:12 haven't spent a lot of time in that code myself 2008-09-16 21:12 but that's correct 2008-09-16 21:12 does O_DIRECT mean anything on read? 2008-09-16 21:12 yes 2008-09-16 21:13 will not read from buffer afaic 2008-09-16 21:13 Try to minimize cache effects of the I/O to and from this file 2008-09-16 21:13 but I could be wrong 2008-09-16 21:13 according to man open, basically skip buffer cache populating 2008-09-16 21:13 anything not buffered is read directly from disk and not added to the page cache 2008-09-16 21:13 unless already there 2008-09-16 21:13 so o_direct read avoids double buffering 2008-09-16 21:14 O_DIRECT (Since Linux 2.4.10) 2008-09-16 21:14 Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special 2008-09-16 21:14 situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is syn- 2008-09-16 21:14 chronous, that is, at the completion of a read(2) or write(2), data is guaranteed to have been transferred. See NOTES below for further 2008-09-16 21:14 discussion. 2008-09-16 21:14 A semantically similar (but deprecated) interface for block devices is described in raw(8). 2008-09-16 21:14 I'm not sure what it does with already-buffered data 2008-09-16 21:14 if dirty then it _must_ use the dirty version 2008-09-16 21:14 so, how expensive is a write to read only page fault? 2008-09-16 21:14 from man 2 open, sorry for the long lines 2008-09-16 21:14 but I don't know if it does that by flushing it first, then reading it back, or doing buffered read just for that bit 2008-09-16 21:14 yeah, found it 2008-09-16 21:14 doesn't look like there's any requirement to flush 2008-09-16 21:15 seems like O_DIRECT read is meant for access once - not worth caching - data 2008-09-16 21:15 yes 2008-09-16 21:15 still leaves the question about what it does with pages already in cache, or dirty in cache 2008-09-16 21:15 it says minimize 2008-09-16 21:15 shall we leave that as your homework? 2008-09-16 21:16 not ignore cache 2008-09-16 21:16 can't rely on the man page 2008-09-16 21:16 the pages should not be dirty for too long 2008-09-16 21:16 have to read the code 2008-09-16 21:16 :D 2008-09-16 21:17 from NOTES 2008-09-16 21:17 Applications should avoid mixing O_DIRECT and normal I/O to the same 2008-09-16 21:17 file, and especially to overlapping byte regions in the same file. 2008-09-16 21:17 Even when the filesystem correctly handles the coherency issues in this 2008-09-16 21:17 situation, overall I/O throughput is likely to be slower than using 2008-09-16 21:17 either mode alone. Likewise, applications should avoid mixing mmap(2) 2008-09-16 21:17 of files with direct I/O to the same files. 2008-09-16 21:17 one thing you see is that o_direct has to be constantly checking the page cache to be sure nothing is aliased there 2008-09-16 21:17 "The thing that has always disturbed me about O_DIRECT is that 2008-09-16 21:17 the whole interface is just stupid, and was probably designed by 2008-09-16 21:17 a deranged monkey on some serious mind-controlling substances." 2008-09-16 21:17 — Linus 2008-09-16 21:17 maze, the advice is often ignored 2008-09-16 21:18 linux is not absolved from responsibiltiy for keeping the cache consistent 2008-09-16 21:18 right 2008-09-16 21:18 linus doesn't run a database company 2008-09-16 21:18 lol 2008-09-16 21:18 which is why he thinks that 2008-09-16 21:18 the interface is quite simple 2008-09-16 21:19 open with o_direct, make sure your data is aligned 2008-09-16 21:20 hi all 2008-09-16 21:20 maze, how'd you do with reading your superblock 2008-09-16 21:20 hey 2008-09-16 21:20 shapor, right on time ;) 2008-09-16 21:20 I slept well, thank you ;-) 2008-09-16 21:20 good thing we have logs 2008-09-16 21:20 yeah 2008-09-16 21:20 reading now 2008-09-16 21:20 I'm going to be working on it now 2008-09-16 21:20 maze, that little subproject will be highly instructive 2008-09-16 21:21 agreed 2008-09-16 21:21 it already has been 2008-09-16 21:21 especially if you write your own custom endio 2008-09-16 21:21 and figure out how to have your task (which is "mount") wait on a wait queue for the io to complete 2008-09-16 21:21 exactly 2008-09-16 21:22 well, it's the in-kernel portion of mount 2008-09-16 21:22 it's all not very much code, but each line takes about 15 minutes of study 2008-09-16 21:22 or maybe an hour the first time 2008-09-16 21:22 I expect I need something, sleep on something, wake something from endio 2008-09-16 21:22 precisely 2008-09-16 21:22 apparently something called a waitqueue 2008-09-16 21:23 the waiting bits are covered in a nice tutorial manner on lwn 2008-09-16 21:23 so probably something like a dynamic init of a waitqueue 2008-09-16 21:23 ACTION is off to bed. Tomorrow he needs to be early at school. 2008-09-16 21:23 then submit io 2008-09-16 21:23 bio is... an acquired taste 2008-09-16 21:23 then sleep on wq 2008-09-16 21:23 acquired ore 2008-09-16 21:23 acquired lore 2008-09-16 21:23 in endio wake wq 2008-09-16 21:23 more like acquired love 2008-09-16 21:23 exactly 2008-09-16 21:23 probably using the "wake" function 2008-09-16 21:24 that sounds awesome 2008-09-16 21:24 and either wake or wakeall likely 2008-09-16 21:24 here wakeall being more appropriate 2008-09-16 21:24 usually wake 2008-09-16 21:24 no need for a thundering herd 2008-09-16 21:24 of course you know there is only one waiter 2008-09-16 21:25 there better not be more, or something else broke 2008-09-16 21:25 well, but in general, since the op is complete - I should wake all 2008-09-16 21:25 interesting question then is how to dealloc the wq 2008-09-16 21:25 must be some put_wq in the waiters 2008-09-16 21:25 which on last dec to zero does free 2008-09-16 21:25 next move for me is to drop over to whole foods to pick up some munchies 2008-09-16 21:26 I only have a few more days left as a bachelor 2008-09-16 21:26 before the girls get back ;) 2008-09-16 21:26 flips: hah thats where i was instead of class 2008-09-16 21:26 at which time I'm afraid my checking rate will drop somewhat 2008-09-16 21:26 linux/wait.h 2008-09-16 21:26 didn't think it'd be so early 2008-09-16 21:26 checkin 2008-09-16 21:26 shapor, 8 pm tue and thur 2008-09-16 21:28 hmm 2008-09-16 21:28 looks like it's too late for whole food 2008-09-16 21:28 unless I really run 2008-09-16 21:28 don't feel like really running 2008-09-16 21:28 maybe it's 3rd street for dinner tonight 2008-09-16 21:28 so i need to make a dynamic wq, init with init_waitqueue_head() 2008-09-16 21:28 yes, and there are various convenience wrappers 2008-09-16 21:29 best is to write it on the metal the first time 2008-09-16 21:30 #define wake_up_all(x) __wake_up(x, TASK_NORMAL, 0, NULL) 2008-09-16 21:30 well if I don't go shopping there will be no coffee for breakfast 2008-09-16 21:30 seems to be the way to wake 2008-09-16 21:30 so I'm gone... 2008-09-16 21:30 folks 2008-09-16 21:31 hi bh 2008-09-16 21:32 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has left #tux3 2008-09-16 21:32 -!- cdk(~chinmay@121.246.33.227) has joined #tux3 2008-09-16 21:34 interesting 2008-09-16 21:34 how did I become a contributor on zumastor? 2008-09-16 21:35 aposter 2008-09-16 21:35 ah 2008-09-16 21:35 u do that? 2008-09-16 21:37 flips:are the latest tuxfs binaries working fine for everyone? 2008-09-16 21:38 i am getting segfaults for each file that i copy 2008-09-16 21:38 sync rootdir 2008-09-16 21:38 filemap_blockio: write 2008-09-16 21:38 devmap_blockio: read [8] 2008-09-16 21:38 devmap_blockio: read [9] 2008-09-16 21:38 balloc -> [10] 2008-09-16 21:38 new group at 0 2008-09-16 21:38 insert 0x0 at 0 in group 0 2008-09-16 21:38 limit = 0, free = 4088 2008-09-16 21:38 save_inode: save inode 0xd 2008-09-16 21:39 lookup inode 0xd, 0 + d 2008-09-16 21:39 resize inum 0xd at 0x58 from 18 to 28 2008-09-16 21:39 sync atom table 2008-09-16 21:39 Segmentation fault 2008-09-16 21:41 thats the inode.c right? 2008-09-16 21:41 or is that tux3 fuse? 2008-09-16 21:42 tux3 fuse running in the foreground 2008-09-16 21:42 let me try to reproduce 2008-09-16 21:43 did you try running under gdb? 2008-09-16 21:43 no .. that i did not 2008-09-16 21:43 probably have to attach to it after you start it i haven't tried yet 2008-09-16 21:45 cdk: you're running tux3fs right? 2008-09-16 21:45 not tux3fuse 2008-09-16 21:45 yeah tux3fs 2008-09-16 21:46 ah yes, happens for me too 2008-09-16 21:46 i am sure it worked before... 2008-09-16 21:47 i mean two days ago 2008-09-16 21:48 yeah appears to be on write 2008-09-16 21:48 new bug 2008-09-16 21:49 not sure why its sync'ing atom table at all 2008-09-16 21:49 hrm 2008-09-16 21:50 btw...ls is yet to work for tux3fuse isnt it? 2008-09-16 21:50 yeah i think tux3fuse is very broken 2008-09-16 21:51 but perhaps a better approach 2008-09-16 21:51 using the "low level" api 2008-09-16 21:51 yes. 2008-09-16 21:54 anyways...i need to go...will keep track of the changes. 2008-09-16 21:54 hopefully we resolve this soon. 2008-09-16 21:54 cdk: thanks for the bug report... i'm on it 2008-09-16 21:54 i think it see whats wrong 2008-09-16 22:00 -!- amey(~amey@116.73.35.180) has left #tux3 2008-09-16 22:00 yes 2008-09-16 22:01 kernel locked ;-) 2008-09-16 22:09 heh 2008-09-16 22:09 fun 2008-09-16 22:09 whatd you do 2008-09-16 22:10 cdk: fixed :) 2008-09-16 22:11 oh hes gone 2008-09-16 22:11 cdk, no doubt it's my fault 2008-09-16 22:11 I didn't try it 2008-09-16 22:12 probably have to attach to it after you start it i haven't tried yet <- or just change the makefile 2008-09-16 22:13 segfault in atom stuff... no surprise 2008-09-16 22:13 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-09-16 22:23 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-16 22:24 back from kernel lala land 2008-09-16 22:24 hmm 2008-09-16 22:24 hello all 2008-09-16 22:25 maze, log of tux univ posted? 2008-09-16 22:25 yes 2008-09-16 22:26 http://shapor.com/tux3/irclogs/current.txt 2008-09-16 22:26 okay, looks like I definitely need to put a little effort into making my system more debuggable 2008-09-16 22:27 that was a totally harmless piece of code - one'd think 2008-09-16 22:29 uh 2008-09-16 22:29 ugh 2008-09-16 22:29 stupid thing 2008-09-16 22:29 this one takes an object, that one takes a pointer to an object... 2008-09-16 22:29 and they're all macros, so who'd guess 2008-09-16 22:30 okay, so that actually works 2008-09-16 22:30 waits the appropriate number of jiffies 2008-09-16 22:31 i'm going to split up those logs soon 2008-09-16 22:31 gotta make a cron job 2008-09-16 22:31 that way we can link to TuxU sessions 2008-09-16 22:31 by day? by week? by month? 2008-09-16 22:32 not sure yet 2008-09-16 22:32 i have this script http://zumastor.org/irclogs/ 2008-09-16 22:32 leave current the way it is 2008-09-16 22:32 yeah i like the one big long 2008-09-16 22:32 log 2008-09-16 22:32 easy to grep ;) 2008-09-16 22:32 would be nice to hit record and stop for tux3 U 2008-09-16 22:32 lol 2008-09-16 22:32 I love the most recent conversation 2008-09-16 22:33 #zumastor doesn't get a lot of traffic these days 2008-09-16 22:33 wonder why 2008-09-16 22:33 what's up with zumastor? 2008-09-16 22:33 not much these days 2008-09-16 22:34 no one really works on it anymore 2008-09-16 22:34 its waiting for tux3 goodness to be backported 2008-09-16 22:34 really? 2008-09-16 22:34 to increase performance 2008-09-16 22:34 yup 2008-09-16 22:35 I guess that's kind of sad 2008-09-16 23:55 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-17 00:23 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-17 00:31 -!- tim_vimm(~Tim@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-17 00:40 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-17 00:44 ok, I have a basic idea how I'm going to make extent creation happen in inode.c 2008-09-17 00:44 may slightly depart from tradition and not write a design note first 2008-09-17 00:45 -!- tim_vimm(~Tim@cpe-76-90-98-247.socal.res.rr.com) has left #tux3 2008-09-17 00:45 so its a surprise? 2008-09-17 01:07 http://www.phoronix.com/scan.php?page=news_item&px=NjcyNQ 2008-09-17 01:08 "An Update On The Tux3 File-System" 2008-09-17 01:08 very nice little news piece 2008-09-17 01:09 oh, and "Tux3 Report" is somehow got to #1 on the lkml.org hot list 2008-09-17 01:09 the next three posts are linus, then alan cox, then "time travel", then the original Tux3 announcement 2008-09-17 01:19 http://lwn.net/Articles/296568/ 2008-09-17 01:19 some comments there 2008-09-17 01:19 regarding the namespace 2008-09-17 01:20 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-17 01:20 geeks firing on random mode 2008-09-17 01:20 doesn't mean a thing until kernel untar works 2008-09-17 01:21 don't see anything on the namespace 2008-09-17 01:21 oh 2008-09-17 01:21 right 2008-09-17 01:21 "taking back teh tux" 2008-09-17 01:22 http://www.hpcwire.com/features/Cray_Unveils_Personal_Supercomputer.html 2008-09-17 01:22 noticed that one 2008-09-17 01:22 msft involvement 2008-09-17 01:22 got to be a disaster 2008-09-17 01:22 ack 2008-09-17 01:22 trying to find a higher waste-to-recovery ratio than even xbox I think 2008-09-17 01:23 tim_dimm, shap & I had a nice midnight skate, sorry we forgot to ping you 2008-09-17 01:24 don't know how we overlooked that 2008-09-17 01:24 it won't happen again 2008-09-17 01:24 dang- I would have rolled 2008-09-17 01:24 I know 2008-09-17 01:24 ACTION kicks /me 2008-09-17 01:24 ACTION agrees 2008-09-17 01:25 well I drown my sorrows in a glass of cabernet 2008-09-17 01:25 and see if I can get some progress on extent writing 2008-09-17 01:25 I spent the evening hanging shelves 2008-09-17 01:26 that's fun too 2008-09-17 01:26 loads 2008-09-17 01:26 more fun than watching paint dry 2008-09-17 01:26 -!- ChanServ changed mode/#tux3 -> +o shapor 2008-09-17 01:26 uhoh 2008-09-17 01:27 that set off all kinds of alarms 2008-09-17 01:27 ACTION aims ops priv 2008-09-17 01:27 -!- ChanServ changed mode/#tux3 -> +o flips 2008-09-17 01:28 -!- flips changed mode/#tux3 -> +o tim_dimm 2008-09-17 01:28 -!- flips changed mode/#tux3 -> +o konrad 2008-09-17 01:28 smooth operator 2008-09-17 01:28 now u r 1 2 2008-09-17 01:29 -!- flips changed mode/#tux3 -> +o tux3bot 2008-09-17 01:29 bot can kick now 2008-09-17 01:29 what you call a kickass bot 2008-09-17 01:29 hah 2008-09-17 01:29 kickbut_bot 2008-09-17 01:29 i dont think the tux has the code to do that 2008-09-17 01:30 lets find out 2008-09-17 01:32 folks 2008-09-17 01:33 <- crashes 2008-09-17 01:45 lol 2008-09-17 01:46 lots of sheriffs around now 2008-09-17 01:46 this must be the safest place on the planet 2008-09-17 01:46 ? 2008-09-17 01:46 you causing trouble again? 2008-09-17 01:46 oh 2008-09-17 01:46 all the ops 2008-09-17 01:47 -!- flips changed mode/#tux3 -> +o MaZe 2008-09-17 01:47 just don't make me use the kickbot 2008-09-17 01:48 lol 2008-09-17 01:48 I don't even know how to use *o* powers 2008-09-17 01:48 lots of chiefs not enough indians 2008-09-17 01:48 ACTION just barely refrains from kicking maze to communicate the concept 2008-09-17 01:48 I got waitqueues working 2008-09-17 01:48 flips: how's it going ? 2008-09-17 01:48 -!- flips changed mode/#tux3 -> -o flips 2008-09-17 01:48 there we are 2008-09-17 01:49 hi bh 2008-09-17 01:49 also found a bio_kern_map func 2008-09-17 01:49 pretty good, extents coming into focus 2008-09-17 01:49 yeah, I was reading something about it from your adventures with btrfs 2008-09-17 01:49 sounds... um... map what? 2008-09-17 01:49 erm, bio_map_kern 2008-09-17 01:49 apparently takes kernel data ptr and returns a bio 2008-09-17 01:50 sounds really automagic 2008-09-17 01:50 except no-one seems to use it... 2008-09-17 01:51 http://lxr.linux.no/linux+v2.6.26.5/fs/bio.c#L922 2008-09-17 01:51 add_pc_page... still don't know what pc stands for 2008-09-17 01:52 looks like a good exercise in taking something simple and making it look complex 2008-09-17 01:53 yes, well not sure what the extra pc means 2008-09-17 01:53 all that request queue passing looks doubtful 2008-09-17 01:53 what's it for? 2008-09-17 01:53 looks less than clean 2008-09-17 01:53 a lot of the bio code is like that 2008-09-17 01:53 http://lxr.linux.no/linux+v2.6.26.5/block/blk-map.c#L284 2008-09-17 01:53 seems to be the only place its used 2008-09-17 01:54 is this entire thing really such spaghetti? 2008-09-17 01:54 well, the request_queue, is apparently something you can pull out of the bio 2008-09-17 01:54 it's bogus to say "map kern" 2008-09-17 01:54 bio by default references kernel memory 2008-09-17 01:55 it's essentially just a vector of page headers 2008-09-17 01:55 the entire thing is pretty much that spagetti like or worse 2008-09-17 01:55 looks superficially plausible 2008-09-17 01:55 is mostly fluff when you dig 2008-09-17 01:56 the more I read this the more scared I am of running linux 2008-09-17 01:56 anyway, you now have officially arrived at the underbelly of linux 2008-09-17 01:56 few kernel hacks even look at this stuff 2008-09-17 01:56 I'd almost switch to windows... except better the devil you know, then devil you don't ;-) 2008-09-17 01:57 hah 2008-09-17 01:57 windows is worse with a very high degree of probability 2008-09-17 01:57 yeah, pretty sure of that 2008-09-17 01:57 bio is fast 2008-09-17 01:57 that's the redeeming thing 2008-09-17 01:57 I'm actually most annoyed about the complete lack of useful documentation 2008-09-17 01:58 yes, especially here 2008-09-17 01:58 nobody documents bios, it's new 2008-09-17 01:58 http://kerneltrap.org/man/linux/9?page=4 2008-09-17 01:58 have to let it age a bit first 2008-09-17 01:59 anyway, just write your root loader 2008-09-17 01:59 super loader 2008-09-17 01:59 then I'll kick sand at it ;) 2008-09-17 01:59 yeah, yeah, I hate writing without understanding 2008-09-17 02:00 bio is just a handle for a biovec which is just a vector of page heads with a short offset and length of data on each one 2008-09-17 02:00 it transfers to a _contiguous_ physical region 2008-09-17 02:00 right 2008-09-17 02:00 to or from 2008-09-17 02:00 the memory side can be completely discontiguous 2008-09-17 02:00 very useful 2008-09-17 02:00 it's physically contiguous? 2008-09-17 02:00 on disk it is 2008-09-17 02:00 as in on disk 2008-09-17 02:00 right 2008-09-17 02:01 not memory 2008-09-17 02:01 that's the most important aspect of the api 2008-09-17 02:01 it's just a preadv / pwritev 2008-09-17 02:01 there is tons of cruft you can ignore connected with queueing, elevatoring, and mapping bio to dma 2008-09-17 02:01 just ignore it 2008-09-17 02:01 you only care about the length field, sector address, count of bvecs, couple of other things 2008-09-17 02:02 transfer direction 2008-09-17 02:02 list is getting short 2008-09-17 02:02 endio 2008-09-17 02:02 private field 2008-09-17 02:02 fill in the fields, submit your bio, wait fot the computer to catch fire 2008-09-17 02:03 right, except bio_add_page takes pages, and I'm still not to clear on kaddr -> page conversion 2008-09-17 02:03 so I'm trying to parse that 2008-09-17 02:03 forget that 2008-09-17 02:04 just set the bvec fields yourself 2008-09-17 02:04 you only need to "map" a page in kernel if you're going to play with the data on it 2008-09-17 02:04 that's an advantage of using buffers 2008-09-17 02:04 they're always in kernel memory 2008-09-17 02:04 but 2008-09-17 02:04 you get to set up this bio 2008-09-17 02:04 virt_to_page(data) 2008-09-17 02:04 meaning you can allocate the page it's going to read the super into 2008-09-17 02:04 offset_in_page(kaddr); 2008-09-17 02:05 seem to be relevant 2008-09-17 02:05 and make that a kernel page 2008-09-17 02:05 so you don't have to "map" it 2008-09-17 02:05 you can already address it 2008-09-17 02:05 right, so I have a kmalloc 2008-09-17 02:05 which gives me a void * kaddr 2008-09-17 02:05 offset_in_page isn't anything I've used 2008-09-17 02:05 sounds like some more wanabe api 2008-09-17 02:06 so you're saying to literally fill in all the bio fields by hand? seems terrible 2008-09-17 02:06 not kmalloc, you want alloc_pages 2008-09-17 02:06 order 0 2008-09-17 02:06 = one page 2008-09-17 02:06 you'll write a helper 2008-09-17 02:06 just like everybody does 2008-09-17 02:06 nope 2008-09-17 02:06 and everybody writes a crappy helper that nobody else wants to use ;) 2008-09-17 02:06 why not kmalloc? I don't need a full page. 2008-09-17 02:07 I should be fine with kmalloc 2008-09-17 02:07 you can't store it in the bvec is why 2008-09-17 02:07 and then passing the converted - it'll work 2008-09-17 02:07 you need a _page_ 2008-09-17 02:07 I see it 2008-09-17 02:07 won't work 2008-09-17 02:07 bvecs point at struct pages 2008-09-17 02:07 don't be shy about taking a full page to read the superblock 2008-09-17 02:08 it's a tiny blip in terms of kernel memory wastage 2008-09-17 02:08 eh, it compiles 2008-09-17 02:08 it'll work - I'm stubborn 2008-09-17 02:08 kay, I'll wait for the code 2008-09-17 02:09 ;-) 2008-09-17 02:12 once you have done this you have figured out a huge part of the kernel io system 2008-09-17 02:13 it's actually simple, just wrapped in layers of crud to make it look complex 2008-09-17 02:19 hmm, well I have something which actually might work 2008-09-17 02:19 now to reread the code and then test it 2008-09-17 02:23 agh, lets test it live 2008-09-17 02:33 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-17 02:33 hmm 2008-09-17 02:33 computer caught fire? 2008-09-17 02:33 not quite 2008-09-17 02:34 it didn't quite go away 2008-09-17 02:34 and I think it might have done something right 2008-09-17 02:34 but a reboot was needed 2008-09-17 02:34 no more testing live - not worth it 2008-09-17 02:34 right 2008-09-17 02:34 qemu or uml 2008-09-17 02:34 you got uml running didn't you? 2008-09-17 02:34 ACTION forgets 2008-09-17 02:35 Sep 17 02:25:00 nike kernel: loop: module loaded 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 0000: FA EB 21 5B 4D 61 5A 65 42 6F 6F 74 5D 00 60 B4 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 0010: 0E BB 07 00 89 E5 8B 76 10 FF 46 10 8A 04 FE 04 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 0020: CD 10 61 C3 31 C0 8E D0 BC 00 7C FB E8 DF FF 2A 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 0030: 8E D8 89 E6 06 8E C0 57 BF 00 06 FC B9 00 01 F3 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 0040: A5 EA 57 06 00 00 56 BB 07 00 B4 0E CD 10 5E AC 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 0050: 08 C0 75 F2 F4 EB FD E8 B4 FF 5B 89 C5 BF BE 07 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 0060: B1 04 E8 A9 FF 31 80 3D 80 75 0B 09 ED BE 22 07 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 0070: 75 DD 89 FD EB 08 80 3D 00 BE 3F 07 75 D1 83 C7 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 0080: 10 E2 DF E8 88 FF 5D 09 ED BE 5A 07 74 C1 E8 7D 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 0090: FF 76 BF 05 00 B4 41 BB AA 55 CD 13 72 33 81 FB 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 00A0: 55 AA 75 2D F6 C1 01 74 28 E8 62 FF 4C 8B 76 08 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 00B0: 89 36 1A 07 E8 57 FF 31 BE 12 07 B4 42 CD 13 BE 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 00C0: 97 07 73 33 31 C0 CD 13 4F 75 E9 BE 71 07 E9 7E 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 00D0: FF 8A 76 01 8B 4E 02 E8 34 FF 31 BB 00 7C B8 01 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 00E0: 02 57 CD 13 5F BE 9C 07 73 0D 31 C0 CD 13 4F 75 2008-09-17 02:35 Sep 17 02:25:15 nike kernel: 00F0: E6 BE 76 07 E9 58 FF E8 14 FF 3D 81 3E FE 7D 55 2008-09-17 02:35 qemu 2008-09-17 02:35 stil need to craft a better rootfs for it though 2008-09-17 02:36 so, was that a successful read or is that cruft? 2008-09-17 02:36 looks rather crufty 2008-09-17 02:36 0000000: faeb 215b 4d61 5a65 426f 6f74 5d00 60b4 ..![MaZeBoot].`. 2008-09-17 02:36 0000010: 0ebb 0700 89e5 8b76 10ff 4610 8a04 fe04 .......v..F..... 2008-09-17 02:36 0000020: cd10 61c3 31c0 8ed0 bc00 7cfb e8df ff2a ..a.1.....|....* 2008-09-17 02:36 0000030: 8ed8 89e6 068e c057 bf00 06fc b900 01f3 .......W........ 2008-09-17 02:36 0000040: a5ea 5706 0000 56bb 0700 b40e cd10 5eac ..W...V.......^. 2008-09-17 02:36 0000050: 08c0 75f2 f4eb fde8 b4ff 5b89 c5bf be07 ..u.......[..... 2008-09-17 02:36 0000060: b104 e8a9 ff31 803d 8075 0b09 edbe 2207 .....1.=.u....". 2008-09-17 02:36 0000070: 75dd 89fd eb08 803d 00be 3f07 75d1 83c7 u......=..?.u... 2008-09-17 02:36 0000080: 10e2 dfe8 88ff 5d09 edbe 5a07 74c1 e87d ......]...Z.t..} 2008-09-17 02:36 0000090: ff76 bf05 00b4 41bb aa55 cd13 7233 81fb .v....A..U..r3.. 2008-09-17 02:36 00000a0: 55aa 752d f6c1 0174 28e8 62ff 4c8b 7608 U.u-...t(.b.L.v. 2008-09-17 02:36 00000b0: 8936 1a07 e857 ff31 be12 07b4 42cd 13be .6...W.1....B... 2008-09-17 02:36 00000c0: 9707 7333 31c0 cd13 4f75 e9be 7107 e97e ..s31...Ou..q..~ 2008-09-17 02:36 00000d0: ff8a 7601 8b4e 02e8 34ff 31bb 007c b801 ..v..N..4.1..|.. 2008-09-17 02:36 00000e0: 0257 cd13 5fbe 9c07 730d 31c0 cd13 4f75 .W.._...s.1...Ou 2008-09-17 02:36 00000f0: e6be 7607 e958 ffe8 14ff 3d81 3efe 7d55 ..v..X....=.>.}U 2008-09-17 02:36 it worked! 2008-09-17 02:36 ooh, Mazeboot 2008-09-17 02:36 so it did perform the read from loop 2008-09-17 02:37 that's a hand crafted lba capable boot sector 2008-09-17 02:37 that I should get around to sending to hpa 2008-09-17 02:37 hpa? 2008-09-17 02:37 hpa@zytor.com 2008-09-17 02:37 what I thought 2008-09-17 02:37 I think is his nick, he's syslinux guy 2008-09-17 02:37 oh yes 2008-09-17 02:38 anyway, so the bio part mostly worked 2008-09-17 02:38 it must be a very special boot sector 2008-09-17 02:38 most likely locking got messed up somewhere 2008-09-17 02:38 or kfree happened to quickly 2008-09-17 02:38 you should post code around now 2008-09-17 02:38 give it a little more fiddling, then post 2008-09-17 02:38 hmm 2008-09-17 02:38 I'll post now 2008-09-17 02:38 :) 2008-09-17 02:38 since it'll take a while to get a rootfs 2008-09-17 02:39 static int junkfs_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt) 2008-09-17 02:39 { 2008-09-17 02:39 <------>return get_sb_bdev(fs_type, flags, dev_name, data, junkfs_fill_super, mnt); 2008-09-17 02:39 } 2008-09-17 02:39 standard stuff 2008-09-17 02:39 i'll skip all the other stuff 2008-09-17 02:39 the important stuff is all in junkfs_fill_super 2008-09-17 02:39 ah, I thought this was your bio thing 2008-09-17 02:40 well 2008-09-17 02:40 I guess it is 2008-09-17 02:40 struct mz_t { 2008-09-17 02:40 <------>wait_queue_head_t wq; 2008-09-17 02:40 <------>int completed; 2008-09-17 02:40 }; 2008-09-17 02:40 nice and simple 2008-09-17 02:40 the struct to stick in the bio to wait on and mark completion 2008-09-17 02:40 static void end_io_read(struct bio *bio, int err) 2008-09-17 02:40 { 2008-09-17 02:40 <------>struct mz_t * mzp; 2008-09-17 02:40 <------>DBG_ENTER0(); 2008-09-17 02:40 <------>mzp = (struct mz_t *)bio->bi_private; 2008-09-17 02:40 <------>mzp->completed = 1; 2008-09-17 02:40 <------>bio_put(bio); 2008-09-17 02:40 you can post to tux3 list 2008-09-17 02:40 <------>wake_up(&mzp->wq); 2008-09-17 02:40 <------>DBG_RETURN0(); 2008-09-17 02:40 } 2008-09-17 02:40 don't be shy :) 2008-09-17 02:40 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-17 02:40 the end completion 2008-09-17 02:41 I'm shy. 2008-09-17 02:41 yes, good 2008-09-17 02:41 don't be 2008-09-17 02:41 I want to get a non-crashing version first 2008-09-17 02:41 optional, but if that's your fav... 2008-09-17 02:41 anyway the above just grabs the private part, marks it as completed, and wakes up the wq 2008-09-17 02:41 yes 2008-09-17 02:41 the bio_put might be the problem... 2008-09-17 02:41 it's right 2008-09-17 02:41 um 2008-09-17 02:41 might not, not sure 2008-09-17 02:42 should be ok 2008-09-17 02:42 #define SB_SIZE 512 2008-09-17 02:42 static int junkfs_fill_super(struct super_block *sb, void *data, int silent) 2008-09-17 02:42 { 2008-09-17 02:42 <------>struct mz_t *mz; 2008-09-17 02:42 <------>u8 *buf; 2008-09-17 02:42 <------>struct bio *bio; 2008-09-17 02:42 <------>int err; 2008-09-17 02:42 <------>int s, y, x; 2008-09-17 02:42 <------>DBG_ENTER0(); 2008-09-17 02:42 <------>mz = kmalloc(sizeof(struct mz_t), GFP_KERNEL); 2008-09-17 02:42 <------>if (IS_ERR(mz)) { 2008-09-17 02:42 <------><------>err = PTR_ERR(mz); 2008-09-17 02:42 <------><------>goto out_mz_null; 2008-09-17 02:42 <------>}; 2008-09-17 02:42 <------>init_waitqueue_head(&mz->wq); 2008-09-17 02:42 <------>mz->completed = 0; 2008-09-17 02:42 <------>buf = kmalloc(SB_SIZE, GFP_KERNEL); 2008-09-17 02:42 <------>if (IS_ERR(buf)) { 2008-09-17 02:42 <------><------>err = PTR_ERR(buf); 2008-09-17 02:42 <------><------>goto out_sb_null; 2008-09-17 02:42 <------>}; 2008-09-17 02:42 <------>bio = bio_alloc(GFP_KERNEL, 1); 2008-09-17 02:42 <------>if (IS_ERR(bio)) { 2008-09-17 02:42 <------><------>err = PTR_ERR(bio); 2008-09-17 02:42 <------><------>goto out_bio_null; 2008-09-17 02:42 <------>}; 2008-09-17 02:42 then we have the last function remaining - still very crufty 2008-09-17 02:42 basically, up till this point it's all kmalloc's 2008-09-17 02:43 <------>bio->bi_bdev = sb->s_bdev; 2008-09-17 02:43 <------>bio->bi_sector = 0; // first sector 2008-09-17 02:43 what we actually want to read - from sector 0 on our bdev 2008-09-17 02:43 <------>if (bio_add_page(bio, virt_to_page(buf), SB_SIZE, offset_in_page(buf)) == SB_SIZE) { 2008-09-17 02:43 <------><------>bio->bi_end_io = end_io_read; 2008-09-17 02:43 <------><------>bio->bi_private = mz; 2008-09-17 02:43 <------><------>submit_bio(READ, bio); 2008-09-17 02:43 <------><------>s = wait_event_interruptible(mz->wq, mz->completed); 2008-09-17 02:43 <------><------>DBG_MARK1(int, s); 2008-09-17 02:43 <------><------>for (y = 0; y < 16; ++y) { 2008-09-17 02:43 <------><------><------>printk(KERN_INFO "%04X:", y * 16); 2008-09-17 02:43 <------><------><------>for(x = 0; x < 16; ++x) { 2008-09-17 02:43 <------><------><------><------>printk(" %02X", buf[y * 16 + x]); 2008-09-17 02:43 <------><------><------>}; 2008-09-17 02:43 <------><------><------>printk("\n"); 2008-09-17 02:43 <------><------>}; 2008-09-17 02:43 <------>} else { 2008-09-17 02:43 <------><------>DBG_MARK0(); 2008-09-17 02:43 <------>}; 2008-09-17 02:43 <------>err = -1; 2008-09-17 02:44 there's the actual read, wait on wq, dump to dmesg 2008-09-17 02:44 ah, you did virt_to_page, that's how it worked 2008-09-17 02:44 forget that 2008-09-17 02:44 obviously the dump happened, and already had the proper content 2008-09-17 02:44 just to page = alloc_pages 2008-09-17 02:44 and use the page head directly, like all the other hacks 2008-09-17 02:44 kmallocing that will get you shouted at, trust me 2008-09-17 02:44 this is cleaner, I'm actually allocating exactly how much I need 2008-09-17 02:44 nope 2008-09-17 02:44 it's not good 2008-09-17 02:45 well 2008-09-17 02:45 unless you have an _actual_ other user of the page 2008-09-17 02:45 false economy 2008-09-17 02:45 so, you're saying grabing an empty page is cheaper? especially since it'll be returned soon anyway? 2008-09-17 02:45 it's very cheap 2008-09-17 02:45 <------>kfree(bio); 2008-09-17 02:45 <------>bio = NULL; 2008-09-17 02:45 out_bio_null: 2008-09-17 02:45 <------>kfree(sb); 2008-09-17 02:45 <------>sb = NULL; 2008-09-17 02:45 out_sb_null: 2008-09-17 02:45 <------>kfree(mz); 2008-09-17 02:45 and it's the superblock 2008-09-17 02:45 <------>mz = NULL; 2008-09-17 02:45 out_mz_null: 2008-09-17 02:46 <------>DBG_RETURN1(int, err); 2008-09-17 02:46 } 2008-09-17 02:46 and that's it 2008-09-17 02:46 it deserves a page of its own 2008-09-17 02:46 and this apparently reads and dumps correctly, but still has some nasty bug in it 2008-09-17 02:46 anyway, it looks good 2008-09-17 02:46 nothing to be shy about 2008-09-17 02:46 maybe the kfree(bio)? 2008-09-17 02:46 you can post that to the tux3 list 2008-09-17 02:46 yep 2008-09-17 02:46 don't do that 2008-09-17 02:46 you need to give that bio back to the bio system 2008-09-17 02:46 since bio_put already freed it? 2008-09-17 02:46 via bio_put 2008-09-17 02:46 right 2008-09-17 02:47 so I need to bio_put on the error path 2008-09-17 02:47 you might want to put some trace output in bio_put 2008-09-17 02:47 not on the normal return path? 2008-09-17 02:47 bio_endio should put the bio 2008-09-17 02:47 I forget 2008-09-17 02:47 very forgettable detail 2008-09-17 02:47 you need to look at the source 2008-09-17 02:48 well another endio func did bio_put 2008-09-17 02:48 hence I did as well 2008-09-17 02:48 there's a bio error mechanism too 2008-09-17 02:48 so kfree(bio) is the problem 2008-09-17 02:48 you can endio when you have an error 2008-09-17 02:48 and it will put the bio 2008-09-17 02:48 don't need to do it on a separate path 2008-09-17 02:48 right endio(bio,err) 2008-09-17 02:49 the big deal is, your wait queue and wake worked 2008-09-17 02:49 that's fun, hmm? 2008-09-17 02:49 powerful 2008-09-17 02:49 oh the wq was easy 2008-09-17 02:50 the only problem with wq, was the sometimes need for & or * 2008-09-17 02:52 I'm assuming writes would be just as simple 2008-09-17 02:54 yes 2008-09-17 02:54 it's symmetric 2008-09-17 02:59 anyway, once you have tracked down your double free I encourage you to post it to the tux3 list 2008-09-17 02:59 it's especially interesting now while it's still minimal 2008-09-17 03:01 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-17 03:01 argh 2008-09-17 03:01 okay so that kfree ain't the only problem in there 2008-09-17 03:01 now, really no more live testing 2008-09-17 03:02 anyway, once you have tracked down your double free I encourage you to post it to the tux3 list 2008-09-17 03:02 it's especially interesting now while it's still minimal 2008-09-17 03:02 will do so 2008-09-17 03:02 there's still a lot of stuff in there I'm not quite clear on 2008-09-17 03:03 so I think I'll devote a little more time into understanding what all these functions _do_... 2008-09-17 03:04 anyway, these bio's seem to be usable for aio pretty nicely 2008-09-17 03:04 inherently aio 2008-09-17 03:04 you have to work at it to make it sync 2008-09-17 03:04 right, that's what I meant 2008-09-17 03:05 once I get this working without crashing, and understand the code better, I'm going to need to create a content-less file system 2008-09-17 03:05 ie. ability to create/chmod/chown/etc files in pure ram without having content (all zero length) 2008-09-17 03:05 that will of course be a lot... 2008-09-17 03:05 and no backing to disk 2008-09-17 03:06 so basically reimplement ramfs 2008-09-17 03:06 just take the tux3 checkin 2008-09-17 03:06 little point in doing otherwise 2008-09-17 03:06 was just planning on building on this 2008-09-17 03:06 you don't really want to reverse engineer the twisty thoughts of the vfs maintainer ;) 2008-09-17 03:06 sure I do 2008-09-17 03:07 I'll ask you again in two weeks ;) 2008-09-17 03:07 you can't write a good fs without understanding the twistyness of the layer above it 2008-09-17 03:07 ah 2008-09-17 03:07 and the layer beneath it 2008-09-17 03:07 but you can understand the twists without deriving them from first principle 2008-09-17 03:07 learning by trying is painful, but very efficient in the long term 2008-09-17 03:07 just saying, examples are what you want now 2008-09-17 03:08 you remember the errors you make along the way 2008-09-17 03:08 not really looking at the bits and imagining how they fit together 2008-09-17 03:08 well, I feel I need a real deep understanding of the vfs layer to even attempt to try what I would like to do 2008-09-17 03:08 you can do that to a certain extent 2008-09-17 03:09 but there is a high percentage of "arbitrary" inthe bit you're just about to go exploring 2008-09-17 03:09 not really, it's just the next on the list ;-) 2008-09-17 03:09 kill_litter_super ;) 2008-09-17 03:09 throw in some more block layer, and some networking, and a lot of memory management/userspace/mmap 2008-09-17 03:09 work till the end of the year if I'm lucky 2008-09-17 03:10 and have the time 2008-09-17 03:10 yeah, seen kill_little_super, although haven't read it with understanding yet 2008-09-17 03:10 litter 2008-09-17 03:11 you meant litter? 2008-09-17 03:12 you did... 2008-09-17 03:15 that does seem to be the one triggered by ramfs sb cleanup 2008-09-17 03:15 anyway, enough for tonight 2008-09-17 03:15 need to sleep 2008-09-17 03:17 me too 2008-09-17 03:17 think I found the bug 2008-09-17 03:18 kfree(sb) instead of kfree(buf) 2008-09-17 03:18 basically typo 2008-09-17 03:18 thinko 2008-09-17 03:37 -!- openblast(~quassel@static.230.173.47.78.clients.your-server.de) has joined #tux3 2008-09-17 03:47 -!- konrad(~konrad@c-24-16-74-109.hsd1.wa.comcast.net) has joined #tux3 2008-09-17 04:42 -!- kmeyer(~konrad@c-24-16-74-109.hsd1.wa.comcast.net) has joined #tux3 2008-09-17 08:00 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-17 08:09 -!- kbingham(~kbingham@92.9.151.25) has joined #tux3 2008-09-17 08:14 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-17 08:53 http://www.sciencedaily.com/releases/2008/09/080915105733.htm 2008-09-17 08:53 cool 2008-09-17 09:29 -!- Kirantpatil(~kiran@122.167.207.73) has joined #tux3 2008-09-17 09:42 http://en.gogloom.com/OFTC/tux3/ 2008-09-17 09:42 cool 2008-09-17 09:42 nice ! 2008-09-17 09:53 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-17 10:58 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-17 11:01 -!- Kirantpatil(~kiran@122.167.183.230) has joined #tux3 2008-09-17 11:01 -!- Kirantpatil(~kiran@122.167.183.230) has left #tux3 2008-09-17 12:50 tux3 on linuxtoday: http://www.linuxtoday.com/ 2008-09-17 12:51 http://www.linuxtoday.com/developer/2008091702135NWKN 2008-09-17 12:52 the phoronix mention is nice too: http://phoronix.com/forums 2008-09-17 12:53 http://www.phoronix.com/scan.php?page=news_item&px=NjcyNQ 2008-09-17 13:42 sync bitmap 2008-09-17 13:42 filemap_blockio: write <0:0> 2008-09-17 13:42 filemap_blockio: egad, wrote a clean buffer 2008-09-17 13:42 yay for driveby debug checks 2008-09-17 13:46 ah, it's because the dirty buffer is set clean before being written 2008-09-17 13:46 which is ok 2008-09-17 13:46 but the buffer emulation probably should have a writeback state 2008-09-17 13:46 like kernel 2008-09-17 13:46 or maybe that is wanking 2008-09-17 13:47 just turn off the debug check for now I guess 2008-09-17 13:47 maybe turn it into a check for writing out a non-uptodate buffer 2008-09-17 13:48 if (buffer_empty(buffer)) 2008-09-17 13:48 warn("egad, wrote an invalid buffer"); 2008-09-17 14:31 struct buffer *nextbuf = findblk(buffer->map, ends[down] + (int[2]){ 1, -1 }[down] ); 2008-09-17 14:31 hurt anybody's eyes? 2008-09-17 14:33 struct buffer *nextbuf = findblk(buffer->map, ends[down] + (down ? -1 : 1)); <- little less barbaric 2008-09-17 14:33 I don't know 2008-09-17 14:33 (int[2]){1,-1}[x] is pretty cute 2008-09-17 14:33 yup 2008-09-17 14:33 these days it's the slower of the two 2008-09-17 14:33 kinda shocking for us oldtimers 2008-09-17 14:34 1-2*!!down 2008-09-17 14:34 second will compile to a cmov 2008-09-17 14:34 or hmm 2008-09-17 14:34 you'd better hope the compiler optimizes that ;) 2008-09-17 14:34 it does 2008-09-17 14:34 but 2008-09-17 14:34 the condex is optimal 2008-09-17 14:34 for any proc with cmov or equivalent, which is pretty much all these days 2008-09-17 14:35 yeah 2008-09-17 14:35 cmov is a great 'invention' 2008-09-17 14:35 amazing it took so freakin' long 2008-09-17 14:35 I'd like to write (down ? + : -)1 2008-09-17 14:35 why can't I? 2008-09-17 14:35 lol 2008-09-17 14:35 that's wicked 2008-09-17 14:36 (down ? (+) :( -))1 2008-09-17 14:36 make it unambiguous 2008-09-17 14:36 (int)(ror 1,!down) 2008-09-17 14:37 not quite 2008-09-17 14:37 you need to propagate the sign all the way 2008-09-17 14:37 [by this point I'm not sure if we want -1 +1 or +1 -1 2008-09-17 14:37 oh right 2008-09-17 14:38 tricky 2008-09-17 14:38 cmov will win 2008-09-17 14:39 anyway, enough wan^review, got to push this extent maker a little further 2008-09-17 14:39 or down, down; adc ax,ax; leal (ax,ax,-1),ax 2008-09-17 14:39 nah 2008-09-17 14:39 the or sets z not c 2008-09-17 14:40 cmov will clean its clock 2008-09-17 14:40 true 2008-09-17 14:40 even when it's working 2008-09-17 14:40 although 2008-09-17 14:41 (down ? ends[down] - 1 : ends[0] + 1) 2008-09-17 14:41 will probably be better 2008-09-17 14:42 it'll probably end up as a compute in parallel and cmov to select 2008-09-17 14:42 this is a pure idiocy though 2008-09-17 14:42 it doesn't matter 2008-09-17 14:43 right, but nice 2008-09-17 14:43 you're a born demo coder 2008-09-17 14:51 if (ends[1] - ends[0]) 2008-09-17 14:51 printf("extent from %x to %x\n", ends[0], ends[1]); 2008-09-17 14:51 works, seems to 2008-09-17 14:51 time to check in 2008-09-17 14:53 ends[up] = next; <- reads kind of cutely 2008-09-17 14:55 for (int up = 0, sign = -1; up < 2; up++, sign = -sign) { 2008-09-17 14:55 the most efficient of all 2008-09-17 14:55 so far 2008-09-17 15:09 there we go, a checkin 2008-09-17 15:09 that gives me the moral right to go for a skate 2008-09-17 15:09 early skate today 2008-09-17 15:09 in honor of the cabal meeting 2008-09-17 15:20 mmm, sushi for breakfast 2008-09-17 15:35 -!- kbingham(~kbingham@92.9.135.11) has joined #tux3 2008-09-17 16:12 http://www.phoronix.com/forums/showthread.php?t=12704 2008-09-17 16:13 "either we will still be using ext5-6-7 in the future but with new ideas that were proven to be valuable by other projects like Tux3 or we might actually see a shift towards a completely new filesystem like tux3" 2008-09-17 16:14 "Lot's of information here though: http://shapor.com/tux3/shapor-tux3/doc/design.html" 2008-09-17 16:14 hehe 2008-09-17 16:15 there's a ringer in the thread 2008-09-17 17:02 folks 2008-09-17 17:10 reading fanboy mail ? :) 2008-09-17 21:24 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-17 21:56 -!- tim_dimm_(~mobile@32.174.56.165) has joined #tux3 2008-09-17 21:57 Greetings from the cabal 2008-09-17 21:59 Flips proposed a mellowing, Shapor wants swear words 2008-09-17 22:00 New sys call: 2008-09-17 22:00 un_fuck 2008-09-17 22:02 Cabal suggest sys_unfuck 2008-09-18 02:48 -!- kbingham(~kbingham@92.10.191.55) has joined #tux3 2008-09-18 03:35 folks 2008-09-18 03:35 not much irc traffic today 2008-09-18 03:35 how's it going ? 2008-09-18 03:44 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-09-18 04:03 today was pretty busy 2008-09-18 04:03 off channel action irl 2008-09-18 04:04 a sys_unfuck syscall was proposed, and useful work was also done 2008-09-18 04:05 irl ? 2008-09-18 04:05 in real life 2008-09-18 04:05 ok 2008-09-18 04:05 good, cabal meeting of sorts ? 2008-09-18 04:07 full blown 2008-09-18 04:07 oh really ? unannounced ? 2008-09-18 04:07 true 2008-09-18 04:07 who was there ? 2008-09-18 04:07 flips: are you getting private /msg ? 2008-09-18 04:07 can't say it was a cabal meeting 2008-09-18 04:07 ok 2008-09-18 04:09 regarding extents ? 2008-09-18 04:10 one thing indeed 2008-09-18 04:10 coding right now 2008-09-18 04:10 tricky 2008-09-18 04:10 yeah 2008-09-18 04:16 ok night 2008-09-18 04:17 surprised you're up this late still 2008-09-18 04:18 me too 2008-09-18 07:16 -!- Kirantpatil(~kiran@122.167.223.69) has joined #tux3 2008-09-18 07:16 -!- Kirantpatil(~kiran@122.167.223.69) has left #tux3 2008-09-18 07:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-18 08:36 -!- openblast(~quassel@static.230.173.47.78.clients.your-server.de) has joined #tux3 2008-09-18 08:57 -!- openblast(~quassel@static.230.173.47.78.clients.your-server.de) has joined #tux3 2008-09-18 09:21 -!- kbingham(~kbingham@92.20.194.187) has joined #tux3 2008-09-18 10:15 -!- tim_dimm_(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-18 10:20 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-18 10:24 -!- tim_dimm_(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-18 10:42 -!- kbingham(~kbingham@92.20.194.187) has joined #tux3 2008-09-18 10:47 -!- konrad(~konrad@D-128-208-53-196.dhcp4.washington.edu) has joined #tux3 2008-09-18 11:00 top 2008-09-18 11:57 -!- pgquiles(~pgquiles@50.Red-79-153-248.staticIP.rima-tde.net) has joined #tux3 2008-09-18 13:53 flips: btrfs claims to eventually have online disk checking 2008-09-18 13:54 a coworker just attended a btrfs talk 2008-09-18 16:17 dwalk_next is hard to write 2008-09-18 16:17 given some context already set up, returns the next extent from a dleaf 2008-09-18 16:18 probably will turn into a post to the list 2008-09-18 16:18 big complexity in a small corner 2008-09-18 16:18 as expected, actually 2008-09-18 16:56 hey 2008-09-18 17:02 pong 2008-09-18 17:02 how's it going ? 2008-09-18 19:07 -!- ChanServ changed mode/#tux3 -> +o flips 2008-09-18 19:07 -!- flips changed topic to "Tux3 list membership roars past 100! ~ http://tux3.org ~ Tux3 U, right here Tue and Thur 8 p.m. Pacific Time ~ Next session: bio level data transfer" 2008-09-18 19:08 -!- flips changed topic to "Tux3 list membership roars past 100! ~ http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 p.m. Pacific Time ~ Next session: bio level data transfer" 2008-09-18 19:08 maze, ping 2008-09-18 19:19 -!- flips changed topic to "Tux3 list membership roars past 100! ~ http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 p.m. Pacific Time ~ Next session: bio level data transfer ~ Seinfeld ads canned, thanks for small mercies" 2008-09-18 19:19 -!- flips changed mode/#tux3 -> -o flips 2008-09-18 19:27 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-18 19:31 I figure if I make myself a new cuppa dark french right now have a fighting chance of getting streaming dleaf read working by midnight 2008-09-18 19:31 maybe even write 2008-09-18 19:31 ACTION takes action on that item 2008-09-18 19:34 ACTION is browsing LDD a little... 2008-09-18 19:48 -!- BSD(~bandan@pool-71-174-177-86.bstnma.east.verizon.net) has joined #tux3 2008-09-18 19:52 -!- Kirantpatil(~kiran@122.167.219.189) has joined #tux3 2008-09-18 19:53 -!- Kirantpatil(~kiran@122.167.219.189) has left #tux3 2008-09-18 19:53 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-18 19:55 Um.. How do I clone the git ddtree ? 2008-09-18 19:55 tried git clone? 2008-09-18 19:55 on the url I posted? 2008-09-18 19:55 Ya I mean what's the URL ? Sorry I probably missed it :( 2008-09-18 19:56 in a message somewhere, "tux3 report: what's next" 2008-09-18 19:56 alternatively, go to phunq.net/ddtree 2008-09-18 19:57 has gitweb and everything 2008-09-18 19:58 git clone http://phunq.net/tux3fs is what I tried 2008-09-18 19:58 it would be nice it git just worked 2008-09-18 19:59 like mercurial 2008-09-18 19:59 kay 2008-09-18 19:59 hmm.. 2008-09-18 19:59 a matter of getting the url right 2008-09-18 19:59 I think it gets confused by symlinks 2008-09-18 20:00 Yay I will just do it with hg, never mind :) 2008-09-18 20:00 git is just the kernel part 2008-09-18 20:00 you don't need that right now 2008-09-18 20:01 so mercurial 2008-09-18 20:01 nice nick 2008-09-18 20:01 :) 2008-09-18 20:03 I'll clean up the git cloneability later 2008-09-18 20:03 Thanks! 2008-09-18 20:03 manshack underwent a major re-arrange 2008-09-18 20:03 just another point on the "merucial rules" curve I think 2008-09-18 20:03 yummy 2008-09-18 20:03 wow 2008-09-18 20:03 we started 3 minutes ago 2008-09-18 20:04 no maze 2008-09-18 20:04 so we will take a slight change in session plan 2008-09-18 20:04 instead of doing bio transfers we will continue drilling down into generic_write 2008-09-18 20:05 ok, somebody summarize where we got to, please... mention _2copy 2008-09-18 20:06 ACTION looks at RazvanM 2008-09-18 20:06 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2063 2008-09-18 20:06 and the summary? 2008-09-18 20:07 and we got there from here: http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2319 2008-09-18 20:07 the 2copy is used when there is no support for write_begin 2008-09-18 20:08 what is happening in this function? 2008-09-18 20:08 and we use prepare_Write and commit_write 2008-09-18 20:09 the data is moved to some kernel pages and then to some user memory? :P 2008-09-18 20:09 hi all 2008-09-18 20:09 hi 2008-09-18 20:09 ACTION takes a seat at the back of the room 2008-09-18 20:09 the data is moved from user memory onto buffer pages 2008-09-18 20:10 then the buffer pages are committed to disk 2008-09-18 20:10 sorry... I got the order wrong :P 2008-09-18 20:10 2copy is the lamest name anybody could have possibly chosen :p 2008-09-18 20:10 appears to be the real thing though 2008-09-18 20:11 just where we should be reading 2008-09-18 20:11 __grab_cache_page is the heart of it 2008-09-18 20:11 other things are decoration 2008-09-18 20:11 such as fault_in_readable 2008-09-18 20:12 just a quick q: why some functions start with uppercase? 2008-09-18 20:12 attempts to deal with the many dangerous recursions 2008-09-18 20:12 with varying degrees of success in terms of robustness and readability 2008-09-18 20:12 razvanm, random hackers 2008-09-18 20:12 what is write_begin? 2008-09-18 20:12 sometimes have studly caps days 2008-09-18 20:12 hey 2008-09-18 20:13 write_begin is a hook for some specialized user I don't know about 2008-09-18 20:13 "completely general interface used inexactly one place" like as not 2008-09-18 20:13 or "homework for shapor" 2008-09-18 20:13 hey maze 2008-09-18 20:13 :) 2008-09-18 20:13 ok 2008-09-18 20:13 ok, we can return to the original session plan 2008-09-18 20:14 maze, the plan is for you to report your findings on basic bio transfers 2008-09-18 20:14 lol 2008-09-18 20:14 point to code (you might want to pastie it) 2008-09-18 20:14 uhm, lol 2008-09-18 20:14 how about I put a tar.gz up? 2008-09-18 20:14 don't copy in the channel unless it's 1/2 lines 2008-09-18 20:14 that too 2008-09-18 20:14 pastie is good, use your taste 2008-09-18 20:15 if you had it checked in you could point a urls 2008-09-18 20:15 so... remember to check in next time ;) 2008-09-18 20:15 uploading 2008-09-18 20:15 since you code is so short I'd suggest just pasting the whole thing 2008-09-18 20:16 http://m.a.z.e.pl/junkfs.tar.gz 2008-09-18 20:16 lol nice domain! 2008-09-18 20:16 really 2008-09-18 20:16 leet 2008-09-18 20:16 yeah, I own z.e.pl 2008-09-18 20:17 almost as cool as cr.yp.to 2008-09-18 20:17 so I also have m.a@z.e.pl 2008-09-18 20:17 heh 2008-09-18 20:17 "opened with ark" 2008-09-18 20:17 or m@z.e.pl - whichever you prefer 2008-09-18 20:17 ok, who has got the code open, and who not? 2008-09-18 20:17 me not 2008-09-18 20:18 ok, got it open 2008-09-18 20:18 ark works pretty fscking well 2008-09-18 20:18 I'm impressed 2008-09-18 20:18 mind you - this is very rough, and mostly was debugging plus getting it working 2008-09-18 20:18 I'm still not quite sure of everything, and although I fixed the last hang bug I found 2008-09-18 20:18 I haven't since tested 2008-09-18 20:18 so I'm not sure ;-) 2008-09-18 20:18 don't worry, shapor will hurt you if you get anything wrong 2008-09-18 20:19 lol 2008-09-18 20:19 ACTION wields axe 2008-09-18 20:19 so... where does the bio read setup start? 2008-09-18 20:20 do you want me answering? 2008-09-18 20:20 yes 2008-09-18 20:20 you should have been asking ;) 2008-09-18 20:20 hmm. 2008-09-18 20:20 right 2008-09-18 20:20 so pretty much everything except super.c is either makefile or debug 2008-09-18 20:20 noticed 2008-09-18 20:21 and the bottom of super.c is pretty standard module init stuff 2008-09-18 20:21 nicely lindented 2008-09-18 20:21 for the moment we only care about the bio transfer 2008-09-18 20:21 and above that is the standard fs registering and fs_ops stuff 2008-09-18 20:22 and from there we get to junkfs_get_sb which calls into get_sb_bdev 2008-09-18 20:22 which calls junkfs_fill_super as a callback 2008-09-18 20:22 and that's were all the action is 2008-09-18 20:22 action :) 2008-09-18 20:22 get_sb_bdev also exclusively opens the block device for us, so that's nice 2008-09-18 20:22 finally, after 4 days of tux3 U 2008-09-18 20:22 at the point we enter into junkfs_fill_super, we have an exclusively opened block device 2008-09-18 20:23 which is passed in the superblock 2008-09-18 20:23 sb->s_bdev 2008-09-18 20:23 in junkfs_fill_super we then proceed to allocate memory for 3 basic objects 2008-09-18 20:23 1) memory to read in the 512 byte (SB_SIZE) superblock 2008-09-18 20:23 1 sector sb, leet 2008-09-18 20:23 2) an object to store state (in the bio->b_private field) 2008-09-18 20:24 c) a bio 2008-09-18 20:24 1 and 2 are just normal kmalloc's 2008-09-18 20:24 3 is via bio_alloc 2008-09-18 20:24 thus 1 and 2 will need to be kfree'd 2008-09-18 20:24 -!- Bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-09-18 20:24 and 3 will need to be bio_put'ed at some point before the end of junkfs_fill_super 2008-09-18 20:24 or we'll leak 2008-09-18 20:25 anyway, standard handling of error returns on all the allocs 2008-09-18 20:25 and we get to: 2008-09-18 20:25 bio->bi_bdev = sb->s_bdev; 2008-09-18 20:25 <------>bio->bi_sector = 0; // first sector 2008-09-18 20:25 <------>s = bio_add_page(bio, virt_to_page(buf), SB_SIZE, offset_in_page(buf)); 2008-09-18 20:25 which is most of the bio preparation stage 2008-09-18 20:25 Bushman: hi Marcin 2008-09-18 20:25 the real meat 2008-09-18 20:25 we set the bio to refer to the correct block device 2008-09-18 20:25 marcin, hi 2008-09-18 20:25 and (for now - this is all junkfs ;-) ) we just read the first sector 2008-09-18 20:26 sectors in new linux are always exactly 512 bytes 2008-09-18 20:26 that's leet nuff for us 2008-09-18 20:26 so we're saying here offset 0 * 512 into the block dev 2008-09-18 20:26 then we need to tell the bio where to store the data 2008-09-18 20:26 (or read from, since a write would be identical) 2008-09-18 20:26 right, struct bio is sector-addressed for no good reason 2008-09-18 20:26 s = bio_add_page(bio, virt_to_page(buf), SB_SIZE, offset_in_page(buf)) 2008-09-18 20:26 hello Daniel 2008-09-18 20:27 this actually gives our carefully allocated memory to the bio as memory 2008-09-18 20:27 bushman, enjoy ;) 2008-09-18 20:27 note that bio_add_page takes (bio, struct page*, len, ofs) 2008-09-18 20:27 i dunno if enjoy is the right word for kernel code just before bedtime ;) 2008-09-18 20:27 so we pass in the bio, then convert the bufs address to a page via virt_to_page 2008-09-18 20:27 and you could write it out in full in about as much code as the function call takes 2008-09-18 20:27 pass the length of the block 2008-09-18 20:28 and calc the offset from the page struct for the ofs via offset_in_page 2008-09-18 20:28 bushman, then just enjoy the geek banter 2008-09-18 20:28 virt_to_page? 2008-09-18 20:28 I'm assuming at this point that a kmalloc can't give us memory split across pages 2008-09-18 20:28 - not sure if this is correct 2008-09-18 20:28 shapor, great question 2008-09-18 20:28 maze, correct 2008-09-18 20:28 so buf was kmalloc'ed, so it's a virtual kernel memory address 2008-09-18 20:29 maze, unless the kmalloc is bigger than a page 2008-09-18 20:29 virt_to_page gives us the struct page * for the kaddr we pass to it 2008-09-18 20:29 [flips: of course] 2008-09-18 20:29 maze, and why do we need the struct page? 2008-09-18 20:29 because that's what bios want 2008-09-18 20:29 if you look at what a bio is 2008-09-18 20:29 it's 3 things 2008-09-18 20:30 the struct bio 2008-09-18 20:30 which has a lot of management fields 2008-09-18 20:30 the bvec which 2008-09-18 20:30 is an array of a tiny struct with 3 fields 2008-09-18 20:30 { struct page * p; int len; int ofs; } 2008-09-18 20:31 so basically a list of where to put the next len bytes, specifying memory via page/ofs pairs 2008-09-18 20:31 this is for two reasons: 2008-09-18 20:31 [at least as far as i can tell] 2008-09-18 20:31 a) most hw (ie. stuff the blockdevice drivers care about) 2008-09-18 20:31 cares about physicall addresses and not virtual kernel addresses 2008-09-18 20:31 right 2008-09-18 20:31 ie. for dma and all that good for performance goodness 2008-09-18 20:32 b) this can also be used for data xfr into userspace 2008-09-18 20:32 and there is no guarantee userspace memory has a mapping into kernel space 2008-09-18 20:32 [high mem] 2008-09-18 20:32 the big reason: scatter gather 2008-09-18 20:32 this is a dma interface in disguise 2008-09-18 20:32 very effective one 2008-09-18 20:32 this also makes it easier to coallesce physically neighboring memory together into the bvecs 2008-09-18 20:32 precisely 2008-09-18 20:33 right, another way of saying scatter gather 2008-09-18 20:33 notice that in bio_alloc 2008-09-18 20:33 we passed in a 1 2008-09-18 20:33 that 1 is the number of bvecs in the bvec area allocated to the bio 2008-09-18 20:33 so that limits how many non-contig pieces of memory we can have in the bio 2008-09-18 20:33 ah 2008-09-18 20:33 here - all we need is 1 2008-09-18 20:33 and because you did that, you could have initialized your one bvec with a simple structure assignment 2008-09-18 20:33 instead of the function call 2008-09-18 20:33 right. 2008-09-18 20:33 which does a bunch of stuff you don't need 2008-09-18 20:34 oh well. 2008-09-18 20:34 does a bio_vec describes exactly one page? 2008-09-18 20:34 maze, exactly 2008-09-18 20:34 no 2008-09-18 20:34 bv_len 2008-09-18 20:34 it describes a start page with ofset and a length 2008-09-18 20:34 the length may exceed that page and cross into however many next ones 2008-09-18 20:34 the precise rules for merging are overridable 2008-09-18 20:34 it describes a data region that resides within one page 2008-09-18 20:35 so the bio interface will be quite good for extents 2008-09-18 20:35 many device drivers have limits on how many sectors they can transfer in one go (ie. 200 or so) 2008-09-18 20:35 maze, you can't cross a page with a bvec 2008-09-18 20:35 flips, you sure? 2008-09-18 20:35 sadly, or perhaps sanely 2008-09-18 20:35 I certainly ain't ;-) 2008-09-18 20:36 pretty sure 2008-09-18 20:36 but then I don't know what I'm talking about here 2008-09-18 20:36 never seen it done ;) 2008-09-18 20:36 these are still all guesses 2008-09-18 20:36 pollacks ain't sane, just ask Shap 2008-09-18 20:36 I thought they merged by themselves 2008-09-18 20:36 hmm, well, first homework I;d guess 2008-09-18 20:36 one more q: bv_len is counting bytes or sectors? :P 2008-09-18 20:37 merging happens in the physical driver 2008-09-18 20:37 good question 2008-09-18 20:37 anyway bio_add_page returns how much it successfully added (or what the current total is, not sure) in bytes 2008-09-18 20:37 bytes I think 2008-09-18 20:37 so if everything is good it should be 512 at this point 2008-09-18 20:37 hence the check 2008-09-18 20:37 it's pretty badly braindamaged i that respect, counting in different units for no good reason 2008-09-18 20:37 if it doesn't match, we've got a problem - which mind you - AFAICT - can't happen 2008-09-18 20:37 and we bio_put to free the structure and basically error out 2008-09-18 20:38 [of course here we always error out, because this is junkfs (tm)] 2008-09-18 20:38 anyway if s==512 then we're good 2008-09-18 20:38 oh bv_len is definitely bytes 2008-09-18 20:38 we setup to more fields in the bio 2008-09-18 20:38 bi_end_io is the call back for when the bio is processed (or errors out) 2008-09-18 20:39 when the disk completion interrupt fires 2008-09-18 20:39 key point 2008-09-18 20:39 bi_private is a pointer to our data (the mz struct) so that we can figure out what we're talking about in the endio handler 2008-09-18 20:39 and then we submit the bio for READ 2008-09-18 20:39 now this (ie. bios) are inherently asynchronous 2008-09-18 20:40 so at this point it might have already completed - it could have been cached and come back immediately 2008-09-18 20:40 right... it's the _only_ way to recover a memory context for a completed bio 2008-09-18 20:40 [I think] 2008-09-18 20:40 or we might need to wait some indeterminate amount of time 2008-09-18 20:40 it's much more direct than that 2008-09-18 20:40 here's where we make use of the waitqueue which we helpfully placed in the mz struct 2008-09-18 20:40 disk raises interrupt -> endio gets called 2008-09-18 20:40 in interrupt context 2008-09-18 20:40 this is as on the metal as you will get without going hypervisor 2008-09-18 20:41 oh, so basically end_io should do as little as feasibly possible 2008-09-18 20:41 preferably as simple as it is here 2008-09-18 20:41 yes 2008-09-18 20:41 again yes 2008-09-18 20:41 is it the right place to call bio_put ? 2008-09-18 20:41 though I often get excessive there ;) 2008-09-18 20:41 anyway, earlier on, we'd already initialized the waitqueue, so now we can just wait on it 2008-09-18 20:41 in the endio handler? 2008-09-18 20:42 except wait needs not only a waitqueue (wq) but also a condition 2008-09-18 20:42 [which it checks _first_] 2008-09-18 20:42 maze, _interruptible? 2008-09-18 20:42 hence mz struct also contains a boolean 2008-09-18 20:42 flips: yeah, no idea what the right choice is there, meaning to ask about this 2008-09-18 20:42 shapor, yes 2008-09-18 20:42 very important question 2008-09-18 20:42 flips, so how would it behave in a hypervisor? any changes? does it lose determinism? 2008-09-18 20:42 why does it matter? 2008-09-18 20:43 if interruptible, you better be prepared to field anything that can be thrown at you 2008-09-18 20:43 if uninterruptible, you'd better be able to prove it always completes 2008-09-18 20:43 is that the basis for atomicity then? 2008-09-18 20:43 so what could get thrown at us, and will the bio always complete? 2008-09-18 20:43 flips: what happens if there is an error 2008-09-18 20:43 bushman, we don't touch hypervisors 2008-09-18 20:43 disk io error or something 2008-09-18 20:43 if we did, it would be to implement hard realtime or something 2008-09-18 20:43 hypervisors should be transparent to the os 2008-09-18 20:43 does the endio handler get called? 2008-09-18 20:44 yes endio has err parameter 2008-09-18 20:44 bushman, there is some sense of atomicity here in the interruptible/noninterrupble distinction 2008-09-18 20:44 loose sense 2008-09-18 20:44 just to finish off this (junkfs_fill_super) function, we then dump the superblock via printk and free everything and return an error (junkfs remember.?) 2008-09-18 20:44 maze, in kernel interrupts don't just happen, you have to ask for them 2008-09-18 20:45 even with preemption 2008-09-18 20:45 ? 2008-09-18 20:45 or they get fielded on syscall exit 2008-09-18 20:45 SHOULD be transparrent, but since most of them mangle time into nonlinear, doesnt it screw up our predictions when interrupt is gonna finish? 2008-09-18 20:45 task switch is not interrupt 2008-09-18 20:45 it's caused by an interrupt 2008-09-18 20:45 oh i see you just aren't checking the err parameter in end_io_read 2008-09-18 20:45 you can get a task switch even with wait_uninterruptible 2008-09-18 20:45 probably should ;) 2008-09-18 20:45 so while in kernel space, my thread of execution is guaranteed not get interrupted by anything? 2008-09-18 20:46 right I should ;-) 2008-09-18 20:46 all that means is, an interrupt won't cause the wait to bail early 2008-09-18 20:46 you have to wrap your interruptible wait in a loop 2008-09-18 20:46 or write uninterruptible 2008-09-18 20:46 so interruptible here refers to what? can be interrupted by killing the mount process? 2008-09-18 20:46 which is probably what you want here 2008-09-18 20:46 just means the wait may bail before the wak 2008-09-18 20:46 wake 2008-09-18 20:47 so has to be in a loop, and you can't assume that what you were waiting for actually happened 2008-09-18 20:47 so i guess the big question here is how do we guarantee that the write is gonna complete? 2008-09-18 20:47 so I'd want uninterruptible? or interruptible and then on some interrupts somehow cancel and free the bio 2008-09-18 20:47 just write uninterruptible until you know kernel scheduling better ;) 2008-09-18 20:47 (read here) 2008-09-18 20:47 uninterruptable will cause it to be D too iirc 2008-09-18 20:47 bushman, it always completes 2008-09-18 20:47 D state 2008-09-18 20:48 with or without an error 2008-09-18 20:48 Bushman: it may complete with an error 2008-09-18 20:48 which gets passed to the endio handler 2008-09-18 20:48 yes, this is d state, the real thing 2008-09-18 20:48 which as written ignores all errors, and just marks the io as completed, frees the bio, and wakes the wq 2008-09-18 20:48 interruptable is not quite so severe i guess 2008-09-18 20:48 you are in d state any time you're waiting in kernel 2008-09-18 20:48 even interruptable? 2008-09-18 20:48 yes 2008-09-18 20:48 unless you're doing wait_interruptible? 2008-09-18 20:49 hmm 2008-09-18 20:49 flips: didn't we find that not to be the case 2008-09-18 20:49 with ddsnap 2008-09-18 20:49 even then I think 2008-09-18 20:49 hmm, so how could I get this to be abortable, in case for example the block device hangs on network? 2008-09-18 20:49 remember our threads were all D state 2008-09-18 20:49 you get a qualifier on your ps output 2008-09-18 20:49 until we changed it to interruptable 2008-09-18 20:50 maze, that's not your job, it's the job of the device insert/remove 2008-09-18 20:50 which of course means it's badly mismanaged ;) 2008-09-18 20:50 but... 2008-09-18 20:50 not your problem for now 2008-09-18 20:50 well what if we're running this off of a nbd or something like that, and the network gets pulled 2008-09-18 20:50 would the bio then just (eventually) return with an error to endio? 2008-09-18 20:50 that's nbd's problem 2008-09-18 20:50 again not yours 2008-09-18 20:51 you can try to do timeouts and things, but you're risking redudancy 2008-09-18 20:51 and confusion 2008-09-18 20:51 right 2008-09-18 20:51 risking redundancy ? 2008-09-18 20:51 duplicating functionality that is better performed at some other layer 2008-09-18 20:52 constant risk with the blind leading the blind ;) 2008-09-18 20:52 yeah 2008-09-18 20:52 good point 2008-09-18 20:52 but the blind leading the deaf is ok 2008-09-18 20:52 maze, that was a great walkthrough, and the code is great too 2008-09-18 20:52 yes! 2008-09-18 20:52 not perfect, but you don't need that to be great in linux ;) 2008-09-18 20:52 I stil don't quite understand a bunch of it 2008-09-18 20:52 MaZe: thanks, i was following closely with little time to type 2008-09-18 20:53 a few warts make it more real, like a european movie 2008-09-18 20:53 hah 2008-09-18 20:53 ACTION rolls eyeballs 2008-09-18 20:53 lol 2008-09-18 20:53 maze, I am going to cut and paste your code into fs/tux3/super.c 2008-09-18 20:53 and tux3 is going to read a leet sector sized sb too 2008-09-18 20:54 heh 2008-09-18 20:54 s/junkfs/tux3/ 2008-09-18 20:54 hehe 2008-09-18 20:54 exactly 2008-09-18 20:54 or s/tux3/junkfs/ 2008-09-18 20:54 depending on leetness or lack of it 2008-09-18 20:54 so it seems silly for every fs to have to do this 2008-09-18 20:54 is the vfs totally useless? 2008-09-18 20:54 yes 2008-09-18 20:55 pretty much 2008-09-18 20:55 what I still haven't found is how to specify the io priority of the bio you submit 2008-09-18 20:55 pretty close 2008-09-18 20:55 not completely 2008-09-18 20:55 lame but not useless 2008-09-18 20:55 better than NT 2008-09-18 20:55 I'm assuming it inherits from the ionice'ness of the process in whose context you're running 2008-09-18 20:55 maze, completely separate 2008-09-18 20:55 it's part of the elevator abstraction 2008-09-18 20:55 oh? 2008-09-18 20:56 huh? 2008-09-18 20:56 i was wondering that too 2008-09-18 20:56 inheriting anything is completely a property of the elevator plugin 2008-09-18 20:56 shouldn't submitting a read/write request to a blockdevice be exactly when this matters? 2008-09-18 20:56 see "request queue" 2008-09-18 20:56 oh, the mysterious q parameter 2008-09-18 20:56 one of the harder code reading projects in kernel 2008-09-18 20:56 it's a mess 2008-09-18 20:56 I saw all over the place 2008-09-18 20:56 that is apparently a field in the bio struct 2008-09-18 20:57 q is a carpet under which all kinds of doggie poo is swept 2008-09-18 20:57 it's really a bag tied onto the side of the bio 2008-09-18 20:57 we'll get rid of it before next christmas 2008-09-18 20:57 I hope 2008-09-18 20:57 I just want a nice aio read/write with priority interface for my coding 2008-09-18 20:57 you got it 2008-09-18 20:57 already 2008-09-18 20:58 well s/nice/nicer than what we had before/ 2008-09-18 20:58 that would be a good project.. a new aio interface 2008-09-18 20:58 right, I have the aio rw 2008-09-18 20:58 sounds like it should map easily enough.... 2008-09-18 20:58 bio transfer is aio at its purest 2008-09-18 20:58 yeah 2008-09-18 20:58 right, but you want prioritization in there 2008-09-18 20:58 should be easier than non aio realy 2008-09-18 20:58 and that's what I'm failing to see 2008-09-18 20:58 maze, in the elevator 2008-09-18 20:58 'scuze my newbness, but wouldnt priority be at odds with queuing that the controllers try to do? 2008-09-18 20:58 so does the bio go through the elevator? 2008-09-18 20:59 bushman, interactions, yes 2008-09-18 20:59 not all good 2008-09-18 20:59 well, you want something htb like for io 2008-09-18 20:59 best to try and harmonize with them 2008-09-18 20:59 wait a minute, what's the layering here? 2008-09-18 21:00 is the physical hw under the elevator under the bio 2008-09-18 21:00 vfs <-> bio <-> driver 2008-09-18 21:00 and where's the elevator? 2008-09-18 21:00 between bio and driver 2008-09-18 21:00 vfs <-> bio <-> elevator <-> driver 2008-09-18 21:00 right? 2008-09-18 21:00 vfs <-> bio <-> elevator <-> driver 2008-09-18 21:00 ? 2008-09-18 21:00 heh 2008-09-18 21:00 heh 2008-09-18 21:00 exactly 2008-09-18 21:00 so by choosing the request queue in the bio, I choose priority of the request with regards to other requests? 2008-09-18 21:00 and the presence/lack of the elevator is up to the driver or virtual driver even 2008-09-18 21:01 so the elevator can appear at multiple or no places in the stack 2008-09-18 21:01 so the elevator messes with fields in the bios? 2008-09-18 21:01 is this screwy? or is this just me...? 2008-09-18 21:01 and vice versa in an idiotic way... sometimes useful way 2008-09-18 21:01 maze, it's screwy 2008-09-18 21:01 not just you 2008-09-18 21:01 but better than we had in 2.4 2008-09-18 21:02 it's damn fast actually, compared to a disk 2008-09-18 21:02 we didn't have that a few years ago 2008-09-18 21:02 now it's looking slow again 2008-09-18 21:02 and people are asking me to fix it 2008-09-18 21:02 it shall be done 2008-09-18 21:03 wait a minute - what is slow? 2008-09-18 21:03 the interfaces / kernel code? 2008-09-18 21:03 this who kooky chain 2008-09-18 21:03 whole 2008-09-18 21:03 vfs <-> bio <-> elevator <-> driver 2008-09-18 21:03 layering is right 2008-09-18 21:03 implementation is faulty 2008-09-18 21:03 agreed 2008-09-18 21:04 anyway 2008-09-18 21:04 we're using the existing one for now 2008-09-18 21:04 it will work for tux3 as well as it works for anybody 2008-09-18 21:04 better, because we will use it more directly 2008-09-18 21:04 and have fewer strange waits and so on 2008-09-18 21:04 right 2008-09-18 21:04 and when we do see a strange wait, we will be able to pounce on it 2008-09-18 21:04 that's why I wanted to go all the way down to the bio on the sb read 2008-09-18 21:05 a) for practice 2008-09-18 21:05 b) because it's the way it should be done 2008-09-18 21:05 unlike if you use the... odd... vfs block io helpers 2008-09-18 21:05 well I think we are going to stay all the way down here for tux3 2008-09-18 21:06 tux3 has no use asking other subsystems to submit bios on its behalf, unless that subsystem is an lvm 2008-09-18 21:06 and even then, we just submit a bio to the lvm without caring its not a real device 2008-09-18 21:06 still have to figure out how to do mmap like stuff (ie. trigger read in, on page fault, or write out, both for kernel and userspace, and cow, etc) 2008-09-18 21:06 maze, handled for you 2008-09-18 21:06 like magic 2008-09-18 21:06 cool - assuming it does the right thing (tm) 2008-09-18 21:06 see filemap.c -> nopage 2008-09-18 21:06 kinda right 2008-09-18 21:06 some messed locking 2008-09-18 21:07 which I'm not sure it does for cache coherency netfs 2008-09-18 21:07 bottlenecks on i_mutex during fault in 2008-09-18 21:07 bad 2008-09-18 21:07 so it probably needs to be gone through with a fine comb then 2008-09-18 21:07 even nfs is cache coherent/consistent with respect to mmap 2008-09-18 21:07 as I was expecting 2008-09-18 21:07 yes 2008-09-18 21:07 right in to the danger zone 2008-09-18 21:08 speaking of which 2008-09-18 21:08 what bottlenecks on i_mutex? 2008-09-18 21:08 time to turn on the ghetto blaster 2008-09-18 21:08 and get back to coding 2008-09-18 21:08 I'm assuming the code in filemap.c which deals with page-in/outs of mmapped pages 2008-09-18 21:08 oh, right it's already 10 past 9 2008-09-18 21:08 so is that it for this time? 2008-09-18 21:08 ACTION puts on Holst's the planets, performed by korean rock band 2008-09-18 21:09 ACTION scrolls back to remember his homework 2008-09-18 21:09 that's it, nice one maze 2008-09-18 21:09 is anybody sticking around to ask lame(er) questions? 2008-09-18 21:09 next time it will be razvanm's turn 2008-09-18 21:09 :P 2008-09-18 21:09 oh, awesome, what's he doing? 2008-09-18 21:09 to explain some more of _2copy 2008-09-18 21:09 ah 2008-09-18 21:09 lame question period is officially open 2008-09-18 21:10 intelligent questions banned 2008-09-18 21:10 what's an elevator? 2008-09-18 21:10 ACTION doesn't have anything to ask this time 2008-09-18 21:10 a kernel elevator 2008-09-18 21:10 when you read/write data to a hard disk 2008-09-18 21:10 otherwise you're going to get some dumb jokes 2008-09-18 21:10 which is a spinning platter with a seeking head 2008-09-18 21:10 elevator = io scheduler 2008-09-18 21:10 then depending on the order you send out request 2008-09-18 21:10 just caught up 2008-09-18 21:11 you may need to do a small or large number of seeks 2008-09-18 21:11 like tivo for geeks 2008-09-18 21:11 yup, and it's algorithms are the same as a busy elevator in a skyscraper 2008-09-18 21:11 seeks are very expensive 2008-09-18 21:11 so you try to minimize seeks 2008-09-18 21:11 for good performance (b/w), but higher latency 2008-09-18 21:11 so are tlb misses 2008-09-18 21:11 and page cache misses 2008-09-18 21:11 you basically scan the disk from top to bottom, doing read writes at increasing lba addresses 2008-09-18 21:11 irregardless of the order they were submitted in 2008-09-18 21:11 then do the same thing going downwards 2008-09-18 21:12 somewhat downwards 2008-09-18 21:12 ok great, but from this level, can we be aware of what media we're writing to so we dont make it overinvolved in cases it doesnt matter, like solid state disks? 2008-09-18 21:12 right 2008-09-18 21:12 the disk doesn't like going backwards as much as forwards 2008-09-18 21:12 the consecutive read/write sectors are still upwards 2008-09-18 21:12 Bushman: you can pick an io scheduler on a per-block-device basis 2008-09-18 21:12 and sometimes you skip the backwards step entirely 2008-09-18 21:12 depends 2008-09-18 21:12 bushman, mostly we don't care, where we do care we care a lot 2008-09-18 21:12 lots of fine tuning required to get optimal performance 2008-09-18 21:13 and it heavily depends on usecases 2008-09-18 21:13 /sys/block/sda/queue/scheduler 2008-09-18 21:13 as long as it's adjustable from userspace i'm good ;) 2008-09-18 21:13 plus you can throw in individual io priorities into the mix (ie. reading this sector is more important) 2008-09-18 21:13 we try to design for whole classes of usecases, rather than one at a time 2008-09-18 21:13 and b/w per job, and hard read/write deadlines, etc 2008-09-18 21:13 and it all gets complex 2008-09-18 21:13 http://friedcpu.wordpress.com/2007/07/17/why-arent-you-using-ionice-yet/ 2008-09-18 21:13 shapor, nice, i havent gotten used to the new linux, i've been bsd'ing since '03 2008-09-18 21:13 i only recently discovered ionice 2008-09-18 21:13 and the elevator is the piece of code which gets requests thrown at it 2008-09-18 21:14 i think mentioned on here 2008-09-18 21:14 does some algo mumbo jumbo to put them in the 'best' order 2008-09-18 21:14 shapor, because it doesn't work that well? 2008-09-18 21:14 and throws them at the disk 2008-09-18 21:14 flips: yes but the interface is there 2008-09-18 21:14 if people use it they can report bugs 2008-09-18 21:14 sure 2008-09-18 21:14 if people dont report bugs or say it sucks on lkml it wont get fixed 2008-09-18 21:14 same problem with posix_fadvise 2008-09-18 21:14 note that for a network nic 2008-09-18 21:14 we will take it for a spin at some point 2008-09-18 21:15 you have a certain amount of b/w 2008-09-18 21:15 maze will ;) 2008-09-18 21:15 and it's all pretty easy - conceptually 2008-09-18 21:15 and shapor will make some nice charts of the event logs 2008-09-18 21:15 vfs + bio events 2008-09-18 21:15 oh i almost forgot about that 2008-09-18 21:15 sending each packet involves a fixed amount of headroom, (header fields), the packet itself, and a fixed footer 2008-09-18 21:15 still no clue how to glue those together 2008-09-18 21:15 so when you send a packet you know exactly how much of the nic (ie. for how long) you're using it up 2008-09-18 21:16 thus you can make very nice guarantees 2008-09-18 21:16 and this is what htb + sfq does for networking 2008-09-18 21:16 htb? sfq? 2008-09-18 21:16 you can partition your network card pretty much arbitrarily between diifferent apps 2008-09-18 21:16 giving different apps different priorities, then different priorities different amounts of bw 2008-09-18 21:16 and the priorities don't need to be strictly linear either 2008-09-18 21:16 htb? sfq? 2008-09-18 21:16 htb 2008-09-18 21:16 oh could i get in on the testing? i've done a lot of work visualizing sequences of events in temporal OSPF loops, this should be i could do ;) 2008-09-18 21:17 htb is basically a tree structure 2008-09-18 21:17 the nodes are were requests come in 2008-09-18 21:17 what's the tla mean? 2008-09-18 21:17 the root is were requests come out 2008-09-18 21:17 so each application (or tcp stream, or whatever you're using) gets assigned to a leaf node in this tree 2008-09-18 21:17 (Stochastic Fairness Queueing) 2008-09-18 21:18 and the network driver then (when it wants to send) always pulls from the root 2008-09-18 21:18 gah 2008-09-18 21:18 each node in this tree has a certain speed of accumulating tokens 2008-09-18 21:18 (htb = hierarchical token buckets) 2008-09-18 21:18 that it accumulates in the bucket in that node 2008-09-18 21:18 wouldnt stochastic approach that every client is equally unhappy? ;) 2008-09-18 21:19 Bushman: sfq is used in the leafs to randomly select between clients / tcp streams you consider equivalent 2008-09-18 21:19 you hang an sfq off of each leaf node in htb, so you actually throw the packets at the correct sfq, and the htb leaf pulls it from the attached sfq 2008-09-18 21:19 network peeps are always reinventing the world ;) 2008-09-18 21:20 ah, so you use the hiarchical token buckets to assign different classes of service to different apps/streams? 2008-09-18 21:20 anyway, you divide up each nodes bandwidth among it's children 2008-09-18 21:20 and then define how and when they can borrow/lend tokens to each other 2008-09-18 21:20 I'm not doing a very good job of defining it here 2008-09-18 21:20 but it's wicked! 2008-09-18 21:20 no- you're doing a great job 2008-09-18 21:20 maze, I'm getting the idea 2008-09-18 21:20 sounds wicked 2008-09-18 21:20 yea i just did a project with filtering/limiting at work, so i'm getting it 2008-09-18 21:21 it sounds a lot smarter than it is ;) 2008-09-18 21:21 well, disk layer doesn't have any such pretentions to sophistication 2008-09-18 21:21 yet 2008-09-18 21:21 heh 2008-09-18 21:21 damn academis justifying their existence 2008-09-18 21:21 anyway, basically htb + sfq is the best I've seen for networking, and would probably be awesome for other stuff as well like scheduling cpus 2008-09-18 21:21 I can imagine the mess if it did 2008-09-18 21:21 Bushman: gee filtering and limiting, i wouldn't have guessed :P 2008-09-18 21:21 except it's probably to compute intensive for that and can't take cache-heat or memory nearness into account 2008-09-18 21:21 shapor: stfu ;) 2008-09-18 21:22 :) 2008-09-18 21:22 anyway, with disk it gets tougher 2008-09-18 21:22 if it did, could be interesting as a cache coherency protocal 2008-09-18 21:22 because you can't just up and calculate how long a particular operation will take 2008-09-18 21:22 network peeps always trying to find the must obscrue TLA 2008-09-18 21:22 Bushman: don't you guys use bullets for limiting ? :P 2008-09-18 21:22 mot <- most obscure tla 2008-09-18 21:22 haha 2008-09-18 21:22 (with the nic, you know its line rate, you know how many bytes your sending, the size of the pre and post-amble, the wait between packets, you thus now the _entire_ cost of sending any given packet] 2008-09-18 21:23 dont make me whip out stories about invalidating keys with thermite granades 2008-09-18 21:23 motley cru 2008-09-18 21:23 tla? 2008-09-18 21:23 mot? 2008-09-18 21:23 maze, and you don't know much carrier sense backout is going to cost ;) 2008-09-18 21:23 most obscure three letter acronym 2008-09-18 21:23 ah, so you use the hiarchical token buckets to assign different classes of service to different apps/streams? - precisely 2008-09-18 21:23 and that's where your pretentions to realtime control come crashing down 2008-09-18 21:23 which is a fla 2008-09-18 21:24 which is a tla 2008-09-18 21:24 which is a tla 2008-09-18 21:24 third time lucky 2008-09-18 21:24 for example I would give each user in my network their own sfq for local traffic to another nic (just switching) to another network via wireless and to the internet (via the same wireless) 2008-09-18 21:24 to make delivery time guaranteed, woudlnt you have to have full preempt kernel? (oh i miss 80ties Amigas) 2008-09-18 21:24 ACTION thinks of some keys he'd like invalidated 2008-09-18 21:24 and then use htb to make sure everything was fair on the slow internet link, and on the others at the same time - worked awesome 2008-09-18 21:25 be right back in 10. 2008-09-18 21:25 was a good one 2008-09-18 21:25 so who's hungry? 2008-09-18 21:25 me? 2008-09-18 21:25 was just going to order from bruno's 2008-09-18 21:25 we could meet there instead 2008-09-18 21:25 you don't coult, you're always hungry 2008-09-18 21:25 flips: i thought you were coding not slacking tonight 2008-09-18 21:25 ;-) 2008-09-18 21:25 i need to sleep, it's past midnight here damn it 2008-09-18 21:26 shapor, what do think I was doing while maze was talking? 2008-09-18 21:26 Bushman: I'll drink a zyweic for you :) 2008-09-18 21:26 Bushman: east coast? 2008-09-18 21:26 bushman, laterz 2008-09-18 21:26 you guys keep it too interesting 2008-09-18 21:26 heh thanks 2008-09-18 21:26 ACTION also goes to bed. Good night to everyone. 2008-09-18 21:27 shapor- you up for grub? 2008-09-18 21:27 Shapor: you better have some Zywiec/Okocim handy when i invade LA again 2008-09-18 21:27 that would be "more grub" 2008-09-18 21:27 safe bet, shapor already cooked tonight 2008-09-18 21:27 tim_dimm_: yeah i already ate 2008-09-18 21:27 k, beer? 2008-09-18 21:27 flips: no, i had gDinner 2008-09-18 21:27 how about some chianti 2008-09-18 21:27 ? 2008-09-18 21:27 flips: did Shap introduce you to polish beer yet? 2008-09-18 21:27 heh 2008-09-18 21:27 don't need shap for that 2008-09-18 21:28 used to live in berlin 2008-09-18 21:28 ah yes, the spoils of war... ;) 2008-09-18 21:28 heh 2008-09-18 21:28 not sure which way that one cuts 2008-09-18 21:28 all the kings horses.... 2008-09-18 21:28 couldn't stop tanks 2008-09-18 21:28 but they stopped for a beer! 2008-09-18 21:29 berlin has lots of wayward poles 2008-09-18 21:29 drinking, mostly 2008-09-18 21:29 some leggy poles 2008-09-18 21:29 drinking 2008-09-18 21:29 flips: since its late night, how's swingers sound? 2008-09-18 21:29 even the ubermensch need a brewsky 2008-09-18 21:29 or playing with the berlin boys 2008-09-18 21:29 toying actually 2008-09-18 21:29 berlin boy toys 2008-09-18 21:29 also fun for the finhish girls 2008-09-18 21:29 finnish 2008-09-18 21:29 tim_dimm_: people who dont know LA might take that out of context ;) 2008-09-18 21:30 i thought of that as soon as I hit enter 2008-09-18 21:30 esp with the PC bunch we have in here 2008-09-18 21:30 for the record, swingers is a diner 2008-09-18 21:30 shapor: maybe i should tell them about how you behaved when i took you out to boystown in chicago ;) 2008-09-18 21:30 hahah 2008-09-18 21:30 lol 2008-09-18 21:30 tim_dimm_, 802 Broadway? 2008-09-18 21:31 yeah 2008-09-18 21:31 corner of lincoln and broadway 2008-09-18 21:31 shap? 2008-09-18 21:31 sure 2008-09-18 21:31 tim_dimm_, 22 oclock? 2008-09-18 21:31 pick u up 2008-09-18 21:31 ? 2008-09-18 21:31 i could go for a vanilla chai latte 2008-09-18 21:31 kay 2008-09-18 21:31 sure 2008-09-18 21:31 good idea 2008-09-18 21:31 you commie bastards 2008-09-18 21:31 keep those wrists safe tonight 2008-09-18 21:31 i got waffle house 2008-09-18 21:32 :) 2008-09-18 21:32 k rollin in ten 2008-09-18 21:32 you coming by here? 2008-09-18 21:32 shapor: drive by u then flips 2008-09-18 21:32 yeah 2008-09-18 21:32 good 2008-09-18 21:32 sure 2008-09-18 21:32 see you then 2008-09-18 21:32 k 2008-09-18 21:32 got 28 minutes to hack on dleaf 2008-09-18 21:32 bushman, good to meet you 2008-09-18 21:32 ACTION puts pants on 2008-09-18 21:32 nice to talk with everyone 2008-09-18 21:33 swingers, pants 2008-09-18 21:33 bushman, see you soon 2008-09-18 21:33 wtf dude 2008-09-18 21:33 haha 2008-09-18 21:33 ;-) 2008-09-18 21:33 oh if shap is putting pants on... SAY HI TO JOELLE! 2008-09-18 21:33 bushman, we need to meet up 2008-09-18 21:33 yea i know, end of fiscal year madness here, maybe this weekend we'll talk more 2008-09-18 21:33 Bushman: she says hi ;0 2008-09-18 21:33 ;) rather 2008-09-18 21:34 bushman, works for me 2008-09-18 21:35 flips: my boss been just tasked with writing the next orange book like thing, so we can make our requirements whatever we want, literally 2008-09-18 21:36 this is DoD/govt wide stuff, seriously influential development for the next decade, so it's the perfect moment to sneak in all kinds of security goodness 2008-09-18 21:37 bushman, sweet 2008-09-18 21:37 means I'd better bootstrap my clue 2008-09-18 21:38 i get to be the technical ideas feeder, as tehy're more policy, so if you got good ideas, i'm all ears 2008-09-18 21:38 what color is this one going to be? 2008-09-18 21:38 green book? 2008-09-18 21:38 this is la after all 2008-09-18 21:38 nah, the rainbow series been retired, dunno what it's gonna be called 2008-09-18 21:39 leetbook 2008-09-18 21:39 they've realized common criteria was an EPIC FAIL! 2008-09-18 21:39 nice 2008-09-18 21:39 onion book 2008-09-18 21:39 Bushman: isn't that what you spent a year of raduate school pissing and moaning about? 2008-09-18 21:39 or... maybe pomegranite 2008-09-18 21:39 anyway, must sleep, got some hacking certification to pass tommorow 2008-09-18 21:39 pomegrantis have excellent security... isolation... 2008-09-18 21:40 compartmentalization 2008-09-18 21:40 yes, that was the class i was forced into when you were 'visiting' 2008-09-18 21:40 robustness... 2008-09-18 21:40 Bushman: thanks again for that ;) 2008-09-18 21:40 hmm 2008-09-18 21:40 they always pick such meaningful names like dod8200.2 or dcid6/3 2008-09-18 21:42 now , now, now, not all poles drink 2008-09-18 21:42 ...heavily... 2008-09-18 21:42 no? 2008-09-18 21:42 we take breaks 2008-09-18 21:43 true 2008-09-18 21:43 to sleep... 2008-09-18 21:43 of sorts 2008-09-18 21:43 my break been too long, i need my okocim porter damn it 2008-09-18 21:44 orange book on what? 2008-09-18 21:44 security requirements for high assurance computer systems 2008-09-18 21:44 ah 2008-09-18 21:44 govt/mil style stuff 2008-09-18 21:44 http://en.wikipedia.org/wiki/TCSEC 2008-09-18 21:45 that can be interesting 2008-09-18 21:45 red hat and suse got b2 a while back 2008-09-18 21:45 so long as you don't have to write or read it 2008-09-18 21:45 I seem to recall 2008-09-18 21:45 those documents need an interface layer 2008-09-18 21:45 heh, i forgot the official name of it, i actually got the orange covered book on my shelf ;) 2008-09-18 21:45 so did windows nt... for a rather specialized configuration 2008-09-18 21:45 (ie. come with your own personal interpreter) 2008-09-18 21:45 with the network unplugged I think it was 2008-09-18 21:45 well, you need stuff like labeled packets on a network, 2008-09-18 21:46 i totally agree, that's why we're redoing it, cuz everything up to this point sucks 2008-09-18 21:46 isolation 2008-09-18 21:46 yeah, lots of fun 2008-09-18 21:46 maze, I agree with the concept of defining the functionality rather than the interface 2008-09-18 21:46 heh, if you had citizenship i could hire you right now ;) 2008-09-18 21:46 and while you're at it, you should probably make sure to sneak in good performance 2008-09-18 21:46 defining interfaces just doesn't fly with linux kern hacks 2008-09-18 21:47 I'd make sure a decent qos and the like makes it in 2008-09-18 21:47 these people couldnt give less shit about performance 2008-09-18 21:47 barriers 2008-09-18 21:47 that's what they want 2008-09-18 21:47 MaZe: good point.. you really have to sneak in performance 2008-09-18 21:47 I'd want these to be able to interoperate 2008-09-18 21:47 can't get there from here, provably 2008-09-18 21:47 with a public network like the internet 2008-09-18 21:47 the only way to sneak in performance is to push for small code 2008-09-18 21:48 fortunately, performance and provability tend to go hand in hand 2008-09-18 21:48 that's the only spot where security and performance principles meet 2008-09-18 21:48 or make this be the standard for the backbone or something 2008-09-18 21:48 yup 2008-09-18 21:48 because to get performance you need simple 2008-09-18 21:48 less code - easier to understand 2008-09-18 21:48 NOT YOUR CODE! 2008-09-18 21:48 easier to prove correct (or believe correct)/understand 2008-09-18 21:48 Bushman: no US Citizenship, just Canadian ;-) we Canadians rule the world. 2008-09-18 21:49 i've been reading tux' code, holy crap, did you all grow up coding up demos for amigas in the 80ties? 2008-09-18 21:49 imo, that's a requirement for decent coding 2008-09-18 21:49 bushman, you need to read some other filesystem code 2008-09-18 21:49 it's just as dense, but not as performant for the most part 2008-09-18 21:50 i know, i'm kidding, but i saw bitslicing and i went 'oh shnap, they didnt...' 2008-09-18 21:50 the real trick, is you want to define nice clean apis/interfaces, then stick to them without breaking through the layering 2008-09-18 21:50 ACTION thinks about cutting and pasting some vfat code 2008-09-18 21:50 while at the same time avoid layer for the sake of layers - a lot of the vfs code is just wrappers on wrappers - sad 2008-09-18 21:50 bushman, the bitfields stuff will go, mostly 2008-09-18 21:50 can't have that on the actual media 2008-09-18 21:50 and a vital point is, the apis have to be precisely and accurately defined and documented 2008-09-18 21:50 -!- tim_dimm__(~mobile@32.172.89.233) has joined #tux3 2008-09-18 21:51 too much variation between machine architectures 2008-09-18 21:51 yea i was thinking how easy that would be to do buffer overflows on 2008-09-18 21:51 maze, careful there 2008-09-18 21:51 does this function sleep? what's the input? the output? how does it deal with errors? what errors can it return? how long can it take? 2008-09-18 21:51 maze, there is a long and star studded history of api proposals to the linux kernel core that failed 2008-09-18 21:51 hmm? 2008-09-18 21:52 Shapor: outside dude 2008-09-18 21:52 tim_dimm__: k 2008-09-18 21:52 you almost want a language where you can have write constraints for before/after/during execution of each function...like Eiffel 2008-09-18 21:52 selinux only just squeeked in 2008-09-18 21:52 Bushman: agreed 2008-09-18 21:52 that should almost be a gov requirement for the code thats secure 2008-09-18 21:52 most other consortium/thinktank type apis have failed to get merged 2008-09-18 21:52 that's how stock market software is done 2008-09-18 21:52 not to say it can't happen 2008-09-18 21:52 but one has to be _very careful_ 2008-09-18 21:53 careful with what? 2008-09-18 21:53 i got to sit down with the creator of eiffel few months ago, very smart dude 2008-09-18 21:53 maze, proposing apis to linus 2008-09-18 21:53 ah 2008-09-18 21:53 better to propose functionality 2008-09-18 21:53 define the functionality and the invariants 2008-09-18 21:53 where's the problem? he doesnt' like them? 2008-09-18 21:53 oh 2008-09-18 21:53 let linus and friends take it to api 2008-09-18 21:53 yea, linus seems to be a big proponent of order emergent out of chaos... 2008-09-18 21:53 perhaps with some helpful guidance 2008-09-18 21:54 yeah, I kind of consider the api to be the functionality/invariants 2008-09-18 21:54 I'm not sure I'm really aware of the difference 2008-09-18 21:54 maze, he has a healthly disrespect for anybody else's ability to design a robust api 2008-09-18 21:54 linus isn't the world's greatest either, but its his kernel 2008-09-18 21:54 he's not the worst either, or even below 99th percentile 2008-09-18 21:54 ok, by api, I don't mean it can't be changed later - as in stable 2008-09-18 21:55 api for demo apps, sure 2008-09-18 21:55 I mean a layer below which you don't have to descend to understand what it will do 2008-09-18 21:55 yea but how do you arrive at stable without having it in real action for a while 2008-09-18 21:55 reference implementation 2008-09-18 21:55 jsut don' 2008-09-18 21:55 jsut don't let it grow into a huge undertaking with emotional baggage 2008-09-18 21:55 bushman, true 2008-09-18 21:55 well 2008-09-18 21:55 some strange bootstrap 2008-09-18 21:56 usually best to work with incremental modifactions 2008-09-18 21:56 you have to have a clear idea of where you're heading 2008-09-18 21:56 usually you get that the second or third time around the block 2008-09-18 21:56 because you've tried it yourself, yes 2008-09-18 21:56 but that still doesn't cut it with core 2008-09-18 21:56 i was a sysadmin for a long time, if i learned anything is that long term reality beats out fuzzing/use cases anytime ;) 2008-09-18 21:56 -!- tim_dimm__(~mobile@32.172.89.233) has joined #tux3 2008-09-18 21:56 bushman, exactly 2008-09-18 21:56 and the reality is posix 2008-09-18 21:57 I'm not sure what you mean by that, especially by fuzzing 2008-09-18 21:57 yeah and postfix is broken 2008-09-18 21:57 so this will succeed to the extent it builds on that 2008-09-18 21:57 posix is nice because it's a standard 2008-09-18 21:57 but, oh boy, what a standard it is... 2008-09-18 21:57 Flips: be outside in 3 min 2008-09-18 21:57 and because linus cares about it in a backhanded way 2008-09-18 21:57 kay 2008-09-18 21:57 ACTION thinks about pants 2008-09-18 21:58 anybody gonna come pick me up? 2008-09-18 21:58 housecoat currently in case any of you were wondering ;) 2008-09-18 21:58 let anything run in real production environment long enough and it's gonna encounter more bugs than all the test cases you can predict/generate. all tests are contrived. reality is strangely objective 2008-09-18 21:58 maze, we'll send the learjet by in 3 years 2008-09-18 21:58 Bushman: I'm an SRE right now, so I know ;-) 2008-09-18 21:59 what's SRE? 2008-09-18 21:59 bushman, it's not really true of core kernel though 2008-09-18 21:59 -!- tim_dimm__(~mobile@32.172.89.233) has joined #tux3 2008-09-18 21:59 way more bugs are squeezed out before it gets into hands of users 2008-09-18 21:59 or we'd be dead 2008-09-18 21:59 site reliability engineer for google, running crawling and indexing, all the way from machines to 70% or so of the way up the stack 2008-09-18 22:00 that's true, to me kernel is something that joins the clarity of time travel with readability of alchemy ;) 2008-09-18 22:01 well, we used to have the nice 2.odd trees, now all users are beta testers :/ 2008-09-18 22:01 they are also much smaller changes though 2008-09-18 22:01 alrightt, 1am, time to pass out, over an out, great meeting you all 2008-09-18 22:02 nice to meet you 2008-09-18 22:02 and you have the stable kernels as well - the ones in RHEL4/5 and 2.6.16.X and then you have the newer ones being tested in fedora and ubuntu and desktop distros, and then you have bleeding edge in unreleased distros (fedora 10, etc) 2008-09-18 22:02 mainline 2008-09-18 22:03 so this is something I'm not sure about, but I think the trees used by distros have gotten _MUCH_ closer to mainline 2008-09-18 22:03 Flips outside now 2008-09-18 22:03 cu 2008-09-18 22:03 bye guys 2008-09-18 22:04 I used to compile kernels from source back in 2.4 days 2008-09-18 22:05 Mobile irc = busted 2008-09-18 22:05 :) 2008-09-18 22:05 nowadays I use whatever distro provided kernel is available 2008-09-18 22:05 ie. right now I'm running fedora 9 and tracking koji (ie. running 2.6.26.5-42 now) 2008-09-18 22:06 the problem with building your own kernel is it's so freaking complex to get the right config options 2008-09-18 22:06 not to mention you end up with a config noone has tested... 2008-09-18 22:06 and you end up building so many modules you'll never use 2008-09-18 22:07 (make config could really use a detect usb/pci/etc devices present in system and enable those forcibly, disable the rest) 2008-09-18 22:07 I would guess this actually means almost everybody is running a distro provided kernel 2008-09-18 22:11 Thanks! 2008-09-18 23:29 maze, make defconfig is your friend 2008-09-18 23:29 hmm? 2008-09-18 23:29 try it 2008-09-18 23:29 oh, speaking of that, yeah, still 2008-09-18 23:30 I've got a perfectly good kernel someone else deals with 2008-09-18 23:30 and I can compile modules against it 2008-09-18 23:30 I'm happy ;-) 2008-09-18 23:30 you'll get over it 2008-09-18 23:31 that being happy thing 2008-09-18 23:31 these days you just cat your config out of proc 2008-09-18 23:32 cat /proc/config.gz | gunzip | less 2008-09-18 23:32 and lsmod 2008-09-18 23:34 lots of windows peeps reading our mailing list archives 2008-09-18 23:35 chances are, linux hacks running company laptops 2008-09-18 23:35 but you never know 2008-09-18 23:40 or bots trying to be subtle 2008-09-18 23:40 sneakbots 2008-09-18 23:41 flipz_out: do you have stats? 2008-09-18 23:41 maybe 2008-09-18 23:41 installed the stats thing 2008-09-18 23:41 didn't check it 2008-09-18 23:41 what's the command? 2008-09-18 23:41 webalizer 2008-09-18 23:41 it produces html output 2008-09-18 23:42 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-18 23:42 lame 2008-09-18 23:42 well 2008-09-18 23:42 by default /var/www/webalizer in debian i think 2008-09-18 23:42 maybe 2008-09-18 23:42 got to get it doing something more sensible 2008-09-18 23:42 only giving me per-month right now 2008-09-18 23:43 I want per-hour 2008-09-18 23:43 monthly for may??? 2008-09-18 23:43 wtf 2008-09-18 23:43 Usage Statistics for tux3.org 2008-09-18 23:43 Summary Period: May 2007 2008-09-18 23:43 Generated 18-Sep-2008 23:41 PDT 2008-09-18 23:43 [Daily Statistics] [Hourly Statistics] [URLs] [Entry] [Exit] [Sites] [Referrers] [Search] [Agents] [Locations] 2008-09-18 23:43 Monthly Statistics for May 2007 2008-09-18 23:44 got more important things to do than give enemas to stats scripts 2008-09-18 23:45 1 45 58.44% slashdot.org/comments.pl 2008-09-18 23:47 microsoft seems to be crawling my site with the user-agent 2008-09-18 23:47 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322) 2008-09-18 23:48 ah 2008-09-18 23:48 seen that much 2008-09-18 23:48 from diverse ip addresses 2008-09-18 23:48 evil? 2008-09-18 23:48 its obviously a bot 2008-09-18 23:48 what makes you think its msftbot? 2008-09-18 23:48 its grabbing an html file 2008-09-18 23:48 msnbot 2008-09-18 23:48 and not any of the jpgs linked on it 2008-09-18 23:48 also 2008-09-18 23:49 the ip's belong to msft ;) 2008-09-18 23:49 they're not even good at sneaking 2008-09-18 23:50 65.55.109.0/24 and 65.55.110.0/24 2008-09-18 23:51 OrgName: Microsoft Corp 2008-09-18 23:51 and they set the referrer to 2008-09-18 23:51 http://search.live.com/results.aspx?q=camera 2008-09-18 23:51 to make it look like people are using live.com to find my site 2008-09-18 23:51 seems shady 2008-09-18 23:52 where "camera" is any common word which appears in my site 2008-09-18 23:53 shady 2008-09-18 23:53 ballmer style 2008-09-18 23:53 http://ekstreme.com/thingsofsorts/blogging/yell-if-microsofts-livecom-spammed-you-too 2008-09-18 23:54 with all the subtletly of a giraffe in a japanese tea house 2008-09-18 23:54 this has been going on a long time 2008-09-18 23:58 i guess its them trying to emulate a human 2008-09-18 23:58 seems stupid 2008-09-19 00:02 but can it pogo 2008-09-19 00:15 ... 2008-09-19 00:20 Shapor: # Idle priority is VERY cautious about marking block devices idle. If your foreground tasks are using disk, then your background tasks will become noticeably slower, as they get blocked from touching the disks until Linux knows for sure your foreground tasks have all had a chance at the disk. Most of the times, you don?t care about this anyway, but don?t run a torrent in non-idle class and expect a 20GB copy to finish till the torrent?s done! 2008-09-19 00:20 == lame 2008-09-19 00:20 http://friedcpu.wordpress.com/2007/07/17/why-arent-you-using-ionice-yet/ 2008-09-19 00:21 got to be a hint why only phb oriiented vendors provide it by default 2008-09-19 00:21 course... maybe that's every vendor ;) 2008-09-19 00:32 yeah its not very fine grained 2008-09-19 00:33 that VERY is a red flag 2008-09-19 00:34 i didn't generate hard data but when i use a combinatino of ionice and nice compressing log files the system seems more responsive 2008-09-19 00:34 than if i dont use them 2008-09-19 00:34 i dont care how long it takes to compress my log files really 2008-09-19 00:35 so if ANY other io wants my disk let it have it 2008-09-19 00:35 if the log compression never finishes thats ok 2008-09-19 00:35 it is useful even in its current state 2008-09-19 00:36 now i just need a version of cat which puts posix_fadvise in the io even loop so it doesn't piss on my buffer cache either 2008-09-19 00:37 cat --dont-piss-on-my-buffer-cache 2008-09-19 00:37 mount -ttux3 -oloop foodev /mnt 2008-09-19 00:37 we start here. 2008-09-19 00:37 wow! we got here 2008-09-19 00:38 my cut n paste of mazes junk just worked 2008-09-19 00:38 not junk 2008-09-19 00:38 junkfs :) 2008-09-19 00:38 junkfs reulz 2008-09-19 00:39 unlike maze, I did not have to reboot my workstation 2008-09-19 00:39 because I ran it under uml 2008-09-19 00:39 worth getting that working 2008-09-19 00:40 heh 2008-09-19 00:40 mount: wrong fs type, bad option, bad superblock on /dev/loop0, 2008-09-19 00:40 or too many mounted file systems 2008-09-19 00:40 (aren't you trying to mount an extended partition, 2008-09-19 00:40 instead of some logical partition inside?) 2008-09-19 00:40 ok, let's get out the junk mop 2008-09-19 00:40 and make it presentable for tux3 checkin #3 2008-09-19 00:41 i thought you were working on extents ;) 2008-09-19 00:41 or #4 if you count the lame git checkin 2008-09-19 00:41 this is related to extents, just as ketchup is related to ice cream 2008-09-19 00:42 maze done good in here 2008-09-19 00:43 I particularly like the little hexdump in 7 lines 2008-09-19 00:43 ACTION cuts it down to 6 2008-09-19 01:00 flipz: where did dir.c come from 2008-09-19 01:00 looks a lot different from fs/ext2/dir.c in a recent kernel 2008-09-19 01:00 dir.c? 2008-09-19 01:00 oh 2008-09-19 01:00 it's the same 2008-09-19 01:01 just marginally cleaned up 2008-09-19 01:01 and got rid of the page wanking 2008-09-19 01:01 when back to buffer ops as god intended 2008-09-19 01:02 changed the interface a bit? 2008-09-19 01:02 not really 2008-09-19 01:02 ext2_create_entry 2008-09-19 01:02 ? 2008-09-19 01:03 pretty much the same 2008-09-19 01:03 that level isn't implemented 2008-09-19 01:03 in tux3 2008-09-19 01:03 well 2008-09-19 01:03 it's in inode.c 2008-09-19 01:03 well i dont see ext2_create_entry in the ext2/dir.c 2008-09-19 01:03 in fact lxr says it doesn't exist 2008-09-19 01:03 try namei.c 2008-09-19 01:03 mknod 2008-09-19 01:03 or something 2008-09-19 01:03 did you rename it? 2008-09-19 01:03 buncha verbosity 2008-09-19 01:04 were they passing in a dentry before? 2008-09-19 01:04 no 2008-09-19 01:04 it's trivial 2008-09-19 01:04 um 2008-09-19 01:04 ok 2008-09-19 01:04 you want to know the name 2008-09-19 01:04 justa sec 2008-09-19 01:05 ext2_add_link 2008-09-19 01:05 dumb name 2008-09-19 01:05 yeah thats what i thought 2008-09-19 01:06 you changed the interface 2008-09-19 01:06 I really didn't change much 2008-09-19 01:06 did not want to discover new bugz 2008-09-19 01:06 because they create the dentry first 2008-09-19 01:06 and pass that 2008-09-19 01:06 rather than a filename 2008-09-19 01:06 hmm I did a little 2008-09-19 01:06 because no dentries in tux3 userspace 2008-09-19 01:06 and they call ext2_create seperately 2008-09-19 01:07 hmm 2008-09-19 01:07 caught me 2008-09-19 01:07 perhaps there should be 2008-09-19 01:07 to make kernel port easier 2008-09-19 01:07 ext2 is not an exemplary model for namespace structure 2008-09-19 01:07 hmm 2008-09-19 01:07 this is all fs internal 2008-09-19 01:07 ok 2008-09-19 01:08 might as well drop some of the braindamage 2008-09-19 01:08 good call on that though 2008-09-19 01:08 i'm just trying to fix a bug in it ;) 2008-09-19 01:08 bug! 2008-09-19 01:08 i think you did introduce one 2008-09-19 01:08 ;) 2008-09-19 01:08 happens 2008-09-19 01:15 feels like there are too many interfaces in ext2/dir.c 2008-09-19 01:15 yes 2008-09-19 01:15 a linux meme 2008-09-19 01:16 making interfaces looks confusingly like productive work 2008-09-19 01:19 now... why did maze put a wait queue inside the bio 2008-09-19 01:19 looking forward to the explanation ;) 2008-09-19 01:19 ACTION unborks 2008-09-19 01:20 it seems like a lot of stuff is landing in our inode.c 2008-09-19 01:20 sposed to put a pointer to the wait queue there, not the wait queue itself 2008-09-19 01:20 sure 2008-09-19 01:20 inode.c is a toilet 2008-09-19 01:20 heh 2008-09-19 01:20 by tradition 2008-09-19 01:20 dont flush it! 2008-09-19 01:20 might lose something good 2008-09-19 01:22 ah so the vfs does indeed hand you a dentry 2008-09-19 01:22 not a filename 2008-09-19 01:22 man lxr is fucking slow 2008-09-19 01:23 i'm going to run my own 2008-09-19 01:23 damn europeans much be awake 2008-09-19 01:23 good luck installing it 2008-09-19 01:23 must even 2008-09-19 01:23 let me know how it works out 2008-09-19 01:23 hrm the interface is kinda crap 2008-09-19 01:24 ACTION tries not getting sidetracked making lxr not suck as much 2008-09-19 01:29 shapor, know a shell command for writing a few bytes at the beginning of a file without truncating the file? 2008-09-19 01:30 reiserfs has some weird looking shit in it 2008-09-19 01:30 you don't say 2008-09-19 01:31 take a simple idea and make it weird 2008-09-19 01:31 flipz: dd ? 2008-09-19 01:31 how bout that shell command? 2008-09-19 01:31 ah 2008-09-19 01:31 didn't know it could do that 2008-09-19 01:31 notrunc i think 2008-09-19 01:33 conv=notrunc 2008-09-19 01:33 lets you plop down data in it without truncating 2008-09-19 01:33 dd conv=notrunc if=hello of=foodev 2008-09-19 01:33 dd has a really weird command syntax 2008-09-19 01:34 root@usermode:~# ./tux3 2008-09-19 01:34 we start here. 2008-09-19 01:34 wow! we got here 2008-09-19 01:34 super = 68 65 6C 6C 6F 0A 00 00 00 00 00 00 00 00 00 00 2008-09-19 01:34 mount: Not a directory 2008-09-19 01:34 with maze's 'art' fixed 2008-09-19 01:35 the number of right things in maze's little hack _vastly_ outnumbers the wrong things 2008-09-19 01:35 but the wrong things are doozers ;) 2008-09-19 01:36 "It is rumored to have been based on IBM's JCL, and though the syntax may have been a joke[1], there seems never to have been any effort to write a more Unix-like replacement." 2008-09-19 01:36 from the wikipedia dd page 2008-09-19 01:36 http://en.wikipedia.org/wiki/Dd_(Unix) 2008-09-19 01:36 :p 2008-09-19 01:36 longest running joke in unix 2008-09-19 01:37 dd deprecated? 2008-09-19 01:37 i think not 2008-09-19 01:37 flipz: we should fix it :) 2008-09-19 01:38 right, if only because we own the name 2008-09-19 01:38 yup 2008-09-19 01:38 ddcp 2008-09-19 01:38 nah too long 2008-09-19 01:38 and its not cp 2008-09-19 01:38 dd --oldbroken 2008-09-19 01:38 dd2 2008-09-19 01:38 dd --muchbetter 2008-09-19 01:39 ddd 2008-09-19 01:39 or how about just "d" 2008-09-19 01:40 dd with a symlink 2008-09-19 01:42 hardlink 2008-09-19 01:42 mandatory 2008-09-19 01:42 provide legacy compatability if the argv[0] is dd 2008-09-19 01:42 otherwise new hawtness 2008-09-19 01:43 root@usermode:~# ./tux3 2008-09-19 01:43 we start here. 2008-09-19 01:43 wow! we got here 2008-09-19 01:43 super = 68 65 6C 6C 6F 0A 00 00 00 00 00 00 00 00 00 00 2008-09-19 01:43 root@usermode:~# mount 2008-09-19 01:43 /dev/ubda on / type ext2 (rw) 2008-09-19 01:43 proc on /proc type proc (rw) 2008-09-19 01:43 devpts on /dev/pts type devpts (rw,gid=5,mode=620) 2008-09-19 01:43 /root/foodev on /mnt type tux3 (rw,loop=/dev/loop0) 2008-09-19 01:43 that's enough for tonight 2008-09-19 01:43 sweet 2008-09-19 01:43 almost ;) 2008-09-19 01:43 so i'm trying to get a backup of your git tree up on github.com 2008-09-19 01:43 how'd you clone it? 2008-09-19 01:43 they already have linus's tree 2008-09-19 01:44 I failed 2008-09-19 01:44 so i forked it 2008-09-19 01:44 always forget how 2008-09-19 01:44 now i'm just trying to push your changes in to it 2008-09-19 01:44 just clone mine 2008-09-19 01:44 i dont think i can 2008-09-19 01:44 don't rebase to anything 2008-09-19 01:44 well 2008-09-19 01:44 I'll fix that 2008-09-19 01:44 tomorrow 2008-09-19 01:44 you need to have the git service running 2008-09-19 01:44 maybe you do 2008-09-19 01:44 I do 2008-09-19 01:44 it's just configged borkly 2008-09-19 01:45 git ui braindamage as much as anything 2008-09-19 01:45 nothing is obvious 2008-09-19 01:45 telnet phunq.net 9418 2008-09-19 01:45 yeah you do 2008-09-19 01:45 mercurial is altogether more usable in this and other ways 2008-09-19 01:46 we should get the whole vfs running in user space 2008-09-19 01:46 would be killer for testing 2008-09-19 01:46 I'm milding interested in doing a dentry like thing 2008-09-19 01:46 but we have fuse for that, really 2008-09-19 01:46 fuse is... 2008-09-19 01:46 ugh 2008-09-19 01:46 we just need to use it better 2008-09-19 01:46 yeah 2008-09-19 01:46 true 2008-09-19 01:46 we're really fitting sideways into it right now 2008-09-19 01:46 yeah its gross 2008-09-19 01:46 I'm amazed anything at all works 2008-09-19 01:47 the bug i was fixing 2008-09-19 01:47 is trying to create a file with a name which is too long 2008-09-19 01:47 returns an error 2008-09-19 01:47 that its too long 2008-09-19 01:47 as it should 2008-09-19 01:47 but creates it anyway 2008-09-19 01:47 oh, bad 2008-09-19 01:47 with the name truncated 2008-09-19 01:47 and no inode 2008-09-19 01:47 naughty 2008-09-19 01:47 its fucked 2008-09-19 01:47 heh 2008-09-19 01:47 I doubt that was my idea 2008-09-19 01:47 its a case you never get in dir.c 2008-09-19 01:48 because it checks when it creates the dentry 2008-09-19 01:48 nice catch 2008-09-19 01:48 in the vfs 2008-09-19 01:48 since we dont use it 2008-09-19 01:48 lamissimo 2008-09-19 01:48 yeah 2008-09-19 01:48 "always check your inputs" 2008-09-19 01:48 yeah 2008-09-19 01:48 that one musta got quietly slipped by ted 2008-09-19 01:48 subtle due to a minor interface change 2008-09-19 01:49 and its not liek dentries have fixed sized strings in d_name 2008-09-19 01:49 so there is no hard maximum 2008-09-19 01:49 its just supposed be be checked 2008-09-19 01:49 i dunno the limit is rediculously short 2008-09-19 01:50 i think 255 bytes maybe 2008-09-19 01:50 why not allow for long filenames 2008-09-19 01:50 that's considered long 2008-09-19 01:51 i suppose 2008-09-19 01:51 silly limitation 2008-09-19 01:51 useful silly limitation 2008-09-19 01:51 of course it comes from wanting to represent the length with a byte 2008-09-19 01:51 so you can have fixed size dentries? 2008-09-19 01:51 fixed size? 2008-09-19 01:52 i'm asking 2008-09-19 01:52 is that the reason? 2008-09-19 01:52 oh i see 2008-09-19 01:52 they certainly aren't fixed size 2008-09-19 01:53 hm one byte lengths 2008-09-19 01:53 true, useful 2008-09-19 01:53 qstr 2008-09-19 01:53 yeah 2008-09-19 01:53 was looking at that earlier 2008-09-19 01:53 len is int 2008-09-19 01:54 it's checked somewhere but you're right 2008-09-19 01:54 ext3 is violating, not checking it 2008-09-19 01:54 ext2 2008-09-19 01:54 __d_path or something 2008-09-19 01:54 playing fast and loosey goosey 2008-09-19 01:55 er no 2008-09-19 01:55 where do the dentries get created? 2008-09-19 01:55 i guess i should look top-down 2008-09-19 01:55 start with sys_open 2008-09-19 01:56 rather than bottom up 2008-09-19 01:56 somewhere in path_walk 2008-09-19 02:01 ah yeah, just got there 2008-09-19 02:01 3440 objp = ____cache_alloc(cache, flags); 2008-09-19 02:01 :p 2008-09-19 02:01 damn thats twisty 2008-09-19 02:01 guy who invented slab also invented zfs 2008-09-19 02:02 eh? 2008-09-19 02:02 course I doubt he wrote four underbars there 2008-09-19 02:02 true 2008-09-19 02:03 this is by way of checking whether kmalloc returns ERR_PTR or just NULL on error 2008-09-19 02:03 seems to be the latter 2008-09-19 02:03 http://lxr.linux.no/linux+v2.6.26.5/fs/namei.c#L869 2008-09-19 02:03 maze naively assumed otherwise, of course maze is right and we are wrong 2008-09-19 02:04 but he have to match linux fart for fart 2008-09-19 02:05 -!- kushal(~kushal@121.246.36.162) has joined #tux3 2008-09-19 02:07 so i got all the way down to http://lxr.linux.no/linux+v2.6.26.5/fs/dcache.c#L1241 2008-09-19 02:07 but i can't find the damn length check 2008-09-19 02:07 let me know ;) 2008-09-19 02:07 try get_name 2008-09-19 02:09 bah better to do this during the day when europeans are sleep and lxr is fast 2008-09-19 02:10 heh 2008-09-19 02:10 you're poking in the right place 2008-09-19 02:10 by the time lxr loads i've lost my train of thought 2008-09-19 02:10 right 2008-09-19 02:10 it would be useful to install it 2008-09-19 02:10 then you can teach me 2008-09-19 02:11 involves postgres & mod_perl 2008-09-19 04:15 -!- kushal_(~kushal@121.246.36.194) has joined #tux3 2008-09-19 05:13 -!- kushal(~kushal@121.246.36.194) has joined #tux3 2008-09-19 07:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-19 08:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-19 09:27 -!- guile(~guile@89-159-217-245.rev.numericable.fr) has joined #tux3 2008-09-19 09:28 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-19 09:35 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-19 10:46 -!- Kirantpatil(~kiran@122.167.197.109) has joined #tux3 2008-09-19 10:46 -!- Kirantpatil(~kiran@122.167.197.109) has left #tux3 2008-09-19 11:09 actually whether I had used IS_ERR and ERR_PTR correctly was something else I'd been meaning to ask ;-) 2008-09-19 11:12 flipz: "Ext3cow was designed as a platform for regulatory compliance, and has been used to implement secure deletion, authenticated encryption, and incremental authentication. See the publications page for more details." 2008-09-19 11:13 http://www.ext3cow.com/Publications.html 2008-09-19 11:13 about idle - not lame - idle class is not meant to affect anything else using io, neither b/w wise nor latency wise - hence it has to be very conservative, if that's not what you want... don't use idle (and even then idle can still impact performance of non-idle tasks...) 2008-09-19 11:14 and ionice has more classes than just idle 2008-09-19 11:15 exactly two more 2008-09-19 11:15 although it's not as powerful as it could/should be 2008-09-19 11:15 although i think idle is the most appealing one 2008-09-19 11:16 its common to want to do io intensive tasks in the background like backups or whatnot 2008-09-19 11:16 yeah, using kvm now 2008-09-19 11:16 ionice'ing a kvm session? 2008-09-19 11:16 mind you of course, all my printk's are non-multithreaded-printk compatible - who cares ;-) [for now] 2008-09-19 11:18 I put the wq in the bio, cause I needed something to wait on... was there something else I could wait on, and wake up from the endbio func? 2008-09-19 11:18 uhm, what's wrong with just putting the wq there? what use is the extra level of indirection? 2008-09-19 11:19 If you do install your own lxr - pass links to it ;-) 2008-09-19 11:19 we don't need all kversions 2008-09-19 11:19 dd can write bytes without trunc 2008-09-19 11:20 ah, there it is in the log - still catchingup 2008-09-19 11:20 ;) 2008-09-19 11:20 hey what did you expect... I have no bloody idea what I'm doing ;-) [about the wrong things being doozers] 2008-09-19 11:20 hrm it would be cool if lxr could be tied to a git repo 2008-09-19 11:21 and dd is weird... but it works and is everywhere... 2008-09-19 11:21 or is it already 2008-09-19 11:22 you know what is annoying is the number of clicks you need to do to download anything from sourceforge 2008-09-19 11:22 ah yeah it talks to git 2008-09-19 11:25 yeah I was thinking I should be checking both for errors and null... 2008-09-19 11:26 kvm, and ionice, no was referring to running my tests in kvm, like flips is in uml 2008-09-19 11:27 clicks - yeah agreed 2008-09-19 11:27 Ah, caught up.... 2008-09-19 11:27 seems you guys had a productive night 2008-09-19 11:27 mine was as well 2008-09-19 11:27 first time in a long time that I'm not sleepy before noon 2008-09-19 11:27 i was going to ask what the problem was with putting the wq in the bio as well 2008-09-19 11:28 whats the difference if you put a pointer there 2008-09-19 11:28 well, I need both the wq, and a bool 2008-09-19 11:28 so I put in a pointer to a struct with both 2008-09-19 11:28 (also should probably have an error return field in there) 2008-09-19 11:30 ok, back to work 2008-09-19 11:56 +<----->bio->bi_io_vec[bio->bi_vcnt] = (struct bio_vec){ 2008-09-19 11:56 +<-----><------>.bv_page = virt_to_page(buf), 2008-09-19 11:56 +<-----><------>.bv_offset = offset_in_page(buf), 2008-09-19 11:56 +<-----><------>.bv_len = SB_SIZE }; 2008-09-19 11:56 +<----->bio->bi_size = SB_SIZE; 2008-09-19 11:56 +<----->bio->bi_end_io = end_io_read; 2008-09-19 11:56 +<----->bio->bi_private = &mz; 2008-09-19 11:56 +<----->bio->bi_vcnt = 1; 2008-09-19 11:57 either that should be bio->bi_io_vec[0] = ... 2008-09-19 11:57 or bio->bi_vcnt++; 2008-09-19 11:59 plus putting the wq on the stack is stack bloat - isn't that bad if we want 4k stacks? 2008-09-19 12:00 actually, adding some sort of debug stack depth tracking might be useful. 2008-09-19 12:00 record deepest spot on stack ever hit in your code 2008-09-19 12:00 hmm, maybe the kernel already does that automatically 2008-09-19 12:07 -!- nataliep(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-09-19 12:12 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-19 13:16 maze, ping 2008-09-19 13:17 maze, your ERR_PTR is mostly wrong, where you've applied it to functions that only return ptr or null 2008-09-19 13:17 entirely wrong to be precise ;) 2008-09-19 13:19 maze, putting the work queue in the bio is inherently fragile, the bio can disappear 2008-09-19 13:19 put a pointer to the work queue in the bio 2008-09-19 13:20 right, the wq is in a pointer in the bio 2008-09-19 13:20 yeah checking whether err-ptr is required or not was a todo 2008-09-19 13:20 but of course that's actually not documented anywhere 2008-09-19 13:20 useful, like at the top of said functions 2008-09-19 13:21 the way you've done it, you've got double indirection - I've got single 2008-09-19 13:21 maze, it would be very cool if lxr could be tied to a repo... think about versioned indexes :) 2008-09-19 13:21 heh 2008-09-19 13:21 you either indirect on the wait, or indirect on the complete 2008-09-19 13:22 the wait indirection will be executed more often than the complete 2008-09-19 13:23 bio->bi_vcnt++ would be an improvement 2008-09-19 13:24 the mz goes on the stack anyway 2008-09-19 13:24 if we're that tight for stack space we should not be doing 4K stacks 2008-09-19 13:24 (which people are slowly learning) 2008-09-19 13:31 Iceweasel can't find the server at m.a.z.e.pl. 2008-09-19 13:38 maze, by the way, a wait queue is tiny 2008-09-19 13:38 just a spinlock and a list 2008-09-19 13:38 yeah my university is migrating to a new building 2008-09-19 13:38 why does the mz go on stack if it's kmalloc'ed? 2008-09-19 13:39 my mz wan't kmalloced 2008-09-19 13:39 oh 2008-09-19 13:39 can't see your original code any more 2008-09-19 13:39 good to post that kind of thing to the list 2008-09-19 13:39 it was a nice hack 2008-09-19 13:39 very nice 2008-09-19 13:39 which hack? 2008-09-19 13:39 junkfs 2008-09-19 13:39 oh 2008-09-19 13:40 too bad bio is such a bloaty interface 2008-09-19 13:40 not easy to make useful helpers for it 2008-09-19 13:42 I have ;-) 2008-09-19 13:42 int bioio(int rw, dec_t dev, sector_t sector, unsigned size, 2008-09-19 13:42 endio_t endio, void *private, unsigned vecs, struct page *page, 2008-09-19 13:42 unsigned off, unsigned len); 2008-09-19 13:42 :p 2008-09-19 13:43 yeah, should probably call it something like synchronous_bio_io 2008-09-19 13:44 can shell it as syncbio 2008-09-19 13:44 or synchbio standing for synch[ronous]_b[io]_io 2008-09-19 13:44 don't want sync since that means something else 2008-09-19 13:44 not really 2008-09-19 13:44 it's just a part of a sync 2008-09-19 13:44 well, it won't sync to disk 2008-09-19 13:44 oh, wait it will 2008-09-19 13:44 it will 2008-09-19 13:44 uhm, even if it's an lvm volume? 2008-09-19 13:44 syncbio is the one 2008-09-19 13:45 yes 2008-09-19 13:45 right it will 2008-09-19 13:45 the page cache is above this level 2008-09-19 13:45 things get screwy when virtual block devices cache 2008-09-19 13:45 which some do 2008-09-19 13:45 I'm still not quite clear on how to do barriers and permit reordering in the elevator at this level 2008-09-19 13:45 and theyget screwy 2008-09-19 13:45 barriers are a big mess 2008-09-19 13:46 mostly we just close our eyes and try to do simple things 2008-09-19 13:46 but anyway 2008-09-19 13:46 as you can tell ... I don't do/like simple 2008-09-19 13:46 a barrier is a flag on any bio 2008-09-19 13:46 bad idea actually 2008-09-19 13:46 I like powerful - shoot yourself in the foot things ;-) 2008-09-19 13:46 barrier should be separate bio 2008-09-19 13:46 but a barrier should be more like a pointer to another bio(s) which should be first 2008-09-19 13:46 this write must happen after those writes 2008-09-19 13:46 maybe 2008-09-19 13:47 a new barrier api would be a nice contribution 2008-09-19 13:47 no reason for barriers between fs'es on two partitions on the same bdev 2008-09-19 13:47 current one is teh suck 2008-09-19 13:47 and I don't think you need barriers on read... 2008-09-19 13:47 you do 2008-09-19 13:47 I'll even remember why 2008-09-19 13:47 not badly 2008-09-19 13:47 why? something net related? 2008-09-19 13:47 but its the same as memory ops 2008-09-19 13:48 need all combinations if you look hard enough 2008-09-19 13:48 oh, but then it's a these rights must hit disk before this read 2008-09-19 13:48 s/rights/writes/ 2008-09-19 13:48 that kind of thing 2008-09-19 13:48 barriers between readS? 2008-09-19 13:48 barries between reads... hmm 2008-09-19 13:48 I'd think no 2008-09-19 13:48 probably tackling something at the wrong level 2008-09-19 13:48 I can see writes -> barrier -> writes/reads 2008-09-19 13:49 I don't see reads -> barrier -> reads, nor reads -> barrier -> writes 2008-09-19 13:49 although I guess what exactly should happen if read, write to same sector gets reordered... hmm. 2008-09-19 13:49 the arrow directions are ambiguous 2008-09-19 13:49 arrows pointing out time 2008-09-19 13:49 flow 2008-09-19 13:49 commas do that ;) 2008-09-19 13:50 will we really have to fix the bdev interface first? 2008-09-19 13:50 I can see reads/barrier/writes 2008-09-19 13:50 but not a strong case 2008-09-19 13:50 the bdev barrier interface? 2008-09-19 13:50 yes, it's naive 2008-09-19 13:50 well, and the prio interface 2008-09-19 13:51 there is none 2008-09-19 13:51 get it all kind of nice and usable 2008-09-19 13:51 the prio ideas are just a hack in one elevator option 2008-09-19 13:51 we (I?) need a bdev interface which is aio read/write scatter/gather with priorities htb-like and barriers 2008-09-19 13:52 true 2008-09-19 13:52 be happy to work on it with you 2008-09-19 13:52 a lot of it is there 2008-09-19 13:52 a lot isn't 2008-09-19 13:52 I have plenty of apps 2008-09-19 13:52 starting with media... 2008-09-19 13:53 how exactly barriers should work is an interesting question 2008-09-19 13:53 yes 2008-09-19 13:53 you don't want them to be too strong 2008-09-19 13:53 or awkward 2008-09-19 13:53 but strong enough to implement the consistency the fs needs 2008-09-19 13:53 you want it to solve the primary problem well, which is journalling 2008-09-19 13:53 exactly 2008-09-19 13:54 and it has to take into consideration real world disks 2008-09-19 13:54 and the fact they spin/seek - something to be very aware of when working on this, since impacts priorities much 2008-09-19 13:54 and you might desire consistency x-dev 2008-09-19 13:54 need to write to journal dev before hitting base dev 2008-09-19 13:54 maze, notice there is nothing read-specific about your endio 2008-09-19 13:55 *cute* 2008-09-19 13:55 needs a different name 2008-09-19 13:55 I know 2008-09-19 13:55 hey it was a hack ;-) 2008-09-19 13:55 not any more 2008-09-19 13:55 hehe 2008-09-19 13:55 right 2008-09-19 13:56 maze, I don't know what I was going on about with your bio private pointer, your usage is fine 2008-09-19 13:56 on the stack is more leet 2008-09-19 13:57 kmallocs are bad things 2008-09-19 13:57 I don't know 2008-09-19 13:57 fragment 2008-09-19 13:57 fragile 2008-09-19 13:57 stack is small nowadays 2008-09-19 13:57 not that small 2008-09-19 13:57 yeah, I've wanted to see exactly how much stack space I actually have 2008-09-19 13:57 for my leet new fs idea, I actually need to be very careful 2008-09-19 13:58 sure 2008-09-19 13:58 since with both a net layer and an fs layer it might get tight 2008-09-19 13:58 but this is on the other side of too careful 2008-09-19 13:58 mayhaps 2008-09-19 13:58 I'm still new ;-) 2008-09-19 13:58 you are? 2008-09-19 13:59 I think you're past 50 percentile in hacking skills of people who call themselves that 2008-09-19 13:59 kernel hacking 2008-09-19 13:59 another couple months will get you past 90 2008-09-19 13:59 I still have no idea about anything yet ;-) 2008-09-19 13:59 you think anybody else does? 2008-09-19 13:59 how'd we get all that crap in kernel if anybody had a clue? 2008-09-19 14:00 oh one thing, there are a few null statements in your code that you may not think are there 2008-09-19 14:01 extra semicolons 2008-09-19 14:05 oh, well I like semicolons 2008-09-19 14:05 think every } should be followed by a ; 2008-09-19 14:06 heading to grab lunch 2008-09-19 14:06 (except in } else {) 2008-09-19 14:06 and C/C++ just has bad syntax with semicolon 2008-09-19 14:06 s 2008-09-19 14:07 I assume they'll be gone ;) 2008-09-19 14:08 extra parents and curlies are also frowned at, but extra semicolons are cause for shouting 2008-09-19 14:08 extra parens I mean 2008-09-19 14:08 extra parents are probably ok, particularly in utah 2008-09-19 14:11 let em shout 2008-09-19 14:12 hmm, I wonder 2008-09-19 14:12 is it true that removing a semicolon will either 2008-09-19 14:12 a) result in code functioning the exact same way as before 2008-09-19 14:12 or 2008-09-19 14:12 b) result in a compile failure 2008-09-19 14:13 probably noy 2008-09-19 14:13 extra semicolons make the code more fragile 2008-09-19 14:13 you can get a big surprise if somebody adds a seemingly innocuous conditional 2008-09-19 14:14 in theory there is no effect on generated code 2008-09-19 14:14 in practice, theory and practice are different 2008-09-19 14:15 yeah while I know ';' is a statement seperator, I much prefer to think of them as end-of-statement markers 2008-09-19 14:15 hmm 2008-09-19 14:15 closet pascal groupie ;) 2008-09-19 14:15 maybe I got that backwards 2008-09-19 14:15 well 2008-09-19 14:15 I love pascal syntax 2008-09-19 14:15 soor 2008-09-19 14:15 sorry 2008-09-19 14:16 yes, backwards 2008-09-19 14:16 but still a closet pascaller I think 2008-09-19 14:16 whatever - everything should end with a ';' 2008-09-19 14:16 semicolons are stupid 2008-09-19 14:16 should be optional 2008-09-19 14:16 designers of c are/were stupid 2008-09-19 14:16 but since they are there, have to use them lindentally 2008-09-19 14:16 imho should be required ;-) 2008-09-19 14:17 every line should be required to have two semicolons, one at the beginning, one at the end 2008-09-19 14:17 because you need them anyway, - can't live without em 2008-09-19 14:17 nah 2008-09-19 14:17 every statement should end with a ';' 2008-09-19 14:18 might be at the end of line, might be in the middle, might extend into the next line 2008-09-19 14:18 whitespace shouldn't matter (although could cause compiler warnings) 2008-09-19 14:18 e l s e 2008-09-19 14:19 anway 2008-09-19 14:19 let's not go there ;) 2008-09-19 14:19 else should always be } else { 2008-09-19 14:19 you either have a simple if (blah) something; 2008-09-19 14:19 not considered lindenty to have curlies around single statements 2008-09-19 14:19 or an if () { ... } else { ... }; 2008-09-19 14:19 not saying I think that's good or bad, it's just not lindenty 2008-09-19 14:20 yeah, I know 2008-09-19 14:20 my personal opinion, is: 2008-09-19 14:20 either it's short and sweet fits on a line if (something) something; 2008-09-19 14:21 or should be the full multi-line if () {\n ...\n } else {\n ...\n };\n 2008-09-19 14:21 my personal opinion is, if it's written in C is going to look ugly and there is little you can do about it 2008-09-19 14:21 possibly without the else clause if not needed 2008-09-19 14:21 break your heart trying 2008-09-19 14:21 true 2008-09-19 14:21 folks 2008-09-19 14:22 yes, fixing C is something I've thought of, codenamed 'the language advanced', a curious mix of pascal/c/c++/java/asm/gnu-isms/lisp 2008-09-19 14:22 but besides thinking about it never got anywhere 2008-09-19 14:22 (never tried) 2008-09-19 14:22 bh: hey 2008-09-19 14:22 your brainpower is needed more badly elsewhere ;) 2008-09-19 14:22 hehe 2008-09-19 14:23 but if you write the language first, you can then write the kernel in a language which doesn't suck... 2008-09-19 14:23 you can and nobody will care 2008-09-19 14:23 agreed 2008-09-19 14:23 still an interesting exercise 2008-09-19 14:23 a disappear for years exercise 2008-09-19 14:23 true 2008-09-19 14:24 hence the 'haven't ever tried' part 2008-09-19 14:24 save it for when you're old 2008-09-19 14:24 show those whippersnappers 2008-09-19 14:24 I'm hoping someone else will do it 2008-09-19 14:24 they will 2008-09-19 14:24 or I'll get some smart/bright friends and students to do it 2008-09-19 14:24 there's always somebody with enough time on their hands to write an os from scratch 2008-09-19 14:25 they get 15 minutes of slashdot fame and a nice job where they can stew 2008-09-19 14:25 hehe 2008-09-19 14:25 if/when I go back to school to get my phd, I've been thinking about leading a course for some of the best'n'brightest with design and implementation of a language or os as the topic 2008-09-19 14:26 you can practice here 2008-09-19 14:26 anyway, back to earth 2008-09-19 14:26 you're already TA at tux3u 2008-09-19 14:26 sure 2008-09-19 14:26 got to think about the next level for junkfs/tux3fs 2008-09-19 14:26 right 2008-09-19 14:27 right now I'm trying to think of what I want from the mm subsystem for my fs 2008-09-19 14:27 it's cool how tux3fs is both ramfs and diskfs now, hmm? 2008-09-19 14:27 hehe 2008-09-19 14:27 that's the most instructive thing so far 2008-09-19 14:27 re vfs 2008-09-19 14:28 and at which layer of the vfs (generics for most ops or not) the interfaces need to happen 2008-09-19 14:28 linux kinda gets it right, it's just warty 2008-09-19 14:28 also... error trapping 2008-09-19 14:29 I'd like to see a stack unwinding/resource recovery discipline 2008-09-19 14:29 also wondering if implementing reads by userspace (with appropriate aligned buffer) by unmap and map in ro cow pages from page cache or somewhere else would be appropriate and fast and/or sow 2008-09-19 14:29 slow 2008-09-19 14:29 it would be appropriate even on linux 2008-09-19 14:29 and if there's any race there 2008-09-19 14:29 there are cases where it's slow, but in general it's powerful 2008-09-19 14:29 just linux isn't orgainized that way 2008-09-19 14:30 linux has a loopy approach 2008-09-19 14:30 very naive 2008-09-19 14:30 would get zero copy reads, and most of the time you don't write over that data 2008-09-19 14:30 as in... too many loops 2008-09-19 14:30 yes, and it gets even more fun with net + disk + buffer + both ways 2008-09-19 14:30 exactly 2008-09-19 14:30 hence I'm thinking of this as a two layer fs to begin with 2008-09-19 14:30 that's why nobody has been crazy enough to attempt it 2008-09-19 14:30 look a splice 2008-09-19 14:30 simple thing 2008-09-19 14:31 big disaster 2008-09-19 14:31 I'm not actually sure where splice is atm? 2008-09-19 14:31 what happened there? last I knew there was an exploit and fix and exploit and fix... 2008-09-19 14:31 freesearch lxr 2008-09-19 14:32 it's a feature build on a base of jello 2008-09-19 14:33 there does appear to be a way to get your own page to trigger on all sorts of page operations, so that's good 2008-09-19 14:33 oh yes 2008-09-19 14:33 it's fun 2008-09-19 14:33 meant code not page 2008-09-19 14:33 "stupid page tricks" 2008-09-19 14:34 yeah, but I'm guessing it's needed for decent cache coherency 2008-09-19 14:34 even if it will mean locking will effectively end up being page (not byte-range) based 2008-09-19 14:35 anyway, I figure it's important to know what's possible, to know what can be later implemented, and to design the possibility in from the start 2008-09-19 14:35 you'll be getting into vm soon enough 2008-09-19 14:36 you can help me with the variable page rewrite if you like 2008-09-19 14:36 linus said he would open 2.7 if I did that hack 2008-09-19 14:36 I've realized that the xattr interface can probably be used as a nice ioctl layer for the fs 2008-09-19 14:36 it can? 2008-09-19 14:36 yeah like setting fs.tux3.option = something on an inode 2008-09-19 14:37 and then reading it back 2008-09-19 14:37 have the stuff be auto-generated 2008-09-19 14:37 ioctls would not be pleasant for that 2008-09-19 14:37 and have options for stuff like type of optimizations to be used on this file or etc 2008-09-19 14:37 we have ddlink for that 2008-09-19 14:37 I think xattr is nice here - although haven't looked at ddlink 2008-09-19 14:37 reiser5 ;) 2008-09-19 14:37 hmm? 2008-09-19 14:37 ddlink is cool 2008-09-19 14:38 really cook 2008-09-19 14:38 cool 2008-09-19 14:38 reiser5? what's with reiser4? 2008-09-19 14:38 "dead" 2008-09-19 14:38 is reiser even being worked on? 2008-09-19 14:38 slowly 2008-09-19 14:38 very slowly 2008-09-19 14:39 is reiser 4 done? stable? dropped? 2008-09-19 14:39 quasi stable 2008-09-19 14:40 should be merged, under a different name imho 2008-09-19 14:41 chris mason was one of the big driving forces on reiser, at least reiser 3, and he's entirely devoted to btrfs now 2008-09-19 14:41 which is something like reiser 3.5 2008-09-19 15:00 ah 2008-09-19 15:00 I came up with a few interesting network fs related ideas last night 2008-09-19 15:00 was a very productive bath ;-) 2008-09-19 15:01 works for me too, showers though 2008-09-19 15:01 something about that running water 2008-09-19 15:01 settles the lame ideas, let's bouyant ones float to the top 2008-09-19 15:11 maze, 2008-09-19 15:11 while (vecs--) 2008-09-19 15:11 bio->bi_io_vec[bio->bi_vcnt++] = va_arg(args, struct bio_vec); 2008-09-19 15:13 int bio(int rw, dev_t dev, sector_t sector, bio_end_io_t endio, void *private, unsigned vecs, ...) 2008-09-19 16:57 what are you folks going to finished the file system ? 2008-09-19 16:57 what=when 2008-09-19 16:57 are were there yet ? 2008-09-19 16:58 ACTION grins 2008-09-19 17:31 gregkh is an idiot 2008-09-19 17:32 oh was that public 2008-09-19 17:32 http://dustinkirkland.wordpress.com/2008/09/18/whats-behind-gregkhs-latest-rant/ 2008-09-19 17:32 wouldn't be so bad if he could design, code or debug 2008-09-19 17:33 bh, we're getting closer 2008-09-19 17:33 the kernel port is getting a little attention 2008-09-19 17:33 needs a lot more 2008-09-19 17:56 true fact: the linux kernel makefile is 1600 lines long 2008-09-19 18:01 541 KBUILD_CFLAGS += $(call cc-option,-Wdeclaration-after-statement,) <- this is the line we kill to enable inline decls 2008-09-19 18:01 I guess we are going to do taht 2008-09-19 18:01 for now until just before merge 2008-09-19 18:09 sk8 oclock 2008-09-19 18:09 one could say sk8teen oclock 2008-09-19 18:18 -!- BSD(~bandan@70-4-203-156.area3.spcsdns.net) has joined #tux3 2008-09-19 18:26 -!- BSD(~bandan@70-4-203-156.area3.spcsdns.net) has joined #tux3 2008-09-19 19:02 -!- BSD(~bandan@70-4-203-156.area3.spcsdns.net) has joined #tux3 2008-09-19 19:46 -!- BSD(~bandan@68-244-245-217.area3.spcsdns.net) has joined #tux3 2008-09-19 19:57 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-19 20:42 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-19 20:51 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-19 21:20 Results 1 - 10 of about 283,000 for tux3. 2008-09-19 21:20 up 100k in a day 2008-09-19 21:20 wonder what happened 2008-09-19 21:21 100k...hits? 2008-09-19 21:21 musta been the waking post 2008-09-19 21:21 100k up in one day, yes 2008-09-19 21:21 damn 2008-09-19 21:21 the internet loves wanking I guess 2008-09-19 21:21 hey, did you guys want a faster lxr? 2008-09-19 21:21 very much 2008-09-19 21:22 ok, i just got one going on my home quad, i gotta tweak postgres and we should be ready to rock 2008-09-19 21:22 excellent 2008-09-19 21:22 of course bandwidth might be a problem 2008-09-19 21:22 your admin skillz rock 2008-09-19 21:22 shapor can fix that 2008-09-19 21:22 shap can fix anything, that bastard! :) 2008-09-19 21:23 truth 2008-09-19 21:23 do you need the free text searches? that's some extra software that i'd have to get/configure 2008-09-19 21:23 oh yes 2008-09-19 21:23 the whole enchilada 2008-09-19 21:23 freetext is essential 2008-09-19 21:23 ok, i'll go play with that 2008-09-19 21:24 thanks much 2008-09-19 21:24 have to run out before whole foods closes 2008-09-19 21:24 didnt know, i just started dicking around with it to remind myself what sysadmining was like in linux ;) 2008-09-19 21:24 or I don't get my sushi tonight 2008-09-19 21:24 go get sushi, that's a moral imperative 2008-09-19 21:24 it's a mess, isn't it. LXR install I mean 2008-09-19 21:24 it aint pretty, but than again i dont do normal...anything 2008-09-19 21:25 that's a good sign 2008-09-19 21:25 bbiaf 2008-09-19 21:26 lol 2008-09-19 21:43 Bushman: what kind of bandwidth do you have? 2008-09-19 21:43 ACTION puts head down in shame 2008-09-19 21:43 cable modem 2008-09-19 21:44 its good enough 2008-09-19 21:44 most requests are only a few 10's of kb probably 2008-09-19 21:44 i can 'pimp my apache' and turn on compression since it's text it should be alright 2008-09-19 21:44 by looking at the code cpu is gonna be the bottle neck first ;) 2008-09-19 21:45 yea the db design makes baby jesus cry 2008-09-19 21:45 so i gotta go through postgres configs first, which can take hours to do right, that shit is complicated 2008-09-19 21:46 i'm i can get a dual Dual Core Woodcrest 2008-09-19 21:46 for $120/mo 2008-09-19 21:46 i might do that 2008-09-19 21:47 since i'm finally more than breaking even with my current servers 2008-09-19 21:47 the database for a whole 2.6.26.5 tree is about 1.1gb, so i'll just shove the whole thing into memory 2008-09-19 21:47 how much ram does it have? 2008-09-19 21:47 braking even? what are you hosing? 2008-09-19 21:47 6gb ;) 2008-09-19 21:47 a few sites 2008-09-19 21:48 it's my Matlab cruncher ;) 2008-09-19 21:48 well a few paying sites 2008-09-19 21:48 persiankitty is back? :) 2008-09-19 21:48 haha 2008-09-19 21:48 then various freebees like zumastor.org 2008-09-19 21:49 Bushman: you installed lxrng right? 2008-09-19 21:49 thats the one running on lxr.linux.no 2008-09-19 21:49 lxrng? i just found lxr-devel 2008-09-19 21:49 i think that ones out of date 2008-09-19 21:49 it's like 0.9.5 2008-09-19 21:49 but if it works thats cool 2008-09-19 21:50 there are annoying bugs in the one running on lxr.linux.no 2008-09-19 21:50 like it will show the same result multiple times 2008-09-19 21:50 see http://lxr.linux.no/ at the bottom of the page 2008-09-19 21:50 step by step instructions for setting it up too ;) 2008-09-19 21:50 well i havent seen the web part of it yet, the first run through the code indexing just ended like 10 mins ago 2008-09-19 21:51 ah cool 2008-09-19 21:51 took a while 2008-09-19 21:51 too bad i dont have any phat hardware at home 2008-09-19 21:51 i have 15/15 fios 2008-09-19 21:51 should i ship you a box? :) 2008-09-19 21:52 how many watts ? ;) 2008-09-19 21:52 if i were to guess it probably sounds like a helicopter 2008-09-19 21:52 never knew you to have an quiet boxes 2008-09-19 21:52 ups says about 90 sitting idle with cpu frequency throttling, 140+ at full boogie 2008-09-19 21:53 so 100 on average 2008-09-19 21:53 it's a bulb 2008-09-19 21:53 so about $8/mon 2008-09-19 21:54 are you all green, or just being a cheapass? 2008-09-19 21:54 little from column a... 2008-09-19 21:54 dont have to answer, i know the answer ;) 2008-09-19 21:54 hm the place i used to live had free electricity and fios available 2008-09-19 21:55 should build a datacenter in it 2008-09-19 21:55 or a grow house ;) 2008-09-19 21:55 ok u turn 2008-09-19 21:55 is that a 'weeds' reference? 2008-09-19 21:55 get me a prius 2008-09-19 21:56 so i can sneak up on mofos real quiet 2008-09-19 21:56 'so i can sneak up on motherfuckers' 2008-09-19 21:56 hah 2008-09-19 21:56 that's a bit spooky 2008-09-19 21:56 annnyway 2008-09-19 21:57 bandwidth shouldnt be a problem 2008-09-19 21:57 if it is, ship it to me ;) 2008-09-19 21:57 cant we just set it up on one of your real servers? 2008-09-19 21:57 sure 2008-09-19 21:57 but i dont want to kill the cpus 2008-09-19 21:57 we'll see how it goes on cable modem 2008-09-19 21:57 i can put it up on marcintology, but that's just a normal shared web account 2008-09-19 21:57 i can run dyndns for you if you dont have it already 2008-09-19 21:58 oh i got dns for it ;) 2008-09-19 21:58 lxr.tux3.org ? 2008-09-19 21:58 ooh could we pull from ddtree too? 2008-09-19 21:58 ok, let's do that than 2008-09-19 21:58 that would be slick! 2008-09-19 21:58 flipz: you like that idea? 2008-09-19 21:58 well first i wanna see if i can get it working nicely 2008-09-19 21:58 have the ddtree lxr'ed ? ;) 2008-09-19 21:58 flipz is doing sushi 2008-09-19 21:59 ah ok well i'm out for a bit too 2008-09-19 21:59 i sushi'ed m'self for lunch yesterday so i should be good for a day or two before i'm gonna start jonsing again 2008-09-19 21:59 good, i cant work with you making me laugh 2008-09-19 22:08 http://blogs.pcworld.com/staffblog/archives/007783.html 2008-09-19 22:08 take a look at the windows ad at the bottom... 2008-09-19 22:09 some enlightened folk at microsoft snuck in penguins... 2008-09-19 22:09 they're everywhere 2008-09-19 22:12 if needs be I can probably stick lxr on an athlon64 at my univ in poland 2008-09-19 22:12 or, maybe I could host a second copy from home off of comcast 2008-09-19 22:13 ddtree.tux3.org 2008-09-19 22:14 no functionen 2008-09-19 22:14 Host ddtree.tux3.org not found: 3(NXDOMAIN) 2008-09-19 22:15 it was just a suggestion 2008-09-19 22:15 we can make it happen any time 2008-09-19 22:15 haha 2008-09-19 22:16 nice redmond penquins 2008-09-19 22:16 are they that clueless or are there subversives inside m$? 2008-09-19 22:19 I'm guessing subversion from inside 2008-09-19 22:19 but it looks like windows out onto the future to me, penguins everywhere 2008-09-19 22:22 ugh, reading a notebook review and someone is claiming gigabit wired is overkill... 2008-09-19 22:22 what the hell are they drinking? 2008-09-19 22:22 or smoking 2008-09-19 22:23 "The thing that gets us out of bed every day is the prospect of creating pathways above, around and through walls." msft marketdroid 2008-09-19 22:23 sheesh 2008-09-19 22:23 I'm glad I'm not him 2008-09-19 22:23 or them 2008-09-19 22:23 what gets me out of bed is sheer effort of will 2008-09-19 22:23 and the prospect of some french roast 2008-09-19 22:25 heh, for me it's usually the buzzer and a sense of duty 2008-09-19 22:25 when the family gets back tomorrow it will be a four year old jumping on me 2008-09-19 22:26 "time to get up and play daddy" 2008-09-19 22:27 "An approach dedicated to engineering the absence of anything that might stand in the way" -- my gawd who came up with that one, ballmer? 2008-09-19 22:27 ugh 2008-09-19 22:27 engineering the absence - new msft slogan 2008-09-19 22:31 i'm all for engineering the absence 2008-09-19 22:31 of msft 2008-09-19 22:32 is the stupid question hour still on? 2008-09-19 22:32 yep 2008-09-19 22:33 vm.swappiness, wanna explain it to me, what's it really do, what rules of thumb i wanna use to determine it, etc 2008-09-19 22:33 oh, one of those 2008-09-19 22:34 it's an andrewism 2008-09-19 22:34 did i pick a touchy topic? :) 2008-09-19 22:34 take a vm that just plain doesn't swap very well and bolt knobs on it 2008-09-19 22:34 akpm is a friend 2008-09-19 22:34 but the linux vm has been dire for a few years 2008-09-19 22:34 `http://kerneltrap.org/node/3000 2008-09-19 22:34 swappiness is one of the attempted bandaids 2008-09-19 22:36 what's wrong with knobs? better to have them than to not have them, isnt it? 2008-09-19 22:36 better to have it work 2008-09-19 22:36 than to give up and offer a knob which also doesn't work 2008-09-19 22:37 like those old televisions 2008-09-19 22:37 you had a color control that ranged from "very green" to "very red" with "very blue" in between 2008-09-19 22:38 anyway 2008-09-19 22:38 I'm taking a sabbatical from vm 2008-09-19 22:38 so I am licensed to throw turds 2008-09-19 22:42 Bushman: the vm operates in a delicate balance right now with knobs pulling in 4 dimensions 2008-09-19 22:42 this breaks more than you think 2008-09-19 22:43 i've seen memory recursion deadlocks with AoE and ddsnap 2008-09-19 22:43 as well as other wacky behavior like: 2008-09-19 22:43 http://pengaru.com/~swivel/pop_comparisons/04-26-2006/ 2008-09-19 22:44 I don't even want to think about the vm 2008-09-19 22:44 let's stick with fixing fs and bdev 2008-09-19 22:44 Vito sounds like he could use to do some PCA to reduce the dimensionality in his quest for performance 2008-09-19 22:46 its not a lot to ask of a modern server to be able to parse silly protocols like pop at wire speed 2008-09-19 22:46 and cache some stuff 2008-09-19 22:47 thats a case of the os clearly pissing in your cheerios 2008-09-19 22:48 evicting >2GB of buffer cache at a time due to brain damage in the vm 2008-09-19 22:49 shapor: you're spoiled, you need to work with windows for a while ;) 2008-09-19 22:49 no thanks 2008-09-19 22:49 once i reinstalled a SCSI driver and all my fonts went bold 2008-09-19 22:49 wanna explain that one 2008-09-19 22:51 mmm the nigiri was fine, now time to look into the sake issue 2008-09-19 22:51 lol 2008-09-19 22:52 bushman, use after free? 2008-09-19 22:52 in the registry? 2008-09-19 22:53 that was a long time ago, i dont remember. i saw that and i thought i had some bad sushi or something ;) 2008-09-19 22:55 i'm still working on the freetext indexing, do we need just straight text/html parsing, or do we want more like PDF or .doc? 2008-09-19 22:55 there's no such thing as bad sushi 2008-09-19 22:55 careful there ;) 2008-09-19 22:55 MaZe: let it sit out in the sun for few hours, then you'll experience bad sushi 2008-09-19 22:56 bushman, I didn't know lxr had that knob 2008-09-19 22:56 but straight text, yes 2008-09-19 22:56 it's not lxr, it's swish, the text indexer 2008-09-19 22:56 right, so lxr doesn't recommend a config for swish? 2008-09-19 22:56 I thought they did 2008-09-19 22:57 kinda, but if i'm doing it by hand, might as well pimp it out a bit 2008-09-19 22:57 anyway, text 2008-09-19 22:58 plus Shap tells me that i grabbed a slightly different code than what the norwiegian lxr site is running 2008-09-19 22:58 remember when it was glimpse? 2008-09-19 22:58 ıʞsÊoʞʎzÉ”uÇż Ë™É É¾ÇıɔÉɯ - who can read this? 2008-09-19 22:58 a kinda sort open indexer 2008-09-19 22:58 university project 2008-09-19 22:58 the lxr manual says glimpse doesnt support all the functions it wants, so i went with swish 2008-09-19 22:58 they tried to close it up and make some dough, nobody used it, finally somebody wrote swishe and nobody remembers glimpse 2008-09-19 22:59 i've used this russian indexer/search engine called mnogosearch before, if i'm really bored i might see if i can use that here 2008-09-19 23:00 one of the things we plan to get happening with tux3 is proper incremental indexintg 2008-09-19 23:04 why are we leaking the sb inode map in inode.c test? 2008-09-19 23:04 ==31367== 8,160 (8,040 direct, 120 indirect) bytes in 1 blocks are definitely lost in loss record 4 of 7 2008-09-19 23:04 ==31367== at 0x4A1B858: malloc (vg_replace_malloc.c:149) 2008-09-19 23:04 ==31367== by 0x401E44: new_map (buffer.c:452) 2008-09-19 23:04 ==31367== by 0x40A088: new_inode (inode.c:128) 2008-09-19 23:04 ==31367== by 0x40B53C: make_tux3 (inode.c:493) 2008-09-19 23:04 ==31367== by 0x40BA21: main (inode.c:554) 2008-09-19 23:04 because it's broken ;-) 2008-09-19 23:04 shapor, we want to know 2008-09-19 23:05 does seem like the test is broken 2008-09-19 23:05 doesn't* 2008-09-19 23:06 want hints to track it down? 2008-09-19 23:06 put exit(1) somewhere and see if you put it before or after the leak 2008-09-19 23:06 smrt 2008-09-19 23:06 question: 2008-09-19 23:06 if a page-fault gets triggered 2008-09-19 23:07 due to the page being not present or write to read-only page 2008-09-19 23:07 from user space 2008-09-19 23:07 what context do you end up in the kernel? is that considered process context or interrupt context? 2008-09-19 23:07 process 2008-09-19 23:07 but not process 2008-09-19 23:07 it's non-interrupt kernel 2008-09-19 23:08 lol, so what can/can you not do - how does it differ from process and interrupt? 2008-09-19 23:08 can you sleep? 2008-09-19 23:08 any doc pointers? 2008-09-19 23:08 yes, you have to in order to read in the page 2008-09-19 23:08 right - fair point 2008-09-19 23:09 and of course another thread can trigger the same page fault again before you read it in, so you need to lock appropriately 2008-09-19 23:09 or that might happen automagically 2008-09-19 23:09 http://lxr.linux.no/linux+v2.6.26.5/arch/x86/mm/fault.c 2008-09-19 23:09 so how many different types of contexts do we have? 2008-09-19 23:10 process? kthread? fault handler? interrupt? anything else? 2008-09-19 23:10 wtf if i put an exit(1) right before the return 0; in main it doesn't detect a leak 2008-09-19 23:10 try exit(0)? 2008-09-19 23:10 does it always leak the same way? 2008-09-19 23:10 maybe instead of exit(1) do goto return 0 at end of main? 2008-09-19 23:11 same 2008-09-19 23:11 cant you printf a bunch of usual suspects? 2008-09-19 23:11 exit(0) also no error 2008-09-19 23:11 maze, http://lxr.linux.no/linux+v2.6.26.5/Documentation/exception.txt#L266 2008-09-19 23:11 maybe it bypasses the check code? 2008-09-19 23:11 i'll try moving the return up instead 2008-09-19 23:12 right that's the dealing with exceptions in kernel space 2008-09-19 23:12 shapor, right, valgrind does that 2008-09-19 23:12 and using them to detect unauthorized reads, etc 2008-09-19 23:12 maze, process, kthread and fault handler are all the same 2008-09-19 23:12 shap: you got ida pro handy? 2008-09-19 23:13 oh really? so basically there's just 2: process/kthread/fault vs interrupt 2008-09-19 23:13 the only real difference with process is it has a user address space to work with 2008-09-19 23:13 mm 2008-09-19 23:13 that's actually just a flag bit 2008-09-19 23:13 and the doc you referred to is about faults triggered from the kernel 2008-09-19 23:13 like having a file table 2008-09-19 23:13 ah 2008-09-19 23:14 right, but since we don't have any pointers passed in as parameters from userspace that doesn't really much matter 2008-09-19 23:14 maze, the doc tells you about do_page_fault 2008-09-19 23:15 that it gets called? 2008-09-19 23:15 I know that ;-) 2008-09-19 23:15 but do_page_fault does special stuff if the page fault got triggered with eip in kernel space 2008-09-19 23:16 ah i see whats going on 2008-09-19 23:16 although I guess we can trigger page fault-ins for user pages from kernel space as well 2008-09-19 23:17 with the leak 2008-09-19 23:17 so the two cases shouldn't really differ 2008-09-19 23:17 how would that change anythign? if you need to get a page, gotta fetch it, regardless where EIP is pointing at, isnt it? 2008-09-19 23:18 we don't support faults in kernel space 2008-09-19 23:18 you don't want to trigger sigsegv from kernel space 2008-09-19 23:18 only object is to oops properly 2008-09-19 23:18 used to panic on that 2008-09-19 23:18 you return EFAULT instead 2008-09-19 23:18 ok so how do you prevent that? 2008-09-19 23:19 wait, so what happens if userspace syscall writes and the buffer I passed in is swapped out? 2008-09-19 23:19 I'd always assumed the page-ins would happen via fault, do we actually map the memory in in some other way? 2008-09-19 23:23 maze, ok it took me a while to remember 2008-09-19 23:23 when a fault occurs, unlike an interrupt it isn't interrupting some random process 2008-09-19 23:23 right 2008-09-19 23:23 it faults in the process that needs to work, so the fault handler uses that context 2008-09-19 23:24 it just has to do a little register fiddling and play with the intruction pointer 2008-09-19 23:24 in some cases, parsing the instruction stream 2008-09-19 23:24 dimly remembering this from the last time I did it, many years ago 2008-09-19 23:24 fault semantics on x86 are utter crap 2008-09-19 23:24 so basically a fault triggered from userspace is almost like a syscall 2008-09-19 23:24 yes 2008-09-19 23:25 besides the fact it can happen anywhere, and needs some special asm on entry/exit to deal with the weird semantics 2008-09-19 23:25 http://lxr.linux.no/linux+v2.6.26.5/arch/sh/kernel/cpu/sh5/entry.S#L1134 2008-09-19 23:25 for example 2008-09-19 23:25 how about a fault triggered on access from within the kernel? 2008-09-19 23:26 oops if you're lucky 2008-09-19 23:26 panic if not 2008-09-19 23:26 and what if its triggered from irq context? 2008-09-19 23:26 death 2008-09-19 23:26 try it ;) 2008-09-19 23:26 it's easy 2008-09-19 23:26 so how does the kernel guarantee the userspace memory accessed by syscalls is present? 2008-09-19 23:27 it goes delving into page table entries and things 2008-09-19 23:27 do we grab and release locks on the userspace memory before and after the copy (and if not present call the pagein handlers manually?) 2008-09-19 23:27 well 2008-09-19 23:27 it doesn't need to be present 2008-09-19 23:27 because it can fault 2008-09-19 23:27 huh? 2008-09-19 23:27 well 2008-09-19 23:27 sorry 2008-09-19 23:27 not in kernel ;) 2008-09-19 23:27 double huh 2008-09-19 23:27 it does the fault by hand 2008-09-19 23:27 see get_user_pages 2008-09-19 23:28 right, so like I said above at 11:27:10 2008-09-19 23:28 don't have timestamps on 2008-09-19 23:28 (11:27:10 PM) MaZe: do we grab and release locks on the userspace memory before and after the copy (and if not present call the pagein handlers manually?) 2008-09-19 23:29 we don't grab locks on user memory 2008-09-19 23:29 not sure what the question is 2008-09-19 23:29 not on struct page *? 2008-09-19 23:29 no 2008-09-19 23:29 not for that 2008-09-19 23:29 we take ref counts 2008-09-19 23:29 ain't that the same thing? 2008-09-19 23:29 nonzero ref count holds a page in memory 2008-09-19 23:29 no 2008-09-19 23:30 but it prevents the page from disappearing from under us? right? so it's like a lock - except others can access it as well, right 2008-09-19 23:30 see why lock is an inappropriate name here 2008-09-19 23:30 I meant lock in the sense of lock into ram 2008-09-19 23:30 it's not at all like a lock 2008-09-19 23:30 it's a refcount 2008-09-19 23:31 a lock is a serializer, and recount is a don't kill me 2008-09-19 23:31 http://lxr.linux.no/linux+v2.6.26.5/mm/memory.c#L962 <- follow_page, doing the job of mm hardware by hand 2008-09-19 23:32 does refcount = 0 immediately result in memory being destroyed? 2008-09-19 23:32 yes 2008-09-19 23:32 or will it only get evicted if need be at that point? 2008-09-19 23:32 now... depends whether it's anon or not 2008-09-19 23:33 anon has to be swept up, page cache is immediately freed at that point 2008-09-19 23:33 don't quote me, I used to hack that stuff ;) 2008-09-19 23:33 but it's been a while 2008-09-19 23:33 so normally a processes mapping holds a refcount on it's memory? 2008-09-19 23:34 but that doesn't work 2008-09-19 23:34 check __free_page 2008-09-19 23:34 there have to be 2 layers here 2008-09-19 23:34 one virtual - what a process needs, one physical - what's in memory 2008-09-19 23:34 that puts the page back on the buddy as soon as count hits zero I believe 2008-09-19 23:34 probaby vma vs page 2008-09-19 23:34 so... treatment of inodes by the vfs is rather different 2008-09-19 23:35 there is one refcount on a page for each pointer to it, basically 2008-09-19 23:35 including one for the lru, I tried to implement, andrew never took the patch 2008-09-19 23:35 but definitely one for the page cache 2008-09-19 23:35 and one for each page table entry pointing at it 2008-09-19 23:35 there aren't really two layers in linux 2008-09-19 23:35 unlike freebsd 2008-09-19 23:36 it's all one layer 2008-09-19 23:36 -!- amey(~amey@116.73.35.180) has joined #tux3 2008-09-19 23:36 vma just specifies some access rights to memory regions 2008-09-19 23:37 hmm 2008-09-19 23:37 you have lots of time to get that sorted 2008-09-19 23:37 has no real impact on fs development 2008-09-19 23:39 am I right in assuming that during kernel startup 2008-09-19 23:39 a fragment of physical memory is reserved for a big honking array 2008-09-19 23:39 of struct pages - one per page of physical memory in the system? including high-mem and all that? 2008-09-19 23:40 possibly with some hacks for discontig mem 2008-09-19 23:40 yes 2008-09-19 23:40 it's rather crude 2008-09-19 23:40 and struct page as struct address_space * mapping in it 2008-09-19 23:40 s/as/has/ 2008-09-19 23:41 which appears to be inode related 2008-09-19 23:41 very much so 2008-09-19 23:41 we make every page header have that field even if its anon where the field has no use 2008-09-19 23:42 how big is a struct page - around 50 bytes? 2008-09-19 23:42 address_space, mapping, and page cache are different names for the same thing by the way 2008-09-19 23:42 stupidly sloppy terminology 2008-09-19 23:42 less I think 2008-09-19 23:43 it's been heavily sqzd 2008-09-19 23:43 maybe 50 on 64 bit 2008-09-19 23:43 so we basically throw 1% of memory out for accounting purposes. 2008-09-19 23:43 go into junkfs and printf(... sizeof(struct page)) 2008-09-19 23:43 much more than that 2008-09-19 23:44 and that's not even including the cpu pagetables 2008-09-19 23:44 dentry and inode cache are really extravagant 2008-09-19 23:44 it's not lean and mean 2008-09-19 23:44 only compared to even worse kernels 2008-09-19 23:45 56 2008-09-19 23:45 64 bit, right? 2008-09-19 23:45 yes 2008-09-19 23:46 multiply 56 times 1 TB / 4096 2008-09-19 23:46 kind of wicked how you can use junkfs as a code injector 2008-09-19 23:46 that's the point 2008-09-19 23:46 side door into the kernel for dodgy people 2008-09-19 23:46 14 GB 2008-09-19 23:46 so... imagine the suck when we scan that 2008-09-19 23:47 the sound of sucking is the only thing you hear from that computer 2008-09-19 23:47 scan it for what? 2008-09-19 23:47 anything 2008-09-19 23:47 freeable memory 2008-09-19 23:47 why would we want to scan it? 2008-09-19 23:47 oh 2008-09-19 23:47 so there's no heap structure of memory or anything like that 2008-09-19 23:47 nope 2008-09-19 23:47 it's the crudest imaginable system 2008-09-19 23:47 oh, that is indeed quite a vacuum 2008-09-19 23:48 while 1 TB is still rare 2008-09-19 23:48 it was only recently that linus allow vma to be a tree isntead of a linear list 2008-09-19 23:48 32-128G is perfectly reasonable nowadays 2008-09-19 23:49 and at that point we have almos 1-2G of struct page's 2008-09-19 23:49 1 tb is right around the corner 2008-09-19 23:49 argh 2008-09-19 23:49 -!- amey(~amey@116.73.35.180) has left #tux3 2008-09-19 23:49 so basically another part that wasn't designed 2008-09-19 23:49 indeed 2008-09-19 23:50 that comment applies to almost all the parts 2008-09-19 23:50 s/almost// 2008-09-19 23:50 do other kernels get this even worse? 2008-09-19 23:50 yes 2008-09-19 23:50 unbelievable? 2008-09-19 23:50 yes 2008-09-19 23:50 but true 2008-09-19 23:50 possible exception of, oh, qnx 2008-09-19 23:51 I'd always assumed the kernel to be this awesome C/assembler layer of wicked algos and data structures 2008-09-19 23:51 haha 2008-09-19 23:51 welcome, you're a kernel hacker now 2008-09-19 23:51 evrything optimized and tuned to hell and back 2008-09-19 23:51 to hell, not back 2008-09-19 23:51 lol 2008-09-19 23:52 some bits are ok 2008-09-19 23:52 yeah 2008-09-19 23:52 some bits are pretty damm amazing 2008-09-19 23:52 and I'm assuming it is in general getting better over time 2008-09-19 23:52 but most bits are just plain crap 2008-09-19 23:52 hard to say 2008-09-19 23:52 it's getting bigger 2008-09-19 23:53 yes, I've noticed 2008-09-19 23:53 I'm not sure its getting faster, seems to be regressing a little 2008-09-19 23:53 but I've assumed that hasn't been core functionality 2008-09-19 23:53 more just new drivers 2008-09-19 23:53 new filesystems 2008-09-19 23:53 etc 2008-09-19 23:53 also core 2008-09-19 23:53 all the big iron stuff 2008-09-19 23:53 from sgi and ibm 2008-09-19 23:54 hmm 2008-09-19 23:54 buffer.c and filemap.c get longer and longer 2008-09-19 23:54 so 2.7 happens when we get rid of bh? 2008-09-19 23:54 things like mpage.c appear 2008-09-19 23:54 linus said the variable sized page patch would be enough to open 2.7 2008-09-19 23:54 that was a while ago 2008-09-19 23:54 thing is, I'm not sure I would want 2.7 to happen 2008-09-19 23:54 2.5 was an utter mess 2008-09-19 23:55 don't know how much he treats an email as a promise ;) 2008-09-19 23:55 and 2.6. up to 2.6.7 or so was junk 2008-09-19 23:55 going through that again would be painful 2008-09-19 23:55 2.6 is kinda starting to stink 2008-09-19 23:55 it was fresh and new once 2008-09-19 23:56 well 2008-09-19 23:56 it's different 2008-09-19 23:56 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-19 23:56 in 2.3/4/5 we had a desparate situation 2008-09-19 23:56 nobody had a kernel that worked properly 2008-09-19 23:56 otoh, it does seem like there's a lot of deep in the core stuff that should be changed 2008-09-19 23:56 complaints about paging artifacts every day on lkml 2008-09-19 23:57 for years 2008-09-19 23:57 that was bad 2008-09-19 23:57 it just didn't work 2008-09-19 23:57 it's better now, where 2.6 may suck but it works 2008-09-19 23:57 that's a good base to step out and do some housecleaning 2008-09-19 23:57 how much of this is sucky code, or bad algos, and how much is it just being np-complete or unsolvable problems 2008-09-19 23:58 most of it is just sucky code 2008-09-19 23:58 nearly all 2008-09-19 23:58 we know about impossibility 2008-09-19 23:58 don't count that 2008-09-19 23:59 also, we have much better processes for bug tracking and nailing regressions 2008-09-19 23:59 back in the day it was just linus and some text mode mailer 2008-09-20 00:00 not pine, something like that 2008-09-20 00:00 does the fact linux is x-platform make it harder to write good code, or is that actually a benefit? 2008-09-20 00:00 benefit, but not for the reason you'd think 2008-09-20 00:00 the arch maintainers have autonomy and don't have to listen to linus about what they check in 2008-09-20 00:01 that's a benefit? 2008-09-20 00:01 only thing arch can't touch is core, and there are numerous bypasses 2008-09-20 00:01 it is 2008-09-20 00:01 they can drop some stupidities 2008-09-20 00:01 like they can have kdb in their arch if they want 2008-09-20 00:01 ah 2008-09-20 00:01 doesn't that fragment the kernel though? 2008-09-20 00:02 yes 2008-09-20 00:02 enormous amount of cut and paste sloth across arches 2008-09-20 00:02 if code which is/should be x-platform ends up being arch 2008-09-20 00:02 that's why some interfaces change very slowly 2008-09-20 00:03 somehow I'd assumed arch-code was mostly asm and declarations of macros/support functions for the rest 2008-09-20 00:03 you'd think 2008-09-20 00:03 but now, huge parts of the vm are per-arch 2008-09-20 00:03 things like do_fault 2008-09-20 00:03 which hardly vary 2008-09-20 00:03 but each arch has its own 2008-09-20 00:05 right 2008-09-20 00:06 all arches are forced to follow the x86 page table model by the way, even if their paging does not work that way 2008-09-20 00:07 sometime you need to meet bill irwin 2008-09-20 00:07 far more ascerbic than me 2008-09-20 00:07 kill_litter_super - where does the name come from? 2008-09-20 00:07 litterbug 2008-09-20 00:08 crappiest name ever, almost 2008-09-20 00:09 why? 2008-09-20 00:10 what does a giant cockroach have to do with fs? 2008-09-20 00:10 doesn't say anything? 2008-09-20 00:10 oh 2008-09-20 00:10 litter... leave things lying around 2008-09-20 00:11 oh, maybe, ok 2008-09-20 00:11 as in trash 2008-09-20 00:11 still seems pointless 2008-09-20 00:11 true 2008-09-20 00:14 I think the litter refers to all dentries for the fs have to stay in cache 2008-09-20 00:14 because it has no backing store 2008-09-20 00:14 thus litter memory 2008-09-20 00:14 you'll have to ask viro to know for sure 2008-09-20 00:14 by any measure, one of the worst names ever 2008-09-20 00:15 but shouldn't those no longer be used at the point we unmount? 2008-09-20 00:23 ok, for an unbacked fs like ramfs, the dentries/inodes have to be prevented from disappearing 2008-09-20 00:23 because if they do there is not way to get them back 2008-09-20 00:23 a block backed fs can have them first flushed then evicted 2008-09-20 00:23 so dentry counts can drop to zero 2008-09-20 00:24 unback fs, they have to be forced to zero 2008-09-20 00:24 see d_genocide 2008-09-20 00:24 unbelievably crappy naming 2008-09-20 00:24 and nonexistent documentation 2008-09-20 00:24 take something simple and make it mysterious, it's hacker's job security 2008-09-20 00:28 :-) 2008-09-20 00:30 free_super_unpin would be more informative 2008-09-20 00:30 or 2008-09-20 00:30 unpin_super 2008-09-20 00:33 maze, what's the next move on junkfs? 2008-09-20 00:34 going to be mucking around with the vm or implementing the root directory 2008-09-20 00:34 not sure which, maybe both 2008-09-20 00:35 if you just go with fs/tux3 which has your code in it you get the root directory for free 2008-09-20 00:36 vm doesn't really have much to do with fs 2008-09-20 00:36 right, but I don't learn anything ;-) 2008-09-20 00:36 you just use the interfaces, like cache_alloc etc 2008-09-20 00:36 learn about vma then if you must hurt yourself 2008-09-20 00:36 err 2008-09-20 00:36 dma 2008-09-20 00:37 though vma would not be a bad second chocie 2008-09-20 00:37 choice 2008-09-20 00:37 for hurt 2008-09-20 00:37 heh 2008-09-20 00:37 actually, learning about memory barriers would be useful 2008-09-20 00:37 even to fs work 2008-09-20 00:37 I really want to understand what all the generic implementations do 2008-09-20 00:37 and when and why we would want to not use them 2008-09-20 00:38 sure 2008-09-20 00:38 that's the block IO library 2008-09-20 00:38 try using it as an alternative to what we just did 2008-09-20 00:38 read your superblock via ->readpage 2008-09-20 00:38 hmm? you mean bread and brelse? 2008-09-20 00:38 also doing it sb_bread would be useful 2008-09-20 00:38 easier 2008-09-20 00:38 cruftier 2008-09-20 00:39 create some dentries 2008-09-20 00:39 and inodes 2008-09-20 00:39 link them up properly 2008-09-20 00:39 by hand 2008-09-20 00:39 right 2008-09-20 00:40 well, I'm out all day tomorrow and have work stuff for sunday (yeah, suxorz) so I won't get much done this weekend (end of quarter and all that) 2008-09-20 00:40 read path_walk 2008-09-20 00:40 that will keep you busy 2008-09-20 01:02 first checkin in two days 2008-09-20 01:02 haven't had a two day gap since the start 2008-09-20 01:02 maze's fault for getting me interested in bio hackery 2008-09-20 01:02 lol 2008-09-20 01:02 working on the kernel port I see 2008-09-20 01:02 was 2008-09-20 01:02 maze is going to next 2008-09-20 01:02 how bout you? 2008-09-20 01:02 fun n easy 2008-09-20 01:03 oh still trying to fix out nfs 2008-09-20 01:03 out=our 2008-09-20 01:03 you work 24/7? 2008-09-20 01:03 well, a lot at times 2008-09-20 01:03 isn't child labor illegal? 2008-09-20 01:03 or something like that 2008-09-20 01:03 mostly struggling with various silly road blocks at this time, but I'm happy if it makes some kind of progress 2008-09-20 01:04 send over some of your friends then 2008-09-20 01:04 novell guys can't all be working 16 hours 2008-09-20 01:04 well, I've been watching TV tonight for a good chunk of it 2008-09-20 01:04 but I do work a lot at times 2008-09-20 01:04 need I point out that novell and redhat are both 100% mia for tux3? 2008-09-20 01:04 who is going to win the lamer race? 2008-09-20 01:04 sometimes more focused at various points than others, but I do schedule breaks for myself 2008-09-20 01:05 working on a file system is non-trival 2008-09-20 01:09 most folks in our group are focused on kvm which is a full time deal 2008-09-20 01:09 you know how it works 2008-09-20 01:10 and there's plenty of scheduler work that needs to be done in the general community 2008-09-20 01:10 plenty of design level stuff that needs to be done as well which is also non-trivial 2008-09-20 01:10 hard to find time 2008-09-20 01:11 it tux3 was more complete you might see more effort from various folks at RH or Novell to bug fix, but I think that a lot of folks in the Linux community aren't really interested in solving those probelms since they are happy with ext3 2008-09-20 01:12 but surely somebody in novell 2008-09-20 01:12 can read the code and understand 2008-09-20 01:12 in the entire time I've been there I've yet to talk to a file systems person 2008-09-20 01:12 true 2008-09-20 01:12 at novell 2008-09-20 01:12 can't think of one myself 2008-09-20 01:13 at red hat there's only stephen tweedie, then the gfs2 wan^developers 2008-09-20 01:13 I'm about the closest since I use to work on WAFL which has a completely different set of APIs 2008-09-20 01:13 I should ping sct 2008-09-20 01:13 come to think of it 2008-09-20 01:13 so you folks are on your own for the most part 2008-09-20 01:13 still 2008-09-20 01:13 which is a good and bad thing 2008-09-20 01:13 fs hackers can be made 2008-09-20 01:13 good ;-) 2008-09-20 01:13 it takes a weekend or too 2008-09-20 01:13 two 2008-09-20 01:13 it's a lot harder than that unless you're working on a toy system 2008-09-20 01:13 :) 2008-09-20 01:13 more then that 2008-09-20 01:14 not really 2008-09-20 01:14 hacking ability is more important that fs knowledge 2008-09-20 01:14 logical thinking 2008-09-20 01:14 bug seeing 2008-09-20 01:14 well yeah, but that's true for everything ;-) 2008-09-20 01:14 in hacking,yes 2008-09-20 01:14 well, you know the code really well and have had a lot of experience with file systems, I have some notion of how an enterprise file system should function, but really blank on the implementation details 2008-09-20 01:15 I'd much rather teach a talented python hack to code kernel than an average kernel hack to... do anything 2008-09-20 01:15 like where to go once you've done the b-tree things, atomic logging, etc... 2008-09-20 01:15 I understand 2008-09-20 01:15 it's going to be: extents; atomic commit; kernel 2008-09-20 01:15 with kernel in parallel 2008-09-20 01:15 maze is on it 2008-09-20 01:16 I mean the good thing about this project is the fact that is has some kind of leader, you 2008-09-20 01:16 can't think of a project that doesn't 2008-09-20 01:16 that allows you to guide folks to an implementation, even if I know the fs interfaces and stuff, it wouldn't complete the knowledge base you have with write allocation and stuff 2008-09-20 01:17 you've thought it out carefully and stored that in your head for retrival 2008-09-20 01:17 there's always a part for everybody to work on 2008-09-20 01:17 the fuse thing was sweet 2008-09-20 01:17 flipz: for folks to justify time into tux3, it's got to work on a basic level 2008-09-20 01:18 it does 2008-09-20 01:18 that means in-kernel, atomic commits, snapshots, off-line checker and all of the basic for it to function like a regular file system 2008-09-20 01:18 most of which are still in the future 2008-09-20 01:18 not really 2008-09-20 01:18 working in fuse is working "to a level" 2008-09-20 01:18 don't understand why you'd want an offline checker 2008-09-20 01:19 wouldn't online be better? 2008-09-20 01:19 gh, it doesn't 2008-09-20 01:19 nonsense 2008-09-20 01:19 every filesystem has gotten off the ground before all the stuff was done 2008-09-20 01:19 MaZe: basic checking, online checking is harder 2008-09-20 01:19 zfs still doesn't have a checker 2008-09-20 01:19 yeah, well, zfs is driven by a lot of hyper 2008-09-20 01:19 hype 2008-09-20 01:19 and tons of Sun marketing 2008-09-20 01:20 so there you go 2008-09-20 01:20 zfs is weird 2008-09-20 01:20 MaZe: tell me about it 2008-09-20 01:20 I'd say, yes 2008-09-20 01:20 not that what I'm planning isn't weird... 2008-09-20 01:20 you're planning to reimplment hammer 2008-09-20 01:20 flipz: getting a strong functional implementation is key to getting a lot more developers 2008-09-20 01:20 why not just port hammer? 2008-09-20 01:20 nah 2008-09-20 01:20 bh, I don't want a lot more devs 2008-09-20 01:21 I want a couple more good ones 2008-09-20 01:21 redhat hacks can join the party late, what's new 2008-09-20 01:21 flipz: you have a lot of technical problems still, I can make some suggestions if I see something useful that I can add 2008-09-20 01:21 it's true that what you probably want is a half dozen good devs you can put in one room and have them bounce off of each other 2008-09-20 01:21 cyber room 2008-09-20 01:22 flipz: all of which is a good thing in that it still needs to be solved 2008-09-20 01:22 cyber is good, but real is better 2008-09-20 01:22 tons of intersting work still ahead and discoveries 2008-09-20 01:22 maze, halloween cabal party 2008-09-20 01:22 two locations in venice 2008-09-20 01:22 primary and overflow 2008-09-20 01:22 venice in europe? 2008-09-20 01:22 venice beach 2008-09-20 01:22 in socal 2008-09-20 01:23 oh, in santa monica 2008-09-20 01:23 a little south 2008-09-20 01:23 flipz: unfortunately, I can't help you get it off the ground, I might of help in the future 2008-09-20 01:23 skating distance 2008-09-20 01:23 btw venice, ca on gmaps finds venice, ab, ca 2008-09-20 01:23 hah 2008-09-20 01:23 fix it 2008-09-20 01:23 regarding things like concurrency as you run into scalability problems, etc... 2008-09-20 01:23 354 minles 2008-09-20 01:23 miles 2008-09-20 01:24 just book a trip 2008-09-20 01:24 nah, I'm cheap 2008-09-20 01:24 you don't have to justify, it's a short hop 2008-09-20 01:24 I mean on goog's nickle 2008-09-20 01:24 won't fly 2008-09-20 01:25 santa monica peeps lose touch unless mtv peeps show up from time to time 2008-09-20 01:25 it's mercy travel 2008-09-20 01:25 lol 2008-09-20 01:25 well 2008-09-20 01:25 try this 2008-09-20 01:25 gotta take it private 2008-09-20 01:25 yeah, well I find it hard to even visit a datacenter and that's actually work related 2008-09-20 01:26 flipz: LA is wierd 2008-09-20 01:27 seems ok to me 2008-09-20 01:27 besides I like driving, and don't like flying... 2008-09-20 01:28 LA is weird, but so is SF 2008-09-20 01:28 SF is much better, it's much more diverse 2008-09-20 01:28 you can yourself into and out of trouble in SF 2008-09-20 01:29 I know most of the trouble spots 2008-09-20 01:29 plenty of experience with that 2008-09-20 01:30 only thing is 2008-09-20 01:30 the halloween cabal party is in venice, not sfo 2008-09-20 01:31 anyway it's warmer here 2008-09-20 01:31 better rollerskating 2008-09-20 01:31 more sand 2008-09-20 01:31 you know 2008-09-20 01:31 heh 2008-09-20 01:31 yeah, it's getting cool up north 2008-09-20 01:32 chicks wear less clothing and are on more hippy drugs 2008-09-20 01:32 flipz: a high performance parallelized implementation will be interesting because the Linux kernel itself isn't very scalable in the core fs core 2008-09-20 01:32 code 2008-09-20 01:32 bh, you said it 2008-09-20 01:32 namely posix file locking 2008-09-20 01:33 ugh, posix, 2008-09-20 01:33 I can help you there 2008-09-20 01:33 ugh, posix 2008-09-20 01:33 can't live without it 2008-09-20 01:33 no shit 2008-09-20 01:33 nobody uses it really 2008-09-20 01:33 can't live with it 2008-09-20 01:33 I never saw anybody use it 2008-09-20 01:33 ah 2008-09-20 01:34 so you have some ideas on range locks? 2008-09-20 01:34 yes 2008-09-20 01:34 want to expound? 2008-09-20 01:34 tree? 2008-09-20 01:34 but of course 2008-09-20 01:34 or something smarter? 2008-09-20 01:34 now what kind... 2008-09-20 01:34 maybe it should be done on a per inode basis 2008-09-20 01:34 smarter than a tree? dwimlocks 2008-09-20 01:34 it's in memory 2008-09-20 01:34 so something optimized for in-mem 2008-09-20 01:35 so probably not b-tree 2008-09-20 01:35 use an hball structure 2008-09-20 01:35 a magic 8ball that would be coo 2008-09-20 01:35 l 2008-09-20 01:35 ;-) 2008-09-20 01:35 yah 2008-09-20 01:35 geodesic pointers 2008-09-20 01:36 I think CLR has a red-black tree generalize to interval tree exercise 2008-09-20 01:36 If I remember correctly the generlizement works correctly ... 2008-09-20 01:36 blah 2008-09-20 01:36 no need 2008-09-20 01:36 actually 2008-09-20 01:36 flipz: one of the only places that I've seen lock contention activity is one of the inode locks 2008-09-20 01:36 no need 2008-09-20 01:36 no you do need it 2008-09-20 01:36 bh, which one? 2008-09-20 01:36 I think it's lock during directory traversal or something like that 2008-09-20 01:36 since you can have multi-read-locks, but only one write -lock 2008-09-20 01:37 sure it's not rename? 2008-09-20 01:37 I looked at the problem at there was many places in ext3 that uses this lock in a generic fashion, 2008-09-20 01:37 atomic rename 2008-09-20 01:37 flipz: can't remember 2008-09-20 01:37 the toughest thing to get right, that you absolutely must have 2008-09-20 01:37 it showed up in a "find" load so it's easy to reproduce 2008-09-20 01:37 I implemented the first revision of lockstat for this purpose 2008-09-20 01:38 low impact contention measurements in -rt 2008-09-20 01:38 peterz reimplemented this in lockdep 2008-09-20 01:38 I wonder if an fs could be done with something rcu like 2008-09-20 01:38 so this should definitely show up in trivially reproducable runs 2008-09-20 01:38 bh, let us know when you reproduce 2008-09-20 01:38 MaZe: fine grained locking maybe 2008-09-20 01:39 or per cpu-ification 2008-09-20 01:39 I use to have the runs for it 2008-09-20 01:39 maze, rcu is certainly applicable to fs 2008-09-20 01:39 maze, but its a walk/run situation 2008-09-20 01:39 that won't be easy to get right 2008-09-20 01:39 flipz: just compile in lockdep with stats tracking and cat /proc/lock_stats 2008-09-20 01:39 bh, just tell us ;) 2008-09-20 01:40 we'll do lockdep etc when we get there 2008-09-20 01:40 I have to compile a custom kernel or something like that to figure that out 2008-09-20 01:40 flipz: it'll still be there when you get it into the kernel :) 2008-09-20 01:40 no rush 2008-09-20 01:40 locking is really not the major issue in fs, seek is, and if you mess up, indexing 2008-09-20 01:40 and block allocation 2008-09-20 01:41 that is major 2008-09-20 01:41 blows away locking in actual impact 2008-09-20 01:41 well, what if you have a high performance IO situation ? wouldn't that eventually push things ? 2008-09-20 01:41 probably would need many cpus for it to push 2008-09-20 01:41 if your allocation/seeking suck, which they probably do, you don't care 2008-09-20 01:41 and some sort of wicked raid array to run on 2008-09-20 01:42 what if you have contention against, say, atomic logging 2008-09-20 01:42 fix it when you get there 2008-09-20 01:42 MP atomic logging 2008-09-20 01:42 could be very tricky 2008-09-20 01:42 it's not the logging which is tricky 2008-09-20 01:42 probably isn't 2008-09-20 01:42 it's the mutates 2008-09-20 01:42 just use bio for everything 2008-09-20 01:43 but tux3 isn't there yet so we can't talk about the issues 2008-09-20 01:43 contention will never be in logging 2008-09-20 01:43 inherently async, inherently multi cpu 2008-09-20 01:43 it'll always be on metadata in-mem updates 2008-09-20 01:43 mostly at the dir and up level 2008-09-20 01:43 maze, interesting proposition 2008-09-20 01:43 well, that's still subject to atomic logging right ? 2008-09-20 01:43 yeah, but atomic logging is almost a no-op 2008-09-20 01:43 what if you have a very heavy metadata load ? 2008-09-20 01:44 it's short and sweet and very easy to shard 2008-09-20 01:44 I'd tend to agree 2008-09-20 01:44 the contention is in metadata updates, possibly in block allocation 2008-09-20 01:44 possilby 2008-09-20 01:44 not in the logging, since logging is pretty much just adding a new element to a queue 2008-09-20 01:44 these kind of systems could put a lot of pressure on a single tree data structure, etc... maybe you need to lock for online checking, etc... it's hard to say until you have a fairly complete basic implementation 2008-09-20 01:45 allowing parallel access to the allocation bitmap will be fun 2008-09-20 01:45 right 2008-09-20 01:45 sounds like a pain in the ass kind of problem :) 2008-09-20 01:45 bh, the nice thing about trees is they have subtrees 2008-09-20 01:45 it might be you may end up using a 'usually-works' type of algo 2008-09-20 01:45 flipz: yeah, but the allocation bitmap is a bitch 2008-09-20 01:45 with fallback on collision detection to locking 2008-09-20 01:45 bh, why? 2008-09-20 01:46 the bitch is finding the area to allocate 2008-09-20 01:46 it'll have to be protect in cases where you have heavy write operations, right ? 2008-09-20 01:46 one word: range lock 2008-09-20 01:46 which is inherently a read only op 2008-09-20 01:46 maze, right 2008-09-20 01:46 and the writes are again - quick 2008-09-20 01:46 well, those traversals can be long right ? 2008-09-20 01:46 yes, against cache 2008-09-20 01:46 so basically something rcu-like works 2008-09-20 01:46 spinlock zone 2008-09-20 01:47 although writing the algo with per-cpu 2008-09-20 01:47 when the long traversals get to be the problem, we're winning 2008-09-20 01:47 something like using a hash function with a cpunum parameter 2008-09-20 01:47 then we need a second order map to say where the high/low density areas are 2008-09-20 01:47 shorten the traversal, right 2008-09-20 01:47 rcu might work really well with the bitmap 2008-09-20 01:48 theoretically you could switch allocation strategies based on load and disk fullness 2008-09-20 01:48 simply because tux3 has the notion of logging allocations rather than directoy entering them 2008-09-20 01:48 well 2008-09-20 01:48 hmm 2008-09-20 01:48 hah 2008-09-20 01:48 it also has the notion of keeping the cache blocks up to date 2008-09-20 01:48 right, you'd almost need rcu 2008-09-20 01:48 which conflicts with rcu 2008-09-20 01:48 um 2008-09-20 01:48 it doesn't have to be precise fortunately 2008-09-20 01:48 but 2008-09-20 01:48 don't need rcu 2008-09-20 01:48 why conflicts? 2008-09-20 01:49 can do as well as rcu without it 2008-09-20 01:49 rcu doesn't like changing things 2008-09-20 01:49 that's very slow 2008-09-20 01:49 I'm not even sure a bitmap is the right way to go 2008-09-20 01:49 true 2008-09-20 01:49 you may want something sparser 2008-09-20 01:49 read my writing on that? 2008-09-20 01:49 nope 2008-09-20 01:49 there's a post 2008-09-20 01:49 analyzed in detail 2008-09-20 01:49 thinking of something tree with bitmap leaf like 2008-09-20 01:49 bitmap has 25/2 advantage or something like that in some cases 2008-09-20 01:49 extent tree has just as much or more in others 2008-09-20 01:50 braindead simple solution 2008-09-20 01:50 indeed? 2008-09-20 01:50 use both? 2008-09-20 01:50 in regions where bitmap is better, use a bitmap, otherwise extents 2008-09-20 01:50 in regions where it's a tie you don't care which 2008-09-20 01:50 uh, so tree with bitmap leafs ;-) 2008-09-20 01:50 use whichever is already there 2008-09-20 01:50 simpler 2008-09-20 01:50 sparse bitmap? 2008-09-20 01:50 bitmap stays like it is 2008-09-20 01:51 extent tree can have bitmap regions as leaves instead of extent block 2008-09-20 01:51 just a logical offset in the bitmap 2008-09-20 01:51 exactly what I was thinking 2008-09-20 01:51 instead of a pointer to a leaf block 2008-09-20 01:51 good 2008-09-20 01:51 so you have a tree which can have leaves - either describing the state, or pointing to the right fragment of a sparse bitmap 2008-09-20 01:52 yes 2008-09-20 01:52 use some left over bits in the extent tree for accelleration 2008-09-20 01:52 and you probably allocate space for it just like for any normal file 2008-09-20 01:52 of course 2008-09-20 01:52 in fact both are mapped into normal fiels 2008-09-20 01:52 possibly need to have a few blocks of reserve space to prevent weird cases 2008-09-20 01:53 I showed that's better than direct block pointers 2008-09-20 01:53 there is a weird case 2008-09-20 01:53 very weird 2008-09-20 01:53 when while allocating space the tree and bitmap are full and you need to split and allocate more blocks, etc 2008-09-20 01:53 the bitmap is sparse... so you go set a bit in it, that allocs a block and marks a bit in another block... which might be sparse... 2008-09-20 01:53 yeah, obviously need to either be very careful or do this right 2008-09-20 01:53 "terminate me this" 2008-09-20 01:54 do this right 2008-09-20 01:54 be aware of it 2008-09-20 01:54 think about it clearly 2008-09-20 01:54 the advantage of spare allocation of the bitmap is compelling imho 2008-09-20 01:54 it should terminate if you lean the algos to choosing to non-split blocks 2008-09-20 01:54 I also posted about that 2008-09-20 01:54 same thing for the logging 2008-09-20 01:54 algorithm proposals welcome 2008-09-20 01:54 should also keep the tree from degenerating to a non-sparse bitmap 2008-09-20 01:55 shapor was going to scrape the ml for those posts 2008-09-20 01:55 well 2008-09-20 01:55 it's more important to choose optimal locations 2008-09-20 01:55 I've barely read any of the ml posts... just not enough time 2008-09-20 01:55 than worry about how the bitmap is split 2008-09-20 01:55 ah 2008-09-20 01:55 well let me find one 2008-09-20 01:55 yes ... but no 2008-09-20 01:55 about the bitmaps vs btree 2008-09-20 01:56 8910 flips 2008-09-20 01:56 2885 MaZe 2008-09-20 01:56 1696 shapor 2008-09-20 01:56 1003 bh 2008-09-20 01:56 876 flipz 2008-09-20 01:56 671 konrad 2008-09-20 01:56 418 tim_dimm 2008-09-20 01:56 196 RazvanM 2008-09-20 01:56 133 Bushman 2008-09-20 01:56 latest stats 2008-09-20 01:56 your closing 2008-09-20 01:56 of late it seems flipz has abandoned coding in favour of frivilous chat 2008-09-20 01:57 ;-) 2008-09-20 01:57 or he just likes to talk about interestig problems 2008-09-20 01:58 allocation is a good one 2008-09-20 01:58 oh, you're still here? so what was your idea ;-)? 2008-09-20 01:59 you seemed to disappear right when you were getting to the good part 2008-09-20 01:59 flipz: actually pulled an old trick out of his hat and came up wit a good solution 2008-09-20 01:59 maze, "More about the free tree" 2008-09-20 02:00 http://kerneltrap.org/mailarchive/tux3/2008/8/16/2959334 2008-09-20 02:00 not the one I was thinking of 2008-09-20 02:01 reading 2008-09-20 02:01 "All about the free tree" 2008-09-20 02:01 http://kerneltrap.org/mailarchive/tux3/2008/8/13/2929244 2008-09-20 02:01 right previous one 2008-09-20 02:01 whoever said the fs has to be clean on unmount? 2008-09-20 02:02 just rely on the journal recovery getting memory state consistent 2008-09-20 02:02 problem solved 2008-09-20 02:02 right 2008-09-20 02:02 relating to recursion 2008-09-20 02:02 continuing to parse 2008-09-20 02:02 my point from the first tux3 post 2008-09-20 02:02 breaking new ground it seems 2008-09-20 02:03 hmm? 2008-09-20 02:03 nobody else does it that way 2008-09-20 02:03 oh, yeah, scarredy cats 2008-09-20 02:03 hmm, that's mis-spelt (as is this) I think 2008-09-20 02:03 there are scaredy cats an scarred cats ;) 2008-09-20 02:04 you just have to log all (de-)allocates etc 2008-09-20 02:04 yeah it's a little complex but very satisfying 2008-09-20 02:04 nice we have some kickass bio primitives to log with huh? 2008-09-20 02:05 those ones I did earlier today are really quite nice and general 2008-09-20 02:05 you start the fs with just the journal 2008-09-20 02:05 only thing you might want is alloc flags... but then maybe not even 2008-09-20 02:05 and empty trees and structures 2008-09-20 02:05 no journal 2008-09-20 02:05 then you dump the allocation for the superblock and log into the journal 2008-09-20 02:05 journals are for lamerz -- flipz 2008-09-20 02:06 and you let the kernel module (journal -> forward log) deal with creating the fs 2008-09-20 02:06 no complexity in the mkfs code at all 2008-09-20 02:06 tux3 already creates the fs extremely elegantly 2008-09-20 02:06 hard to improve on, really 2008-09-20 02:06 still parsing ;-) 2008-09-20 02:06 Creating a new Tux3 filesystem requires allocating a number of objects, 2008-09-20 02:06 including objects involved in allocation.  Another nice recursion 2008-09-20 02:06 there: you have to allocate space for objects, but the block allocator 2008-09-20 02:06 is not initialized yet. <- where I am 2008-09-20 02:07 ah 2008-09-20 02:07 right that was fun 2008-09-20 02:07 a few segfaults on the way to sorting it 2008-09-20 02:08 do you mark the superblock as a used block in the alloc tree? 2008-09-20 02:08 superblock(s) 2008-09-20 02:08 yes 2008-09-20 02:08 ok 2008-09-20 02:08 ah, you chickened out at the allocation strategy ;-) 2008-09-20 02:09 that's always where my reasoning breaks down... 2008-09-20 02:09 line 480 "reserve superblock" http://tux3.org/tux3?f=bcfdc76d14a8;file=user/test/inode.c 2008-09-20 02:09 didn't chicken out 2008-09-20 02:09 recognized how big a post it would be and deferred it 2008-09-20 02:09 but there are writings 2008-09-20 02:09 just one? 2008-09-20 02:09 the essential point 2008-09-20 02:09 a farily long one 2008-09-20 02:10 ends with "generating functions" 2008-09-20 02:10 I've been thinking an fs should have to superblocks 2008-09-20 02:10 haven't written that one yet 2008-09-20 02:10 so called bracket-blocks 2008-09-20 02:10 one in front, one at the end 2008-09-20 02:10 you can then support extending a file-system forward and backward 2008-09-20 02:10 http://kerneltrap.org/mailarchive/tux3/2008/8/27/3094404 [Tux3] Spacial correlation between directory entries, inodes and file data 2008-09-20 02:11 if the front or back of the bdev remains unchanged between mounts you can fix up and resize the filesystem on the fly 2008-09-20 02:11 should make resizing easier 2008-09-20 02:11 why do you want to extend forwards? 2008-09-20 02:11 or I'm not sure, backwards? 2008-09-20 02:12 interesting ain't it? 2008-09-20 02:12 not sure 2008-09-20 02:12 because normal fs extend -> 2008-09-20 02:12 so it's natural to want to be able to extend <- 2008-09-20 02:12 to be able to share space between them 2008-09-20 02:12 without having to have lvm in the middle 2008-09-20 02:12 did god intend that? 2008-09-20 02:13 why not have lvm in the middle? 2008-09-20 02:13 an extra layer 2008-09-20 02:13 not really 2008-09-20 02:13 with little apparent benefit 2008-09-20 02:13 provisioning 2008-09-20 02:13 plus it then ends up remapping disk order 2008-09-20 02:13 big benefit 2008-09-20 02:13 so? 2008-09-20 02:13 which breaks space optimizations you might otherwise attempt 2008-09-20 02:13 remap in big chunks 2008-09-20 02:13 seeks across an lvm 2008-09-20 02:13 that fixes that 2008-09-20 02:13 are no longer like seeks across a normal disk 2008-09-20 02:14 no, they're faster 2008-09-20 02:14 across an array 2008-09-20 02:14 yes, true 2008-09-20 02:14 but I still think you want to be able to keep the order of blocks in the fs the same as on disk - even if it gets split across multiple disks and raid5/6ed 2008-09-20 02:15 the cost of lvm is much less than you think, see my bio stacking patches 2008-09-20 02:15 argh, bh disappeared again 2008-09-20 02:15 I'm not thinking of cpu kernel nor stack cost 2008-09-20 02:15 youcan permute in big chunks without loss of performance 2008-09-20 02:15 I'm thinking of cost of the disk no longer being linear 2008-09-20 02:15 that's key 2008-09-20 02:15 that's one of the few things we have going for us 2008-09-20 02:16 but you can't have fluid space allocation between multiple fs 2008-09-20 02:16 you do 2008-09-20 02:16 although I'm not sure having that would be a good thing ;-) 2008-09-20 02:16 just at a coarser granularity than you're thinking 2008-09-20 02:17 I am thinking, the provision granularity will be units of 128 MB 2008-09-20 02:17 oh, I know another reason why I liked this 2008-09-20 02:17 because of dr 2008-09-20 02:17 coincidentally, the number of 4K blocks you can map with one 4K block 2008-09-20 02:17 recovering the fs from a partially damaged disk 2008-09-20 02:18 getting late 2008-09-20 02:18 you basically want to survive in some pseudo decent state a situation in which multi-megabyte pieces of the disk go awol 2008-09-20 02:18 and I only checked in half the code I'd planeed 2008-09-20 02:18 indeed 2008-09-20 02:18 been meaning to say: 2008-09-20 02:18 yes 2008-09-20 02:18 good rule to write into the plan 2008-09-20 02:18 it shall be so 2008-09-20 02:18 signing off and rebooting into mac to finally install spore 2008-09-20 02:18 but have gotten caught up in conversations ;-) 2008-09-20 02:19 heh 2008-09-20 02:19 maybe I have to try that lucasarts demo again 2008-09-20 02:19 have to put on a show for my daughter tomorrow 2008-09-20 02:19 get back from hols 2008-09-20 02:19 I believe it should be possible to write the kernel fs 2008-09-20 02:19 code in such a way that no user space utilities would be required 2008-09-20 02:19 there is no reason why the kernel can't repair the fs 2008-09-20 02:20 oh yeah, especially considering I already did it 2008-09-20 02:20 since the kernel code has to be fault tolerant anyway 2008-09-20 02:20 well there's no fsck 2008-09-20 02:20 but there is mkfs 2008-09-20 02:20 can easily be done on mount 2008-09-20 02:20 -!- pgquiles(~pgquiles@229.Red-83-49-101.dynamicIP.rima-tde.net) has joined #tux3 2008-09-20 02:20 even that could be a couple function in kernel with a mount mkfs option 2008-09-20 02:20 hey pgquiiles ;) 2008-09-20 02:20 since most of the mkfs ends up being inkernel anyways 2008-09-20 02:20 just when we were heading to our consoles 2008-09-20 02:21 there's going to be -omake for tux3 2008-09-20 02:21 huh? 2008-09-20 02:21 oh 2008-09-20 02:21 as in -omkfs? 2008-09-20 02:21 mount -ttux3 -omake /dev/foo /mnt/rulez 2008-09-20 02:21 yep 2008-09-20 02:21 mount -ttux3 -omkfs /dev/foo /mnt/rulez 2008-09-20 02:21 better I guess 2008-09-20 02:22 trivial 2008-09-20 02:22 exactly 2008-09-20 02:22 useful 2008-09-20 02:22 I didn't realize I always wanted that tillyou mentioned it 2008-09-20 02:22 ok... options parsing 2008-09-20 02:22 yeah, and it'll probably share a fair bit of code with the recovery parts of the kernel 2008-09-20 02:22 my ask for next in junkfs 2008-09-20 02:22 ok? 2008-09-20 02:22 ...maybe... 2008-09-20 02:22 ok 2008-09-20 02:23 text strings of course 2008-09-20 02:23 I want to see waht kind of parser you write ;) 2008-09-20 02:23 lol 2008-09-20 02:23 something tells me you're a parser kind of guy 2008-09-20 02:23 the parser doesn't have to be efficient 2008-09-20 02:23 shift reduce is second nature 2008-09-20 02:23 hmm? 2008-09-20 02:23 since it'll be used barely ever 2008-09-20 02:23 hmm? 2008-09-20 02:23 sure 2008-09-20 02:23 shift reduce? 2008-09-20 02:24 lalr-1? 2008-09-20 02:24 context sensitive? 2008-09-20 02:24 well 2008-09-20 02:24 ok, you're going to have fun learning that 2008-09-20 02:24 should take you about a day 2008-09-20 02:24 I think I'm not aware of the terms you're using 2008-09-20 02:25 at least not in English 2008-09-20 02:25 basic parser lingo 2008-09-20 02:25 you're in for a formative experience 2008-09-20 02:25 that's a whole class of algorithmic thoughts you haven't had yet 2008-09-20 02:25 what kind of options do we need to parse anyway? 2008-09-20 02:25 opt=# 2008-09-20 02:25 opt=[on/off] 2008-09-20 02:26 on=[enum] 2008-09-20 02:26 erm opt=[enum] 2008-09-20 02:26 opt=[ip4|ip6|ip4:port...hostname,etc.] 2008-09-20 02:26 -omkfs 2008-09-20 02:26 mostly on/off present/not present and integers it seems 2008-09-20 02:27 case sensitivity? 2008-09-20 02:27 gnu opt syntax would be nice 2008-09-20 02:27 but 2008-09-20 02:27 you don't get to parse the command line 2008-09-20 02:27 just the -o part 2008-09-20 02:27 right 2008-09-20 02:27 it has a set syntax 2008-09-20 02:27 but you do end up getting a single char* 2008-09-20 02:27 every options parser I've seen is sickening 2008-09-20 02:28 what part is sickening? 2008-09-20 02:28 the implementation 2008-09-20 02:28 I meant which part ... what don't you like? 2008-09-20 02:28 ext2/3 used to be even worse than they are 2008-09-20 02:28 cut and paste 2008-09-20 02:28 poor use of tables 2008-09-20 02:28 have to look in 3 places to see what's going on 2008-09-20 02:28 that kind of thing 2008-09-20 02:28 long code 2008-09-20 02:28 mostly fluff 2008-09-20 02:29 hmm 2008-09-20 02:30 { 2008-09-20 02:30 { "uid32", &flag, OP_OR, FLAG_UID32 } 2008-09-20 02:30 that's from? 2008-09-20 02:30 head 2008-09-20 02:30 right 2008-09-20 02:30 good 2008-09-20 02:31 less use of enums is better 2008-09-20 02:31 but hard to avoid entirely 2008-09-20 02:31 but yes, that's the way it should be 2008-09-20 02:31 I'm thinking a parse_options(char *, options_table_t *) 2008-09-20 02:31 and please no destroying the input string ;) 2008-09-20 02:32 sure, most should be handled by directly setting a flag, no special option code to invoke 2008-09-20 02:32 possibly multiple flags 2008-09-20 02:32 and setting clearing 2008-09-20 02:33 with appropriate callback, sure 2008-09-20 02:33 right callbacks 2008-09-20 02:33 also strings can be handled via store length and store pointer to non-asciiz 2008-09-20 02:33 that way we don't modify 2008-09-20 02:33 bonus points for being able to parse binary 2008-09-20 02:33 I _always_ store length 2008-09-20 02:34 right being able to parse 0x000 and 000b and 007 and 23#blah 2008-09-20 02:34 never asciiz except for throwaway 2008-09-20 02:34 although... 2008-09-20 02:34 converting commas to nulls in the input string, might be acceptable... 2008-09-20 02:34 there isn't a unicode requirement 2008-09-20 02:34 fortunately 2008-09-20 02:34 utg8 2008-09-20 02:34 utf8 2008-09-20 02:34 not even 2008-09-20 02:34 should be utf8 clean 2008-09-20 02:35 though 2008-09-20 02:35 command lines are not utf8 2008-09-20 02:35 although that's obvious 2008-09-20 02:35 correct me if I'm wrong 2008-09-20 02:35 there is unfortunately no standard 2008-09-20 02:35 but utf8 is winning 2008-09-20 02:35 and the fs is utf8 2008-09-20 02:35 so they basically are 2008-09-20 02:35 utf8 options... 2008-09-20 02:35 probably not necssary 2008-09-20 02:36 probably also no hurt to support 2008-09-20 02:36 might not even need any code if done right 2008-09-20 02:36 I'm geeking the question 2008-09-20 02:37 hmm? 2008-09-20 02:37 what does that mean 2008-09-20 02:37 geeking? 2008-09-20 02:37 almost like googling 2008-09-20 02:37 well geeking the question 2008-09-20 02:37 oh 2008-09-20 02:37 except geekier 2008-09-20 02:38 so in case of conflicting options 2008-09-20 02:38 last one on the right wins 2008-09-20 02:38 right? 2008-09-20 02:38 or all get computed and accumulated 2008-09-20 02:40 http://developer.apple.com/technotes/tn2002/tn2065.html 2008-09-20 02:40 Q: What does do shell script do with non-ASCII text (accented characters, Japanese, etc.)? 2008-09-20 02:40 useful commentary 2008-09-20 02:40 exit with an error 2008-09-20 02:40 on any option conflict 2008-09-20 02:41 informative error 2008-09-20 02:41 properly formatted 2008-09-20 02:41 saying what conflicted with what 2008-09-20 02:41 give up on first conflict 2008-09-20 02:42 that might be hard to do 2008-09-20 02:42 since what conflicts with what is a problem in and of itsel 2008-09-20 02:42 f 2008-09-20 02:42 if it's not hard you wont' get pheromones from it 2008-09-20 02:44 right 2008-09-20 02:44 the comments above from apple are not quire 2008-09-20 02:44 quite 2008-09-20 02:44 most linux international os'es 2008-09-20 02:44 have LC_ALL=something.utf-8 2008-09-20 02:44 $ locale 2008-09-20 02:44 LANG=en_US.UTF-8 2008-09-20 02:44 LC_CTYPE="en_US.UTF-8" 2008-09-20 02:44 LC_NUMERIC="en_US.UTF-8" 2008-09-20 02:44 LC_TIME="en_US.UTF-8" 2008-09-20 02:44 LC_COLLATE="en_US.UTF-8" 2008-09-20 02:44 LC_MONETARY="en_US.UTF-8" 2008-09-20 02:44 LC_MESSAGES="en_US.UTF-8" 2008-09-20 02:44 LC_PAPER="en_US.UTF-8" 2008-09-20 02:44 LC_NAME="en_US.UTF-8" 2008-09-20 02:44 LC_ADDRESS="en_US.UTF-8" 2008-09-20 02:44 LC_TELEPHONE="en_US.UTF-8" 2008-09-20 02:44 LC_MEASUREMENT="en_US.UTF-8" 2008-09-20 02:44 LC_IDENTIFICATION="en_US.UTF-8" 2008-09-20 02:44 LC_ALL= 2008-09-20 02:45 those that don't are broken ;-) and why your xchat didn't work 2008-09-20 02:45 and since multi-byte encodings (like unicode16 or nicode32) don't work in shell 2008-09-20 02:45 mount options themselves have no unicode impact 2008-09-20 02:45 precisely 2008-09-20 02:45 only possibility is values supplied to options 2008-09-20 02:45 so everything is ascii 2008-09-20 02:45 and I don't know of any of those 2008-09-20 02:46 theoreticaly a mount option could be the directory relative to root fs to mount as root 2008-09-20 02:46 and that might be utf8 2008-09-20 02:46 is there one of those? 2008-09-20 02:46 but really at that point it's just a string of bytes 2008-09-20 02:46 there should be ;-) 2008-09-20 02:47 passphrase 2008-09-20 02:47 that too 2008-09-20 02:47 remote machines hostname 2008-09-20 02:47 when using international domains 2008-09-20 02:47 passphrase in cleartext on the mount command would be really lame ;) 2008-09-20 02:47 let's not do that 2008-09-20 02:47 well 2008-09-20 02:47 anyway, it's not so much a matter of support, but non-breaking it 2008-09-20 02:47 I suppose it's ok if nobody can see it (secret fstab) 2008-09-20 02:48 nah, not the right spot for passwords 2008-09-20 02:48 and tux3 shouldn't do crypto anyway, not really 2008-09-20 02:49 that's something you should definitely rely on dm-crypt or whatever to do 2008-09-20 02:49 it may end up doing some namespace stuff 2008-09-20 02:49 that only the filesystem can do 2008-09-20 02:49 oh, had an idea 2008-09-20 02:49 for pipe files 2008-09-20 02:50 data is a pipe, with appropriate flags, do cause reading/writing it to launch a zcat/gzip to data.gz 2008-09-20 02:50 userspace compression ;-) 2008-09-20 02:50 the pipe ends up being not seekable though 2008-09-20 02:50 ah, overload the file semantics 2008-09-20 02:50 but maybe it'd be of some use 2008-09-20 02:50 so you can plug a filter in front of a file 2008-09-20 02:50 right 2008-09-20 02:51 and remember that on the fs 2008-09-20 02:51 exactly 2008-09-20 02:51 should be fun to write up 2008-09-20 02:51 always wanted that 2008-09-20 02:51 but I didn't think it should be within the fs 2008-09-20 02:51 more a vfs thing 2008-09-20 02:51 like a per-file mount 2008-09-20 02:51 but then how do you make it persistent? 2008-09-20 02:51 that's why the fs has to support storing it 2008-09-20 02:52 that's were the must-be-in-fs might come in 2008-09-20 02:52 and where once again xattr options could come in useful 2008-09-20 02:52 xattr has the nice benefit that archival utilities already support storing them 2008-09-20 02:52 trick is to come if with a mechanism-not-policy on that 2008-09-20 02:52 and have it be a nice mechanism and not totally single purpsoe 2008-09-20 02:53 yeah, the above is just a rough concept 2008-09-20 02:53 wow I let this vino stand on end too long 2008-09-20 02:53 corks resisting 2008-09-20 02:53 theoretically the command to run on read/write could be a xattr option 2008-09-20 02:53 maybe fifo better than pipe 2008-09-20 02:54 unsure 2008-09-20 02:54 maybe a fifo that becomes a pipe on read or write access 2008-09-20 02:54 that sounds better 2008-09-20 02:54 since I don't think unix support pipe on fs 2008-09-20 02:54 maybe there's a new syscall that associates 2008-09-20 02:54 unix does support pipe on fs 2008-09-20 02:54 where have you been? 2008-09-20 02:55 oh it does? 2008-09-20 02:55 use it heavily in zumastor 2008-09-20 02:55 almost never use pipes and fifos (except pipes in shell via | ) 2008-09-20 02:55 named pipes 2008-09-20 02:55 bash daemons ;) 2008-09-20 02:56 wow, that was tight 2008-09-20 02:56 probably the semantics are wrong though... or can you open the same pipe multiple times for write and /or read and not have conflicts? 2008-09-20 02:56 the cork? 2008-09-20 02:56 it's 3am 2008-09-20 02:56 bash semantics for pipes are slightly odd 2008-09-20 02:56 since it can't hold them open 2008-09-20 02:57 there also needs to be a way to store metadata associated with a file with information on when to invalidate it 2008-09-20 02:57 it relies on the fringe behavior 2008-09-20 02:57 what happens before the other side opens and after it closes 2008-09-20 02:57 or some very powerful way to keep track of file state 2008-09-20 02:57 well 2008-09-20 02:57 hold that thought ;) 2008-09-20 02:57 bash can keep em open 2008-09-20 02:57 till after the junkfs option parser 2008-09-20 02:57 use exec to redirect 2008-09-20 02:58 right, option parser first ;-) 2008-09-20 02:58 oh, sick 2008-09-20 02:58 I need to improve my demented index 2008-09-20 02:58 huh? 2008-09-20 02:58 redirecting a pipe in bash 2008-09-20 02:58 to keep it open 2008-09-20 02:58 sick 2008-09-20 02:58 huh 2008-09-20 02:58 why? 2008-09-20 02:59 it's how I do tcp connections in bash 2008-09-20 02:59 bash can keep em open <- just commenting 2008-09-20 02:59 well for our app that would be way sick 2008-09-20 03:00 why? 2008-09-20 03:00 here let me find some code 2008-09-20 03:01 I know what you're talking about 2008-09-20 03:01 you don't know how we use pipes 2008-09-20 03:01 CR=`echo -en "\r"` 2008-09-20 03:01 open3() { 2008-09-20 03:01 exec 3<>/dev/tcp/$HOSTNAME/$HOSTPORT 2008-09-20 03:01 } 2008-09-20 03:01 if you did you wouldn't have to ask about the sick 2008-09-20 03:01 close3() { 2008-09-20 03:01 exec 3<&- 2008-09-20 03:01 } 2008-09-20 03:01 get() { 2008-09-20 03:01 open3 || { sleep 1; exit; } 2008-09-20 03:01 echo -en "GET $1 HTTP/1.0\r\n\r\n" >&3 2008-09-20 03:01 cat <&3 2008-09-20 03:01 close3 2008-09-20 03:01 } 2008-09-20 03:01 getfile() { 2008-09-20 03:01 get "$1" | while read line; do 2008-09-20 03:01 if [ "$line" == "" -o "$line" == "$CR" ]; then cat; exit; fi 2008-09-20 03:01 done 2008-09-20 03:01 } 2008-09-20 03:01 there 2008-09-20 03:01 wget 2008-09-20 03:01 heh 2008-09-20 03:01 ok 2008-09-20 03:01 that is sick 2008-09-20 03:01 really 2008-09-20 03:01 shapor needs to see it 2008-09-20 03:01 works wonders 2008-09-20 03:02 putfile() { 2008-09-20 03:02 open3 2008-09-20 03:02 cat > $TMPDIR/post-$$ 2008-09-20 03:02 LEN=`wc -c < $TMPDIR/post-$$` 2008-09-20 03:02 echo -en "POST $1 HTTP/1.0\r\n" >&3 2008-09-20 03:02 echo -en "Content-Length: $LEN\r\n" >&3 2008-09-20 03:02 echo -en "\r\n" >&3 2008-09-20 03:02 cat $TMPDIR/post-$$ >&3 2008-09-20 03:02 rm -f $TMPDIR/post-$$ 2008-09-20 03:02 close3 2008-09-20 03:02 } 2008-09-20 03:02 why in bash may I ask? 2008-09-20 03:02 even does post 2008-09-20 03:02 because this was for a disk-less system 2008-09-20 03:02 and the entire thing ran in bash basically 2008-09-20 03:03 sick^2 2008-09-20 03:03 nope 2008-09-20 03:03 this is where it gets sick: 2008-09-20 03:03 open3pr() { 2008-09-20 03:03 exec 3<>/dev/tcp/$IPPNAME/$IPPPORT 2008-09-20 03:03 } 2008-09-20 03:03 agreed 2008-09-20 03:03 putchar() { 2008-09-20 03:03 echo -en "\x"`printf "%02X" $1` 2008-09-20 03:03 } 2008-09-20 03:03 ippstr() { 2008-09-20 03:03 SIZE=`echo -en "$*" | wc -c` 2008-09-20 03:03 putchar $[$SIZE/256] 2008-09-20 03:03 putchar $[$SIZE%256] 2008-09-20 03:03 echo -en "$*" 2008-09-20 03:03 } 2008-09-20 03:03 ippstr3() { 2008-09-20 03:03 echo -en "$1" 2008-09-20 03:03 ippstr "$2" 2008-09-20 03:03 ippstr "$3" 2008-09-20 03:03 } 2008-09-20 03:03 print_header() { 2008-09-20 03:03 PRINTJOB=$[$PRINTJOB+1] 2008-09-20 03:03 echo -en "\1\1\0\2\0\0\0\1" 2008-09-20 03:03 echo -en "\1" 2008-09-20 03:03 ippstr3 "G" "attributes-charset" "iso-8859-1" 2008-09-20 03:03 ippstr3 "H" "attributes-natural-language" "en-us" 2008-09-20 03:03 ippstr3 "E" "printer-uri" "ipp://$IPPNAME:$IPPPORT/printers/$PRINTER" 2008-09-20 03:04 ippstr3 "B" "requesting-user-name" "root" 2008-09-20 03:04 ippstr3 "B" "job-name" "judge-job-$PRINTJOB-$1" 2008-09-20 03:04 ippstr3 "I" "document-format" "application/octet-stream" 2008-09-20 03:04 echo -en "\2" 2008-09-20 03:04 ippstr3 "B" "job-sheets" "none" 2008-09-20 03:04 ippstr3 "B" "" "none" 2008-09-20 03:04 echo -en "\3" 2008-09-20 03:04 } 2008-09-20 03:04 print() { 2008-09-20 03:04 open3pr 2008-09-20 03:04 TEMPFILE="$TMPDIR/header-$$" 2008-09-20 03:04 print_header "$1" >"$TEMPFILE" 2008-09-20 03:04 cat >> "$TEMPFILE" 2008-09-20 03:04 LENGTH=`wc -c<"$TEMPFILE"` 2008-09-20 03:04 echo -en "POST /printers/$PRINTER HTTP/1.1\r\n" >&3 2008-09-20 03:04 echo -en "Content-Length: $LENGTH\r\n" >&3 2008-09-20 03:04 echo -en "Content-Type: application/ipp\r\n" >&3 2008-09-20 03:04 echo -en "Host: $IPPNAME\r\n" >&3 2008-09-20 03:04 echo -en "\r\n" >&3 2008-09-20 03:04 cat "$TEMPFILE" >&3 2008-09-20 03:04 rm -f "$TEMPFILE" 2008-09-20 03:04 LENGTH=5 2008-09-20 03:04 #cat <&3 2008-09-20 03:04 while read line <&3; do 2008-09-20 03:04 if [ "$line" == "" -o "$line" == "$CR" ]; then break; fi 2008-09-20 03:04 case "$line" in 2008-09-20 03:04 "Content-Length: "*) 2008-09-20 03:04 LENGTH=`echo "$line" | sed "s/^Content-Length: //;s/\r\\$//"` 2008-09-20 03:04 ;; 2008-09-20 03:04 esac 2008-09-20 03:04 echo "$line" 2008-09-20 03:04 done 2008-09-20 03:04 dd bs=1 count=$LENGTH 2>/dev/null <&3 | xxd 2008-09-20 03:04 close3 2008-09-20 03:04 } 2008-09-20 03:04 and you have printing to an ipp print spool 2008-09-20 03:04 no bash-that-writes-bash? 2008-09-20 03:05 oh, let me find another snipper 2008-09-20 03:05 heh 2008-09-20 03:05 pastie please 2008-09-20 03:05 the /dev/tcp trick is disabled in bash on ubuntu/debian 2008-09-20 03:05 so we are hacking routers? 2008-09-20 03:05 dsl router or something? 2008-09-20 03:05 print server? 2008-09-20 03:05 no online contest judge system 2008-09-20 03:06 for programming contest 2008-09-20 03:06 ah 2008-09-20 03:06 basically my MSc thesis 2008-09-20 03:06 uses diskless nodes to perform testing of untrusted code 2008-09-20 03:06 your msc thesis was a contest? 2008-09-20 03:06 I see 2008-09-20 03:06 contest is the algorithm 2008-09-20 03:06 the code to support the national eliminations for ICPC 2008-09-20 03:06 called AMPPZ 2008-09-20 03:07 in Poland 2008-09-20 03:07 (ACM ICPC) 2008-09-20 03:07 so you wrote the judge script for it and got a msc for it? 2008-09-20 03:07 no, a lot more 2008-09-20 03:08 the rootfs file system, the stripped down compilers 2008-09-20 03:08 the judging server 2008-09-20 03:08 the scripting 2008-09-20 03:08 the firewall rules 2008-09-20 03:08 etc 2008-09-20 03:08 the entire system 2008-09-20 03:08 the whole infrastructure 2008-09-20 03:08 sandbox 2008-09-20 03:08 right 2008-09-20 03:08 yup 2008-09-20 03:08 still in active use for student coursework 2008-09-20 03:08 fun 2008-09-20 03:08 I was so much more boring as a student 2008-09-20 03:08 makes'em write code that actually runs and works on frickin' wickedly selected tests 2008-09-20 03:09 writing compilers and such 2008-09-20 03:09 ...still searching... 2008-09-20 03:10 $ cat rpc.sh 2008-09-20 03:10 #!/bin/echo You must include this file: 2008-09-20 03:10 escape() { 2008-09-20 03:10 local arg 2008-09-20 03:10 local ch 2008-09-20 03:10 for arg in "$@"; do 2008-09-20 03:10 echo -n " \$'" 2008-09-20 03:10 echo -n "${arg}" \ 2008-09-20 03:10 | while IFS= read -n 1 -r ch; do 2008-09-20 03:10 echo -n "\\x$(xxd -ps -l1 <<< "${ch}")" 2008-09-20 03:10 done 2008-09-20 03:10 echo -n "'" 2008-09-20 03:10 done 2008-09-20 03:10 echo 2008-09-20 03:10 } 2008-09-20 03:10 rpc() { 2008-09-20 03:10 local HOST="$1" 2008-09-20 03:10 shift 2008-09-20 03:10 local PROC="$1" 2008-09-20 03:10 shift 2008-09-20 03:10 local FUNC=$(type "${PROC}" | sed -rn '2,$p') 2008-09-20 03:10 # echo "rpc HOST[${HOST}] PROC[${PROC}] FUNC[${FUNC}] ARGS[$*]" 2008-09-20 03:10 local ARGS=`escape "$@"` 2008-09-20 03:10 ssh -ax "root@${HOST}" "${FUNC}; ${PROC}${ARGS}" } 2008-09-20 03:11 usage: 2008-09-20 03:11 rpc hostname shell_function parameters... 2008-09-20 03:11 executes shell_function on remote machine 2008-09-20 03:11 using a ssh-based rpc scheme 2008-09-20 03:11 yup 2008-09-20 03:12 crazy eh? 2008-09-20 03:12 well I was going to say, you out leeted the zumastor script, but then... did you write daemons in bash? 2008-09-20 03:12 local FUNC=$(type "${PROC}" | sed -rn '2,$p') 2008-09-20 03:12 this line is the kicker 2008-09-20 03:12 I have a web server running in bash, yes 2008-09-20 03:13 ok 2008-09-20 03:13 we're officially outleeted 2008-09-20 03:13 although it cheats and uses xinetd to launch itself 2008-09-20 03:13 since I can't figure out how to do listens in pure bash 2008-09-20 03:13 our daemons listen on pipes, but of course you need socks 2008-09-20 03:14 I can listen on a nc pipe 2008-09-20 03:14 they do spawn other daemons 2008-09-20 03:14 but that makes it single threaded 2008-09-20 03:14 so only one outstanding request 2008-09-20 03:14 right 2008-09-20 03:14 that was a pain 2008-09-20 03:14 oh, I also had a proxy running in bash 2008-09-20 03:14 somebody wrote a little bit of c to do nonblocking read 2008-09-20 03:14 I felt that was cheating 2008-09-20 03:15 yup 2008-09-20 03:15 anyway my proxy is 4K bash code 2008-09-20 03:15 I need to test the code 2008-09-20 03:15 before I sleep 2008-09-20 03:15 mostly debug and strings really 2008-09-20 03:15 test this dleaf walker 2008-09-20 03:15 it's been kind of a block for me 2008-09-20 03:15 unfun 2008-09-20 03:15 okay, I'm going to bed, since I have to get up at 9 2008-09-20 03:15 needs to be done 2008-09-20 03:15 is in the way of finishing extents 2008-09-20 03:15 you enjoy your testing... 2008-09-20 03:16 I won't, but the chat was fun 2008-09-20 03:16 and bh didn't share his idea with us ;-( 2008-09-20 03:16 maybe next time 2008-09-20 03:17 8910 flips 2008-09-20 03:17 3261 MaZe 2008-09-20 03:17 I think those cut'n'pastes should classify as cheating 2008-09-20 03:17 your ratio is catching up 2008-09-20 03:17 oh yeah 2008-09-20 03:17 well most of my chat is cut and paste too 2008-09-20 03:17 I've said this all before ;) 2008-09-20 03:18 you switched to another username to let me catch up 2008-09-20 03:18 oh right 2008-09-20 03:18 you just hit 10K 2008-09-20 03:19 anyway, enough is enough... 2008-09-20 03:19 good night and good testing 2008-09-20 03:19 good night 2008-09-20 03:22 ah! 2008-09-20 03:22 brilliance 2008-09-20 03:23 the OP_OR will actually be the name of a function 2008-09-20 03:23 it will simply be a callback 2008-09-20 03:23 find 2008-09-20 03:23 fine 2008-09-20 03:23 anyway back to bed ;-) 2008-09-20 03:23 heh 2008-09-20 03:23 just like me 2008-09-20 03:24 dleaf probe seems to be working 2008-09-20 03:24 wish I'd tested it earlier 2008-09-20 03:24 could have moved on 2008-09-20 03:25 whoops, nope 2008-09-20 07:07 -!- amey(~amey@116.73.35.180) has joined #tux3 2008-09-20 07:10 -!- amey(~amey@116.73.35.180) has left #tux3 2008-09-20 09:29 -!- BSD(~bandan@38.117.250.152) has joined #tux3 2008-09-20 11:19 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-20 12:01 I copied over the patches from Daniel's tree. Until he fixes cloning, feel free to clone it from here: git clone git://makefile.in/tux3fs 2008-09-20 12:02 I will update it with patches from his tree regularly 2008-09-20 14:19 folks 2008-09-20 14:33 bsd, thanks 2008-09-20 14:38 flipz: np 2008-09-20 14:38 ACTION found a nice weekend to look around the tux3fs code 2008-09-20 15:50 BSD: yeah me too, I'm finally getting a chance to look at the lowlevel fuse api 2008-09-20 15:51 it is a nice weekend indeed 2008-09-20 15:51 i got sidetracked with a memory leak in inodetest last night 2008-09-20 15:51 and then fell asleep 2008-09-20 15:52 its strange if i put the main return right after make_tux3 it leaks even more memory 2008-09-20 15:52 someone else can chew on that one 2008-09-20 15:52 it's a known behavior of valgrind 2008-09-20 15:53 oh? 2008-09-20 15:53 has something to do with c library cleanup 2008-09-20 15:53 so not a bug? 2008-09-20 15:53 whether or not _exit gets called or something 2008-09-20 15:53 not sure 2008-09-20 15:53 yeah i didnt find anything 2008-09-20 15:53 maybe not a bug 2008-09-20 15:53 I used to know and I forgot 2008-09-20 15:54 vtable inode was getting detected as leaked 2008-09-20 15:54 ah 2008-09-20 15:54 but if i comment initialized it, other shit get detected 2008-09-20 15:54 (the next malloc) 2008-09-20 15:54 hmm 2008-09-20 15:54 maybe its because of some memmove or something 2008-09-20 15:54 confusing valgrind 2008-09-20 15:55 nice to know how valgrind works 2008-09-20 15:55 vtable is the only inode that doesn't actually get used 2008-09-20 15:55 I should remove it 2008-09-20 15:55 yeah thats why i figured just comment it out 2008-09-20 15:55 but it didnt help ;) 2008-09-20 15:55 it's out of the design already 2008-09-20 15:55 posted a nice troll to lkml over that and nobody bit 2008-09-20 15:56 yeah i was notcing we still have a version.c or something too 2008-09-20 15:56 i dont buy your argument 2008-09-20 15:56 wanted to get a discussion going re whether that stuff should be in the filesystem or the volume manager 2008-09-20 15:56 well 2008-09-20 15:56 discuss or shut up ;) 2008-09-20 15:56 er i mean volume.c 2008-09-20 15:56 you see a use for subvolumes? 2008-09-20 15:56 I don't any more 2008-09-20 15:57 funny, we started from the opposite positions 2008-09-20 15:57 sharing free space? 2008-09-20 15:57 right 2008-09-20 15:57 useless 2008-09-20 15:57 super useful 2008-09-20 15:57 they should share volume free space 2008-09-20 15:57 only 2008-09-20 15:57 it allows you to compartmentalize 2008-09-20 15:57 you can do the same with volumes 2008-09-20 15:58 share free space? 2008-09-20 15:58 yes 2008-09-20 15:58 lvm3 2008-09-20 15:58 lvm2 even, though lame 2008-09-20 15:59 so your solution is to punt to the block layer 2008-09-20 15:59 yes 2008-09-20 16:00 layering 2008-09-20 16:00 there's no compelling argument to do otherwise 2008-09-20 16:00 the strongest argument is, perhaps you could get finer grained free space sharing within the fs 2008-09-20 16:00 but that's bad for performance 2008-09-20 16:00 see the fsync story 2008-09-20 16:00 also matt's comments 2008-09-20 16:00 yeah 2008-09-20 16:00 hmm 2008-09-20 16:00 his granularity is 128 MB, ours should be too 2008-09-20 16:01 since zfs has such lame quota support what people do is create one volume per user for example 2008-09-20 16:01 /home/user is a volume 2008-09-20 16:01 and let them all share free space 2008-09-20 16:02 lets do quots right then 2008-09-20 16:02 so if you only had a /home volume 2008-09-20 16:02 all the users would share free space anyway i suppose 2008-09-20 16:03 let's do per directory quota, with hardlinks between them disabled 2008-09-20 16:03 hm 2008-09-20 16:03 the rule is: a directory with quota may not have hard links into it 2008-09-20 16:03 just can't do that boz 2008-09-20 16:04 sounds reasonable 2008-09-20 16:04 you mean out of it 2008-09-20 16:05 the direction is ambiguous 2008-09-20 16:05 not in or out 2008-09-20 16:05 just like a home directory 2008-09-20 16:05 well 2008-09-20 16:05 are there any hardlinked things in home directories? 2008-09-20 16:05 not usually. 2008-09-20 16:05 lets see 2008-09-20 16:06 is there a way to perl that? 2008-09-20 16:06 maybe with the help of locate? 2008-09-20 16:06 challenge 2008-09-20 16:07 find 2008-09-20 16:07 ? 2008-09-20 16:07 find -links 2 ! -type d 2008-09-20 16:07 how do you reject internal links? 2008-09-20 16:08 cant 2008-09-20 16:08 would be -links +1 i think 2008-09-20 16:08 "more than 1 link" 2008-09-20 16:08 hard to find those hard links 2008-09-20 16:08 needs help from the indexer 2008-09-20 16:08 locate 2008-09-20 16:08 somehow 2008-09-20 16:09 hmm 2008-09-20 16:09 no 2008-09-20 16:09 i'm not even finding any internal links 2008-09-20 16:09 it's much harder 2008-09-20 16:09 have to list by inode 2008-09-20 16:09 and check everything outside by inode 2008-09-20 16:10 I don't know of any common use for hardlinks across home dir boundaries 2008-09-20 16:10 no its not hard 2008-09-20 16:10 you certainly can't have them if the home dir is a separate volume like zfs 2008-09-20 16:10 just print out nlinks and inode number 2008-09-20 16:10 well hrm 2008-09-20 16:11 you can ls the inodes inside and outside and intersect 2008-09-20 16:11 well 2008-09-20 16:11 the intersection has to be empty 2008-09-20 16:11 translate into simple bash ;) 2008-09-20 16:12 there is a way 2008-09-20 16:14 find, excluding the home dir, recursive list all inodes 2008-09-20 16:14 then recursive list the inodes of the dir 2008-09-20 16:14 find -links +1 ! -type d -printf "%i %n\n" | sort | uniq -c | awk '($1!=$3) {print}' 2008-09-20 16:14 no inode in the second is allowed to be in the first 2008-09-20 16:14 onliner boo yeah 2008-09-20 16:14 oneliner* 2008-09-20 16:14 it works? 2008-09-20 16:14 duh 2008-09-20 16:14 i wrote it 2008-09-20 16:15 of course it does 2008-09-20 16:15 but does it work the way we hope 2008-09-20 16:15 it only finds files which are externally linked 2008-09-20 16:15 or link to external files 2008-09-20 16:15 same thing really 2008-09-20 16:15 yes 2008-09-20 16:15 symmetric 2008-09-20 16:15 it only prints the inode number unfortunately 2008-09-20 16:16 did it find any? 2008-09-20 16:16 so you have to do another pass and find which file is the offender 2008-09-20 16:17 why won't it find hardlinks within a directory? 2008-09-20 16:19 find -links +1 ! -type d -printf "%i %n\n" | sort | uniq -c | awk '($1!=$3) {print $2}' | xargs --no-run-if-empty -n 1 find -inum 2008-09-20 16:19 that does the pass to translate inodes to filenames 2008-09-20 16:19 why won't it find hardlinks within a directory? <- still wondering 2008-09-20 16:19 ok 2008-09-20 16:20 for the first find 2008-09-20 16:20 it prints out the inode number and nlinks 2008-09-20 16:20 for each file with more than 1 link 2008-09-20 16:20 then the sort | uniq -c 2008-09-20 16:20 counts how many times each occurs 2008-09-20 16:20 oh right 2008-09-20 16:20 yes 2008-09-20 16:20 good 2008-09-20 16:20 nice 2008-09-20 16:20 then the awk compares 2008-09-20 16:20 and prints out the inode number if they dont match 2008-09-20 16:20 i tested it it works 2008-09-20 16:20 wait 2008-09-20 16:21 it will still pick up hardlinks strictly within the dir 2008-09-20 16:21 no 2008-09-20 16:21 it will not 2008-09-20 16:21 oh right 2008-09-20 16:24 got it 2008-09-20 16:24 you look for number of hard links greater than number of occurances within the dir 2008-09-20 16:25 sweet 2008-09-20 16:25 or != 2008-09-20 16:25 no, can't be < 2008-09-20 16:25 but you check for less anyway ;) 2008-09-20 16:25 hoping to find disk corrpution I assume 2008-09-20 16:31 getting close to sk8 oclock 2008-09-20 16:40 i skated this morning 2008-09-20 16:41 and? 2008-09-20 16:41 so no more for me 2008-09-20 16:41 lamer 2008-09-20 16:41 got other stuff to do 2008-09-20 16:41 like... figure out the fuse stuff 2008-09-20 16:41 they all say that 2008-09-20 16:41 oh right 2008-09-20 17:40 so what's the difference between volumes and subvolumes? i'm thinking subvolumes might be helpful for some security partitioning 2008-09-20 17:43 bushman, a subvolume shares nothing but the allocation space with another subvolume 2008-09-20 17:43 since it can share allocation space, it goes in the opposite direction of security partitioning 2008-09-20 17:44 what you want are real volumes 2008-09-20 17:44 which is another reason I real it is right to drop subvolumes 2008-09-20 17:44 so to have separate partitions for data with different labels, we'd just have regular volumes, or use traditional partitions? 2008-09-20 17:45 yes 2008-09-20 17:45 any drawback? 2008-09-20 17:45 well wait 2008-09-20 17:45 we're going to be able to separate the allocation space for data within the same namespace 2008-09-20 17:46 and we're going to be able to separate namespaces 2008-09-20 17:46 now a multimillion dollar question: could i make them transparent? as in if i'm a user with two levels, can i see both partitions/volumes on top of each other? 2008-09-20 17:46 just not using the subvolume idea 2008-09-20 17:46 yes 2008-09-20 17:46 that's the plan 2008-09-20 17:46 do you know where i'm going with this? i want polyinstantinated directories 2008-09-20 17:47 try again with words of one syllable 2008-09-20 17:47 that might be more namespacing tricks tho, i'm not sure how you'd design it 2008-09-20 17:47 with namespacing tricks 2008-09-20 17:47 I'm working on it 2008-09-20 17:48 heirarchically inherited namespaces to be exact 2008-09-20 17:48 much like the versioning model 2008-09-20 17:48 except its not versions, it's namespaces 2008-09-20 17:49 I suspect that amounts to polyinstantinated directories 2008-09-20 17:49 they look different depending on who you are 2008-09-20 17:49 i basically want a /home/marcin directory, and if i got lets' say S and TS labels then i see both sets of files inside my home dir, but let's say i've been naughty and they pulled my TS, and I should be seeing only S files. however the rest of my files shouldnt delete, just sit there for someone with a security dominating mine to pick them up or whatever 2008-09-20 17:50 yes 2008-09-20 17:50 you put it much more succintly 2008-09-20 17:50 correct, that's the plan 2008-09-20 17:50 you put it in terms that don't require leaps of logic 2008-09-20 17:51 awesome, as long as the files are stored of different physical disks or paritions 2008-09-20 17:51 right 2008-09-20 17:51 namespace partitioning is one of two forms of partitioning we have in mind 2008-09-20 17:51 the other is physical data 2008-09-20 17:51 there's a lot of discussion lately if one disk with different encryptions should be treated as equivalent to multiple disks 2008-09-20 17:52 partitioned onto different volumes according to the class of data, and the filesystem amalgamates those volumes into a... filesystem 2008-09-20 17:52 to be more precise, the volume manager amalgamates those volumes, but the filesystem knows the layout 2008-09-20 17:53 and does data allocation accordingly 2008-09-20 17:53 this differs somewhat from the zfs model 2008-09-20 17:53 which takes the task of amalgamation into itself 2008-09-20 17:54 tux3 sitting on lvm3 will just want to look at the lvm's mapping table 2008-09-20 17:54 and be able to specify how the lvm should change that table 2008-09-20 17:55 so that it can provision itself with as much of the different kinds of storage as it needs, in the places it wants it 2008-09-20 17:55 so you guys wanna utilize the lvm underneath? i thought the whole idea of zumastore is to eliminate it? 2008-09-20 17:55 rewrite the lvm 2008-09-20 17:55 that will be lvm3 2008-09-20 17:55 but we can get by with lvm2 2008-09-20 17:55 it just sucks for adminning 2008-09-20 17:55 everything manual 2008-09-20 17:55 we want automatic 2008-09-20 17:56 sounds great 2008-09-20 17:56 that's what we think :) 2008-09-20 17:56 it's one of those itches 2008-09-20 17:56 multi year itch 2008-09-20 17:56 if i can demo any of this, in however raw form, i will definitely get some raised eyebrows 2008-09-20 17:57 I think we can get an early demo, yes 2008-09-20 17:57 we'll do the provisioning manually, using the existing lvm 2008-09-20 17:57 and the fs will proceed to partition data as promised 2008-09-20 17:57 partitioning namespace requires more effort 2008-09-20 17:57 we'd have to find a way of bootstrapping that project wise 2008-09-20 18:00 so how would you deal with union'ed namespaces? 2008-09-20 18:00 if i got /home/marcin/attackatdawn(TS) and another one at (S), how would it show? 2008-09-20 18:01 and more importantly, how does a user pick which one they're dealing with? 2008-09-20 18:02 when you have high security clearance you have the option of covering up a lower clearance file by creating one of the same name 2008-09-20 18:02 would it internally be stored as /home/marcin/TS/file and /home/marcin/S/file or something wackier than that? 2008-09-20 18:03 a tag goes on to the beginning of the filename internally 2008-09-20 18:03 define 'covering up' 2008-09-20 18:03 and is part of the namespace lookup 2008-09-20 18:03 covering up means: by default you will get EXIST, but you can override that and create a new entry that overrides the old one 2008-09-20 18:04 obviously, it is best not to cover up a lower security file 2008-09-20 18:04 override as in delete the old one? 2008-09-20 18:04 but you can if you want 2008-09-20 18:04 no, as in both exist 2008-09-20 18:04 override as in cover up the old one. It will reappear if you delete yours 2008-09-20 18:04 but only highest security you have access to can be read 2008-09-20 18:04 so i cannot access both at the same time? 2008-09-20 18:04 right 2008-09-20 18:04 no 2008-09-20 18:04 you could access the S if the TS privs were dropped 2008-09-20 18:04 if you want that, log in as two people or don't make the names collide 2008-09-20 18:05 you'd usually be doing this anyway to store junk misdirecting data at a lowe sec level 2008-09-20 18:05 if i wanted to have multiple users for multiple levels of security, i'd just stick to regular systems, not building MLS one 2008-09-20 18:05 bushman, we could fiddle around with the idea and make both visible through some messed up name syntax, it's hard to see why that would be better though 2008-09-20 18:06 agreed 2008-09-20 18:06 If you're going to access it via diff names 2008-09-20 18:06 why not give it diff names to begin with? 2008-09-20 18:06 i need to be able to access all versions if my security dominates labeles on multiple files 2008-09-20 18:06 rename TS 2008-09-20 18:06 access S 2008-09-20 18:06 tha'ts why it's polyinstantiated 2008-09-20 18:06 rename TS back to old name 2008-09-20 18:06 yes 2008-09-20 18:07 Bushman: you're making this sound like resource forks... 2008-09-20 18:07 bushman, the problem is, you start drifting away from unix semantics 2008-09-20 18:07 no, the idea is to have multiple files with same name at different levels and access them all as long as i'm cleared 2008-09-20 18:07 but if they have the same name.. 2008-09-20 18:07 bushman, ok, could you show an example of accessing two of them? 2008-09-20 18:08 what command would you write? 2008-09-20 18:08 oh i know, that's why i came to you guys with this, it's out there stuff 2008-09-20 18:08 I guess you could suffix the filenames with somthing like /path/file#S to override access to /path/file (TS) into /path/file (S) 2008-09-20 18:08 but we'd need to waste a character 2008-09-20 18:08 for the # symbol 2008-09-20 18:08 that's not unix though 2008-09-20 18:08 ok, let's make a user marcin with labels of s0,s1 2008-09-20 18:08 agreed 2008-09-20 18:09 bushman, we can apply the versioned symlink idea 2008-09-20 18:09 that lets you access files from different versions, on the same version 2008-09-20 18:09 so in this case, you'd have a priviledged symlink to the directory you're in 2008-09-20 18:10 i need to have files /supersecretcrap (s0) and /supersecretcrap (s1), and i should be able to pick either one to work on 2008-09-20 18:10 when you read the directory through the symlink, you see one view, and a different view if you read it directly 2008-09-20 18:10 so why not call them /s0_supersecretcrap and /s1_supersecretcrap? 2008-09-20 18:10 privileged symlink? what's that? 2008-09-20 18:10 bushman, new invention 2008-09-20 18:10 just now 2008-09-20 18:10 based on the idea of versioned symlink I have written about 2008-09-20 18:11 Maciek, that's not my call, that's what the bigwigs in bunkers want ;) 2008-09-20 18:11 only works if the symlink is parsed at the sub-vfs layer - or if we muck around with the vfs 2008-09-20 18:11 maze, that is the plan 2008-09-20 18:11 sub-vfs 2008-09-20 18:11 we will need to extent some syscalls 2008-09-20 18:11 extend 2008-09-20 18:11 so to apps, does it look like a ymlink? 2008-09-20 18:11 one syscall, actually, ln 2008-09-20 18:12 it looks like a symlink yes 2008-09-20 18:12 why extend the syscall? can't we make the data stored in a symlink (it's binary remember) suffice? 2008-09-20 18:12 oh so you want a file /secretcrap with multiple symlinks to it called s0, or s1? 2008-09-20 18:12 no, to the entire directory 2008-09-20 18:12 ln needs to know how to create one, until it does we provide our own utility 2008-09-20 18:12 to create one of these 2008-09-20 18:13 bushman, yes 2008-09-20 18:13 ln -s 'binary_blob' filename 2008-09-20 18:13 including no symlink 2008-09-20 18:13 maze, possibly ;) 2008-09-20 18:13 I don't like this 2008-09-20 18:13 it's dirty 2008-09-20 18:13 that'd work, cuz you can still work on it as a normal file...hmm 2008-09-20 18:13 make, your idea? 2008-09-20 18:13 versioned symlinks is clean 2008-09-20 18:13 what's the problem? 2008-09-20 18:14 it's fine for versioning 2008-09-20 18:14 I'm not sure I like it for this s* crap 2008-09-20 18:14 don't have to use it 2008-09-20 18:14 with versioning you get a view from the past that's not mutatable 2008-09-20 18:14 flips, one question: would the links be created dynamically when requesting info on a file, or would they actually be laying around at all times for people to use? 2008-09-20 18:14 maze, not so 2008-09-20 18:14 versions are all rw 2008-09-20 18:15 oh, right 2008-09-20 18:15 bushman, they'd be lying around, or you'd create them if you have the right clearance 2008-09-20 18:15 sun is getting lower 2008-09-20 18:16 I need to hustle out if I'm going to get to the strand before nightfall 2008-09-20 18:17 so would a regular user see /file /filesecret /filetopsecret with the last two being links, or the real /file would be not visible, only the links would be visible? 2008-09-20 18:18 I believe a regular user would not see anything 2008-09-20 18:18 since he'd have no clearance 2008-09-20 18:18 the regular users would only see /file 2008-09-20 18:18 well 2008-09-20 18:18 i'm trying to prevent mistakes from mindlessly pickign a wrong link/file, we all know how unaware of ownership/permissions most people are 2008-09-20 18:18 if file is low clearance 2008-09-20 18:18 no no, regular user i meant not root ;) 2008-09-20 18:19 you don't even see the links unless they exist at or below your level 2008-09-20 18:19 not a user that's cleared only to unclass 2008-09-20 18:19 same with the files 2008-09-20 18:19 I believe unclass user would see zlich 2008-09-20 18:19 yes 2008-09-20 18:19 secret user would see file and filesecret 2008-09-20 18:19 topsecret would see all 3 2008-09-20 18:19 yes 2008-09-20 18:19 if you see filesecret or/and filetopsecret it's a symlink 2008-09-20 18:19 maze, I think /file is supposed to be unsecret 2008-09-20 18:20 oh, I don't know, and that's one of the reasons I don't like this 2008-09-20 18:20 yea, that's just standard domination/lattice based security scheme, SElinux does it for you behind the scenes 2008-09-20 18:20 wait till i throw in compartments into TS ;) 2008-09-20 18:21 bushman, it would not make sense to create a security link for a high clearance level, so that a low clearance person can see it 2008-09-20 18:21 they're just their for the convenience of the supersecret spooks 2008-09-20 18:21 so nobody less secret has to see them 2008-09-20 18:21 yes of course 2008-09-20 18:22 i dunno if shap talked to you about the whole gaugin-messenger non-interference model yet 2008-09-20 18:22 i spent like 2hrs talking him through the subtleties of that and general 'need to know' stuff 2008-09-20 18:23 bushman, you know what is really cool about this? it all happens far away from the vfs, which never gets to see it 2008-09-20 18:23 which makes it harder to subvert 2008-09-20 18:23 that's the goal 2008-09-20 18:24 otoh 2008-09-20 18:24 this is precisely what the vfs should be dealing with... 2008-09-20 18:24 you mean the 'visibility' issues? 2008-09-20 18:24 yes, all of it 2008-09-20 18:25 there is no benefit to doing this below the vfs, except for code duplication across fs'es 2008-09-20 18:25 and multiple opportunities to screw it up 2008-09-20 18:25 and If someone compromises the vfs, they've got your kernel and your fs drivers as well 2008-09-20 18:25 other than political reasons of people saying 'no' simply because noone else but tux using these features 2008-09-20 18:25 [unless we're not talking about linux here] 2008-09-20 18:26 i know Daniel got a lot of pull, but i dunno how much we can force down these people's throats for the sake for esoteric security features 2008-09-20 18:26 ah, yes, politics... 2008-09-20 18:27 so for political reasons we might have to smuggle this shit under tux's branch so the rest of people dont have any say in it 2008-09-20 18:27 thing is this should be done with xattr and something selinux like 2008-09-20 18:28 yes, SElinux does it all on xattr, that's how we get all the functionality i need 2008-09-20 18:28 so what are you missing? 2008-09-20 18:28 multiple clearance levels for diff files with the same name? 2008-09-20 18:28 maze, I don't know if this belongs in vfs 2008-09-20 18:29 maze, maybe in ten years 2008-09-20 18:29 polyinstantiation is the huge goal to get Linux to be a true MLS system 2008-09-20 18:29 yes, and we can do it at the filesystem level so we should 2008-09-20 18:29 cut out as much bs as possible 2008-09-20 18:29 there is no existing model 2008-09-20 18:29 if we have this, govt/mil wont even look at solaris trusted extensions anymore 2008-09-20 18:29 why is polyinstantiation such a big feature? 2008-09-20 18:30 versioning is already a kind of polyinstantiation 2008-09-20 18:30 yes, true 2008-09-20 18:31 that's a fair point that we get it almost for free... 2008-09-20 18:31 nearly 2008-09-20 18:31 except for storing different sec levels on different back end devices 2008-09-20 18:31 it's the buliding block of MultiLevelSecurity, which ends up giving you a higher classification of a system, so we'll be applicable to go into crazier installations like subs, planes, and other deep dark holes 2008-09-20 18:31 linux in subs? 2008-09-20 18:31 my god 2008-09-20 18:31 maze, that's where we just give them the xattrs 2008-09-20 18:31 that's almost reason enough to not do this 2008-09-20 18:31 HAVE YOU SEEN THE CODE? 2008-09-20 18:31 ;-) 2008-09-20 18:31 bushman, ooh 2008-09-20 18:31 not yet, that's why we're pushing for more functionality ;) 2008-09-20 18:32 ok I'm going to actually read the wikipedia article now 2008-09-20 18:32 but after skating 2008-09-20 18:32 sun is getting critically low 2008-09-20 18:32 oh you wont find much on this on the internet 2008-09-20 18:32 the linux code base sucks, I wouldn't want it anywhere near anything life-critical or nuclear 2008-09-20 18:32 well at least the unclass portion of the internet ;) 2008-09-20 18:33 the classified portions of the internet are predominantly porn and warez ;-) 2008-09-20 18:33 what do you want instead? trusted solaris 8? 2008-09-20 18:33 maze, so you want something worse there? 2008-09-20 18:34 I'd always thought the military had something home-grown, and tiny for the really important stuff 2008-09-20 18:34 this quickly becomes a question of lesser evils 2008-09-20 18:34 maze, teehee 2008-09-20 18:34 you know sucky performance, runs on well bug-cleared 486 cores 2008-09-20 18:34 bwahaha, i went to military gradschool with a bunch of officers, with very few exceptions they couldnt code their way out of a paper bag 2008-09-20 18:35 maze, from my skydiving years I remember when the miltary gave up deveoping and just bought sport chutes 2008-09-20 18:35 but has been gone through with a fine comb once a quarter for two dozen years 2008-09-20 18:35 that's disappointing 2008-09-20 18:35 this is what i'm trying to push, if we give them one big important piece, they should get off their ass and sponser massive code audits on the rest of linux 2008-09-20 18:35 I'm not aware of any commercial code that isn't bug-ridden 2008-09-20 18:36 hmm, I see 2008-09-20 18:36 now - that - seems like a worthwhile goal 2008-09-20 18:36 have they done that on solaris? 2008-09-20 18:37 here's the problem with 'secure' development: you end up hiregin people that have proper clearances, instead of people with mad skillz. but since it's all done behind well guarded doors, noone ever will know how much it sucks 2008-09-20 18:37 I wonder if the US is the only country doing anything serious in this area... I mean there's so many other countries, and I don't buy them all buying off of the us... 2008-09-20 18:37 clearances are a very good way to hide incompetence :/ 2008-09-20 18:37 I'm clear of clearances ;-) 2008-09-20 18:37 and none of these people will ever share 2008-09-20 18:37 cuz it's national security 2008-09-20 18:37 national insecurity rather 2008-09-20 18:38 clearances are a good way into a depression 2008-09-20 18:38 tell me about it :/ 2008-09-20 18:38 the more you learn, the more surprised you are you're (we're) still alive 2008-09-20 18:38 you ever seen dr. strangelove? 2008-09-20 18:39 might have, don't recall 2008-09-20 18:39 was that a bond movie? 2008-09-20 18:39 old 60s movie from Kubric, about atomic bomb and the doomsday device 2008-09-20 18:39 no don't recall 2008-09-20 18:40 then my jokes wouldnt mean much ;) 2008-09-20 18:40 ah 2008-09-20 18:40 well, the worst part is none of what you've written sounds like a joke... 2008-09-20 18:41 but in general it's a very strange environment, that's why i'm here, so we can bridge some of the well audited/tested code into domains that are usually very very separated from the rest of the world 2008-09-20 18:41 i was about to crack a joke, but realized it'd sound very goofy and out of place unless you know dr strangelove 2008-09-20 18:41 go ahead 2008-09-20 18:41 anyway 2008-09-20 18:42 nah, doesnt matter 2008-09-20 18:42 I'm assuming this means your a civilian contractor with necessary clearances working for some military/defense whatever arm of the us gov? 2008-09-20 18:43 not a contractor, actual govt worker 2008-09-20 18:43 Do the true military guys treat civilians like sh*t? 2008-09-20 18:43 some 2008-09-20 18:43 ah 2008-09-20 18:43 in gradschool we got a lot of it, cuz it was 300 military officers and 10 civillians 2008-09-20 18:43 so it's got all that (and more) beautiful politics to live with 2008-09-20 18:44 so why civilian then and not military? 2008-09-20 18:44 who, me? 2008-09-20 18:44 in general, if almost everybody is military, why the pslit? 2008-09-20 18:44 ah, whatever, disregard... 2008-09-20 18:45 I'm not sure I even want to know how it all works ;-) 2008-09-20 18:45 or whether it works 2008-09-20 18:45 you dont, it's depressing shit 2008-09-20 18:45 so, Marcin - Polish roots? 2008-09-20 18:46 100% 2008-09-20 18:46 heh 2008-09-20 18:46 that's more than me then (unless I treat my roots as adding up to > 100) 2008-09-20 18:46 i thought you had a site from UJ 2008-09-20 18:47 I do 2008-09-20 18:47 but you know... standard story: 2008-09-20 18:47 oh that's weird 2008-09-20 18:47 conceived in France, born in Britain, grew up in Poland, kindergarten - grade 8 in Canada, high school and university in Poland, now working in California... 2008-09-20 18:48 weird stuff ;-) 2008-09-20 18:48 but, yeah, my Family's Polish, just done a lot of travelling. 2008-09-20 18:48 no wonder you end up working with shap, most of his friends in chicago were pollacks ;) 2008-09-20 18:49 Shap's in Santa Monica though, I'm in Mountain View 2008-09-20 18:49 well, work as in on tux 2008-09-20 18:49 I don't think I've ever actually met Shapor 2008-09-20 18:49 he's a fantastic character, love him dearly 2008-09-20 18:49 scary smart but without ego, which is strange 2008-09-20 18:50 this is growing to be a great crew 2008-09-20 18:50 yeah, way better than the other way round (ego without smarts) 2008-09-20 18:50 i wish i could code anywhere near what's needed here 2008-09-20 18:50 coding isn't actually the problem 2008-09-20 18:50 coding is trivial 2008-09-20 18:51 not to me, i'm a codeing retard 2008-09-20 18:51 the real problems are figuring out what the interfaces of the rest of the kernel are and how to obey them 2008-09-20 18:51 and what algos and data structures to use for what purpose 2008-09-20 18:51 and where 2008-09-20 18:51 yea that's a problem trying to merge yourself into a huge prexisting infrastructure 2008-09-20 18:52 there is actually very little coding and code involved ;-) 2008-09-20 18:52 i did little kernel coding in minix in gradschool, but that's it, what you guys are talking about gives me a headache 2008-09-20 18:53 the real problem for me at least - is the terrible lack of any documentation 2008-09-20 18:53 i'm just trying to steer it in the right direction, i'm more of a lobbyst/fanclub ;) 2008-09-20 18:53 and the documentation that is present is often either wrong or partial or outdated 2008-09-20 18:53 for the interfaces you mean? 2008-09-20 18:53 yeah 2008-09-20 18:54 i'm setting up lxr for us 2008-09-20 18:54 cool 2008-09-20 18:54 trying to do some support work 2008-09-20 18:54 you're somewhere in south carolina right? 2008-09-20 18:54 it's mostly set up, but i did it wtihout the free text searches, and Daniel wanted it, so i gotta redo a big chunk 2008-09-20 18:54 yea, near Charleston 2008-09-20 18:54 oh, sad. 2008-09-20 18:55 eh no problem, gotta refresh myself on linux sysadmining, i'm kinda rusty, school and new job took me out of commision for 3yrs 2008-09-20 18:55 let's look up Charleston on the map 2008-09-20 18:55 look where civilisation ends, and it's right on the border ;) 2008-09-20 18:56 that's true 2008-09-20 18:56 it's on the coast ;-) 2008-09-20 18:56 right next to hollywood 2008-09-20 18:56 I'll be passing through hollywood on halloween 2008-09-20 18:57 hollywood? 2008-09-20 18:57 the one with movies or some other one? 2008-09-20 18:57 the one in LA 2008-09-20 18:58 yea, i gotta go bug shap again, we need to stage a gettogether 2008-09-20 18:58 man, SC is weird 2008-09-20 18:58 'NO SHIT 2008-09-20 18:58 there doesn't seem to be anything there judging from the map 2008-09-20 18:58 swamps with aligators providing free security around military basis 2008-09-20 18:58 lol 2008-09-20 18:58 and ghettos with poor people 2008-09-20 18:59 been to Atlanta, hated the wather 2008-09-20 18:59 weather 2008-09-20 18:59 i aint kidding, my base has a pet aligator named Charlie, he's like 14ft ;) 2008-09-20 18:59 oh, 2008-09-20 18:59 you have your own base? 2008-09-20 18:59 wow, you're high up ;-) 2008-09-20 19:00 not my personal one ;) 2008-09-20 19:00 oh, is it just one of your holdings? 2008-09-20 19:00 oh yea, i'm big pimpin it ;) 2008-09-20 19:01 eh, every time I look at a map of the east coast of the US, I realize how I bloody don't know what the hell it's like 2008-09-20 19:01 the west coast is so easy: 2008-09-20 19:01 seattle, portland, san francisco, los angeles, san diego 2008-09-20 19:01 me neither, i've lived in chicago and cali, east coast is uncharted territory to me 2008-09-20 19:01 and you're done ;-) 2008-09-20 19:02 which also happens to be the 5 states: 2008-09-20 19:02 washington, oregon, north california, south california and mexico 2008-09-20 19:02 oh, ok, maybe a little off there ;-) 2008-09-20 19:03 and you should probably start with Vancouver... 2008-09-20 19:06 so how did you pick up this systems stuff? i read logs from last night, and your bash foo is out there 2008-09-20 19:07 mostly self taught 2008-09-20 19:07 really 2008-09-20 19:08 people coming out of european universities know a lot more than most american counterparts 2008-09-20 19:08 not having a life really helps ;-) 2008-09-20 19:08 here college is so common to do, it's more of an extension of highschool 2008-09-20 19:08 yeah, but most of my skills are stuff I picked up on my own 2008-09-20 19:08 indeed I have a MSc in Physics 2008-09-20 19:09 but I actually dropped out of univ, and then got back in and took 7 years to do it 2008-09-20 19:09 oh i'm back to not having a life, i had a life and it got too dramatic, thus new school/job 2008-09-20 19:09 because CS and running my own ISP was more fun and challenging and interesting 2008-09-20 19:09 (before you ask - mini-ISP for like 300 people) 2008-09-20 19:10 hehe, a friend of mine in poland did UJ too, molecular chemistry or something, took like 6yrs 2008-09-20 19:10 here if you're smart you get out in 3yrs, shap did it, i did like 3.5 2008-09-20 19:10 kinda a joke 2008-09-20 19:11 didnt learn shit about computers in college, that was mostly about hacking my way into coeds panties 2008-09-20 19:11 hey, that is at least worthwhile ;-) 2008-09-20 19:11 not in a long run 2008-09-20 19:12 I know 2008-09-20 19:12 i learned more by hanging out with shap hacking shit till 4am 2008-09-20 19:12 ah, so the two of you went to the same uni? 2008-09-20 19:13 no, different highschools, universities, just kept hanging out 2008-09-20 19:13 ah 2008-09-20 19:13 I think a lot of the truth here is that people finish college 2008-09-20 19:13 not university 2008-09-20 19:13 we met cuz he was running a bbs in the neighbourhood (which mattered back in the day of cost being distance dependent) 2008-09-20 19:13 and they end up with bachelors, not masters 2008-09-20 19:13 there's a big difference 2008-09-20 19:13 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-20 19:14 in US the difference between college and university is very blurry 2008-09-20 19:14 doesnt really mean much 2008-09-20 19:14 degrees do matter, sometimes quite a bit 2008-09-20 19:14 a lot of the colleges etc are really just lame local schools 2008-09-20 19:15 that's 2yr degrees 2008-09-20 19:15 more different stuff 2008-09-20 19:15 we have the same thing in Poland - most higher ed schools suck, but then they're mostly not called univseristies, and most good ones are. 2008-09-20 19:15 i was talking about 4yr degree 2008-09-20 19:15 bachelor's is 2 years? since when? I thought bachelors was 3-4 2008-09-20 19:15 no no 2008-09-20 19:15 with masters being another 2 on top of that 2008-09-20 19:16 associates is 2yrs 2008-09-20 19:16 bachelor is 4yrs 2008-09-20 19:16 masters is another 2 on top of that 2008-09-20 19:16 depends on program/school, etc 2008-09-20 19:16 but that's the general rundown 2008-09-20 19:16 so associates is ike worthless right? 2008-09-20 19:16 associates is worthless, that's like something you want if you want to be a cop 2008-09-20 19:17 bachelor is the normal 4yr degree 2008-09-20 19:17 cops aren't worthless though :-) 2008-09-20 19:17 oh yea, they're my favorite people 2008-09-20 19:18 was actually talking about cops and how hard a job they have at work yesterday 2008-09-20 19:18 some people do the 2yr associates and then transfer to 4yr programs to save money cuz the local 'community colleges' are usually very very cheap 2008-09-20 19:18 so doing the 2 in local comm col? 2008-09-20 19:18 yea, 2here and 2there 2008-09-20 19:19 you end up with a normal 4yr bachelor degree 2008-09-20 19:19 yeah, why education (higher) is so expensive in the US is something I never understood 2008-09-20 19:19 college can be very expensive so it's a viable technique if you dont want to bury yourself in debt 2008-09-20 19:19 oh i worked at a private university for 5yrs, i can explain that one ;) 2008-09-20 19:20 yeah, a car per year in expenses - cute for someone who isn't working... 2008-09-20 19:20 yea, my housemate is a smart dude, very hard working too, but comes from a poorass family, even with scholarships he racked up like 50+k in debt for undergrad 2008-09-20 19:21 but at least it's a possibility, if you got talent but no cash, it's still doable 2008-09-20 19:21 right 2008-09-20 19:21 it takes a long time to pay that off afterwards 2008-09-20 19:22 unless you end up landing an awesome job 2008-09-20 19:22 these days most colleges are 30k/yr+, which is insane, you start life at -$100,000 2008-09-20 19:22 true, but... 2008-09-20 19:22 you should also be earning an extra 20/30k per year because of them 2008-09-20 19:22 over the course of your entire life it adds up 2008-09-20 19:22 still sucks of course 2008-09-20 19:23 eh, college loans are relatively low interest, i know people who are 40yrs old and multimillionare, but still pay off their school loans to earn good credit ;) 2008-09-20 19:23 lol 2008-09-20 19:24 eh, some people i went to undergrad with got 2-3 degrees and sell cellphones for a living 2008-09-20 19:24 geeks have it made comparing to others 2008-09-20 19:25 in what sense? 2008-09-20 19:25 income after college? 2008-09-20 19:25 it's hard for us to be unemployed 2008-09-20 19:25 true 2008-09-20 19:25 if you really got little talent and just wanna do basic IT work you still can make 40-50k/yr easily 2008-09-20 19:26 if you got any sort of brains you get to 70k/yr quickly 2008-09-20 19:26 I'm assuming you're talking about east coast here 2008-09-20 19:26 first year as a high school teacher in good district in chicago payed like 28k 2008-09-20 19:27 yea, cali is a bit different 2008-09-20 19:27 we start higher, and go up easily 2008-09-20 19:28 and it's difficult to be unemplyed 2008-09-20 19:28 I spend something like 15K+ a year just to rent a studio... 2008-09-20 19:28 and then add utilities on top of that. 2008-09-20 19:28 heh, move to SC, you can buy a new house for 130k ;) 2008-09-20 19:29 I can buy half a studio in a sh*tty area for that 2008-09-20 19:29 oh i know, i lived in monterey for 2yrs, a shed on my block sold for 570k 2008-09-20 19:29 it was an old house the side of my current living room 2008-09-20 19:30 s/side/size/ 2008-09-20 19:30 right 2008-09-20 19:30 the california fixer upper for just slightly less than a million 2008-09-20 19:30 me and shap were thinking of moving to cali at the end of 2000, just before the market exploded 2008-09-20 19:31 imploded 2008-09-20 19:31 looked at prices of houses, median price was like 890k 2008-09-20 19:31 but the weather's nice ;-) 2008-09-20 19:31 and there's lots of job opportunities 2008-09-20 19:31 and interesting ones at that 2008-09-20 19:31 i'm a geek, i live in rooms in AC so there's no condensation on servers ;) 2008-09-20 19:32 the weather outside is irrelevant unless it's a hurricane 2008-09-20 19:32 I have an AC - haven't turned it on in 2+ years I've lived here 2008-09-20 19:32 truthfully didn't turn the heater on last winter either 2008-09-20 19:33 i havent turn OFF AC since march, it's 100F and 100% humidity at all times 2008-09-20 19:33 sh*tty gas stove, more trouble than it's worth 2008-09-20 19:33 south sucks weatherwise 2008-09-20 19:33 yep 2008-09-20 19:33 i understand now why they had slavery here, noone's gonna volunteer their ass to be outside 2008-09-20 19:34 lol... 2008-09-20 19:34 it's ahorrible joke, but it's true 2008-09-20 19:34 yeah 2008-09-20 19:34 the non-white folks are probably more used to the heat... 2008-09-20 19:35 and can thus better deal with it 2008-09-20 19:35 i have to mow my lawn at 8am, cuz by 9:30am it's too hot for my pasty white ass not to get scorched 2008-09-20 19:48 hm that hasn't happened in a while X was chewing up half my ram even after closing firefox 2008-09-20 20:44 hrm does readdir work for anyone using tux3fuse ? 2008-09-20 20:44 (not tux3fs) 2008-09-20 21:38 folks 2008-09-20 23:55 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-21 00:39 -!- pranith(7aa040b1@webchat.mibbit.com) has joined #tux3 2008-09-21 00:42 -!- Bobby(~Bobby@122.160.64.177) has joined #tux3 2008-09-21 00:49 hello 2008-09-21 00:49 :) 2008-09-21 00:53 hi 2008-09-21 00:53 hey shapor 2008-09-21 02:45 hey 2008-09-21 02:45 ACTION is back from a night of clubing in SD 2008-09-21 02:45 lubbing 2008-09-21 02:45 bah 2008-09-21 02:45 clubbing 2008-09-21 03:35 walk->estop -= walk->group--->count; <- true code 2008-09-21 07:43 -!- Kirantpatil(~kiran@122.167.211.253) has joined #tux3 2008-09-21 07:44 -!- Kirantpatil(~kiran@122.167.211.253) has left #tux3 2008-09-21 08:22 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-21 10:02 -!- pgquiles(~pgquiles@250.Red-79-144-194.staticIP.rima-tde.net) has joined #tux3 2008-09-21 10:25 ok, we have a pretty nice dleaf streamwise reader now 2008-09-21 10:25 next tricky issue is writing into the dleaf 2008-09-21 10:26 tricky because we're writing into the middle of a big glob of extents 2008-09-21 10:26 possibly truncating some at the beginning and end of the range of interest 2008-09-21 13:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-21 13:53 -!- mback(~mback@netblock-68-183-189-239.dslextreme.com) has joined #tux3 2008-09-21 14:08 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-21 15:07 -!- ajonat(~ajonat@190.48.117.81) has joined #tux3 2008-09-21 15:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-21 16:50 flipzout: dwalk_pack? 2008-09-21 16:52 shapor, email just sent out 2008-09-21 16:52 to tux3 list 2008-09-21 16:52 ah still didnt get it 2008-09-21 16:53 lets you poke extents into a dleaf one at a time, and it builds up the group and entry index as you go 2008-09-21 16:53 or it will when it works, which is far away 2008-09-21 16:53 heading out 2008-09-21 17:33 -!- kbingham(~kbingham@92.10.66.117) has joined #tux3 2008-09-21 17:38 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-21 19:54 -!- ajonat(~ajonat@190.48.117.81) has joined #tux3 2008-09-21 20:00 hey 2008-09-21 20:11 -!- BSD(~bandan@38.117.250.152) has joined #tux3 2008-09-21 21:55 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-21 22:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-22 01:48 yo 2008-09-22 04:13 7/3: 1000033 => 777; 1000044 => 888; 1000099 => 999; 2008-09-22 04:13 8/3: 3000044 => 444 666; 3000055 => 555; 3000000 => ; <- result of first dwak_pack attempt 2008-09-22 04:13 not bad, it actually added an entry 2008-09-22 04:13 messed it up, but. 2008-09-22 04:14 but bug hunt commences 2008-09-22 04:21 8/2: 3000044 => 444 666; 3000055 => 555 123 456; <- 3rd try = good 2008-09-22 04:21 now some more broken cases 2008-09-22 04:25 3000044 => 444 666; 3000055 => 555; 3000056 =>; <- 4th attempt, hmm seem to be missing something 2008-09-22 04:33 8/3: 3000044 => 444 666; 3000055 => 555; 3000056 => 123; <- result of fixing an off by one 2008-09-22 04:33 next boundary: add a new group 2008-09-22 04:37 9/1: 4000000 =>; <- not bad for a first attempt 2008-09-22 04:37 also overwrote the 0th extent of the leaf with the new group descriptor ;) 2008-09-22 04:46 9/1: 4000123 => 123 0 0 0; <- closer 2008-09-22 04:46 few extra zeros came from somewhere 2008-09-22 04:46 hmm 2008-09-22 04:47 ah, walk extend base needs to be bumpbed for the new group 2008-09-22 04:47 extent base 2008-09-22 04:48 ...by the group count of the current group 2008-09-22 04:49 err, no 2008-09-22 04:49 by the amount of the most recent entry limit 2008-09-22 04:50 9/1: 4000123 => 123; <- correct 2008-09-22 04:50 this is too easy 2008-09-22 04:50 you lamers who didn't rise to my sunday afternoon challenge should hang your heads ;) 2008-09-22 04:50 going to be a funny lkml post about this 2008-09-22 04:52 couple more boundaries to check 2008-09-22 04:52 next one: group count overflow 2008-09-22 04:55 8/7: 3000044 => 444 666; 3000055 => 555; 3001001 => 1; 3001002 => 1; 3001003 => 1; 3001004 => 1; 3001005 => 1; 2008-09-22 04:55 9/1: 3001006 => 1; <- works great 2008-09-22 04:55 first time 2008-09-22 04:55 now what? 2008-09-22 04:55 got to be more 2008-09-22 04:55 dwalk_pack looks too simple 2008-09-22 04:57 need to be able to add stuff in the middle of a dleaf, not just at the end I suppose 2008-09-22 04:57 though for now we can just re-append everything after the add point 2008-09-22 04:57 easy 2008-09-22 04:57 will serve for some time 2008-09-22 05:07 I know, I'll add some asserts on leaf full 2008-09-22 05:07 though of course it will never happen ;) 2008-09-22 05:18 ok, all properly and anally asserted 2008-09-22 05:18 now just about time to write dwalk_mock 2008-09-22 05:18 going to be the cute+funny subject of the lkml post 2008-09-22 05:19 hmm, maybe time for a checkin 2008-09-22 05:24 final score: 9 lines of the dwalk_pack prototype survived, 4 were changed, 12 were added not counting comments and asserts 2008-09-22 05:27 ok, dwalk_mock 2008-09-22 05:32 http://www.linuxtoday.com/infrastructure/2008092200135OSCY <- LOL 2008-09-22 05:32 really 2008-09-22 05:33 "Microsoft isn't the answer. Microsoft is the question. 'Linux' or 'No' is the answer." -- some wag 2008-09-22 05:33 where is everybody? 2008-09-22 05:34 it's only 5 in the morning 2008-09-22 05:34 wimps 2008-09-22 06:10 ACTION is alive 2008-09-22 07:09 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-22 07:14 ok, dwalk_mock is a fait accompli 2008-09-22 07:15 almost within striking zone of putting this together in inode.c 2008-09-22 07:15 need to think about extenty implications now 2008-09-22 07:15 extents just about here :) 2008-09-22 07:15 => let the benchmark wars begin 2008-09-22 07:35 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-22 07:36 -!- ceatinge(~ceatinge@veryclever.net) has joined #tux3 2008-09-22 09:41 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-22 09:50 -!- nataliep_(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-09-22 10:10 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-22 10:15 ok, prolly about time to try dropping these gizmos into inode.c 2008-09-22 10:15 and see if they make extents happen 2008-09-22 10:16 I wonder if we need extent.c 2008-09-22 11:24 we got filemap.c instead 2008-09-22 11:24 extents are a detail 2008-09-22 12:30 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-22 12:44 -!- nataliep_(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-09-22 13:29 hey flipz 2008-09-22 13:30 hi 2008-09-22 13:30 got extents working or something like that ? 2008-09-22 13:37 check out the code and try it 2008-09-22 14:18 I'll do so a bit later. I've been reading a bit of the code online 2008-09-22 14:18 flipz: you know there's a #linuxfs channel, right ? 2008-09-22 14:24 with clueful stuff happening? 2008-09-22 15:39 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-22 15:40 -!- openblast(~quassel@static.230.173.47.78.clients.your-server.de) has left #tux3 2008-09-22 16:47 -!- ajonat(~ajonat@190.48.103.186) has joined #tux3 2008-09-22 17:29 ok, here we go, final push for an extents prototype 2008-09-22 20:41 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-22 21:12 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-22 21:39 -!- Bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-09-22 21:50 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-23 01:55 folks 2008-09-23 02:41 -!- tim_dimm_(~mobile@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-23 02:42 Greetings 2008-09-23 03:17 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-23 04:49 -!- kbingham(~kbingham@92.10.168.77) has joined #tux3 2008-09-23 05:43 -!- Bobby(~Bobby@nat-inn.mentorg.com) has joined #tux3 2008-09-23 05:44 hellooo 2008-09-23 05:44 anyone here mind explaining extents to me??? 2008-09-23 05:45 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-09-23 06:14 -!- smitht(~chatzilla@ool-182f94db.dyn.optonline.net) has joined #tux3 2008-09-23 06:16 Hello all, I am eager to look over the kernel port of Tux3, however I cannot get git to clone the repo at http://phunq.net/ddtree, has anyone done this successfully? Thanks. Trevor 2008-09-23 09:50 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-23 10:45 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-23 10:48 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-23 11:58 "Results 1 - 10 of about 413,000 for tux3" 2008-09-23 12:12 maze, there? 2008-09-23 12:12 yes ;-( 2008-09-23 12:12 your suggestion re putting the xattrs up high in files 2008-09-23 12:13 hmm 2008-09-23 12:13 really a good idea, except for the problem of deepening the radix tree 2008-09-23 12:13 I'd think that would be relativelly insiginificant 2008-09-23 12:13 every file with xattrs would have a 6 level radix tree 2008-09-23 12:13 its significant I think 2008-09-23 12:14 compared to 1 level for most files now 2008-09-23 12:14 however 2008-09-23 12:14 if the radix tree code were to be modified to have one level of btree at the top level... 2008-09-23 12:14 perhaps optionally 2008-09-23 12:14 then it gets practical 2008-09-23 12:14 I'm not quite sure why it would get so deep so quickly? 2008-09-23 12:15 its a radix tree 2008-09-23 12:15 if you map something at the top of the space, the entire tree has to deepen 2008-09-23 12:15 you could use signed 2008-09-23 12:15 ah 2008-09-23 12:15 but 2008-09-23 12:15 well, with an offset 2008-09-23 12:16 so the only cost is to look up and maintain that offset 2008-09-23 12:16 yes, that improves it without much stress 2008-09-23 12:16 good idea 2008-09-23 12:16 I guess, my problem is, I'm not quite sure what a radix tree is 2008-09-23 12:16 well... you can take a run at your first core kernel hack then, after we have tux3's requirements to justify it 2008-09-23 12:17 ah 2008-09-23 12:17 better clear that up 2008-09-23 12:17 I'd assumed you'd be using a normal number of leaves determines depth type of tree here 2008-09-23 12:17 fundamental tool of software engineering 2008-09-23 12:17 is a radix tree what the cpu uses for tlb? 2008-09-23 12:17 radix tree is to btree as bucket sort is to quicksort 2008-09-23 12:17 sorta 2008-09-23 12:17 radix tree is directly indexed at each level instead of binsearched 2008-09-23 12:18 therefore needs no keys 2008-09-23 12:18 index nodes are twice as compact 2008-09-23 12:18 right, so isn't the cpu virt to phys mapping a radix tree? 2008-09-23 12:18 probe is much faster 2008-09-23 12:18 but you still pay a l1 cache pressure penalty for each level of the tree 2008-09-23 12:18 it is, and that is a problem in some cases 2008-09-23 12:18 64 bit machines struggle with it 2008-09-23 12:19 ok, so I just wan't aware of the name 'radix tree' 2008-09-23 12:19 no other name for it that I know 2008-09-23 12:19 like radix sort 2008-09-23 12:19 wasn't aware of any name ;-) 2008-09-23 12:19 you were looking for some reason to split something into two for 32-bit machines 2008-09-23 12:19 can't remember what it was... something about 1EB of space? 2008-09-23 12:20 was that for total fs? 2008-09-23 12:20 hmm, must have been 2008-09-23 12:20 I wasn't either until "wind" showed up on #kernelnewbies with his plan of changing the page cache from a hash to something better 2008-09-23 12:20 I recall he considered half a dozen types of trees I'd never heard of 2008-09-23 12:20 he? 2008-09-23 12:20 the one thing he was sure of, all of them would be better than a hash 2008-09-23 12:20 he was right 2008-09-23 12:20 John Levon 2008-09-23 12:20 h 2008-09-23 12:21 ah 2008-09-23 12:21 after that, showed very little interest in Linux 2008-09-23 12:21 not sure why 2008-09-23 12:21 heading out with team for Lunch. Will be back in 30 minutes. 2008-09-23 12:21 bye 2008-09-23 12:21 ;-) 2008-09-23 12:48 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-09-23 12:51 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-09-23 12:55 -!- pgquiles(~pgquiles@42.Red-83-39-60.dynamicIP.rima-tde.net) has joined #tux3 2008-09-23 13:12 back 2008-09-23 13:23 crawl-8.cuill.com ;-) 2008-09-23 13:23 that was fast 2008-09-23 13:52 folks 2008-09-23 14:10 flipz: talking about implementing xattrs ? 2008-09-23 14:10 already ahve 2008-09-23 14:10 you should check out the code 2008-09-23 14:19 yeah, been looking at it 2008-09-23 14:19 and thinking about the allocation map problem a bit 2008-09-23 14:20 not sure what kind of tree structure to use to represent areas on the disk that might have a certain amount of free blocks 2008-09-23 14:27 it's a bitmap 2008-09-23 14:27 not a tree 2008-09-23 14:27 there will be accelerator bits in the pointers to bitmap blocks eventually 2008-09-23 14:27 to know which blocks have how much space free 2008-09-23 14:27 for now it's just a big linear block map 2008-09-23 14:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-23 15:22 big linear scan then ? 2008-09-23 15:37 for now 2008-09-23 15:37 it certainly won't stay that way 2008-09-23 15:37 even now 2008-09-23 15:37 the scan is directed 2008-09-23 15:37 to a preferred target area 2008-09-23 18:11 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-23 18:35 -!- ajonat(~ajonat@190.48.115.242) has joined #tux3 2008-09-23 19:16 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-23 19:52 -!- RalucaM(~ral@scout-9.cnds.jhu.edu) has joined #tux3 2008-09-23 19:56 ACTION is working for a deadline for tomorrow so it will not be present at today's lesson :-( 2008-09-23 20:00 :-( 2008-09-23 20:08 -!- ajonat(~ajonat@190.48.119.175) has joined #tux3 2008-09-23 20:22 -!- Kirantpatil(~kiran@122.167.199.68) has joined #tux3 2008-09-23 20:22 -!- Kirantpatil(~kiran@122.167.199.68) has left #tux3 2008-09-23 20:29 no session at all today? 2008-09-23 20:29 teacher skipped class 2008-09-23 20:30 looks like it 2008-09-23 20:37 -!- macan(~chatzilla@159.226.41.129) has joined #tux3 2008-09-23 22:40 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-23 22:47 -!- tim_dimm_(~mobile@166.134.66.229) has joined #tux3 2008-09-23 23:15 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-23 23:26 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-23 23:35 -!- nataliep(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-09-23 23:38 -!- ajonat(~ajonat@190.48.127.55) has joined #tux3 2008-09-23 23:46 flake 2008-09-23 23:46 flipz: = flake 2008-09-23 23:59 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-24 00:10 -!- shapor kicked bh ("be nice") 2008-09-24 00:25 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-09-24 00:25 hello 2008-09-24 00:45 hi pranith 2008-09-24 00:46 sleep time... 2008-09-24 01:17 -!- pgquiles(~pgquiles@42.Red-83-39-60.dynamicIP.rima-tde.net) has joined #tux3 2008-09-24 01:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-24 01:32 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-24 01:52 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-24 01:58 hello 2008-09-24 01:58 anyone here help me on setting up a fuse tux3? 2008-09-24 01:59 MaZe: ? 2008-09-24 01:59 hmm 2008-09-24 02:00 ? 2008-09-24 02:00 oh, tux3 fuse... I'm probably not the best person, seeing as I haven't tried running it yet 2008-09-24 02:00 hmm, ok 2008-09-24 02:00 I'm mucking around with the kernel and haven't compiled tux3 yet even... still working on (a) learning about the kernel, (b) writing options parsing, and (c) getting a solid debug environment, and most importantly... I have work ;-( 2008-09-24 02:01 oh.. 2008-09-24 02:01 speaking of options parsing... 2008-09-24 02:01 flipz: -o mkfs ofcourse, but can also potentially do -o resize=#blocks 2008-09-24 02:01 hmm 2008-09-24 02:01 pranith: sure 2008-09-24 02:02 i can help 2008-09-24 02:02 hey shapor 2008-09-24 02:02 cool 2008-09-24 02:02 flipz: and support that as a -o remount option as well 2008-09-24 02:02 MaZe: which code are u working on? 2008-09-24 02:02 will have to look see how remount is implemented 2008-09-24 02:03 right now? generic kernel space options parser 2008-09-24 02:03 shapor: what do i need to do? 2008-09-24 02:03 well tux3fs since ls works with it 2008-09-24 02:03 readdir doesn't work in tux3fuse 2008-09-24 02:03 MaZe: ok 2008-09-24 02:03 well, at least not for me, i was going to try a new version of fuse 2008-09-24 02:03 it seems to be returning some data but somethings not right 2008-09-24 02:03 shapor: hmm. where can i get the code? and any install options? 2008-09-24 02:04 tux3fuse is the low level api one 2008-09-24 02:04 its in the repo 2008-09-24 02:04 user/test 2008-09-24 02:04 right next to all the other files 2008-09-24 02:04 ok 2008-09-24 02:05 shapor: can i have the link please... 2008-09-24 02:05 :) 2008-09-24 02:05 its in the mercurial repository at http://tux3.org/tux3 2008-09-24 02:06 hg clone http://tux3.org/tux3 2008-09-24 02:06 then it will be in tux3/user/test 2008-09-24 02:06 just run "make makefs" 2008-09-24 02:07 and "make debug" 2008-09-24 02:07 and it should be mounted on /tmp/test 2008-09-24 02:07 ok 2008-09-24 02:44 anyone mind explaining something about extents to me ? :) 2008-09-24 02:45 i want to know how they help in addressing a larger disk area 2008-09-24 02:45 shapor: ? 2008-09-24 02:45 MaZe: ? 2008-09-24 02:45 hmm? 2008-09-24 02:45 any idea on extents? 2008-09-24 02:45 they don't really... they're really just a performance optimization 2008-09-24 02:45 instead of splitting a file into blocks 2008-09-24 02:46 and then storing the location of every single block 2008-09-24 02:46 ok, u map a chunk of block to a single extent... 2008-09-24 02:46 you split the file into linear sequence of blocks (linear in the sense they are ordered sequentially one after the other on disk) 2008-09-24 02:46 this way you only need to store a mapping (file blocks N..M) -> (disk block X..Y) 2008-09-24 02:47 hmm, ok 2008-09-24 02:47 or indeed just a map of [N] -> [X] is enough (since M and Y are -1 of the next set) 2008-09-24 02:47 thus you have a file as a a set of extents (linear set of blocks), instead of as a set of blocks 2008-09-24 02:47 hmm, nice 2008-09-24 02:48 since you want files to be linear as much as possible (and thus contain few extents) [hence running defragmentors, etc in windows] 2008-09-24 02:48 you will usually end up with relatively few extents, and thus it takes less space to store and can have better performance (especially if well implemented) than just a naive block list 2008-09-24 02:50 basically you get both space savings on disk (and in memory), and better performance, due to having/needing to read in fewer disk blocks (which have a tendency to get pretty randomly distributed) than in a block based fs 2008-09-24 02:51 hmm, thats the main advantage then 2008-09-24 02:51 not addressing a large disk area 2008-09-24 02:56 shapor: problem with fuse :( 2008-09-24 02:56 pranith: whats the problem 2008-09-24 02:57 permission denied 2008-09-24 02:57 doing what 2008-09-24 02:57 cd /tmp/test 2008-09-24 02:58 there are no permissions on the test directory 2008-09-24 02:58 it just say 'd?????' on a ls -l 2008-09-24 02:58 btw, make debug did not return 2008-09-24 02:59 to the command prompt 2008-09-24 03:02 oh i've seen that before 2008-09-24 03:02 hmm 2008-09-24 03:02 are you running ls as root? 2008-09-24 03:02 what do i do? 2008-09-24 03:02 nope 2008-09-24 03:02 try that 2008-09-24 03:02 as a normal user 2008-09-24 03:02 ok 2008-09-24 03:02 try root 2008-09-24 03:03 ohk, got the permissions now as root 2008-09-24 03:03 but... why? 2008-09-24 03:03 because the fuse implementation is *very* rough around the edges 2008-09-24 03:03 shouldn't fuse be accessible as a normal user 2008-09-24 03:04 yes, patches welcome :) 2008-09-24 03:04 :) 2008-09-24 03:04 if you hit ctrl-c 2008-09-24 03:04 and re run it as a normal user 2008-09-24 03:04 ./tux3fs /tmp/testdev /tmp/test -f 2008-09-24 03:04 instead of make debug 2008-09-24 03:04 i think it will work 2008-09-24 03:04 hmm 2008-09-24 03:05 i dint run make debug as root... 2008-09-24 03:05 but it has a sudo command in it 2008-09-24 03:05 oh 2008-09-24 03:05 :) 2008-09-24 03:05 the fuse implementation is really just meant as a test harness 2008-09-24 03:05 we know there are a lot of bugs in it 2008-09-24 03:05 ok 2008-09-24 03:06 i was trying to figure out why readdir isn't working right in tux3fuse (the low level one) 2008-09-24 03:06 oh 2008-09-24 03:06 since i think porting to that will make the kernel port a little easier 2008-09-24 03:06 since the api is more vfs-ish 2008-09-24 03:06 remains to be seen, i havne't had a lot of time to work on it recently 2008-09-24 03:07 you part time work on tux3?? 2008-09-24 03:08 wow! tux3 is gplv3!! 2008-09-24 03:08 how do u get it into the kernel? 2008-09-24 03:15 i think it will be v2 in kernel 2008-09-24 03:15 i work nights and weekends when i have time 2008-09-24 03:16 its not my day job ;) 2008-09-24 03:16 its not anyones day job afaik 2008-09-24 03:17 hmm, what does flips do? 2008-09-24 03:18 i mean what does he do for a day job :D 2008-09-24 03:18 i know he develops tux3... 2008-09-24 03:21 i think hes been mostly working on tux3 recently 2008-09-24 03:33 pranith: regarding gplv2 vs 3. This is solved by the remark 2008-09-24 03:33 * By contributing changes to this file you grant the original copyright holder 2008-09-24 03:33 * the right to distribute those changes under any license. 2008-09-24 03:41 :) 2008-09-24 03:41 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-24 03:41 its just what reiser did :D 2008-09-24 03:41 hope flips doesnt flip out like reiser :P 2008-09-24 03:52 -!- pgquiles(~pgquiles@42.Red-83-39-60.dynamicIP.rima-tde.net) has joined #tux3 2008-09-24 04:37 what is tux3fs.c and tux3fuse.c? 2008-09-24 04:37 any difference between the two? 2008-09-24 04:41 tux3fuse uses the low level fuse api 2008-09-24 04:41 and readdir is currently broken 2008-09-24 04:49 yeah 2008-09-24 04:49 ok 2008-09-24 04:56 any idea on how i can debug this code?? 2008-09-24 04:56 printf's are nice.. but gdb rocks 2008-09-24 04:56 :D 2008-09-24 05:03 what does printf("'%.*s'", namelen, name) do? 2008-09-24 05:03 im not sure of this printf specifier :( 2008-09-24 05:58 hello 2008-09-24 05:59 anybody here? 2008-09-24 07:22 pranith: A field width or precision, or both, may be indicated by an asterisk `*' or an asterisk followed by one or more decimal digits and a `$' instead of a digit string. In this case, an int argument supplies the field width or precision. A negative field width is treated as a left adjustment flag followed by a positive field width; a negative precision is treated as though it were missing. If a single format directi 2008-09-24 07:22 (from man 3 printf) 2008-09-24 07:23 (on mac :P) 2008-09-24 07:23 RzM|Away: thanks :) 2008-09-24 07:23 RzM|Away: mac is lame 2008-09-24 07:23 :P 2008-09-24 07:23 so the namelen will tell how much of the name to show :D 2008-09-24 07:23 hmm, roger that 2008-09-24 07:24 The field width 2008-09-24 07:24 An optional decimal digit string (with nonzero first digit) specifying a minimum field width. If the converted value has fewer characters than the field width, it will be padded with spaces on the left (or right, if the left-adjustment flag has been given). Instead of a decimal digit string one may write `*' or `*m$' (for some decimal integer m) to specify that the field width is given in the next argument, or in the 2008-09-24 07:24 from linux 2008-09-24 07:24 I use both ;-) 2008-09-24 07:25 and like both :P 2008-09-24 07:25 hmm 2008-09-24 07:25 i use ubuntu dressed up as mac 2008-09-24 07:25 so i have the best of both worlds 2008-09-24 07:25 :D 2008-09-24 07:25 ;-) 2008-09-24 07:26 any idea why readdir fails in tux3fuse? 2008-09-24 07:26 check this out http://xkcd.com/424/ 2008-09-24 07:26 I didn't have a chance to try that :( 2008-09-24 07:27 got to go 2008-09-24 07:27 hmm, xkcd.. lol 2008-09-24 07:27 have fun 2008-09-24 07:27 u too 2008-09-24 07:27 bbye 2008-09-24 07:57 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-24 08:10 -!- kbingham(~kbingham@92.22.1.228) has joined #tux3 2008-09-24 08:10 flipz: -o mkfs ofcourse, but can also potentially do -o resize=#blocks <- ah yes 2008-09-24 08:20 MaZe, and for that matter, -o remount,resize=#blocks 2008-09-24 08:21 maze, see super_operations->remount_fs 2008-09-24 08:21 there we go, that will have to do for a tux3 U session this time 2008-09-24 08:45 -!- flips(~phillips@phunq.net) has joined #tux3 2008-09-24 08:48 tim_dimm, dcc chat doesn't work with my connection 2008-09-24 08:48 probably have to configure my router or something 2008-09-24 08:49 I'm timing out on the other connection 2008-09-24 08:50 that's because speakeasy went down 2008-09-24 08:50 just reconnect 2008-09-24 08:50 can u do a 11am call friday ? 2008-09-24 08:50 ACTION points at the query chat 2008-09-24 09:45 -!- pgquiles__(~pgquiles@42.Red-83-39-60.dynamicIP.rima-tde.net) has joined #tux3 2008-09-24 10:08 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-24 10:51 -!- ceatinge(~ceatinge@veryclever.net) has left #tux3 2008-09-24 11:08 -!- ceatinge(~ceatinge@72.232.13.50) has joined #tux3 2008-09-24 11:32 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-24 11:40 -!- pranith(7aa040b1@webchat.mibbit.com) has joined #tux3 2008-09-24 11:40 heya 2008-09-24 11:40 hi 2008-09-24 11:40 anyone here? 2008-09-24 11:40 hey flips 2008-09-24 11:40 nope 2008-09-24 11:41 was going through the fuse code today... 2008-09-24 11:41 whats wrong with the readdir function? 2008-09-24 11:41 I haven't looked at it 2008-09-24 11:41 shapor started to look at it 2008-09-24 11:41 getting it working in tux3fs was tricky 2008-09-24 11:41 ohk 2008-09-24 11:41 hmm 2008-09-24 11:42 the readdir internal interface is super crappy 2008-09-24 11:42 tux3fuse.c and tux3fs.c are different... 2008-09-24 11:42 on of the worst interfaces anywhere, for anything 2008-09-24 11:42 that's right 2008-09-24 11:42 is it because fuse uses the fuse api? 2008-09-24 11:42 or something like that? 2008-09-24 11:42 no idea 2008-09-24 11:42 hmm, ok 2008-09-24 11:43 you might try emailing tero 2008-09-24 11:43 who checked in the original tux3fuse patch 2008-09-24 11:43 hmm, ok 2008-09-24 11:43 cc tux3 list if you do please 2008-09-24 11:43 yeah, sure 2008-09-24 11:43 :) 2008-09-24 11:43 you can also bug shapor 2008-09-24 11:43 if you like 2008-09-24 11:43 shapor is fun to bug 2008-09-24 11:44 hehe 2008-09-24 11:44 shapor is busy with his day job 2008-09-24 11:44 he told me 2008-09-24 11:44 so better to try tero for that :) 2008-09-24 11:44 "busy" 2008-09-24 11:44 it's all relative 2008-09-24 11:45 hmm 2008-09-24 11:45 can i have the encrypted email id of tero? 2008-09-24 11:47 ok, got it 2008-09-24 12:08 -!- pranith(7aa040b1@webchat.mibbit.com) has joined #tux3 2008-09-24 12:17 debugging of filemap extents finally begins 2008-09-24 12:17 was hard code to write 2008-09-24 12:18 I found it hard 2008-09-24 12:18 prolly easy for shapor though ;) 2008-09-24 12:36 irob's thoughtful initializing of buffers to "deadly data" has the unintended side effect of preventing valgrind from detecting access to unitialized buffer data 2008-09-24 12:37 ACTION removes 2008-09-24 12:37 ah, there we go, lots of valgrind complaints 2008-09-24 12:38 -!- ceatinge(~ceatinge@72.232.13.50) has joined #tux3 2008-09-24 13:42 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-24 14:26 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-09-24 14:26 folks :) 2008-09-24 14:57 -!- pgquiles_(~pgquiles@42.Red-83-39-60.dynamicIP.rima-tde.net) has joined #tux3 2008-09-24 15:46 hmm, linux-fsdevel server is slower than molasses in January 2008-09-24 15:47 last time I post to it without ccing lkml, I think 2008-09-24 15:47 flips: sk8 30 for new dads 2008-09-24 15:47 still there? 2008-09-24 15:47 rolling out now 2008-09-24 15:47 I'll head out in about 15 2008-09-24 15:47 meet at the pier? 2008-09-24 15:48 got to home by 5 2008-09-24 15:48 pier-ish 2008-09-24 15:48 should work 2008-09-24 15:48 k 2008-09-24 15:48 rollin' 2008-09-24 15:48 cu 2008-09-24 15:48 no crashes 2008-09-24 15:48 certainly no blue screen of death 2008-09-24 15:49 ;-) 2008-09-24 15:49 don't skate in the dark 2008-09-24 15:49 u haven;'t lived until you bomb latigo in the moonlight 2008-09-24 15:50 hmm, sounds like "haven't died" 2008-09-24 15:51 hey flips 2008-09-24 15:51 hey 2008-09-24 15:51 got to go 2008-09-24 15:51 folks were complaining last night about you being delinquent 2008-09-24 15:51 and now you're abandoning us 2008-09-24 15:55 bh, complaining about what you get for free is not normally considered good style 2008-09-24 15:55 besides, I haven't seen you at one of the sessions 2008-09-24 15:55 I'm also joking if you haven't figured that out by now 2008-09-24 15:56 the internet doesn't see the smile 2008-09-24 15:56 never does 2008-09-24 15:56 you say it like you're never going to see the sun rise again or something like that 2008-09-24 15:57 nothing in santa monica is *that* awful 2008-09-24 16:18 -!- mingming_(~mingming@bi01p1.co.us.ibm.com) has joined #tux3 2008-09-24 16:24 ACTION waves to flips 2008-09-24 16:33 flips: ping me when you get back 2008-09-24 16:34 shapor: the core code is in dleaf.c right ? 2008-09-24 17:36 aw, missed mingming 2008-09-24 17:40 bh, ping 2008-09-24 17:41 the core code is indeed in dleaf.c 2008-09-24 17:56 hey shapor, I'm forced to link my ugly tux3 page and cr*ppy design doc from the lkml post because your version of the design doesn't have headings in most of it 2008-09-24 17:57 flips: yeah, looking over it now 2008-09-24 17:57 er was about an hour ago 2008-09-24 17:58 it's starting to get interesting now 2008-09-24 17:58 with the dwalk stuff 2008-09-24 17:58 yeah, it's going to get more and more complicated as well 2008-09-24 17:58 this step is making it less complicated 2008-09-24 17:58 general resizing isn't implemented yet which includes truncation 2008-09-24 17:59 a couple of big ugly functions will disappear in a week or so 2008-09-24 17:59 all of that needs to be integrated into atomic logging as well, not really that easy 2008-09-24 17:59 roughly zero impact on dleaf 2008-09-24 17:59 talking about the refactoring ? 2008-09-24 18:00 logging impact 2008-09-24 18:00 logging just needs to be done once for all forms of btree, its generic 2008-09-24 18:00 president bush is giving a speech btw 2008-09-24 18:00 hope he chokes 2008-09-24 18:01 he isn't your favorite president of all time ? 2008-09-24 18:01 least in fact 2008-09-24 18:01 can't think of a worse one 2008-09-24 18:01 well, you should start putting in blank functions for atomic logging 2008-09-24 18:01 anyway 2008-09-24 18:01 #offtopic 2008-09-24 18:02 real functions for atomic logging will go in around the time of the kernel port 2008-09-24 18:02 so no need to fire blanks 2008-09-24 18:03 the blanks are helpful for other folks 2008-09-24 18:03 got to get back to my post 2008-09-24 18:03 nearly running out of time 2008-09-24 18:03 and you can start training people to think in terms of it 2008-09-24 18:03 if somebody steps up to implement it I'll put in some stubs 2008-09-24 18:03 otherwise... 2008-09-24 18:04 got other things to do 2008-09-24 18:04 like train up some hackers for the port 2008-09-24 18:05 you should mark it so that folks understand your thinking regarding it 2008-09-24 18:05 I'm tell'n you that it's going to be useful for me and will indicate a direction with your implementation 2008-09-24 18:07 it's been described in a number of posts 2008-09-24 18:08 we can get links up on the page 2008-09-24 18:27 that would be good 2008-09-24 18:54 -!- tim_dimm_(~mobile@166.135.68.85) has joined #tux3 2008-09-24 19:22 "Tux3 gets a high speed atom smasher" -- just posted to lkml 2008-09-24 20:10 -!- Kirantpatil(~kiran@122.167.206.163) has joined #tux3 2008-09-24 20:21 -!- amey(~amey@116.73.35.180) has joined #tux3 2008-09-24 20:21 -!- amey(~amey@116.73.35.180) has left #tux3 2008-09-24 21:37 -!- amey(~amey@116.73.35.180) has joined #tux3 2008-09-24 21:38 -!- amey(~amey@116.73.35.180) has left #tux3 2008-09-24 22:16 -!- Kirantpatil(~kiran@122.167.197.205) has joined #tux3 2008-09-24 22:16 hello list.. 2008-09-24 22:16 i tried to get junkfs 2008-09-24 22:16 but link http://m.a.z.e.pl/junkfs.tar.gz is not working.. 2008-09-24 22:17 any one point me to the right location.. 2008-09-24 22:30 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-24 22:37 hello.. 2008-09-24 22:38 howdy 2008-09-24 22:45 could you please get me the link from where i can download junkfs ? 2008-09-24 22:45 as http://m.a.z.e.pl/junkfs.tar.gz is not working.. 2008-09-25 00:10 that was fun 2008-09-25 00:11 beach cruiser on the strand at midnight 2008-09-25 00:11 can't get more california than that 2008-09-25 00:49 -!- pgquiles(~pgquiles@42.Red-83-39-60.dynamicIP.rima-tde.net) has joined #tux3 2008-09-25 01:04 -!- ajonat(~ajonat@190.48.120.246) has joined #tux3 2008-09-25 01:07 bah 2008-09-25 01:18 flipsout: enjoy it before winter hits 2008-09-25 01:53 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-25 03:19 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-25 04:19 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-25 05:00 -!- pgquiles_(~pgquiles@42.Red-83-39-60.dynamicIP.rima-tde.net) has joined #tux3 2008-09-25 05:08 -!- Kirantpatil(~kiran@122.167.196.127) has joined #tux3 2008-09-25 05:09 -!- Kirantpatil(~kiran@122.167.196.127) has left #tux3 2008-09-25 11:09 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-25 11:11 file_bwrite: block write <0:0> 2008-09-25 11:11 <<< extent 0x0/4 >>> 2008-09-25 11:11 0 entry groups: 2008-09-25 11:11 file_bwrite: fill gap at 0x0/4 2008-09-25 11:11 balloc_extent_from_range: balloc 4 blocks from [0/1000] 2008-09-25 11:11 balloc extent -> [2/4] 2008-09-25 11:12 extent writing almost happening 2008-09-25 11:12 lots of combinatorics to take care of 2008-09-25 11:53 segs: 0x2/4 (1) 2008-09-25 11:53 dwalk_mock: add entry key 0x4 after 0x0 2008-09-25 11:53 dwalk_mock: add extent 0x2/4 2008-09-25 11:53 getting closer... 2008-09-25 11:57 -!- amey(~amey@116.73.35.180) has joined #tux3 2008-09-25 12:02 -!- amey(~amey@116.73.35.180) has left #tux3 2008-09-25 12:33 -!- tim_dimm_(~mobile@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-25 12:57 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-25 14:07 fyi, I might be late for tux3 U.. hopefully not... 2008-09-25 14:23 bah 2008-09-25 15:23 -!- Ryback_(~ulisses@201.82.39.16) has joined #tux3 2008-09-25 17:03 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-25 18:49 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-25 18:55 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-25 19:14 -!- Kirantpatil(~kiran@122.167.212.31) has joined #tux3 2008-09-25 19:14 -!- Kirantpatil(~kiran@122.167.212.31) has left #tux3 2008-09-25 19:33 -!- ajonat(~ajonat@190.48.120.246) has joined #tux3 2008-09-25 19:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-25 19:57 I'll miss tonight's Tux3 U 2008-09-25 19:57 looking forward to reading the logs 2008-09-25 20:02 present 2008-09-25 20:02 although were's everyone else 2008-09-25 20:02 i'm here 2008-09-25 20:03 just got back from an offsite 2008-09-25 20:03 whiskey, steak, and guns ;-) 2008-09-25 20:04 okay, so I chose wine instead of whiskey 2008-09-25 20:04 so that makes it 2008-09-25 20:04 wine, steak and guns ... 2008-09-25 20:05 and I might be getting the order a little mixed up since I'm a little tini-iny-bit buzzed 2008-09-25 20:05 so it might just have been 2008-09-25 20:05 guns, steak, and wine... 2008-09-25 20:05 hmm 2008-09-25 20:05 where's the teacher? 2008-09-25 20:09 heh, tonight you get to teach instead 2008-09-25 20:10 nope 2008-09-25 20:10 I'm close to calling myself drunk... I need a nap. 2008-09-25 20:11 flipsout: ping pong? 2008-09-25 20:12 I believe Razvan was supposed to be the teacher this time around 2008-09-25 20:12 but he's RzM|Away, apparently afk 2008-09-25 20:13 maze 2008-09-25 20:13 here 2008-09-25 20:13 ok 2008-09-25 20:13 cool 2008-09-25 20:13 where were we 2008-09-25 20:13 any requests? 2008-09-25 20:13 last one was bio xfrs 2008-09-25 20:13 then last tuesday we skipped 2008-09-25 20:13 right, conducted by maze 2008-09-25 20:13 it was good 2008-09-25 20:14 now tux3fs has a rather nice generic set of bio fns 2008-09-25 20:14 an async and a sync bio transfer flavor 2008-09-25 20:14 right 2008-09-25 20:14 fully general, except maybe it could take some alloc flags 2008-09-25 20:14 alloc flags? 2008-09-25 20:14 yes, like how hard the kernel should try to satisfy a request 2008-09-25 20:15 you will see that functions like kmalloc take gfp flags 2008-09-25 20:15 memory wise or io wise? 2008-09-25 20:15 "gfp: get free pages" 2008-09-25 20:15 memory wise 2008-09-25 20:15 well 2008-09-25 20:15 is coupled to io 2008-09-25 20:15 in an incestuous way 2008-09-25 20:15 most of the time, the kernel cache will be just about full 2008-09-25 20:16 what we have now I believe asks for memory in a 'can sleep' way 2008-09-25 20:16 except for right after boot, or after unmounting a volume, say, which invalidates a bunch of cache 2008-09-25 20:16 unless you specify GFP_ATOMIC, it is always "can sleep" 2008-09-25 20:16 also when you delete a dvd you just checksummed ;-) 2008-09-25 20:16 so that io transfers can take place and other things can run while waiting for memory to get free 2008-09-25 20:17 we have __NOFAIL as a gfp flag 2008-09-25 20:17 just means it will try for infinity... 2008-09-25 20:17 it means, under no circumstances return without completing the allocation 2008-09-25 20:17 until it suceeds 2008-09-25 20:18 yes 2008-09-25 20:18 and what could prevent it from succeeding? 2008-09-25 20:18 asking for 100M on 50M machine 2008-09-25 20:18 true 2008-09-25 20:18 or 120M machine with 20+M already allocated 2008-09-25 20:18 or not enough memory of a specific type 2008-09-25 20:18 or on a 200M machine on which 195M has leaked 2008-09-25 20:19 (ie. asking for low memory, when only high mem is free) 2008-09-25 20:19 also true 2008-09-25 20:19 [or dma16 or dma32] 2008-09-25 20:19 but the most common reason is: when memory is full of dirty pages that cannot be written out for some reason 2008-09-25 20:19 in general it is a bug 2008-09-25 20:20 in general, memory can always be allocated in kernel, by kicking out some cache 2008-09-25 20:20 so writing out dirty pages should not need to allocate memory, since otherwise it can deadlock? 2008-09-25 20:20 or you need to have a pre-allocated pool of temporary pages 2008-09-25 20:20 I believe the kernel even provides such features 2008-09-25 20:21 exactly 2008-09-25 20:21 you nailed that 2008-09-25 20:21 in fact, this is an unsolved problem in linux kernel 2008-09-25 20:21 or it is solved, but the fix is not in mainline 2008-09-25 20:21 see bio-throttle 2008-09-25 20:22 there has been an attempt to fix the problem by limiting total memory that is allowed to be dirty in kernel 2008-09-25 20:22 "dirty limits" 2008-09-25 20:22 complex, fragile, and doesn't work 2008-09-25 20:22 but has been good for creating lots of bugfixing activity lately 2008-09-25 20:22 anyway 2008-09-25 20:22 enough on memory for now? 2008-09-25 20:23 I think so... 2008-09-25 20:23 this is just something to be aware of off? 2008-09-25 20:23 let's get back to __copy2 2008-09-25 20:23 get you thinking about it, yes 2008-09-25 20:23 and some practical facts about GFP_ flags to memory allocators 2008-09-25 20:24 when you allocate a bio, there is an attempt made to provide a pre-allocated pool, so in theory a bio alloc will never fail 2008-09-25 20:25 in practice, it can slow to a craw as the pre-allocated pool only gaurantees 2 bios 2008-09-25 20:25 and it often gets into that corner 2008-09-25 20:25 hmm, so should you keep a couple pre-alloced bios for yourself? 2008-09-25 20:26 youcan maintain your own pool, yes 2008-09-25 20:26 perhaps a good idea when the kernel is in the broken state it is 2008-09-25 20:26 extra complexity 2008-09-25 20:26 see the "mempool" mechanism 2008-09-25 20:27 is it worth it? 2008-09-25 20:27 better is to fix the bug 2008-09-25 20:27 it works for some situations 2008-09-25 20:27 its messy 2008-09-25 20:28 much harder to fix bugs, then to work around them ;-) 2008-09-25 20:28 the first requires understanding the entire system 2008-09-25 20:28 the second only the way it affects you 2008-09-25 20:28 true 2008-09-25 20:28 we can return to that issue 2008-09-25 20:29 it is fully understood, but not by everybody 2008-09-25 20:29 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2063 <- _2copy 2008-09-25 20:29 I'm totally mystified by the name, 2copy 2008-09-25 20:29 as we all are 2008-09-25 20:29 speaking of all 2008-09-25 20:29 how many of us are here? 2008-09-25 20:29 I'm feeling lonely 2008-09-25 20:30 [and drunk...] 2008-09-25 20:30 heh 2008-09-25 20:30 we'll keep it light then 2008-09-25 20:30 and short 2008-09-25 20:30 I'm attempting to get feeling drunk ;) 2008-09-25 20:30 you're well ahead it would seem 2008-09-25 20:30 heh 2008-09-25 20:31 yea, 2 glasses of white (some pinot), and 2 of red (not sure what it was), plus steak, plus an afternoon at a gun range 2008-09-25 20:31 the gun range made you drunk I presume 2008-09-25 20:31 wine before or after shooting? 2008-09-25 20:31 nah, that was first, and was fun ;-) 2008-09-25 20:31 the range was first 2008-09-25 20:32 afterward you rode around and shot up a few stop signs? 2008-09-25 20:32 nah, we left the range gun-less 2008-09-25 20:32 just checking 2008-09-25 20:32 we then invaded an italian restaurant in downtown mountain view 2008-09-25 20:33 castro street? 2008-09-25 20:33 yep 2008-09-25 20:34 ok, 2copy 2008-09-25 20:34 right 2008-09-25 20:34 seems we're pretty much alone 2008-09-25 20:35 the basic scheme is: alloc page; ->prepare write; copy data onto it; ->commit_write 2008-09-25 20:35 the -> are calls into the filesystem 2008-09-25 20:36 interesting 2008-09-25 20:36 what's the purpose of the prepare? 2008-09-25 20:36 the channel log will be preserved for posterity 2008-09-25 20:36 verify there's enough disk space, etc? 2008-09-25 20:36 I've alwasy wonder about that 2008-09-25 20:36 yes including the comments about wine 2008-09-25 20:36 for a partial page write, the prepare does a read before write 2008-09-25 20:36 otherwise it seems pretty useless 2008-09-25 20:36 I think it is useless 2008-09-25 20:37 but it has been in linux since eternity, which is an argument for it staying another eternity 2008-09-25 20:37 how do you know if it's a partial or full page write? 2008-09-25 20:37 see the parameters passed to it 2008-09-25 20:37 and where they come from 2008-09-25 20:37 ah 2008-09-25 20:37 this comes from the file pos and write len 2008-09-25 20:38 2159 status = a_ops->prepare_write(file, page, offset, offset+bytes); 2008-09-25 20:38 so there may be a partial page at the beginning and one at the end 2008-09-25 20:38 so I'm assuming that the 3rd and 4th paramts 2008-09-25 20:38 are 0,4096 if we're writing a full page 2008-09-25 20:38 pretty dumb to have the ->prepare on every page when only two per transfer need the special treatment 2008-09-25 20:38 oh, moment 2008-09-25 20:38 3rd 0 2008-09-25 20:38 do we call prepare_write, commit_write per page 2008-09-25 20:38 otherwise right 2008-09-25 20:38 or on page ranges 2008-09-25 20:39 per page 2008-09-25 20:39 dumb 2008-09-25 20:39 actually, this whole part of the kernel sucks pretty hard 2008-09-25 20:39 why just 3rd 0, why not 4th PAGE_SIZE (4096)? 2008-09-25 20:40 4th is normally page_size, yes 2008-09-25 20:40 ok 2008-09-25 20:40 3rd is zero normally because it's an offset 2008-09-25 20:40 in the page 2008-09-25 20:40 right hence the 0,4096 above I was asking about 2008-09-25 20:40 see the flush_dcache page 2008-09-25 20:41 ah 2008-09-25 20:41 sorry, read wrong 2008-09-25 20:41 oki 2008-09-25 20:41 the dcache flush is a noop on x86 2008-09-25 20:41 some arches need it 2008-09-25 20:41 mips I think 2008-09-25 20:41 what's the purpose? 2008-09-25 20:41 tlb hackery? 2008-09-25 20:41 could not swear to that 2008-09-25 20:41 also not really clear to me 2008-09-25 20:41 it's like L1 cache 2008-09-25 20:42 that has to be explicitly flushed 2008-09-25 20:42 why... another matter 2008-09-25 20:42 seems like braindamage to design a processor that doesn't know how to flush its cache 2008-09-25 20:42 but people do it, they have their reasons I suppose 2008-09-25 20:42 maybe the asm code can be much more efficient on some archs if you assume explicit flushes on any change 2008-09-25 20:43 put that one aside to bother the mips maintainer about 2008-09-25 20:43 there is some sparse kernel doc on the subject 2008-09-25 20:43 but there is a general principle here: just because your code works on x86 does not mean it works 2008-09-25 20:44 hmm 2008-09-25 20:44 same is true if all your spinlocks work, because you compiled with smp disabled 2008-09-25 20:44 so how do you test on the dozen+ archs linux supports? 2008-09-25 20:44 get users to report errors? 2008-09-25 20:44 that's the question isn't it? 2008-09-25 20:44 after testing on the 2-3 you have access to? 2008-09-25 20:44 well, I can test smp 32 and 64 bit x86 2008-09-25 20:44 you try to be aware of the issues and write using the generic apis that work on every arch 2008-09-25 20:44 I could probably get my hands on power32 and maybe alpha 2008-09-25 20:45 and eventually, somebody with that arch will hit your bug and complain 2008-09-25 20:45 but that's about it 2008-09-25 20:45 right... 2008-09-25 20:45 but... 2008-09-25 20:45 it's good to test on a couple different arches 2008-09-25 20:45 bugs like that are damn near impossible to trace down 2008-09-25 20:45 big/lttle end 2008-09-25 20:45 is any of the archs the most difficult to program for? 2008-09-25 20:45 and if one can find it, maybe something that has to do explicit dcache flush and other such horrors 2008-09-25 20:45 (I know alpha has the most lenient memory cache coherency model) 2008-09-25 20:45 sparc maybe 2008-09-25 20:46 sparc is pretty horrible 2008-09-25 20:46 who still has sparc machines? 2008-09-25 20:46 pretty much complete absence of atomic instructions 2008-09-25 20:46 dave miller 2008-09-25 20:46 sparc maintainer 2008-09-25 20:46 sun has the niagara box 2008-09-25 20:46 well, hopefully the maintainer does ;-) 2008-09-25 20:46 but it's true, sparc is nearly dead 2008-09-25 20:46 arm 2008-09-25 20:47 arm is embedded 2008-09-25 20:47 on the rise 2008-09-25 20:47 it's a big constituency these days 2008-09-25 20:47 easy to find, hard to find with a lot of ram or power or disk 2008-09-25 20:47 would testing in emulators work? 2008-09-25 20:47 if it has such a great mips/watt ratio you'd expect to see it in hpc 2008-09-25 20:48 some sort of qemu or something? 2008-09-25 20:48 but it's not there 2008-09-25 20:48 makes me wonder 2008-09-25 20:48 about that mips/watt ratio 2008-09-25 20:48 of arm? 2008-09-25 20:48 yes 2008-09-25 20:48 arm is good for stuff which needs high mips 2008-09-25 20:48 but rarely 2008-09-25 20:48 possilby you can test in emulation 2008-09-25 20:48 ie. high peak, but mostly idle 2008-09-25 20:48 I think qemu is x86 only 2008-09-25 20:48 i've been doing a lot of stuff on amd geode and intel atom if that helps, i can test something, they're low end but still powerful 2008-09-25 20:48 but those are x86 aren't tyhey? 2008-09-25 20:49 bushman, they' 2008-09-25 20:49 bushman, they're x86 arch 2008-09-25 20:49 but testing is _always_ useful 2008-09-25 20:49 x86 is a sick arch... but it's so dominant 2008-09-25 20:49 true 2008-09-25 20:49 POS86 2008-09-25 20:49 hmm no idea what that is 2008-09-25 20:50 you'll decode it eventually ;) 2008-09-25 20:50 are we done with _2copy? 2008-09-25 20:50 oh piece of 2008-09-25 20:50 balance_dirty_pages_ratelimited(mapping); <- attempt to limit kernel dirty pages 2008-09-25 20:50 so it goes a page at a time right? 2008-09-25 20:50 nasty thing 2008-09-25 20:50 yes, yuck 2008-09-25 20:51 and even then it's a mess 2008-09-25 20:51 there are many different flavors of similar kinds of io transfer loops 2008-09-25 20:51 in filemap.c 2008-09-25 20:51 take a browse and enjoy some of them 2008-09-25 20:52 oh, look at that vmtruncate at the end 2008-09-25 20:52 scary stuff 2008-09-25 20:52 why is there so much of it? 2008-09-25 20:52 much of what? 2008-09-25 20:52 copy loops? 2008-09-25 20:52 code;-) 2008-09-25 20:52 badly designed 2008-09-25 20:52 or not designed at all 2008-09-25 20:52 just grows 2008-09-25 20:53 changes in response to bug reports 2008-09-25 20:53 it feels like we have multiple interfaces/apis for everything 2008-09-25 20:53 including performance bug reports 2008-09-25 20:53 you're starting to get a feeling for it 2008-09-25 20:53 and eventually none of them get fully tested 2008-09-25 20:53 it's not unmanageable, just unconscionable 2008-09-25 20:53 at least not in all the myriad of combinations 2008-09-25 20:54 they get pretty well tested 2008-09-25 20:54 hmm 2008-09-25 20:54 I _think_ pretty much all buffer wries get funneled through _2copy 2008-09-25 20:54 though I haven't completely read through since this thing landed 2008-09-25 20:54 here's a question then 2008-09-25 20:54 how would I go about tracing a syscall 2008-09-25 20:55 seeing exactly which kernel funcs 2008-09-25 20:55 linux trace toolkit 2008-09-25 20:55 got called in what order with what params? 2008-09-25 20:55 puts probes into the kernel 2008-09-25 20:55 dprobe? kprobe? 2008-09-25 20:55 http://www.opersys.com/LTT/ 2008-09-25 20:55 kprobe 2008-09-25 20:55 now part of ltt I think 2008-09-25 20:55 hmm, so that's the 2nd time you've mentioned ltt 2008-09-25 20:56 it's good I take it? 2008-09-25 20:56 I haven't used it 2008-09-25 20:56 I should 2008-09-25 20:56 but it's the only game in town 2008-09-25 20:56 I think 2008-09-25 20:56 latest news is 2004 2008-09-25 20:56 I think that may because it got at least partially merged 2008-09-25 20:57 http://ltt.polymtl.ca/ 2008-09-25 20:57 moved 2008-09-25 20:58 right 2008-09-25 20:58 it current 2008-09-25 20:58 to 2.6.27-rc7 2008-09-25 20:58 current to yesterday or so ;) 2008-09-25 20:58 I should try it, there are no doubt many times when it could have saved me time 2008-09-25 20:59 patch-2.6.27-rc7-lttng-0.26.tar.bz225-Sep-2008 16:05 177K - right 2008-09-25 21:00 we should have looked at grab_cache_page in _2copy 2008-09-25 21:00 another poorly named function 2008-09-25 21:00 but important, and it will serve as our introduction the the page cache api 2008-09-25 21:00 Find or create a page at the given pagecache position. Return the locked 2038 * page. This function is specifically for buffered writes. 2008-09-25 21:00 one of the worst apis in the kernel ;) 2008-09-25 21:01 2038? 2008-09-25 21:01 line # 2008-09-25 21:01 oh 2008-09-25 21:01 what does page cache position mean? 2008-09-25 21:01 index within the cache for a particular inode 2008-09-25 21:01 so many logical pages offset in the file 2008-09-25 21:01 so a page cache position is a superblock:inode:offset triplet? 2008-09-25 21:02 just inode:offset 2008-09-25 21:02 because inode->sb 2008-09-25 21:02 ah 2008-09-25 21:02 so now inode #, but instead inode ptr 2008-09-25 21:02 yes 2008-09-25 21:02 s/now/not/ 2008-09-25 21:02 the "page cache" is in fact not a single cache 2008-09-25 21:02 maybe it was at one time 2008-09-25 21:03 but now it is a radix tree that hangs off of each inode 2008-09-25 21:03 giving you an idea maybe how bloating things get with lots of small files 2008-09-25 21:03 so page cache is effectively per inode? 2008-09-25 21:03 and what a bad idea sysfs is, which uses files and all the cache stuff that goes with it, to communicate tiny, 4 byte quantities, to the kernel 2008-09-25 21:04 page cache is per inode 2008-09-25 21:04 not effectively, absolutely 2008-09-25 21:04 why this split to per inode level? 2008-09-25 21:05 we think it's a good idea 2008-09-25 21:05 doesn't it make it harder to find what to free when memory runs low? 2008-09-25 21:05 no, because all the pages are linked together via a lru list 2008-09-25 21:05 but anyway 2008-09-25 21:05 lru is probably a bad idea 2008-09-25 21:05 lru = least recently used 2008-09-25 21:05 yes 2008-09-25 21:05 self organizing list 2008-09-25 21:06 simple minded 2008-09-25 21:06 just for the folks reading this later 2008-09-25 21:06 not very effective, especially since we mostly bypass it 2008-09-25 21:06 who bypasses it? 2008-09-25 21:06 we do 2008-09-25 21:06 in writeout for example 2008-09-25 21:06 it's mostly per-inode using the inode dirty lists 2008-09-25 21:07 we, as in tux 2008-09-25 21:07 or we as in fs drivers? 2008-09-25 21:07 there as in linuxen 2008-09-25 21:07 we as in linux penguins 2008-09-25 21:08 hmm 2008-09-25 21:08 the lru has exaclty one purpose: to decide which page to evict next 2008-09-25 21:08 we mess with the lru idea so much that we don't get good decisions on that 2008-09-25 21:09 what do you mean mess? 2008-09-25 21:09 all kinds of mess 2008-09-25 21:09 there is the concept of hot and cold end of the lru list 2008-09-25 21:09 does it evict both clean and dirty page? 2008-09-25 21:10 and there is code to try to move pages to the hot or cold end of the list according to whether we think the page is hot or cold 2008-09-25 21:10 both clean and dirty 2008-09-25 21:10 actually only clean 2008-09-25 21:10 it cleans dirty pages 2008-09-25 21:10 and evicts clean pages 2008-09-25 21:10 yes, to prefer using hot pages, since those are likely to be in cache 2008-09-25 21:10 so we won't be wasting cache by using hot pges 2008-09-25 21:10 right, except I think we mostly blow chunks in deciding what will be hot 2008-09-25 21:11 sinc the caches of hot page, can replace spot of previous page 2008-09-25 21:11 yes, evicting pages that wil be faulted in again immediately does no good, quite the contrary 2008-09-25 21:12 or read via filesystem operations 2008-09-25 21:12 ok, it's getting late 2008-09-25 21:12 true 2008-09-25 21:12 and I'm falling asleep 2008-09-25 21:12 see you 2008-09-25 21:12 ;-) 2008-09-25 21:13 should be time for questions now 2008-09-25 21:13 overtime 2008-09-25 21:13 and since I've been asking questions all session... 2008-09-25 21:13 I'll let other folks ask questions now 2008-09-25 21:13 next tuesday we will continue with grab_cache_page 2008-09-25 21:14 finally 2008-09-25 21:14 ;-) 2008-09-25 21:14 -!- ChanServ changed mode/#tux3 -> +o flips 2008-09-25 21:14 Topic for #tux3 is: Tux3 list membership roars past 100! ~ http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 p.m. Pacific Time ~ Next session: grab_cache_page and friends 2008-09-25 21:14 -!- flips changed mode/#tux3 -> -o shapor 2008-09-25 21:15 -!- flips changed mode/#tux3 -> -o flips 2008-09-25 21:15 yeah, tried changing that earlier and failed 2008-09-25 21:15 I'll unlock the topic 2008-09-25 21:15 nah 2008-09-25 21:15 anyway, let's find a bed 2008-09-25 21:16 -!- ChanServ changed mode/#tux3 -> +o flips 2008-09-25 21:16 Topic for #tux3 is: Tux3 list membership roars past 100! ~ http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 p.m. Pacific Time ~ Next session: grab_cache_page and friends 2008-09-25 21:16 -!- flips changed mode/#tux3 -> -o flips 2008-09-25 21:16 hmm 2008-09-25 21:16 -!- ChanServ changed mode/#tux3 -> +o flips 2008-09-25 21:17 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 p.m. Pacific Time ~ Next session: grab_cache_page and friends" 2008-09-25 21:17 -!- flips changed mode/#tux3 -> -o flips 2008-09-25 21:38 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-26 01:28 folks 2008-09-26 01:28 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-09-26 01:28 hello 2008-09-26 01:28 anyone here? 2008-09-26 01:52 -!- pgquiles(~pgquiles@42.Red-83-39-60.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 02:04 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-09-26 02:04 hello 2008-09-26 02:44 -!- pgquiles(~pgquiles@42.Red-83-39-60.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 03:58 hola 2008-09-26 04:49 anyone here? 2008-09-26 05:17 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-09-26 05:17 hmm 2008-09-26 05:23 well, i am here, but i will not be able to answer any questions :) 2008-09-26 05:23 and the rest is probably asleep. it's 5 am there or something like that 2008-09-26 05:25 hehe 2008-09-26 05:25 ok 2008-09-26 05:25 hmm 2008-09-26 06:06 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-09-26 06:06 reading yesterday's logs 2008-09-26 06:06 since no one posted the tux u proceedings... 2008-09-26 06:06 i think im going to do that 2008-09-26 06:23 could you post those from tuesday too? 2008-09-26 06:23 i think they are still missing 2008-09-26 07:13 hmm 2008-09-26 07:13 wait 2008-09-26 08:03 hmmm 2008-09-26 08:49 hola 2008-09-26 08:49 sad to hear the feature drop from tux3 2008-09-26 08:50 was hoping we could blow away zfs 2008-09-26 08:50 that was an awesome idea of dynamically increasing your fs across multiple hds 2008-09-26 08:56 pranith: well, lvm3 will be used for that, hopefully 2008-09-26 08:57 _hopefully_ 2008-09-26 08:57 yeah, the argument for not doing that is reasonable 2008-09-26 09:12 will tux3 be having plugins?? :D 2008-09-26 09:12 we can implement this as a plugin if possible 2008-09-26 09:12 ;) 2008-09-26 09:30 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-26 09:31 ACTION is sorry he missed the class from last night :-( 2008-09-26 09:37 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-26 10:20 -!- flips(~phillips@phunq.net) has joined #tux3 2008-09-26 10:22 nah, plugins for this are pretty much as implementing it to begin with 2008-09-26 10:22 as hard ;-) 2008-09-26 10:48 -!- Kirantpatil(~kiran@122.167.222.6) has joined #tux3 2008-09-26 10:48 -!- Kirantpatil(~kiran@122.167.222.6) has left #tux3 2008-09-26 10:59 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-26 12:24 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-26 12:42 -!- pgquiles(~pgquiles@42.Red-83-39-60.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 14:38 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 14:53 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-26 15:21 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-26 15:35 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 15:36 maze, plugins? 2008-09-26 15:36 hmm? 2008-09-26 15:36 moment 2008-09-26 15:36 MaZe> nah, plugins for this are pretty much as implementing it to begin with 2008-09-26 15:37 earlier on talk of plugins for multi-disk 2008-09-26 15:37 I must have been d/c for that 2008-09-26 15:37 filesystem plugins? 2008-09-26 15:38 something like that 2008-09-26 15:38 implementing support for plugins in tux3 2008-09-26 15:38 and then implementing support for multi-disk as a plugin 2008-09-26 15:38 If I understood correctly 2008-09-26 15:38 I was commenting about 2-3 hours later 2008-09-26 15:38 plugins make me think if reiserfs 2008-09-26 15:39 (08:49:58 AM) pranith: sad to hear the feature drop from tux3 2008-09-26 15:39 (08:50:09 AM) pranith: was hoping we could blow away zfs 2008-09-26 15:39 (08:50:36 AM) pranith: that was an awesome idea of dynamically increasing your fs across multiple hds 2008-09-26 15:39 (08:56:11 AM) RzM|Away left the room (quit: Quit: Computer goes to sleep!). 2008-09-26 15:39 (08:56:47 AM) data: pranith: well, lvm3 will be used for that, hopefully 2008-09-26 15:39 (08:57:04 AM) pranith: _hopefully_ 2008-09-26 15:39 (08:57:31 AM) pranith: yeah, the argument for not doing that is reasonable 2008-09-26 15:39 (09:12:00 AM) pranith: will tux3 be having plugins?? :D 2008-09-26 15:39 (09:12:10 AM) pranith: we can implement this as a plugin if possible 2008-09-26 15:39 if there are plugins, they will certainly not land before initial merge 2008-09-26 15:39 and there will be no aborbing the volume manager into tux3 2008-09-26 15:40 when designing lvm3 we need to figure out some sort of fs-bdev interface which provides more info than currently available 2008-09-26 15:40 what tux3 will do is work more closely with the volume manager 2008-09-26 15:40 which we will also develop 2008-09-26 15:40 maze, did you know you were going to be developing a volume manager? 2008-09-26 15:40 heh 2008-09-26 15:40 serious 2008-09-26 15:40 see comment above 2008-09-26 15:40 which one? 2008-09-26 15:40 the fs driver in order to schedule some stuff 2008-09-26 15:40 needs to know more about bdev disk layout for non disk bdevs 2008-09-26 15:40 ie. raid multi-disk etc 2008-09-26 15:41 exactly 2008-09-26 15:41 the fs is going to be able to specify the volume map 2008-09-26 15:41 and to retrieve it from the lvm 2008-09-26 15:41 very important 2008-09-26 15:41 a driver for this is already sketched out 2008-09-26 15:41 it's called "table block device" 2008-09-26 15:42 and will be a plugin for lvm3 2008-09-26 15:42 which we are going to develop 2008-09-26 15:42 hmm interesting 2008-09-26 15:42 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 15:42 covered stuff like this in my berlin talk 2008-09-26 15:43 and should probably get it on the agenda for some upcoming linux confab 2008-09-26 15:43 though sometimes I feel like I'm showing television to the family dog as far as understanding from other kernel devs goes 2008-09-26 15:43 hopefuly repeated impressions makes a difference 2008-09-26 15:44 that's one reason why we need to make some new kernel devs 2008-09-26 15:44 heh 2008-09-26 15:45 I'm being pulled in 4 directions here 2008-09-26 15:46 bio is pulling strongest I think 2008-09-26 15:46 and not having enough time to devote myself fully to any of these 2008-09-26 15:46 just don't get distracted by mm 2008-09-26 15:46 no, not even talking about in-kernel 2008-09-26 15:46 regard it as an interesting, funny little friend with occasionally curious opinions and you will be ok 2008-09-26 15:47 talking about my team at work, another team (kernel), and a 3rd team in the process of being created (networking), plus pure kernel as fourth 2008-09-26 15:47 fuck real life ;) 2008-09-26 15:48 anyway, the key is not to get the idea you have to understand the whole kernel at once 2008-09-26 15:48 even linus doesn't 2008-09-26 15:48 or akpm, though he probably gets closest 2008-09-26 15:51 of course you don't 2008-09-26 15:51 but it's best to understand as much as possible 2008-09-26 15:52 and preferably at least one level down from where you muck around 2008-09-26 15:52 true 2008-09-26 15:52 if the kernel internal apis were clean and well documented, this wouldn't be that needed 2008-09-26 15:52 with the total lack of docs, it's not possible to write non-buggy code 2008-09-26 15:52 sometimes you are just plain forced to proceed on induction though 2008-09-26 15:52 without knowing what it will actually do 2008-09-26 15:52 it depends on what you're trying to do of course 2008-09-26 15:53 anyway 2008-09-26 15:53 lxr is the answer 2008-09-26 15:53 write bug-less code is easy, if and only if, there is no undocumented code in the layers beneath (and to a lesser extent above) you 2008-09-26 15:53 right, hence I've been reading code as a passtime 2008-09-26 15:53 knowing everything in advance helps you design, but is not strictly necessary for developing or debugging 2008-09-26 15:53 true 2008-09-26 15:54 but browsing code requires less commitment 2008-09-26 15:54 in other words, you can pick up the info on-demand for the latter two 2008-09-26 15:54 and that's something I have time for 2008-09-26 15:54 right 2008-09-26 15:54 also: taking in too much at once can lead to burn out and drop out 2008-09-26 15:55 true 2008-09-26 15:55 what does NPI mean? 2008-09-26 15:55 NFI 2008-09-26 15:55 <- "no fscking idea" 2008-09-26 15:56 heh 2008-09-26 15:56 it's a TLA thought up by a SFI 2008-09-26 16:08 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 16:14 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 16:21 Settlement-Free Interconnect 2008-09-26 16:21 apparently SFI is a valid TLA in networking 2008-09-26 16:21 just hit on it in a doc 2008-09-26 16:38 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 17:00 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-26 17:01 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 17:06 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 17:37 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-26 20:02 -!- ajonat(~ajonat@190.48.119.128) has joined #tux3 2008-09-26 20:30 -!- ajonat(~ajonat@190.48.119.128) has joined #tux3 2008-09-26 21:16 maze, in my lexicon a SFI is a stupid fscking idiot ;) 2008-09-26 21:16 somebody who likes to invent TLAs to be leet 2008-09-26 21:17 I suppose that would now include me 2008-09-26 21:18 course we could consider an exemption for those who only make them up as satire 2008-09-26 22:12 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-26 22:32 hey tim_dimm 2008-09-27 02:12 hey 2008-09-27 02:34 -!- flips(~phillips@phunq.net) has joined #tux3 2008-09-27 02:55 -!- pranith(7aa040b1@webchat.mibbit.com) has joined #tux3 2008-09-27 02:55 hello guys 2008-09-27 03:02 no one is here when im around 2008-09-27 03:02 :( 2008-09-27 03:02 its bad to be on the other side of the world 2008-09-27 03:08 yeah, we're all hard asleep 2008-09-27 03:11 hehe 2008-09-27 03:12 yeah 2008-09-27 03:12 wht u doin? 2008-09-27 03:25 -!- pranith(7aa040b1@webchat.mibbit.com) has joined #tux3 2008-09-27 03:27 falling asleep while skimming code 2008-09-27 03:28 hmm 2008-09-27 03:28 which code are u skimming 2008-09-27 03:40 just some of the other places in the kernel which do options parsing 2008-09-27 04:01 hmm 2008-09-27 04:01 ok 2008-09-27 05:19 -!- Kirantpatil(~kiran@122.167.179.185) has joined #tux3 2008-09-27 05:22 -!- Kirantpatil(~kiran@122.167.179.185) has left #tux3 2008-09-27 07:26 -!- BSD(~bandan@38.117.250.152) has joined #tux3 2008-09-27 07:44 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-27 09:14 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-27 09:41 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-27 09:45 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-27 11:34 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-27 11:43 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-27 12:19 -!- pranith(7aa040b1@webchat.mibbit.com) has joined #tux3 2008-09-27 12:19 heya 2008-09-27 12:20 hi. thought you wanted to post the univ-sessions? 2008-09-27 12:21 yeah, i was going through them... 2008-09-27 12:22 dint finish reading.. 2008-09-27 13:14 pranith: thanks 2008-09-27 13:15 data, welcome 2008-09-27 13:15 data, what do you do? 2008-09-27 13:15 for tux3? nothing but reading :) 2008-09-27 13:15 reading as in? 2008-09-27 13:16 reading the channel, university sessions, some code. But it's still exam-time, and since i am double majoring i don't have that much spare time 2008-09-27 13:16 not for tux, in gen 2008-09-27 13:17 you mean what i read in my leisure time? 2008-09-27 13:17 i meant if you went to college or something? 2008-09-27 13:17 u are majoring in? 2008-09-27 13:17 computer science and math 2008-09-27 13:18 nice 2008-09-27 13:18 so yes, university, in germany 2008-09-27 13:18 which univ/ 2008-09-27 13:18 if you know it, karlruhe, technical university 2008-09-27 13:18 soon to be known as KIT 2008-09-27 13:18 karlsruhe, actually 2008-09-27 13:19 hmm 2008-09-27 13:19 ok 2008-09-27 13:20 and what are youdoing? 2008-09-27 13:20 im working 2008-09-27 13:20 completed my bachelors recently 2008-09-27 13:27 -!- ajonat(~ajonat@190.48.124.246) has joined #tux3 2008-09-27 13:50 sorry, i am not that talkative right now. still reading a little about real time systems and scheduling for my exam on monday 2008-09-27 13:50 oh 2008-09-27 13:50 all the best 2008-09-27 13:50 thanks 2008-09-27 13:51 it's not that hard, but they have a lot of analog to digital stuff in there 2008-09-27 13:51 where it gets kind of hairy 2008-09-27 13:51 op-amps e.g. 2008-09-27 13:52 hmm 2008-09-27 13:52 nice stuff 2008-09-27 13:52 u have that in rts? 2008-09-27 13:52 yes, dunno why 2008-09-27 13:52 i can see the relation, but still 2008-09-27 13:52 hmm, interesting 2008-09-27 13:53 closed loop controls with their laplace transforms, z-transforms 2008-09-27 13:53 hmm 2008-09-27 13:56 oh, and one should not forget all the stuff about cnc machines, programming ofthem, robot controls and ... just too much stuff not really related to the topic 2008-09-27 14:11 the relationship between analog and realtime is deep and important 2008-09-27 14:11 filters come into realtime a lot 2008-09-27 14:12 and time derivatives and integrals 2008-09-27 14:12 your delta t has to be exact ;) 2008-09-27 14:12 or your rocket explodes 2008-09-27 14:12 file_bwrite: block write <0:0> 2008-09-27 14:12 ---- extent 0x0/4 ---- 2008-09-27 14:12 balloc extent -> [2/4] 2008-09-27 14:12 segs: 0x2/4 (1) 2008-09-27 14:12 group -1/0 at entry -1/0 2008-09-27 14:12 1 entry groups: 2008-09-27 14:12 0/1: 0 => 2/4; 2008-09-27 14:12 file_bwrite: block write <0:5> 2008-09-27 14:12 ---- extent 0x5/2 ---- 2008-09-27 14:12 balloc extent -> [9/2] 2008-09-27 14:12 segs: 0x9/2 (1) 2008-09-27 14:13 group 0/1 at entry 0/1 2008-09-27 14:13 1 entry groups: 2008-09-27 14:13 0/2: 0 => 2/4; 5 => 9/2; 2008-09-27 14:13 flush... Success 2008-09-27 14:13 extent writing a lot closer to working 2008-09-27 14:13 now two discontiguous extents formed and written into the dleaf 2008-09-27 14:14 flips: i see the connection... learnt enough about it :0 2008-09-27 14:45 next test is to rewrite a region that already has extents 2008-09-27 14:45 and expose the next flock of bugs 2008-09-27 14:45 shapor... 2008-09-27 15:00 Ok, here we go again: 2008-09-27 15:00 ---- extent 0x0/7 ---- 2008-09-27 15:00 balloc extent -> [c/5] 2008-09-27 15:00 segs: 0xc/5 0x9/2 (2) 2008-09-27 15:00 group 0/1 at entry 0/2 2008-09-27 15:00 group 0/1 at entry 0/2 2008-09-27 15:00 1 entry groups: 2008-09-27 15:00 0/3: 0 => 2/4 c/5; 5 => 9/2; 0 => ; 2008-09-27 15:01 close actually 2008-09-27 15:01 well not actually 2008-09-27 15:01 buncha bugs 2008-09-27 15:33 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-27 15:34 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-27 15:50 -!- joededman(~chatzilla@S0106005004b0be73.ed.shawcable.net) has joined #tux3 2008-09-27 16:01 folks 2008-09-27 16:05 ok, let's see where this puppy strays afield 2008-09-27 16:09 flips: how far do you feel like skating today? 2008-09-27 16:10 hard to say 2008-09-27 16:10 I take it you had an adventure in mind? 2008-09-27 16:10 thinking about ti 2008-09-27 16:10 it 2008-09-27 16:29 -!- pgquiles(~pgquiles@82.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2008-09-27 16:52 ok, there's the first bug, obvious enough: after retrieving an extent to see if it should be skpped, if it should not be skipped then we need to rewind before the next step, or save the extent as the current one and scan on from there 2008-09-27 16:53 maybe I should explain what I'm doing on the list so we can get a couple more hands pulling on the oars 2008-09-27 16:54 I guess it is probably better to avoid rewinds and save some cpu 2008-09-27 16:58 but then when we do rewind we need to include the extent we just found, so a rewind would have to not only reset the dwalk state but the saved extent just found there 2008-09-27 16:59 fiddly, but probably the most efficient way to do it 2008-09-27 17:03 getting close to sk8 oclock 2008-09-27 17:27 ah, dwalk_back is the answer 2008-09-27 17:27 unread the last returned extent 2008-09-27 19:45 -!- BSD(~bandan@38.117.250.152) has joined #tux3 2008-09-27 21:10 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-27 22:12 folks 2008-09-27 22:23 ok, next bug is clear 2008-09-27 22:23 clear braindamage 2008-09-27 22:24 have to truncate before repacking 2008-09-27 22:24 obvious :-P 2008-09-27 22:41 -!- ajonat(~ajonat@190.48.124.246) has joined #tux3 2008-09-27 22:48 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-27 22:52 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-28 00:21 -!- pranith(7aa040b1@webchat.mibbit.com) has joined #tux3 2008-09-28 02:06 -!- Aks(~ankitsriv@123.237.69.19) has joined #tux3 2008-09-28 02:07 -!- Aks(~ankitsriv@123.237.69.19) has left #tux3 2008-09-28 03:05 -!- Kirantpatil(~kiran@122.167.219.252) has joined #tux3 2008-09-28 03:05 -!- Kirantpatil(~kiran@122.167.219.252) has left #tux3 2008-09-28 03:05 -!- bobby(~bobby@122.162.68.241) has joined #tux3 2008-09-28 03:05 hey guys 2008-09-28 03:12 anyone awake? 2008-09-28 03:41 -!- bobby(~bobby@122.162.68.241) has joined #tux3 2008-09-28 03:47 -!- paola(~paola@ppp-219-23.20-151.libero.it) has joined #tux3 2008-09-28 05:23 -!- pgquiles(~pgquiles@166.Red-83-35-243.dynamicIP.rima-tde.net) has joined #tux3 2008-09-28 06:01 -!- pgquiles(~pgquiles@166.Red-83-35-243.dynamicIP.rima-tde.net) has joined #tux3 2008-09-28 07:02 -!- paola(~paola@ppp-157-16.20-151.libero.it) has joined #tux3 2008-09-28 08:04 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-28 08:04 -!- tim_dimm_(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-28 09:45 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-28 10:24 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-28 11:18 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-28 11:45 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-28 11:53 -!- pgquiles__(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-28 12:10 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-28 12:14 -!- pranith(7aa24857@webchat.mibbit.com) has joined #tux3 2008-09-28 12:14 hey guys 2008-09-28 12:15 flipsout: was thinking about the unit testing you wanted for dwalk_pack 2008-09-28 12:15 i am not sure about how you go about doing that.. was reading the dleaf.c code today. 2008-09-28 12:16 u mentioned some unit test already present 2008-09-28 12:16 can u point out where that is? 2008-09-28 15:01 -!- paola(~paola@ppp-157-16.20-151.libero.it) has left #tux3 2008-09-28 15:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-28 16:02 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-28 16:02 hi 2008-09-28 16:03 so, wait wait, why is this Tux3 better than ext3 ? 2008-09-28 16:03 orgthing, did you read the initial post? 2008-09-28 16:03 orgthingy 2008-09-28 16:03 initial post? 2008-09-28 16:04 "Tux3, a versioning filesystem" 2008-09-28 16:04 tux3.org seems empty :P 2008-09-28 16:04 did you follow the links there? 2008-09-28 16:04 i dont want to sound like an idiot, but what exactly is "versioning" filesystem ? 2008-09-28 16:05 another word for snapshots 2008-09-28 16:06 flips : and, how is tux2 opensource if it was never released? 2008-09-28 16:07 flips : and, what do you think is better than ext3 (discarding tux2 and tux3) in your opinion ? 2008-09-28 16:07 the code is linked on the site and is under gpl v3 2008-09-28 16:08 orgthingy, care to say something about yourself and where you are coming from? 2008-09-28 16:08 flips : well, what shall i say... Im just an opensource fan and my goal is to be UNIX genius? :P 2008-09-28 16:08 UNIX/Linux to be exact xD 2008-09-28 16:09 worked on which projects? 2008-09-28 16:09 yes, mostly bug and features reports though 2008-09-28 16:09 used to do stuff with python iirc 2008-09-28 16:09 ext3 does not have versioning of any form, or extents 2008-09-28 16:09 flips : Id be happy to help with tux3, but not sure how this whole thing works 2008-09-28 16:10 but ill study it and understand it 2008-09-28 16:10 it is also slow at deleting 2008-09-28 16:10 yes, slow at deleting.. cant deny that :P 2008-09-28 16:10 it is also limited to files and volumes of a few TB 2008-09-28 16:11 ext4 is what you should be asking about 2008-09-28 16:11 which also has no snapshots 2008-09-28 16:12 few TB? 2008-09-28 16:12 isnt that quite.. a lot :P 2008-09-28 16:12 not these days, a TB disk costs a little over $100 2008-09-28 16:14 :| 2008-09-28 16:14 man, they lied to me then! 2008-09-28 16:14 i bought 250GB external HD for $90 !! 2008-09-28 16:15 and i was like "oh, quite fair price" ! 2008-09-28 16:15 yes, you got ripped off 2008-09-28 16:15 :( 2008-09-28 16:15 all prices were like that in different shops 2008-09-28 16:15 :'( 2008-09-28 16:16 anyway, flips, may you tell me about you? 2008-09-28 16:16 google me? 2008-09-28 16:18 lol, ok 2008-09-28 16:18 ACTION googles "flips" 2008-09-28 16:18 flips : nothing that seems to be u 2008-09-28 16:19 ACTION googles himself 2008-09-28 16:19 ACTION saw something embarrassing that happened 6 years ago 2008-09-28 16:19 me hides xD 2008-09-28 16:19 /me * 2008-09-28 16:20 try /whois flips 2008-09-28 16:20 maybe consider using your real name on irc 2008-09-28 16:21 I glaube dass du kanst Deutsch 2008-09-28 16:21 deutsch? 2008-09-28 16:21 german? 2008-09-28 16:22 omg 2008-09-28 16:22 Daniel Phillips :| 2008-09-28 16:22 flips : nice to meet you 2008-09-28 16:23 nice to meet you 2008-09-28 16:23 flips : Im gonna learn german soon, but online :( 2008-09-28 16:23 no classes over here 2008-09-28 16:23 i mean, i gotta learn in LiveMocha 2008-09-28 16:23 doesn't matter, I just thought you were german because you are connected to a server in darmstadt 2008-09-28 16:23 flips : is there ANY GOOD WAY that I can learn about filesystems? tux3 sounds like a nice project 2008-09-28 16:24 http://en.wikipedia.org/wiki/Filesystems <- start here 2008-09-28 16:24 ah, wikipedia... :P 2008-09-28 16:25 http://lxr.linux.no/linux+v2.6.26.5/fs/open.c#L1106 <- continue here 2008-09-28 16:26 flips : thanks sir, but a usual feature i tell devoloper is: How about making a program for Windows and probably OS X that makes it possible to read your filesystem, which is tux3 is this situation"? 2008-09-28 16:27 exercise left for the interested reader 2008-09-28 16:27 because, FAT32 is bad, and seems to be only way, after NTFS, to be read by all Oses.. and since NTFS is proprietary, and not *that good*, i suggest making these kind of readers 2008-09-28 16:39 flips : may I ask what FileSystem you are using right now? 2008-09-28 16:40 ext3 2008-09-28 16:40 I see 2008-09-28 16:42 flips : it says in wikipedia that stuff get mounted at boot time.. but what about, like, umm.. we inserted a CD after we booted? 2008-09-28 16:42 how can it be automatically mounted? 2008-09-28 16:44 automount? 2008-09-28 16:44 flips, yes 2008-09-28 16:44 how exactly does automount work? 2008-09-28 16:45 beyond the scope of this channel, http://www.google.com/search?q=automount 2008-09-28 16:47 ah, Google again xD 2008-09-28 17:20 finally, dwalk_next and dwalk_back seem to kind of make sense 2008-09-28 17:20 could be near the end of pain for extent writing 2008-09-28 17:21 checkin coming a little after sk8 oclock 2008-09-28 19:22 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-28 20:02 tim_dimm, ping 2008-09-28 20:09 folks 2008-09-28 20:11 flips: give me a few 2008-09-28 21:04 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-28 21:50 -!- Kirantpatil(~kiran@122.167.215.234) has joined #tux3 2008-09-28 21:50 -!- Kirantpatil(~kiran@122.167.215.234) has left #tux3 2008-09-28 22:45 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-09-28 22:45 hey guys 2008-09-28 22:45 flips: you got some time? 2008-09-28 22:45 flips: need u to tell me about unit tests for tux3 2008-09-28 22:45 .. 2008-09-28 22:45 hi 2008-09-28 22:46 hello 2008-09-28 22:46 parnith, for example to run the dleaf unit test: make dleaf && make dleaftest 2008-09-28 22:46 ok, where are the tests written? dleaftest.c? 2008-09-28 22:47 dleaf test is written in dleaf.c 2008-09-28 22:47 it's a main routine that is only compiled if you're compiling just dleaf by itself 2008-09-28 22:47 the other unit tests are similar 2008-09-28 23:14 ok.. 2008-09-28 23:14 going through them 2008-09-28 23:25 -!- pranith(~bobby@nat-inn.mentorg.com) has joined #tux3 2008-09-28 23:27 -!- ajonat(~ajonat@190.48.124.246) has joined #tux3 2008-09-28 23:36 hmm 2008-09-28 23:48 not clear what the unit tests are testing? 2008-09-28 23:56 btw, I'm coming through LA to SF on Friday 2008-09-28 23:56 I'll be around 2008-09-29 00:09 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-29 00:12 flips, going through the code 2008-09-29 00:12 trying to figure out the unit tests 2008-09-29 00:12 some are obvious, others not so obvious 2008-09-29 00:17 file_bwrite: block write <0:0> 2008-09-29 00:17 ---- extent 0x0/7 ---- 2008-09-29 00:17 existing extents: 0x0 => 2/4; 0x5 => 9/2; 2008-09-29 00:17 ---- rewind to 0x0 => 2/4 ---- 2008-09-29 00:17 balloc extent -> [4/1] 2008-09-29 00:17 segs: 0x2/4 0x4/1 0x9/2 (3) 2008-09-29 00:17 dwalk_chop_after: 1 groups, 0 entries in last 2008-09-29 00:17 1 entry groups: 2008-09-29 00:17 0/0: 2008-09-29 00:17 group 0/1 at entry -1/0 2008-09-29 00:17 group 0/1 at entry 1/1 2008-09-29 00:17 group 0/1 at entry 3/2 2008-09-29 00:17 1 entry groups: 2008-09-29 00:17 0/3: 0 => 2/4; 4 => 4/1; 5 => 9/2; 2008-09-29 00:17 flush... Success 2008-09-29 00:17 woohoo, first time this ever worked 2008-09-29 00:17 rewrite a region of file containing two discontiguous extents 2008-09-29 00:18 it properly fills in the 1 block gap between them 2008-09-29 00:18 nice 2008-09-29 00:41 hmm, a bug in set_bits, that has to hurt 2008-09-29 00:54 ah, no it was a bug in balloc_extent_from_range 2008-09-29 01:42 ACTION resets his firefox home page from google.com to tux3.org 2008-09-29 01:45 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-29 01:45 morning pgquiles 2008-09-29 01:46 hey flips 2008-09-29 01:47 I was thinking of you yesterday 2008-09-29 01:47 I'm touched ;) 2008-09-29 01:48 mkfs reserves 5% of the blocks for root, what's the use case for that? system crashes or is full and root comes to the rescue? 2008-09-29 01:48 I'm asking because that accounts for a whopping 50GB in my 1TB disk, which is a lot of seemingly wasted space :-/ 2008-09-29 01:48 attempt to avoid DoS of root 2008-09-29 01:48 by ordinary user, in absense of quotas 2008-09-29 01:49 but 5% is excessive 2008-09-29 01:49 on the other hand, it does need to scale with time as everything gets more bloated 2008-09-29 01:49 I guess 1GB would more more than enough 2008-09-29 01:49 50 MB would be enough 2008-09-29 01:49 probably 2008-09-29 01:50 I looked at the source of mkfs and was surprised to discover it accepts -mDOUBLE, I thought it only accepted an int 2008-09-29 01:50 ( I used -m1 and it still hurts!) 2008-09-29 01:51 you never know how many blocks sombody might want to reserve 2008-09-29 01:51 takes a double... 2008-09-29 01:52 yes, that's dumb but does no harm 2008-09-29 01:52 email ted and ask for a decimal point 2008-09-29 01:53 "avoids fragmnentation"... I doubt that 2008-09-29 01:54 I think what the man page means there is, performance gets so pathetically bad when the filesystem is 95%+ full that we just don't allow it 2008-09-29 01:54 think of it as a lameness tax ;-) 2008-09-29 01:54 we will try to make tux3 perform reasonably well even at 99% full 2008-09-29 01:54 going to be lots of work, hope you are ready to help 2008-09-29 01:55 send us some of your spanish dev buddies 2008-09-29 02:00 :-D 2008-09-29 02:00 I don't know anybody in Spain developing filesystems 2008-09-29 02:01 althought I must say the tux3 university is really useful 2008-09-29 02:01 I hope I'll have some spare time to catch up 2008-09-29 02:01 eventually :-) 2008-09-29 02:02 i think we need a video version of tux3 U someday 2008-09-29 02:02 hi pgguiles 2008-09-29 02:13 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-29 03:07 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-29 03:25 hello ! 2008-09-29 03:32 i finally understood what "versioning filesystem" really means :P 2008-09-29 03:33 how about saying it in your own words 2008-09-29 03:34 flips : its like when it keeps old copies of file or folder? 2008-09-29 03:35 partly 2008-09-29 03:38 ACTION keeps searching about "versioning filesystems" 2008-09-29 03:38 flips : today, Im gonna buy some books about that xD 2008-09-29 03:38 haha 2008-09-29 03:38 i have to learn about these things.. sounds interesting :P 2008-09-29 03:38 we're writing the books ;) 2008-09-29 03:39 source codes? 2008-09-29 03:41 orgthingy, did u understand the version pointer part? 2008-09-29 03:41 pranith : not yet 2008-09-29 03:41 about how maintaining versions this way is useful? 2008-09-29 03:41 hmm 2008-09-29 03:41 because once i want to learn something, i find out that i need to learn another thing 2008-09-29 03:41 complicated :P 2008-09-29 03:42 hmm, yeah 2008-09-29 03:42 kind of 2008-09-29 03:42 i had to read the document thrice 2008-09-29 03:42 before i got a semblence of understanding 2008-09-29 03:43 orgthingy, u from germany? 2008-09-29 03:43 peanitth : no, but im interested in german language 2008-09-29 03:43 :P 2008-09-29 03:44 ohk 2008-09-29 03:44 ACTION stares at http://www.ext3cow.com/Welcome_files/example1.jpg 2008-09-29 03:44 ACTION smiles 2008-09-29 04:06 -!- pgquiles__(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-29 04:30 flips, can u explain the fields in struct dwalk? 2008-09-29 04:31 pranith, sure 2008-09-29 04:31 specially about the gdict and edict fields 2008-09-29 04:31 they are just what is necessary for dwalk_next to work efficiently 2008-09-29 04:31 ok, what are they used for? 2008-09-29 04:31 needs to have a pointer to an extent and a group and an entry, and a limit for each to know when to go to the next 2008-09-29 04:32 exbase is a little different, the ->limit field of entry is relative to that 2008-09-29 04:32 and the "mock" fields are for calculating the finished size of a packed leaf, without actually writing the leaf 2008-09-29 04:32 i suppose exbase is the base of the extent 2008-09-29 04:33 yes 2008-09-29 04:33 i.e., the current extent's pointer 2008-09-29 04:33 no 2008-09-29 04:33 it's the lowest extent for an entire group of entries 2008-09-29 04:33 there is a description of the dleaf format somewhere 2008-09-29 04:34 ping shapor about it maybe 2008-09-29 04:34 he's the expert 2008-09-29 04:34 ohk, will do 2008-09-29 04:34 writing a mail.. that will better document it 2008-09-29 04:34 notices that the "limit" field of the entry and the "count" field of the group are only one byte, that goes a long way towards explaining some of the apparent complexity 2008-09-29 04:35 a dleaf index is highly compressed and therefore a little tricky to edit 2008-09-29 04:35 hmm 2008-09-29 04:35 yeah, u were trying to simplify that 2008-09-29 04:36 with an api... 2008-09-29 04:36 successfully, see the latest checkins 2008-09-29 04:36 filemap.c is now pretty obvious I think 2008-09-29 04:37 ok, how do i update using hg? i usually remove the tux3 folder and pull it again :( 2008-09-29 04:37 that works 2008-09-29 04:37 just hg pull will do it 2008-09-29 04:37 ok 2008-09-29 04:37 then hg update I think 2008-09-29 04:37 ok 2008-09-29 04:39 night 2008-09-29 04:40 gn 2008-09-29 04:42 flips : have you compared ext3cow to tux3 ? 2008-09-29 06:08 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-29 07:16 -!- orgthingy_(~orgthingy@62.150.55.188) has joined #tux3 2008-09-29 07:28 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-29 07:54 -!- smitht(~chatzilla@ool-182f94db.dyn.optonline.net) has joined #tux3 2008-09-29 07:57 -!- smitht(~chatzilla@ool-182f94db.dyn.optonline.net) has left #tux3 2008-09-29 10:08 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-29 10:18 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-29 11:21 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2008-09-29 12:13 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-29 13:06 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-29 13:10 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-29 14:16 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-29 14:22 -!- kbingham(~kbingham@92.21.238.93) has joined #tux3 2008-09-29 15:15 bah 2008-09-29 15:15 ACTION is kind of annoyed by how sloppy the lockdep code is 2008-09-29 15:16 like folks never figured out in the Linux community to use a lot of small abstraction so that function bodies express things simply and are readable in that way as well 2008-09-29 15:16 it's just extra brain wankery that I could do without 2008-09-29 15:30 -!- orgthingy(~orgthingy@62.150.55.188) has left #tux3 2008-09-29 15:30 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-29 15:30 ops 2008-09-29 16:01 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-29 16:06 so, are you all developers in tux3? 2008-09-29 16:56 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-29 18:02 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-29 18:43 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-29 19:04 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-29 19:42 bh, ingo isn't big on abstraction, more a hack it now kinda guy 2008-09-29 19:42 has its uses 2008-09-29 19:43 bh, but you can always abstract it as it should be 2008-09-29 19:43 beauty of open source 2008-09-29 19:44 orgthingy, ext3cow looks like a fine project 2008-09-29 19:44 but it's missing a few essential things 2008-09-29 19:45 like: writable snapshots, snapshots of snapshots, deletion of snapshots 2008-09-29 19:46 it's very clever as far as it goes 2008-09-29 19:46 oh and limited to 2 TB 2008-09-29 19:47 like ext3 2008-09-29 19:47 well 2008-09-29 19:47 maybe they have increased that to 16TB 2008-09-29 19:47 still too small by today's standards 2008-09-29 19:48 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-29 19:53 flips: trying to get my patches integrate, I fear, would be a pain 2008-09-29 19:53 if they're good developers will use them integrated or not 2008-09-29 19:53 they're for devs anyway 2008-09-29 19:54 yeah, looking at his code drives me up the fucking wall at times 2008-09-29 19:54 I mean, I'm use to it now after all of these years, but, man, it's massive mess for clean up for other developers 2008-09-29 19:54 like with the scheduler 2008-09-29 19:54 or rtmutex, et.c.. 2008-09-29 19:55 there's a certain point where you can't really do much other than hack it in a more limited way without a mass refactorng 2008-09-29 20:50 -!- Kirantpatil(~kiran@122.167.178.24) has joined #tux3 2008-09-29 20:50 -!- Kirantpatil(~kiran@122.167.178.24) has left #tux3 2008-09-29 21:02 -!- ajonat(~ajonat@190.48.120.169) has joined #tux3 2008-09-29 22:29 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-29 22:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-29 23:57 -!- bobby(~bobby@nat-inn.mentorg.com) has joined #tux3 2008-09-29 23:57 heya 2008-09-29 23:59 hello all, anyone here? 2008-09-30 00:00 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-30 00:00 MaZe, hello 2008-09-30 00:07 hi pranith 2008-09-30 00:09 flips, no reply from shapor :( 2008-09-30 00:09 about the struct fields description... 2008-09-30 00:09 may be u can reply ? ;) 2008-09-30 00:09 right 2008-09-30 00:11 pranith, have you read the comment that begins: /* Leaf index format 2008-09-30 00:11 ? 2008-09-30 00:11 in dleaf.c 2008-09-30 00:11 i tried :) 2008-09-30 00:12 the header contains the two level index followed by the table of extents 2008-09-30 00:13 i dint understand the limit on number of versions at the same level 2008-09-30 00:13 dinner time for me 2008-09-30 00:14 ohkies 2008-09-30 00:14 it's simple: you can't have more than 255 entries in one group, therefore can't have more than 255 entries with the same logical address 2008-09-30 00:14 well 2008-09-30 00:14 actually that is probably wrong now 2008-09-30 00:15 hmm, anything changed? 2008-09-30 00:15 you can have multiple dleaf groups with the same logical address now I think 2008-09-30 00:15 sure, lots of code changed 2008-09-30 00:15 every day 2008-09-30 00:15 later... 2008-09-30 00:15 okies 2008-09-30 01:59 folks 2008-09-30 01:59 ACTION is back from a night of goofing off 2008-09-30 01:59 feels great 2008-09-30 01:59 I was working myself into the ground much of last week 2008-09-30 02:14 -!- ajonat(~ajonat@190.48.120.169) has joined #tux3 2008-09-30 02:14 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-30 02:14 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-30 02:14 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2008-09-30 02:14 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-30 02:14 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-30 02:14 -!- flips(~phillips@phunq.net) has joined #tux3 2008-09-30 02:14 -!- ceatinge(~ceatinge@72.232.13.50) has joined #tux3 2008-09-30 02:14 -!- nataliep(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-09-30 02:14 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-09-30 02:14 -!- Bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-09-30 02:14 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-09-30 02:14 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-09-30 02:14 -!- ChanServ changed mode/#tux3 -> -o tux3bot 2008-09-30 02:31 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-30 03:01 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-30 03:26 -!- Kirantpatil(~kiran@122.166.94.37) has joined #tux3 2008-09-30 03:26 hello 2008-09-30 03:26 ah, what a great night ^______^ 2008-09-30 03:35 -!- Kirantpatil(~kiran@122.166.94.37) has left #tux3 2008-09-30 03:35 -!- Kirantpatil(~kiran@122.166.94.37) has joined #tux3 2008-09-30 03:53 orgthingy, where from/ 2008-09-30 03:53 ? 2008-09-30 03:54 why everybody keeps asking where im from xD 2008-09-30 04:10 why? you have a problem in that/ 2008-09-30 04:17 pranith : well, not really 2008-09-30 04:17 but i dont usually say any personal info in IRC :P 2008-09-30 04:18 ohk 2008-09-30 04:18 i dint knw that ones country is personal 2008-09-30 04:20 hi pranith 2008-09-30 04:21 Kirantpatil, helo 2008-09-30 04:21 is the session over today morning ? 2008-09-30 04:22 again i failed join .. 2008-09-30 04:22 TUX 3 university session. 2008-09-30 04:23 hmm 2008-09-30 04:23 yeah, morning is a bad time for people here 2008-09-30 04:23 i too miss it 2008-09-30 04:23 usually get it from the logs 2008-09-30 04:23 i am from bengaluru.. 2008-09-30 04:24 how about you.. 2008-09-30 04:24 delhi 2008-09-30 04:24 oh cool.. 2008-09-30 04:24 what do u do? 2008-09-30 04:24 i run Freesoftware training center.. 2008-09-30 04:24 oh 2008-09-30 04:24 which place in bangy? 2008-09-30 04:25 Driver programming, Administration.. 2008-09-30 04:25 it is in rajaji nagar. 2008-09-30 04:25 hmm 2008-09-30 04:25 i knw only koramangala 2008-09-30 04:25 and marathahalli 2008-09-30 04:25 ok.. 2008-09-30 04:25 and mg road ;) 2008-09-30 04:26 i am planning for giving filesystem training. 2008-09-30 04:26 so i am preparing for it.. 2008-09-30 04:26 oh 2008-09-30 04:26 nice 2008-09-30 04:26 i have it scheduled from Nov 1. 2008-09-30 04:27 hmm 2008-09-30 04:27 too soon i guess 2008-09-30 04:27 hope for getting some contributors ... 2008-09-30 04:27 oh 2008-09-30 04:27 which level are the students here? 2008-09-30 04:28 basically they span from college grads to experienced fellows.. 2008-09-30 04:28 hmm 2008-09-30 04:28 ok 2008-09-30 04:29 my motives are to spread linux kernel programming in easy way 2008-09-30 04:30 here is our website www.turtlelinuxlabs.in 2008-09-30 04:30 nice 2008-09-30 04:31 i am still in learning phase of filesystems.. 2008-09-30 04:32 hmm 2008-09-30 04:32 i tried to apply the patch of daniels posted in lwn.net and compile the kernel.. 2008-09-30 04:32 it was showing some error. 2008-09-30 04:34 give me some guidelines on this.. 2008-09-30 04:37 which patch? 2008-09-30 04:37 tux3 is not yet in kernel 2008-09-30 04:38 its still in userspace in fuse 2008-09-30 04:38 you dont need to compile the kernel to test this 2008-09-30 04:38 just use the fuse version 2008-09-30 04:39 please see this http://lwn.net/Articles/299740/ 2008-09-30 04:39 i followed that link.. 2008-09-30 04:41 hmm, i think i missed that 2008-09-30 04:41 :( 2008-09-30 04:41 what shall i do then.. 2008-09-30 04:41 flips, why dont u cc to tux3?? 2008-09-30 04:42 im not sure.. 2008-09-30 04:42 ive never compiled this in a kernel before.. 2008-09-30 04:43 where can i get the fuse version.. 2008-09-30 04:48 am i doing right here.. 2008-09-30 04:51 hmm 2008-09-30 04:51 use the mercurial repo 2008-09-30 04:52 hg pull http://phunq.net/tux3 2008-09-30 04:52 install mercurial 2008-09-30 04:53 ok, i will try that 2008-09-30 04:57 then cd tux3/user/test 2008-09-30 04:57 make && make debug 2008-09-30 04:58 it will mount in /tmp/ 2008-09-30 05:04 thanks pranith. 2008-09-30 05:12 welcome :) 2008-09-30 05:13 i am getting some errors, shall i paste it here 2008-09-30 05:13 i am using ubuntu gibbon. 2008-09-30 05:15 tux3.c:14:18: error: popt.h: No such file or directory 2008-09-30 05:16 sudo apt-get install libpopt-dev 2008-09-30 05:34 Kirantpatil, worked? 2008-09-30 05:42 yes it worked. 2008-09-30 05:43 i am just execting sudo make testfuse 2008-09-30 05:50 ok.. i played with testfuse and testfs 2008-09-30 05:50 they are working fine.. 2008-09-30 05:51 next what should i do ?? 2008-09-30 05:51 work with dleaf and dleaftest 2008-09-30 05:51 make dleaf 2008-09-30 05:51 make dleaftest 2008-09-30 05:51 ./dleaf 2008-09-30 05:51 there is a bug with testfuse 2008-09-30 05:51 in readdir... 2008-09-30 05:52 ls 2008-09-30 05:52 touch hello 2008-09-30 05:52 ls 2008-09-30 05:52 rm hello 2008-09-30 05:52 ls 2008-09-30 05:53 i am now installing valgrind 2008-09-30 05:54 no need 2008-09-30 05:54 u can run it directly 2008-09-30 05:54 ./dleaf 2008-09-30 05:54 ok.. 2008-09-30 05:57 i did run ./dleaf 2008-09-30 05:57 it is showing lot of dwalk messages.. 2008-09-30 05:58 i didnt understand where should i do "touch hello" "ls" and "rm hello" 2008-09-30 06:00 make debugfs 2008-09-30 06:00 "make debug" 2008-09-30 06:00 go to /tmp/test 2008-09-30 06:00 them do touch and rm then ls 2008-09-30 06:03 yeah 2008-09-30 06:03 root@kiran-desktop:/tmp/test# ls 2008-09-30 06:03 ???@???? 2008-09-30 06:04 this is how looks after "rm hello" 2008-09-30 06:04 mode 0100666 uid 0 gid 0 root d:1 2008-09-30 06:04 ---- get attr for '/' ---- 2008-09-30 06:04 ---- get attr for '/' ---- 2008-09-30 06:04 ---- readdir '/' at 0 ---- 2008-09-30 06:04 ---- get attr for '� 2008-09-30 06:04 @�' ---- 2008-09-30 06:04 ---- get attr for '/� 2008-09-30 06:04 @�' ---- 2008-09-30 06:04 ---- readdir '/' at 1000 ---- 2008-09-30 06:05 in debug message. 2008-09-30 06:10 then i think i need to understand the code .. 2008-09-30 06:10 am i right.. 2008-09-30 06:13 yup 2008-09-30 06:13 u need to.. 2008-09-30 06:13 something wrong in readdir 2008-09-30 06:13 i dint look further 2008-09-30 06:49 -!- Kirantpatil(~kiran@122.166.94.37) has left #tux3 2008-09-30 07:33 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-30 07:41 flips, there? 2008-09-30 07:47 -!- pgquiles__(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-30 09:21 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-30 09:27 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-30 09:38 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-30 09:49 -!- Kirantpatil(~kiran@122.167.222.94) has joined #tux3 2008-09-30 09:49 -!- Kirantpatil(~kiran@122.167.222.94) has left #tux3 2008-09-30 09:50 -!- Kirantpatil(~kiran@122.167.222.94) has joined #tux3 2008-09-30 09:50 -!- Kirantpatil(~kiran@122.167.222.94) has left #tux3 2008-09-30 09:52 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-30 10:28 -!- Kirantpatil(~kiran@122.167.222.94) has joined #tux3 2008-09-30 10:28 -!- Kirantpatil(~kiran@122.167.222.94) has left #tux3 2008-09-30 10:48 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-30 10:51 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-30 12:01 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-30 12:43 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-30 17:10 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-30 18:58 -!- ajonat(~ajonat@190.48.107.189) has joined #tux3 2008-09-30 19:32 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-30 19:52 19:52:21 2008-09-30 19:54 that's true 2008-09-30 19:54 ah, and I missed the ping from pranith too 2008-09-30 19:54 yesterday 2008-09-30 19:55 I guess I'd better fix the readdir bug in fuse 2008-09-30 19:55 specially as a provisional fix has been offered 2008-09-30 19:55 -!- ajonat_(~ajonat@190.48.122.185) has joined #tux3 2008-09-30 19:59 t -30 & counting 2008-09-30 20:00 t -> tux3 2008-09-30 20:00 browsers running? 2008-09-30 20:00 mayhaps 2008-09-30 20:01 we start here: http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2040 2008-09-30 20:01 or maybe we should start from where this is called in _copy2 2008-09-30 20:01 _2copy 2008-09-30 20:01 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2106 2008-09-30 20:02 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-30 20:02 razvanm: http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2106 2008-09-30 20:02 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-30 20:02 ACTION is sorry that he is late 2008-09-30 20:02 flips: razvanm: http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2106 2008-09-30 20:02 ACTION is glad you're here 2008-09-30 20:02 page = __grab_cache_page(mapping, index); 2008-09-30 20:03 ok, we're getting a cache page that user data will be copied onto 2008-09-30 20:03 later that page will be added to a bio and thrown at a device 2008-09-30 20:03 but today we're just going to look at the page cache 2008-09-30 20:04 that is, the list of pages belonging to a particular inode that have been read in via some buffer IO operation 2008-09-30 20:04 or directly created, as here 2008-09-30 20:04 since we know we're going to write to this page, normally the entire thing, there is no need to read it first 2008-09-30 20:05 we just "grab" it, and by that, viro means look into the cache and allocated a page if one is not already there 2008-09-30 20:06 so lets got to http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2040 and see how it works 2008-09-30 20:06 quick q: this "grab" is unique to this case? 2008-09-30 20:06 pretty much 2008-09-30 20:06 page cache ops are highly non-orthogonal 2008-09-30 20:06 there may or may not be justification for that 2008-09-30 20:07 they just kind of grew from usage, like most of linux 2008-09-30 20:07 and the comment claims it's just for buffered writes 2008-09-30 20:07 possibly true 2008-09-30 20:07 should we take a look at what mapping and index? :P 2008-09-30 20:08 struct address_space * mapping 2008-09-30 20:08 we'll be looking at those, yes 2008-09-30 20:08 aaa... this was in some time ago 2008-09-30 20:08 that's what this is all about 2008-09-30 20:08 pgoff_t index; 2008-09-30 20:08 index is for practical purposes unsigned int 2008-09-30 20:08 that means 32 bit on 32 arches 2008-09-30 20:08 interesting that pgoff_t seems to be a page offset, but it would make sense to be a file offset div page_size ? 2008-09-30 20:09 maybe just a misnomer 2008-09-30 20:09 limiting the size of any file to 2^(32 + 12) 2008-09-30 20:09 oh, as in offset into file in pages 2008-09-30 20:09 exactly 2008-09-30 20:10 bad terminology 2008-09-30 20:10 -!- Kirantpatil(~kiran@122.167.219.78) has joined #tux3 2008-09-30 20:10 does this mean files can't be larger than 16TB? 2008-09-30 20:10 (on 32-bit arch) 2008-09-30 20:10 yes 2008-09-30 20:10 that's where that comes from 2008-09-30 20:10 volumes too 2008-09-30 20:10 does any linux filesystem workaround this somehow? 2008-09-30 20:10 volumes? 2008-09-30 20:10 because each volume has a page cache dedicated to non-file pages on the volume, that is, metadata 2008-09-30 20:11 there is no workaround 2008-09-30 20:11 "speed of sound in a 32 bit vacuum" 2008-09-30 20:11 :p 2008-09-30 20:11 :D 2008-09-30 20:11 ok, what does the index index? 2008-09-30 20:11 A: a radix tree 2008-09-30 20:12 let's drill down into find_lock_page, which is used in more than one place thankfully 2008-09-30 20:12 index = pos >> PAGE_CACHE_SHIFT; 2008-09-30 20:12 so how does tux3 scale beyond this? 2008-09-30 20:12 it doesn't? 2008-09-30 20:12 razvanm, good point 2008-09-30 20:13 razvanm, you will see code like that in tux3.c 2008-09-30 20:13 it does not on 32 bit 2008-09-30 20:13 fact of life 2008-09-30 20:13 ACTION sits down in the back row 2008-09-30 20:13 hey shapor 2008-09-30 20:13 hi flips 2008-09-30 20:13 ACTION throws some chalk 2008-09-30 20:13 always wanted to do that 2008-09-30 20:13 is it illegal yet? 2008-09-30 20:14 so you simply can't mount such a large tux3 fs on a 32-bit os? 2008-09-30 20:14 simply can't 2008-09-30 20:14 we'd better produce a nice error though 2008-09-30 20:14 because somebody will try 2008-09-30 20:14 hm 2008-09-30 20:14 to tell the truth, it would not be that big a deal to fix 2008-09-30 20:14 somebody who wants to is welcome 2008-09-30 20:15 pretty easy hack for a great deal of fame 2008-09-30 20:15 ACTION listens for the thundering herd of volunteers 2008-09-30 20:15 ok, let's go to find_lock_page 2008-09-30 20:15 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L661 2008-09-30 20:16 takes a mapping and index 2008-09-30 20:16 mapping is what tux3 calls "map" 2008-09-30 20:16 I called it map because that saved me about 80,000 keystrokes over the life of the project 2008-09-30 20:17 how does a page become a pagecache page? 2008-09-30 20:17 also, a tux3 userspace map maps blocks, whereas linux page cache maps pages 2008-09-30 20:17 shapor, we're looking at that right now 2008-09-30 20:17 somewhere in here we will find an alloc_pages(order 1) 2008-09-30 20:17 order 0 I mean 2008-09-30 20:18 first thing we do is try to find it already in the radix tree, but let's skip that and find out what happens when it's not there 2008-09-30 20:18 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-30 20:18 nothing happens :P 2008-09-30 20:18 exactly 2008-09-30 20:18 this function does not allocate pages 2008-09-30 20:18 alloc_pages(order n) allocates 2^n pages, with linear physical addresses? 2008-09-30 20:19 Returns zero if the page was not present. 2008-09-30 20:19 ok, let's go back up to _2copy and find out where the page is really alloced 2008-09-30 20:19 if we don't find it here 2008-09-30 20:20 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2049 2008-09-30 20:20 status = -ENOMEM; 2008-09-30 20:20 break 2008-09-30 20:20 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2107 2008-09-30 20:20 2049 page = page_cache_alloc(mapping); 2008-09-30 20:21 8-) 2008-09-30 20:21 :D 2008-09-30 20:21 better ;-) 2008-09-30 20:21 right 2008-09-30 20:21 knew it was in there ;) 2008-09-30 20:22 74static inline struct page *page_cache_alloc(struct address_space *x) 75{ 76 return __page_cache_alloc(mapping_gfp_mask(x)); 77} 2008-09-30 20:22 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L500 2008-09-30 20:22 which is just a call to alloc_pages 2008-09-30 20:22 as promised 2008-09-30 20:22 return alloc_pages(gfp, 0); 2008-09-30 20:22 what is the point of page_cache_alloc? 2008-09-30 20:23 some new bs about mapping_gfp_mask 2008-09-30 20:23 calling __page_cache_alloc 2008-09-30 20:23 if (cpuset_do_page_mem_spread()) { 2008-09-30 20:23 for numa 2008-09-30 20:23 shapor, probably little point if you really dig 2008-09-30 20:23 lots of accumlated cruft in there 2008-09-30 20:23 why grabing fails if adding to the lru fails? 2008-09-30 20:23 it's basically a numa-diverse alloc_pages(0) 2008-09-30 20:24 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2057 2008-09-30 20:24 goto repeat; 2008-09-30 20:24 notice it shouldn't fail 2008-09-30 20:24 razvanm, because that got a lot more complex recently 2008-09-30 20:24 let's take a look at it 2008-09-30 20:25 getting well outside the scope of vfs 2008-09-30 20:25 I guess there must be a reason why the page must be in the lru. Is there an obvious one? :P 2008-09-30 20:25 I think every page must be 2008-09-30 20:25 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L459 <- :P 2008-09-30 20:25 otherwise how do you know what to flush on memory low condition? 2008-09-30 20:25 it's about reverse mapping 2008-09-30 20:26 lots of complexity has been added to optimize it 2008-09-30 20:26 oh sorry 2008-09-30 20:26 I was blathering 2008-09-30 20:26 O:-) 2008-09-30 20:26 truth is, it's just a wrapper for the radix tree insert 2008-09-30 20:26 MaZe: to deal with the low memory you need some of the pages to be there :P 2008-09-30 20:27 add to page cache should never, ever fail 2008-09-30 20:27 but it could 2008-09-30 20:27 RazvanM: I think that's with extremely low memory conditions 2008-09-30 20:27 if it does fail we are in deep doodoo 2008-09-30 20:27 :D 2008-09-30 20:27 flips: yeah i see it gets called a few times in that file 2008-09-30 20:27 not jsut extremely low, bug buggy in the kernel bug sense 2008-09-30 20:27 the page could already be there, probably we're not fully locked against smp, and thus could potentially hit this on 2 cpus 2008-09-30 20:27 shapor, yes, this is the main interface to the page cache 2008-09-30 20:28 maze, we are fully locked against smp 2008-09-30 20:28 necessarily 2008-09-30 20:28 so where does the EEXIST check come from? 2008-09-30 20:28 write_lock_irq does that, and turns off interrupts for good measure 2008-09-30 20:29 if the page is already there, somebody needs to tell us 2008-09-30 20:29 MaZe: the page is already in lru, right? 2008-09-30 20:29 not lru 2008-09-30 20:29 radix tree 2008-09-30 20:29 badly named function here 2008-09-30 20:29 very bad 2008-09-30 20:29 it means "add to page cache and also to lru" 2008-09-30 20:29 not add to lru 2008-09-30 20:30 aaaa 2008-09-30 20:30 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L490 2008-09-30 20:30 right, only adding to cache can fail 2008-09-30 20:30 got a rul for the EEXIST test? 2008-09-30 20:30 url? 2008-09-30 20:30 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2055 2008-09-30 20:31 mem_cgroup_uncharge_page <- wow, 75 cent name 2008-09-30 20:31 shapor, this should get a rise out of the hotrodder in you 2008-09-30 20:31 memory accounting for containerization 2008-09-30 20:31 hah 2008-09-30 20:32 nitro chardged pages 2008-09-30 20:32 maze, thanks 2008-09-30 20:32 for? 2008-09-30 20:32 for the comment re containers 2008-09-30 20:32 oh. 2008-09-30 20:32 explains why I haven't seen the beast before 2008-09-30 20:32 crappy name 2008-09-30 20:32 therefore fits nicely ;) 2008-09-30 20:33 cgroup is the containers stuff, both cpu and mem 2008-09-30 20:33 uncharge must be in the release page path, and charge in the alloc path 2008-09-30 20:33 yeah this unfortunately all gets pretty complex 2008-09-30 20:33 because we're supporting numa and containers 2008-09-30 20:34 why does this matter for tux3 2008-09-30 20:34 all right, the EEXIST is about what happens if somebody adds the page while we are waiting to acquire the radix tree lock 2008-09-30 20:34 ok? 2008-09-30 20:34 i thoguht we were intentially avoiding the vm 2008-09-30 20:34 page cache is vfs, not vm 2008-09-30 20:34 numa = non-uniform memory access machines (multi-socket machines) and containers (good for jails/vms/isolating users/apps, etc...) 2008-09-30 20:34 flips: right, hence my comment about not having all the locks in smp 2008-09-30 20:35 shapor, we only need to know to recognize what is mm and therefore can be ignored ;) 2008-09-30 20:35 ok 2008-09-30 20:35 maze, right then 2008-09-30 20:35 as usual ;) 2008-09-30 20:35 you get to run the next class ;) 2008-09-30 20:35 well 2008-09-30 20:35 no... 2008-09-30 20:35 got to wait and see what you hack next 2008-09-30 20:35 ACTION runs away... 2008-09-30 20:35 flips: did you feel the mini quake a while ago? 2008-09-30 20:35 shapor, no, missed it 2008-09-30 20:35 didn't feel anything up here 2008-09-30 20:36 we had a great one a month or two ago 2008-09-30 20:36 got the familly up and huddled under a door jamb 2008-09-30 20:36 anyway 2008-09-30 20:36 life in paradise 2008-09-30 20:36 right 2008-09-30 20:36 duck'n'cover 2008-09-30 20:36 ;-) 2008-09-30 20:36 where were we 2008-09-30 20:37 we've nearly done everything interesting in there 2008-09-30 20:37 yes, we did get sidetracked a little. 2008-09-30 20:37 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2055 2008-09-30 20:37 sorry ;) 2008-09-30 20:37 details of radix tree aren't that interesting 2008-09-30 20:37 could we have a little more info on mapping? 2008-09-30 20:37 the channel topic says "and friends" 2008-09-30 20:37 ok, question? 2008-09-30 20:37 who fills it in and what it is for? 2008-09-30 20:37 who fills in the mapping? 2008-09-30 20:37 2066 struct address_space *mapping = file->f_mapping; 2008-09-30 20:38 so it comes from the file, so the vfs? 2008-09-30 20:38 it's just the per-inode page cache 2008-09-30 20:38 vfs usually 2008-09-30 20:38 though filesystem can too, and some do 2008-09-30 20:38 the fs has access to the whole misshapen page cache api 2008-09-30 20:38 for better or worse 2008-09-30 20:38 you will see all the functions are EXPORT()ed 2008-09-30 20:39 not even _GPL 2008-09-30 20:39 you can write evil/fringer binary modules that use this interface 2008-09-30 20:39 fringe 2008-09-30 20:39 ok, did we do who fills it in enough? 2008-09-30 20:40 probably 2008-09-30 20:40 still don't get it ;-) but nevermind 2008-09-30 20:40 then we didn't 2008-09-30 20:40 http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L499 2008-09-30 20:40 the mapping _is_ the page cache 2008-09-30 20:40 is where the struct is defined 2008-09-30 20:41 so the page cache 2008-09-30 20:41 it's basically one to one with file inodes 2008-09-30 20:41 is actually not a page cache, but rather a page cache per inode per superblock 2008-09-30 20:41 exactly 2008-09-30 20:41 just per inode 2008-09-30 20:41 as in not _a_ but _one per_ 2008-09-30 20:41 one per inode 2008-09-30 20:41 actually 2008-09-30 20:41 one per file-backed inode 2008-09-30 20:42 okay, it's always talked of as if it was one beast 2008-09-30 20:42 to be precise 2008-09-30 20:42 yes, that's just sloppy 2008-09-30 20:42 non file-backed inode being non-file/dir stuff? (sockets, pipes, symlinks, fifos, devs?) 2008-09-30 20:42 while we're looking at struct address_space (which really should have been called struct mapping) 2008-09-30 20:42 let's look at some of the fields there 2008-09-30 20:43 device inode, socket, etc 2008-09-30 20:43 so 4k per inode? 2008-09-30 20:43 everything is an inode, and not all have caches 2008-09-30 20:43 er at least 2008-09-30 20:43 at least 2008-09-30 20:43 why 4k? 2008-09-30 20:43 inodes are big bloating things, especially when they have all their decorations attached 2008-09-30 20:44 shapor, which 4k were you referring to? 2008-09-30 20:44 at least one page gets allocated, right? 2008-09-30 20:44 size of a page 2008-09-30 20:44 for what? 2008-09-30 20:44 for the struct? 2008-09-30 20:44 shapor, probably not 2008-09-30 20:44 I'm sure we use a slab-cache of some sort? 2008-09-30 20:45 some inodes could lack any page cache, right? 2008-09-30 20:45 ok i read what flips said wrong 2008-09-30 20:45 shapor, look at the inode, you will see there's an address space embedded right in it 2008-09-30 20:45 kind of confusing 2008-09-30 20:45 hmm 2008-09-30 20:45 where is that defined/ 2008-09-30 20:46 624 struct address_space i_data; 2008-09-30 20:46 http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L623 2008-09-30 20:46 http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L624 2008-09-30 20:46 yeah i found it 2008-09-30 20:46 cat /proc/slabinfo - 3rd number means objsize 2008-09-30 20:46 radix_tree_node 21273 21294 560 14 2 : tunables 0 0 0 : slabdata 1521 1521 0 2008-09-30 20:46 bdev_cache 43 63 768 21 4 : tunables 0 0 0 : slabdata 3 3 0 2008-09-30 20:46 sysfs_dir_cache 11921 12189 80 51 1 : tunables 0 0 0 : slabdata 239 239 0 2008-09-30 20:46 inode_cache 4952 4970 568 14 2 : tunables 0 0 0 : slabdata 355 355 0 2008-09-30 20:46 dentry 486172 486172 208 19 1 : tunables 0 0 0 : slabdata 25588 25588 0 2008-09-30 20:47 razvanm, that's the pointer to it, which points into the address space itself, immediately after 2008-09-30 20:47 why the mapping has to be a pointer isn't clear to me 2008-09-30 20:47 probably bogus 2008-09-30 20:47 flips: that was what I about to ask :P 2008-09-30 20:47 razvanm, it's the homework assignment then, to find out by thursday 2008-09-30 20:47 in case you don't want that one? 2008-09-30 20:48 maze, sure, but what use case? 2008-09-30 20:48 share across many inodes? 2008-09-30 20:48 I suspect a highly dogdy one 2008-09-30 20:48 like cow 2008-09-30 20:48 aaa... hard links? :P 2008-09-30 20:48 nah, hard links -> same inode 2008-09-30 20:48 never assume that what we do in kernel actually makes sense ;) 2008-09-30 20:48 soft link? :-) 2008-09-30 20:48 often things are the way they are just because they are 2008-09-30 20:48 and that is eventually proven when somebody rips it out and changes it completely 2008-09-30 20:49 616 spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ 2008-09-30 20:49 how can a lock maybe protect a field? 2008-09-30 20:49 maybe not everybody follow the rules? 2008-09-30 20:49 there's a lot of weirdness around i_size locking wise 2008-09-30 20:50 maze, likely more bogosity, and the cause of thousands of hours worth of bug chasing the last few years 2008-09-30 20:50 anyway, so our inode, including the mapping, seems to use about 570 bytes 2008-09-30 20:50 jsut for starters 2008-09-30 20:50 then you get a couple of pages linked to it... dentries... it gets bloaty 2008-09-30 20:50 inode_cache 162 264 340 11 1 : tunables 54 27 8 : slabdata 24 24 0 2008-09-30 20:51 dentry per hardlink to the inode right? 2008-09-30 20:51 per path element used to open the inode 2008-09-30 20:51 + radix_tree_nodes (and actual pages) for the parts that are in memory? 2008-09-30 20:51 right, but those higher levels are shared 2008-09-30 20:51 n directory names and one file name 2008-09-30 20:51 maybe 2008-09-30 20:52 ? 2008-09-30 20:52 not necessarily 2008-09-30 20:52 except for root 2008-09-30 20:52 you can easily have lots of very unshared paths 2008-09-30 20:52 like in a java class tree 2008-09-30 20:52 bushy 2008-09-30 20:52 in what sense not necessarily, as in there may not be any other files using the same prefix? 2008-09-30 20:52 right 2008-09-30 20:52 or we can have the same prefix and still not share? 2008-09-30 20:52 ah, ok 2008-09-30 20:52 right, of course, then 2008-09-30 20:53 but dentries are pretty small (200 bytes) 2008-09-30 20:53 just by way of showing that the average pinned cache per inode can be quite large 2008-09-30 20:53 you also typically have a struct file 2008-09-30 20:53 if its open 2008-09-30 20:53 right? 2008-09-30 20:53 so file->dentry->inode->pages 2008-09-30 20:53 yes 2008-09-30 20:54 all the dentries up to the root and the inode, and the radix tree have to be in ram, as long as we are using the page cache from that inode, right? 2008-09-30 20:54 destroyed when closed always? 2008-09-30 20:54 when stuff is just hanging in cache you have dentry->inode->pages 2008-09-30 20:54 that's with a closed file? 2008-09-30 20:54 struct file always destroyed on close, dentry not 2008-09-30 20:54 yes 2008-09-30 20:55 right, but eventually it would get evicted, since the close would flush? 2008-09-30 20:55 maze, close doesn't flush 2008-09-30 20:55 only umount evicts like that 2008-09-30 20:55 really? 2008-09-30 20:55 well 2008-09-30 20:55 close does not evict 2008-09-30 20:55 I thought close was guaranteed to give you back any error messages 2008-09-30 20:55 or flush actually 2008-09-30 20:55 in case you were to run out of disk, etc 2008-09-30 20:56 and thus close had to wait for a flush? 2008-09-30 20:56 hmm 2008-09-30 20:56 doubt that 2008-09-30 20:56 if close was equivalent to fsync performance would tank 2008-09-30 20:57 so close may be defined that way, the filesystem does not have to implement it that way 2008-09-30 20:57 hm but if dentries get purged from you dont lose the cache right? 2008-09-30 20:57 man 2 open 2008-09-30 20:57 It is quite possible that errors on a pre- 2008-09-30 20:57 vious write(2) operation are first reported at the final close(). Not 2008-09-30 20:57 checking the return value when closing the file may lead to silent loss 2008-09-30 20:57 of data. This can especially be observed with NFS and with disk quota. 2008-09-30 20:57 A successful close does not guarantee that the data has been success- 2008-09-30 20:57 fully saved to disk, as the kernel defers writes. It is not common for 2008-09-30 20:57 a filesystem to flush the buffers when the stream is closed. If you 2008-09-30 20:57 need to be sure that the data is physically stored use fsync(2). (It 2008-09-30 20:57 will depend on the disk hardware at this point.) 2008-09-30 20:57 shapor, dentries stay around as long as the cache does 2008-09-30 20:57 so, close can return errors, but it still doesn't flush unless you fsync 2008-09-30 20:57 right 2008-09-30 20:58 cute 2008-09-30 20:58 It is probably unwise to close file descriptors while they may be in 2008-09-30 20:58 use by system calls in other threads in the same process. Since a file 2008-09-30 20:58 descriptor may be re-used, there are some obscure race conditions that 2008-09-30 20:58 may cause unintended side effects. 2008-09-30 20:58 two minutes 2008-09-30 20:59 my girl has decided daddy needs to play with her 2008-09-30 20:59 :-) 2008-09-30 20:59 she doesn't like the linux kernel (yet)? 2008-09-30 20:59 maze, we have fixed most of those races 2008-09-30 20:59 couple were fixed this year 2008-09-30 20:59 not yet 2008-09-30 20:59 I like the sound of confidence there... 2008-09-30 20:59 working on it 2008-09-30 20:59 ...most... 2008-09-30 20:59 file table is a nasty thing 2008-09-30 20:59 race wise 2008-09-30 21:00 but yes, the known holes are closed now 2008-09-30 21:00 remember I told you fget_light was the most perverse function in the kernel? 2008-09-30 21:00 interesting that closing an fd drops locks on the file even if you have duped it to another fd 2008-09-30 21:01 indeed 2008-09-30 21:01 second homework is to find out why 2008-09-30 21:01 ok, first home work was why both ptr and struct address_space (*i_mapping and i_data) in struct inode 2008-09-30 21:01 right 2008-09-30 21:02 well we did grab_cache_page pretty well, did not get to the friends 2008-09-30 21:02 gives a starting point for next time if we want 2008-09-30 21:07 can lxr do regexp search 2008-09-30 21:17 coda, raw, bdev 2008-09-30 21:17 quantum electro-dynamics 2008-09-30 21:18 ACTION goes to bed 2008-09-30 21:19 lol 2008-09-30 21:23 -!- Kirantpatil(~kiran@122.167.219.78) has left #tux3 2008-09-30 21:27 yeah, I typed /nick instead of /me in '/nick says thanks for the lesson :D' 2008-09-30 21:27 I've looked at fget_light, and it doesn't seem that scary... 2008-09-30 21:27 something's wrong with me 2008-09-30 21:27 or I'm missing the point 2008-09-30 21:28 or both ;-) 2008-09-30 21:33 linux-2.6.26.5$ egrep -rn -C 2 "[-][>]i_mapping *=" . 2008-09-30 21:47 folks 2008-09-30 21:47 hmm 2008-09-30 21:49 -!- Kirantpatil(~kiran@122.167.219.78) has joined #tux3 2008-09-30 21:50 -!- Kirantpatil(~kiran@122.167.219.78) has left #tux3 2008-09-30 21:55 maze, you haven't spotted it yet 2008-09-30 21:56 hmm? the fact it uses rcu? and doesn't always increment usage counters, nor does it always call fput? 2008-09-30 21:56 oh does it use rcu now? 2008-09-30 21:56 yeah 2008-09-30 21:57 struct files_struct *files = current->files; 2008-09-30 21:57 is protected by rcu 2008-09-30 21:57 although I'm guessing in many cases the cu is partial and not full 2008-09-30 21:57 fget itself isn't though 2008-09-30 21:58 "You can use this only if it is guranteed that the current task already 2008-09-30 21:58 fget is copied verbatim into fget_light 2008-09-30 21:58 313 * holds a refcnt to that file. That check has to be done at fget() only 2008-09-30 21:58 314 * and a flag is returned to be passed to the corresponding fput_light()" 2008-09-30 21:58 in other words, if the current task drops its reference... well it can't 2008-09-30 21:59 and there is no way an external observer can tell that the file is held by fget_light 2008-09-30 21:59 for starters 2008-09-30 21:59 right 2008-09-30 21:59 hence the 'doesn't always increment usage counters' 2008-09-30 21:59 but since, you're already holding the refcnt, it doesn't matter 2008-09-30 21:59 hence line 312 2008-09-30 22:00 and in cases where that doesn't work (threads), it falls back to using full fget 2008-09-30 22:00 oh wait, it does use rcu now 2008-09-30 22:00 used to be much worse 2008-09-30 22:01 wait again 2008-09-30 22:01 it uses rcu on the slow path 2008-09-30 22:02 right 2008-09-30 22:02 but the slow path is actually pretty common 2008-09-30 22:02 -> threads 2008-09-30 22:02 or anything that through clone ended up with shared fd table 2008-09-30 22:04 if the fd table isn't shared, then there is no need for locking, since it's local to this task 2008-09-30 22:04 and we're running in this tasks context 2008-09-30 22:04 /as this task/ 2008-09-30 22:05 otherwise, we need to synchronize via rcu with other tasks which share our fd table 2008-09-30 22:06 it may become shared 2008-09-30 22:06 after the fget_light 2008-09-30 22:06 nope 2008-09-30 22:06 notice the comment 2008-09-30 22:07 cannot be used if clone before fput_light 2008-09-30 22:07 315 * There must not be a cloning between an fget_light/fput_light pair. 2008-09-30 22:07 that's basically the only case were you are not allowed to use fget_light 2008-09-30 22:07 starting to see the perversity? 2008-09-30 22:07 hmm? 2008-09-30 22:07 doesn't seem perverse 2008-09-30 22:07 seems pretty clean 2008-09-30 22:08 oh... hmm 2008-09-30 22:08 I'm just worried that it may not be worth the effort with how multithreaded nowadays everything is getting 2008-09-30 22:09 (of course threads, don't necessarily share fd tables, but in most languages they probably do) 2008-09-30 22:09 isn't tux3 the works of satan ? 2008-09-30 22:09 I thought rcu was supposed to be pretty efficient... I wonder how much this gains in a single-thread case 2008-09-30 22:10 ACTION reads the backlog 2008-09-30 22:10 what I was thinking 2008-09-30 22:11 flips: you saw? linux-2.6.26.5$ egrep -rn -C 2 "[-][>]i_mapping *=" . --> blockdev, raw char dev, coda -> basically my guess was right 2008-09-30 22:11 ah, didn't notice you were already doing the challenge 2008-09-30 22:12 raw char dev maps on block dev, so remaps mapping to blockdevs mapping to share page cache, coda does hackery in case localfs is exported (AFAICT) 2008-09-30 22:12 and then re-imported via coda 2008-09-30 22:13 to share page cache between the codafs import and the original export 2008-09-30 22:13 at least, that's my guess 2008-09-30 22:13 how did you guess raw char dev, coda? 2008-09-30 22:14 linux-2.6.26.5$ egrep -rn -C 2 "[-][>]i_mapping *=" . 2008-09-30 22:14 ah, a computerized guess 2008-09-30 22:14 not really a guess 2008-09-30 22:14 so the short answer is: when the cache must be shared between inodes 2008-09-30 22:14 the guess was earlier, when I said multiple inodes with the same mapping 2008-09-30 22:14 but it isn't clear whether the sharing cases are valid 2008-09-30 22:15 the coda case at least isn't clear 2008-09-30 22:15 now what about the raw char dev? 2008-09-30 22:15 why should that have a cache at all? 2008-09-30 22:15 raw char dev is basically opening block dev with O_DIRECT 2008-09-30 22:15 and is the ancient way to do it 2008-09-30 22:15 oh 2008-09-30 22:15 raw dev 2008-09-30 22:15 so the raw char dev case maps in the mapping from the block dev 2008-09-30 22:16 seems wrong somehow 2008-09-30 22:16 in what sense? 2008-09-30 22:16 why doesn't it just return the device inode? 2008-09-30 22:16 use that when you open the raw device 2008-09-30 22:16 probably because it's a raw char not a block dev 2008-09-30 22:16 and behaviour is different? 2008-09-30 22:17 so can't get there from here maybe 2008-09-30 22:17 hmm 2008-09-30 22:17 or the raw char dev should point at the device inode 2008-09-30 22:17 not at the mapping 2008-09-30 22:18 you'll note that for some reason it's not a raw block dev, but a raw char dev, so probably alignment issues and ioctls force it to have a shim-layer 2008-09-30 22:18 right, but it is not clear it can't reference the block device inode 2008-09-30 22:18 haven't looked at that thing at all 2008-09-30 22:18 some sort of ancientness, nowadays raw char devs are close to getting dropped I think 2008-09-30 22:18 always used o_direct instead 2008-09-30 22:19 exactly 2008-09-30 22:20 coda... who knows 2008-09-30 22:20 doing stacking on a vfs that wasn't designed for it is going to be fun 2008-09-30 22:24 http://lkml.org/lkml/2003/5/2/157 2008-09-30 22:24 (maze) 2008-09-30 22:33 hmm 2008-09-30 22:33 coda... who knows - not really 2008-09-30 22:33 it's a network filesystem with local caching and offline operation 2008-09-30 22:33 pretty obvious it needs to tie the codafs inodes with the local backing store inodes 2008-09-30 22:35 the one you found is just an inlining of fput_light 2008-09-30 22:36 or rather of the first if in it 2008-09-30 22:37 sorry, me bad 2008-09-30 22:38 earlier on in that thread 2008-09-30 22:38 and even then those comparisons are before rcu 2008-09-30 22:40 so not clear what the gain is with rcu 2008-09-30 22:40 as opposed to normal r/w locks like before 2008-09-30 23:04 maze, what is not obvious is why it can't keep references to the backing store inodes 2008-09-30 23:05 instead of sharing the inode's cache with its own inodes 2008-09-30 23:07 then there is assoc_mapping 2008-09-30 23:12 assoc_mapping is only used by sync_mapping_buffers, which is only used by brainless filesystems like ext2 2008-09-30 23:12 is used incorrectly it would seem in reiserfs 2008-09-30 23:13 ocfs2 also uses it, perhaps because mark overlooked it 2008-09-30 23:14 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-30 23:35 -!- Kirantpatil(~kiran@122.166.169.45) has joined #tux3 2008-09-30 23:35 -!- Kirantpatil(~kiran@122.166.169.45) has left #tux3 2008-09-30 23:39 I'm guessing it does keep reference to the backing store inodes 2008-09-30 23:39 but I'm also guessing that it wants access to the coda file to hit the same pagecache as the backing file, isn't that easiest to do by having the same pagecache, by having the i_mapping pointer point to the same location? 2008-10-01 00:00 I'd think just has easy to redirect each sys_* call to a different inode 2008-10-01 00:00 well I haven't tried it 2008-10-01 00:00 have not looked nearly deep enough to know 2008-10-01 00:01 one day when there's nothing else to do 2008-10-01 00:01 the arrangement does look a little less than elegant 2008-10-01 00:02 kay, time to explain why extents just go so much work 2008-10-01 00:25 flips: is the 16 terabyte limitation for a file a problem ? 2008-10-01 00:25 I can image some situation where a DB might need a file that size or larger 2008-10-01 00:26 yes, it's uncomfortably small at a time where 1TB disks are $100 and it doubles every 18 months 2008-10-01 00:26 let alone storage arrays 2008-10-01 00:27 in six years, $100 will get you 16TB 2008-10-01 00:31 -!- Kirantpatil(~kiran@122.167.192.43) has joined #tux3 2008-10-01 00:32 http://lwn.net/Articles/194869/ <- the original patch for extents in ext4 2008-10-01 00:34 -!- bobby(~bobby@nat-inn.mentorg.com) has joined #tux3 2008-10-01 00:35 hello 2008-10-01 00:35 anyone here? 2008-10-01 00:35 flips: how's tux3 going overall ? well IYO ? 2008-10-01 00:35 hi pranith 2008-10-01 00:35 hello flips 2008-10-01 00:35 it's getting more like a filesystem every day 2008-10-01 00:35 seems like extents were a bitch to pull together, bug fixing and all 2008-10-01 00:36 indeed 2008-10-01 00:36 pranith, I will probably put the fuse ls fix in tomorrow or the next day 2008-10-01 00:36 flips, the groups table and second level entry table are stored in reverse. any reason for that? 2008-10-01 00:36 flips, you found the problem? 2008-10-01 00:37 pranith, no but I will 2008-10-01 00:37 :) 2008-10-01 00:37 ok 2008-10-01 00:37 flips: that implementation looked pretty big and complicated 2008-10-01 00:37 pranith, the groups and entries grow down towards the extent table, therefore it is more efficient to store them in reverse 2008-10-01 00:38 that way, appending a new entry at the end does not require moving all the other entries 2008-10-01 00:38 and the computer doesn't care which way they are stored, only the poor unfortunately programmers who have to try to understand that code 2008-10-01 00:39 i dont understand what u mean by reverse here... 2008-10-01 00:39 the entries are growing from top to bottom 2008-10-01 00:39 yes, I call that reverse 2008-10-01 00:39 hmm 2008-10-01 00:40 isn't that the usual way to do it? 2008-10-01 00:40 top to bottom 2008-10-01 00:40 flips: in general, it's a good thing that all of these file systems are getting development 2008-10-01 00:40 btrfs, ext3 and tux3 2008-10-01 00:40 pranith, I don't know about usual, but btrfs does something similar 2008-10-01 00:40 flips, ohk 2008-10-01 00:40 maybe one of then can get us out of the Linux file system dark ages 2008-10-01 00:40 then=them 2008-10-01 00:40 bh, I am awed by the epic nature of the ext4 extents implementation 2008-10-01 00:41 I am sure there are good reasons for everything 2008-10-01 00:41 flips: it's huge, do you think it needs to be that way ? 2008-10-01 00:41 struct dleaf { u16 magic, free, used, groups;struct extent table[];} 2008-10-01 00:41 well, mine is about 400 lines, maybe has a 100 lines more to go 2008-10-01 00:41 so no 2008-10-01 00:41 maybe they're working around legacy issues or something with that allocator, don't know unless I look at the patch more carefully 2008-10-01 00:42 flips, magic? 2008-10-01 00:42 the magic number is to do an in-memory check that you really have got a dleaf when you think you have 2008-10-01 00:42 ACTION takes a longer look at the patch 2008-10-01 00:42 where are the entries in the dleaf struct? 2008-10-01 00:43 I think the magic number is 0x1eaf 2008-10-01 00:43 i mean u dint use struct group and struct entry there 2008-10-01 00:43 pranith, extents at the bottom of the leaf are indexed by dictionary entries at the top of the leaf 2008-10-01 00:44 ACTION looking at the code 2008-10-01 00:44 a picture would be most helpful 2008-10-01 00:45 anybody good with graphics? 2008-10-01 00:45 flips, ohk, these dict entries are formed in dwalk_probe 2008-10-01 00:45 dwalk_probe does a lookup 2008-10-01 00:46 they are created by dwalk_pack 2008-10-01 00:46 that's for the extent code 2008-10-01 00:46 ok, looking at that 2008-10-01 00:46 the existing pointer code uses dleaf_lookup to find an entry and dleaf_resize to create space for a new one 2008-10-01 00:47 ok 2008-10-01 00:51 oh look, ext4 has an extent walk too: ext4_ext_walk_space 2008-10-01 00:53 and uses the word "gap" 2008-10-01 00:54 bh, I think the reason the ext4 extent code is so long is, it actually includes all the ext4 btree indexing code too 2008-10-01 00:55 I'm looking over the patch right now 2008-10-01 00:55 A good chunk of the complexity is with tree manipulation which you largely have already in your b-tree implementation I believe 2008-10-01 00:56 flips, leaf_free function is a bit confusing.. shouldn't that just return to_dleaf(leaf)->free?? 2008-10-01 00:56 there's an extent merge operation which I don't know yet where it's used and how 2008-10-01 00:56 pranith, no, I used some confusing field names there 2008-10-01 00:57 flips: I didn't see the b-tree stuff in there, but I saw a bunch of stuff that looked like generic tree manipulation, no b-tree 2008-10-01 00:57 flips, whats free for? and whats used for? 2008-10-01 00:57 I'm about 2/5th through so far, nearing the halfway point 2008-10-01 00:57 dleaf->free is the offset within the dleaf of the top of the extents table, dleaf->used is offset of the bottom of the entries list 2008-10-01 00:57 bh, truee 2008-10-01 00:57 the "extent tree" does not appear to be a btree 2008-10-01 00:58 pranith, it's used to know when there is space to add a new extent to the leaf, or if the leaf has to be split 2008-10-01 00:58 hmm 2008-10-01 01:01 leaf->groups contains the number of groups currently in the leaf? 2008-10-01 01:01 yes 2008-10-01 01:02 flips: they have more operations for modifying the extents, but that's about it. It doesn't look that bad regarding complexity 2008-10-01 01:02 -!- _ajonat(~ajonat@190.48.125.115) has joined #tux3 2008-10-01 01:02 does tux3 have similar operations yet ? 2008-10-01 01:02 something like that 2008-10-01 01:02 something more powerful imho 2008-10-01 01:02 flips: I think it's just about general manipulation of file system data structures, nothing more than that 2008-10-01 01:03 looks like 2008-10-01 01:03 it's a lot of code 2008-10-01 01:03 hard to understand the structure in one reading 2008-10-01 01:03 no comments... 2008-10-01 01:03 I think it's just because of it's legacy roots with ext3 which is what this patch is aimed at work around 2008-10-01 01:03 flips: it wasn't hard, you'd be able to get through it in a concentrated day 2008-10-01 01:03 there's only one really large core routine 2008-10-01 01:04 but I got the general idea of what was going on 2008-10-01 01:05 if tux3 uses the b-tree for this already then you don't have to go through the process of "fitting" extents to the disk, etc... 2008-10-01 01:05 and other optimizations with allocation 2008-10-01 01:06 I wonder how they get away with 32 bits for the logical block number of the extent 2008-10-01 01:06 the only worry about tux3 that I have is the lack of kernel support since there's logic manipulation buffer states, ec... 2008-10-01 01:07 the tux3 buffer cache emulation is quite close to the real thing 2008-10-01 01:08 maybe they don't want to have large an extent ? 2008-10-01 01:09 bh, I think they do have a btree in there 2008-10-01 01:09 I saw generic extent manipulation stuff in there 2008-10-01 01:09 no b-tree that I immediately saw 2008-10-01 01:09 there is this notion of an extent cache, I wonder what that is about 2008-10-01 01:10 generally, where you see "path" variables there is a btree 2008-10-01 01:11 and not another on-disk structure ? 2008-10-01 01:13 there are lead manipulation routines but it didnt look like they were manipulating a b-tree 2008-10-01 01:14 +err = ext4_ext_insert_extent(handle, inode, path, &newex); 2008-10-01 01:14 the "inode" parameter makes me think it's against an inode tree instead of a b-tree 2008-10-01 01:17 bh, see ext4_ext_show_path, it's walk a structure that looks much like a btree 2008-10-01 01:17 probing I mean 2008-10-01 01:18 showing 2008-10-01 01:19 it's a very generic routine 2008-10-01 01:19 well look at /* walk through the tree */ 2008-10-01 01:19 it's probing a btree 2008-10-01 01:20 ext4_ext_path 2008-10-01 01:20 What's backing that struct ? 2008-10-01 01:21 blocks 2008-10-01 01:21 It's a path 2008-10-01 01:21 they're written out via mark_buffer_dirty_inode 2008-10-01 01:21 right ? 2008-10-01 01:22 question is what is a path in this code ? 2008-10-01 01:22 that's an in-memory structure 2008-10-01 01:22 just for keeping track of position in the btree 2008-10-01 01:23 no question it's a btree 2008-10-01 01:23 funny that word doesn't appear in the code 2008-10-01 01:23 don't know, I have to stop looking at the path 2008-10-01 01:23 flips: correct 2008-10-01 01:24 ...patch 2008-10-01 01:24 me too 2008-10-01 01:24 that put me in a bad mood ;) 2008-10-01 01:24 too much code 2008-10-01 01:24 it's as big as all of tux3 at the moment 2008-10-01 01:26 http://www.phoronix.com/forums/showthread.php?t=1765 <- fs benchmarks from 2006 2008-10-01 01:26 reiser4 wins apparently 2008-10-01 01:28 actually, if you read it, ext2 wins, and ext4 not far behind 2008-10-01 01:28 marginally beats reiser4 at tar -c and -x 2008-10-01 01:29 kills reiser on delete 2008-10-01 01:29 so to speak 2008-10-01 01:34 hes already dead ;) 2008-10-01 02:07 -!- Kirantpatil(~kiran@122.167.192.43) has left #tux3 2008-10-01 02:25 flips: as long as tux3 has the same kind of stuff you're set 2008-10-01 02:25 or different stuff that works as well or better 2008-10-01 02:26 the goot thing about a b-tree being the center of the universe is that implementations like that can be done in terms of it 2008-10-01 02:26 good 2008-10-01 02:27 which is what you've done 2008-10-01 02:27 it's better to be efficient than to use a brute force method to building a system if possible 2008-10-01 02:28 pretty much need to have a btree in order to have extents 2008-10-01 02:28 ACTION still needs to look at dleaf.c more  2008-10-01 02:29 well, good luck with it 2008-10-01 02:37 ACTION is just about to hit the bed 2008-10-01 02:37 getting late and it's rally hot today/tonight 2008-10-01 02:55 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-01 02:55 hey daddy 2008-10-01 02:59 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-10-01 03:45 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-01 04:35 Fire! 2008-10-01 04:35 there goes another tux3 report off into the wild blue yonder 2008-10-01 04:41 flips: from where do you have the numbers? 2008-10-01 04:42 for the comparisons? 2008-10-01 04:42 which ones? 2008-10-01 04:42 ext3 vs tux3 with regard to deletion e.g. 2008-10-01 04:42 simple arithmetic 2008-10-01 04:43 average seek time: 6 ms 2008-10-01 04:43 transfer speed: 64 MB/sec 2008-10-01 04:43 ok, so you calculated it. did you account for the overhead of more complex computations for extents? Or are they nothing to worry about (probably not, regarding how slow disks are) 2008-10-01 04:43 1K pointers/4K block for ext2 comes from the 4 byte pointer size 2008-10-01 04:44 no, I did not add in the cpu overhead 2008-10-01 04:44 but cpu overhead is actually less 2008-10-01 04:44 well, less cases to look at 2008-10-01 04:44 because many simple operations are replaced by one complex operation 2008-10-01 04:45 will you do another post about the extent design? 2008-10-01 04:45 yes 2008-10-01 04:45 maybe tomorrow, maybe later 2008-10-01 04:46 I guess I will get it working the rest of the way first 2008-10-01 04:46 there are still a couple of biggish things to do 2008-10-01 04:46 have to put in the leaf splitting 2008-10-01 04:47 and the actual IO 2008-10-01 04:47 and the extent reading 2008-10-01 04:48 sounds like the better plan. have something to show :) 2008-10-01 04:48 agreed 2008-10-01 04:48 and even better plan: me get some sleep 2008-10-01 04:49 i was just wondering if you were still awake or already 2008-10-01 04:49 still 2008-10-01 04:49 not for very much longer 2008-10-01 04:50 ok. i'll get going too 2008-10-01 04:50 see you 2008-10-01 04:50 bye. 2008-10-01 04:51 -!- bobby(~bobby@nat-inn.mentorg.com) has joined #tux3 2008-10-01 05:33 -!- Kirantpatil(~kiran@122.167.192.43) has joined #tux3 2008-10-01 05:33 -!- Kirantpatil(~kiran@122.167.192.43) has left #tux3 2008-10-01 06:05 flips, congrats on completing(partially even) extents 2008-10-01 06:24 -!- Bobby(~Bobby@nat-inn.mentorg.com) has joined #tux3 2008-10-01 06:25 hello all 2008-10-01 06:57 -!- Kirantpatil(~kiran@122.167.192.43) has joined #tux3 2008-10-01 06:57 -!- Kirantpatil(~kiran@122.167.192.43) has left #tux3 2008-10-01 08:37 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-01 09:38 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-01 11:19 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-01 11:55 flips: ping 2008-10-01 12:02 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-01 12:15 tim_dimm, pong 2008-10-01 12:21 dang- now she's eating again 2008-10-01 12:21 20 minutes 2008-10-01 12:21 life in the daddy zone 2008-10-01 12:24 -!- Bobby(~Bobby@122.162.68.20) has joined #tux3 2008-10-01 12:24 hey all 2008-10-01 12:25 hi pranith 2008-10-01 12:25 hello flips 2008-10-01 12:25 was reading the ext2 chapter 2008-10-01 12:26 from utlk 2008-10-01 12:26 good thing to do 2008-10-01 12:26 its got some great info 2008-10-01 12:26 utlk is pretty good 2008-10-01 12:26 yup, but pretty outdated i guess 2008-10-01 12:26 he says that ext3 still in dev.. 2008-10-01 12:26 :D 2008-10-01 12:27 :) 2008-10-01 12:27 things actually change pretty slowly in kernel 2008-10-01 12:27 I wonder if he covers reverse mapping 2008-10-01 12:27 and rcu 2008-10-01 12:27 don't think either 2008-10-01 12:28 hmm 2008-10-01 12:28 rcu? 2008-10-01 12:28 weird new locking method 2008-10-01 12:28 :( 2008-10-01 12:28 ohk 2008-10-01 12:28 read/copy/update 2008-10-01 12:29 ohk, found this http://lwn.net/Articles/174641/ 2008-10-01 12:30 ACTION is going to miss this place for the next few days 2008-10-01 12:30 :( out on a trip 2008-10-01 12:30 that's a use of rcu 2008-10-01 12:30 by the way 2008-10-01 12:30 hmm 2008-10-01 12:30 next tuesday I will be away 2008-10-01 12:31 oh 2008-10-01 12:31 maybe maze can do it 2008-10-01 12:31 hmm 2008-10-01 12:31 ok 2008-10-01 12:31 should give him some warning 2008-10-01 12:31 time to study up on something ;) 2008-10-01 12:31 okies 2008-10-01 12:31 flips, one thing 2008-10-01 12:31 you've any idea of aio in the current linux system? 2008-10-01 12:32 i tried it today.. 2008-10-01 12:32 yes 2008-10-01 12:32 din't seem to work... 2008-10-01 12:32 is it _properly_ supported now? 2008-10-01 12:32 yes 2008-10-01 12:32 I have a simple demo somewhere 2008-10-01 12:32 glibc uses the kernel implementation?? 2008-10-01 12:33 yes 2008-10-01 12:33 well 2008-10-01 12:33 it's confusing 2008-10-01 12:33 hmm 2008-10-01 12:33 two different interfaces 2008-10-01 12:33 I can never remember which is which 2008-10-01 12:33 and both complex 2008-10-01 12:33 :) 2008-10-01 12:33 ok 2008-10-01 12:33 irritating to use 2008-10-01 12:33 aio is in general irritating 2008-10-01 12:34 hmm 2008-10-01 12:34 what do u suggest if we had to write say 200mb of data 2008-10-01 12:34 and got some processing to do later 2008-10-01 12:34 i think aio would be perfect for this 2008-10-01 12:34 just dump it and continue with your work 2008-10-01 12:35 and finally check the status and wait till it is completed 2008-10-01 12:35 flips: done 2008-10-01 12:35 instead of waiting forever for the data to be written 2008-10-01 12:35 hey tim_dimm 2008-10-01 12:35 done with what? :) 2008-10-01 12:35 pranith, http://groups.google.com/group/zumastor/browse_thread/thread/7b0f5350a99c0d7d/c0bd2f4d698b3ad6?fwc=1 2008-10-01 12:35 hey pranith 2008-10-01 12:36 thnx flips 2008-10-01 12:36 done with feeding my very fussy 3 week old daughter 2008-10-01 12:36 pranith, basically just copied from the aio man page and cleaned up a little 2008-10-01 12:36 okies 2008-10-01 12:39 ok guys, c u on sunday 2008-10-01 12:39 tata 2008-10-01 12:58 Hello :D !!! 2008-10-01 13:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-01 13:14 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-01 13:43 -!- flips(~phillips@phunq.net) has joined #tux3 2008-10-01 14:18 -!- ajonat(~ajonat@190.48.125.115) has joined #tux3 2008-10-01 14:18 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-01 14:20 folks 2008-10-01 14:21 flips: where the anouncement btw ? 2008-10-01 14:21 you might lke to put the update in the topic 2008-10-01 14:25 -!- ChanServ changed mode/#tux3 -> +o flips 2008-10-01 14:25 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 p.m. Pacific Time ~ Next session: friends of grab_cache_page" 2008-10-01 14:25 -!- ChanServ changed mode/#tux3 -> -o flips 2008-10-01 14:35 -!- kbingham(~kbingham@92.19.37.221) has joined #tux3 2008-10-01 14:41 flips: tux3 email posting I mean 2008-10-01 14:41 which one? 2008-10-01 14:41 the most recent tux3 update 2008-10-01 14:42 you were up last last nigh 2008-10-01 14:42 night 2008-10-01 14:49 http://tux3.org/tux3 2008-10-01 16:05 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-01 16:10 flips : ever decided to put some CMS/CSS :P ? 2008-10-01 16:10 orgthingy, sure, shapor was going to 2008-10-01 16:11 css? 2008-10-01 16:11 if cms, then i really advice to use b2evolution 2008-10-01 16:11 had experience with it in my old bluecirclet.com site 2008-10-01 16:11 we newed DB at bluecirclet.com :( 2008-10-01 16:11 well, it was my partner's fault.. 2008-10-01 16:11 :P 2008-10-01 16:12 check out shapor's site, linked from tux3.org 2008-10-01 16:12 k 2008-10-01 16:12 http://www.totalnetsolutions.net/2007/08/13/how-to-increase-battery-life-in-ubuntu-or-debian-linux/ 2008-10-01 16:12 interesting 2008-10-01 16:13 why is b2evolution better than other php blogging packages? 2008-10-01 16:13 flips : because it was my fav.. i tried many 2008-10-01 16:13 it was probably (one of) best i tried 2008-10-01 16:14 but some like WordPress are fine as well 2008-10-01 16:14 couple of reasons for it being your fav? 2008-10-01 16:19 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-01 16:19 ACTION thinks plain text files are the best :P 2008-10-01 16:19 spot the bug, it has been in since forever: 2008-10-01 16:19 int ileaf_check(BTREE, struct ileaf *leaf) 2008-10-01 16:19 { 2008-10-01 16:19 char *why; 2008-10-01 16:19 why = "not an inode table leaf"; 2008-10-01 16:19 if (leaf->magic != 0x90de) 2008-10-01 16:19 goto eek; 2008-10-01 16:19 why = "dict out of order"; 2008-10-01 16:20 if (!isinorder(btree, leaf)) 2008-10-01 16:20 goto eek; 2008-10-01 16:20 return 0; 2008-10-01 16:20 eek: 2008-10-01 16:20 printf("%s!\n", why); 2008-10-01 16:20 return -1; 2008-10-01 16:20 } 2008-10-01 16:20 well 2008-10-01 16:20 I gave it away 2008-10-01 16:20 that's the fixed version 2008-10-01 16:20 if (leaf->magic != 0x90de); 2008-10-01 16:20 goto eek; 2008-10-01 16:20 if (leaf->magic != 0x90de); 2008-10-01 16:20 goto eek; 2008-10-01 16:20 ; ?!? 2008-10-01 16:21 :D 2008-10-01 16:21 nobody noticed, and nobody noticed the error that has been printed out every time make tests is run 2008-10-01 16:22 21 days till I have time for FS again :| 2008-10-01 16:23 what happens then? 2008-10-01 16:23 out of jail? 2008-10-01 16:26 it's pretty 2008-10-01 16:26 I might run it just for laughs 2008-10-01 16:27 better that the crap on tux3.org now 2008-10-01 16:27 thats the best part about bash blogger 2008-10-01 16:27 it just pre renders static html/css 2008-10-01 16:28 so no added security issues either 2008-10-01 16:28 and faster 2008-10-01 16:28 the way it ought to be 2008-10-01 16:29 I noted that 2008-10-01 16:29 plus it has attitude 2008-10-01 16:33 i've been meaning to set it up for the ugly page which is shapor.com 2008-10-01 16:41 shapor : hello 2008-10-01 16:42 has attitude? 2008-10-01 16:42 lol 2008-10-01 16:43 on 22 is the IPSN deadline 2008-10-01 16:44 till then it will be worse and worse :P 2008-10-01 16:44 timewise speaking 2008-10-01 16:45 how come all of you guys are C programmers and know about FS :( 2008-10-01 16:45 no good place to learn C/FS in Middle east 2008-10-01 16:45 no C books in middle east 2008-10-01 16:45 :( 2008-10-01 16:46 I come from Romania ;-) 2008-10-01 16:47 with so much code around a C book it not so useful at it was before 2008-10-01 17:00 flips 2008-10-01 17:00 i was wondering 2008-10-01 17:00 since ext3cow is just FS you wanted 2008-10-01 17:00 but has missing features 2008-10-01 17:00 why dont you just simply make tux3 based on ext3cow ? 2008-10-01 17:01 ipsn? 2008-10-01 17:02 orgthingy, because by the time I finished changing it, it would not resemble ext3 any more at all, and that would be more work than just starting from the beginning 2008-10-01 17:02 for one thing, tux3 does not use a journal 2008-10-01 17:03 i see 2008-10-01 17:03 and the ext3 file index format cannot be adapted for extents, it has to be thrown away and rewritten 2008-10-01 17:03 but little people over here.. do people know what tux3 is 2008-10-01 17:03 ? 2008-10-01 17:03 the list goes on 2008-10-01 17:03 i knew tux3 from stumble-upon 2008-10-01 17:03 basically very little survives 2008-10-01 17:03 i see 2008-10-01 17:04 at least tux3 uses the ext2 directory code 2008-10-01 17:04 for now 2008-10-01 17:05 -!- inverse(~none@h80-net10.simres.netcampus.ca) has joined #tux3 2008-10-01 17:06 flips : how did you get involved in all this opensource project? 2008-10-01 17:06 opensource world* 2008-10-01 17:06 tux2 2008-10-01 17:07 so, tux2 was very first project you were ever involved in? 2008-10-01 17:07 anyway, ext3cow is not the fs I want, I said it is a good project, not that I want to use it myself 2008-10-01 17:07 same goes for btrfs and ext4 2008-10-01 17:08 oh common, ext4 sounds exciting 2008-10-01 17:08 not to me 2008-10-01 17:08 aha ^_^ 2008-10-01 17:10 sk8 oclock 2008-10-01 17:17 http://ipsn.acm.org/2009/ 2008-10-01 17:17 I love the sk8 oclock thing :D 2008-10-01 17:29 flipsout: nice 2008-10-01 17:34 flipsout: have you heard of dmapi ? 2008-10-01 17:34 http://en.wikipedia.org/wiki/XFS#DMAPI 2008-10-01 17:56 shapor, I have 2008-10-01 17:56 permabanned from linux for some unknown reason 2008-10-01 17:56 hrm 2008-10-01 17:57 some people do use it apparently 2008-10-01 17:57 let me take a closer look 2008-10-01 17:58 wow. i've just written the worst code since my first hello world... static allocation of 10000 slots for the free list for the toy os we are coding at university :) 2008-10-01 17:58 but shared mem is supported :) 2008-10-01 17:59 razvanm, good look with your paper then, it's a paper, right? 2008-10-01 17:59 flips: yup 2008-10-01 17:59 ACTION has yet to write his worst code 2008-10-01 17:59 true, the worse code is always ahead ;-) 2008-10-01 17:59 ok, worst code yet :) 2008-10-01 20:04 -!- ajonat(~ajonat@190.48.123.108) has joined #tux3 2008-10-01 20:11 -!- Kirantpatil(~kiran@122.167.213.15) has joined #tux3 2008-10-01 20:11 -!- Kirantpatil(~kiran@122.167.213.15) has left #tux3 2008-10-01 20:18 Question: does Tux3 support file creation dates? I started playing with it today and noticed that everything was set to "1969-12-31 19:00" 2008-10-01 20:24 inverse: yeah, they're just not used by fuse iirc 2008-10-01 20:24 by the tux3fuse program that is 2008-10-01 20:25 ah, that makes sense. I was confused because I saw an email on the list that referred to them being correct :) 2008-10-01 20:30 hmm, there are 2 different fuse implemenations 2008-10-01 20:30 maybe its in one not the other 2008-10-01 20:30 there is tux3fs and tux3fuse 2008-10-01 20:35 yes, now I have it working :) 2008-10-01 20:35 cool! 2008-10-01 20:35 whatd you chanfe? 2008-10-01 20:35 change* 2008-10-01 20:36 "make makefs" instead of "make testfs" after looking at the source files again I now understand what those are :p 2008-10-01 21:08 http://en.wikipedia.org/wiki/Ext4 <- the one feature I see here we need to add is persistent preallocation 2008-10-01 21:08 should be able to do that with essentially zero code 2008-10-01 21:36 hey flips 2008-10-01 21:49 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-01 21:56 shapor: your link took me down the rabbit hole to this 2008-10-01 21:56 http://en.wikipedia.org/wiki/Comparison_of_file_systems 2008-10-01 21:56 Daniel needs a wiki entry 2008-10-01 22:19 hey tim_dimm 2008-10-01 22:19 wassap? 2008-10-01 22:20 tim_dimm: there already is one for tux3 2008-10-01 22:20 saw that- needs to be one for Daniel too 2008-10-01 22:20 everyone else has one 2008-10-01 22:20 (all the kids are doing it) 2008-10-01 22:21 shapor to the bat channel 2008-10-02 03:07 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-10-02 04:19 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-10-02 07:08 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-10-02 07:08 hello! 2008-10-02 08:55 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-02 09:35 -!- Kirantpatil(~kiran@122.166.93.80) has joined #tux3 2008-10-02 09:35 -!- Kirantpatil(~kiran@122.166.93.80) has left #tux3 2008-10-02 11:23 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-02 14:05 -!- kd(~kd@121.246.35.242) has joined #tux3 2008-10-02 14:09 -!- orgthingy_(~orgthingy@62.150.55.188) has joined #tux3 2008-10-02 15:47 well, extent read is a little harder than extent write in one respect 2008-10-02 15:47 for the write, we know to form up extents by searching for adjoining dirty regions 2008-10-02 15:47 dirty buffers I mean 2008-10-02 15:48 for read we don't 2008-10-02 15:48 I suppose I could implement readahead here 2008-10-02 15:48 and just go read a whole extent every time somebody asks for a buffer 2008-10-02 15:48 for now 2008-10-02 15:48 the problem is, the buffer at a time high level interface is lame 2008-10-02 15:49 but accurately models what we will get in kernel 2008-10-02 15:49 we need an extent at a time interface that comes from the sys_write level 2008-10-02 15:50 which there is a hook for 2008-10-02 15:50 but it means bypassing the whole generic_read/write mess 2008-10-02 15:50 which might be ok in that it means bypassing a big mess 2008-10-02 15:50 but it also means we will have to maintain essentially a forked version of the read/write library 2008-10-02 15:51 volunteers? 2008-10-02 16:13 -!- ChanServ changed mode/#tux3 -> +o flips 2008-10-02 16:15 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: friends of grab_cache_page ~ Postponed till 9 pm tonight, thursday Oct 2" 2008-10-02 16:15 -!- flips changed mode/#tux3 -> -o flips 2008-10-02 16:15 nice 2008-10-02 16:15 flips : you should ask in other channels 2008-10-02 16:15 here all people try to help 2008-10-02 16:16 flips : ask in Linux and opensource channels 2008-10-02 16:16 or, in EFnet 2008-10-02 16:16 ? 2008-10-02 16:16 some may-be interested 2008-10-02 16:16 about what? 2008-10-02 16:16 in helping you 2008-10-02 16:16 oh 2008-10-02 16:16 feel free to ask 2008-10-02 16:16 well, i really dont know much about it? 2008-10-02 16:16 so, i dont really know if i can ask 2008-10-02 16:16 that's ok 2008-10-02 16:16 :P 2008-10-02 16:17 ACTION invites flips to #ubuntu-offtopic @ irc.freenode.net 2008-10-02 16:17 just tell people a filesystem project needs devs, is willing to help them learn 2008-10-02 16:17 sure 2008-10-02 16:18 :) 2008-10-02 16:20 doesn't sound too professional. :| 2008-10-02 16:20 * snuxoll knows nothing about filesystem design flips 2008-10-02 16:20 see? looks matter :P 2008-10-02 16:20 but, still, ask 2008-10-02 16:20 and explain 2008-10-02 16:20 there are lots of people there 2008-10-02 16:24 sure, but C coders? 2008-10-02 16:24 flips : common 2008-10-02 16:24 its linux channel 2008-10-02 16:24 its full of C coders :P 2008-10-02 16:26 you'd be surprised 2008-10-02 16:26 these days linux coding seems to be more about php than anything 2008-10-02 16:27 flips : php? 2008-10-02 16:27 nay 2008-10-02 16:27 Python and C 2008-10-02 16:27 C++ a bit 2008-10-02 16:27 << python 2008-10-02 16:28 orgthingy, we could use more people playing with the fuse stuff 2008-10-02 16:28 and just trying it and complaining about broken things 2008-10-02 16:28 that's one way to get shapor to code ;) 2008-10-02 16:28 flips : FreeNode is like a UNIX and Linux network 2008-10-02 16:28 lots of programmers there 2008-10-02 16:28 he's awesome when he does 2008-10-02 16:28 you *should* find someone there 2008-10-02 16:28 heh :) 2008-10-02 16:29 flips : and sourceforge would be good idea, so is stumble-upon and digg 2008-10-02 16:30 flips : if it stays small, it'd be "just another free software project" but if you "market" it (not business term) it'd be "just another great opensource project" 2008-10-02 16:30 ok, there's my troll 2008-10-02 16:30 anything else has to come from the grassroots 2008-10-02 16:30 that means you, orgthingy 2008-10-02 16:31 "troll" ? 2008-10-02 16:31 ACTION didnt understand what flips meant  2008-10-02 16:32 wikepedia that 2008-10-02 16:33 you mean troll as is in "annoying useless dude in IRC" ? 2008-10-02 16:33 orgthingy, my time is better spent making code happen, it's up to people who want to help to go spread the word 2008-10-02 16:33 flips : well, i think time is worth looking for people to *code* with you 2008-10-02 16:33 troll as in "saying something controversial in order to get a response" 2008-10-02 16:33 get what i mean? 2008-10-02 16:34 oh, common :( 2008-10-02 16:34 orgthingy, my time is also better spent encouraging people to go out and find coders than going out and hunting myself 2008-10-02 16:34 ok 2008-10-02 16:34 ACTION hides 2008-10-02 16:35 time being at a premium here 2008-10-02 16:35 sorry :| 2008-10-02 16:35 got to get extent reading working today according to me 2008-10-02 16:35 see the resonding lack of response on ubuntu channel 2008-10-02 16:35 prevailing attitude seems to be "work is somebody else's problem, we're here to hang and feel leet" 2008-10-02 16:36 maybe that's not accurate 2008-10-02 16:36 flips : maybe because it's offtopic channel 2008-10-02 16:36 and maybe because ubuntu users dont program 2008-10-02 16:36 I think the latter is the reason 2008-10-02 16:36 one of the reasons 2008-10-02 16:37 willingness to contribute could certainly be better 2008-10-02 16:37 being willing to always lose the early adopter race to gentoo is not a healthy attitude 2008-10-02 16:37 ok, im asking while you're coding 2008-10-02 16:37 :D 2008-10-02 16:38 seems fair 2008-10-02 16:38 best strategy is just for somebody like you to say there, "there's cool stuff going down on oftc #tux3, why not drop by for a visit" 2008-10-02 16:39 flips : well, thats called trolling in freenode :P 2008-10-02 16:40 or, spamming 2008-10-02 16:40 Id rather ask if someone is interested 2008-10-02 16:40 if they are, ill tell them to come by 2008-10-02 16:40 drop* 2008-10-02 16:40 orgthing, not in #offtopic 2008-10-02 16:40 ok 2008-10-02 16:40 im asking in many channels like #C 2008-10-02 16:40 if it worries you, then give a url instead of a channel 2008-10-02 16:40 #c would be good 2008-10-02 16:41 any c coder can become a kernel coder, or if they don't like kernel, fuse is entirely userspace 2008-10-02 16:41 flips : sorry, but i said "we" but i think suing "we" is better than "he" :P 2008-10-02 16:41 we is correct 2008-10-02 16:42 "we" as in "everybody who thinks fat baby penquins are cute" 2008-10-02 16:42 haha 2008-10-02 16:44 flips : i think all programmers are asleep now 2008-10-02 16:44 usually, all of them go like "i want to join!" 2008-10-02 16:44 meh, maybe they're asleep :P 2008-10-02 16:44 programmers are usually asleep 2008-10-02 16:44 just as shapor 2008-10-02 16:45 just ask shapor 2008-10-02 16:45 some are drunk :P (yes really xD) 2008-10-02 16:45 or drunk, right, when they're awake 2008-10-02 16:45 that's why tux3 project has a requirement for beer to be sent 2008-10-02 16:46 to keep our programmers "in the zone" 2008-10-02 16:47 hmm, some say they're already programming in other projects :P 2008-10-02 16:47 haha 2008-10-02 16:47 haha 2008-10-02 16:48 don't take that for an answer ;) 2008-10-02 16:48 flips : ill ask Eloxoph people (i was staff there once) 2008-10-02 16:48 got to have a better excuse than that for being lame 2008-10-02 16:48 its full of C programmers 2008-10-02 16:48 but they're already working on 2 projects 2008-10-02 16:48 sounds good 2008-10-02 16:48 but ill ask them anyway :P 2008-10-02 16:48 2 is not enough 2008-10-02 16:48 should be 3 2008-10-02 16:54 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-02 16:55 I'll be hanging on #freenode if any ubuntus want to chat, flipz, not on any channel 2008-10-02 16:55 heh 2008-10-02 16:55 FelipeS : hello 2008-10-02 16:55 orgthingy, hey 2008-10-02 16:55 so, finally got interested? 2008-10-02 16:56 friend of yours orgthingy? 2008-10-02 16:56 flips : from ##not-physics over there 2008-10-02 16:56 any dev with physics background tends to be interesting to me 2008-10-02 16:56 likewise music and math 2008-10-02 16:57 seem to be generally more aware design wise 2008-10-02 16:57 I'm just a young student. You got me there with the music background. I'm not into music at all. 2008-10-02 16:57 ok, let's see if the read extent generator works 2008-10-02 17:02 felipes, physics? 2008-10-02 17:02 or does #not-physics really mean not physics? 2008-10-02 17:02 flips : no, not-physics is offtopic channel of ##physics that i founded 2008-10-02 17:03 right, so it means "physics student" essentially? 2008-10-02 17:03 folks 2008-10-02 17:03 :P 2008-10-02 17:03 ok, physics prof then 2008-10-02 17:04 ACTION has to find his hotel receipts and stuff for reembursement 2008-10-02 17:04 mad scientist? 2008-10-02 17:04 flips : uuuumm? 2008-10-02 17:05 what is #physics about? 2008-10-02 17:05 flips : PHYSICS :P ? 2008-10-02 17:05 flips : how about going on topic with felipes? 2008-10-02 17:06 felipes, got c skillz? 2008-10-02 17:08 eh 2008-10-02 17:08 Honestly I'm in no position for being a dev or working on serious stuff 2008-10-02 17:08 at least I don't think so 2008-10-02 17:09 I just came here to maybe see some conversations on real work being done. 2008-10-02 17:09 sure, good place for that 2008-10-02 17:09 flips : he knows C++ most 2008-10-02 17:09 are you like Linus T. ? not allowing c++ code at all :P 2008-10-02 17:09 c++ has a certain influence over what we do 2008-10-02 17:10 I have nothing against c++, I would be ok with allow some files in kernel to be compiled that way, with appropriate care 2008-10-02 17:10 I'm a first year computer eng major at tech, programming is a hobby that got ahold of me about 4 years ago. I've never done anything impressing with it however. just learning here and there. 2008-10-02 17:10 but linus will not allow it so that ends that 2008-10-02 17:10 and it's all been with c++ 2008-10-02 17:11 computer eng is a good place to get a perspective on software efficiency 2008-10-02 17:11 c++ includes c, most of it 2008-10-02 17:11 yeah I know 2008-10-02 17:11 c++ lacks designated initializers, which we use extensivley, without them I'd be unwilling to do a kernel project in c++ even if it was ok with linus 2008-10-02 17:11 is it not 100% backwards compatible? 2008-10-02 17:11 'backwards' :P 2008-10-02 17:11 not 100% 2008-10-02 17:11 stupidly so 2008-10-02 17:15 "sure, good place for that"; was that sarcasm? flips 2008-10-02 17:16 not at all 2008-10-02 17:16 sarcasm always comes with a :p 2008-10-02 17:16 some of the best devs in the known universe hang here 2008-10-02 17:16 just have to catch them talking ;) 2008-10-02 17:18 oh well that's great. I'll be sure to add it to my favs then. 2008-10-02 17:18 filemap_extent_read: logical block 0x5 of inode 0x0 2008-10-02 17:18 ---- extent 0x5/1 ---- 2008-10-02 17:18 prior extents: 2008-10-02 17:18 ---- rewind to 0x0 => 0/1 ---- 2008-10-02 17:18 filemap_extent_read: index 5, limit 6 2008-10-02 17:19 filemap_extent_read: offset = 0, gap = 0 2008-10-02 17:19 filemap_extent_read: fill gap at 5/1 2008-10-02 17:19 balloc extent -> [2/1] 2008-10-02 17:19 segs (offset = 0): 5 => 2/1; (1) 2008-10-02 17:19 well things are starting to happen with extent read 2008-10-02 17:20 course it should not be allocating blocks on read 2008-10-02 17:20 that's because it started as a cut n paste of extent write 2008-10-02 17:20 needs to fill those buffers with zero instead 2008-10-02 17:20 hmm 2008-10-02 17:21 no, just the one buffer 2008-10-02 17:21 ...maybe 2008-10-02 17:21 I doubt "fill ahead" is a win 2008-10-02 17:29 getting closer... 2008-10-02 17:29 filemap_extent_read: logical block 0x5 of inode 0x0 2008-10-02 17:29 ---- extent 0x5/1 ---- 2008-10-02 17:29 prior extents: 2008-10-02 17:29 ---- rewind to 0x0 => 0/1 ---- 2008-10-02 17:29 filemap_extent_read: index 5, limit 6 2008-10-02 17:29 filemap_extent_read: offset = 0, next = 6, gap = 1 2008-10-02 17:29 filemap_extent_read: fill gap at 5/1 2008-10-02 17:29 balloc extent -> [2/1] 2008-10-02 17:29 segs (offset = 0): 5 => 2/1; (1) 2008-10-02 17:29 filemap_extent_read: extent 0x5/1 => 2 2008-10-02 17:29 filemap_extent_read: read block 0x5 => 2 2008-10-02 17:29 now need to get rid of that balloc/read and replace with fill for unmapped buffer 2008-10-02 18:08 seg[segs++] = *(struct extent *)(u64[]){ -1LL }; <- some nasty c 2008-10-02 18:08 thought I'd show that before throwing it away 2008-10-02 18:10 now does the right thing for unmapped buffers: 2008-10-02 18:10 filemap_extent_read: logical block 0x5 of inode 0x0 2008-10-02 18:10 ---- extent 0x5/1 ---- 2008-10-02 18:10 prior extents: 2008-10-02 18:10 ---- rewind to 0x0 => 0/1 ---- 2008-10-02 18:10 filemap_extent_read: index 5, limit 6 2008-10-02 18:10 filemap_extent_read: offset = 0, next = 6, gap = 1 2008-10-02 18:10 filemap_extent_read: fill gap at 5/1 2008-10-02 18:10 segs (offset = 0): 5 => ffffffffffff/1; (1) 2008-10-02 18:10 filemap_extent_read: extent 0x5/1 => ffffffffffff 2008-10-02 18:10 filemap_extent_read: read block 0x5 => ffffffffffff 2008-10-02 18:10 filemap_extent_read: zero fill buffer 2008-10-02 18:10 which brings us to sk8 oclock 2008-10-02 19:53 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-02 19:55 oh... no class today? 2008-10-02 20:01 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-02 20:02 aaa... 9pm tonight :P 2008-10-02 20:17 really? 9? 2008-10-02 20:20 I have to be up in about 6h... :( 2008-10-02 20:28 I'm falling asleep already... 2008-10-02 20:28 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-02 20:28 me too 2008-10-02 20:28 http://lxr.linux.no/linux+v2.6.26.5/fs/inode.c#L124 2008-10-02 20:29 about the first Q from the homework... 2008-10-02 20:29 what was the second one? 2008-10-02 20:31 aaa... the locks when a file is closed 2008-10-02 20:34 hmm 2008-10-02 20:36 I've already answered both. 2008-10-02 20:36 http://lxr.linux.no/linux+v2.6.26.5/fs/open.c#L1175 files->file_lock cannot be the lock we were talking about... 2008-10-02 20:36 let me check my logs 2008-10-02 20:37 ok, so let's pick it up again next thursday 2008-10-02 20:37 ok, first home work was why both ptr and struct address_space (*i_mapping and i_data) in struct inode 2008-10-02 20:37 http://lxr.linux.no/linux+v2.6.26.5/fs/locks.c#L1567 so the locks are tided to filp 2008-10-02 20:37 I will be offline for a few days 2008-10-02 20:37 ah 2008-10-02 20:37 it goes on with out me ;) 2008-10-02 20:37 (the ideal situation) 2008-10-02 20:38 :-) 2008-10-02 20:38 and the above L124 is not quite the answer 2008-10-02 20:38 ralucam, you following ok? 2008-10-02 20:38 MaZe: true, it's the point where are made the same... 2008-10-02 20:38 that's were it gets set to the default value, but why do you need a pointer, couldn't u always use &i_data 2008-10-02 20:38 the i_data doesn't seem to be used much 2008-10-02 20:39 inode->mapping is defined to be 2008-10-02 20:39 always valid 2008-10-02 20:39 re 1175 - not sure what you mean 2008-10-02 20:39 sorry... I was wrong about L124 2008-10-02 20:40 the locks in L1175 and L1567 are not quite the ones we were talking about re fget_light/fput_light 2008-10-02 20:40 so the thing from L124 is put in the structure here: http://lxr.linux.no/linux+v2.6.26.5/fs/inode.c#L184 2008-10-02 20:40 MaZe: aaaa... 2008-10-02 20:40 right 2008-10-02 20:40 but that's the default path 2008-10-02 20:41 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-02 20:41 now I'm searching for the places where i_mapping is changed :P 2008-10-02 20:41 the one which could have sed / "->i_mapping->" "->i_data." / 2008-10-02 20:41 killing lxr in the process ;-) 2008-10-02 20:42 maze, not quite 2008-10-02 20:42 hmm? 2008-10-02 20:42 i_mapping is a pointer, i_data is an object 2008-10-02 20:42 hence -> to . 2008-10-02 20:42 I'm searching for i_mapping only ;-) 2008-10-02 20:42 right 2008-10-02 20:43 I don't expect the name to be used for anything else 2008-10-02 20:43 sorry, my eyes got crossed with the sed ;) 2008-10-02 20:43 ok, searching logs for the second homework question 2008-10-02 20:43 there was a bonus about why fget_light is demented 2008-10-02 20:45 a: breaks refcounting purely to reduce cacheline pinging 2008-10-02 20:45 ah, ok, so that wasn't the bonus, that was just the second homework 2008-10-02 20:45 accurately characterized by akpm as "foul" 2008-10-02 20:45 http://lxr.linux.no/linux+v2.6.26.5/fs/block_dev.c#L290 2008-10-02 20:45 looks closer 2008-10-02 20:46 but no cigar 2008-10-02 20:46 that's the default value as well 2008-10-02 20:46 I still am not sure why we need i_mapping -> -i_data 2008-10-02 20:46 i_mapping = &i_data 2008-10-02 20:46 and http://lxr.linux.no/linux+v2.6.26.5/fs/block_dev.c#L425 2008-10-02 20:46 http://lxr.linux.no/linux+v2.6.26.5/fs/block_dev.c#L450 2008-10-02 20:46 to be more precise 2008-10-02 20:47 I suspect it's bogus, but a pretty extensive survey of usage would be required to say one way or the other 2008-10-02 20:47 that is the _main_ use case in the kernel 2008-10-02 20:47 (there are two others) 2008-10-02 20:47 right, coda and ? 2008-10-02 20:47 raw char devs 2008-10-02 20:47 char devs? 2008-10-02 20:48 why only char devs? 2008-10-02 20:48 the primary use case seems to be block devices, the secondary is raw char devices (ie. for direct io to block devs, ancient pre-O_DIRECT interface), and third/last is the coda fs 2008-10-02 20:48 well the blockdev usage smells really bogus 2008-10-02 20:48 -!- ajonat(~ajonat@190.48.123.108) has joined #tux3 2008-10-02 20:48 inode->i_mapping = &inode->i_data; 2008-10-02 20:48 from coda ;-) 2008-10-02 20:48 wrong line 2008-10-02 20:49 since that's restoring the default 2008-10-02 20:49 I pick it because of that ;-) 2008-10-02 20:49 the rest are in file.c 2008-10-02 20:50 jeez, that blockdev code is twisted 2008-10-02 20:50 I think, product of fuzzy thinking and not necessity 2008-10-02 20:50 but more analysis is needed to be sure 2008-10-02 20:51 there should be some system to mark as 'looks wrong to me' some stuff in the kernel :P 2008-10-02 20:51 heh 2008-10-02 20:51 we carve our comments in the internet, right hew 2008-10-02 20:51 right here 2008-10-02 20:51 ./fs/coda/file.c-106- host_inode = host_file->f_path.dentry->d_inode; 2008-10-02 20:51 ./fs/coda/file.c-107- coda_file->f_mapping = host_file->f_mapping; 2008-10-02 20:51 ./fs/coda/file.c:108: if (coda_inode->i_mapping == &coda_inode->i_data) 2008-10-02 20:51 ./fs/coda/file.c:109: coda_inode->i_mapping = host_inode->i_mapping; 2008-10-02 20:51 ./fs/coda/file.c-110- 2008-10-02 20:51 ./fs/coda/file.c-111- /* only allow additional mmaps as long as userspace isn't changing 2008-10-02 20:52 where they will be unearthed by archaeologists millenia later 2008-10-02 20:52 that's the relevant part of coda 2008-10-02 20:52 ./drivers/char/raw.c-76- filp->f_mapping = bdev->bd_inode->i_mapping; 2008-10-02 20:52 ./drivers/char/raw.c-77- if (++raw_devices[minor].inuse == 1) 2008-10-02 20:52 ./drivers/char/raw.c:78: filp->f_path.dentry->d_inode->i_mapping = 2008-10-02 20:52 ./drivers/char/raw.c-79- bdev->bd_inode->i_mapping; 2008-10-02 20:52 ./drivers/char/raw.c-80- filp->private_data = bdev; 2008-10-02 20:52 and that's for raw char dev 2008-10-02 20:53 bdev again 2008-10-02 20:53 while I can see/understand the need for the raw char dev and coda use cases 2008-10-02 20:53 I don't yet get what the normal bdev case is for 2008-10-02 20:53 actually... they looks similar 2008-10-02 20:53 coda and raw 2008-10-02 20:53 not surprising 2008-10-02 20:54 coda is a networked file system with local caching and offline operation 2008-10-02 20:54 the raw.c stuff looks bogus too 2008-10-02 20:54 at least party 2008-10-02 20:54 why dow d_inode need to have a mapping? 2008-10-02 20:54 it needs a way to tell the kernel that the page cache for a file in codafs is actually the page cache for another file in a local filesystem (ie. in the cache fs store) 2008-10-02 20:54 sorrry 2008-10-02 20:54 while the raw char dev case needs to remap the raw char dev to the block dev 2008-10-02 20:55 bd_acquire is only called in 3 places... 2008-10-02 20:55 lying underneath it 2008-10-02 20:55 whey does filp need a mapping, I meant to say 2008-10-02 20:55 maze, but that raw char dev case sounds like it could be done some other way 2008-10-02 20:56 hmm... how does cache-ing for char devices works? :P 2008-10-02 20:56 a maybe 2008-10-02 20:56 it's if we open the same block dev from different inodes? 2008-10-02 20:56 from different entries and/or filesystems? 2008-10-02 20:56 something wierd 2008-10-02 20:56 that offends my sense of form and balance 2008-10-02 20:56 since they're all actually the same block device, but they're not the same inode 2008-10-02 20:56 that makes sense 2008-10-02 20:56 want to have them backed by the same page cache 2008-10-02 20:57 make sense! :D 2008-10-02 20:57 how about coda? 2008-10-02 20:57 we don't open block devices "from inodes" 2008-10-02 20:57 we open them from names 2008-10-02 20:57 sure we do ;-) 2008-10-02 20:57 we're talking about i_node->i_mapping remember ;-) 2008-10-02 20:57 right name -> dentry -> inode -> bla bla -> mapping 2008-10-02 20:57 $ stat hda1 2008-10-02 20:57 File: `hda1' 2008-10-02 20:57 Size: 0 Blocks: 0 IO Block: 4096 block special file 2008-10-02 20:57 Device: dh/13d Inode: 2343 Links: 1 Device type: 3,1 2008-10-02 20:57 Access: (0660/brw-rw----) Uid: ( 0/ root) Gid: ( 6/ disk) 2008-10-02 20:57 Access: 2008-10-01 15:41:08.151619201 -0400 2008-10-02 20:57 Modify: 2008-10-01 15:40:54.344130563 -0400 2008-10-02 20:58 Change: 2008-10-01 15:40:54.344130563 -0400 2008-10-02 20:59 stat hdaX 2008-10-02 20:59 File: `hdaX' 2008-10-02 20:59 Size: 0 Blocks: 0 IO Block: 4096 block special file 2008-10-02 20:59 Device: dh/13d Inode: 911348 Links: 1 Device type: 3,1 2008-10-02 20:59 Access: (0644/brw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) 2008-10-02 20:59 MaZe is right 2008-10-02 20:59 # ls -al sda 2008-10-02 20:59 brw-rw---- 1 root disk 8, 0 2008-09-29 17:52 sda# mknod sda__ b 8 0 2008-10-02 20:59 # ls -al sda__ 2008-10-02 20:59 brw-r--r-- 1 root root 8, 0 2008-10-02 20:58 sda__ 2008-10-02 20:59 # stat sda 2008-10-02 20:59 Device: eh/14d Inode: 339 Links: 1 Device type: 8,0 2008-10-02 20:59 # stat sda__ 2008-10-02 20:59 Device: eh/14d Inode: 844038 Links: 1 Device type: 8,0 2008-10-02 20:59 :D 2008-10-02 20:59 I was first ;-) 2008-10-02 20:59 yup 2008-10-02 21:00 and it'll have an even different inode on a different partition/filesystem 2008-10-02 21:00 hence, different sb,inode pairs referring to the same block dev, but having to have the same page cache 2008-10-02 21:00 that for sure 2008-10-02 21:00 I don't understand the page cache for char dev though 2008-10-02 21:01 the raw char dev is an abomination 2008-10-02 21:01 it behave like a block dev 2008-10-02 21:01 nowadays the correct way is to open the normal (base) block dev with option O_DIRECT 2008-10-02 21:01 maze, but why does kernel ever have to use the inode from the mknod file directly? 2008-10-02 21:01 oh... 2008-10-02 21:01 why not dereference to the real inode first? 2008-10-02 21:01 flips: dentry? 2008-10-02 21:02 could be a dentry reason, still sounds bogus 2008-10-02 21:02 real inode? 2008-10-02 21:02 even if it's in the dentry 2008-10-02 21:02 it used to be you would map the base dev to a raw dev and then get direct io on the raw dev, which for reasons which escape me wasn't a block dev, but a char dev instead 2008-10-02 21:02 dreference after getting hold of the dentry 2008-10-02 21:02 because bdevs don't have inodes? 2008-10-02 21:02 "if" it's a dentry reason 2008-10-02 21:02 the names can be different 2008-10-02 21:02 so the entries are diferent 2008-10-02 21:03 there is no real name, no real inode, the only thing there is is the block major, minor pair 2008-10-02 21:03 you want to alias them to the same inode? 2008-10-02 21:03 my sense that this extra level of indirection in ->mapping is being abused by blockdevs is getting stronger 2008-10-02 21:03 you could probably get away with having a blockdevfs 2008-10-02 21:03 which would be the fs with inodes for blockdevs, which you could then return instead of the proper inode from the normal filesystem 2008-10-02 21:03 except 2008-10-02 21:04 inodes also store permissions 2008-10-02 21:04 which might be different between different blockdev entries on the filesystem 2008-10-02 21:04 :D 2008-10-02 21:04 well let's put that one aside 2008-10-02 21:04 return to it later 2008-10-02 21:04 and you couldn't delete them... 2008-10-02 21:04 since you'd be deleting the wrong inode 2008-10-02 21:04 after I've had time to read more around that part of the code ;) 2008-10-02 21:05 so the inode references the entry in the filesystem 2008-10-02 21:05 it's often a mistake to assume that something which is really weird is that way because it has to be 2008-10-02 21:05 [hmm although deletions, really delete dentries, not inodes] 2008-10-02 21:05 dirent, not dentry 2008-10-02 21:06 we reserve the latter name to mean the cache item 2008-10-02 21:06 right nomenclature... 2008-10-02 21:06 just convention 2008-10-02 21:06 I have a feeling you'd still need it for stuff like coda, or more advanced caching network fs'es anyway 2008-10-02 21:07 _maybe_ 2008-10-02 21:07 why don't you write a stacking filesystem and see if you're forced to use that feature? 2008-10-02 21:07 ordinary filesystem is too easy for you ;) 2008-10-02 21:08 in truth, I've taken a light run at that myself and found the issues... hurtful 2008-10-02 21:08 our vfs was not designed to be stacked 2008-10-02 21:08 it's very much a fixed number of levels kind of thing 2008-10-02 21:09 every now and then some people show up to improve the stackability, and after pushing uphill for a while they go away again 2008-10-02 21:10 then when fuse came along, most of the motivation for stacking filesystems in kernel went away 2008-10-02 21:10 leaving just nfs... a quasi stacking filesystem... and coda... I think I have very little understanding of 2008-10-02 21:10 for a while there was intermezzo 2008-10-02 21:10 which never quite got working 2008-10-02 21:11 well I don't think it was an official tux3 u session today 2008-10-02 21:11 let's not post logs 2008-10-02 21:12 will resume in earnest next thursday 2008-10-02 21:12 with the friends, right? :D 2008-10-02 21:13 that would be a good place, or any requests? 2008-10-02 21:13 what is the biggest mystery remaining? 2008-10-02 21:13 dentry cache ranks up there 2008-10-02 21:13 path walk 2008-10-02 21:14 ->rename, worth a visit 2008-10-02 21:14 pdflush? 2008-10-02 21:14 it's really vm, but you need to know how it works to write fast filesystem code 2008-10-02 21:15 ACTION will still 'float' till 22 :| 2008-10-02 21:15 hmm, it may even not be generic enough for cool stuff like cow or versioned or other types of opts filesystems 2008-10-02 21:16 what is pdflush? 2008-10-02 21:16 maze, what might not be? 2008-10-02 21:16 pdflush? certainly isn't 2008-10-02 21:16 the current i_mapping pointer 2008-10-02 21:16 ah 2008-10-02 21:16 since you could potentially have 2008-10-02 21:16 yes, you could get into some really demented versioning tricks 2008-10-02 21:16 2 files on different inodes on different filesystems 2008-10-02 21:17 but we will use a very simple one... 2008-10-02 21:17 one version of a file gets loaded into the inode->mapping and that is it 2008-10-02 21:17 and the page cache could potentially be shared between them if they (or parts of them) refer to the same data 2008-10-02 21:17 and shared with the block dev cache that the files are stored on , etc 2008-10-02 21:17 we don't try to share mapping pages between different versions of the same file 2008-10-02 21:17 ACTION is off to bed. Good night to everyone! 2008-10-02 21:17 sharing mapping pages would require deep surgery 2008-10-02 21:17 see you 2008-10-02 21:17 yep, just what I realized 2008-10-02 21:17 good night! 2008-10-02 21:18 save that for linux 2.9 2008-10-02 21:18 but sharing pages between the blockdev and the fs on top of it, and the netfs exported/imported from it, and the various versions and cow files is what should happen 2008-10-02 21:19 why between various versions of cow files? 2008-10-02 21:19 if it lives in one spot on disk, it should only live in one spot in memory 2008-10-02 21:19 why does that matter? 2008-10-02 21:19 for the regions which are identical 2008-10-02 21:19 memory 2008-10-02 21:19 so we waste some cache by duplicating pages, what's the problem? 2008-10-02 21:19 you can get by with a much smaller cache, or make much better use of existing cache 2008-10-02 21:20 we already suck beyond belief for in-cache diff 2008-10-02 21:20 hmm? what do you mean/ 2008-10-02 21:20 maybe fix the obvious breakage first 2008-10-02 21:20 ? 2008-10-02 21:20 try diffing two kernel trees 2008-10-02 21:20 and see how much memory you need to keep both 100% in cache 2008-10-02 21:20 it's ballooned from before, not because the tree got bigger 2008-10-02 21:21 [and yes doing all this is hard because of writes and read-only stuff, and when to dupe, when to modify in place, etc] 2008-10-02 21:21 $ time diff -qr linux-2.6.26.5_ linux-2.6.26.5 2008-10-02 21:21 real 0m1.208s 2008-10-02 21:22 hmm 2008-10-02 21:22 but how much cache am I using 2008-10-02 21:22 on, now suppose you want to share cache pages, that means at find_cache_page miss time you need to be able to know the target page is already in some other cache 2008-10-02 21:22 the way to do that is by putting a forwarding pointer in the page cache for the device 2008-10-02 21:22 Cached: 3193700 kB 2008-10-02 21:22 hmm wonder how much of that was the 2 kernel trees? 2008-10-02 21:23 mazem, 4 GB machine? 2008-10-02 21:23 yup 2008-10-02 21:23 try it with 512 MB 2008-10-02 21:23 $ du -hs * 2008-10-02 21:23 323M linux-2.6.26.5 2008-10-02 21:23 323M linux-2.6.26.5_ 2008-10-02 21:23 won't work 2008-10-02 21:23 since the kernels take 640M themselves 2008-10-02 21:24 1G then 2008-10-02 21:24 Cached: 2552060 kB 2008-10-02 21:24 after deleting both kernel trees 2008-10-02 21:24 would it have gc'ed all the stuff I deleted? 2008-10-02 21:25 it should not have 2008-10-02 21:25 um 2008-10-02 21:25 no, of course it should have 2008-10-02 21:25 deleted = not cacheable 2008-10-02 21:26 so it seems to have deleted exactly the right amount -> 641MB 2008-10-02 21:27 oh, so the block dev and the file system mounted on top of it have seperate page caches, which aren't shared till they hit disk 2008-10-02 21:27 which is why you should never touch the block dev directly, if there's an fs mounted on it 2008-10-02 21:27 not quite 2008-10-02 21:28 (except potentially for reads) 2008-10-02 21:28 they aren't shared, but the set of blocks in the two is disjoint 2008-10-02 21:28 better be 2008-10-02 21:28 uhm 2008-10-02 21:28 I don't think there's any guarantee for that 2008-10-02 21:28 file cache is data, blockdev cache is metadata 2008-10-02 21:28 the filesystem has to make that gaurantee 2008-10-02 21:28 oh, you mean if the fs is using the blockdev 2008-10-02 21:29 sure 2008-10-02 21:29 the fs always uses the blockdev 2008-10-02 21:29 well 2008-10-02 21:29 if you do it from userspace you can easily screw that up 2008-10-02 21:29 it always uses the buffer cache 2008-10-02 21:30 you will see code in filesystems to invalidate buffer cache pages when metadata is freed 2008-10-02 21:30 ah 2008-10-02 21:30 in case they later get used for normal data 2008-10-02 21:30 in some caches, clean alias pages are left around, but that is an accident waiting to happen 2008-10-02 21:30 in some cases I meant 2008-10-02 21:30 yes 2008-10-02 21:31 classic badness 2008-10-02 21:31 has bitten many times 2008-10-02 21:31 in some cases it's impossible to avoid aliases 2008-10-02 21:31 namely when one block on a page is data and another is metadata 2008-10-02 21:32 the blocks themselves are not aliased but the pages are 2008-10-02 21:32 right 2008-10-02 21:33 right, this entire page cache system is nice and simple, and has good performance, but can't really represent all the edge cases, or more complex scenarios 2008-10-02 21:33 it's pretty simple minded true 2008-10-02 21:34 so how would you go about answering the question: in which page cache(s) is a given physical block already mapped? 2008-10-02 21:34 if you could change everything? 2008-10-02 21:35 I'm not sure yet, but it's pretty clear that (if possible to get good performance with something like this) we would want the minimum amount of duplication possible 2008-10-02 21:36 ie. if it's in one physical location on disk, it should remain in one (or zero) pages in ram regardless of how many levels it crosses 2008-10-02 21:36 all the way down 2008-10-02 21:36 block dev, virtual block dev, file system, network fs, userspace mmap, possibly userspace read hack opts 2008-10-02 21:37 specifying exactly what happens when you trigger a write to a page, would be non-trivial 2008-10-02 21:38 in some cases you simply allocate a new page with duped data that is not mapped to anything else (a modify in ram only scenario) 2008-10-02 21:38 in others you'd need to allocate space on the filesystem and map to a 'sync to this location on disk' page 2008-10-02 21:38 etc 2008-10-02 21:39 my main worry would be that we could potentially be triggering spurious context switches on writes to read only pages 2008-10-02 21:39 spurious - wrong word - more like 'an excessive number of' that would hurt performance 2008-10-02 21:40 you need to be able to throw pages around 2008-10-02 21:41 a read from block dev through virtual dev (lvm) to fs, would somehow result in the same page being held from all 3 places 2008-10-02 21:41 you're going to have a lot of trouble when data and metadata live on the same page 2008-10-02 21:41 and then a full page write would result in an existing page getting mapped in to all 3 places at the same time (reverse process), while partial writes would flag pages as dirty etc 2008-10-02 21:42 yes, but I'm not sure metadata and data deserve to be treated seperately 2008-10-02 21:42 you don't need context switch when you write to a read only page, you can check the page flags explicitly 2008-10-02 21:42 and not take a fault 2008-10-02 21:42 in kernel space sure 2008-10-02 21:42 not so in userspae 2008-10-02 21:42 file data is certainly treated separately from metadata 2008-10-02 21:42 not going to change soon 2008-10-02 21:42 in kernel space I could simply check the counters 2008-10-02 21:43 in memory? inodes dentries, etc, sure 2008-10-02 21:43 but on disk? 2008-10-02 21:43 not so sure 2008-10-02 21:43 tux3 already kind of has less of a distinction than normal 2008-10-02 21:43 between? 2008-10-02 21:44 between a file and it's contents, and metadata (logs, btrees, etc) 2008-10-02 21:44 oh, some metadata is mapped as data 2008-10-02 21:44 I think it should be possible to have all the metadata on disk behave like data, with possible exception of (a few?) superblocks 2008-10-02 21:44 I'm sure we're going to hit some interesting recursions in there at some point 2008-10-02 21:45 sometimes I try to map file index metadata into a page cache and it never seems to work out very well 2008-10-02 21:45 well imagine we have a filesystem 2008-10-02 21:45 it already works 2008-10-02 21:46 now we need to store metadata 2008-10-02 21:46 so we write it out to a logfile in the first filesystem 2008-10-02 21:46 metadata != xattrs 2008-10-02 21:46 and store trees as sparse files in the first filesystem, etc 2008-10-02 21:46 sure 2008-10-02 21:46 that's actually done already 2008-10-02 21:46 now if the first filesystem is the filesystem for which we're storing metadata 2008-10-02 21:46 in lustre 2008-10-02 21:46 we've got a problem... 2008-10-02 21:47 since we get updates on updates 2008-10-02 21:47 however 2008-10-02 21:47 if all we update is up to a specific point 2008-10-02 21:47 and the rest is handled via forward logging 2008-10-02 21:47 ok, so you could make it work, but what is the win? 2008-10-02 21:47 then so long as you can guarantee that generating X KB of updates generates less than X KB of new updates, then it converges 2008-10-02 21:48 I think it should actually turn out to be pretty simple 2008-10-02 21:48 code wise 2008-10-02 21:48 if not conceptually 2008-10-02 21:48 so which piece of tux3 would go into a file next? 2008-10-02 21:48 no idea 2008-10-02 21:48 the inode table is problematic because of variable sized inodes 2008-10-02 21:48 does not map into a page cache nicely 2008-10-02 21:48 I'm still at the phase, where I'm thinking this should be doable 2008-10-02 21:49 and since we rely on logging during mount anyway, it never has to fully converge 2008-10-02 21:49 otherwise, filesystems with fixed sized inodes could put the inode table in page cache and it would be a win 2008-10-02 21:49 the file system is always dirty 2008-10-02 21:50 tux3 only has three kinds of things that are not in files: 1) inode table 2) file indexes 3) update logs 2008-10-02 21:50 ie. it's always: what's on disk reflects last commit point + forward log which contains the changes which were made to the fs to perform the last commit (and any other changes from userspace in the mean time) 2008-10-02 21:50 ok, so why ain't the update log in a file? 2008-10-02 21:50 I can't see winning on any of those three by mapping to a file 2008-10-02 21:50 ah 2008-10-02 21:51 don't know ;) 2008-10-02 21:51 recursion for one thing 2008-10-02 21:51 log the updates to the log file 2008-10-02 21:51 see the forward log should just be a periodically front-truncated normal file 2008-10-02 21:51 and the win is? 2008-10-02 21:51 the win is all we have to support is normal files 2008-10-02 21:52 got to be more of a win than that 2008-10-02 21:52 to make up for the extra problems 2008-10-02 21:52 and except from the initial 'recover during mount' phase it's simple 2008-10-02 21:52 you share code for more stuff, you don't have to (at least theoretically) special case allocation for the forward log, etc 2008-10-02 21:53 since we have not done the log at all yet, if you come up with a convincing win argument, we can do it that way 2008-10-02 21:53 although that might be a bad thing 2008-10-02 21:53 sharing code is a minor plus 2008-10-02 21:53 -!- Kirantpatil(~kiran@122.167.195.107) has joined #tux3 2008-10-02 21:53 ACTION is looking for the big win 2008-10-02 21:53 dinner time 2008-10-02 21:53 ACTION thinks this would be the first file system which would deserve the name 2008-10-02 21:53 when is the next burst of activity on junkfs, or is there nothing interesting left to try? 2008-10-02 21:54 lots of interesting stuff 2008-10-02 21:54 it's also end-of-quarter time 2008-10-02 21:54 that was last week 2008-10-02 21:54 working on-and-off on options 2008-10-02 21:54 one would wish it was done last week ;-) 2008-10-02 21:54 we're scoring on monday, so I want to finish two more things I've left before then 2008-10-02 21:58 -!- Kirantpatil(~kiran@122.167.195.107) has left #tux3 2008-10-02 22:21 -!- amey(~amey@116.73.35.180) has joined #tux3 2008-10-02 22:43 ok, time to finish up the extent drop 2008-10-02 22:43 make it the default 2008-10-02 22:43 start exposing bugs 2008-10-02 23:08 flips: how was the sk8 today 2008-10-02 23:09 was a fine skate in the dark 2008-10-02 23:09 started at sunset 2008-10-02 23:09 just down to the sk8 park and back? 2008-10-02 23:09 up to 3rd st 2008-10-02 23:09 ah 2008-10-02 23:10 musicians on the strand were doing special things 2008-10-02 23:10 "funk you we're playing what we want" 2008-10-02 23:10 i got out at 3pm on the road bike 2008-10-02 23:10 rode up tuna canyon for the first time in months 2008-10-02 23:11 tuna is like entering a different zone all together 2008-10-02 23:11 you get 100 yards off the pch, and you're in the wilderness 2008-10-02 23:11 sounds nice 2008-10-03 00:14 extents just landed 2008-10-03 00:15 probably with a big crash, tinkle sound 2008-10-03 00:15 bug hunters should have fun 2008-10-03 00:36 -!- amey(~amey@116.73.35.180) has left #tux3 2008-10-03 00:52 -!- Kirantpatil(~kiran@122.167.195.107) has joined #tux3 2008-10-03 00:53 -!- Kirantpatil(~kiran@122.167.195.107) has left #tux3 2008-10-03 01:19 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-03 01:31 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-03 01:33 -!- Kirantpatil(~kiran@122.167.195.107) has joined #tux3 2008-10-03 01:33 -!- Kirantpatil(~kiran@122.167.195.107) has left #tux3 2008-10-03 02:20 -!- flips(~phillips@phunq.net) has joined #tux3 2008-10-03 03:18 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-10-03 03:51 flips: good work, was just reading the code 2008-10-03 04:20 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-03 04:42 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-03 04:44 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-03 05:23 -!- FelipeS(~Felipe@lawn-128-61-118-191.lawn.gatech.edu) has joined #tux3 2008-10-03 06:15 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-03 07:08 morning tim_dimm 2008-10-03 10:05 flips : hello 2008-10-03 10:05 I just bought a new laptop (with no OS) 2008-10-03 10:05 :D 2008-10-03 10:05 Toshiba Sat. L300 2008-10-03 10:08 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-03 10:14 morning flips 2008-10-03 10:52 I just bought a PPC G4 :P 2008-10-03 10:52 specs? 2008-10-03 11:44 it will be quite small, 733 MHz, no cache L2 2008-10-03 11:44 it will come with 512MB of RAM but I have 1GB from my old eMac 2008-10-03 11:45 it's a quicksilver: http://support.apple.com/specs/powermac/Power_Mac_G4_Quicksilver.html 2008-10-03 11:45 at 105$ including shipping I simply could not resist :P 2008-10-03 12:18 I've got a 1.5Ghz powerbook g4 I've been thinking about putting into linux duty 2008-10-03 12:18 let me know what you load onto it 2008-10-03 12:18 be curious to see what works 2008-10-03 12:19 I've heard there's limited drivers for things like trackpads on ppc 2008-10-03 12:21 I'll put mac os on it 2008-10-03 12:21 I run linux for about 2 years on my iBook G3 2008-10-03 12:22 interesting experience 2008-10-03 12:44 flips and shapor give me major crap about me running mac os 2008-10-03 12:44 flips calls it "shiny" 2008-10-03 12:45 its the only shiny os that I can run FCP on though 2008-10-03 12:50 what is FCP? Flexible Control Protocol? :P http://cs.jhu.edu/~razvanm/ipsn2008koala.pdf 2008-10-03 13:48 final cut pro 2008-10-03 13:48 video editing app 2008-10-03 13:48 my former life as a creative dude 2008-10-03 14:27 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-03 14:52 your FCP is better ;-) 2008-10-03 15:26 119 list members 2008-10-03 15:47 my new fit-pc slim should arrive any day 2008-10-03 15:47 ACTION was going to go up today but is lagging behind work and home related chores 2008-10-03 15:47 http://www.fit-pc.com/new/ 2008-10-03 15:47 bh, so what did you like about the new code? 2008-10-03 15:49 it's simplifying te system 2008-10-03 15:49 indeed 2008-10-03 15:50 stuff like that is better done earlier than later 2008-10-03 15:50 remove 150 lines or so 2008-10-03 15:50 because of entropy related things if you buy into that model of looking at software development 2008-10-03 15:50 more still will be removed, so actually the LoC cost of extents gets close to zero 2008-10-03 15:50 complexity went up though 2008-10-03 15:50 yeah, the complexity of the file system is increasing but the code base is getting more and more dense, sign of maturation 2008-10-03 15:51 you're like the last hope I have for Linux file systems so I'm definitely interested in positive progress of this project 2008-10-03 15:55 hg diff -r921a58bdbf8b | diffstat 2008-10-03 15:55 b/user/test/filemap.c | 330 ++++++++++++++++++++++++++++++ 2008-10-03 15:55 ... 2008-10-03 15:55 15 files changed, 906 insertions(+), 423 deletions(-) 2008-10-03 15:55 <- extents cost 517 lines so far 2008-10-03 15:55 how's it working so far ? 2008-10-03 15:55 ok as far as I've tested it which is not far 2008-10-03 15:56 waiting for you to download it and try it ;-) 2008-10-03 15:56 ACTION might be able to leave tonight or something 2008-10-03 15:56 yeah, looking for testers 2008-10-03 15:56 make filemap && ./filemap foodev 2008-10-03 15:56 you broadcast that to the world yet ? 2008-10-03 15:56 then it outputs lots of stuff 2008-10-03 15:56 if there are no exclamation marks, you're ok 2008-10-03 15:56 seen the post on tux3 ml? 2008-10-03 15:56 and the one on lkml? 2008-10-03 15:57 when ? a couple of days ago ? 2008-10-03 15:57 or yesterday ? 2008-10-03 15:57 today ? 2008-10-03 15:57 rising steadily up the "hot messages" list 2008-10-03 15:57 ok 2008-10-03 15:57 http://kerneltrap.org/ 2008-10-03 15:58 I don't have a subscription to that unfortunately 2008-10-03 15:58 you don't need it 2008-10-03 15:58 just surf in 2008-10-03 15:58 http://groups.google.com/group/linux.kernel/browse_frm/thread/ce1094c9b82a6768/3f40ebbd1d197f60?lnk=gst&q=daniel+phillips#3f40ebbd1d197f60 2008-10-03 15:59 try and get some ibm folks to support it 2008-10-03 15:59 like those ibm folks that got the et4 extents working 2008-10-03 15:59 funny you should mention that 2008-10-03 16:00 how about some of those novell folks? 2008-10-03 16:00 let me see, who is clueful re filesystems at suse 2008-10-03 16:00 andi for sure 2008-10-03 16:00 clueful about everything nearly 2008-10-03 16:01 ok, time to move the devel env to my laptop 2008-10-03 16:01 so I'm not completely useless for the next 5 days 2008-10-03 16:01 that is, the eee 2008-10-03 16:01 can't say how cool it is to be able to develop tux3 under uml and fuse on the eee 2008-10-03 16:02 flips: we're all working on kvm, scheduling, lockdep and stuff like that with one person overextended that has an interest in the oracle file system 2008-10-03 16:02 that's just in our group, most folks are spread pretty thin as is, same for me as well 2008-10-03 16:03 flips: for what are you going to use the fitpc? 2008-10-03 16:03 ACTION continues reading the post 2008-10-03 16:11 flips: the best way of getting resources is to get your project up to a certain stage of functionality where it has popular following 2008-10-03 16:11 then it's much easier to convince folks to commit resources to something like tux3 development, that's just market reality unfortunately 2008-10-03 16:11 -!- kbingham(~kbingham@92.9.147.219) has joined #tux3 2008-10-03 16:12 If I was unestablished and unemployeed, I'd be on it or -rt 2008-10-03 16:17 rzm|away, web server 2008-10-03 16:18 I was going to use my original fit pc for that but it got commandeered by my daughter 2008-10-03 16:18 bh, it already got there when the fuse port went in 2008-10-03 16:19 don't want to hear reasons for not contributing 2008-10-03 16:20 especially from novell 2008-10-03 16:20 who has about 1 bazillion times more resources than me 2008-10-03 16:21 pull somebody off mono ;) 2008-10-03 16:26 http://kerneltrap.org/ <- number one popular lkml message on kernel trap 2008-10-03 16:53 flips: I disagree 2008-10-03 16:54 with? 2008-10-03 16:54 fuse definitely helps but it needs at least a kernel port an a significant following outside of folks just on this channel 2008-10-03 16:54 devs who wait until the code is already in the hands of users are not the devs I'm interested in 2008-10-03 16:55 novell doesn't have infinite amount of resources, quite the opposite 2008-10-03 16:55 of course I know that 2008-10-03 16:55 but compared to me 2008-10-03 16:55 then your best bet is doing what you're currently doing, training the staff to be able to do this kind of work yourself 2008-10-03 16:55 protestations about lack of resources get close to whining ;) 2008-10-03 16:56 dude, it's reality 2008-10-03 16:56 getting on my case, and I'm on your side, doesn't help the situation 2008-10-03 16:56 whining is reality, yes, especially from $billion corps 2008-10-03 16:56 like google 2008-10-03 16:56 masters of whining 2008-10-03 16:56 oh yes 2008-10-03 16:56 like google too, google whines better than anybody 2008-10-03 16:57 corporate slackasses ;) 2008-10-03 16:57 good thing there are actual devs with a clue to hand them their business model 2008-10-03 16:57 ACTION can't find receipts for reembursement :\ 2008-10-03 16:58 anyway, until novell contributes a patch, novell is in the whining/slackass category, simple fact of life 2008-10-03 16:58 I don't speak for all of novell 2008-10-03 16:58 ACTION worres he's beginning to sound a bit like gregkh 2008-10-03 16:58 you're speaking to me 2008-10-03 16:58 I'm on your side 2008-10-03 16:58 I was speaking about novell 2008-10-03 16:59 so light a fire under those pansy asses 2008-10-03 16:59 whoops, is this all logged publicly 2008-10-03 16:59 well, if you're pressuring me it's going to get no where 2008-10-03 16:59 flips: :) 2008-10-03 16:59 doesn't really matter 2008-10-03 17:00 andrea ought to have some fun with this 2008-10-03 17:00 I'll ping him 2008-10-03 17:02 getting close to sk8 oclock 2008-10-03 17:03 bh, just define it as rt work and submit a patch for rt io sheduling 2008-10-03 17:03 can even be done in fuse 2008-10-03 17:03 yeah, right 2008-10-03 17:03 rt doesn't care about performance, only deadlines 2008-10-03 17:03 good way of getting me fired 2008-10-03 17:03 really? 2008-10-03 17:04 yeah, dong something you're not suppose to be dong 2008-10-03 17:04 doing 2008-10-03 17:04 I thought you were doing rt 2008-10-03 17:04 that and more 2008-10-03 17:04 but only for our -rt product 2008-10-03 17:04 that's lame 2008-10-03 17:04 flips: at this time 2008-10-03 17:04 that's life baby 2008-10-03 17:04 just include it in the product 2008-10-03 17:04 tell the pm 2008-10-03 17:04 and I can't just like quit Novell just to work on tux3, that's kind of silly 2008-10-03 17:04 "rt with tux3 rt schedule will rool the woorld" 2008-10-03 17:05 flips: the best I can do is when the right opportunity comes up, I'll spread the good word, but that's going to have limited influence if I'm considered a flake in the group/company 2008-10-03 17:05 rt filesystem is a key blocker for rt 2008-10-03 17:06 without it you've got no rt 2008-10-03 17:06 not for our purposes 2008-10-03 17:06 give it time 2008-10-03 17:06 however much one would like to pretend otherwise 2008-10-03 17:06 and keep working on stuff, it's looking good 2008-10-03 17:06 and you're getting folks on board, that's all good forward progress 2008-10-03 17:09 now if only we'd some some participation from novell 2008-10-03 17:10 course the satisfying point will be the day an oracle dev submits a patch 2008-10-03 17:10 probable a couple months away from that 2008-10-03 17:10 or maybe we have to wait for the cows to come home first, depends 2008-10-03 17:11 on enlightenment of management there mainly 2008-10-03 17:11 then when sun submits a patch we know we RLY ROOL 2008-10-03 17:12 when we expand the addressing to 128 bits we can rename as ZTS (Zigabyte Tux3 fileSystem) 2008-10-03 17:13 -!- ajonat(~ajonat@190.48.123.108) has joined #tux3 2008-10-03 18:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-03 18:26 -!- ajonat(~ajonat@190.48.123.108) has joined #tux3 2008-10-03 19:36 -!- ajonat(~ajonat@190.48.116.103) has joined #tux3 2008-10-03 19:43 -!- ajonat(~ajonat@190.48.116.103) has joined #tux3 2008-10-03 20:56 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-03 20:59 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-03 23:11 -!- Aks(~ankitsriv@123.237.70.127) has joined #tux3 2008-10-03 23:20 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-03 23:30 -!- wweng(~chatzilla@p67-47.acedsl.com) has joined #tux3 2008-10-03 23:57 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-04 00:38 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-04 00:39 ACTION is off to bed after almost 21h of uptime... 2008-10-04 00:40 -!- ChanServ changed mode/#tux3 -> +o flips 2008-10-04 00:41 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session Thursday Oct 9: friends of grab_cache_page ~ No Tux3 U on Tuesday 7th ~ flips out" 2008-10-04 00:42 -!- flips changed mode/#tux3 -> -o flips 2008-10-04 00:52 cool, the eee is all set up as a tux3 dev station 2008-10-04 00:52 mercurial rocks, the whole toolchain rocks 2008-10-04 01:41 yes, it does 2008-10-04 01:41 I really like the web interface for it and stuff 2008-10-04 01:42 it's quite nice 2008-10-04 03:18 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-04 03:18 -!- tim_dimm_(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-04 04:04 -!- kbingham(~kbingham@92.9.147.219) has joined #tux3 2008-10-04 07:55 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-04 08:29 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-04 09:11 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-04 09:44 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-04 11:50 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-04 11:56 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-04 12:26 ACTION is going to start on his SD to SF trip soon 2008-10-04 12:30 -!- ajonat(~ajonat@190.48.116.103) has joined #tux3 2008-10-04 12:35 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-04 12:53 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-04 12:54 -!- hubar(~chatzilla@p67-47.acedsl.com) has joined #tux3 2008-10-04 13:04 -!- hubar(~chatzilla@p67-47.acedsl.com) has left #tux3 2008-10-04 13:12 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-04 13:15 -!- paola(~paola@ppp-23-17.20-151.libero.it) has joined #tux3 2008-10-04 13:46 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-04 13:51 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-04 13:54 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-04 13:55 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-04 14:11 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-04 14:42 -!- paola(~paola@ppp-23-17.20-151.libero.it) has left #tux3 2008-10-04 14:52 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-04 15:24 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-04 17:28 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-04 18:18 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-04 18:35 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-04 18:41 -!- FelipeS_(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-04 20:48 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-04 20:49 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-04 20:50 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-04 22:13 -!- FelipeS_(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-04 23:20 -!- ajonat(~ajonat@190.48.116.103) has joined #tux3 2008-10-04 23:40 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-05 00:13 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-05 00:18 -!- RazvanM2(~razvanm2@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-05 01:59 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-05 03:43 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-05 04:25 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-05 04:50 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-05 05:10 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-05 07:44 -!- kbingham(~kbingham@92.1.55.205) has joined #tux3 2008-10-05 09:44 -!- kbingham(~kbingham@92.23.22.99) has joined #tux3 2008-10-05 10:13 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-05 10:40 -!- kbingham(~kbingham@92.21.131.250) has joined #tux3 2008-10-05 10:45 -!- ajonat(~ajonat@190.48.122.100) has joined #tux3 2008-10-05 10:48 -!- Bobby(~Bobby@122.162.70.237) has joined #tux3 2008-10-05 10:55 hey guys 2008-10-05 10:55 hello all 2008-10-05 11:03 flips, there? 2008-10-05 11:03 got the extent code working? 2008-10-05 11:03 i think he does 2008-10-05 11:04 hey tim_dimm 2008-10-05 11:04 hey dude 2008-10-05 11:05 how's it going? 2008-10-05 11:05 hmm 2008-10-05 11:05 not great 2008-10-05 11:05 what's not great? 2008-10-05 11:06 been trying to get working on tux3 since long 2008-10-05 11:06 what's the block? 2008-10-05 11:08 not sure of the concepts 2008-10-05 11:08 and no time :( 2008-10-05 11:08 its not easy, and it does take time to digest 2008-10-05 11:10 i'm just now learning C, so my role has been evangelism so far 2008-10-05 11:10 hmm 2008-10-05 11:10 im going to put more work into this 2008-10-05 11:10 learning the concepts and the code 2008-10-05 11:11 is tux3 U helping? 2008-10-05 11:11 yeah, i just go through the logs 2008-10-05 11:11 the timing doesn't exactly match :( 2008-10-05 11:11 where r u located? 2008-10-05 11:11 india 2008-10-05 11:11 where in india? 2008-10-05 11:11 delhi 2008-10-05 11:11 its all the same time zone 2008-10-05 11:12 havent been tere, but have been to chennai 2008-10-05 11:12 oh 2008-10-05 11:12 ok 2008-10-05 11:12 whr u frm/ 2008-10-05 11:12 LA 2008-10-05 11:12 los angeles 2008-10-05 11:12 why chennai? 2008-10-05 11:12 ya. im familiar 2008-10-05 11:13 i did visual effects for a film there 2008-10-05 11:13 ohk 2008-10-05 11:13 frameflow, got bought by sony imageworks 2008-10-05 11:14 ohk 2008-10-05 11:14 hey tim_dimm, g2g, ttyl 2008-10-05 11:14 cool, l8tr 2008-10-05 13:03 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-05 14:17 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-05 14:41 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-05 17:25 -!- kbingham(~kbingham@92.9.150.238) has joined #tux3 2008-10-05 17:57 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-05 19:04 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-05 21:57 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-05 23:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-05 23:23 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-10-05 23:24 hey all 2008-10-05 23:39 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-06 00:12 flips: creating sourceforge page for tux3 2008-10-06 00:12 so that people with no mercurial can access it 2008-10-06 00:12 i will take the responsibility to keep the project there in sync with mercurial 2008-10-06 00:12 i hope this is ok with you 2008-10-06 00:33 -!- ajonat(~ajonat@190.48.122.100) has joined #tux3 2008-10-06 03:14 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-06 04:05 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-06 04:27 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-10-06 05:23 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-06 05:52 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2008-10-06 05:53 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-06 05:54 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-06 07:30 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-06 09:19 -!- rollen(~none@195.129-243-81.adsl-dyn.isp.belgacom.be) has joined #tux3 2008-10-06 10:07 -!- Bobby(~Bobby@122.162.68.177) has joined #tux3 2008-10-06 10:09 -!- Bobby(~Bobby@122.162.68.177) has joined #tux3 2008-10-06 10:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-06 10:37 -!- Bobby(~Bobby@122.162.68.177) has joined #tux3 2008-10-06 10:37 hey all 2008-10-06 10:37 hey Bobby 2008-10-06 10:38 hello tim_dimm 2008-10-06 10:38 you r now help? 2008-10-06 10:38 heh 2008-10-06 10:39 -!- Bobby_(~Bobby@122.162.68.177) has joined #tux3 2008-10-06 10:39 forgot to log off from office :D 2008-10-06 10:40 it was not accepting my nick changes 2008-10-06 10:40 it happens 2008-10-06 10:51 tim_dimm, which movie did u work on?? 2008-10-06 10:51 here in chennai? 2008-10-06 10:51 my movie, midgetman 2008-10-06 10:51 was never completed 2008-10-06 10:51 hmm 2008-10-06 10:51 which firm? 2008-10-06 10:52 frameflow was purchased by sony 2008-10-06 10:52 ohk 2008-10-06 10:52 indy 2008-10-06 10:52 what are u doing now? 2008-10-06 10:52 i mean job 2008-10-06 10:53 after midgetman, I was doing color correction for film 2008-10-06 10:53 ok 2008-10-06 10:53 otherwise known as DI (digital intermiediate) 2008-10-06 10:53 and I learned about a ssd startup - violin memory 2008-10-06 10:53 and believed that could accelerate certain post production processes 2008-10-06 10:54 so I made a few introductions, then found myself employed by violin 2008-10-06 10:54 hmm 2008-10-06 10:54 yeah, violin.. that 16gb ram disk... 2008-10-06 10:54 which lead me to MetaRAM after violin didn't get market traction 2008-10-06 10:54 512GB ram disk 2008-10-06 10:54 oh 2008-10-06 10:54 awesome.. 2008-10-06 10:55 metaram? 2008-10-06 10:55 I was hoping they'd release flash 2008-10-06 10:55 MetaRAM makes 8GB DDR2 rDIMMs and 8GB and 16GB DDR3 rDIMMS 2008-10-06 10:55 and I'm doing bizdev for them 2008-10-06 10:56 hmm 2008-10-06 10:56 started just in media + entertaiment, but now cover all 2008-10-06 10:56 u seem to be having fun :) 2008-10-06 10:56 memory market is tough right now 2008-10-06 10:56 all markets are tough :) 2008-10-06 10:56 now* 2008-10-06 10:56 all the big memory fabs are losing money 2008-10-06 10:56 yeah 2008-10-06 10:57 especially with the recession and all 2008-10-06 10:57 is india in a recession too? 2008-10-06 10:57 thought your economy was growing fast 2008-10-06 10:57 hmm 2008-10-06 10:57 the market reached its 2 year low point today 2008-10-06 10:57 US market affects all other markets 2008-10-06 10:58 of course 2008-10-06 10:58 foreign inflows are drying up.. 2008-10-06 10:58 people are pulling out of markets left and right 2008-10-06 10:58 how is that affecting IT spending? 2008-10-06 10:59 IT spending is also being cut overseas 2008-10-06 10:59 and that will affect the IT companies here which mostly do outsourcing work... 2008-10-06 10:59 speaking of work, what do u do? 2008-10-06 11:00 i work in mentor graphics 2008-10-06 11:00 ever heard of it? 2008-10-06 11:00 name is familiar 2008-10-06 11:00 got a website? 2008-10-06 11:00 and no, its not a graphics company :) 2008-10-06 11:00 yup 2008-10-06 11:00 mentor.com 2008-10-06 11:00 electronic design 2008-10-06 11:01 yup 2008-10-06 11:01 thought I'd heard it 2008-10-06 11:01 what's your role at mentor? 2008-10-06 11:02 im member, technical staff 2008-10-06 11:02 how'd you hear about tux3? 2008-10-06 11:02 hmm 2008-10-06 11:02 im an kernel enthusiast 2008-10-06 11:03 like operating systems in general 2008-10-06 11:03 lkml ? 2008-10-06 11:03 been reading about os for 1-2 years 2008-10-06 11:03 yup 2008-10-06 11:03 i was getting nowhere without a project 2008-10-06 11:03 so finally decided to take the plunge 2008-10-06 11:04 and how's that going? 2008-10-06 11:04 not great :( 2008-10-06 11:04 my day job is like my entire day job 2008-10-06 11:04 dont get the time exactly 2008-10-06 11:04 to devote to tux3 2008-10-06 11:04 so its a question of time then 2008-10-06 11:05 yup 2008-10-06 11:05 i need to make up more of it 2008-10-06 11:05 well, its cool you're here though 2008-10-06 11:05 hmm, im trying :) 2008-10-06 11:06 I've got the benefit of being able to hear about it straight from flips 2008-10-06 11:06 I met him at linux world over a year ago 2008-10-06 11:06 he tested violin when he was at google 2008-10-06 11:06 oh! cool 2008-10-06 11:07 he isn't at google anymore? 2008-10-06 11:07 nope 2008-10-06 11:07 then/ 2008-10-06 11:07 tux3 100% 2008-10-06 11:07 hmm, ok 2008-10-06 11:07 tux3 96%, skate 4% 2008-10-06 11:07 hehe 2008-10-06 11:07 lol 2008-10-06 11:07 he's gotten quite good 2008-10-06 11:07 just started skating in feb 2008-10-06 11:07 i never understood what sk8 oclock meant 2008-10-06 11:08 I do downhill 2008-10-06 11:08 http://homepage.mac.com/timothyhuber/downhill/iMovieTheater68.html 2008-10-06 11:08 grt 2008-10-06 11:08 plus skate on the boardwalk in venice and santa monica 2008-10-06 11:08 venice?? 2008-10-06 11:08 venice is in los angeles 2008-10-06 11:09 hehe 2008-10-06 11:09 i was thinking about the other venice... :) 2008-10-06 11:09 wow 2008-10-06 11:09 venice was designed in early 1900 by a developer to resemble venice, italy 2008-10-06 11:09 cool videoo 2008-10-06 11:10 is that you in that video? 2008-10-06 11:10 http://maps.google.com/maps?f=q&hl=en&geocode=&q=Venice,+CA+90291&ie=UTF8&z=14&iwloc=addr 2008-10-06 11:10 yeah, I'm in red 2008-10-06 11:10 and the other? 2008-10-06 11:10 george merkert, three time world champ 2008-10-06 11:10 for downhill inline 2008-10-06 11:10 woho 2008-10-06 11:10 nicee 2008-10-06 11:11 and he is a visual effects producer 2008-10-06 11:11 total recall, starship troopers and ~20 others 2008-10-06 11:11 hmm 2008-10-06 11:11 nice 2008-10-06 11:11 awesome.. 2008-10-06 11:11 scott peer is our other big dh inline guy 2008-10-06 11:11 works at the JPL 2008-10-06 11:12 Cassini navigation software is his design 2008-10-06 11:12 aren't there any vehicles on the road? 2008-10-06 11:13 heh 2008-10-06 11:13 yeah, but not many 2008-10-06 11:13 we pick roads with little traffic 2008-10-06 11:13 it looked like you were going pretty fast 2008-10-06 11:13 its remote, and very curvy 2008-10-06 11:13 hmm 2008-10-06 11:13 in that video, about 35mph 2008-10-06 11:13 we top out in the mid-high 50's 2008-10-06 11:14 there's one video on there where I was wearing a gps 2008-10-06 11:14 hmm 2008-10-06 11:14 and at the end of the run it shows 51.5 2008-10-06 11:14 hmm, cool 2008-10-06 11:14 its the mammoth video 2008-10-06 11:15 seeing it 2008-10-06 11:17 shapor does downhill with us now 2008-10-06 11:17 hmm, you all stay in LA?? 2008-10-06 11:17 flips won't do it with us 2008-10-06 11:17 yeah 2008-10-06 11:17 why? 2008-10-06 11:17 la is pretty cool 2008-10-06 11:17 there's plenty to do here 2008-10-06 11:17 naah, i mean why doesn't flips do downhill with you? :) 2008-10-06 11:17 and there are mountains right here 2008-10-06 11:17 oh, 2008-10-06 11:18 :D 2008-10-06 11:18 doesn't want to go that fast 2008-10-06 11:18 ohk 2008-10-06 11:18 he's heard about our crashes 2008-10-06 11:19 <-never gotten hurt, but crash a few times/year 2008-10-06 11:19 hehe 2008-10-06 11:19 that's what body armor is for 2008-10-06 11:20 watchin mammoth video... which one is u? 2008-10-06 11:21 i'm carrying the camera 2008-10-06 11:21 :) 2008-10-06 11:21 ohk 2008-10-06 11:22 tim on stunt, crouching skater, tuna canyon, tuna bottom section, brake testing on little t 2008-10-06 11:22 man, we've gotten way off topic 2008-10-06 11:22 :-) 2008-10-06 11:23 hehe 2008-10-06 11:23 yeah 2008-10-06 11:27 i am getting plenty of errors using make tests 2008-10-06 11:27 looking into them... 2008-10-06 11:41 tim_dimm, you looked around the tux3 code? 2008-10-06 11:41 yup 2008-10-06 11:41 which part are u familiar with? 2008-10-06 11:44 feeding my daughter...gimme a few to respond 2008-10-06 11:45 oh! sorry, please go on 2008-10-06 11:45 goin for a smoke, be back soon 2008-10-06 11:50 -!- Bobby_(~Bobby@122.162.68.179) has joined #tux3 2008-10-06 11:52 -!- Bobby_(~Bobby@122.162.68.179) has joined #tux3 2008-10-06 11:57 -!- Bobby_(~Bobby@122.162.68.179) has joined #tux3 2008-10-06 11:58 back 2008-10-06 11:59 -!- pranith_(~bobby@122.162.68.179) has joined #tux3 2008-10-06 12:17 hmm 2008-10-06 13:10 -!- zbrown(~rufius@208.64.37.45) has joined #tux3 2008-10-06 13:48 pranith: yeah looks like the extent stuff broke something in dleaf.c.. not too suprising 2008-10-06 14:55 folks 2008-10-06 16:23 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-06 17:26 -!- ajonat(~ajonat@190.48.122.100) has joined #tux3 2008-10-06 21:57 -!- flipz(~daniel@d75-157-56-124.bchsia.telus.net) has joined #tux3 2008-10-06 22:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-06 22:13 -!- daniel(~daniel@d75-157-56-124.bchsia.telus.net) has joined #tux3 2008-10-06 23:24 hey all 2008-10-06 23:25 anyone here? 2008-10-06 23:27 hi pranith 2008-10-06 23:27 wha 2008-10-06 23:28 what's up? 2008-10-06 23:28 hey flipz, u back!! 2008-10-06 23:28 i thought u were out today too 2008-10-06 23:29 I'm still on the road 2008-10-06 23:29 oh 2008-10-06 23:29 seems like ur extent stuff broke a lot of things... 2008-10-06 23:29 trying to find them out 2008-10-06 23:29 heh 2008-10-06 23:29 have fun reading the code 2008-10-06 23:29 messages on the mailing list? 2008-10-06 23:29 :) 2008-10-06 23:29 yeah, i will 2008-10-06 23:30 please 2008-10-06 23:30 I'll have a read when I see something 2008-10-06 23:30 ohkies 2008-10-06 23:30 ill post the errors 2008-10-06 23:30 thanks 2008-10-06 23:30 how do i turn the trace function on? 2008-10-06 23:31 #define trace trace_on 2008-10-06 23:31 instead of #define trace trace_off 2008-10-06 23:31 hmm 2008-10-06 23:31 ohk 2008-10-06 23:31 thnx 2008-10-06 23:32 flipz: its defined as trace on... 2008-10-06 23:32 19#ifndef trace 20#define trace trace_on 21#endif 2008-10-06 23:33 unless it was already defined 2008-10-06 23:33 hmm, ok. trace has been defined somewhere else... 2008-10-06 23:33 ACTION searching 2008-10-06 23:34 look for #include "something.c" 2008-10-06 23:35 ok 2008-10-06 23:35 trace.h 2008-10-06 23:39 shapor: mind updating the bitbucket mirror?? 2008-10-06 23:40 it's not exactly mirroring stuff now 2008-10-07 00:10 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-07 01:49 hey 2008-10-07 01:51 hello bh 2008-10-07 01:56 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-07 02:05 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-07 02:17 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-07 02:38 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-07 02:42 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-07 02:50 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-10-07 05:15 hey all 2008-10-07 05:15 real life is boring 2008-10-07 08:00 -!- Kirantpatil(~kiran@122.167.195.60) has joined #tux3 2008-10-07 08:01 -!- Kirantpatil(~kiran@122.167.195.60) has left #tux3 2008-10-07 09:44 greetings earthlings 2008-10-07 09:44 pranisout: what makes you think bitbucket isn't updating? 2008-10-07 09:47 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-07 10:01 -!- pranith(~Bobby@122.162.71.155) has joined #tux3 2008-10-07 10:01 hello 2008-10-07 10:01 anyone here 2008-10-07 10:06 -!- pranith(~Bobby@122.162.71.155) has joined #tux3 2008-10-07 10:07 hello 2008-10-07 10:25 hi 2008-10-07 10:42 shapor, hello 2008-10-07 10:57 you were mentioning the bitbucket mirror yesterday 2008-10-07 10:57 it looks up to date to me 2008-10-07 11:05 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-07 11:06 shapor, hmm 2008-10-07 11:07 yup, my mistake 2008-10-07 11:07 sorry 2008-10-07 11:14 -!- Bobby_(~Bobby@122.162.69.183) has joined #tux3 2008-10-07 11:20 its just that flips has been out of commission so there aren't any updates ;) 2008-10-07 11:29 hmm 2008-10-07 11:35 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-07 11:36 hey tim_dimm 2008-10-07 11:36 hey pranith 2008-10-07 12:59 -!- nataliep(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-10-07 15:59 -!- flipz(~daniel@d75-157-56-124.bchsia.telus.net) has joined #tux3 2008-10-07 16:45 -!- pravin(~pravin@145-116-238-192.uilenstede.casema.nl) has joined #tux3 2008-10-07 17:10 -!- pravin(~pravin@145-116-238-192.uilenstede.casema.nl) has joined #tux3 2008-10-07 17:32 flipz: ping 2008-10-07 17:33 hi tim_dimm 2008-10-07 17:33 meet me on the bat channel 2008-10-07 17:37 just a sec 2008-10-07 17:37 k 2008-10-07 18:30 -!- garns(~garns@marvin.cs.uni-dortmund.de) has joined #tux3 2008-10-07 18:42 -!- ajonat(~ajonat@190.48.122.100) has joined #tux3 2008-10-07 18:57 -!- ajonat(~ajonat@190.48.122.100) has joined #tux3 2008-10-07 19:06 -!- flipz(~daniel@d75-157-56-124.bchsia.telus.net) has joined #tux3 2008-10-07 19:37 -!- daniel_(~daniel@d75-157-56-124.bchsia.telus.net) has joined #tux3 2008-10-07 19:43 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-07 19:45 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has left #tux3 2008-10-07 20:12 -!- Kirantpatil(~kiran@122.167.194.225) has joined #tux3 2008-10-07 20:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-07 20:49 -!- Kirantpatil(~kiran@122.167.194.225) has left #tux3 2008-10-07 21:36 -!- ajonat(~ajonat@190.48.122.100) has joined #tux3 2008-10-07 22:13 hey all 2008-10-07 22:23 hi prantih 2008-10-07 22:23 think you typoed your nick 2008-10-07 22:25 flipz, hi 2008-10-07 22:26 yeah, :( 2008-10-07 22:26 hehe 2008-10-07 22:26 u still on road? 2008-10-07 22:41 still 2008-10-07 22:41 pranith, were you going to report a bug to the mailing list? 2008-10-07 22:51 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-07 23:04 flipz, yup.. doing it now... 2008-10-07 23:04 flipz, was thinking of debugging it myself 2008-10-07 23:05 flipz, let me share it with you 2008-10-07 23:10 good to post whether you debug it yourself or not 2008-10-07 23:57 flipz, anything? 2008-10-08 00:08 pranith, I see the same valgrind issues 2008-10-08 00:10 yeah i noticed those too 2008-10-08 00:11 what is the actual bug? 2008-10-08 00:12 good question 2008-10-08 00:12 I see the very first valgrind complaint does not really matter, it is assigning an invalid groups pointer, but that pointer will be ignored because the code relies on knowing groups = 0 there, and should not use the groups pointer in that case 2008-10-08 00:12 the group pointer I meant 2008-10-08 00:13 I mean, what is the symptom? Just that make tests stops there? 2008-10-08 00:14 the next interesting valgrind complaint at line 419 is just a printf 2008-10-08 00:17 the issue at 469 is, I didn't set dleaf->free or ->used at the end of the leaf packing operations 2008-10-08 00:18 dleaf_dump is complaining about an unitialized count field, which sounds bad but the leaf seems perfectly ok when printed out 2008-10-08 00:19 and there is a leak of 1K in the test 2008-10-08 00:19 shapor, that should be enough for you to kill the all off 2008-10-08 00:20 the first one above can be fixed by not doing the two offending assignments if leaf->groups is zero 2008-10-08 00:21 or maybe better, assigning the two pointers in the struct dwalk to NULL 2008-10-08 00:31 I won't be able to fix these for another couple days, anybody else is welcome 2008-10-08 00:43 flipz: no its not copying the pointer at 641 2008-10-08 00:43 the first error is a problem 2008-10-08 00:48 flipz, is the leak genuine? 2008-10-08 00:48 don't know 2008-10-08 00:48 try putting in exit(1) instead of return 2008-10-08 00:53 shapor, walk->group at 641 is valid but *walk->group is not 2008-10-08 00:54 so right, the pointer is fine, it's the object that's invalid 2008-10-08 00:55 ACTION goes for some zzzs 2008-10-08 01:14 flipzzz: http://arstechnica.com/journals/linux.ars/2008/10/07/wizbit-a-linux-filesystem-with-distributed-version-control 2008-10-08 01:14 very interesting 2008-10-08 01:27 wow shiny 2008-10-08 01:28 uses gnome vfs I think I saw 2008-10-08 01:35 yeah 2008-10-08 01:44 ok really sleeping this time 2008-10-08 02:42 -!- less(~less@145-116-238-192.uilenstede.casema.nl) has joined #tux3 2008-10-08 05:05 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-08 05:10 hey tim_dimm 2008-10-08 05:10 hey pranith 2008-10-08 05:10 i'm up feeding my twins 2008-10-08 05:10 oh, i thought u had a girl 2008-10-08 05:10 lucky u 2008-10-08 05:10 boy and a girl 2008-10-08 05:11 cool 2008-10-08 05:12 how old? 2008-10-08 05:12 4 weeks 2008-10-08 05:12 quite a handful 2008-10-08 05:13 oh, cool 2008-10-08 05:42 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-08 07:06 -!- kbingham(~kbingham@cvs.mpc-ogw.co.uk) has joined #tux3 2008-10-08 08:39 -!- Kirantpatil(~kiran@122.167.197.37) has joined #tux3 2008-10-08 08:39 -!- Kirantpatil(~kiran@122.167.197.37) has left #tux3 2008-10-08 08:42 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-08 10:27 hey all 2008-10-08 11:27 -!- daniel_(~daniel@d75-157-56-124.bchsia.telus.net) has joined #tux3 2008-10-08 13:38 folks 2008-10-08 13:59 the mails about encoding of extent information: To me, it sounds like a variant of a buddy-system 2008-10-08 14:40 what's that ? 2008-10-08 15:08 data, right, and he mentions knuth too 2008-10-08 15:09 but it is an original application 2008-10-08 15:10 question is: is the per pointer bit saving more than the extra pointers required? 2008-10-08 15:10 It is quite possible that it is 2008-10-08 15:10 need to spreadsheet that 2008-10-08 15:26 worth giving it a shot anyway 2008-10-08 15:26 g'night 2008-10-08 15:47 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-08 17:59 -!- ajonat(~ajonat@190.48.94.249) has joined #tux3 2008-10-08 18:36 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-08 18:51 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-08 20:03 -!- Kirantpatil(~kiran@122.167.211.254) has joined #tux3 2008-10-08 21:20 -!- natalie(~natalie@72.14.228.1) has joined #tux3 2008-10-08 21:34 -!- natalie(~natalie@72.14.228.1) has left #tux3 2008-10-08 21:35 -!- Kirantpatil(~kiran@122.167.192.198) has joined #tux3 2008-10-08 21:35 -!- Kirantpatil(~kiran@122.167.192.198) has left #tux3 2008-10-08 22:25 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-08 23:21 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-08 23:32 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-08 23:53 -!- pranith(~Bobby@122.162.67.161) has joined #tux3 2008-10-08 23:53 hey all 2008-10-09 00:00 -!- flips(~phillips@phunq.net) has joined #tux3 2008-10-09 00:40 reading dleaf.c 2008-10-09 00:40 will try to run some tests today.. 2008-10-09 00:40 hoping to find atleast one bug :D 2008-10-09 00:41 to make myself useful here 2008-10-09 00:48 pranith, I spotted a couple of bugs 2008-10-09 00:48 will put in fixes pretty soon 2008-10-09 00:49 stupid things 2008-10-09 00:49 changed the interface to dleaf_dump 2008-10-09 00:49 will change it back I think 2008-10-09 00:49 hmm 2008-10-09 00:49 ohk 2008-10-09 00:49 just running the tests is already useful 2008-10-09 00:49 hmm 2008-10-09 00:49 stuff like tux3 mkfs 2008-10-09 00:50 ? 2008-10-09 00:50 make tux3? 2008-10-09 00:50 and echo foo | tux3 write testdev testfile 2008-10-09 00:50 you do make tux3 && ./tux3 mkfs testdev 2008-10-09 00:50 for example 2008-10-09 00:50 we need some docs around now 2008-10-09 00:50 -!- pranith(~Bobby@122.162.67.161) has joined #tux3 2008-10-09 00:51 sry, dc 2008-10-09 00:52 flips, u back? 2008-10-09 02:44 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-10-09 03:11 -!- kbingham(~kbingham@cvs.mpc-ogw.co.uk) has joined #tux3 2008-10-09 03:22 back now 2008-10-09 03:22 next move is to get some sleep 2008-10-09 04:24 -!- pranith(~Bobby@122.162.67.161) has joined #tux3 2008-10-09 04:24 hey all 2008-10-09 04:55 -!- Bobby_(~Bobby@122.162.67.161) has joined #tux3 2008-10-09 04:55 -!- Bobby_(~Bobby@122.162.67.161) has left #tux3 2008-10-09 05:03 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-10-09 07:34 -!- pranith(~Bobby@122.162.67.161) has joined #tux3 2008-10-09 07:49 -!- pgquiles(~pgquiles@16.Red-83-41-239.dynamicIP.rima-tde.net) has joined #tux3 2008-10-09 08:32 -!- pgquiles(~pgquiles@16.Red-83-41-239.dynamicIP.rima-tde.net) has joined #tux3 2008-10-09 08:34 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-09 09:04 -!- pranith(~Bobby@122.162.67.161) has joined #tux3 2008-10-09 09:15 hello 2008-10-09 09:23 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-09 09:23 ACTION is back :P 2008-10-09 09:28 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-09 09:40 hey pranith 2008-10-09 10:04 morning tim_dimm 2008-10-09 10:04 morning flips 2008-10-09 10:04 got your sleep? 2008-10-09 10:04 some of it 2008-10-09 10:04 you? 2008-10-09 10:04 feeling rejuvenated? 2008-10-09 10:04 hah 2008-10-09 10:04 you are hilarious dude 2008-10-09 10:05 ;-) 2008-10-09 10:27 -!- less(~less@145-116-238-192.uilenstede.casema.nl) has joined #tux3 2008-10-09 11:15 there we go, tux3 command should function again 2008-10-09 11:16 dleaf_dump interface was changed and commented out of the ops, caused seg fault 2008-10-09 11:16 now put back as it was 2008-10-09 11:16 got to fix the valgrind issues, none of which seem serious 2008-10-09 11:17 then there is a real bug in the new extents stuff that make tux3 read seg fault 2008-10-09 11:17 tux3 write seems to work ok 2008-10-09 11:17 then time to write a mail on atomic commit 2008-10-09 11:18 and there is tux3 u tonight 2008-10-09 11:18 -!- ChanServ changed mode/#tux3 -> +o flips 2008-10-09 11:19 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: friends of grab_cache_page " 2008-10-09 11:19 -!- flips changed mode/#tux3 -> -o flips 2008-10-09 12:43 shapor, found the valgrind issue, it was real 2008-10-09 12:53 -!- ajonat(~ajonat@190.48.94.249) has joined #tux3 2008-10-09 12:58 make tests compile without valgrind errors now 2008-10-09 12:59 bunch of little things, real bugs 2008-10-09 12:59 -!- alaine(~alaine@kevbroadley.demon.co.uk) has joined #tux3 2008-10-09 12:59 someone that got owned on msn .. haha makes me ROFL http://www.tibix.eu/include/index.php 2008-10-09 13:01 I wonder if it is actually good to have make mkfs do its thing in /tmp 2008-10-09 13:02 makes for more typing running tests 2008-10-09 13:02 but does keep the local source free of big loopback volumes 2008-10-09 13:03 good call on the autokill 2008-10-09 13:43 i like bitbucket's diffs 2008-10-09 13:43 http://www.bitbucket.org/shapor/tux3/changeset/1b6cf87c7234/ 2008-10-09 13:43 make tests runs successfully now 2008-10-09 13:44 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-09 14:03 tux3 read has a segfault in the new extents code 2008-10-09 14:03 true, bitbucket diffs are nice 2008-10-09 14:05 for some reason the bitbucket mirror is lagged 77 minutes 2008-10-09 14:05 oh 2008-10-09 14:05 sorry 2008-10-09 14:05 because I used your url ;) 2008-10-09 14:24 this has xattr bugs: make inode && ./inode foodev 2008-10-09 14:24 there should be no xattrs but the inode table listing thinks there are 2008-10-09 14:24 something stupid 2008-10-09 15:03 shapor: trac does a similar output and I think they are using some kind of package for it 2008-10-09 15:03 butit does look nice 2008-10-09 17:14 Tonight on tux3 u, provided I can stay awake that long, we will be looking at the relationship between buffers and pages in the page cache 2008-10-09 17:21 2.6.27 is out with lockless page cache 2008-10-09 17:21 changes at the heart of tux3-u :) 2008-10-09 17:27 hey flips 2008-10-09 17:28 shapor: it's been in -rt forever 2008-10-09 17:28 and is a major pain in the ass regarding scalability 2008-10-09 17:33 bh: oh? 2008-10-09 17:34 howso 2008-10-09 19:13 -!- FelipeS_(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-09 19:55 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-09 20:00 who's here tonight? 2008-10-09 20:00 ACTION is 2008-10-09 20:00 Ral will also be. 2008-10-09 20:00 saw shapor not too long ago 2008-10-09 20:01 let's start in two places this time 2008-10-09 20:01 http://lxr.linux.no/linux+v2.6.26.6/include/linux/mm_types.h#L36 <- struct page 2008-10-09 20:02 http://lxr.linux.no/linux+v2.6.26.6/include/linux/buffer_head.h#L60 <- struct buffer_head 2008-10-09 20:03 struct page is the thing we use as a handle for a physical page 2008-10-09 20:03 it has an object count so we known when to release the page 2008-10-09 20:03 _mapcount is something new I haven't really looked at 2008-10-09 20:04 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-09 20:04 it has a private field for whatever the owner, whoever alloced the page usually, wants to put there 2008-10-09 20:04 quick q: what is 'ptes' and 'mms'? 2008-10-09 20:04 in practice, that is usually a list of buffers attached to the page 2008-10-09 20:05 pte is a page table entry 2008-10-09 20:05 mm is a memory management context 2008-10-09 20:05 that is, an address space 2008-10-09 20:05 it's a struct mm 2008-10-09 20:05 each process has one, threads share one 2008-10-09 20:06 the struct page also used to have a lock 2008-10-09 20:06 seems to have gone missing now 2008-10-09 20:06 so we don't have one lock per page 2008-10-09 20:06 probably replaced by a hashed lock 2008-10-09 20:07 need to chase that down 2008-10-09 20:07 it has a very important field: index 2008-10-09 20:07 this is the position of the page within a page cache radix tree, if it is in one 2008-10-09 20:08 and that is the tie to vfs 2008-10-09 20:08 the page also has a pointer to the mapping it is in 2008-10-09 20:08 so mapping + index => retrieve the page 2008-10-09 20:09 and we can remove the page from a mapping by because of those fields recorded in it 2008-10-09 20:10 there is also the lru link, which is gives the vmm an idea of which page should be recovered when cache memory gets full 2008-10-09 20:10 over to buffer_head 2008-10-09 20:11 also has a flags and a count, though the buffer flags is named b_state for no particular reason 2008-10-09 20:11 has a pointer to the page the buffer head is attached to, on which the data belonging to the buffer is stored 2008-10-09 20:12 we figure out where on the page the buffer data is stored by looking at the low bits of the index, I think... 2008-10-09 20:12 we will come back to that and check it 2008-10-09 20:13 the buffer also points at a block device b_bdev, but this field is redundant now 2008-10-09 20:13 because we have buffer->page->mapping->... bdev 2008-10-09 20:14 there is an end_io function like the endio for a bio 2008-10-09 20:14 serves the same purpose, and is now also largely redundant 2008-10-09 20:15 assoc_buffers is a crude scheme for flushing file metadata along with data for primitive filessystems like ext2 that let the vfs do all their work for them 2008-10-09 20:16 we also don't see a lock in the buffer_head itself 2008-10-09 20:16 though for both pages and buffers, locking is a huge element of how they are used 2008-10-09 20:18 in the case of buffers, we spin on one of the state bits 2008-10-09 20:18 we will go find that code later also 2008-10-09 20:20 it's __lock_buffer 2008-10-09 20:20 defined somewhere lxr can't find 2008-10-09 20:21 (I'd thought we were meeting at 9pm today...) 2008-10-09 20:22 it should be in buffer.c 2008-10-09 20:22 because we did last time? 2008-10-09 20:22 should be 2008-10-09 20:22 http://lxr.linux.no/linux+v2.6.26.5/fs/buffer.c#L70 2008-10-09 20:23 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L70 2008-10-09 20:23 thanks 2008-10-09 20:23 ok, see it's a lock that spins on a bit 2008-10-09 20:23 another quick q: how is the page structures managed? Is there a huge array somewhere? Or some lists? 2008-10-09 20:23 not very efficient 2008-10-09 20:23 a huge array 2008-10-09 20:23 very simple/crude 2008-10-09 20:24 and sometimes a big problem because of the size of that array 2008-10-09 20:24 let's have a look at lock_page for comparison 2008-10-09 20:25 searching... 2008-10-09 20:25 yeah coming up empty 2008-10-09 20:25 been worked on lately 2008-10-09 20:25 http://lxr.linux.no/linux+v2.6.26.5/include/linux/pagemap.h#L167 2008-10-09 20:25 :) 2008-10-09 20:25 2.6.27 has no index 2008-10-09 20:26 right 2008-10-09 20:26 I should have mentioned 2008-10-09 20:26 2.6.26.6 2008-10-09 20:26 oh... I'm still on .5 :P 2008-10-09 20:26 http://lxr.linux.no/linux+v2.6.26.6/include/linux/pagemap.h#L167 :D 2008-10-09 20:26 http://lxr.linux.no/linux+v2.6.26.6/mm/filemap.c#L599 <- we see that the page lock is another bit spin lock 2008-10-09 20:27 does might_sleep do something or is that a statement for code testing? 2008-10-09 20:27 the closer you look at buffer_heads and struct pages, the more they are quite similar to each other 2008-10-09 20:27 might_sleep will generate a kprint warning if it is called under a spinlock 2008-10-09 20:28 if you have that debug option turned on 2008-10-09 20:28 ok, now we have some slight familiarity with those two, let's go look at a place where they are used together 2008-10-09 20:28 like block_read_full_page 2008-10-09 20:30 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L2093 <- block_read_full_page 2008-10-09 20:31 the page may or may not have a list of buffers attached to it 2008-10-09 20:31 if the blocksize is same as page size, the list will have one bufer 2008-10-09 20:31 otherwise some binary number of buffers 2008-10-09 20:32 the first thing _full_page does is put a list of buffer (heads) on the page if it has none 2008-10-09 20:33 then it loops over the buffer list (again usually one buffer) to find any buffers not uptodate 2008-10-09 20:33 if it just put the buffers on the page, it should already know of course 2008-10-09 20:34 any buffer that is not up to date, it makes a call into the filesystem, get_block 2008-10-09 20:34 which is a callback passed to it by the filesystem 2008-10-09 20:34 because block_read_full_page is always called from filesystem code 2008-10-09 20:35 it is just a library helper to make it easy to do IO on a page 2008-10-09 20:35 easy, but pretty sloppy and executing way too much code 2008-10-09 20:36 which can be masked by a slow disk, but not entirely 2008-10-09 20:37 see further down, there is some coupling of the buffer flags and the page flags in that if all buffers are up to date, the page is set up to date as well 2008-10-09 20:37 does this mean we still have a page cache, even if the file system block size is not page sized? and this is what converts from sub-page sized buffers to the page used by the page cache? 2008-10-09 20:37 this is where we handle the impedence mismatch between page size and block size, if that answers your question 2008-10-09 20:38 it page cache is indexed by pages, but filesystems like ext3 treat it as if it was indexed by buffers 2008-10-09 20:38 (perhaps) 2008-10-09 20:38 see ext3_bread 2008-10-09 20:39 this mismatch is a huge source of complexity in vfs and mm, and a nasty source of bugs 2008-10-09 20:39 by the time we've done the tux3 kernel port, everybody will know exactly what I'm talking about 2008-10-09 20:40 ok, buffer locking is a little counterintuitive 2008-10-09 20:40 we will keep a buffer locked while reading, but not while writing in general 2008-10-09 20:40 same with pages 2008-10-09 20:41 while reading from disk into the buffer, or from the buffer? 2008-10-09 20:41 disk into buffer 2008-10-09 20:41 always what is meant by "read" in here 2008-10-09 20:42 finally, the buffers that need reading are submitted via submit_bh 2008-10-09 20:42 doesn't it make sense to lock reads then? since that's when the memory content actually changes? 2008-10-09 20:42 which is just a simple wrapper on submit_bio 2008-10-09 20:42 which is an old friend of ours 2008-10-09 20:42 yes, we lock reads, not writes 2008-10-09 20:43 this is a horribly inefficient code path we're looking at 2008-10-09 20:44 and doesn't actually get used much any more, though there are cases that still trigger it 2008-10-09 20:44 I don't know what they are exactly, but again we will have a good idea after doing the port 2008-10-09 20:44 let's look at block_write_full_page while we are in here 2008-10-09 20:45 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L2093 2008-10-09 20:45 sorry 2008-10-09 20:45 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1645 2008-10-09 20:45 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1645 2008-10-09 20:45 right 2008-10-09 20:46 starts the same way 2008-10-09 20:46 as often happens in disk io 2008-10-09 20:46 but unfortunately, kernel takes little advantage of such symmetry 2008-10-09 20:47 we take care of zeroing a partial page exctending beyond end of file here 2008-10-09 20:47 "unmap_underlying_metadata" is a scary function to see here 2008-10-09 20:47 we'll leave that for another day 2008-10-09 20:48 see, we keep a state bit in the buffer to tell us whether we need to call the fs get_block method or not 2008-10-09 20:48 what is a non-blockdev mapping? 2008-10-09 20:48 page cache I guess 2008-10-09 20:48 fuinny terminology 2008-10-09 20:50 a slight fib, it seems we keep the buffer locked all the way through the write here 2008-10-09 20:50 the page however gets unlocked 2008-10-09 20:50 it is probably unnecessary to keep the buffer locked 2008-10-09 20:51 "redirty_page_for_writeback" is another scary visitor to see here 2008-10-09 20:51 hacking around various subtle loopholes in the vm design 2008-10-09 20:51 now let's see, where is the get_block call 2008-10-09 20:52 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1696 2008-10-09 20:53 by the way, each buffer has a size field, which is now redundant 2008-10-09 20:53 still hanging around 2008-10-09 20:54 we now find out the size from the block device pointed to by the buffer->page->mapping->inode->sb 2008-10-09 20:54 something like that 2008-10-09 20:55 (large number of indirects...) 2008-10-09 20:55 ah, what we are doing in unmapp_underlying_metadata is taking care of the lack of coherence between the inode page cache and the block device buffer cache 2008-10-09 20:56 there might have been a page sitting around in the buffer cache mapped to the same physical block 2008-10-09 20:56 see the bh->b_blocknr, that is what we call buffer->index in the tux3 userspace code 2008-10-09 20:57 but the usage is different here 2008-10-09 20:57 in kernel, this caches the _physical_ block the buffer is mapped to, even if the buffer is on a page in an inode page cache 2008-10-09 20:58 in the tux3 userspace code, the physical mapping is never cached 2008-10-09 20:58 and we use that field like kernel uses the page->index field, to know what the logical offset of the data is 2008-10-09 20:59 it turns out that caching the physical block pointer is pretty useless, almost always 2008-10-09 21:00 since nearly all writes will just write the buffer to the physical location once then address it out of cache after that 2008-10-09 21:00 it might save a get_block trip into the filesystem only for a rewrite 2008-10-09 21:01 the 9 oclock horn just sounded 2008-10-09 21:01 this was a pretty dry one today, no? 2008-10-09 21:01 but important 2008-10-09 21:01 seemed pretty hard core 2008-10-09 21:01 this little corner of the kernel will be visited frequently by anybody doing filesystem work 2008-10-09 21:01 yup 2008-10-09 21:02 ok, on tuesday we're going to get much more hard core 2008-10-09 21:02 is a lot of the complication around here cruft? or is it actually needed for performance and/or edge cases? 2008-10-09 21:02 because we want to use this buffer+page mechanism in ways it was not necessarily designed for 2008-10-09 21:02 major cruft, yes 2008-10-09 21:03 I hope to make it obsolete 2008-10-09 21:03 in due course 2008-10-09 21:03 but we're going to have to work with it for now 2008-10-09 21:03 changing core kernel to merge tux3 isn't really wise 2008-10-09 21:04 questions? 2008-10-09 21:04 otherwise my little girl wants to try out the new video game 2008-10-09 21:04 ACTION doesn't have any :( 2008-10-09 21:04 aaa... what video game is it? :P 2008-10-09 21:04 razvanm, try to read through some more of the block_ functions in buffer.c 2008-10-09 21:04 ACTION doesn't really know how to ask intelligent questions... 2008-10-09 21:04 bioshock 2008-10-09 21:05 my first free time will be after 22 :( 2008-10-09 21:05 I heard about bioshock... 2008-10-09 21:05 oh right 2008-10-09 21:05 where "some" means a little bit 2008-10-09 21:05 it's important to have the layout of the code seeping into you in the background 2008-10-09 21:06 bok 2008-10-09 21:06 a few minutes of looking, then you can go away and let it seep 2008-10-09 21:06 :-) 2008-10-09 21:07 when I get into the nitty gritty of atomic commit, this buffer cache interface gets really important 2008-10-09 21:07 this is where most of the action happens 2008-10-09 21:07 sorry 2008-10-09 21:07 page cache - with buffers attached 2008-10-09 21:07 and we will make it act like a buffer cache, as tux3 uses in user space 2008-10-09 21:07 ok, I'm out 2008-10-09 21:07 have a nice evening 2008-10-09 21:14 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has left #tux3 2008-10-09 21:42 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-09 22:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-09 22:47 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-09 23:01 hello 2008-10-09 23:01 anyone? 2008-10-09 23:48 hey 2008-10-09 23:48 hi pranith 2008-10-09 23:52 shapor: locking that data structure limited it about 2.5 processor scalability 2008-10-09 23:52 it's in old OLS papers if you want to read about it 2008-10-10 00:34 bh: which ds are u talking about? 2008-10-10 02:22 the page cache dictionary itself 2008-10-10 02:22 it's a well known problem 2008-10-10 02:59 -!- kbingham(~kbingham@cvs.mpc-ogw.co.uk) has joined #tux3 2008-10-10 03:12 flipsout: leaf->groups is being used without being initialized... 2008-10-10 03:12 are u assuming that it is 0 in dleaf_chop? 2008-10-10 03:15 so is group->count 2008-10-10 03:25 -!- pgquiles(~pgquiles@16.Red-83-41-239.dynamicIP.rima-tde.net) has joined #tux3 2008-10-10 04:20 HI, I am new guy here. I wanted to learn about Tux3 and contribute to it.... any guidance will be greatly appreciated... 2008-10-10 04:54 hello less 2008-10-10 04:54 you still here? 2008-10-10 04:56 pranith : yes 2008-10-10 04:56 you know C? 2008-10-10 04:56 yep, i have done some kernel programming also.. 2008-10-10 04:56 good 2008-10-10 04:56 bt not very extensive.. 2008-10-10 04:57 did you go throught the design document? 2008-10-10 04:57 of tux3? 2008-10-10 04:57 http://shapor.com/tux3/shapor-tux3/doc/design.html 2008-10-10 04:57 yep, i have taken overview of it 2008-10-10 04:57 -!- FelipeS_(~Felipe@lawn-128-61-26-178.lawn.gatech.edu) has joined #tux3 2008-10-10 04:57 ok 2008-10-10 04:58 you can go through the code.. run the tests 2008-10-10 04:58 write your own tests 2008-10-10 04:58 implement the ideas 2008-10-10 04:58 ok, i will start with running the test's then... 2008-10-10 04:58 ok 2008-10-10 04:58 that should give me some more insite.. 2008-10-10 04:58 thanx.. :-) 2008-10-10 04:58 yup 2008-10-10 04:59 and please 2008-10-10 04:59 try to document what you learn 2008-10-10 04:59 we can compare notes later ;) 2008-10-10 04:59 ohh, sure... 2008-10-10 04:59 has anyone documented it before..?? 2008-10-10 04:59 so that i can get some help from it..? 2008-10-10 05:00 nope, the incode documentation is all that we have 2008-10-10 05:00 :) 2008-10-10 05:00 i've done some basic stuff 2008-10-10 05:00 but not online 2008-10-10 05:01 its in a nice notebook 2008-10-10 05:01 ok... 2008-10-10 05:01 will digitize it 2008-10-10 05:01 soon 2008-10-10 05:01 i will create my logs online only.. 2008-10-10 05:01 that's good 2008-10-10 05:01 may b it will help futute beginers.. 2008-10-10 05:01 :-) 2008-10-10 05:02 yeah 2008-10-10 05:02 you good at diagrams? 2008-10-10 05:02 may be you can make those 2008-10-10 05:02 i think they are very helpful and are missing now 2008-10-10 05:02 yep... bt not asci diagrams... 2008-10-10 05:03 any diagrams will do 2008-10-10 05:03 they take quite a lot time.. 2008-10-10 05:03 gif, png :) 2008-10-10 05:03 yeah 2008-10-10 05:03 thats the main problem 2008-10-10 05:03 yep.. i can do dat.. 2008-10-10 06:29 -!- less(~less@145-116-238-192.uilenstede.casema.nl) has joined #tux3 2008-10-10 06:56 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-10 07:59 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-10 08:33 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-10 08:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-10 08:37 -!- ajonat(~ajonat@190.48.94.249) has joined #tux3 2008-10-10 09:19 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-10 09:21 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-10 09:47 -!- Kirantpatil(~kiran@122.167.208.110) has joined #tux3 2008-10-10 09:47 -!- Kirantpatil(~kiran@122.167.208.110) has left #tux3 2008-10-10 09:56 -!- ajonat_(~ajonat@190.48.107.122) has joined #tux3 2008-10-10 10:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-10 10:08 anyone check out UBIFS yet? 2008-10-10 10:08 http://www.linux-mtd.infradead.org/doc/ubifs.html 2008-10-10 10:14 Violin report on 2.6.27 : pci_dma_mapping_error(dma_handle) in 2.6.27 to pci_dma_mapping_error(pdev, dma_handle) 2008-10-10 10:23 morning flips 2008-10-10 10:23 good morning 2008-10-10 10:23 violin report? 2008-10-10 10:35 chatted with brad this morning 2008-10-10 10:35 was talking about 2.6.27 2008-10-10 10:35 my syntax was bad 2008-10-10 10:36 should have said "violin reports" 2008-10-10 12:29 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-10 14:55 folks 2008-10-10 16:02 -!- pgquiles(~pgquiles@16.Red-83-41-239.dynamicIP.rima-tde.net) has joined #tux3 2008-10-10 16:57 sk8 oclock 2008-10-10 16:57 comes earlier in winter 2008-10-10 17:58 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-10 20:34 -!- ajonat(~ajonat@190.48.107.122) has joined #tux3 2008-10-10 22:17 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-10 23:18 -!- Bobby_(~Bobby@122.162.70.20) has joined #tux3 2008-10-10 23:31 hey all 2008-10-11 00:40 hi 2008-10-11 00:41 title of the next post is set 2008-10-11 00:41 "Thinking about Syncing" 2008-10-11 00:41 even better, there's an algorithm in mind 2008-10-11 01:50 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-11 02:47 -!- Bobby_(~Bobby@122.162.70.20) has joined #tux3 2008-10-11 05:35 -!- Bobby_(~Bobby@122.162.70.20) has joined #tux3 2008-10-11 06:20 -!- Bobby_(~Bobby@122.162.70.20) has joined #tux3 2008-10-11 06:40 -!- Bobby_(~Bobby@122.162.70.20) has joined #tux3 2008-10-11 07:49 -!- less(~less@145-116-238-192.uilenstede.casema.nl) has joined #tux3 2008-10-11 08:13 -!- Bobby__(~Bobby@122.162.70.206) has joined #tux3 2008-10-11 09:18 -!- pgquiles(~pgquiles@247.Red-83-41-112.dynamicIP.rima-tde.net) has joined #tux3 2008-10-11 09:36 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-11 12:12 -!- Bobby__(~Bobby@122.162.70.206) has joined #tux3 2008-10-11 12:22 -!- pranihome(~Bobby@122.162.70.206) has joined #tux3 2008-10-11 12:34 http://lodge.glasgownet.com/2008/10/11/its-hammer-time/ <- hammer and tux3 2008-10-11 12:44 flips, are hammer and tux3 comparable? 2008-10-11 12:48 never went through what hammer was really... 2008-10-11 13:07 they use a similar method to do versioning 2008-10-11 13:08 significant differences too 2008-10-11 13:08 hammer has linear versioning, one long chain of snapshots at high granularity while tux3 does tree versioning with snapshots of snapshots 2008-10-11 13:35 -!- ajonat(~ajonat@190.48.107.122) has joined #tux3 2008-10-11 13:39 -!- bobby(~bobby@122.162.70.206) has joined #tux3 2008-10-11 13:39 amazing facts department: a freshly untarred 2.6.26.5 tree has 276008641 bytes of file data / 25715 files, average file size 10733 bytes 2008-10-11 13:40 since untarring kernel trees is one of the main linux fs benchmarks, we care about this 2008-10-11 13:40 average file size has grown over the years from about 8k to nearly 11k 2008-10-11 13:40 not growing very fast, really 2008-10-11 13:41 in the same period the total file data size has about tripled 2008-10-11 15:41 yet another amazing fact: average length of a filename in 2.6.26.5 kernel tree is 36 chars 2008-10-11 15:41 translates into 85 names per ext3 dirent block 2008-10-11 15:50 -!- ajonat_(~ajonat@190.48.116.113) has joined #tux3 2008-10-11 17:01 hey 2008-10-11 17:06 9080 / 1000. 2008-10-11 17:06 whoops 2008-10-11 17:06 wrong window ;) 2008-10-11 17:09 my calculations suggest we will be able to achieve 97.6% of raw disk write bandwidth for the case of untarring a kernel tree to an empty filesystem, complete with atomic commit, but provided the direct data pointer inode attribute is implemented, to get rid of the btree root + leaf per file 2008-10-11 17:10 with the current layout, we can get about 56% of raw 2008-10-11 17:14 Ext3 at present achieves 14% of raw bandwidth as measured on my system here 2008-10-11 17:14 I'm sure I'm missing some overheads that will pull that 97% number lower 2008-10-11 17:14 but we have an awful lot of room for error here 2008-10-11 17:15 put it another way: ext3 write performance sucks hard 2008-10-11 17:15 low bar to jump over 2008-10-11 19:11 what accounts for ext3's write performance ?issues ? 2008-10-11 19:24 journal for one thing 2008-10-11 19:26 what else ? 2008-10-11 19:27 is xfs faster ? 2008-10-11 19:29 don't know, why don't you run some tests? 2008-10-11 19:33 jjfjfjfjfjfjf 2008-10-11 19:33 shit 2008-10-11 19:33 sorry 2008-10-11 19:39 ok, a fairer test shows ext3 coming in at 25 MB/sec 2008-10-11 19:39 untar a tarfile from ramfs 2008-10-11 19:40 still a very far way from there to speed of the disk 2008-10-11 19:45 most disk top out at that rate 2008-10-11 19:45 unless you have a raid array or something like that 2008-10-11 19:46 -!- data`(~data@echo489.server4you.de) has joined #tux3 2008-10-11 19:47 not really 2008-10-11 19:47 dd will get about 64 MB/sec on this disk 2008-10-11 20:00 ok, just confirmed... dd from ramfs to my disk runs from 56 to 64 MB/sec 2008-10-11 20:01 same disk that's running my workstation and server right now ;) 2008-10-11 20:01 course, have to be quite careful not to destroy it while running the dd write test 2008-10-11 20:02 let's round that throughput to 60 MB/s 2008-10-11 20:03 Ext3 therefore hits about 42% of raw bandwidth 2008-10-11 20:04 we are aiming higher with tux3 2008-10-11 20:05 initially, just over 50% of raw would be nice, before optimizing to get rid of the two btree blocks per file for small files 2008-10-11 20:05 then I don't see why we can't hit 80-90% of raw after optimizing 2008-10-11 22:18 -!- pranith_(~bobby@122.162.70.206) has joined #tux3 2008-10-11 22:18 hey all 2008-10-11 22:29 hi 2008-10-11 23:11 flips, hey 2008-10-11 23:11 hi 2008-10-12 00:19 cricket rocks :) 2008-10-12 01:28 -!- pgquiles(~pgquiles@247.Red-83-41-112.dynamicIP.rima-tde.net) has joined #tux3 2008-10-12 04:53 -!- Bobby_(~Bobby@122.162.70.206) has joined #tux3 2008-10-12 04:53 hey all 2008-10-12 05:04 -!- pranith_(~Bobby@122.162.70.206) has joined #tux3 2008-10-12 05:09 -!- pranith_(~Bobby@122.162.70.206) has joined #tux3 2008-10-12 05:22 -!- pranith_(~Bobby@122.162.70.206) has joined #tux3 2008-10-12 05:35 -!- pranith_(~Bobby@122.162.70.206) has joined #tux3 2008-10-12 06:39 hehehe 2008-10-12 08:02 sry, wrong channel 2008-10-12 08:02 :) 2008-10-12 08:11 -!- pgquiles(~pgquiles@249.Red-79-155-127.staticIP.rima-tde.net) has joined #tux3 2008-10-12 08:12 flips, you have pretty cool coding skills.. am learning a lot by reading the code :) 2008-10-12 09:13 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-12 10:27 -!- pranith_(~Bobby@122.162.70.206) has joined #tux3 2008-10-12 12:14 -!- ajonat(~ajonat@190.48.106.99) has joined #tux3 2008-10-12 18:48 -!- less(~less@145.116.238.192) has joined #tux3 2008-10-12 20:07 -!- Kirantpatil(~kiran@122.167.201.131) has joined #tux3 2008-10-12 20:07 -!- Kirantpatil(~kiran@122.167.201.131) has left #tux3 2008-10-12 23:30 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-10-13 00:20 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-10-13 00:28 -!- kbingham(~kbingham@cvs.mpc-ogw.co.uk) has joined #tux3 2008-10-13 05:46 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-13 08:07 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-13 09:15 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 09:31 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 09:38 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 09:45 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 09:50 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 09:58 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:01 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-13 10:03 -!- tim_dimm_(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-13 10:06 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-13 10:10 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:15 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:19 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:25 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:30 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:31 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:36 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:38 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:51 -!- Bobby_(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:54 -!- pranihome(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 10:56 -!- pranihome(~Bobby@122.162.69.155) has joined #tux3 2008-10-13 11:20 helloo 2008-10-13 11:20 its been pretty quite here lately 2008-10-13 11:45 revising trees... complete, balanced, red-black... 2008-10-13 12:13 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-13 12:40 flips: your posting is at the top of popular messages on kernel trap (again :) 2008-10-13 12:42 :) 2008-10-13 12:42 hi nataliep 2008-10-13 12:42 hi dan :) 2008-10-13 12:43 the one where I predict we will get 99% of media speed presumably 2008-10-13 12:43 now... to make that happen 2008-10-13 13:25 folks 2008-10-13 13:27 flips: how's it going ? 2008-10-13 13:28 moving along 2008-10-13 14:04 -!- flips(~phillips@phunq.net) has joined #tux3 2008-10-13 14:04 -!- data`(~data@echo489.server4you.de) has joined #tux3 2008-10-13 14:04 -!- less(~less@145.116.238.192) has joined #tux3 2008-10-13 14:04 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-13 14:04 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-13 14:04 -!- pgquiles(~pgquiles@249.Red-79-155-127.staticIP.rima-tde.net) has joined #tux3 2008-10-13 14:04 -!- ceatinge(~ceatinge@72.232.13.50) has joined #tux3 2008-10-13 14:04 -!- nataliep(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-10-13 14:04 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-10-13 14:04 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-13 14:04 -!- zbrown(~rufius@208.64.37.45) has joined #tux3 2008-10-13 14:04 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-10-13 14:04 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-10-13 14:29 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-13 14:53 nearly sk8 oclock 2008-10-13 16:23 -!- mdakin(~chatzilla@79.97.85.155) has joined #tux3 2008-10-13 19:42 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-13 20:50 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-13 22:31 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-10-14 00:24 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-14 01:50 -!- less(~less@145-116-238-192.uilenstede.casema.nl) has joined #tux3 2008-10-14 02:41 -!- pgquiles_(~pgquiles@249.Red-79-155-127.staticIP.rima-tde.net) has joined #tux3 2008-10-14 02:42 -!- ceatinge_(~ceatinge@veryclever.net) has joined #tux3 2008-10-14 02:54 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-14 02:54 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-14 06:31 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-14 06:31 -!- nataliep(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-10-14 09:36 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-14 12:12 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-14 15:44 -!- mokkpr01(~chatzilla@133-132.127-70.tampabay.res.rr.com) has joined #tux3 2008-10-14 18:30 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-14 18:57 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-14 19:23 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-14 19:45 ACTION is wondering in buffer.c... 2008-10-14 19:47 wandering? 2008-10-14 19:48 just saw an article in the news about McCain and YouTube and the DMCA... cute 2008-10-14 19:49 right, wandering :P 2008-10-14 19:50 damn... I press the wrong button and the whole chat window was cleared :| 2008-10-14 19:52 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-14 19:52 Hi 2008-10-14 19:56 hi ralucam 2008-10-14 19:57 T-3 minutes 2008-10-14 19:59 T-1 2008-10-14 20:00 ACTION is ready 2008-10-14 20:01 ok 2008-10-14 20:02 last time we went delving into the relationship between pages and buffers 2008-10-14 20:02 let's do some more of that 2008-10-14 20:02 let's look at sb_bread 2008-10-14 20:03 http://lxr.linux.no/linux+v2.6.26.6/include/linux/buffer_head.h#L278 ? 2008-10-14 20:03 2.6.27 indexed yet? 2008-10-14 20:03 http://lxr.linux.no/linux+v2.6.27/include/linux/buffer_head.h#L278 :D 2008-10-14 20:03 right 2008-10-14 20:03 you don't think sticking to 2.6.26 is worth it? 2008-10-14 20:03 I don't know if the search works though... 2008-10-14 20:04 2.6.26 is fine 2008-10-14 20:04 we don't need to get off on lockless page cache right now 2008-10-14 20:04 ok, bread is the classic bsd way of accessing buffer cache 2008-10-14 20:05 just one parameter, the buffer, and the block to read is in the buffer struct 2008-10-14 20:05 which I will just call buffer instead of buffer_head from now on 2008-10-14 20:05 the _head is entirely fluff, doesn't mean anything 2008-10-14 20:06 struct buffer traditionally also has a size 2008-10-14 20:06 that was a stupid idea 2008-10-14 20:06 and we have largely dropped that now 2008-10-14 20:07 instead, the size is taken from a field in the superblock, which is why we now have sb_bread, taking an sb, a physical block on the device referenced by the sb, and returning a buffer 2008-10-14 20:07 sb_bread(struct super_block *sb, sector_t block) 2008-10-14 20:07 seems to take a block number as a parameter... 2008-10-14 20:07 yes 2008-10-14 20:08 oh, misinterpreted your comment 2008-10-14 20:08 (to mean the sb had a field with the block number) 2008-10-14 20:08 and my comment re the orignal bread was wrong, does not take a buffer 2008-10-14 20:08 http://www.ipnom.com/FreeBSD-Man-Pages/bread.3.html 2008-10-14 20:08 bread(struct uufsd *disk, ufs2_daddr_t blockno, void *data, size_t size) 2008-10-14 20:09 let's go find the old linux one just for interest 2008-10-14 20:09 2.4? 2008-10-14 20:09 yes 2008-10-14 20:10 you sure sb_bread, doesn't just read the superblock from a specific block number? 2008-10-14 20:10 http://lxr.linux.no/linux-old+v2.4.31/fs/buffer.c#L1189 2008-10-14 20:10 yes, I'm sure 2008-10-14 20:10 oh, right 2008-10-14 20:11 missed the previous line ;-) someone posted a link to 278 instead of 277 ;-) 2008-10-14 20:11 struct buffer_head * bread(kdev_t dev, int block, int size) <- the legacy linux version 2008-10-14 20:11 I feel stupid... 2008-10-14 20:11 the freebsd version fell even more off the tracks 2008-10-14 20:11 ah, you just tied me for mistakes tonight ;) 2008-10-14 20:12 the trick of the pro is to make those mistakes faster than the amateur 2008-10-14 20:12 Maze: sorry :P 2008-10-14 20:12 lol 2008-10-14 20:13 ok, bad to sb_bread 2008-10-14 20:13 simply calls __bread 2008-10-14 20:13 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1437 2008-10-14 20:13 which no longer needs to know anything about the sb 2008-10-14 20:14 the only reason we needed the sb was to know the blocksize and the underlying device 2008-10-14 20:14 this should have simpley been called "bread" 2008-10-14 20:14 that is, the sb_bread should have been bread 2008-10-14 20:14 I'm guess there's some weird interactions if you call this functions with non-constant size values 2008-10-14 20:15 don't do it 2008-10-14 20:15 never has worked properly 2008-10-14 20:15 never will 2008-10-14 20:15 putting the blocksize in the struct buffer was just a big mistake 2008-10-14 20:15 so size is basically a device property then? 2008-10-14 20:15 not really 2008-10-14 20:15 has been at times 2008-10-14 20:16 has caused lots of bugs 2008-10-14 20:16 see set_block_size 2008-10-14 20:16 or some name like that 2008-10-14 20:16 again, doesn't work properly 2008-10-14 20:16 the buffer size is properly just a property of the superblock, and actually one you can ignore 2008-10-14 20:16 as long as you don't overlap buffers of different sizes 2008-10-14 20:17 there is no cache coherence in that case 2008-10-14 20:17 meta-question: if we want to use bio's for everything... why do we care about bufferheads? 2008-10-14 20:17 we're going to arrive at a bio pretty soon in this little side trip 2008-10-14 20:17 let's try __bread_slow 2008-10-14 20:17 this code path has gotten deeply messed lately 2008-10-14 20:18 with various optimizations + historical cruft 2008-10-14 20:18 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1239 2008-10-14 20:18 we see submit_bh there 2008-10-14 20:18 let's go in 2008-10-14 20:18 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L2862 2008-10-14 20:19 nothing terribly surprising 2008-10-14 20:19 and there we see some code much like you wrote for junkfs 2008-10-14 20:19 this is actually kind of a stupid arrangement 2008-10-14 20:19 the bio could have been allocated on the stack of the caller 2008-10-14 20:19 because we do a sync wait in __bread_slow 2008-10-14 20:20 ok, that is it for sb_bread 2008-10-14 20:20 anything not completely clear there? 2008-10-14 20:20 error handling ;-) 2008-10-14 20:21 hah 2008-10-14 20:21 very poor on this path 2008-10-14 20:21 these functions historically had no error report except "return NULL" 2008-10-14 20:21 much of Linux is still that way, very slowly changing 2008-10-14 20:22 if you check out my latest revs to the tux3 bio interface, there is a mechanism for sending back accurate errors there 2008-10-14 20:22 but usually we tend to drop the ball somewhere in the call chain, and not return the actual error, the higher level just guesses 2008-10-14 20:22 usually the guess is EIO or ENOMEM, randomly 2008-10-14 20:22 yeah, submit_bh returns a value, but it doesn't get checked, etc 2008-10-14 20:22 "don't be part of the problem" 2008-10-14 20:23 when you write you own kernel code 2008-10-14 20:23 you will even see stuff like that in my user space simulation 2008-10-14 20:23 C is just not very good at returning error codes 2008-10-14 20:23 anyway... I will fix it over time 2008-10-14 20:23 -!- cydork(~vihang@59.184.62.147) has joined #tux3 2008-10-14 20:23 it's the penalty you pay for having full control of exceptions... 2008-10-14 20:24 for not having? 2008-10-14 20:24 oh 2008-10-14 20:24 kind of 2008-10-14 20:25 it's more about having no good way to return multiple results from a function, one of which is an error code 2008-10-14 20:25 there's IS_ERR 2008-10-14 20:25 error used it? 2008-10-14 20:25 painful 2008-10-14 20:25 beautifull hack if there ever was one... 2008-10-14 20:25 semantics are not completely obvious either 2008-10-14 20:25 but yeah, it's a little painful 2008-10-14 20:26 often not clear whether it wants err or -err 2008-10-14 20:26 I think part of the problem is it doesn't consider null an error 2008-10-14 20:26 IS_ERR? it wants a pointer 2008-10-14 20:26 and returns bool 2008-10-14 20:26 ERR_PTR 2008-10-14 20:26 I think of it as all one thing 2008-10-14 20:26 clumsy 2008-10-14 20:26 but what can you do? 2008-10-14 20:27 everybody seen vecio and syncio from tux3/super.c ? 2008-10-14 20:28 ACTION did not :( 2008-10-14 20:28 linky? 2008-10-14 20:28 http://phunq.net/ddtree?p=tux3fs;a=blob;f=fs/tux3/super.c;h=1023f06407bc8752e0afc4c2c71940023a18b9f9;hb=HEAD 2008-10-14 20:29 I have to set this repo up better 2008-10-14 20:29 junkfs_fill_super - dead code? 2008-10-14 20:29 it does the only actual work 2008-10-14 20:30 oh lol 2008-10-14 20:30 called from ext3_fill_super 2008-10-14 20:30 yeah see it now 2008-10-14 20:30 tux3_... 2008-10-14 20:30 will disappear next rev, yes it was humor 2008-10-14 20:30 anyway, there you see a far more elegant way of getting a block into memory than sb_bread 2008-10-14 20:31 we only need to know about sb_bread to know how other filesystems do it 2008-10-14 20:31 well 2008-10-14 20:31 sb_bread is still important to is 2008-10-14 20:31 let's go take a look at another part of it 2008-10-14 20:32 where it enters the buffer into the buffer cache 2008-10-14 20:32 almost forgot about that, the most important thing 2008-10-14 20:32 __getblk actually creates the buffer and does this job 2008-10-14 20:33 just like in the tux3 buffer cache emulation 2008-10-14 20:33 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1403 2008-10-14 20:33 thanks 2008-10-14 20:33 you got there seconds ahead of me ;) 2008-10-14 20:34 what, I'm still unsure off... 2008-10-14 20:34 also: http://tux3.org/tux3?f=a4a6f8e640c5;file=user/test/buffer.c <- search for "bread" 2008-10-14 20:34 is why we need a buffer cache 2008-10-14 20:34 don't we have a page cache already? 2008-10-14 20:34 short answer: filesystem metadata 2008-10-14 20:35 there is a page cache dedicated to the block device itself, in addition to a page cache for each inode 2008-10-14 20:35 but when block size does not match page size, it is pretty much impossible to do locking properly with page sized units 2008-10-14 20:36 so what we do instead, is use the buffer attached to the pages as our locking units 2008-10-14 20:36 we looked at that last thursday 2008-10-14 20:36 I thought we'd already put the metadata in files... ;-) 2008-10-14 20:36 but at the time did not really know what it was for 2008-10-14 20:36 heh 2008-10-14 20:36 well what about the metadata for the metada files? 2008-10-14 20:37 ultimately we have to go cache some absolute blocks 2008-10-14 20:37 I thought that was in RAM ;-) 2008-10-14 20:37 uhm the superblock? 2008-10-14 20:37 but let's not divert the topic too much 2008-10-14 20:38 and the blocks that index the files 2008-10-14 20:38 they can't themselves be in files, unless you want a really evil result like ntfs 2008-10-14 20:38 and even then, you can't put _all_ of the file index blocks in files 2008-10-14 20:39 ok, __getblk 2008-10-14 20:39 there's a "friend of grab_cache_page" 2008-10-14 20:39 __find_get_block 2008-10-14 20:40 just a wrapper for __find_get_block_slow 2008-10-14 20:40 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1118 2008-10-14 20:40 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1370 2008-10-14 20:40 see the touch_buffer() there? that implements the lru 2008-10-14 20:41 brings the underlying page to the hot end of the lru 2008-10-14 20:41 we should be here now http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L262 2008-10-14 20:42 now we found a real freind of grab_cache_page, as opposed to a mere hanger on 2008-10-14 20:42 find_get_page 2008-10-14 20:42 http://lxr.linux.no/linux+v2.6.26.6/mm/filemap.c#L630 2008-10-14 20:43 we don't need to go there just now, suffice to say that if the page isn't in the page cache it doesn't try to add it 2008-10-14 20:44 back at _slow... 2008-10-14 20:45 if we find a page in the page cache and it has buffers, when we loop across the buffer list mod the ratio of the buffer size to the page size 2008-10-14 20:46 ah, and we do some evil cruft with the buffer_mapped concept 2008-10-14 20:46 buffer_mapped meaning that the _blocknr field in the buffer is filled in with a physical block number 2008-10-14 20:46 and a bit is set in the buffer flags to indicate this is so 2008-10-14 20:47 actually, that field is entirely redudant in the case of the buffer cache 2008-10-14 20:47 because we can always know the physical device offset from the page->index of the underlying page that stores the buffer data 2008-10-14 20:48 this code returns with a pointer to buffer in "ret" 2008-10-14 20:48 crufty stuff 2008-10-14 20:49 ok, that was the fast path 2008-10-14 20:49 if the buffer wasn't there then we fall onto the slow path 2008-10-14 20:49 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1403 2008-10-14 20:49 in __getblk 2008-10-14 20:50 see that hardsector size stuff 2008-10-14 20:50 largely legacy 2008-10-14 20:50 doesn't do much except create bugs these days 2008-10-14 20:51 lol 2008-10-14 20:51 in __getblock_slow we see an attempt at integration with the vm cache shrinking code 2008-10-14 20:51 it's not pretty 2008-10-14 20:52 but sometime go look at grow_buffers 2008-10-14 20:52 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1118 ? 2008-10-14 20:52 yes 2008-10-14 20:53 "__getblk() cannot fail - it just keeps trying." http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1396 2008-10-14 20:54 in spite of this assertion, the kernel is littered with code to take evasive action if getbllk returns NULL 2008-10-14 20:54 lol 2008-10-14 20:55 should just let that segfault in the tux3 kernel port, perhaps with a pointer to the comment 2008-10-14 20:55 1400 * __getblk() will lock up the machine if grow_dev_page's try_to_free_buffers() 2008-10-14 20:55 1401 * attempt is failing. FIXME, perhaps? 2008-10-14 20:55 wheee 2008-10-14 20:55 try_to_free_buffers is the worst function in the entire kernel 2008-10-14 20:55 whee indeed 2008-10-14 20:56 it's vm, not vfs so we will not look at it right now 2008-10-14 20:56 all this buffer stuff is very fragile and arguably broken 2008-10-14 20:56 it's a credit to the bug chasing talents of people like akpm and linus that it works at all 2008-10-14 20:57 just to give everybody some sense of confidence in what we are about to do ;) 2008-10-14 20:57 you didn't come to university to have the truth softened, right? 2008-10-14 20:57 :P 2008-10-14 20:57 I'm trying to understand why we have per-file and per-metadata pagecache + bufferheads, instead of just device cache 2008-10-14 20:58 this is just device cache 2008-10-14 20:58 ah 2008-10-14 20:58 you mean why not throw away file caches? 2008-10-14 20:58 basically 2008-10-14 20:58 because we need to index cache objects by logical file offset 2008-10-14 20:58 or have filecaches just be pointers to the right pages of the device cache 2008-10-14 20:58 good idea 2008-10-14 20:59 probably a very good idea 2008-10-14 20:59 we' kind of in transition here 2008-10-14 20:59 unifying the page and buffer cache, which used to be a lot more separate 2008-10-14 21:00 linux 2.0 actually copyied data between them to get something resembling coherence 2008-10-14 21:00 right, but right now we have a lot of copying of data around, right? 2008-10-14 21:00 so it's better than it was, which was really really awful 2008-10-14 21:00 we don't, no 2008-10-14 21:00 it's pretty much all done with pointers 2008-10-14 21:00 if you have disk - partition - lvm physical - lvm volume - lvm logical - filesystem - file 2008-10-14 21:00 then how many times to we copy 4KB of data in order to read 4KB? 2008-10-14 21:00 still all just pointers 2008-10-14 21:00 none usually 2008-10-14 21:01 sorry 2008-10-14 21:01 one 2008-10-14 21:01 copy_to_user 2008-10-14 21:01 one dma into the cache and one copy_to_user 2008-10-14 21:01 if it's memory mapped, then no copy_to_user 2008-10-14 21:01 just the dma 2008-10-14 21:03 right, because there's no device cache 2008-10-14 21:03 so unless there's some raid involved, it doesn't much matter 2008-10-14 21:04 it would be very nice to use the buffer cache as a device cache 2008-10-14 21:04 and we should 2008-10-14 21:04 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-10-14 21:04 so, if instead of reading from userspace, we mmap, will that be just a dma to the mmapped region? 2008-10-14 21:04 but it is a lot of work, and as you can see, there is some very fragile code that _will_ break when we mess with this 2008-10-14 21:04 that will 2008-10-14 21:04 dma to the physical page, which is mapped into a process memory space 2008-10-14 21:05 will the dma happen on mmap, or when we later try to read a not-present page? 2008-10-14 21:05 I'm guessing the latter 2008-10-14 21:06 make sense to be the latter! 2008-10-14 21:06 you can make a big areas :P 2008-10-14 21:06 ok, just to wrap up our tour of getblk, the place where a buffer is actually created and inserted into the buffer cache is grow_buffers 2008-10-14 21:06 folks 2008-10-14 21:06 rather badly misnamed function, which is why I had to look at it half a dozen times to know where this happens 2008-10-14 21:06 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1081 2008-10-14 21:07 (interestingly a second implementation in raid5) 2008-10-14 21:07 grow_dev_page <- trying to continue the grand tradition of finding ever worse names for functions 2008-10-14 21:07 no wonder bsd guys tend to slit their wrists when forced to read linux code 2008-10-14 21:08 probably explains why there are so few bsd guys 2008-10-14 21:08 :-) 2008-10-14 21:08 yes, we all know linux leads to killer filesystems 2008-10-14 21:08 eek 2008-10-14 21:09 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1093 <- compare to #1108 2008-10-14 21:09 and promise me you will never write code like that 2008-10-14 21:10 I believe that does actually do what it's supposed to 2008-10-14 21:10 :D 2008-10-14 21:10 could use a comment though 2008-10-14 21:10 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1028 <- finally, here is where the work gets done 2008-10-14 21:10 we enter a page into the page cache, with buffers on it 2008-10-14 21:10 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L1027 2008-10-14 21:11 I think this code actually came from me 2008-10-14 21:11 way back 2008-10-14 21:11 started as my hack to make htree work in the page cache of a file 2008-10-14 21:11 we seem to have a lot of ways to allocate memory... 2008-10-14 21:11 linus liked that idea and decided to use it for the buffer cache 2008-10-14 21:11 there doesn't seem to be much deallocation 2008-10-14 21:11 a good idea 2008-10-14 21:11 maze, there only needs to be deallocation in one spot 2008-10-14 21:12 shrink_caches 2008-10-14 21:12 that is indeed a magical and good thing 2008-10-14 21:12 yes 2008-10-14 21:12 are all page caches actually in one lru then? 2008-10-14 21:12 the kernel is this kind of organica, self cleaning thing 2008-10-14 21:12 has to keep moving, like a shark 2008-10-14 21:13 filling up cache with new stuff about to be used, evicting old stuff to make room for it 2008-10-14 21:13 (wouldn't that lru then be a source of lock contention on multi-way smp?) 2008-10-14 21:13 yes, all pages in the system are in one lru 2008-10-14 21:13 this actually doesn't make complete sense 2008-10-14 21:13 since dirty pages these days do not tend to be evicted via the page lru at all 2008-10-14 21:13 but by inode flushes 2008-10-14 21:14 because a filesystem can't afford to have the vm writing out random pages in orders that violate ACID constraints 2008-10-14 21:15 ok we went into bonus time 2008-10-14 21:15 questions on thursday ;) 2008-10-14 21:15 :-) 2008-10-14 21:16 ;-) 2008-10-14 21:17 how'd we do on the interesting front this time? 2008-10-14 21:17 it's complex... 2008-10-14 21:17 I feel like vast pieces of this should be avoided in new fs code 2008-10-14 21:17 we're going to wallow in it, unfortunately 2008-10-14 21:18 because using anything other than buffers to access your metadata blocks leads to worse horrors 2008-10-14 21:18 ie. there should be no references to buffer heads at all 2008-10-14 21:18 nice idea except when your block size is smaller than a page 2008-10-14 21:18 see, the concept of 'metadata blocks' 2008-10-14 21:18 is something I have issue with ;-) 2008-10-14 21:18 fixing that problem will lead to reinventing the buffer cache 2008-10-14 21:19 I anxiously await your proposal to replace the notion of metadata 2008-10-14 21:19 not metadata... just metadata blocks 2008-10-14 21:19 "hyperdata" 2008-10-14 21:19 I see 2008-10-14 21:19 with? 2008-10-14 21:19 metadata extents? 2008-10-14 21:19 which would be... simpler? 2008-10-14 21:19 buy not having metadata blocks, you should be able to get acid with no effort (or very little additional effort) 2008-10-14 21:20 buy -> by 2008-10-14 21:20 what do you use instead of metadata blocks? 2008-10-14 21:21 hmm, that's hard to describe a not-fully thought out idea 2008-10-14 21:21 but basically a forward log 2008-10-14 21:21 what about the cache? 2008-10-14 21:21 combined with always writing to free disk space 2008-10-14 21:21 cache of what? 2008-10-14 21:21 metadata 2008-10-14 21:21 in memory structure doesn't have to have anything in common with on-disk 2008-10-14 21:22 probably some sort of tree in sparse file or something though 2008-10-14 21:22 it's very helpful if it does 2008-10-14 21:22 the buffer cache already is a tree 2008-10-14 21:22 and if you have a sparse file, you have to have metadata for that file somewere 2008-10-14 21:22 in the tree of course ;-) 2008-10-14 21:22 where, in another file? and now does that recursion terminate? 2008-10-14 21:23 that's why you have a forward log 2008-10-14 21:23 with care ;-) 2008-10-14 21:23 but you still haven't explained how your metadata is cached 2008-10-14 21:23 the tree is in a file, the file is page cached 2008-10-14 21:23 and how are the blocks of that page cache mapped to the disk? 2008-10-14 21:24 using the tree 2008-10-14 21:24 which tree? 2008-10-14 21:24 the one stored on those blocks 2008-10-14 21:24 what if you have a cache miss on one of those blocks? 2008-10-14 21:24 yeah, it's hard to describe 2008-10-14 21:25 I should probably work it out fully... 2008-10-14 21:25 yes, and you will realized that we're not that far off with the current arrangement 2008-10-14 21:25 (a cache miss is not a big problem with a tree, so long as it's not a mere radix-tree) 2008-10-14 21:25 the part that sucks is being stuck with page size resolution, we need more flexibility than that 2008-10-14 21:26 which is what this whole creaky mess of buffer_heads is about 2008-10-14 21:26 it's a solution, just not a good solution 2008-10-14 21:26 improving it would be a good project 2008-10-14 21:26 not a summer project though 2008-10-14 21:26 right 2008-10-14 21:27 I can understand why it's done the way it is 2008-10-14 21:27 it's just duplication of code/concepts and multiple opportunities to screw up and get locking wrong in edge cases 2008-10-14 21:27 a couple of things I propose to do about it 2008-10-14 21:27 1) let struct page denote objects with sizes larger and smaller than page size, thus obviating the need for struct buffer_head 2008-10-14 21:28 larger... probably easy 2008-10-14 21:28 smaller... ur 2008-10-14 21:28 brain fault 2008-10-14 21:28 2) unify the page and buffer cache so that a miss in a page cache then looks in the buffer cache to see if the page is there, so we use the buffer cache as a large device cache 2008-10-14 21:29 3) implement physical readahead in the unified cache 2008-10-14 21:29 4) implement active page table defragmentation so that we can realistically work with larger block sizes 2008-10-14 21:29 5) dynamically allocate struct page's ;-) 2008-10-14 21:30 that's part of (1) 2008-10-14 21:30 yeah, I was wondering if you meant to include that or not 2008-10-14 21:31 a crude form 2008-10-14 21:31 only dynamically allocate to fill in the gaps between the ones in the array 2008-10-14 21:31 gaps 2008-10-14 21:31 ? 2008-10-14 21:32 I think if you want to do dynamic allocation of struct page it's an all-or-nothing scenario 2008-10-14 21:32 yes, you have an array of 4k physical pages, but want to have 1K struct pages, so 3 1K struct pages go between each two 4K physical pages 2008-10-14 21:32 not at all 2008-10-14 21:32 currently physical page address -> struct page is a (PA>>PAGE_SIZE)*sizeof(struct page) + base operation 2008-10-14 21:32 just dynamically allocating for the sub-physical sized pages works out ok 2008-10-14 21:33 you could get much more invasive about this, but probably not a good idea for a first try 2008-10-14 21:33 ACTION says good night (and thanks for the lecture) 2008-10-14 21:33 by virtue of what a 'page' is for the cpu, I'm not sure sub-pagesize pages are realistic 2008-10-14 21:33 night raz 2008-10-14 21:33 good night 2008-10-14 21:33 what happens when you have conflicting access permissions on two sub-pages? 2008-10-14 21:34 the sub-pagesize pages are basically just for locking 2008-10-14 21:34 which is the only really indispensible thing that buffer_heads do at present 2008-10-14 21:34 don't conflict 2008-10-14 21:34 subpages are not entered into page table entries 2008-10-14 21:34 they can't be 2008-10-14 21:34 in that case couldn't we just have a byte of 8 bits for 8 512 byte locks in struct page? 2008-10-14 21:35 possibly, but there page oriented code uses more fields than that 2008-10-14 21:35 for example, the ->index 2008-10-14 21:35 used to locate the apge in a page cache 2008-10-14 21:35 we want all that code to continue to work 2008-10-14 21:36 otherwise we have a massive rewrite in store for everything that touches a page 2008-10-14 21:37 a change like this... 2008-10-14 21:37 it would probably end up with a massive rewrite almost any decent way you do it 2008-10-14 21:38 I don't think it'd be possible to have some sort of shim compatibility translation layer 2008-10-14 21:38 you could potentially leave buffer_heads around, until everthing had been ported... 2008-10-14 21:39 but actually have the same interface... unlikely 2008-10-14 21:44 the subpage concept isn't that big a deal 2008-10-14 21:45 mostly just affects things like grab_cache_page that we looked at 2008-10-14 21:45 a whole bunch of block io library cruf goes away 2008-10-14 21:45 because we lose the list of buffers per page 2008-10-14 22:05 http://www.newmobilecomputing.com/thread?333779 2008-10-14 22:06 where? 2008-10-14 22:06 ?where? 2008-10-14 22:07 oh 2008-10-14 22:07 mention of ftux3 2008-10-14 22:07 trying to locate the location 2008-10-14 22:07 of the summit 2008-10-14 22:07 what summit? 2008-10-14 22:07 oh right yeah 2008-10-14 22:07 the one in the article 2008-10-14 22:07 was a pretty lame summit 2008-10-14 22:07 flips: you get google alerts for tux3 also i see ;) 2008-10-14 22:07 oh that summit 2008-10-14 22:07 even worse 2008-10-14 22:08 shapor, they're great 2008-10-14 22:08 NYC 2008-10-14 22:09 http://www.linux.com/feature/132203 <- joe barr's take 2008-10-14 22:09 all in all, a very bad thing 2008-10-14 22:09 for linux 2008-10-14 22:09 getting way too much corp polictics in the works 2008-10-14 22:10 linux foundation... not really representing the community 2008-10-14 22:10 http://www.austinlug.org/node/259 2008-10-14 22:12 "Just out of respect for the natives of Austin, they should have made a choice not to slouch back on bureaucratic policy and instead, make an exception to that policy in order to be good guests and pay respect to the local Linux Kernel enthusiasts. 2008-10-14 22:12 Instead, they big-timed him and sent him home. That's when you know your movement has been co-opted and it's no longer a progressive social force." 2008-10-14 22:16 http://blog.internetnews.com/skerner/2008/10/no-press-at-linux-foundation-e.html 2008-10-14 22:51 hey all 2008-10-14 22:51 hi 2008-10-14 22:52 hello flips 2008-10-14 23:06 say goodnight all 2008-10-14 23:06 hey pranith 2008-10-14 23:06 hey tim_dimm 2008-10-14 23:06 past my bedtime 2008-10-14 23:07 off to sleep huh? 2008-10-14 23:07 yup 2008-10-14 23:07 hmm 2008-10-14 23:07 twins wore me out today 2008-10-14 23:07 goodnight then 2008-10-14 23:07 :) 2008-10-14 23:07 :-) 2008-10-14 23:07 hmm, lucky you 2008-10-14 23:07 boy and a girl 2008-10-14 23:07 very lucky 2008-10-14 23:07 yeah, i remember :) 2008-10-14 23:07 later guys 2008-10-14 23:46 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-10-14 23:48 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-15 02:07 the eee is back 2008-10-15 02:07 turns out what you have to do is spam the F9 key to get to the grub menu 2008-10-15 02:07 not what it says on the boot screen 2008-10-15 02:08 also there is advice out there for me that the little fan is so pathetic you can just disconnect it without affecting the operating temperature 2008-10-15 02:08 that is good news to be because the only thing I really hate about this little guy is the fan noise 2008-10-15 02:25 flips: wasn't the eeepc supposed to be ultra-silent? 2008-10-15 02:26 no 2008-10-15 02:26 not one of the advertised features 2008-10-15 02:26 but it is if you disconnect the fan 2008-10-15 02:27 which implies underclocking it, I assume 2008-10-15 02:27 http://www.newegg.com/Product/Product.aspx?Item=N82E16883220004 2008-10-15 02:27 "It is quieter than a whisper" 2008-10-15 02:27 on the other hand, silentpcreview measured the eee box pc at 22 db 2008-10-15 02:27 whereas asus claims 26 db, which is already extremely quiet 2008-10-15 02:27 901 maybe 2008-10-15 02:28 mine is a 900, has a celeron-m 2008-10-15 02:28 flips: btw, how come you were not in the linux foundation summit? btrfs and ext4 were presented 2008-10-15 02:28 next year when there is working code 2008-10-15 02:28 actually, summit doesn't do anything of use 2008-10-15 02:29 we've already demonstrated we can get just as much pr from grassroots as you can by summiting 2008-10-15 02:29 with internet, conferences are mostly a place to meet people and see friends you seldomly see :-) 2008-10-15 02:29 right, it's mainly about the drunk 2008-10-15 02:29 :-D 2008-10-15 02:29 done plenty of those things in the past 2008-10-15 02:30 I'm actually more interested in outside events now 2008-10-15 02:30 doing linux conferences gets to be a lot like preaching to the choir 2008-10-15 02:31 indeed 2008-10-15 02:31 speaking of pr, there should be another post by tomorrow 2008-10-15 02:31 on design details of atomic commit 2008-10-15 02:31 code will follow not too long after 2008-10-15 02:32 you gotta write that book on implementation of filesystems in linux :-) 2008-10-15 02:32 maybe just collect all the posts and put a cover on them? 2008-10-15 02:32 heh, some cleaning and editing would be needed :-) 2008-10-15 02:33 for instance, put the code in the appropiate place among the text 2008-10-15 02:33 :-) 2008-10-15 06:05 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-15 08:05 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-15 09:08 -!- mingming(~mingming@bi01p1.co.us.ibm.com) has joined #tux3 2008-10-15 09:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-15 10:03 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-15 11:04 -!- mingming(~mingming@bi01p1.co.us.ibm.com) has joined #tux3 2008-10-15 11:05 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-15 13:05 -!- natalie(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-10-15 13:07 hi mingming 2008-10-15 14:41 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-15 15:22 folks 2008-10-15 15:50 woohoo, I now have a fanless eee 900 2008-10-15 15:50 http://wiki.eeeuser.com/howto:disconnect_fan?s=turn%20off%20fan 2008-10-15 15:50 ACTION <- modder 2008-10-15 15:50 minor mod 2008-10-15 15:51 nothing melting so far 2008-10-15 15:52 there are officially no moving parts on this eee now, unless you count the keyboard 2008-10-15 16:39 sk8 oclock 2008-10-15 17:52 -!- less(~less@145-116-238-192.uilenstede.casema.nl) has joined #tux3 2008-10-15 18:11 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-15 19:50 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-15 20:23 -!- nataliep(~nataliep@72.14.224.1) has joined #tux3 2008-10-15 20:39 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-15 22:17 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-15 23:51 flips: did you see this? http://blogs.sun.com/erwann/entry/zfs_on_the_desktop_zfs 2008-10-15 23:52 fuse? 2008-10-15 23:54 oh, solaris 2008-10-15 23:54 hey, I have a great idea 2008-10-15 23:54 why don't we give Gnome to sun to use exclusively on Solaris? 2008-10-15 23:54 kill two birds with one stone 2008-10-15 23:54 :-D 2008-10-15 23:54 you are bad, very bad 2008-10-15 23:55 sometimes I don't try to hide it 2008-10-15 23:55 some guy implemented something like that for KDE3 a few years ago for NILFS 2008-10-15 23:55 but it was never committed to KDE's svn repository 2008-10-15 23:56 previous versions works with ddsnap 2008-10-15 23:56 I forget what the gui front end was 2008-10-15 23:56 so... what about a time travel interface for kde, and what to test it with? 2008-10-15 23:57 it'd be nice 2008-10-15 23:57 2008-10-15 23:58 by the way, I wonder where people got the idea that infinite snapshots come for free 2008-10-15 23:58 there is this thing called churn 2008-10-15 23:59 and if you've got it, kiss your time travel goodbye 2008-10-15 23:59 http://www.sandeepranade.com/html/ComputerScience/time-travelling-file-manager.html 2008-10-15 23:59 it will be back to good ol hourly/daily/weekly and precious few of the latter 2008-10-15 23:59 oh, dead 2008-10-16 00:00 wayback? 2008-10-16 00:01 http://web.archive.org/web/20080128190254/http://www.sandeepranade.com/html/ComputerScience/time-travelling-file-manager.html 2008-10-16 00:01 yeah 2008-10-16 00:01 ext3cow, the obvious tool to test it with 2008-10-16 00:02 that was it, not NILFS as I said 2008-10-16 00:02 and I wonder when the ext3cow guys are going to go for kernel merge 2008-10-16 00:02 when they are able to rm again? :-) 2008-10-16 00:02 they can rm, it just doesn't go away 2008-10-16 00:03 btw, a few months ago I briefly looked into adding previous versions support to DolphinPart (which is what KDE uses these days, both in Dolphin and in Konqueror) and it did not look too hard 2008-10-16 00:03 should be integrated with a generic time shifter 2008-10-16 00:03 I like the zoom buttons on the link above 2008-10-16 00:04 the other thing needed on such a slider is little marks where significant changes actually happened 2008-10-16 00:04 I don't know how you'd do that 2008-10-16 00:04 anything at all, to show where the activity was 2008-10-16 00:05 a little squashed histogram of activity instead of linear time scale, maybe 2008-10-16 00:06 adding little marks where significant changes actually happened should not be difficult, but how to find significant changes? a change which affects many files (and how many is "many"?) ? a change which affects a large amount of data (is deleting a DVD-ripped movie actually a big change?) ? 2008-10-16 00:07 exactly 2008-10-16 00:07 well there we have an advantage in tux3, you see all the changes at the same time, at least for one file 2008-10-16 00:08 some kind of out of band activity report 2008-10-16 00:08 we need that actually, for the new generation of filesystems 2008-10-16 00:08 like git and hg have 2008-10-16 00:09 do git and hg have heuristics to tell you "hey, this was a HUGE change"? I didn't know 2008-10-16 00:09 they don't report that 2008-10-16 00:10 it's interesting stuff, we should be mining that data out of our filesystems somehow 2008-10-16 00:10 something to think about 2008-10-16 00:17 ACTION turns in early 2008-10-16 00:17 got to get back in the coding saddle tomorrow, plus post a new post 2008-10-16 00:20 see you! 2008-10-16 00:20 pgquiles_, I hope somebody picks up that time travel interface work again, most practically with ext3cow I think 2008-10-16 00:20 I hope I have time to do that 2008-10-16 00:20 :-) 2008-10-16 00:20 gnight 2008-10-16 01:56 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-10-16 02:09 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-16 07:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-16 08:29 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-16 09:45 -!- Bobby_(~Bobby@122.162.71.144) has joined #tux3 2008-10-16 12:31 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-10-16 13:28 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-16 15:00 need a catchy subject line for the follow up post to thinking about syncing 2008-10-16 15:34 -!- flips(~phillips@phunq.net) has joined #tux3 2008-10-16 15:58 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-16 16:52 title for the next post is set 2008-10-16 16:52 "Of Phases, Quanta, and Episodes" 2008-10-16 16:54 nearly sk8 oclock 2008-10-16 16:57 -!- ChanServ changed mode/#tux3 -> +o flips 2008-10-16 16:57 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: page locking and IO life cycle" 2008-10-16 16:58 -!- ChanServ changed mode/#tux3 -> -o flips 2008-10-16 16:58 -!- ChanServ changed mode/#tux3 -> +o flips 2008-10-16 16:59 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: IO life cycle of a page" 2008-10-16 16:59 -!- ChanServ changed mode/#tux3 -> -o flips 2008-10-16 16:59 scans better 2008-10-16 18:00 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2008-10-16 18:03 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-16 19:45 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-16 19:48 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-16 19:48 Hi! 2008-10-16 19:52 hi 2008-10-16 19:52 ACTION has 8 minutes to shower after the sk8 2008-10-16 19:53 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-16 19:53 hi 2008-10-16 19:57 lhi raluca 2008-10-16 19:59 2 days into life with its fan disconnected and the eee is happy and healthy 2008-10-16 19:59 recommended imho 2008-10-16 19:59 :-) 2008-10-16 19:59 what are you using it for? 2008-10-16 19:59 now I just need to do the spacebar mod and it will be a fine machine 2008-10-16 19:59 use it for a laptop 2008-10-16 19:59 much nicer to carry around than the thinkpad 2008-10-16 20:00 doesn't bend your shoulder 2008-10-16 20:00 :-) 2008-10-16 20:00 in fact, fits in the _flap_ of my camera backpack, and it isn't a big backpack 2008-10-16 20:00 mine is ;-) 2008-10-16 20:00 what kind of camera in it? 2008-10-16 20:00 whoops 2008-10-16 20:00 usually the canon 10d 2008-10-16 20:01 I can't know that, it's tux3 u kind 2008-10-16 20:01 ah 2008-10-16 20:01 20d here 2008-10-16 20:01 old but awesome camera :D 2008-10-16 20:01 oh yes 2008-10-16 20:01 even better :P 2008-10-16 20:01 40d will replace it as soon as I get clearance from my wife 2008-10-16 20:01 :D 2008-10-16 20:01 this is entirely justified by the 11000 baby pictures I took in the last two years 2008-10-16 20:01 ACTION is ready for tux3 2008-10-16 20:02 :-) :-) :-) 2008-10-16 20:02 ok, let's go 2008-10-16 20:02 today will be nittier and grittier 2008-10-16 20:02 let's go look at block_read_full_page 2008-10-16 20:03 ACTION listens to the sound of browsers revving up 2008-10-16 20:03 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L2086 2008-10-16 20:04 beat me ;) 2008-10-16 20:04 on entry to this function the page must be locked 2008-10-16 20:05 or this task will oops with a BUG 2008-10-16 20:05 so where does the page get unlocked? 2008-10-16 20:05 see if you can find it, max 3 minutes 2008-10-16 20:06 ACTION takes the chance to snag a glass of vino 2008-10-16 20:07 ACTION is searching 2008-10-16 20:08 talk about your reasoning as you go if you like 2008-10-16 20:09 ok, hint: we've seen the mechanism before 2008-10-16 20:09 right now I'm searching for unlock_page usage 2008-10-16 20:09 I looked for block_read_full_page first 2008-10-16 20:09 didn't get to far 2008-10-16 20:09 when _should_ the page taht is being read be unlocked 2008-10-16 20:09 ? 2008-10-16 20:10 I mean, if you were designing your own os 2008-10-16 20:10 when will be evicted 2008-10-16 20:10 sorry 2008-10-16 20:10 when is not in use anymore 2008-10-16 20:10 apology accepted ;) 2008-10-16 20:11 when it has been successfully read 2008-10-16 20:11 that is when it should be unlocked 2008-10-16 20:11 aaa... 2008-10-16 20:11 I was very wrong :P 2008-10-16 20:11 then all tasks blocked trying to get the page lock will be unlocked, and can then read the data on the page 2008-10-16 20:11 sure, but you thought about it 2008-10-16 20:11 that's the important step 2008-10-16 20:11 ok, so the locking is only for waiting, right? 2008-10-16 20:12 locking is always only for waiting 2008-10-16 20:12 true 2008-10-16 20:12 locking enforces synchronized access to data by making tasks wait 2008-10-16 20:13 in this case, tasks must wait because the page does not contain valid data yet, the data has to be read from disk 2008-10-16 20:13 ack 2008-10-16 20:13 the very first task that needs the data will grab the lock and then be responsible for launching the read 2008-10-16 20:13 http://lxr.linux.no/linux+v2.6.26.6/fs/buffer.c#L2154 2008-10-16 20:14 that's one unlock location 2008-10-16 20:14 not the common one 2008-10-16 20:14 so it makes perfect sense to check if the lock is on 2008-10-16 20:14 if there is none then nobody is waiting 2008-10-16 20:14 what about read ahead? :P 2008-10-16 20:15 whole nuther topic 2008-10-16 20:15 in fact, this page could be being read as part of readahead 2008-10-16 20:15 the locking logic doesn't change 2008-10-16 20:15 the unlock happens in the bio endio 2008-10-16 20:15 does it make sense to wait for a read ahead? 2008-10-16 20:16 much as maze implemented in junkfs 2008-10-16 20:16 yes, it is mandatory to wait for the readahead 2008-10-16 20:16 any task waiting on the lock intends to use the data on the page 2008-10-16 20:16 it could be a parallel task reading at a different place in the file 2008-10-16 20:17 or a page fault caused by memory access to mmap region 2008-10-16 20:17 we don't care why somebody is trying to read the page, only that the read is happening 2008-10-16 20:17 now, where does the page get locked? 2008-10-16 20:17 3 minute search 2008-10-16 20:17 :D 2008-10-16 20:18 reason out loud 2008-10-16 20:18 hint: we've been in the neighbourhood before 2008-10-16 20:18 the locking should happen in somebody that needs a page 2008-10-16 20:19 grab_page or some of the friends :P 2008-10-16 20:19 right, and what piece of code might? 2008-10-16 20:19 grab_page doesn't actually read a page 2008-10-16 20:19 it's part of the mechanism for doing file write 2008-10-16 20:19 (that was another hint) 2008-10-16 20:20 well... we need the data from the page so to me it make sense to look in the read part 2008-10-16 20:20 yes 2008-10-16 20:20 what read part should be look at? 2008-10-16 20:21 get_block? checking now... 2008-10-16 20:21 get_block just performs the mapping between a logical block number and a physical block number, it doesn't actually do IO on the block 2008-10-16 20:22 (though in the tux3 user space code, our equivalent does) 2008-10-16 20:22 http://lxr.linux.no/linux+v2.6.26.6/mm/filemap.c#L1170 2008-10-16 20:23 generic_blah_read 2008-10-16 20:23 it is actually the responsibility of the application, in most cases a filesystem, to lock the page 2008-10-16 20:23 go_generic_file_read? 2008-10-16 20:23 maybe 2008-10-16 20:23 because I still don't see the locking :P 2008-10-16 20:24 let's find the actual call to lock_page here 2008-10-16 20:24 yes, do_* 2008-10-16 20:24 you see the for loop 2008-10-16 20:24 something interesting: http://lxr.linux.no/linux+v2.6.26.6/mm/filemap.c#L61 2008-10-16 20:25 http://lxr.linux.no/linux+v2.6.26.6/mm/filemap.c#L894 yes... 2008-10-16 20:25 yes, that's a message to you straight from akpm 2008-10-16 20:25 the first to actually bother to write this stuff down 2008-10-16 20:25 in fact, what we are covering here today is written in no book 2008-10-16 20:26 http://lxr.linux.no/linux+v2.6.26.6/mm/filemap.c#L1003 2008-10-16 20:26 one day maybe we will write the book, can you write well? 2008-10-16 20:26 ACTION writes very badly 2008-10-16 20:26 but a book about FS is a great idea! :D 2008-10-16 20:26 really-really great 2008-10-16 20:26 research well then? 2008-10-16 20:26 that I already know 2008-10-16 20:27 good researcher 2008-10-16 20:27 ok, that is one important place the page gets locked 2008-10-16 20:27 but I don't think it's the main one, let me check 2008-10-16 20:28 oh yes it is 2008-10-16 20:28 you nailed it 2008-10-16 20:28 ok, do_blah_read is far from the only place a page can be read 2008-10-16 20:29 it's the not up to date branch so it make some sense to be the main one 2008-10-16 20:29 the _get_block method may have to read one or more pages to figure out what the physical mapping for a file page is 2008-10-16 20:29 yes, it is 2008-10-16 20:30 the other big branch is the not present branche 2008-10-16 20:30 we're not going over those details today, though we should later 2008-10-16 20:30 ack 2008-10-16 20:31 ok, what else do we need to know about page locking? 2008-10-16 20:31 what we didn't do is the buffer locking, which is mixed together with the page locking in block_read_full_page 2008-10-16 20:31 I think we will leave that for later, we've done enough buffers recently 2008-10-16 20:32 suffice to say, that that part is an unholy mess 2008-10-16 20:32 :-) 2008-10-16 20:32 (I don't see the buffer locking in block_read_full_page :() 2008-10-16 20:32 I should mention that the locking path we just looked at used to be the main path for file reading in the past 2008-10-16 20:33 it isn't any more 2008-10-16 20:33 which one is it? 2008-10-16 20:34 first, take a look at this: http://lxr.linux.no/linux+v2.6.26.6/mm/filemap.c#L1021 2008-10-16 20:35 error = mapping->a_ops->readpage(filp, page); 2008-10-16 20:35 looks like the main read, correct? 2008-10-16 20:35 right 2008-10-16 20:36 the truth is, the fs does not have to limit itself to reading just this page 2008-10-16 20:36 it must read the page asked for, or you see we return EIO here 2008-10-16 20:36 but it can also read a bunch more pages at the same time 2008-10-16 20:37 which new incarnations of ext3 does 2008-10-16 20:37 we will take a look at that another time 2008-10-16 20:37 it's the multipage path 2008-10-16 20:37 a whole, big, much messier topic 2008-10-16 20:37 just a quick q: there is a generic implementation for readpage somewhere, right? 2008-10-16 20:37 yes, we started with it today 2008-10-16 20:38 block_read_full_page 2008-10-16 20:38 aaaaaa :D 2008-10-16 20:38 ok 2008-10-16 20:38 look for all occurrences, you will find about one/fs 2008-10-16 20:38 tux3 will have one too 2008-10-16 20:38 (the romfs implements it :P) 2008-10-16 20:38 well 2008-10-16 20:38 sorry 2008-10-16 20:38 tux3 is going to ignore block_read_full_page and do the io by a different method I think 2008-10-16 20:39 as you can see, the block_read_full_page function is rather more convoluted that you would expect 2008-10-16 20:39 mpage_readpage 2008-10-16 20:39 that costs cpu, and worse, does operations a page at a time, at best 2008-10-16 20:39 right 2008-10-16 20:39 I haven't really looked at that in depth myself 2008-10-16 20:40 we'll do it as a group, in fact that can be homework for next time 2008-10-16 20:40 ack 2008-10-16 20:40 :D 2008-10-16 20:40 read the mpage path 2008-10-16 20:40 noted 2008-10-16 20:40 ok, that is only half the story of page locking lifecycle 2008-10-16 20:40 the other half is on the write side 2008-10-16 20:41 so lets start similarly by looking at block_write_full_page (again) 2008-10-16 20:41 http://lxr.linux.no/linux+v2.6.26.5/fs/buffer.c#L2801 2008-10-16 20:41 :) 2008-10-16 20:41 and then http://lxr.linux.no/linux+v2.6.26.5/fs/buffer.c#L1645 2008-10-16 20:42 same locking check :P 2008-10-16 20:42 right 2008-10-16 20:42 good 2008-10-16 20:42 io is by nature symmetric 2008-10-16 20:42 in linux you often have to see past a lot of cruft to see that 2008-10-16 20:43 what would be the equivalent of read ahead for write? :P 2008-10-16 20:43 aaa... flushing 2008-10-16 20:43 write_garbage_ahead? 2008-10-16 20:43 sure 2008-10-16 20:43 I should have though more before asking :P 2008-10-16 20:43 flushing 2008-10-16 20:43 me too 2008-10-16 20:43 couldn't resist the temptation to make a joke 2008-10-16 20:44 write_garbage_ahead is funny :P 2008-10-16 20:44 locking strategy is assymetric in a surprising way here 2008-10-16 20:44 many loops in this functions... 2008-10-16 20:45 ...reading code here... 2008-10-16 20:45 function 2008-10-16 20:45 yes, it's going steadily cruftier over time 2008-10-16 20:45 has reached a truly startling stage by now 2008-10-16 20:45 we can skip the first one, right? :D 2008-10-16 20:45 sure, and look at this: http://lxr.linux.no/linux+v2.6.26.5/fs/buffer.c#L1748 2008-10-16 20:46 unlock_page(page); 2008-10-16 20:46 :D 2008-10-16 20:46 symmery, we haz it! 2008-10-16 20:46 that interesting thing is, this is done right after the submit_bh, which doesn't wait for the actual IO to take place 2008-10-16 20:46 this is the asymmetric part 2008-10-16 20:46 the page is unlocked during the actual write, but for a read it is locked 2008-10-16 20:47 why do you suppose that might be? 2008-10-16 20:47 for read we need the content so we need to wait for the result 2008-10-16 20:47 right, and why can we drop the lock for the write? 2008-10-16 20:47 for write we don't need to wait if we don't care if it fails :P 2008-10-16 20:48 and what advantage is there to dropping the lock for write? 2008-10-16 20:48 stupid q: the locks are counting locks? 2008-10-16 20:48 the truth is, I don't really know the advantage, it is always a racy bug to write to a page that is in process of being written to media 2008-10-16 20:49 no, the locks are nonrecursive 2008-10-16 20:49 good guess though 2008-10-16 20:49 in some cases, there is no way to prevent the race of writing to a page that is currently being transferred to disk 2008-10-16 20:49 actually, this unlock will wakeup somebody... 2008-10-16 20:50 true, which will do a racy, useless write the the page 2008-10-16 20:50 who will be wake up? 2008-10-16 20:50 good question 2008-10-16 20:50 some buggy application probably 2008-10-16 20:50 there much be a lock somewhere... 2008-10-16 20:51 we check for the lock to be on at the start of the function 2008-10-16 20:51 the reason this is always a race is, a write to this memory location while the page is in flight could easily take pace at exactly the same time as the dma transfer 2008-10-16 20:51 so we can't predict whether the page will have new or old data or part of each on disk 2008-10-16 20:51 the dma will use the same lock? 2008-10-16 20:51 dma uses no lock 2008-10-16 20:51 hmm... 2008-10-16 20:52 really? 2008-10-16 20:52 once we have sent a page down to the block layer, dma can be initiated at any time 2008-10-16 20:52 really 2008-10-16 20:52 before starting the dma the page is not locked? 2008-10-16 20:52 scary? 2008-10-16 20:52 if you're scared by that you're starting to get it 2008-10-16 20:52 dma doesn't care a bit about page locks 2008-10-16 20:52 hmm... 2008-10-16 20:52 look through all the dma code, you will find no synchronization there 2008-10-16 20:52 I though the OS will take some care... 2008-10-16 20:52 except with the disk hardware 2008-10-16 20:53 the filesystem should take care all right, but it can't do anything about mmaped writes for example 2008-10-16 20:53 tux3 will take a great deal of care there 2008-10-16 20:53 because we can also be writing out metadata here 2008-10-16 20:54 and it is always a bug to have a racy write to metadata 2008-10-16 20:54 in other words, the synchronization is performed by caller 2008-10-16 20:54 ack :D 2008-10-16 20:54 the vfs/block library can't possibly know enough to do the synchronization itself 2008-10-16 20:55 ok, the rest is just a reading exercise 2008-10-16 20:55 the page locks will be found in generic_* like as for the read case 2008-10-16 20:55 though in some cases they will be buried in filemap functions like grab_cache_page 2008-10-16 20:56 what would you like to look at for next tuesday? 2008-10-16 20:56 my deadline for 22 was canceled so I'll have some time to work on the fs stuff again :P 2008-10-16 20:56 :) 2008-10-16 20:57 well tomorrow I'll probalby get back in the hacking chair 2008-10-16 20:57 maze would know better to answer to that question 2008-10-16 20:57 fix some filemap bugs and maybe that readdir thing 2008-10-16 20:57 let's take a run at mpage 2008-10-16 20:57 what's the question? 2008-10-16 20:57 "what next" 2008-10-16 20:57 flips: what would you like to look at for next tuesday? 2008-10-16 20:57 MaZe: where have you been?? 2008-10-16 20:58 mpage.c I think 2008-10-16 20:58 unfortunately, working since 8am today 2008-10-16 20:58 flips: you are working on tux3 only your free time? 2008-10-16 20:58 so we can see how akpm goes about bypassing what looks like the main IO paths in the kernel 2008-10-16 20:58 I've read through till 8:30 pm 2008-10-16 20:58 razvanm, indeed 2008-10-16 20:59 wow... 2008-10-16 20:59 MaZe: reading? 2008-10-16 20:59 ok, you mentioned my name, so it pinged me and I looked, but I'm not sure what exact question and how to answer it 2008-10-16 21:00 (reading? trying, but not having the time to do it well really) 2008-10-16 21:00 MaZe: the questions was what to talked about next tuesday 2008-10-16 21:00 maze is busy handling a large fire in a google data center 2008-10-16 21:00 at the moment 2008-10-16 21:01 ACTION lies through his teeth 2008-10-16 21:01 fire?!? 2008-10-16 21:01 standard goog joke 2008-10-16 21:01 :-) 2008-10-16 21:01 disk drive caught fire 2008-10-16 21:01 flames leaping into the statosphere 2008-10-16 21:02 causing your gmail to lag by tens of seconds 2008-10-16 21:02 nothing beats the waiting time for loading gmail the first time 2008-10-16 21:02 so about that 10d... 2008-10-16 21:02 when I first noticed the progress bar I though it's a joke :P 2008-10-16 21:03 we got the 10d after a rebel xt :P 2008-10-16 21:03 that's mainly crappy javascript parsing in firefox 2008-10-16 21:03 because I had the chance to hold a 10d in my hand and I fall in love :P 2008-10-16 21:03 rebel xt is a nice machine in its own right 2008-10-16 21:03 there is no time for the progress bar in chrome? 2008-10-16 21:03 but when you get a real camera the difference is obvious 2008-10-16 21:04 I really like that big wheel 2008-10-16 21:04 and the way you can very quickly change the settings 2008-10-16 21:04 chrome's big deal is a faster javascript parser 2008-10-16 21:04 and the much lower shutter lag 2008-10-16 21:04 and the heft 2008-10-16 21:04 that's a big one for me 2008-10-16 21:04 really helps in framing shots 2008-10-16 21:04 you really noticed the diffence in shutter speed? 2008-10-16 21:05 shutter lag 2008-10-16 21:05 lag sorry... 2008-10-16 21:05 faster data path from the sensor etc 2008-10-16 21:05 faster focus setup (though still sucks) 2008-10-16 21:05 faster motor drive, that's a big one for me 2008-10-16 21:05 changing the focus points is not my main strength :P 2008-10-16 21:06 200 ms on the 20d 2008-10-16 21:06 bigger controls, also a big deal 2008-10-16 21:06 the funny thing, the eye tracking in and old A2E I have is really working :-) 2008-10-16 21:06 a2e? 2008-10-16 21:06 looks similar with the 10d but it's on film 2008-10-16 21:07 I only dabbled in real film a little bit 2008-10-16 21:07 I like being able to take thousands of shots at $0/shot 2008-10-16 21:08 I shoot film after I did on digital 2008-10-16 21:08 is nice 2008-10-16 21:08 in a different way :P 2008-10-16 21:08 yes, I can see that, I don't like waiting to see the results though 2008-10-16 21:08 :D 2008-10-16 21:08 well by now on digital I usually know whether I've got a good shot without looking 2008-10-16 21:08 I also use a ricoh GX100 almost daily 2008-10-16 21:09 digital is awesome when you use flashes :D 2008-10-16 21:09 that fixes a lot, true 2008-10-16 21:09 http://www.canon.com/camera-museum/camera/film/data/1991-1995/1992_eos5_qd.html the A2E 2008-10-16 21:09 about time for me to get a real flash 2008-10-16 21:10 yup! :D 2008-10-16 21:10 I'm goint to take my camera down to the boardwalk tomorrow 2008-10-16 21:10 in abovementioned backpack 2008-10-16 21:10 boardwalk? 2008-10-16 21:10 the sunsets are beyond belief, with the fires going on up the valley 2008-10-16 21:10 venice beach 2008-10-16 21:11 http://www.imagekandi.com/photo/images/Venice-Beach-Board-Walk.jpg 2008-10-16 21:11 aaa 2008-10-16 21:11 awesome :D 2008-10-16 21:11 saw exactly that tonight, except much redder 2008-10-16 21:11 skate through that spot every day 2008-10-16 21:11 stop! :D 2008-10-16 21:12 there was a movie being shot just north ;) 2008-10-16 21:12 I skated into the middle of it, the security guy assumed I must be with the crew 2008-10-16 21:12 :-) 2008-10-16 21:12 funny 2008-10-16 21:12 could have gotten myself a free helping at the buffet 2008-10-16 21:12 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-16 21:12 but had to get home for tux3 u 2008-10-16 21:13 aaa 2008-10-16 21:13 so later would be better? :D 2008-10-16 21:13 it's fine 2008-10-16 21:13 there will be another shoot 2008-10-16 21:13 and more free food 2008-10-16 21:13 (we had to come home earlier to catch it :P) 2008-10-16 21:13 it was already dark by that time 2008-10-16 21:13 I was asking about the tux3 u :P 2008-10-16 21:13 current time for tux3 u works fine for me 2008-10-16 21:14 do you have flickr stream? :P 2008-10-16 21:14 now, is it good? 2008-10-16 21:14 oh 2008-10-16 21:14 (thanx for the lesson tonight) 2008-10-16 21:14 you mean for for the skate 2008-10-16 21:14 I'll get pix tomorrow 2008-10-16 21:14 funny I never thought of doing that before 2008-10-16 21:14 take it for granted 2008-10-16 21:14 but last few days have been way over the top 2008-10-16 21:15 everybody just stopping and staring with their mouths open 2008-10-16 21:15 :-) 2008-10-16 21:16 got to clean my sensor 2008-10-16 21:16 got one of those visible dust thingies, haven't used it yet 2008-10-16 21:16 but I will hate myself if there is visible dust on my photos tomorrow 2008-10-16 21:17 if you open the aperture then they not be so visible... 2008-10-16 21:17 kind of hard when you're shooting straight into the sun 2008-10-16 21:17 well 2008-10-16 21:17 20d also doesn't have the auto-cleaning stuff... 2008-10-16 21:17 I'll set the shutter fast 2008-10-16 21:17 right 2008-10-16 21:17 yeah... 2008-10-16 21:17 can't do that either 2008-10-16 21:17 all the palm trees will be black 2008-10-16 21:18 so I'll clean the sensor 2008-10-16 21:18 well... if the range of light is big there is not much to do anyway... 2008-10-16 21:18 aaa... you could use a filter 2008-10-16 21:18 the main reasons for going to the 40d: 1) 20% more bit fat pixels 2) 3 inch display 2008-10-16 21:18 a gradient filter... 2008-10-16 21:18 everything else I don't really care that much about 2008-10-16 21:19 I think that working with big files is a pain :P 2008-10-16 21:19 I'll bring my filters 2008-10-16 21:19 play around 2008-10-16 21:19 depends on what machine you have though :D 2008-10-16 21:19 cool :P 2008-10-16 21:19 the ee handles those files just fine 2008-10-16 21:19 make a perfect complement to the canon 2008-10-16 21:20 wow! you open the files on that machine??? 2008-10-16 21:20 you can't fill the 16 GB flash in a day 2008-10-16 21:20 works great 2008-10-16 21:20 it's a pretty fast little machine 2008-10-16 21:20 1 GB memory 2008-10-16 21:21 I can't believe it only cost $500, now costs closer to $300 2008-10-16 21:21 nice... 2008-10-16 21:21 it's going to get a big brother pretty soon, eee 1000 2008-10-16 21:22 what I need to be able effectively on the road 2008-10-16 21:22 OT: http://farm4.static.flickr.com/3017/2894438808_e0d5f9bfbb.jpg taken with a cheap old flash and a cheap umbrella 2008-10-16 21:22 the 9 inch keyboard causes some strain 2008-10-16 21:22 the screen is bigger on the 1000? 2008-10-16 21:23 10 inches, and a 92% keyboard 2008-10-16 21:23 crisp indeed 2008-10-16 21:23 time delay? 2008-10-16 21:23 or raluca on the camera maybe 2008-10-16 21:24 time delay? 2008-10-16 21:24 that's a shot of you, no? 2008-10-16 21:25 http://farm4.static.flickr.com/3286/2893639655_3aed45ecd4_b.jpg 2008-10-16 21:25 yup, that was me 2008-10-16 21:25 the other one is one with Ral 2008-10-16 21:25 notice the black border from the bottom 2008-10-16 21:25 arty 2008-10-16 21:25 I used a 250 shutter speed 2008-10-16 21:25 that background is fine 2008-10-16 21:26 the cap is 200 for 10d 2008-10-16 21:26 what's the black at the bottom? 2008-10-16 21:26 it's the shutter :D 2008-10-16 21:26 so how'd you get 250? 2008-10-16 21:26 ah 2008-10-16 21:26 the 250 was the shutter speed 2008-10-16 21:26 you can tell it do do it, it won't 2008-10-16 21:27 sorry? 2008-10-16 21:27 you can set it to 1/250, but they you get a picture of the shutter, no? 2008-10-16 21:27 and thought the shutter moved left to right 2008-10-16 21:28 not top to bottom 2008-10-16 21:28 I thought I meant 2008-10-16 21:28 it's top to bottom in slr :D 2008-10-16 21:28 ok, and to 1/8000 in the 20d 2008-10-16 21:28 the sync with the flash will be also aroun 1/200 2008-10-16 21:28 or 1/250... 2008-10-16 21:29 you are a more leet photog than me 2008-10-16 21:29 I haven't even gotten into flash sync let 2008-10-16 21:29 yet 2008-10-16 21:29 btw: in my lab the colors are very nice with a color balance of 4000K 2008-10-16 21:30 lab? 2008-10-16 21:30 the office 2008-10-16 21:30 let me see, it means everything is very blue there? 2008-10-16 21:31 we have fluorescent light 2008-10-16 21:31 condolences 2008-10-16 21:31 I use to shoot using the fluorescent setting but the 4000K is much better 2008-10-16 21:31 just twist the tubes out 2008-10-16 21:31 which have 5500K? like the natural light? 2008-10-16 21:31 is that what it is? 2008-10-16 21:32 (the flash sync for 20d is 1/250, congrats :P) 2008-10-16 21:32 I think the color balance doesn't really affect the camera settings, just the jpg conversion 2008-10-16 21:32 so going on that theory, I always shoot raw and never change the temperature 2008-10-16 21:32 I shoot raw for some time 2008-10-16 21:33 the size of the files and the processing was too much for me :P 2008-10-16 21:33 now I'm using jpg so I need to set it right :D 2008-10-16 21:33 they haven't really improved on the 20d mechanics in the 30d and 40d 2008-10-16 21:33 bad canon 2008-10-16 21:33 still 200 ms/shot is the state of the art for prosumer 2008-10-16 21:34 raw is quite comfortable on the 20d 2008-10-16 21:34 except sometimes you have to wait for a shot while the transfer to flash is in progress 2008-10-16 21:34 do you shoot a lot? 2008-10-16 21:34 can fix that with a faster flash card 2008-10-16 21:34 is 5,000 shots/year a lot? 2008-10-16 21:35 not really... 2008-10-16 21:35 right 2008-10-16 21:35 more than most people, less than a true photog 2008-10-16 21:35 we have about 1000 per month 2008-10-16 21:35 that's what I did when I first got it 2008-10-16 21:36 (shoot at 4000K: http://farm4.static.flickr.com/3223/2893706927_bbfc0360d7_b.jpg ) 2008-10-16 21:36 we took so far 33K of pictures... 2008-10-16 21:36 major boca 2008-10-16 21:36 from mid 2003 till now 2008-10-16 21:36 major boca? 2008-10-16 21:37 fuzzy background 2008-10-16 21:37 did I spell that right? 2008-10-16 21:37 spanish? :D 2008-10-16 21:37 camera term 2008-10-16 21:38 we use a cheap 50mm f1.8 2008-10-16 21:38 best for boca 2008-10-16 21:38 bokeh? 2008-10-16 21:38 prime 2008-10-16 21:38 right 2008-10-16 21:38 http://en.wikipedia.org/wiki/Bokeh ? 2008-10-16 21:38 that's the one 2008-10-16 21:38 we only have two zoom lens 2008-10-16 21:38 considered high art 2008-10-16 21:39 all the rest are primes 2008-10-16 21:39 I have yet to get a prime lens 2008-10-16 21:39 just been lazy 2008-10-16 21:39 I like primes because you don't have the problem of zooming ;-) 2008-10-16 21:39 I shoot with this almost exclusively: http://www.the-digital-picture.com/reviews/Canon-EF-S-17-55mm-f-2.8-IS-USM-Lens-Review.aspx 2008-10-16 21:40 barely fits in my holster bag 2008-10-16 21:40 nice!! 2008-10-16 21:40 2.8... IS :D 2008-10-16 21:40 get lots of attention 2008-10-16 21:40 people wonder what is the point of all that glass 2008-10-16 21:40 one of our zooms is the 17-40 F4 :P 2008-10-16 21:40 weighs a kilo, more than the camera 2008-10-16 21:40 the 2.8 :P 2008-10-16 21:41 http://www.flickr.com/gp/46249124@N00/U5gk98 some of the speakers from our CS Seminar 2008-10-16 21:41 can do some nice things with the IS 2008-10-16 21:41 I use the 85mm f1.8 2008-10-16 21:41 like shoot without flash in dim light 2008-10-16 21:41 I don't have any IS lens 2008-10-16 21:41 cool :D 2008-10-16 21:41 anyway, time to go do family stuff 2008-10-16 21:42 I'll be back working on the next post later 2008-10-16 21:42 have a nice evening! 2008-10-16 21:42 you too 2008-10-16 21:42 thanks for the lesson 2008-10-16 21:42 thanks for coming 2008-10-16 21:42 well... I did the easy thing :P 2008-10-16 21:42 this is eventually going to turn into a book I think 2008-10-16 21:42 :D 2008-10-16 21:42 not many people need to read this book, but those who do need it bad 2008-10-16 21:43 I would love to do a fast fwd to see it ;-) 2008-10-16 22:46 ¨hey 2008-10-16 22:46 hi 2008-10-16 22:47 funny, I've been thinking about getting a canon G10 2008-10-16 22:47 seems to be the biggest bang for the buck at this time 2008-10-16 22:47 flips: how's it going ? 2008-10-16 22:48 ACTION is playing around with some lockdep/stat related changes to track rq lock contention 2008-10-16 22:48 going fine 2008-10-16 22:48 bh: dslr elitests wont speak of such a camera 2008-10-16 22:49 shapor: really ? don't like it ? 2008-10-16 22:49 powershot says it all 2008-10-16 22:49 or is it the case that it makes their purchase look bad ? 2008-10-16 22:49 ACTION is not a dslr elitest like flips 2008-10-16 22:49 20d can be had for $450 now 2008-10-16 22:49 now reason not to get a real camera 2008-10-16 22:49 yeah, but that's older technology 2008-10-16 22:49 point and shoots are much better 2008-10-16 22:50 beats heck out of any point n shoot 2008-10-16 22:50 more likely to have it with you when you want it 2008-10-16 22:50 because of optics ? 2008-10-16 22:50 still gets oohs an ahs pretty much every time it comes out of the bag 2008-10-16 22:50 like with a normal 50mm lens and stuff ? 2008-10-16 22:50 most people on the dslr bandwagon are poor photographers who dont even use 1% of the 100's of features their cameras have 2008-10-16 22:50 I'm interested in night photography, indoor club stuff 2008-10-16 22:50 posers... 2008-10-16 22:50 shapor: I agree 2008-10-16 22:51 shapor, and most people aren't posers period? 2008-10-16 22:51 which is why I went with a good consumer casio 2008-10-16 22:51 why limit the discussion to photo posers? 2008-10-16 22:51 it's done well for me and taken tons of punishment from the playa, etc... 2008-10-16 22:51 also its a huge plus being able to have a camera you dont mind dropping or taking on a camping trip for fear of getting ruined 2008-10-16 22:51 I'm never going to get any really expensive lens so I'm seriously thinking about a G10 2008-10-16 22:52 or be burdened with its weight 2008-10-16 22:52 something about that big fat Ka-LICK is addictive 2008-10-16 22:52 my gf has a nikon dslr w/a few $1500+ lenses 2008-10-16 22:52 its more of a hassle than anything else 2008-10-16 22:52 nikon... 2008-10-16 22:52 gotta have a rediculously big tripod to hold it stable 2008-10-16 22:53 totaly not worth it 2008-10-16 22:53 I still manage to fit mine in a holster 2008-10-16 22:53 just barely 2008-10-16 22:53 kind of bulges 2008-10-16 22:53 my sony that fits in my pocket i can prop up on my jacket and take 30s exposures with has gotten much better use 2008-10-16 22:53 can the ex-pro photographer weigh in here? 2008-10-16 22:54 get the lumix or the leica 2008-10-16 22:54 pro's have to stay quiet ;) 2008-10-16 22:54 same optics 2008-10-16 22:54 flips: see even the pro agrees :P 2008-10-16 22:54 great compact camera 2008-10-16 22:54 yeah i considered the lumix 2008-10-16 22:54 wide lens is nice 2008-10-16 22:54 he's just trying not to shame you in public ;) 2008-10-16 22:55 i found their noise reduction a bit dated 2008-10-16 22:55 lumix? 2008-10-16 22:55 yeah 2008-10-16 22:55 uh oh 2008-10-16 22:55 on the panasonic anyway 2008-10-16 22:55 persie 2008-10-16 22:55 everybody knows that in the circles timothy used to move in there are only canon and nikon marks to be seen 2008-10-16 22:55 i think the leica has different software 2008-10-16 22:55 although i may be wrong 2008-10-16 22:55 it does 2008-10-16 22:55 and its better 2008-10-16 22:55 yeah 2008-10-16 22:55 that's why I bought it 2008-10-16 22:56 $100 more or so 2008-10-16 22:56 noise reduction should be done offline 2008-10-16 22:56 let's argue about roller skates next 2008-10-16 22:56 flips: how often do you shoot raw? 2008-10-16 22:57 its a pita 2008-10-16 22:57 big files 2008-10-16 22:57 with post processing 2008-10-16 22:57 if you want to do anything with it 2008-10-16 22:57 do any of the cameras do lossless compression? 2008-10-16 22:58 i alwys wondered why they didnt 2008-10-16 22:58 shapor, always shoot raw 2008-10-16 22:58 some do lzw 2008-10-16 22:58 i always shoot jpegs 2008-10-16 22:58 unless its for money 2008-10-16 22:58 yeah i can fill a 4GB card with jpgs between dumps 2008-10-16 22:58 raws would kill that 2008-10-16 22:59 flips: how's tux3 development going. Seeing that various bug fixes went in about a week ago, but I'm assuming that you're working on other stuff that's yet to be committed. 2008-10-16 22:59 I shoot about 200 pics a day on the theory it's not a video camera 2008-10-16 22:59 i shoot with my iPhone just to annoy my linux geek friends 2008-10-16 22:59 bh, working on a follow up atomic commit degisn post 2008-10-16 22:59 tim_dimm: :) 2008-10-16 23:00 flips: so you're working on atomic commits now ? 2008-10-16 23:00 yes 2008-10-16 23:00 last big thing before kernel port 2008-10-16 23:01 tim_dimm, you got off some decent shots in spite of the beyond belief shutter lag 2008-10-16 23:01 noisy, but in focus 2008-10-16 23:01 the 3G has a much nicer camera 2008-10-16 23:02 better shadow detail 2008-10-16 23:02 nicer lens 2008-10-16 23:02 get a gphone 2008-10-16 23:02 ACTION hides 2008-10-16 23:02 its getting ripped for usability 2008-10-16 23:02 too many buttons 2008-10-16 23:02 remember, I've been using a one button mouse for years 2008-10-16 23:02 http://shapor.com/pics/trips/out_west-2004-10-11/vegas/.html/IMAG0249.JPG.html 2008-10-16 23:02 ;-) 2008-10-16 23:02 macheads get confused by buttons I know 2008-10-16 23:02 ^ reason i use a cheap camera 2008-10-16 23:02 peeceers like them 2008-10-16 23:02 flips: reading your post now 2008-10-16 23:03 what gear? 2008-10-16 23:03 cause that looks like 12k rpm 2008-10-16 23:03 exposure issues? 2008-10-16 23:03 based on the speedo, 6th 2008-10-16 23:03 (thats the 600) 2008-10-16 23:03 160? 2008-10-16 23:03 oh 2008-10-16 23:03 146 i think 2008-10-16 23:04 clutch cable is kinda blocking it 2008-10-16 23:04 lcd speedo 2008-10-16 23:04 tim_dimm, if you get a gphone you can use the gps to measure your speed 2008-10-16 23:04 ...maybe 2008-10-16 23:04 actually I think it's just fake cell tower gps 2008-10-16 23:04 i've dropped 3 cameras off the bike now :( 2008-10-16 23:05 explains your attachment to point n shoots 2008-10-16 23:05 i should maybe a tether 2008-10-16 23:05 no emotional involvement 2008-10-16 23:05 make* 2008-10-16 23:06 that was a $69 vivitar from walmart 6 years ago 2008-10-16 23:06 http://shapor.com/pics/trips/out_west-2004-10-11/vegas/.html/IMAG0270.JPG.html 2008-10-16 23:06 thats right off the camera, no effects, heh 2008-10-16 23:06 super slow processing 2008-10-16 23:06 focal plane effect? 2008-10-16 23:06 no 2008-10-16 23:07 must be sensor scanout 2008-10-16 23:07 but... 2008-10-16 23:07 yeah 2008-10-16 23:07 the later pixels would be overexposed if that were the case 2008-10-16 23:07 it may compensate for that 2008-10-16 23:08 dunno 2008-10-16 23:08 that would be a trick 2008-10-16 23:08 what kind of shutter? 2008-10-16 23:08 a $69 walmart one back in 2002 ;) 2008-10-16 23:08 certainly digital 2008-10-16 23:09 digital shutter? 2008-10-16 23:09 ACTION doubts there is such a thing 2008-10-16 23:09 non-mechanical 2008-10-16 23:09 like a phone has, right? 2008-10-16 23:09 hmm 2008-10-16 23:09 aka cheap 2008-10-16 23:09 hm maybe its not that bad 2008-10-16 23:10 also running off commodity battery is a key feature i look for 2008-10-16 23:10 AA or AAA 2008-10-16 23:11 in a pinch all you do is stop at a gas station 2008-10-16 23:11 instead of this recharging business 2008-10-16 23:11 but the big canon battery will drive the built in flash all day 2008-10-16 23:12 you'd have a bag full of dead aa's 2008-10-16 23:12 actually a pocket full of them 2008-10-16 23:12 ammo for cars who cut you off 2008-10-16 23:12 dual purpose ;) 2008-10-16 23:15 oh, I just need to learn to frame my thoughts the right way 2008-10-16 23:16 :) 2008-10-16 23:16 ACTION heads down to wallmart to pick up a point n shoot and a shopping back fulla aa's 2008-10-16 23:16 flip the evil switch on, i'm sure you have one 2008-10-16 23:16 I parked illegally once 2008-10-16 23:17 I parked legally once 2008-10-16 23:19 trying to decide now whether I should call the thing that has a commit block a quantum or not 2008-10-16 23:21 what would be a better word? 2008-10-16 23:21 phase? 2008-10-16 23:21 phase is already used, a phase is made up of quanta at the moment 2008-10-16 23:21 and an episode is made up of phases 2008-10-16 23:21 whats a quantum made up of? 2008-10-16 23:22 pointers to extents 2008-10-16 23:22 and pointers to parent blocks to plug them into 2008-10-16 23:22 why isn't that just a commit block? 2008-10-16 23:23 there are commit blocks for each of quanta, phases and episodes 2008-10-16 23:23 so it would be a quantum commit block ;) 2008-10-16 23:23 sounds cool for sure 2008-10-16 23:40 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-10-16 23:40 hey all 2008-10-16 23:42 hi pranith 2008-10-16 23:49 flips: hello 2008-10-16 23:50 you were going to post a new mail? 2008-10-16 23:50 working on it 2008-10-16 23:50 maybe 30% done 2008-10-16 23:50 what is it about? 2008-10-16 23:51 details of how we get the writeout pattern I wrote about in the previous post 2008-10-16 23:51 hmm 2008-10-16 23:51 pretty much the most important issue besides the versioned pointers 2008-10-16 23:52 ok 2008-10-16 23:52 looking forward to it 2008-10-16 23:52 i was seeing the sandeepranade's time machine implementation using ext3cow... 2008-10-16 23:53 one thing i wanted to ask.. using versioned pointers how many copies can we store at a time? 2008-10-16 23:53 like if i use bittorrent.. the blocks keep changing all the time.. 2008-10-16 23:53 how do you handle such situations? 2008-10-16 23:54 we drop off old versions to make room for new ones 2008-10-16 23:54 without doing anything special, we can store about 500 versions 2008-10-16 23:54 hmm, we keep the versions until we run out of space? 2008-10-16 23:54 or 8,000 if we save some bits as discussed on the list, using the buddy system idea 2008-10-16 23:54 that might lead to fragmentation... 2008-10-16 23:54 you don't have to 2008-10-16 23:54 but you will be able to 2008-10-16 23:54 hmm 2008-10-16 23:55 we did that in zumastor with success 2008-10-16 23:55 so we can tune that number of versions? 2008-10-16 23:55 that part's not designed yet 2008-10-16 23:55 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-10-16 23:55 would be very useful though 2008-10-16 23:55 it asnwers the question, how do you avoid enospc when writing to a filesystem holding lots of snapshots 2008-10-16 23:56 hmm... yeah 2008-10-16 23:57 where might we be storing this information? 2008-10-16 23:57 if given as an option? 2008-10-17 00:00 which information? 2008-10-17 00:00 the number of versions to store? 2008-10-17 00:02 we hand'e that pretty nicely in zumastor 2008-10-17 00:02 each version has a priority, for any ties you discard the oldest 2008-10-17 00:03 anyway, that would go in the version table, a special file much as the allocation bitmap and atom table are 2008-10-17 00:03 hmm 2008-10-17 01:15 another question is how do you tell the fs your preference to hold a snapshot unless the space is needed 2008-10-17 01:16 more like, how to explain the priority system to an admin 2008-10-17 01:17 would also be useful to control on a per-file or per-directory basis 2008-10-17 01:17 aka "i dont care about snapshots of /var/log" 2008-10-17 01:17 which could burn a lot of our snapshot space 2008-10-17 01:30 shapor: right 2008-10-17 01:52 the argument that "we can't afford to fill up the filesystem because it would fragement to hell" is kind of interesting 2008-10-17 01:52 so how much can we fill the filesystem? 2008-10-17 01:53 why not just leave 5% free and misreport the available space? 2008-10-17 01:56 pranith, shapor, we get to play with new interfaces when we get to snapshots 2008-10-17 01:56 pioneering stuff 2008-10-17 02:04 flips: it would be good to have a professional design document with the things which are new in tux3... 2008-10-17 02:06 would be easy to refer to concepts 2008-10-17 02:07 and *figures* would really help :) 2008-10-17 02:07 shapor's working on it 2008-10-17 02:07 feel free go contribute figures 2008-10-17 02:07 they're quite time consuming to make 2008-10-17 02:07 we need some volunteers 2008-10-17 02:08 yeah.. ill ping him.. 2008-10-17 02:08 shapor: you need any help with figures :D 2008-10-17 02:08 pranith: yes! 2008-10-17 02:08 what exactly are u working on? 2008-10-17 02:08 any draft? 2008-10-17 02:09 very rough, its in the repo 2008-10-17 02:09 http://shapor.com/tux3/shapor-tux3/doc/design.html 2008-10-17 02:09 hmm 2008-10-17 02:10 no figures at all :( 2008-10-17 02:11 yeah, it would be useful for dleaf 2008-10-17 02:11 ok, i think this is worth doing.. 2008-10-17 02:11 very useful 2008-10-17 02:11 the only way i got it was flips drawing it on a whiteboard 2008-10-17 02:11 oh 2008-10-17 02:11 -!- pgquiles_(~pgquiles@121.Red-88-16-37.dynamicIP.rima-tde.net) has joined #tux3 2008-10-17 02:12 can you give me a picture or something.. 2008-10-17 02:12 ill make a figure from that? 2008-10-17 02:12 could 2008-10-17 02:12 hmm, great! 2008-10-17 02:13 perhaps this weekend 2008-10-17 02:13 tomorrow i'm going to be busy with work 2008-10-17 02:13 er later today i guess 2008-10-17 02:13 its 2am here 2008-10-17 02:14 oh 2008-10-17 02:14 hmm 2008-10-17 02:14 ok 2008-10-17 02:14 you mind mailing it to me? 2008-10-17 02:14 when u have it? 2008-10-17 02:15 sure 2008-10-17 02:15 okies 2008-10-17 02:20 hello 2008-10-17 02:20 can I ask for forward logging? 2008-10-17 02:21 are there any papers of detail? 2008-10-17 02:24 hirofumi: mail logs i guess 2008-10-17 02:24 ah 2008-10-17 02:25 but, that hard to understand all of detail for me 2008-10-17 02:26 btw, i have some patches. it should go to mailing list? 2008-10-17 02:26 for user/test/* 2008-10-17 02:32 hirofumi: yeah 2008-10-17 02:33 post them to the mailing list 2008-10-17 02:33 ok, thanks. 2008-10-17 02:35 hirofumi: what are those patches related to? 2008-10-17 02:37 little bug fixes, and draw tux3 intenal graph for newbie like me 2008-10-17 02:37 internal graph -> internal format 2008-10-17 02:37 internal graph as in a figure? 2008-10-17 02:37 you have a figure for that? 2008-10-17 02:38 I think some kind of 2008-10-17 02:38 draw by graphviz 2008-10-17 02:39 but, not perfect 2008-10-17 02:39 hmm 2008-10-17 02:39 something is better than nothing 2008-10-17 02:39 :) 2008-10-17 02:39 you posting it to the list? 2008-10-17 02:39 yeah :) 2008-10-17 02:40 yes 2008-10-17 02:40 ok, waiting for that... 2008-10-17 02:40 ok, I'll post those soon 2008-10-17 02:43 hirofumi, I have written a little bit about it 2008-10-17 02:43 some discussion with matt dillon 2008-10-17 02:43 some in the initial post 2008-10-17 02:43 and there will be some more in tomorrow's post 2008-10-17 02:44 oh, good. btw, what is big difference with usual jornaling? 2008-10-17 02:44 it puts the commit blocks all over the disk, not in a fixed place 2008-10-17 02:45 and it isn't limited to the size of the journal 2008-10-17 02:45 i see. and top of commit is pointed by superblock? 2008-10-17 02:47 btw, i posted some patches. could you check those? 2008-10-17 02:47 yes, re superblock 2008-10-17 02:48 and next commit is pointed by previous commit? 2008-10-17 02:50 yes 2008-10-17 02:51 I'll explain the details tomorrow 2008-10-17 02:51 your patches are looking good 2008-10-17 02:51 i see. thanks. btw, where does it come from? original? 2008-10-17 02:51 thanks 2008-10-17 02:51 I won't get through all ten tonight 2008-10-17 02:51 but by tomorrow 2008-10-17 02:52 original 2008-10-17 02:52 ah, I googled many times. 2008-10-17 02:53 heh 2008-10-17 02:54 hirofumi, you work for ntt or something like that? 2008-10-17 02:54 no 2008-10-17 02:55 student? sysadmin? 2008-10-17 02:55 flips: funny question.. why did u suspect so? 2008-10-17 02:55 foggy memory 2008-10-17 02:55 I'm finding next office, so I have some time for now. 2008-10-17 02:56 going to try that graph drawing patch right now 2008-10-17 02:56 btw, I sent example as reply 2008-10-17 03:03 wow 2008-10-17 03:03 that is beyond awesome 2008-10-17 03:04 pranith, I guess we have some figures now 2008-10-17 03:04 yippeee 2008-10-17 03:04 :) 2008-10-17 03:04 thanks :) kudos to graphviz 2008-10-17 03:06 wow.. 2008-10-17 03:06 the figure looks great.. 2008-10-17 03:06 hirofumi: thanks 2008-10-17 03:07 all most work by graphviz 2008-10-17 03:07 :) 2008-10-17 03:07 graphviz is just great 2008-10-17 03:09 so far it has not made a graph for me 2008-10-17 03:09 where does it write the output? 2008-10-17 03:10 if it's test/tux3.img, tux3graph should output test/test3.img.dot 2008-10-17 03:10 ok, it put it in /tmp 2008-10-17 03:10 it just add ".dot" postfix 2008-10-17 03:11 and "-v" is for full dump 2008-10-17 03:22 I used the comand: cat testdev.dot | dot -Tpng >testdev.png 2008-10-17 03:22 couldn't get dot to work otherwise 2008-10-17 03:22 I don't understand the syntax 2008-10-17 03:23 hirofumi, what name would you like for out "about us" page? 2008-10-17 03:23 we're listing all contributers 2008-10-17 03:23 and you just became one 2008-10-17 03:23 big time 2008-10-17 03:24 thanks. OGAWA Hirofumi 2008-10-17 03:24 it commented in tux3graph 2008-10-17 03:25 dot -Tpng -O foo.dot 2008-10-17 03:25 ogawa is your family name? 2008-10-17 03:25 yes 2008-10-17 03:25 you want all caps? 2008-10-17 03:26 yes, it may indicate it's family name 2008-10-17 03:26 my version of dot is too old to support -O I guess 2008-10-17 03:26 i see. i'll check... 2008-10-17 03:26 you could also write Ogawa, Hirofumi 2008-10-17 03:27 English way of showing family name 2008-10-17 03:27 for example, Phillips, Daniel 2008-10-17 03:27 I'm sure that's the problem 2008-10-17 03:27 this machine is running Etch 2008-10-17 03:27 so the command syntax I used works 2008-10-17 03:28 English way is Hirofumi Ogawa? I'm not sure 2008-10-17 03:28 i see. I'm using testing (lenny) 2008-10-17 03:29 normally I write Daniel Phillips, but it's also common to write Phillips, Daniel 2008-10-17 03:29 and the comma shows which is the family name 2008-10-17 03:29 Daniel is family name? 2008-10-17 03:30 oops, Phillips is family name? 2008-10-17 03:39 right 2008-10-17 03:39 all your changes look good 2008-10-17 03:39 I'll get some sleep first though before I start merging them 2008-10-17 03:40 that graph think is really sweet 2008-10-17 03:40 graph thing 2008-10-17 03:41 oh yay, the eee is booting off the dvd 2008-10-17 03:43 ok, it seems I could install centos on the eee 2008-10-17 03:43 not going to though 2008-10-17 03:51 Werror is killing tux3fuse... 2008-10-17 03:51 i think its better we dont use -Werror for now... 2008-10-17 04:00 sure 2008-10-17 04:00 what are the problematic warnings? 2008-10-17 04:00 posting... 2008-10-17 04:00 time for me to get some zzz's 2008-10-17 04:01 1 min 2008-10-17 04:01 -!- bobby(~bobby@nat-inn.mentorg.com) has joined #tux3 2008-10-17 04:02 i dint understand 2008-10-17 04:02 from the code... 2008-10-17 04:02 cc1: warnings being treated as errors 2008-10-17 04:02 tux3fuse.c: In function ‘tux3_lookup’: 2008-10-17 04:02 tux3fuse.c:78: warning: initialized field overwritten 2008-10-17 04:02 tux3fuse.c:78: warning: (near initialization for ‘ep.attr’) 2008-10-17 04:02 tux3fuse.c:79: warning: initialized field overwritten 2008-10-17 04:02 tux3fuse.c:79: warning: (near initialization for ‘ep.attr’) 2008-10-17 04:02 tux3fuse.c:80: warning: initialized field overwritten 2008-10-17 04:02 tux3fuse.c:80: warning: (near initialization for ‘ep.attr’) 2008-10-17 04:02 the code looks ok to me 2008-10-17 04:06 yes, I don't know what it means by "initialized field overwritten" 2008-10-17 04:07 yup, am on ubuntu, gcc verson 4.2.3 2008-10-17 04:07 but we can write that differently 2008-10-17 04:07 in a what that makes it happier maybe 2008-10-17 04:07 like? 2008-10-17 04:07 .attr = { .st_mod = ... }, 2008-10-17 04:07 hmm 2008-10-17 04:07 ok, am trying 2008-10-17 04:09 man I am going to be busy merging patches tomorrow 2008-10-17 04:10 yup, thats working 2008-10-17 04:10 submit a patch? 2008-10-17 04:16 sure 2008-10-17 04:16 I think the compiler was wrong actually 2008-10-17 04:16 about the initialized field overwritten 2008-10-17 04:16 might also be worth a gcc bug report 2008-10-17 04:16 hmm 2008-10-17 04:17 ohk... 2008-10-17 04:17 or at least a post to the gcc list 2008-10-17 04:17 ok, will do that 2008-10-17 04:17 just the struct and the error message, ask if it's worth a bug report 2008-10-17 04:17 struct init I meant 2008-10-17 04:18 ok, doing it now.. 2008-10-17 04:19 flips: iattr.c:109: warning: left shift count >= width of type 2008-10-17 04:19 im sure this warning is also not genuine 2008-10-17 04:19 this is with gcc 3.4.6 on rhel 5 2008-10-17 04:20 ACTION looks 2008-10-17 04:21 does it still give the warning with (((u64)root->depth) << 48) ? 2008-10-17 04:23 yup 2008-10-17 04:23 did that :) 2008-10-17 04:23 still gives the same 2008-10-17 04:23 warning 2008-10-17 04:26 seems like the compiler is wrong indeed 2008-10-17 04:26 yup, done with the mail to gcc and gcc-bugs 2008-10-17 04:26 submitting patch to tux3... 2008-10-17 04:27 anyway, it is evil to give a warning when the shift count is == to the object size 2008-10-17 04:27 that is perfectly legitimate in many cases 2008-10-17 04:27 idiot compiler hackers ;) 2008-10-17 04:27 lol 2008-10-17 04:27 but hats off to them 2008-10-17 04:27 that too 2008-10-17 04:27 i find compiler writing a boring thing 2008-10-17 04:28 the price is right as well 2008-10-17 04:28 but storage is exciting? 2008-10-17 04:28 gets more attention anyway 2008-10-17 04:28 compiler only gets attention when it doesn't work 2008-10-17 04:28 hehe 2008-10-17 04:28 ok, I really have to sleep 2008-10-17 04:28 storage is THE thing 2008-10-17 04:29 hehe, gudnite 2008-10-17 04:29 thanks for all the great work everybody 2008-10-17 04:35 woohoo 2008-10-17 04:35 http://tux3.org 2008-10-17 04:36 i'm sorry, frames are so 1997 2008-10-17 04:41 new look? 2008-10-17 04:43 shapor: still waiting for the dleaf picture :) 2008-10-17 04:43 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-17 05:16 -!- FelipeS(~Felipe@lawn-128-61-26-125.lawn.gatech.edu) has joined #tux3 2008-10-17 06:10 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-17 06:55 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-10-17 07:41 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-17 07:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-17 08:07 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-17 08:13 -!- cydork_(~cydoork@122.169.100.164) has joined #tux3 2008-10-17 08:55 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-17 08:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-17 09:38 -!- prani(~bobby@122.163.49.185) has joined #tux3 2008-10-17 09:38 anyone here/ 2008-10-17 09:52 -!- prani(~bobby@122.163.49.185) has joined #tux3 2008-10-17 10:07 -!- mingming(~mingming@c-24-22-117-202.hsd1.or.comcast.net) has joined #tux3 2008-10-17 10:07 -!- prani(~bobby@122.163.49.185) has joined #tux3 2008-10-17 10:16 -!- prani(~bobby@122.163.49.185) has joined #tux3 2008-10-17 10:27 hmmm 2008-10-17 10:44 ponk 2008-10-17 11:00 grrr... switching to the latest 2.6 head is not painless :| 2008-10-17 11:13 who the hack is CONFIG_X86_WP_WORKS_OK? 2008-10-17 11:39 yeah... I'll stick with 2.6.26 for now :| 2008-10-17 12:17 hmmm 2008-10-17 12:17 been trying to figure out the leak reported by valgrind... 2008-10-17 12:17 hirofumi just posted a patch... 2008-10-17 12:21 ACTION added tar.gz snapshots of the hg repo for download on the tux3.org page 2008-10-17 12:39 shapor, hello 2008-10-17 12:40 ACTION really needs to understand the dleaf format 2008-10-17 13:12 prani, it's not memory leak. it read the memory beyond buffer size (e.g. rewind.group points the tail of buffer if leaf->groups == 0). 2008-10-17 13:57 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-17 16:25 ACTION getting closer to working on merges 2008-10-17 16:26 hirofumi, are you able to expose your mercurial repository on the net? 2008-10-17 16:27 anybody, what's a decent fast way to make diagrams on linux, so I can diagram the dleaf format? 2008-10-17 16:27 I've used xfig, it's powerful but weird 2008-10-17 16:27 and the dia clone, I forget what it's called 2008-10-17 16:27 name changed somewhere along the way 2008-10-17 16:27 pretty lame 2008-10-17 16:27 inkscape... tedious 2008-10-17 16:28 maybe I'll try inkscape again 2008-10-17 16:28 I like xfig... 2008-10-17 16:28 I like it, but I always end up tearing my hair out 2008-10-17 16:28 I find hard to control things in inkscape... perhaps I'm too old :| 2008-10-17 16:28 and it's really a lot of work making slick diagrams with it 2008-10-17 16:28 what are you trying to diagram? 2008-10-17 16:29 the dleaf format 2008-10-17 16:29 like I did no the whiteboard 2008-10-17 16:29 you need an expert in illustrator to do that? 2008-10-17 16:29 maybe shapor can remembe rthe whiteboard diagram and come up with some picture 2008-10-17 16:29 have you took a picture of the whiteboard? :D 2008-10-17 16:29 tim_dimm, would help 2008-10-17 16:29 I did 2008-10-17 16:29 I'm game 2008-10-17 16:29 in my phone 2008-10-17 16:29 can get it out with bluetooth 2008-10-17 16:30 OmniGraffle is quite nice for some stuff... 2008-10-17 16:30 ok, later for that one 2008-10-17 16:30 nice name 2008-10-17 16:30 if you can get it out to me, I'll work it out 2008-10-17 16:30 http://www.omnigroup.com/applications/OmniGraffle/ :P 2008-10-17 16:31 including pictures from there in papers doesn't work too well... :| 2008-10-17 16:31 need to do any diagrams with an opensource tool chain 2008-10-17 16:31 or at least the file formats have to be open and editable by open tools 2008-10-17 16:32 then inkscape might be the best option 2008-10-17 16:32 quite possibly 2008-10-17 16:32 do it in 3D with blender 2008-10-17 16:32 ;-) 2008-10-17 16:33 what?? :D 2008-10-17 16:33 3D graphics tool 2008-10-17 16:33 I know about it 2008-10-17 16:33 in 3D text mode with aalib 2008-10-17 16:33 looks like a big hammer for this 2008-10-17 16:33 tim was joking 2008-10-17 16:33 ascii?? :D 2008-10-17 16:34 I was joking, sorta 2008-10-17 16:34 anybody who hasn't run quake in text mode needs to 2008-10-17 16:34 with the complexity of your design, how cool would it be to explore it in 3D space? 2008-10-17 16:34 I actually run it once some time ago... 2008-10-17 16:35 it was doom II perhaps... don't remember exactly 2008-10-17 16:35 tim_dimm, sure, every project benefits from adding some smoke and caustics to their data structure diagrams 2008-10-17 16:35 physics 2008-10-17 16:37 povray anyone? :P 2008-10-17 16:39 I'm holding out for a shader to emboss them in 3D and I can blow them to dust with my bfg 2008-10-17 16:41 inkscape package is broken on etch, I'm going back to my tech post 2008-10-17 16:49 OT: http://ozviz.wasp.uwa.edu.au/~pbourke/miscellaneous/scifigure/ 2008-10-17 16:50 I'm sure tim was only half joking 2008-10-17 16:50 we really need our diagrams sitting on shiny disks like that 2008-10-17 16:52 ok ok ok ok :D 2008-10-17 16:54 i could output a library of images 2008-10-17 16:55 you could arrange them any way you want with little lines and everything 2008-10-17 16:55 got to get down to the strand with my camera tonight 2008-10-17 16:56 get some of that brushfire sunset 2008-10-17 16:56 show you point n shoot lamers what a camera can do ;) 2008-10-17 16:56 oh, should I break out the 8x10 camera? 2008-10-17 16:56 only if it's a point n shoot 2008-10-17 16:57 point, pull focus, point again, pull focus some more, then polaroid, then shoot, but only one sheet at a time 2008-10-17 16:57 plus it won't fit in my pocket 2008-10-17 16:57 wish I never sold that thing 2008-10-17 16:58 used to shoot portraits with it 2008-10-17 16:58 ACTION only has medium format cameras :P 2008-10-17 16:58 ah, found my filters 2008-10-17 16:58 polaroid was discontinued, right? 2008-10-17 16:59 flips: did you download the picture of the whiteboard? 2008-10-17 16:59 razvanm, not yet, I need to transfer it to somebody with bluetooth 2008-10-17 16:59 get it out of the t-mobile feature phone 2008-10-17 16:59 probably to tim's ipod 2008-10-17 17:00 or my laptop 2008-10-17 17:00 or maybe mac 2008-10-17 17:00 right 2008-10-17 17:00 :D 2008-10-17 17:00 I'll suspend any mac trashing during the transfer process 2008-10-17 17:11 folks 2008-10-17 17:11 flips: so you do have/use a mac? 2008-10-17 17:12 razvanm, not me 2008-10-17 17:12 being a linux apostle and all 2008-10-17 17:12 :D 2008-10-17 17:13 besides, I have too many fingers for a one button mouse 2008-10-17 17:13 and too many brain cells ;) 2008-10-17 17:13 ACTION ducks 2008-10-17 17:13 :D 2008-10-17 17:14 I got one finger for you ;-) 2008-10-17 17:14 http://osxbook.com/ 2008-10-17 17:14 huge book 2008-10-17 17:14 a googler also ;-) 2008-10-17 17:14 ACTION invokes his invisibility spell 2008-10-17 17:15 it seems to work ;-) 2008-10-17 17:17 ACTION has a question about d_child and d_subdirs 2008-10-17 17:18 sk8 oclock 2008-10-17 17:18 enjoy 2008-10-17 17:18 and take nice pictures :P 2008-10-17 17:19 I'm just thinking how well skates and camera combine 2008-10-17 17:19 rapid change of viewpoint 2008-10-17 17:19 great way to destroy equipment too 2008-10-17 17:19 perfect ;-) 2008-10-17 17:19 true also 2008-10-17 18:48 ACTION is exploring dentry_unused... 2008-10-17 19:39 usbdevfs is seriously braindamaged 2008-10-17 19:39 terminally fragile 2008-10-17 19:42 hmm, connect(3, {sa_family=AF_FILE, path="/var/run/dbus/system_bus_socket"}, 33) = 0 2008-10-17 19:43 dbus getting involved in anything gives me a sense of impending doom 2008-10-17 19:54 http://tux3.org/woohoo.jpg 2008-10-17 20:00 folks 2008-10-17 20:05 -!- prani(~bobby@122.162.70.80) has joined #tux3 2008-10-17 20:07 hello 2008-10-17 20:09 file:///var/www/tux3/paradise.jpg 2008-10-17 20:09 check it out 2008-10-17 20:09 huh 2008-10-17 20:09 var? 2008-10-17 20:09 file? 2008-10-17 20:09 whoops 2008-10-17 20:09 hehe 2008-10-17 20:09 http://tux3.org/paradise.jpg 2008-10-17 20:09 :) 2008-10-17 20:11 is that a shooting star? 2008-10-17 20:11 its too bright for that... 2008-10-17 20:12 a meteoroid i guess.. 2008-10-17 20:12 jet 2008-10-17 20:12 coming in to lax 2008-10-17 20:12 huh! 2008-10-17 20:12 nice... 2008-10-17 20:12 well going away actually 2008-10-17 20:13 and too high for lax 2008-10-17 20:13 military possibly 2008-10-17 20:13 or flying from san diego somewhere 2008-10-17 20:13 hmm 2008-10-17 20:14 nice evening... 2008-10-17 20:14 yah, it's been like that all week 2008-10-17 20:14 bushfire special 2008-10-17 20:14 is that by the sea side? 2008-10-17 20:17 flips, mind explaining the figure by hirofumi ? 2008-10-17 20:17 the blocks... 2008-10-17 20:17 it shows graphically the interrelationship between blocks 2008-10-17 20:18 http://tux3.org/boardwalk.jpg 2008-10-17 20:18 this is where I work on the design each day by the way 2008-10-17 20:19 u have any pic of urs? 2008-10-17 20:20 bful place you live in 2008-10-17 20:20 I took these pics tonight 2008-10-17 20:20 about an hour ago 2008-10-17 20:21 http://tux3.org/casadelmere.jpg 2008-10-17 20:22 where we practice skating down the steps 2008-10-17 20:22 crazy guys like tim and shapor 2008-10-17 20:24 http://tux3.org/thepier.jpg 2008-10-17 20:28 nice... 2008-10-17 20:28 http://tux3.org/carousel.jpg 2008-10-17 20:28 that's it, high points for tonight 2008-10-17 20:28 was fun skating with the camera 2008-10-17 20:29 someone shud have taken your pics tooo 2008-10-17 20:29 :) 2008-10-17 20:29 had to get from the skate park back to santa monica in about 5 minutes to catch the sunset 2008-10-17 20:29 flips: ur bandwidth is teh suck 2008-10-17 20:29 yeah.. pretty slow 2008-10-17 20:29 maybe tim_dimm will bring a real camera one time 2008-10-17 20:30 tim_dimm showed me a few videos last time 2008-10-17 20:30 skating down the hill 2008-10-17 20:30 http://tux3.org/casadelmere.jpg what is low bandwidth about this? 2008-10-17 20:30 oh 2008-10-17 20:30 my uplink 2008-10-17 20:30 true, and expensive too 2008-10-17 20:30 speakeasy 2008-10-17 20:30 not sure why I stick with them 2008-10-17 20:31 flips: http://shapor.com/bashgal 2008-10-17 20:31 I wonder if I can take my ip with me and go to verizon 2008-10-17 20:31 no way 2008-10-17 20:32 I thought there was a law that said I could 2008-10-17 20:32 whats with the ip? 2008-10-17 20:32 you have the domanin name... 2008-10-17 20:32 was ist das bashgal? 2008-10-17 20:32 sure 2008-10-17 20:32 redirect to the new site for a few days until the dns servers are updated... 2008-10-17 20:32 generates thumbnails 2008-10-17 20:32 and smaller pics 2008-10-17 20:33 right 2008-10-17 20:33 well 2008-10-17 20:33 i can lower the ttl so dns can flip over in 5 min 2008-10-17 20:33 every one of my pixels is loaded with goodness 2008-10-17 20:33 :) 2008-10-17 20:33 :) 2008-10-17 20:33 resolution is kind of rediculous on those 2008-10-17 20:34 flips, the dleaf block in the diagram ... 2008-10-17 20:34 in bitmap_dtree 2008-10-17 20:34 the extent table seems to be at the top... 2008-10-17 20:35 followed by the entry and group index 2008-10-17 20:35 http://tux3.org/carousel.jpg <- on the right is a nice ramp I slalom down and tim does pirouettes down 2008-10-17 20:35 ok 2008-10-17 20:35 back to tux3 :) 2008-10-17 20:37 prani, low addresses are at the top of the box 2008-10-17 20:37 hm 2008-10-17 20:38 extent table is at the bottom of a dlead, just after the header that mainly has the groups count 2008-10-17 20:38 dleaf 2008-10-17 20:39 whats the field immediately after the magic field.. 2008-10-17 20:39 3rd row 2008-10-17 20:39 says extent 0 2008-10-17 20:40 hmm, so we are growing from higher address to lower address 2008-10-17 20:40 higher address is the actual top of the leaf... 2008-10-17 20:40 ? 2008-10-17 20:40 the block directory (dictionary) grows down, the extent table groups up 2008-10-17 20:40 lower address at the top of the dleaf box 2008-10-17 20:41 the block directory (dictionary) grows down towards lower addresses, the extent table groups up towards higher addresses 2008-10-17 20:41 hmm.. ok 2008-10-17 20:42 I have to get all those patches merged 2008-10-17 20:42 and get my post out 2008-10-17 20:42 sigh 2008-10-17 20:43 hardly time to fit in any sk8photography 2008-10-17 20:43 skateography 2008-10-17 20:46 :) 2008-10-17 20:47 flips: http://shapor.com/dp/ 2008-10-17 20:47 thats bashgal 2008-10-17 20:47 1. put pics in directory 2008-10-17 20:47 2. cd to directory 2008-10-17 20:47 3. run bashgal script :) 2008-10-17 20:47 autorotates and thumbnail generation 2008-10-17 20:47 I better get summa dat 2008-10-17 20:49 shapor, whos that skating guy? 2008-10-17 20:49 http://shapor.com/dp/.600/.html/carousel.jpg.html <- f2.8, nailed it 2008-10-17 20:50 that's just some sk8er dude 2008-10-17 20:50 I just asked for volunteers to do a couple tricks for the camera 2008-10-17 20:51 lol 2008-10-17 20:53 http://tux3.org/wheee.jpg 2008-10-17 20:53 this one needed a little faster exposure 2008-10-17 20:54 rather nice motion blur on the background 2008-10-17 20:55 guy on the bike looks like he's doing something unnatural with his nose though 2008-10-17 20:55 kay 2008-10-17 20:55 I'd better try bashgal 2008-10-17 20:55 optimize my pathetic bandwidth 2008-10-17 20:55 and I don't really need to share my potential postcard resolution with the entire internet 2008-10-17 20:56 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-17 20:56 hello 2008-10-17 20:56 Hi! 2008-10-17 20:56 hi 2008-10-17 20:57 flips: that shots a bit blurry 2008-10-17 20:57 contrast seems to bleed on that camera 2008-10-17 20:57 high contrast* 2008-10-17 20:57 that's what I was saying 2008-10-17 20:57 where? where? 2008-10-17 20:57 boardwalk where venice meets santa monica 2008-10-17 20:57 http://shapor.com/dp/.600/.html/carousel.jpg.html 2008-10-17 20:57 daniels photos from tonight 2008-10-17 20:58 hmm.. what's with the jpeg artifacts? :P 2008-10-17 20:58 what I needed was a big honking flash 2008-10-17 20:58 I like it :P 2008-10-17 20:58 why?? 2008-10-17 20:59 to get the fast closeup action 2008-10-17 20:59 I would have make the bottom part even darker ;-) 2008-10-17 20:59 no, I mean a big flash :) 2008-10-17 20:59 http://shapor.com/dp/.600/.html/thepier.jpg.html 2008-10-17 21:00 I think the pictures would look better with some more saturation to the colors ;-) 2008-10-17 21:00 I'll gimp them ;) 2008-10-17 21:00 unforunately I didn't get raws for those 2008-10-17 21:00 turned off raw for the sk8t pics 2008-10-17 21:00 oh... 2008-10-17 21:01 the raw would have been handy here :P 2008-10-17 21:01 yo have more space in the buffer? 2008-10-17 21:01 let's see which ones got it 2008-10-17 21:01 yo = to 2008-10-17 21:01 ya, turned it off right at the beginning of the shoot 2008-10-17 21:02 next time will remember to turn it back on for the senery 2008-10-17 21:02 pretty nice jpgs though 2008-10-17 21:02 flips: nothing a decent p&s couldnt do ;) 2008-10-17 21:02 I don't like the jpg artifacts... :| 2008-10-17 21:02 show me :) 2008-10-17 21:02 could you increase the quality when the make the thumbs? :P 2008-10-17 21:03 http://tux3.org/carousel.jpg <- this one has saturation out the yinyang 2008-10-17 21:04 lookit those highlights 2008-10-17 21:04 almost not like digital 2008-10-17 21:05 though I reall want a 12 bit dac on my next camera 2008-10-17 21:05 http://www.flickr.com/cameras/ricoh/gr_digital/ 2008-10-17 21:06 flips: can you share the original from carousel? :P 2008-10-17 21:06 ACTION doesn't trust cameras with square holes at the front 2008-10-17 21:06 square holes at the front? 2008-10-17 21:06 that's the original 2008-10-17 21:06 aaa... ok :D 2008-10-17 21:06 check it out at one to one 2008-10-17 21:07 let me play a little with it :P 2008-10-17 21:08 ACTION feels an unsharp mask coming 2008-10-17 21:08 nooo :P 2008-10-17 21:08 :) 2008-10-17 21:09 I'd like my next camera to have 1/4 the noise at that light level too 2008-10-17 21:09 same as wanting a 12 bit adc 2008-10-17 21:09 I like noise ;-) 2008-10-17 21:09 I like being able to turn it off 2008-10-17 21:09 ACTION likes signal 2008-10-17 21:09 noise is data too 2008-10-17 21:10 you slept through your information theory class 2008-10-17 21:10 :P 2008-10-17 21:10 http://razvan.musaloiu.com/shoebox/ till I play with some photoshop... 2008-10-17 21:11 noise is to signal as shit is to strawberries 2008-10-17 21:11 RazvanM: some nice photos 2008-10-17 21:11 shapor: thanks... :P 2008-10-17 21:12 art :) 2008-10-17 21:17 why current kernel (junkfs) doesn't use page/buffer cache? 2008-10-17 21:18 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-17 21:19 hirofumi, just because we haven't added that yet 2008-10-17 21:20 we going to use those normal way? 2008-10-17 21:20 mostly 2008-10-17 21:20 i see. 2008-10-17 21:20 but we will use those bio transfer functions directly to update buffer pages instead of using the block library I think 2008-10-17 21:21 and we're going to do some pretty fancy things with the buffers in page cache to avoid blocking 2008-10-17 21:21 writing about that now 2008-10-17 21:21 i see 2008-10-17 21:22 and it manage locking of buffers for forward logging? 2008-10-17 21:30 it will lock buffers, but not for forward logging 2008-10-17 21:30 manages not overwriting buffers that haven't been written out yet 2008-10-17 21:30 by removing them from the buffer cache before overwriting 2008-10-17 21:32 http://farm4.static.flickr.com/3011/2951108022_9cc58e9464_b.jpg 2008-10-17 21:32 flips: perhaps not the way you wanted :P 2008-10-17 21:49 um.. ah, COW? well, it seems I have to wait next doc. thanks. 2008-10-17 22:34 razvanm :) 2008-10-17 22:34 hirofumi, right 2008-10-17 22:34 mooo 2008-10-17 22:37 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-17 23:19 ACTION is (re)watching Chunking Express... 2008-10-18 00:00 have you seen fallen angels? 2008-10-18 00:15 yup 2008-10-18 00:15 (just finished the Chinking Express) 2008-10-18 00:16 just saw both recently myself 2008-10-18 00:16 I also liked Falledn Angels :P 2008-10-18 00:16 :D 2008-10-18 00:16 I think it's his best 2008-10-18 00:16 how about the Ashes of Time? :D 2008-10-18 00:16 or maybe I just like vinyl skirts 2008-10-18 00:16 ashes was funny 2008-10-18 00:17 the first one I watch was In the Mood for Love... 2008-10-18 00:17 put me to sleep 2008-10-18 00:17 :P 2008-10-18 00:17 I watched Ashed of Time several times to understand something from it ;-) 2008-10-18 00:17 you can only take so many hours of chungsams 2008-10-18 00:17 I also like it though ;-) 2008-10-18 00:18 chungsams ?!? 2008-10-18 00:18 chinese dress 2008-10-18 00:18 (I also had to watch a bunch of time In The Mood for Love to figure out who is who ;-)) 2008-10-18 00:18 :D 2008-10-18 00:22 http://images.google.com/images?hl=en&q=Cheongsam 2008-10-18 00:22 nice... 2008-10-18 00:22 somehow related: http://razvan.musaloiu.com/2006/12/03/zhou-yu 2008-10-18 00:22 they've been working on it for thousands of years I understand 2008-10-18 00:23 they still love amazing :D 2008-10-18 00:23 (to me at least :P) 2008-10-18 00:24 have you seen chinese ghost story? 2008-10-18 00:24 don't think so? 2008-10-18 00:24 (they should have more pictures here: http://en.wikipedia.org/wiki/Cheongsam ) 2008-10-18 00:25 http://en.wikipedia.org/wiki/A_Chinese_Ghost_Story 2008-10-18 00:25 if you have a thing for chinese you need to see it 2008-10-18 00:25 http://www.imdb.com/title/tt0093978/ 2008-10-18 00:26 I never heard of it :( 2008-10-18 00:26 cult hit 2008-10-18 00:26 87... I was in 2nd grade then :P 2008-10-18 00:28 awesome... they have at the library 2008-10-18 00:29 (I canceled my netflix subscription a few weeks ago) 2008-10-18 00:29 figured you saw everything worth seeing? 2008-10-18 00:29 nope... I didn't have time to see any of the movies for more than one year :D 2008-10-18 00:30 it was a relief 2008-10-18 00:30 :D 2008-10-18 00:31 I had to watch again Chunking Express though... it fades in my memory and I need a refresh from time to time 2008-10-18 00:31 some sort of timeout ;-) 2008-10-18 00:32 now I'll head to bed 2008-10-18 00:32 good night 2008-10-18 00:32 I fixed one last leak for romfs ;-) 2008-10-18 00:33 tomorrow I'll try minix :P 2008-10-18 00:33 and time to watch a netflix movie with my wife 2008-10-18 00:33 doctor strangelove 2008-10-18 00:33 aaaaaaaaaaaa 2008-10-18 00:33 very-very nice! :D 2008-10-18 00:33 extra nice ;-) 2008-10-18 00:33 enjoy :D 2008-10-18 00:33 :) 2008-10-18 00:33 see you 2008-10-18 00:33 bye 2008-10-18 01:04 -!- Bobby_(~Bobby@122.162.70.80) has joined #tux3 2008-10-18 01:04 hey all 2008-10-18 02:16 -!- Bobby_(~Bobby@122.162.74.234) has joined #tux3 2008-10-18 02:20 -!- Bobby_(~Bobby@122.162.72.218) has joined #tux3 2008-10-18 02:22 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-18 02:23 -!- prani(~bobby@122.162.72.218) has joined #tux3 2008-10-18 03:33 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-18 03:34 -!- Bobby_(~Bobby@122.162.72.218) has joined #tux3 2008-10-18 03:46 -!- prani(~bobby@122.162.72.218) has joined #tux3 2008-10-18 06:57 -!- pgquiles(~pgquiles@252.Red-83-41-113.dynamicIP.rima-tde.net) has joined #tux3 2008-10-18 07:46 -!- stargazr5(~gauravstt@59.95.27.129) has joined #tux3 2008-10-18 08:58 -!- stargazr5(~gauravstt@59.95.22.5) has joined #tux3 2008-10-18 09:24 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-18 09:37 -!- Bobby_(~Bobby@122.162.72.218) has joined #tux3 2008-10-18 10:23 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-18 11:00 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-18 12:01 -!- bobby(~bobby@122.162.68.224) has joined #tux3 2008-10-18 12:01 -!- Bobby_(~Bobby@122.162.68.224) has joined #tux3 2008-10-18 12:01 hey all 2008-10-18 12:20 flips, shapor dleaf pics pleaseee 2008-10-18 12:20 i think i got a pretty good idea from hirofumi's diagram 2008-10-18 12:20 it shows the dleaf structure there... 2008-10-18 12:23 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-18 14:20 ACTION is experimenting with mercurial... 2008-10-18 14:38 mercurial rocks 2008-10-18 14:44 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-18 15:08 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-18 16:17 ACTION is trying to revive an old G3 with Panther... 2008-10-18 16:26 -!- flips(~phillips@phunq.net) has joined #tux3 2008-10-18 16:37 http://tux3.org/images/.600/.html/boardwalk.jpg.html <- gallery courtesy of shapor 2008-10-18 16:38 probably should move this to phunq 2008-10-18 16:38 http://phunq.net/images/ <- there we go 2008-10-18 16:40 nice... can you take more pictures? 2008-10-18 16:40 you too :) 2008-10-18 16:40 nice art you made yesterday 2008-10-18 16:40 very postcard 2008-10-18 16:41 http://phunq.net/images/sunset/ <- new home 2008-10-18 16:42 I like the 9x16 format ;-) 2008-10-18 16:42 entirely shapor's doing 2008-10-18 16:42 well and the camera 2008-10-18 16:42 no crops or retouching 2008-10-18 16:43 http://phunq.net/images/sunset/.1024/.html/carousel.jpg.html <- what do you call the darkening effect at the corners 2008-10-18 16:43 oh... so shapor took all the pictures? :D 2008-10-18 16:43 heh, did all the gallery processing 2008-10-18 16:43 http://en.wikipedia.org/wiki/Vignetting 2008-10-18 16:43 :D 2008-10-18 16:43 right 2008-10-18 16:44 especially prominent at f2.8 2008-10-18 16:44 what camera was it? 2008-10-18 16:44 wide angle 2008-10-18 16:44 20d 2008-10-18 16:44 aaa 2008-10-18 16:44 oldie but goodie 2008-10-18 16:45 can you take more? :D 2008-10-18 16:45 you are skating every day, right? 2008-10-18 16:46 right 2008-10-18 16:46 tonight without the camera backpack though 2008-10-18 16:46 stupid q: what kind of skating? 2008-10-18 16:46 roller ;) 2008-10-18 16:46 :D 2008-10-18 16:46 street I guess is the word 2008-10-18 16:46 why a backpack? 2008-10-18 16:47 camera backpack 2008-10-18 16:47 you don't want a holster swinging around 2008-10-18 16:47 I would just tied the camera to my hand :P 2008-10-18 16:47 when slaloming down the hill 2008-10-18 16:47 3 kilos of camera 2008-10-18 16:47 uh... 2008-10-18 16:47 aaa... the lens! :D 2008-10-18 16:47 right 2008-10-18 16:47 do you have a cheap 50mm? :P 2008-10-18 16:47 the f1.8 type ;-) 2008-10-18 16:47 let me see, is it really that much 2008-10-18 16:47 I do 2008-10-18 16:48 takes good pics 2008-10-18 16:48 http://www.dpreview.com/reviews/specs/Canon/canon_eos20d.asp 2008-10-18 16:48 770g 2008-10-18 16:48 in some situations 2008-10-18 16:48 and the lens is 1.something kilos 2008-10-18 16:48 so 2 kilos 2008-10-18 16:48 the 1.8 would e also nice ;-) 2008-10-18 16:48 I'll get a couple primes 2008-10-18 16:48 btw, do you drive there? 2008-10-18 16:49 skate 2008-10-18 16:49 or is that close to where you live? 2008-10-18 16:49 along ocean 2008-10-18 16:49 nice :_) 2008-10-18 16:49 rich people's sidewalk 2008-10-18 16:49 then down a fairly respectable hill 2008-10-18 16:49 so how long is this? 2008-10-18 16:49 several km? 2008-10-18 16:50 (I left my roller blades in Ro :|) 2008-10-18 16:50 right, about 10 km/day 2008-10-18 16:50 when I don't skate with tim that is 2008-10-18 16:50 with tim is more? :P 2008-10-18 16:51 (something funny, you sk8 hour is around 8 here :D) 2008-10-18 16:51 heh 2008-10-18 16:53 http://phunq.net/images/sunset/.1024/.html/carousel.jpg.html <- on the right side of that pic is a ramp I slalom on 2008-10-18 16:53 there are a bunch of posts at the bottom, so it has to be controlled 2008-10-18 16:53 shapor would probably just tuck and aim for a gap between the posts 2008-10-18 16:53 :P 2008-10-18 16:53 narrowly missing a couple baby carriages at the bottom 2008-10-18 16:54 tim does airials and spins all the way down 2008-10-18 16:54 ?? 2008-10-18 16:54 narrlowly missing the posts at the bottom, but not by chance as with shapor ;) 2008-10-18 16:55 aerial -> jump up in the air and do a trick 2008-10-18 16:55 like land backwards 2008-10-18 16:55 or spin 2008-10-18 16:55 oh... you guys are all good :P 2008-10-18 16:55 shapor and I are relative beginners 2008-10-18 16:55 tim is hardcore 2008-10-18 16:56 any pictures of them? 2008-10-18 16:56 there were two guys in your pictures... 2008-10-18 16:56 random dudes 2008-10-18 16:56 we'll get pics one day 2008-10-18 16:56 tim took a couple of me with his iphone 2008-10-18 16:56 noisy 2008-10-18 16:57 and timing is almost impossible with the shutter lag 2008-10-18 16:57 I'll bring down the canon and hand it to him one day 2008-10-18 16:57 let me guess, it will contain 4 people and the letters on their shirts will spell: t u x 3 :D 2008-10-18 16:57 heh 2008-10-18 16:57 about time to get rolling 2008-10-18 16:58 enjoy 2008-10-18 16:58 the G3 is almost ready... 2008-10-18 16:58 g3? 2008-10-18 16:59 PPC G3 2008-10-18 16:59 there was an old one laying around 2008-10-18 16:59 266Mhz 2008-10-18 16:59 I just put 2x128 in it 2008-10-18 16:59 ah 2008-10-18 16:59 os/x ? 2008-10-18 16:59 yup 2008-10-18 16:59 10.3 2008-10-18 17:00 I didn't have any luck with any of the bsd 2008-10-18 17:00 somebody else tried a linux but there was no X 2008-10-18 17:01 what will you use it for? 2008-10-18 17:01 fuse? 2008-10-18 17:01 I wanted to make it work when I didn't have a PPC around 2008-10-18 17:02 the tricks that I did to compile the linux modules on my mac depends on the fact that is intel 2008-10-18 17:02 I wanted to get them to compile on PCC also 2008-10-18 17:03 Ral has a PPC but I didn't want to work on her machine 2008-10-18 17:03 she is using wifi sometime :P 2008-10-18 17:04 hardcore 2008-10-18 17:53 wow... for minix I have 50 more functions to add :| 2008-10-18 18:20 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-18 18:27 block_sync_page looks can be nothing... 1 down, 49 to go 2008-10-18 18:36 find_first_zero_bit... 2 down, 48 to go 2008-10-18 18:48 clear_inode... 3 down, 47 to go 2008-10-18 19:08 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-18 19:57 kmap_atomic... 4 down, 46 to go 2008-10-18 20:01 implementing the linux vfs on minix? 2008-10-18 20:01 nope 2008-10-18 20:01 no, os/x I suppose 2008-10-18 20:01 compiling minix fs in mac os :D 2008-10-18 20:02 what is it page_follow_link_light and page_put? 2008-10-18 20:02 some old stuff? 2008-10-18 20:02 there will be a bunch of support functions not having much to do with modern filesystems 2008-10-18 20:02 hmm... it is in ext4 though 2008-10-18 20:02 follow__light is a new one for me 2008-10-18 20:03 some optimization of symlink I suppose 2008-10-18 20:03 and you'd have to read path_walk and friends 2008-10-18 20:03 I suppose we should do that one day 2008-10-18 20:03 :D 2008-10-18 20:03 I've never completely analyzed it myself, it's a royal mess 2008-10-18 20:03 another one 2008-10-18 20:04 the cool thing is I can fake some stuff if they will never be called :P 2008-10-18 20:05 just compile them all as error stubs, execute the mount and fix the first thing that breaks 2008-10-18 20:05 that's the approach 2008-10-18 20:05 I spent some time to understand them though 2008-10-18 20:05 been there ;) 2008-10-18 20:05 at least a little :D 2008-10-18 20:05 hehe... 2008-10-18 20:06 in what context? 2008-10-18 20:06 ported an app from an embedded controller to msdos 2008-10-18 20:06 uuu... 2008-10-18 20:06 had to emulate the graphics controller 2008-10-18 20:07 dozens of features, some of them bizarre 2008-10-18 20:07 different graphics memory layout, line acceleration... 2008-10-18 20:07 everything to make life miserable 2008-10-18 20:07 did it work? :D 2008-10-18 20:07 it took over the world 2008-10-18 20:07 did it worth the effort? 2008-10-18 20:08 canon cameras are running msdos... 2008-10-18 20:08 http://www.herkules-group.com/ >- bailed these guys out of a dead end and they now monopolize the world's roll grinder industry 2008-10-18 20:09 roll grinder... serious stuff! :D 2008-10-18 20:09 big stuff... 2008-10-18 20:09 (I'm looking at the pictures) 2008-10-18 20:09 chances are you used some plastic wrap in the last week made by a roll ground by my software 2008-10-18 20:10 and are driving around in a car wearing sheet metal made by a roll ground by my software ;) 2008-10-18 20:10 :D 2008-10-18 20:11 and few home for christmas in a plane with superstructure made by... etc etc 2008-10-18 20:11 so far I only know a bug that affects all the mac os x I tried so far :P 2008-10-18 20:12 bugs in that software tend to make large explosions and/or cut people in half 2008-10-18 20:13 "crash and burn" literally 2008-10-18 20:13 how big was your code? 2008-10-18 20:13 http://www.mohansteels.com/rollmill3.jpg 2008-10-18 20:13 bug free? 2008-10-18 20:13 pretty big 2008-10-18 20:13 not bug free, but reliable 2008-10-18 20:14 far more than the original 2008-10-18 20:14 I never have the chance to walk in such a place 2008-10-18 20:14 did some of the debugging live in places like in the picture, wearing a hard hat 2008-10-18 20:14 downtime cost $90,000 hour, so upgrades had to be fast 2008-10-18 20:17 http://www.sumitomometals.co.jp/e/osakasteelworks/vc/images/vc3-3.jpg 2008-10-18 20:18 :D 2008-10-18 20:18 how long were the debug sessions? 2008-10-18 20:18 days or weeks 2008-10-18 20:18 had to be careful 2008-10-18 20:19 finding other people's bugs pretty much 2008-10-18 20:19 how long have you done this? 2008-10-18 20:19 usually by replacing masses of code with stuff a fraction of the size 2008-10-18 20:19 did that for two or three years 2008-10-18 20:19 did you like it? :P 2008-10-18 20:20 would not have given up the experience 2008-10-18 20:20 rather different than the rest of my life ;) 2008-10-18 20:20 whole nuther side of the world 2008-10-18 20:20 don't need any more of that 2008-10-18 20:20 this was before Google? 2008-10-18 20:21 or much earlier? 2008-10-18 20:21 way before 2008-10-18 20:21 been a linux hacker since 2008-10-18 20:22 (nameidata... another thing I don't know about...) 2008-10-18 20:22 it could have been something else beside linux? 2008-10-18 20:22 bsd perhaps? 2008-10-18 20:22 mac os? :D 2008-10-18 20:22 http://www.industry.siemens.com/metals/EN/solutions/autom_siflat.htm <- you don't want to step on one of these when it's moving 2008-10-18 20:22 not much to prevent that either 2008-10-18 20:23 should have been linux 2008-10-18 20:23 the reason I left is because the company decided to do the next generation in windows instead of linux 2008-10-18 20:23 I didn't want to stick around to clean up the body parts 2008-10-18 20:23 they're probably using linux by now 2008-10-18 20:23 and they did do in windows?!? 2008-10-18 20:24 aha... :D 2008-10-18 20:24 they did, just the user interface 2008-10-18 20:24 and used off the shelf controllers for the back end 2008-10-18 20:24 still 2008-10-18 20:25 the off the shelf controllers are almost certainly running linux now 2008-10-18 20:25 then they were probalby wind river or similar 2008-10-18 20:25 VxWorks? 2008-10-18 20:25 probably 2008-10-18 20:26 details weren't interesting to me 2008-10-18 20:26 in linux, you do all the stuff the controller is doing and throw away the controller 2008-10-18 20:26 control the motors directly, throw away the motor controller too 2008-10-18 20:26 :D 2008-10-18 20:27 you still need something to get to those motors 2008-10-18 20:27 http://planet.wwu.edu/fall04/images/carmeltedscrap.jpg 2008-10-18 20:27 wish I had my camera then 2008-10-18 20:27 exactly! :D 2008-10-18 20:27 these shots don't give have the sense of what it's like 2008-10-18 20:27 it's like science fiction 2008-10-18 20:27 :-) 2008-10-18 20:29 http://www.hebig.org/blogs/archives/main/IMG_7068_4.jpg <- that's more like it 2008-10-18 20:29 crappy photo though 2008-10-18 20:30 I wonder if they let people take picture now... 2008-10-18 20:30 sure 2008-10-18 20:30 do they? 2008-10-18 20:30 no secrets 2008-10-18 20:30 they use the photos to get business 2008-10-18 20:31 :D 2008-10-18 20:31 you by equipment by the pound in that industry 2008-10-18 20:31 there's only one source for most of it 2008-10-18 20:31 where are this things? can you ask to go and visit? 2008-10-18 20:31 you can 2008-10-18 20:31 they get so few visitors you'd get the royal treatment 2008-10-18 20:32 but where are these? 2008-10-18 20:32 usually in the middle of nowhere where land is cheap and the cows don't complain about the noise and smoke 2008-10-18 20:32 ouch... 2008-10-18 20:32 you're in chicago, right? probably don't have far to go to get to the rust belt 2008-10-18 20:32 touring america and taking pics in some of these places would be awesome :D 2008-10-18 20:33 I'm in Baltimore :P 2008-10-18 20:33 even better I think 2008-10-18 20:33 I don't jave a car though :| 2008-10-18 20:33 probably got a rolling mill or two on the same block ;) 2008-10-18 20:33 school is in the walking distance and so is the supermarket :P 2008-10-18 20:34 ok, maryland, shows how much american geography I know 2008-10-18 20:34 the mills all moved west long time ago 2008-10-18 20:34 the JHU campus is inside the city 2008-10-18 20:35 how's the weather right now? 2008-10-18 20:35 pretty warm 2008-10-18 20:36 I go to school in short sleeve 2008-10-18 20:36 looks like a nice situation 2008-10-18 20:36 I put a jacket when I come though 2008-10-18 20:36 right now the max is around 20-22 and the min about 10 2008-10-18 20:36 I prefer much warmer weather :P 2008-10-18 20:36 Singapore was heaven :D 2008-10-18 20:37 http://www.baltimoresun.com/media/photo/2008-03/37043371.jpg <- close to you 2008-10-18 20:38 how did you find that? :D 2008-10-18 20:38 creative googling 2008-10-18 20:39 maryland steel factories? :D 2008-10-18 20:39 maps.google.com, go to baltimore, type in steel mill, look for one close to you, type in the address, go to images 2008-10-18 20:39 let's see... 2008-10-18 20:40 http://maps.google.com/maps?f=q&hl=en&geocode=&q=%22steel+mill%22+maryland&ie=UTF8&filter=0&z=7 2008-10-18 20:41 that would have been me, wearing a blue hard hat and birkenstocks 2008-10-18 20:41 perhaps I should mention the terrible state of the public transportation in baltimore :P 2008-10-18 20:41 birkenstocks? 2008-10-18 20:42 http://en.wikipedia.org/wiki/Birkenstocks 2008-10-18 20:42 german sandals 2008-10-18 20:42 still have the same ones 2008-10-18 20:42 they last forever 2008-10-18 20:42 can you wear something like this in a steel factory?? 2008-10-18 20:49 they care more about your head than your feet 2008-10-18 20:49 save the important parts :D 2008-10-18 20:50 just be careful not to step on anything that's glowing 2008-10-18 20:51 43 calls to go... 2008-10-18 20:51 I should go home :| 2008-10-18 20:53 funny: the fs is exactly the last chapter in operating systems, design and implementation :P 2008-10-18 20:58 uses every aspect of the fs just about 2008-10-18 20:58 with the exception of some multimedia 2008-10-18 21:21 going home... 2008-10-18 21:21 have a nice evening! 2008-10-18 21:21 thanks for chat :P 2008-10-18 22:07 folks 2008-10-18 22:09 hi 2008-10-18 22:15 how's it going ? 2008-10-18 22:16 finally got some good results out of my schedule instrumentation and I found that the cpu_idle thread is a significant source of lock contention against the rq locks which is surprising 2008-10-18 22:16 I haven't posted my patch nor the result to lkml yet 2008-10-18 22:24 there might be some logic that too aggressive to try and idle a processor under various thread loads and it might be another aspect that's putting a lot of pressure against those rq locks 2008-10-18 22:24 ACTION thinks about file system parallelism 2008-10-18 22:42 -!- prani(~Bobby@122.163.48.97) has joined #tux3 2008-10-18 22:43 -!- prani(~Bobby@122.163.48.97) has joined #tux3 2008-10-18 23:40 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-18 23:53 -!- Bobby_(~Bobby@122.163.48.97) has joined #tux3 2008-10-19 00:04 -!- bobby(~bobby@122.163.48.97) has joined #tux3 2008-10-19 00:23 -!- bobby(~bobby@122.163.48.97) has joined #tux3 2008-10-19 03:07 -!- paola(~paola@ppp-139-17.20-151.libero.it) has joined #tux3 2008-10-19 04:35 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-19 09:47 -!- Bobby_(~Bobby@122.162.71.160) has joined #tux3 2008-10-19 09:55 -!- bobby(~bobby@122.162.71.160) has joined #tux3 2008-10-19 10:06 -!- stargazr5(~gauravstt@59.95.14.136) has joined #tux3 2008-10-19 10:18 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-19 11:03 -!- pgquiles(~pgquiles@71.Red-79-154-137.staticIP.rima-tde.net) has joined #tux3 2008-10-19 11:20 -!- pgquiles(~pgquiles@71.Red-79-154-137.staticIP.rima-tde.net) has joined #tux3 2008-10-19 12:07 -!- Bobby_(~Bobby@122.162.69.167) has joined #tux3 2008-10-19 12:07 hey akk 2008-10-19 12:07 all* 2008-10-19 12:07 anyone ever heard of 1Gbps data transfer? 2008-10-19 12:09 -!- zbrown(~rufius@208.64.37.45) has left #tux3 2008-10-19 12:41 look up agami systems wikipedia 2008-10-19 12:45 -!- Bobby_(~Bobby@122.162.69.103) has joined #tux3 2008-10-19 13:54 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-19 14:14 hmm... I think I reach the point where I cannot avoid buffer heads anymore... 2008-10-19 15:05 doctor strangelove was great 2008-10-19 15:05 peter sellers is beyond brilliant 2008-10-19 15:07 that's the guy that played a few characters there? 2008-10-19 15:08 that's him 2008-10-19 15:08 I didn't know when I watch it :D 2008-10-19 15:08 including the illustrious doctor 2008-10-19 15:09 I also liked the texan pilot :P 2008-10-19 15:09 (which he was suppose to also play) 2008-10-19 15:09 slim pickens 2008-10-19 15:10 apparently not acting 2008-10-19 15:10 that's how he really was 2008-10-19 15:10 yeah! :D 2008-10-19 15:10 I read about that 2008-10-19 15:10 he came in boots and with the hat from the first day :P 2008-10-19 15:11 it's amazing how sellers could change his face to play the president 2008-10-19 15:11 didn't recognize him 2008-10-19 15:11 Kubrick biographer John Baxter further explains in the documentary Inside the Making of Dr. Strangelove: 2008-10-19 15:11 As it turns out, Slim Pickens had never left the United States. He had to hurry and get his first passport. He arrived on the set, and somebody said, "Gosh, he's arrived in costume!," not realizing that that's how he always dressed... with the cowboy hat and the fringed jacket and the cowboy boots?and that he wasn't putting on the character?that's the way he talked. 2008-10-19 15:11 (from wikipedia :D) 2008-10-19 15:12 seems to be rated about number 5 movie of all time 2008-10-19 15:12 on lots of lists 2008-10-19 15:13 President Merkin Muffley: Gentlemen, you can't fight in here! This is the War Room. 2008-10-19 15:13 I saw it when it first came out, I guess my dad didn't know it wasn't for kids 2008-10-19 15:13 wow! :D 2008-10-19 15:13 all I remembered from it was the guy sitting on the bomb waving his hat 2008-10-19 15:13 how old were you? 2008-10-19 15:13 8 2008-10-19 15:13 :D 2008-10-19 15:14 actually, my 4 year old could relate to the sitting on the bomb scene 2008-10-19 15:15 she need to know why he wanted to sit on the bomb 2008-10-19 15:15 the part about the bodily fluids was hilarious... 2008-10-19 15:15 answer: because he'd stupid 2008-10-19 15:15 :-) 2008-10-19 15:16 reading out the phone call... serious right up until the last line 2008-10-19 15:17 Director Kubrick tricked Scott into playing the role of Gen. Turgidson far more ridiculously than Scott was comfortable with doing. Kubrick talked Scott into doing "over the top" practice takes, which Kubrick told Scott would never be used, as a way to warm up for the "real" takes. Subsequently, Kubrick used these takes in the final film, causing Scott to swear never to work with Kubrick again. 2008-10-19 15:18 (also from wikipedia) 2008-10-19 15:18 considered scott's breakout role 2008-10-19 15:18 so I suppose it was just a quick swear 2008-10-19 15:19 kubrick gets my vote for greatest director of all time 2008-10-19 15:20 I think the doctor strangelove is the only film I watch from him 2008-10-19 15:20 awesome song for the ending :P 2008-10-19 15:20 there's 2001 2008-10-19 15:20 barry lyndon 2008-10-19 15:21 the shining 2008-10-19 15:21 clockwork orange 2008-10-19 15:21 all must see 2008-10-19 15:21 aaaa 2001 2008-10-19 15:21 he's got at least 5 films in the top 100 2008-10-19 15:21 I have to see that 2008-10-19 15:25 apparently some of the terrain scenes in strangelove were recycled in the light show part of 2001 2008-10-19 15:25 optically/chemically alterered 2008-10-19 15:25 didn't know about that :D 2008-10-19 15:26 64, 68... 2008-10-19 15:26 4 years apart 2008-10-19 15:26 aaa... I watched Eyes Wide Shut :P 2008-10-19 15:27 I need to see it 2008-10-19 15:28 critics thought the pacing was slow... that's a good sign 2008-10-19 15:28 not with the kids around though :P 2008-10-19 15:28 it was slow... 2008-10-19 15:28 that's what they thought about barry lyndon, which is since recognized as one of the great films 2008-10-19 15:28 one of my favorites 2008-10-19 15:29 need to watch that 2008-10-19 15:29 cool... they also have it at the library ;-) 2008-10-19 15:30 checkout out but the next free time is not soon 2008-10-19 15:30 you've got plenty of time ;) 2008-10-19 15:31 "after five or ten years came the realization that 2001 or Barry Lyndon or The Shining was like nothing else before or since" -- Scorsese 2008-10-19 15:31 exactly 2008-10-19 15:31 :D 2008-10-19 15:32 ah, full meta jacket, another kubrick 2008-10-19 15:33 metal even 2008-10-19 15:33 metajacket... sounces like sequel to men in black 2008-10-19 15:34 btw, did you like Red Thin Line? 2008-10-19 15:35 didn't see it 2008-10-19 15:35 http://www.imdb.com/title/tt0120863/ 2008-10-19 15:35 1998? 2008-10-19 15:35 I like it a lot 2008-10-19 15:36 http://www.rottentomatoes.com/m/1084146-thin_red_line/ 2008-10-19 15:36 yup 2008-10-19 15:36 it's from '98 2008-10-19 15:36 I'll check it out 2008-10-19 15:37 http://www.youtube.com/watch?v=Gm6ZgOBlzII 2008-10-19 15:37 funny how kubrick seems to have intentionally made one of each of the major film genre's, almost 2008-10-19 15:37 no western 2008-10-19 15:37 I'm a sucker for voiceovers ;-) 2008-10-19 15:40 just watched the trailer... 2008-10-19 15:40 I had to watch it again 2008-10-19 15:40 thin red line? 2008-10-19 15:41 yup 2008-10-19 15:41 ok, it's on my list 2008-10-19 15:41 let me know what do you think after you watch it :D 2008-10-19 15:44 -!- paola(~paola@ppp-139-17.20-151.libero.it) has left #tux3 2008-10-19 15:46 #include "itree_common.c" :D 2008-10-19 15:46 in itree_v1.c from minix 2008-10-19 15:47 :) 2008-10-19 15:47 so there's some precedent... from linus maby even 2008-10-19 15:48 source includes are useful and obvious technique, it's strange how most c coders consider it somehow dirty 2008-10-19 15:48 * Copyright (C) 1991, 1992 Linus Torvalds 2008-10-19 15:48 * 2008-10-19 15:48 * Copyright (C) 1996 Gertjan van Wingerde (gertjan@cs.vu.nl) 2008-10-19 15:48 * Minix V2 fs support. 2008-10-19 15:49 in my case I didn't expect to find it in the middle of the itree_v1.c :D 2008-10-19 15:49 I was tracking a static function in itree_common.c 2008-10-19 15:49 and it was not used in itree_common.c ;-) 2008-10-19 15:51 shapor, brilliant merge 2008-10-19 15:52 post your merge helper scripts on the list? 2008-10-19 16:06 trying to think of all the things that need to be done for the kernel port now 2008-10-19 16:07 so far the testing was done only using fuse? 2008-10-19 16:07 and that very light 2008-10-19 16:07 so light that I think it's been broken for a couple weeks 2008-10-19 16:07 since the extents merge 2008-10-19 16:07 so the more serious one was using userland programs? 2008-10-19 16:08 there hasn't been serious testing 2008-10-19 16:08 ack :D 2008-10-19 16:09 we're getting closer though 2008-10-19 16:10 as soon as I get over this atomic commit hump, and produce some code for it then that finishes the big hacking projects in user space for the time being 2008-10-19 16:10 well, except for making fuse actually work well 2008-10-19 16:10 coding the atomic commit will hopefully start tomorrow 2008-10-19 16:12 38 more calls... I'm slow... 2008-10-19 16:13 now how are we going to do the source code management for the kernel port 2008-10-19 16:13 hg? 2008-10-19 16:14 I'm thinking, we have a script that copies in the files from the userspace directory, maybe makes some slight changes to them, then applies a patch to produce the kernel code 2008-10-19 16:14 and we hack on the kernel code in-tree, with a git tree cloned from mainline 2008-10-19 16:15 so to update the above patch, we run the import script then do git-diff 2008-10-19 16:15 why not make the userland compatible with the kernel code? :P 2008-10-19 16:15 so that way we keep the user space and kernel code from diverging a lot for the time being 2008-10-19 16:15 (like I'm doing now ;-)) 2008-10-19 16:15 we will make it mostly compatible 2008-10-19 16:15 but there are obvious places where it can't be 2008-10-19 16:15 like be don't need buffer.c or diskio.c at all 2008-10-19 16:16 s/be/we/ 2008-10-19 16:16 bok... I'll make the code run on my vfs then :P 2008-10-19 16:16 the kernel code I mean 2008-10-19 16:16 good luck 2008-10-19 16:16 why not just hack on the real kernel? 2008-10-19 16:17 I like the userland 2008-10-19 16:17 I'll set up a uml tarball 2008-10-19 16:17 but I also like to keep the code untouched 2008-10-19 16:17 you can work completely in userland 2008-10-19 16:17 just not on macos 2008-10-19 16:17 well... I'm am working on macos ;) 2008-10-19 16:17 :D 2008-10-19 16:17 all good things come to an end ;) 2008-10-19 16:17 ACTION isn't sure whether that's actually true 2008-10-19 16:18 hmm... the market going up did :P 2008-10-19 16:18 it I'm able to run ext2 I should be able to run tux3, right? 2008-10-19 16:19 it = if 2008-10-19 16:21 possibly 2008-10-19 16:21 you need to emulate the bio interface 2008-10-19 16:21 should not be that hard 2008-10-19 16:21 I'll got to that in ext2 I guess... 2008-10-19 16:21 minix doesn't use it 2008-10-19 16:21 not really 2008-10-19 16:22 ext2 doesn't use bios directly 2008-10-19 16:22 last time I checked 2008-10-19 16:22 ext3? :D 2008-10-19 16:22 flips: not script, just some droppings in my .bash_history 2008-10-19 16:22 probably script worthy though 2008-10-19 16:22 shapor. scrape .bash_history ? 2008-10-19 16:22 yeah 2008-10-19 16:23 before it rides off into the sunset 2008-10-19 16:23 speaking of which 2008-10-19 16:23 copied it so it wouldn't get rotated out 2008-10-19 16:23 getting close to sk8 thirty 2008-10-19 16:23 i'm kinda beat just rode 120 miles 2008-10-19 16:23 next time we have clouds I'll bring the camera down to the beach again 2008-10-19 16:23 mostly on the edge of my tires ;) 2008-10-19 16:23 gotta clean the sensor 2008-10-19 16:23 heh 2008-10-19 16:24 gotta make sure you don't just wear the middles 2008-10-19 16:25 whoa, it's cool out, have to wear an extra layer 2008-10-19 16:25 winter is setting in in socal 2008-10-19 16:26 71F in my lab right now :( 2008-10-19 16:30 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-19 16:34 saw mongol the russian movie last night 2008-10-19 16:34 way excellent 2008-10-19 16:35 blows away any hollywood release in the last couple years imho 2008-10-19 16:35 http://www.imdb.com/title/tt0416044/ ? 2008-10-19 16:37 yes 2008-10-19 16:37 underrated on english speaking sites I think 2008-10-19 16:38 http://www.apple.com/trailers/picturehouse/mongol/ 2008-10-19 16:38 looks cool :D 2008-10-19 16:39 apparently mongolians didn't like it and russians loved it 2008-10-19 16:40 btw, have you watched: http://www.imdb.com/title/tt1032846/ ? 2008-10-19 16:41 no, never heard of it 2008-10-19 16:42 it's dark... 2008-10-19 16:42 looks like a must see 2008-10-19 16:42 one of the best Romanian movie 2008-10-19 16:43 hmm, genghis kahn was played by a japanese 2008-10-19 16:43 looks pretty mongolian, that has something to be with being conquered perhaps 2008-10-19 16:44 probably also has something to do with mongolians hating it 2008-10-19 16:44 :-) 2008-10-19 16:48 the military logic is to history reality as hollywood westerns are to actual cow pies 2008-10-19 17:03 rasvanm, could you email me the original of http://farm4.static.flickr.com/3011/2951108022_9cc58e9464_b.jpg ? 2008-10-19 17:03 the psd file? 2008-10-19 17:04 whatever it is 2008-10-19 17:04 I wonder if gimp can read that 2008-10-19 17:04 apparently yes 2008-10-19 17:05 it might, I only used some simple filters 2008-10-19 17:08 the psd is ~90MB and is here: http://cs.jhu.edu.edu/~razvanm/carousel.psd 2008-10-19 17:08 the full size jpg is here http://farm4.static.flickr.com/3011/2951108022_fd19ba9383_o.jpg 2008-10-19 17:09 I should have spend some time to fix the carousel fringes... 2008-10-19 17:09 razvanm, later tonight ;) 2008-10-19 17:10 :D 2008-10-19 17:10 I like the fringes 2008-10-19 17:10 are you going to take more? 2008-10-19 17:10 it's 8 ;-) 2008-10-19 17:10 gives it that misaligned lithograph look 2008-10-19 17:11 some (usually dark) nice pictures http://www.dianevarner.com/ 2008-10-19 17:11 90 mb is gross 2008-10-19 17:11 I wonder how above achieves that, xml? 2008-10-19 17:11 let me check 2008-10-19 17:12 -rw-r--r-- 1 razvanm users 54747319 Oct 18 00:27 carousel.psd.gz 2008-10-19 17:12 there is some xml inside in some parts 2008-10-19 17:13 http://cs.jhu.edu.edu/~razvanm/carousel.psd.gz 2008-10-19 17:14 90 mb from 7 MB raw is unconscionable 2008-10-19 17:14 why I don't like proprietary/evil 2008-10-19 17:14 even when shiny 2008-10-19 17:15 it's not rgb... I converted to lab colors :P 2008-10-19 17:15 and I also make a layer using one the channels 2008-10-19 17:15 and I also used some gradient on a mask or two 2008-10-19 17:15 and I think that is stored as bitmap :P 2008-10-19 17:16 isn't the _o what you need? 2008-10-19 17:18 _o ? 2008-10-19 17:19 RazvanM: the full size jpg is here http://farm4.static.flickr.com/3011/2951108022_fd19ba9383_o.jpg 2008-10-19 17:19 http://cs.jhu.edu.edu/~razvanm/carousel.psd.gz gives 404 to wget and top level url to firefox 2008-10-19 17:19 looks like a censor job ;) 2008-10-19 17:20 bandwidth police maybe 2008-10-19 17:20 sorry.. I put two .edu-s :D 2008-10-19 17:21 http://cs.jhu.edu/~razvanm/carousel.psd.gz 2008-10-19 17:21 heh 2008-10-19 17:22 now it works, right? 2008-10-19 17:35 18 more calls... 2008-10-19 17:57 14 more... 2008-10-19 18:06 9 more... 2008-10-19 18:13 5 more... 2008-10-19 18:23 1 more... 2008-10-19 18:33 0 :P 2008-10-19 18:34 Bus error ;-) 2008-10-19 18:57 time to go home 2008-10-19 20:16 razvanm went home on the bus 2008-10-19 20:33 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-19 21:10 -!- mint(~mint@71-90-82-221.dhcp.stpt.wi.charter.com) has joined #tux3 2008-10-19 21:10 sup 2008-10-19 21:10 going on bitches 2008-10-19 21:12 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-19 21:42 flips: reading postings from tso 2008-10-19 21:42 regarding btrfs, it's also /.-ed 2008-10-19 22:00 flips: I wouldn't be discouraged 2008-10-19 22:00 just keep at it 2008-10-19 22:58 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-19 23:00 need a wiki editor 2008-10-19 23:00 http://en.wikipedia.org/wiki/Versioning_file_system 2008-10-19 23:32 guess tim didn't notice the [edit] link 2008-10-19 23:32 hey flips 2008-10-19 23:32 i think tim_dimm wanted someone to edit that... 2008-10-19 23:32 anyway, we don't get to go in there until it actually versions imho 2008-10-19 23:32 as in a person 2008-10-19 23:32 :) 2008-10-19 23:33 ah, we should bring our own more up to date 2008-10-19 23:33 http://en.wikipedia.org/wiki/Tux3 2008-10-19 23:33 who in here is Mr. Phillips? 2008-10-19 23:33 wild guess? 2008-10-19 23:34 flips: why are we not using rbtree in our fs? 2008-10-19 23:34 you! 2008-10-19 23:34 ah 2008-10-19 23:34 yeah right click confirmed that 2008-10-19 23:37 flips, if you don't mind me asking, how old are you? 2008-10-19 23:38 ha I went to google-stalk you 2008-10-19 23:38 FelipeS: i had no luck doing that :) 2008-10-19 23:38 I typed daniel phillips and google's "autocomplete" feature associated you with linux 2008-10-19 23:39 well I just found out he has a degreen in music 2008-10-19 23:40 from the Univ of British Columbia 2008-10-19 23:40 which he obtained in 1975 2008-10-19 23:40 lets say 1975 - 20 2008-10-19 23:40 1955 ? 2008-10-19 23:42 :D 2008-10-19 23:42 exactly 2008-10-19 23:42 good sneaky work, we need you :) 2008-10-19 23:43 was that really you? 2008-10-19 23:43 nah we all just need Google 2008-10-19 23:45 flips: rbtree? 2008-10-19 23:46 rbtree? 2008-10-19 23:46 htree maybe 2008-10-19 23:46 hmm 2008-10-19 23:46 not sure what you're talking about 2008-10-19 23:46 ACTION really doesnt knw a htree :( 2008-10-19 23:46 hmm 2008-10-19 23:46 i see that u are using a btree in tux3 2008-10-19 23:47 several 2008-10-19 23:47 yeah... 2008-10-19 23:47 so is a htree or rbtree better than a plain btree? 2008-10-19 23:49 rbtree is not a btree 2008-10-19 23:49 ! 2008-10-19 23:49 and a plain btree is not great for directory indexing 2008-10-19 23:50 rbtree is not? 2008-10-19 23:50 it has an extra color bit, everything else i guess makes it a btree? 2008-10-19 23:51 an rb tree is a type of binary tree 2008-10-19 23:51 a btree is not a binary tree, it is an nary tree 2008-10-19 23:51 oh :) 2008-10-19 23:52 the "b" in btree most likely stands for "balanced", but nobody knows for sure 2008-10-19 23:52 hmm.. thats something i learnt now.. 2008-10-19 23:52 ACTION looking for htree 2008-10-20 01:41 -!- Kirantpatil(~kiran@122.167.206.199) has joined #tux3 2008-10-20 01:42 -!- Kirantpatil(~kiran@122.167.206.199) has left #tux3 2008-10-20 01:43 -!- macan(~chatzilla@159.226.41.129) has joined #tux3 2008-10-20 01:49 hey flips 2008-10-20 01:58 -!- stargazr5(~gauravstt@59.95.52.222) has joined #tux3 2008-10-20 03:08 threading in C++ sucks 2008-10-20 06:14 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-20 06:45 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-10-20 07:02 -!- bobby(~bobby@nat-inn.mentorg.com) has joined #tux3 2008-10-20 07:26 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-20 08:49 -!- mingming(~mingming@c-24-22-117-202.hsd1.or.comcast.net) has joined #tux3 2008-10-20 09:14 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-20 10:27 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-20 10:35 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-20 10:45 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-20 11:53 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-20 13:42 ACTION is implementing read_cache_page... 2008-10-20 15:53 should I not try to upgrade my inkscape from sid? 2008-10-20 15:55 I have 0.46-1+b1 and the latest one in sid is... 0.46-2.1 2008-10-20 16:50 my god... inkscape is awesome! 2008-10-20 17:02 failed to install on etch 2008-10-20 17:03 last time I used it it was sodipodi and it was promising 2008-10-20 17:03 pardon me, it in installs and fails to run 2008-10-20 17:04 can't find library libgc.so.1 2008-10-20 17:05 but I have /usr/lib/libgc.so.1.0.2 2008-10-20 17:05 so I guess the library was installed incorrectly 2008-10-20 17:08 indeed that was the issue, fixed by reinstall :-/ 2008-10-20 17:08 http://cs.jhu.edu/~razvanm/get_sb.png 2008-10-20 17:08 what do you think? :P 2008-10-20 17:08 I think inkspace was forked from sodipodi long time 2008-10-20 17:08 nice arrows 2008-10-20 17:08 not sure what they mean 2008-10-20 17:09 the tip of the arrow is the return of the function 2008-10-20 17:09 it's incredibly stupid that svg graphics do not scale with ctrl +/- 2008-10-20 17:09 should I upload an svg? :D 2008-10-20 17:09 oh 2008-10-20 17:09 thought it was ;) 2008-10-20 17:09 sure 2008-10-20 17:09 try it 2008-10-20 17:10 for that matter, it's stupid that pngs and jpgs don't scale with ctrl +/- 2008-10-20 17:11 http://cs.jhu.edu/~razvanm/get_sb.svg 2008-10-20 17:11 displays fine 2008-10-20 17:11 hey flips 2008-10-20 17:11 I also have on paper the init_module part 2008-10-20 17:12 does the ctrl +/- works with svg? :D 2008-10-20 17:12 it doesn't in safari... 2008-10-20 17:12 ACTION is cleaning up around the house today and will work tonight 2008-10-20 17:13 it does in firefox :D 2008-10-20 17:13 neat 2008-10-20 17:13 btw, does the calls look right? :D 2008-10-20 17:13 for me in firefox it only scales the svg text 2008-10-20 17:13 which looks really stupid 2008-10-20 17:14 hm... scales fine in mine... 2008-10-20 17:14 iceweasel 3.0 2008-10-20 17:14 so the FS is on the left and VFS on the right 2008-10-20 17:14 what is the significance of arrow going to the right vs to the left? 2008-10-20 17:14 ok 2008-10-20 17:15 I should also add the init_module + kmem_cache_create + register_filesystem on top... 2008-10-20 17:15 let me do that... 2008-10-20 17:15 btw, it's close to sk8 o'clock :P 2008-10-20 17:16 where does _fill_super go? 2008-10-20 17:16 you mean that by just fill_super? 2008-10-20 17:16 it is 2008-10-20 17:16 leaving in 4 minutes 2008-10-20 17:17 the fill_super is called from the get_sb_dev by VFS 2008-10-20 17:17 right? 2008-10-20 17:20 the _fill_super, or ->fill_super 2008-10-20 17:20 ->fill_super :D 2008-10-20 17:20 or as I write it sometimes, ->fill_super -> _fill_super 2008-10-20 17:20 sk8 oclock 2008-10-20 17:21 enjoy 2008-10-20 17:21 ACTION straps on skates 2008-10-20 17:22 see you at the top of seaside? 2008-10-20 17:22 actually, the fill_super is passed as a parameter to get_sb_bdev 2008-10-20 17:22 probalby for your second run 2008-10-20 17:22 if you survive the first 2008-10-20 17:22 yup 2008-10-20 17:45 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-20 19:49 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-20 20:13 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-10-20 20:19 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-20 20:27 -!- less(~less@145.116.238.192) has joined #tux3 2008-10-20 22:02 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-20 22:55 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-21 01:09 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-21 01:28 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-21 04:20 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-10-21 08:43 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-21 08:57 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-21 09:10 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-21 09:46 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-21 10:01 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 10:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-21 12:37 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-21 12:47 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 13:59 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 15:30 ACTION is reading Chapter 15 from Understanding the Linux Kernel... 2008-10-21 16:09 ah, never looked at it to tell the truth 2008-10-21 16:09 mostly it just reads out the source code to us without interpretation 2008-10-21 16:10 it's modern enough to know that submit_bh calls submit_bio 2008-10-21 16:24 hey 2008-10-21 16:25 sk8 oclock 2008-10-21 16:27 enjoy : 2008-10-21 16:28 utlk is actually pretty good on the page IO life cycle 2008-10-21 16:28 is missing the recent stuff on dirty page limits and all the changes related to that 2008-10-21 16:28 which were pretty major 2008-10-21 16:30 utlk? 2008-10-21 16:31 understanding the linux kernel 2008-10-21 16:48 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 16:51 uh :D 2008-10-21 16:51 I though it will be some sort of tool 2008-10-21 18:47 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 18:50 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-21 19:07 -!- macan(~chatzilla@159.226.41.129) has joined #tux3 2008-10-21 19:54 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-21 19:54 ACTION is getting ready... 2008-10-21 19:58 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-21 19:58 hi 2008-10-21 20:00 hi raluca 2008-10-21 20:00 how'd you like the photo gallery? 2008-10-21 20:00 http://phunq.net/sunset 2008-10-21 20:01 not in the ballpark of razvan's skillz, but... 2008-10-21 20:01 hmm, seems like "next session" was last session 2008-10-21 20:01 any maze? 2008-10-21 20:02 let's see if 2.6.27 is indexed yet 2008-10-21 20:02 yes 2008-10-21 20:02 ok it is 2008-10-21 20:03 hello 2008-10-21 20:03 hi hirofumi 2008-10-21 20:03 hi 2008-10-21 20:03 let's take a look at how fast path writepages works 2008-10-21 20:03 start with ext3_writepages 2008-10-21 20:03 remind me to ask a question about halloween afterwards... and flips, you're not really out are you ;-)? 2008-10-21 20:04 who me? 2008-10-21 20:04 flips: oh, I didn't see the gallery yet, let me check 2008-10-21 20:04 ->writepages is an address_space_operation 2008-10-21 20:05 meaning, associated with struct mapping 2008-10-21 20:05 err 2008-10-21 20:05 with struct address_space 2008-10-21 20:05 usually referenced by ->mapping 2008-10-21 20:05 are we at a specific place in the code? 2008-10-21 20:05 fun skew there 2008-10-21 20:05 we will be soon 2008-10-21 20:06 http://lxr.linux.no/linux+v2.6.26.6/fs/ext3/inode.c 2008-10-21 20:07 http://lxr.linux.no/linux+v2.6.26.6/fs/ext3/inode.c#L1769 2008-10-21 20:07 that's ext3_readpages 2008-10-21 20:07 ext3 doesn't support ->writepages 2008-10-21 20:08 I guess thats why I couldn't find it :P 2008-10-21 20:08 I should have looked in the struct from the start 2008-10-21 20:08 interesting question, why it's in ext2 and not ext3 2008-10-21 20:08 I'd guess it's the journaling 2008-10-21 20:09 makes it much harder to support 2008-10-21 20:09 because ext2 is old? 2008-10-21 20:09 because ext3 has more rules about writing I think 2008-10-21 20:09 you'll note writepage has 3 different implementations for ext3 2008-10-21 20:09 and the vfs writepages doesn't know those rules 2008-10-21 20:09 we'll get more specifc about that later 2008-10-21 20:09 oh yes 2008-10-21 20:10 so ext3_readpages is just a wrapper for the library function 2008-10-21 20:10 mpage_readpages 2008-10-21 20:10 where it supplies its *_get_block function 2008-10-21 20:10 as is ext3_readpage 2008-10-21 20:11 yes 2008-10-21 20:11 and not the way tux3 is structured at the moment 2008-10-21 20:11 and possibiliy we will avoid creating tux3_get_block and using that whole library interface 2008-10-21 20:11 ACTION arrives fashionably late 2008-10-21 20:11 I'm leaning in that direction 2008-10-21 20:12 very fashionable 2008-10-21 20:12 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L371 2008-10-21 20:12 it builds up a bio containing a bunch of pages instead of just one 2008-10-21 20:13 wait a moment, which direction are you leaning in? (and what directions are there?) 2008-10-21 20:13 which saves merging in the block elevator among onther things 2008-10-21 20:13 the direction I'm leaning in is not to have a tux3_get_block 2008-10-21 20:13 and therfore not using any library function that expects a get_block callback 2008-10-21 20:13 those library functions being ancient, crufty 2008-10-21 20:14 in practice I may find it's impractical, or it's totally practical 2008-10-21 20:14 don't know yet 2008-10-21 20:14 ah 2008-10-21 20:14 homework assignment? ;) 2008-10-21 20:14 so basically implementing readpage(s) manually? 2008-10-21 20:14 right 2008-10-21 20:15 like be beloved romfs :D 2008-10-21 20:15 be = my 2008-10-21 20:15 maybe 2008-10-21 20:15 probably work the effort 2008-10-21 20:15 the callback mess is really a mess, and the concept of the get_block interface is kind of broken 2008-10-21 20:15 it implies some place to cache the physical address 2008-10-21 20:16 whereass that really should be the business of the fs 2008-10-21 20:16 ext2 read/write page/pages just look like wrappers too 2008-10-21 20:16 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L371 <- we're here now 2008-10-21 20:16 yes, because it's a non-journalled file system, which actually has a read/write block interface 2008-10-21 20:17 one thing we don't know from looking here, is where the list of pages we're writing came from 2008-10-21 20:17 mpage_readpage(s) are pretty much the same thing 2008-10-21 20:18 we write pages when they are dirty write? 2008-10-21 20:18 and we want to use them for something else 2008-10-21 20:18 anyway, we're going to write the whole list, and if we're lucky, the list refers to pages contiguous on disk 2008-10-21 20:18 athough apparently there's no page cache lru interaction in the single page version 2008-10-21 20:18 because a single bio can only handle contiguous pages 2008-10-21 20:18 http://lxr.linux.no/linux+v2.6.26.6/fs/ext3/inode.c#L1423 eek 2008-10-21 20:19 there's no lru interaction in the multipage version either 2008-10-21 20:19 nope 2008-10-21 20:19 do_mpage_readpage can submit bios 2008-10-21 20:19 shapor, odd indeed 2008-10-21 20:20 so the data does not have to fit in a single bio 2008-10-21 20:20 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L384 <- nano optimization 2008-10-21 20:20 somebody must have measured it and determined it actually matters 2008-10-21 20:20 warming up a cache line ahead of time 2008-10-21 20:21 what do you mean theres no lru interaction? 2008-10-21 20:21 ok, so what's the add_to_page_cache_lru all about 2008-10-21 20:22 the bio also gets allocated inside do_mpage_readpage 2008-10-21 20:23 but the last submit happens outside of the main loop 2008-10-21 20:23 we peek into the page cache, if there's no page there we read it 2008-10-21 20:23 that's what that's doing 2008-10-21 20:23 why do we need that only in the multiple page case? 2008-10-21 20:23 the assumption: if we find a page, it must be either uptodate or dirty 2008-10-21 20:23 because the readpage won't be called on an uptodate page 2008-10-21 20:24 basically, no point in reading data we already have in the page cache 2008-10-21 20:24 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L168 <- here's the big rambling hack 2008-10-21 20:24 why does it matter if its dirty? 2008-10-21 20:24 either way if its in page cache, just use that copy, no? 2008-10-21 20:24 it's going to deal with issues like noncontiguous physical disk locations 2008-10-21 20:25 dirty or update, either way, don't have to read it 2008-10-21 20:25 dirty or uptodate I meant 2008-10-21 20:25 right 2008-10-21 20:25 right 2008-10-21 20:26 let's skim through this quickly and see if there's anything interesting 2008-10-21 20:26 which function are we skimming through? 2008-10-21 20:26 188 if (page_has_buffers(page)) 2008-10-21 20:26 189 goto confused; <- now why would that be 2008-10-21 20:27 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L168 2008-10-21 20:27 do_mpage_readpage 2008-10-21 20:27 big rambling hack 2008-10-21 20:27 fast 2008-10-21 20:27 unpretty 2008-10-21 20:27 akpm coded this whole file in a couple days as I recall 2008-10-21 20:28 yes 2008-10-21 20:28 shortly before being annointed mm czar ;) 2008-10-21 20:28 because we're not expecting there to be cached in mem data for the page we're reading 2008-10-21 20:29 because we're not expecting any filesystem that monkeys with buffers to touch this mapping? I don't know 2008-10-21 20:29 oh 2008-10-21 20:29 because we just added it 2008-10-21 20:29 and therefore it shouldn't have buffers 2008-10-21 20:30 could write BUG there, it would have to be a race 2008-10-21 20:30 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L350 2008-10-21 20:30 another hack 2008-10-21 20:30 it's a shame this function gets called one page at a time 2008-10-21 20:30 must move the cpu needle 2008-10-21 20:31 that's where I think we're going to do the whole bio prep look in tux3 2008-10-21 20:31 instead of interfacing to the library 2008-10-21 20:31 it's similar to the code already in filemap.c 2008-10-21 20:32 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L350 <- let's consider that issue later 2008-10-21 20:33 re tux3 2008-10-21 20:33 199 * Map blocks using the result from the previous get_blocks call first. 2008-10-21 20:34 nigh on unreadable 2008-10-21 20:35 my brain is sizzling... 2008-10-21 20:35 223 * Then do more get_blocks calls until we are done with this page. <- makes more sense 2008-10-21 20:35 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-21 20:35 from 223 we see the ->get_block calls 2008-10-21 20:35 one for each buffer on the page 2008-10-21 20:36 sorry 2008-10-21 20:36 for each block on the page 2008-10-21 20:36 because we're doing this without buffers 2008-10-21 20:36 that's the main point of it 2008-10-21 20:36 hrm 2008-10-21 20:36 avoids buffer oriented IO for most file data 2008-10-21 20:37 we're using a fake buffer 2008-10-21 20:37 called map_bh 2008-10-21 20:37 just so the ->get_block interface will work 2008-10-21 20:37 crufty? yes very 2008-10-21 20:37 hm get_block gets called alot 2008-10-21 20:38 247 /* some filesystems will copy data into the page during 2008-10-21 20:38 248 * the get_block call, <- for example, tail packing filessystem, for example reiserfs 2008-10-21 20:39 290 * This page will go to BIO. Do we need to send this BIO off first? <- what happens if we hit discontiguous blocks 2008-10-21 20:39 the entire vfs interface/libraries really seem to be optimized/written for older 'simpler' filesystems 2008-10-21 20:39 it's because we "evolve" linux 2008-10-21 20:39 with incremental changes 2008-10-21 20:40 generally helpful for stability, but not for structure 2008-10-21 20:40 and then all the newer filesystems basically reimplement this or skip parts of it 2008-10-21 20:40 true 2008-10-21 20:40 the whole page, block, buffer seems like it should be much simpler 2008-10-21 20:40 major cut & paste culture 2008-10-21 20:40 mapping* 2008-10-21 20:40 nobody wants to read/understand this shit ;) 2008-10-21 20:40 hmm... isn't this the way MS does with their OS :P 2008-10-21 20:40 shapor, yes way simpler 2008-10-21 20:40 it's fairly obscene at the moment 2008-10-21 20:40 perhaps, but how many fs'es does MS support? 2008-10-21 20:40 I don't think akpm would argue that 2008-10-21 20:41 maze, a fraction of linux 2008-10-21 20:41 I was thinking about the OS not the FS part ;-) 2008-10-21 20:41 one thing bsders seem to tout is the fact they've done away with buffer heads 2008-10-21 20:41 I should check out how they went about that 2008-10-21 20:41 had some discussion with dilon about it 2008-10-21 20:42 bit didn't follow up by reading code 2008-10-21 20:42 I got the impression... by implementing a new, xfs like layer 2008-10-21 20:42 sounds like they just hid them then 2008-10-21 20:43 199 * Map blocks using the result from the previous get_blocks call first. <- ok I think I grok this now 2008-10-21 20:43 the filesystem is free to go ahead and map more blocks than the one asked for 2008-10-21 20:45 map in what sense? 2008-10-21 20:45 interesting project is to go trace the lifetime of map_bh through this code 2008-10-21 20:45 map as in call ->get_block 2008-10-21 20:45 to get a physical mapping, store it in the bh->block 2008-10-21 20:45 bh->b_blocknr I think it was 2008-10-21 20:46 and bh->b_size is how many blocks was mapped 2008-10-21 20:47 I didn't notice that 2008-10-21 20:47 good eyes 2008-10-21 20:47 very ugly hack 2008-10-21 20:48 really pushing the buffer interface past the breaking point 2008-10-21 20:48 yes, incrementale change 2008-10-21 20:48 we finished for _readpages for now? 2008-10-21 20:48 I sure hope so ;-) 2008-10-21 20:49 let's see if we can figure out why writepages is used by ext2 and not by ext3 in the next 11 minutes 2008-10-21 20:49 uhm, wild guess... journalling 2008-10-21 20:49 which would only matter for data-journalled - except 2008-10-21 20:50 for writes beyond eof and in holes 2008-10-21 20:50 of course, but that's not a sufficiently precise answer 2008-10-21 20:50 getting more precise 2008-10-21 20:50 you can't use the get_block interface? 2008-10-21 20:50 the proposal is "it only matters for data=journalled" 2008-10-21 20:50 for jbd, i think 2008-10-21 20:50 uhm, no. 2008-10-21 20:51 ok, to start with, writepages only works on data, not metadata 2008-10-21 20:51 the proposal was, it would only matter for data=journalled, except for write past eof, and sparse files, which is why it's always needed 2008-10-21 20:51 and hence the 3 different ext3 writepage implementations 2008-10-21 20:52 and no writepages implementation, which is the interesting question 2008-10-21 20:52 let's see under what conditions vfs calls ->writepages 2008-10-21 20:52 my guess is lack of writepages, means it falls back to using writepage one at a time 2008-10-21 20:52 perhaps writepages is too complicated to journalize 2008-10-21 20:53 actually data=ordered also needs special handling because of consistency guarantees it offers 2008-10-21 20:53 perhaps it can fail in too many ways :P 2008-10-21 20:53 or noone has bothered to yet ;-) 2008-10-21 20:53 (my guess) 2008-10-21 20:53 [since with journaling writes are slow anyways...] 2008-10-21 20:53 akpm would bother if it would make ext3 go faster 2008-10-21 20:54 so I'm rejecting that theory 2008-10-21 20:54 hmm, really? 2008-10-21 20:54 http://lxr.linux.no/linux+v2.6.26.6/mm/page-writeback.c#L1003 2008-10-21 20:54 you see the lengths that have gone to already 2008-10-21 20:54 we take our slight advantage over bsd seriously ;) 2008-10-21 20:54 right, so we use generic 2008-10-21 20:54 hmm? where's the advantage? 2008-10-21 20:55 so the answer may be: generic_writepages works for ext3, not for ext2 2008-10-21 20:55 maybe 2008-10-21 20:56 don't buy that 2008-10-21 20:56 anyway, we have found our way to the main place that pages are written in linux 2008-10-21 20:56 maybe the ext2 case could be more optimized? 2008-10-21 20:56 http://lxr.linux.no/linux+v2.6.26.6/fs/ext2/inode.c#L778 2008-10-21 20:56 http://lxr.linux.no/linux+v2.6.26.6/mm/page-writeback.c#L862 <- write_cache_pages 2008-10-21 20:56 let's see what is the generic one... 2008-10-21 20:56 _2copy will only get a few fringe cases, most ext3 traffic will go through here 2008-10-21 20:57 I suppose nobody got around to plugging generic_writepages into ext2 2008-10-21 20:57 block_dev.c: .writepages = generic_writepages, 2008-10-21 20:58 hmm, I don't like my latest theory either 2008-10-21 20:58 truth is, I don't know and with 2 minutes to go I'm declaring it homework 2008-10-21 20:58 that, and "read generic_writepages" 2008-10-21 20:59 :-) 2008-10-21 20:59 did we have fun today? 2008-10-21 20:59 we're certainly wading in it 2008-10-21 20:59 sinking... 2008-10-21 20:59 I feel it was shorter... 2008-10-21 20:59 one thing worth remembering: there's weird locking going on through all of this 2008-10-21 21:00 and scheduling 2008-10-21 21:00 in other words, we're taking a superficial view of it so far 2008-10-21 21:01 see, mpage_writepages is just a wrapper for write_cache_pages too 2008-10-21 21:02 for every one thing i learn in these sessions i find out about 10 more i have no clue about, makes it feel like a net loss ;) 2008-10-21 21:02 lol 2008-10-21 21:02 ACTION feels pretty much the same... 2008-10-21 21:02 there's also so much history behind how it all is... 2008-10-21 21:03 anyway what's up with halloween? 2008-10-21 21:04 and who's on these 2 photos? http://phunq.net/sunset/.1024/.html/woohoo.jpg.html and http://phunq.net/sunset/.1024/.html/wheee.jpg.html 2008-10-21 21:04 ok, write_cache_pages is for filesystems that don't supply a get_block, but do supply a ->writepage 2008-10-21 21:04 when is holloween? :D 2008-10-21 21:04 oct 31 2008-10-21 21:04 maze, we're making arrangements for something rather cool 2008-10-21 21:05 cool, but where and when? 2008-10-21 21:05 oct 31 2008-10-21 21:05 venice beach 2008-10-21 21:05 we can start early 2008-10-21 21:05 on 3rd street 2008-10-21 21:05 so Friday next week 2008-10-21 21:05 soon, yes 2008-10-21 21:05 expect email 2008-10-21 21:05 does venice beach, mean beach in venice, ca? 2008-10-21 21:06 yes, just south of santa monica 2008-10-21 21:06 caveat: you need to be on the southwest size of the 405 before late afternoon 2008-10-21 21:06 ugh, that's even farther than Malibu... 2008-10-21 21:06 before early afternoon even 2008-10-21 21:06 I'm about 10 minutes from malibu 2008-10-21 21:06 MaZe: planning on being in malibu? 2008-10-21 21:06 15 maybe 2008-10-21 21:07 no, just Malibu has somehow always seemed to be as a tropical paradise on the other end of the world (back when I lived in eu) 2008-10-21 21:07 :) 2008-10-21 21:07 oh, 26 minutes it tells me 2008-10-21 21:08 depends from where in malibu 2008-10-21 21:08 malibu is 27 miles itself 2008-10-21 21:08 I can't think of as "where movie stars get arresting for driving their suvs drunk" 2008-10-21 21:08 355 miles 2008-10-21 21:08 i think of it as the gateway to the santa monica mountains 2008-10-21 21:09 all the nice roads, vrewm 2008-10-21 21:10 start early? around when? 2008-10-21 21:11 ah, ext3 has its own kernel feature: ->write_begin, ->write_end for order write 2008-10-21 21:11 also used by btrfs I think 2008-10-21 21:12 btrfs doesn't have 2008-10-21 21:12 planned to be used 2008-10-21 21:13 prepare_write? 2008-10-21 21:13 not really 2008-10-21 21:13 different thing 2008-10-21 21:14 having a hard time finding the call point in vfs 2008-10-21 21:14 e.g. pagecache_write_begin? 2008-10-21 21:15 this one call: http://lxr.linux.no/linux+v2.6.26.6/drivers/block/loop.c#L769 2008-10-21 21:15 but surely that isn't the only one 2008-10-21 21:15 http://lxr.linux.no/linux+v2.6.26.6/mm/filemap.c#L1912 2008-10-21 21:15 hirofumi, right 2008-10-21 21:16 from cscope: 2008-10-21 21:16 fs/affs/file.c affs_truncate 829 res = mapping->a_ops->write_begin(NULL, mapping, size, 0, 0, &page, &fsdata); 2008-10-21 21:16 fs/ext4/inode.c ext4_page_mkwrite 4851 ret = mapping->a_ops->write_begin(file, mapping, page_offset(page), 2008-10-21 21:16 mm/filemap.c pagecache_write_begin 2020 return aops->write_begin(file, mapping, pos, len, flags, 2008-10-21 21:16 mm/filemap.c generic_perform_write 2429 status = a_ops->write_begin(file, mapping, pos, bytes, flags, 2008-10-21 21:16 well, btrfs seems to do much difference way 2008-10-21 21:17 ok, it's a wrapper for grab_cache_page and prepare_write, or a fs hook 2008-10-21 21:17 really crufty 2008-10-21 21:18 luckly, prepare_write will go away soon 2008-10-21 21:18 and all one will use ->write_begin 2008-10-21 21:18 it will? never understood what it was for 2008-10-21 21:18 in the first place 2008-10-21 21:19 so I guess the answer is, it was always bogus 2008-10-21 21:20 three calls from buffer.c look like the only interesting ones 2008-10-21 21:21 hmm 2008-10-21 21:21 block_prepare_write()? not ->prepare_write 2008-10-21 21:21 those are frigne cases 2008-10-21 21:22 it's quite impressive how all this churn has happened and most filesystem code is barely affected 2008-10-21 21:23 recent bloatup in core is pretty scary 2008-10-21 21:23 ->prepare_write is replaced by ->write_begin 2008-10-21 21:24 ->commit_write was replaced by ->write_end 2008-10-21 21:31 flips, btw, do you already have ideas for buffer management? 2008-10-21 21:32 hirofumi, yes 2008-10-21 21:33 oh, great 2008-10-21 21:33 hirofumi, it's the main topic of the post I've been working on for the last week 2008-10-21 21:33 hopefully I'll post in about 2 hours 2008-10-21 21:34 great :) 2008-10-21 21:43 -!- FelipeS_(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 21:46 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-21 21:52 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 21:53 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 21:54 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-22 00:47 -!- pgquiles(~pgquiles@71.Red-79-154-137.staticIP.rima-tde.net) has joined #tux3 2008-10-22 01:02 ACTION reads the backlog 2008-10-22 01:05 ACTION goes to bed 2008-10-22 01:13 night 2008-10-22 03:59 -!- pgquiles_(~pgquiles@156.Red-88-25-133.staticIP.rima-tde.net) has joined #tux3 2008-10-22 04:59 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-22 06:55 -!- mingming(~mingming@c-24-22-117-202.hsd1.or.comcast.net) has joined #tux3 2008-10-22 08:21 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-22 10:07 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-22 10:45 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-22 10:47 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-22 15:02 -!- kbingham_(~kbingham@82-46-4-172.cable.ubr03.aztw.blueyonder.co.uk) has joined #tux3 2008-10-22 15:30 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-22 15:54 cscope it pretty nice! 2008-10-22 16:02 it is 2008-10-22 16:02 faster than lxr 2008-10-22 16:03 but no back button or open in new tab or post url to code 2008-10-22 16:03 I'd better wrap up this post and post it before my head explodes 2008-10-22 16:03 I'm looking forward to the sound of other people's heads exploding 2008-10-22 17:15 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-22 18:22 shapor, what do you suppose is more correct for ctime, should it be as of the write to buffer cache, or as of the actual transfer to disk? 2008-10-22 18:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-22 18:40 -!- ajonat(~ajonat@190.48.101.29) has joined #tux3 2008-10-22 20:09 buffer cache 2008-10-22 20:45 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-22 20:51 ...one filnal prooread... 2008-10-22 20:53 hey 2008-10-22 21:20 there, enough of that 2008-10-22 21:22 http://tux3.org/pipermail/tux3/2008-October/000300.html 2008-10-22 22:05 emacs (and modified cscope.el) and cscope is best tool to me for now 2008-10-22 22:10 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 00:27 flips: nice 2008-10-23 00:28 ACTION is reading now 2008-10-23 00:28 enjoy 2008-10-23 00:29 was doing some testing tonight against the scheduler. there's no way the current scheduler rebalancing code can guarantee determinancy since it still balance on a best-effort and can double/cross lock runqueues delaying the cpu local schedule() calls from being able to reschedule 2008-10-23 00:29 I'll have do some kind of rt based processor isolations that's possibly dynamic 2008-10-23 00:29 determinacy? 2008-10-23 00:29 flips: you and matt are a great combination, like two peas in a po 2008-10-23 00:30 flips: deterministic latency 2008-10-23 00:30 seems like 2008-10-23 00:30 the -rt patch is fully preemptible but the schedule is mismatch because it's largely best effort 2008-10-23 00:31 s/realtime/rubbertime/ 2008-10-23 00:34 I wonder if the -rt patch still fails to do swsuspend properly 2008-10-23 00:35 don't know 2008-10-23 00:37 po=pod 2008-10-23 00:54 flips: btw, you'll have ot abstract the reagular file handling code with the metadata file stuff if you didn't know that already using routines so that the metadata files aren't... 2008-10-23 00:54 treat the same as regular files. They'll still use basic file load routines and stuff, but not have the same semantics in the fs 2008-10-23 00:54 you probably know that already 2008-10-23 00:56 you'll need some kind atomic write barrier, I guess, as well 2008-10-23 00:59 ACTION remembers soft-updates in ffs 2008-10-23 00:59 er FreeBSD UFs 2008-10-23 01:03 flips: the email is too design heavy for most regular lkml folks, but keep on going... 2008-10-23 01:05 ACTION never liked that aspect of Linux kernel culture 2008-10-23 01:07 flips: isn't this going to require VM changes as well for the forked-buffer stuff ? 2008-10-23 01:07 or does that stuff already exist from the ext3 work ? 2008-10-23 01:09 Well, it's the best of the worse ways :\ 2008-10-23 01:15 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-10-23 01:18 bh, I didn't post it to lkml 2008-10-23 01:19 bh, all the necessary interfaces are available to modules 2008-10-23 01:19 and if they weren't, I 'd make them 2008-10-23 01:19 ok, what about the buffer forking ? 2008-10-23 01:19 ok 2008-10-23 01:19 makes snse 2008-10-23 01:19 sense 2008-10-23 01:19 should work out fine 2008-10-23 01:19 because it's kind of a dramatic thing for the VM 2008-10-23 01:19 we'll make it a tux3 U homework project 2008-10-23 01:20 ext3 does a similar thing 2008-10-23 01:20 yeah, figured as much 2008-10-23 01:20 which is one of the reasons the interfaces have to be exposed 2008-10-23 01:20 but it's probably not as sophisticated 2008-10-23 01:21 it's a journal, it has its own sophistication 2008-10-23 01:21 read the linked pdf for some great entertainment 2008-10-23 01:22 showing design stuff on lkml isn't entirely pointless, at least jon corbet reads it 2008-10-23 01:22 and understands 2008-10-23 01:23 that's good 2008-10-23 01:23 he kind of cares 2008-10-23 01:24 man this is going to trigger edge cases like crazy 2008-10-23 01:26 ACTION wishes he can work on this :\ 2008-10-23 01:26 easy enough to get that wish granted 2008-10-23 01:28 nice speculative recovery 2008-10-23 01:28 posting to lkml won't hurt 2008-10-23 01:28 it's interesting reading 2008-10-23 01:29 flips: you store the metadata/header for extents in reverse order right ? 2008-10-23 01:30 yes 2008-10-23 01:30 nice 2008-10-23 01:31 because I'm thiking about that blob thing 2008-10-23 01:31 you can versio that metadata with constant in a fixed location 2008-10-23 01:32 versio? 2008-10-23 01:32 version I guess 2008-10-23 01:32 extent version using a special number 2008-10-23 01:32 yes 2008-10-23 01:33 so that you kow how to read that structure 2008-10-23 01:33 versioning strategy's pretty well worked out 2008-10-23 01:33 typing with one hand right now :) 2008-10-23 01:52 ok night 2008-10-23 01:52 hello 2008-10-23 03:48 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-23 04:30 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-10-23 07:07 -!- pgquiles_(~pgquiles@156.Red-88-25-133.staticIP.rima-tde.net) has joined #tux3 2008-10-23 07:12 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-10-23 09:07 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 09:10 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 10:08 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-23 10:13 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-23 10:42 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-23 10:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 11:41 quick q: page->lru is something that filesystem will not mess with, right? 2008-10-23 12:02 -!- pgquiles(~pgquiles@156.Red-88-25-133.staticIP.rima-tde.net) has joined #tux3 2008-10-23 12:06 IIRC, page->lru is vm stuff. it will be used to manage which is page active. and if vm want more free memory, it may ask to clean inactive pages to fs 2008-10-23 12:07 great! 2008-10-23 12:07 I'm the VM in my case ;-) 2008-10-23 12:35 -!- pgquiles(~pgquiles@156.Red-88-25-133.staticIP.rima-tde.net) has joined #tux3 2008-10-23 12:56 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-23 13:23 -!- FelipeS_(~Felipe@lawn-128-61-31-5.lawn.gatech.edu) has joined #tux3 2008-10-23 14:46 razvanm, correct 2008-10-23 14:46 thought the filesystem my move pages to the front or back of the lru queue if it thinks it knows something the vmm doesn't 2008-10-23 15:02 my tux3.notes file is about 2,000 lines long 2008-10-23 15:02 mostly consisting of posts I haven't posted 2008-10-23 15:18 wow... big backlog :P 2008-10-23 15:59 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-23 16:43 folks 2008-10-23 17:16 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 17:30 ponk 2008-10-23 17:33 sk8 oclock 2008-10-23 19:11 hello 2008-10-23 19:11 which stage will do "physical remapping"? rollup? 2008-10-23 19:13 hirofumi, not rollup 2008-10-23 19:13 phase transition 2008-10-23 19:13 oh 2008-10-23 19:13 when a new phase is ready to commit to disk, first think to do is flush all dirty inodes 2008-10-23 19:14 all dirty inodes means ileaf? or whole itable btree? 2008-10-23 19:15 flushing dirty inodes in kernel would call write_pages, in tux3 userspace calls write_buffer->map->ops->brwrite 2008-10-23 19:15 dirty inode table blocks have to be flushed too 2008-10-23 19:15 well 2008-10-23 19:15 not flushed, but committed 2008-10-23 19:16 flushing is just the process of committing cached data to writeout 2008-10-23 19:16 yes 2008-10-23 19:16 the whole btree does not have to be committed 2008-10-23 19:17 because we have the "promise" system 2008-10-23 19:17 -!- FelipeS_(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-23 19:17 ACTION have to read that email more deeply 2008-10-23 19:17 we just write out the leaf nodes and "promise" to update the pointers in parents 2008-10-23 19:18 maybe promise is logical logging? 2008-10-23 19:18 yes 2008-10-23 19:18 i see 2008-10-23 19:19 I used to call it logical records in commit blocks 2008-10-23 19:19 promise is short for that 2008-10-23 19:19 i see 2008-10-23 19:20 another question is: 2008-10-23 19:20 modified buffers in active tree can't free, because we don't know 2008-10-23 19:20 final state of that buffer until rollup? If so, we must pin many 2008-10-23 19:20 btree-index buffers (etc.) of active tree if user modified may inodes? 2008-10-23 19:20 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 19:20 exactly 2008-10-23 19:21 i see 2008-10-23 19:21 that is the "dirty metadata" 2008-10-23 19:21 which must be reconstructed if we crash 2008-10-23 19:21 using the logical records in the commit blocks (promises) 2008-10-23 19:21 yes 2008-10-23 19:22 even if vm wants more memory, we can't free those buffers? 2008-10-23 19:27 how will we handle ENOSPC? we must do rollup/pahse transition to make more free space? 2008-10-23 19:28 if it is possible 2008-10-23 19:39 hirofumi, yes, those buffers pin memory even of the vm is low on memory, so we need to make sure not to use too much 2008-10-23 19:40 i see 2008-10-23 19:40 the closer we get to filesystem full, the shorter a phase can be 2008-10-23 19:41 when the vmm is very low on memory, it sets the PF_MEMALLOC flag and calls ->writepage to free memory 2008-10-23 19:41 the PF_MEMALLOC flag gives the filesystem access to an emergency reserve of a few megabytes 2008-10-23 19:42 yes 2008-10-23 19:43 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-23 19:43 otherwise, if cache memory is low but the vmm has not called our filesystem to write out dirty pages, then we will just block in alloc_pages waiting for memory to be freed 2008-10-23 19:45 i see. I thought if we can handle with more few memory, it's great 2008-10-23 19:46 it won't use much memory, it's just index blocks that are pinned 2008-10-23 19:46 each index block references up to 512 data blocks 2008-10-23 19:47 index blocks and bitmap blocks, each bitmap block covers 128 MB of filesystem blocks 2008-10-23 19:47 i see 2008-10-23 19:48 if we need to unpin some then we do a rollup 2008-10-23 19:48 being careful to always have enough memory in the emergency reserve to do the rollup 2008-10-23 19:49 i see 2008-10-23 19:50 i read someone says modern fs uses too much memory 2008-10-23 19:50 btrfs/hammer etc. 2008-10-23 19:51 zfs 2008-10-23 19:51 yes 2008-10-23 19:51 they aren't careful with memory 2008-10-23 19:51 I have been very careful 2008-10-23 19:51 i really dislike that 2008-10-23 19:52 right, it's no good to have more memory if it is just wasted 2008-10-23 19:52 how do they waste it? 2008-10-23 19:52 ZFS has 128 byte block pointers for example 2008-10-23 19:54 128 bytes?!? 2008-10-23 19:54 1024 bits? 2008-10-23 19:54 huge space 2008-10-23 19:54 why do they need them so big? 2008-10-23 19:55 good question 2008-10-23 19:55 128bits? not bytes 2008-10-23 19:55 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-23 19:55 128 bits sounds better :D 2008-10-23 19:55 one thing is, when they do their raid, they like to put multiple pointers to redundant copies of the block in the block pointer 2008-10-23 19:55 hi 2008-10-23 19:55 128 bytes, yes 2008-10-23 19:56 hi raluca 2008-10-23 19:56 ACTION has bad manners :P 2008-10-23 19:57 ah 2008-10-23 19:57 in that huge size they have plenty of space to even store a sha1 or an md5 2008-10-23 19:58 yes, they do that 2008-10-23 20:00 it's tux3 oclock 2008-10-23 20:00 yup 2008-10-23 20:00 I was looking for the zfs repository 2008-10-23 20:00 ACTION is ready 2008-10-23 20:00 I've been in there before 2008-10-23 20:00 google didn't find it right away 2008-10-23 20:01 left as an exercise for the interested reader how they wasted, err, needed 128 bytes for a block pointer 2008-10-23 20:01 i'm useing git for opensolaris 2008-10-23 20:01 right, there's an online repo somewhere on opensolaris.org 2008-10-23 20:02 google just didn't find it right away 2008-10-23 20:02 git://repo.or.cz/opensolaris.git 2008-10-23 20:02 ok, let's be a little selfish today and take a look at something we actually need for tux3 2008-10-23 20:02 mirror though 2008-10-23 20:02 the latest post mentions "forking" a buffer 2008-10-23 20:03 that happens when we want to change a buffer, but it is already committed to writeout 2008-10-23 20:03 hey 2008-10-23 20:03 or another way of putting it, it not in the current phase 2008-10-23 20:03 so we can't change it any more 2008-10-23 20:04 what we do is remove the underlying page from the buffer cache, or in kernel, the page cache 2008-10-23 20:04 copy the data to another page, and put that in the page cache 2008-10-23 20:04 let's look at kernel code to see how we might make that work 2008-10-23 20:04 where should we look first? 2008-10-23 20:04 buffer.c? :D 2008-10-23 20:05 what are we looking for? 2008-10-23 20:05 a write 2008-10-23 20:05 we're looking for where the block is cached 2008-10-23 20:05 remember, in kernel, buffers are just handles for block IO 2008-10-23 20:06 as opposed to in tux3 userspace where we tend to think of them as cached blocks 2008-10-23 20:06 well, the still are, but in kernel they are not the primary unit 2008-10-23 20:06 pages are 2008-10-23 20:07 there are two kinds of places where filesystems cache block 2008-10-23 20:07 the "buffer cache", which is just a page cache mapped one to one to the block device 2008-10-23 20:07 and the so called "page cache" which is a page cache per inode 2008-10-23 20:08 "page cache" is actually misnamed, it sounds like one big caches, it's actually lots of caches 2008-10-23 20:08 ok, so where we should look depends on the kind of block we need to fork 2008-10-23 20:09 suppose it is a directory entry block, where do we look? 2008-10-23 20:09 dirent is cached as page cache, or buffer cache? 2008-10-23 20:09 tell me 2008-10-23 20:10 and your logic 2008-10-23 20:10 dirent is allocted using the slab allocator 2008-10-23 20:10 that was a wild guess 2008-10-23 20:10 and the content of the directory is just a file 2008-10-23 20:10 right 2008-10-23 20:10 so we should look in... 2008-10-23 20:11 ok, let's look at ext3_bread 2008-10-23 20:11 see where it goes 2008-10-23 20:12 should we use .26 or .27 today? 2008-10-23 20:12 let's try .27 2008-10-23 20:12 http://lxr.linux.no/linux+v2.6.27/fs/ext3/inode.c#L1054 2008-10-23 20:12 keep up with mainline, more or less 2008-10-23 20:12 yes 2008-10-23 20:12 RazvanM, ok, follow it in, see where it goes 2008-10-23 20:14 found the next function in? 2008-10-23 20:14 (sorry, I was looking for the dentry_cache :P) 2008-10-23 20:15 next is ext3_getblk 2008-10-23 20:15 right 2008-10-23 20:15 and where does it go from there? 2008-10-23 20:15 the main call 2008-10-23 20:15 hint: don't worry about the ext3 handle 2008-10-23 20:15 ll_rw_block 2008-10-23 20:16 for now 2008-10-23 20:16 which looks to be deprecated :| 2008-10-23 20:16 look closer 2008-10-23 20:16 sb_getblk after get_block like op 2008-10-23 20:16 right 2008-10-23 20:16 let's see how that works 2008-10-23 20:16 sb_getblk 2008-10-23 20:16 we've looked at it before 2008-10-23 20:16 it sounds familiar :P 2008-10-23 20:17 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1403 2008-10-23 20:17 right 2008-10-23 20:18 follow the _slow path 2008-10-23 20:18 I know those 2008-10-23 20:18 this is where the kernel code is really crappy ;) 2008-10-23 20:18 the block to page mapping is done in grow_buffers :p 2008-10-23 20:18 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1119 2008-10-23 20:19 grow_buffers :D 2008-10-23 20:19 love the name 2008-10-23 20:19 I woudn't expect anybody to guess taht 2008-10-23 20:19 took me 10 minutes to figure it out last time we were in here 2008-10-23 20:19 1109 /* Create a page with the proper size buffers.. */ 2008-10-23 20:20 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1028 2008-10-23 20:20 1048 if (!try_to_free_buffers(page)) 2008-10-23 20:20 1049 goto failed; 2008-10-23 20:20 <- lovely 2008-10-23 20:21 and at failed we BUG 2008-10-23 20:21 this mechanism is in a state of transition ;) 2008-10-23 20:21 1055 bh = alloc_page_buffers(page, size, 0); 2008-10-23 20:21 flips: can you explain again the link between the page and buffer_head? :D 2008-10-23 20:21 each page has a page to attach a circular list of buffer_heads to it 2008-10-23 20:22 as many as there can be blocks on the page 2008-10-23 20:22 usually one 2008-10-23 20:22 page is 4k usually 2008-10-23 20:22 yes 2008-10-23 20:22 is the block usually 4k? 2008-10-23 20:22 yes 2008-10-23 20:22 cool 2008-10-23 20:22 default for nearly all filesystems 2008-10-23 20:22 didn't know that :D 2008-10-23 20:23 it's not very cool actually, because 4K is a bit small on modern hardware 2008-10-23 20:23 non-unix fs has 512bytes 2008-10-23 20:23 this is a big flaw in linux 2008-10-23 20:23 can't ahve buffer bigger than page 2008-10-23 20:23 romfs has 1KB I think :P 2008-10-23 20:24 smaller blocks create less external fragmentation 2008-10-23 20:24 one more q: what happen when there is more than one bh in a page? 2008-10-23 20:24 that will only be the case when block size is a fraction of page size 2008-10-23 20:24 right 2008-10-23 20:24 see all that code that checks for buffers being there and puts them there if they are not 2008-10-23 20:25 sometime when you have a lot of time on your hands, go read try_to_free_buffers 2008-10-23 20:25 ...the worst fundtion in the entire kernel 2008-10-23 20:25 so the bh in page are continuos? 2008-10-23 20:25 continous? 2008-10-23 20:25 continuous? 2008-10-23 20:26 contiguous 2008-10-23 20:26 yes 2008-10-23 20:26 what space on the disk do they cover 2008-10-23 20:26 and that's important 2008-10-23 20:26 because what it does in the case of block smaller than page is create false sharing 2008-10-23 20:26 good... I didn't know that :D 2008-10-23 20:27 not contiguous on disk 2008-10-23 20:27 contiguous in memory 2008-10-23 20:27 aaaaaa 2008-10-23 20:27 sorry 2008-10-23 20:27 grrr... 2008-10-23 20:27 I liked the other answer better :D 2008-10-23 20:27 yes, that leads to a lot of headaches 2008-10-23 20:27 exactly!! 2008-10-23 20:28 so in tux3, we want to branch a buffer, but we actually have to mess with a whole page 2008-10-23 20:28 but in tux3 the buffer will be 4k, right? 2008-10-23 20:28 not necessarily 2008-10-23 20:28 tux3 can handle 256 byte blocks 2008-10-23 20:29 I think we decided to make the smallest 512 2008-10-23 20:29 linux sector size 2008-10-23 20:29 yes :) 2008-10-23 20:29 let's keep going in 2008-10-23 20:29 ok 2008-10-23 20:29 find_or_create_page 2008-10-23 20:30 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L720 2008-10-23 20:30 add_to_page_cache_lru 2008-10-23 20:30 add_to_page_cache 2008-10-23 20:31 add_to_page_cache_locked 2008-10-23 20:31 this looks also familiar... 2008-10-23 20:31 radix_tree_insert 2008-10-23 20:32 http://lxr.linux.no/linux+v2.6.27/lib/radix-tree.c#L291 2008-10-23 20:32 there we see the nice new rcu code that got added by peterz in the last cycle 2008-10-23 20:32 lockless pagecache? 2008-10-23 20:33 wait, it was there before 2008-10-23 20:34 ok, let's poke around in _insert for a while 2008-10-23 20:34 it's good to have an idea what happens there 2008-10-23 20:35 it's a radix tree with branching factor 64 2008-10-23 20:35 meaning we have a lot of page cache pointers sitting next to each other 2008-10-23 20:35 it's tempting to use that fact when we are operating on pages that are contiguous in the page cache 2008-10-23 20:35 to avoid lookups 2008-10-23 20:36 I don't know of any kernel code that has actually done that though 2008-10-23 20:36 also haven't looked hard 2008-10-23 20:37 note: lockless page cache is due to nick piggin 2008-10-23 20:37 and it isn't completely merged yet 2008-10-23 20:37 I presume that a goal is to get rid of even the rcu locks from the radix tree 2008-10-23 20:38 oh, i thought it was done 2008-10-23 20:38 part went in 2008-10-23 20:38 _insert is actually pretty simple 2008-10-23 20:39 add levels if we're trying to insert at a high address 2008-10-23 20:39 otherwise drill down through levels masking off the index 2008-10-23 20:39 empty parts of the tree have null pointers, fill them in if in our path 2008-10-23 20:40 that's about it. RCU strangeness to think about 2008-10-23 20:40 otherwise we're done here 2008-10-23 20:40 ok, so what is a tux3 buffer fork going to look like, based on what we just looked at? 2008-10-23 20:41 ok, lookup cache, then copy data to new cache, and insert new pos on radix tree 2008-10-23 20:42 ? 2008-10-23 20:42 basically, and we'll need to worry about locking 2008-10-23 20:42 copy dat to new cache will happen in multiple steps, right? 2008-10-23 20:42 and we need to worry about false sharing 2008-10-23 20:42 there are other blocks onthe same page, what happens to them? 2008-10-23 20:43 it's a per-block operation as currently conceived 2008-10-23 20:44 don't copy other blocks, becase new cache may not be contiguous 2008-10-23 20:44 hirofumi, when we branch a block we don't change its position 2008-10-23 20:44 um.. 2008-10-23 20:44 we just pull the page that carries the block data out of the page cache, leaving a copy in its place 2008-10-23 20:45 we don't necessarily even need buffer heads on the page we pull out of cache 2008-10-23 20:45 because nobody is going to be changing it, hence no need for per-block locking 2008-10-23 20:45 and we can do the actual transfer to disk with a bio 2008-10-23 20:46 so we will just be pulling the underlying page out and replacing it with a new page 2008-10-23 20:46 that has the effect of forking all the buffers on the same page 2008-10-23 20:46 so we must have some bits in the buffer_head flags to tell us which phase a buffer belongs to 2008-10-23 20:47 um.. one may be ileaf, and one may be dleaf etc.? 2008-10-23 20:47 that is, whether it has already been forked or not 2008-10-23 20:47 here in a file page cache we will only find file data or directory data or bitmap block 2008-10-23 20:47 later atom stuff 2008-10-23 20:47 ileaf and dleaf live in the buffer cache 2008-10-23 20:48 which is direct-mapped to the block device 2008-10-23 20:48 it is handled in much the same way 2008-10-23 20:48 ext3 does not directly perform operations on the buffer cache, I think I recall 2008-10-23 20:49 but lets the vfs do it 2008-10-23 20:49 using the generic_ functions 2008-10-23 20:49 and ext3 just supplies a ->get_block function 2008-10-23 20:49 well, let's go see how ext3_get_block works 2008-10-23 20:50 at some point it obviously has to go read some metadata 2008-10-23 20:50 most likely with sb_bread 2008-10-23 20:50 yes 2008-10-23 20:51 http://lxr.linux.no/linux+v2.6.27/fs/ext3/inode.c#L953 2008-10-23 20:51 then http://lxr.linux.no/linux+v2.6.27/fs/ext3/inode.c#L786 2008-10-23 20:51 these functions are a little oddly structured 2008-10-23 20:51 because they are lockless 2008-10-23 20:52 ext3_block_to_path just does arithmetic 2008-10-23 20:52 because caller has lock of requested page? 2008-10-23 20:53 no, it's completely lockless 2008-10-23 20:53 it uses the block pointers like locks 2008-10-23 20:53 um.. what happen if it was truncated? 2008-10-23 20:53 it checks and backs out the operation 2008-10-23 20:53 see verify_chain 2008-10-23 20:54 ok, 367 bh = sb_bread(sb, le32_to_cpu(p->key)); 2008-10-23 20:54 in ext3_get_branch 2008-10-23 20:54 so we got to a function that is familiar 2008-10-23 20:55 yes 2008-10-23 20:55 oh thing to watch out for: some of the blocks on the buffer cache page may be data, that is, in an inode page cache, and some may be metadata, in the bufffer cache 2008-10-23 20:55 I haven't thought about what impact that might have on the forking operation 2008-10-23 20:55 it's probably ok, but needs to be thought about 2008-10-23 20:56 ok, that's enough for today 2008-10-23 20:56 how'd we do on the interesting front? 2008-10-23 20:57 do forking? 2008-10-23 20:57 I meant, was it interesting? 2008-10-23 20:58 yes 2008-10-23 20:58 this lesson was above my knowledge level :P I have to dig more to qualify for it :P 2008-10-23 20:58 razvanm, it's not above your level 2008-10-23 20:59 just read through the log once more and it will all look simple 2008-10-23 20:59 I will :D 2008-10-23 21:00 next time I think we might take a look at how we might go about doing filesystem IO without having a tux3_get_block function 2008-10-23 21:00 I don't know myself whether it's practical to avoid this 2008-10-23 21:00 it's not particularly hard to implement 2008-10-23 21:00 but I think I want to see if we can just avoid the whole block IO library and work directly with the page cache and bio 2008-10-23 21:00 IIRC, btrfs uses get_extents or something 2008-10-23 21:00 sb_bread is our friend 2008-10-23 21:01 new invention 2008-10-23 21:01 I don't want to go that far just yet 2008-10-23 21:01 the api is likely to take some time to settle 2008-10-23 21:01 well 2008-10-23 21:01 I think it is the get_ model that is broken 2008-10-23 21:01 not the block vs extents 2008-10-23 21:02 the get_ model assumes there is a place to cache the physical pointer 2008-10-23 21:02 which historically has been the buffer_head 2008-10-23 21:02 but it turns out that the cached physical pointer is rarely used 2008-10-23 21:02 i see 2008-10-23 21:02 not enough to justify the strange mechanisms that are in place to handle it 2008-10-23 21:03 ah, becase we use forking on write. physical pointer is not used much? 2008-10-23 21:03 because 2008-10-23 21:04 I am thinking that it caching physical pointers is a win, then there should be a library function to do that that all filesystems can use, so that the whole IO path does not revolve around the need for a fs to generate a physical pointer 2008-10-23 21:04 forking happens on the "top end" of filesystem and is not part of writeout 2008-10-23 21:04 it is to avoid stalls in buffer IO calls 2008-10-23 21:05 "remapping" is where we change physical pointers around 2008-10-23 21:05 which occurs in filemap.c, during an inode flush 2008-10-23 21:05 and will also need to happen when we write out inode table blocks during phase commit 2008-10-23 21:06 and index blocks during rollup 2008-10-23 21:07 um.. natural delayed write.. 2008-10-23 21:08 delayed write -> delayed allocation 2008-10-23 21:09 yes 2008-10-23 21:09 I'm thinking about adding delayed inode number assignment too 2008-10-23 21:09 oh, i see 2008-10-23 21:10 could possibly confuse nfs 2008-10-23 21:13 thanks. this talk seems to clear my brain more or less 2008-10-23 21:15 you're welcome 2008-10-23 21:30 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-23 21:30 hmm, a little late 2008-10-23 21:30 ;-) 2008-10-23 21:31 good think we have a log 2008-10-23 21:31 good thing 2008-10-23 21:35 yup catching up now 2008-10-23 21:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 21:52 hi tim_dimm 2008-10-23 21:53 flips... 2008-10-23 21:53 word 2008-10-23 22:02 dword 2008-10-23 22:03 but not msword 2008-10-23 22:03 ddword 2008-10-23 22:11 qword 2008-10-23 22:11 f word 2008-10-23 22:16 wor dup 2008-10-23 22:16 sword 2008-10-23 22:16 *swish* 2008-10-23 22:16 ACTION cuts a swatch through the witty banter 2008-10-23 22:16 swishy 2008-10-23 22:17 wordy 2008-10-24 01:51 folks 2008-10-24 01:51 I see people are talking about the lockless page cache 2008-10-24 02:17 a bit of... 2008-10-24 05:54 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-24 08:30 -!- mingming(~mingming@c-24-22-117-202.hsd1.or.comcast.net) has joined #tux3 2008-10-24 08:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-24 09:05 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-24 09:15 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-10-24 09:21 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-24 10:06 -!- pgquiles(~pgquiles@156.Red-88-25-133.staticIP.rima-tde.net) has joined #tux3 2008-10-24 10:15 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-24 10:36 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-24 10:45 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-24 12:31 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-24 15:11 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-24 15:44 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-24 16:12 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-24 16:13 ACTION got to the point where he needs to understand how get_block works 2008-10-24 16:17 ACTION wishes underscores didn't make him cringe 2008-10-24 16:18 ravanm, the vfs calls the filesystem saying "for this buffer, what is the physical address?" and the filesystem fills in the b_blocknr in the buffer 2008-10-24 16:19 the function is misnamed 2008-10-24 16:19 it doesn't get the block, but the address of the block 2008-10-24 16:20 or preciscely, "for this inode at this logical block offset, please fill in the buffer->b_blocknr" 2008-10-24 16:20 it's actually dumb to use a buffer to pass the result back, it should just be the function result 2008-10-24 16:25 thanks for the answer! :D 2008-10-24 16:25 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-10-24 16:25 to fill the context, what I'm trying to implement is block_read_full_page 2008-10-24 16:28 well, I think it is time to do a big directory move, move everything out of user/test into / in the repo 2008-10-24 16:28 (thinking out loud) the block_read_full_page assumes that the page is filled by some bhs 2008-10-24 16:28 any objection? 2008-10-24 16:28 block_read_full_page will add buffers if there are none 2008-10-24 16:29 from where it will take the info to do that? 2008-10-24 16:29 from the index of the page? 2008-10-24 16:30 it only needs to know the block size 2008-10-24 16:30 which get gets from page->mapping->sb 2008-10-24 16:30 how about the start point? 2008-10-24 16:30 ok, let me see, we have doc/ in the top level of the repo 2008-10-24 16:30 (in romfs it was from index: http://lxr.linux.no/linux+v2.6.26/fs/romfs/inode.c#L432 ) 2008-10-24 16:31 so maybe I should just move user/test/* to user/* 2008-10-24 16:31 and then there will be a kernel and a fuse one? 2008-10-24 16:31 aaa... fuse is inside the user now 2008-10-24 16:32 use, it would be doc/ user/ kernel/ 2008-10-24 16:33 doc/ user/ kernel/ README COPYING INSTALL Makefile 2008-10-24 16:33 something like that 2008-10-24 16:34 and the Makefile from the root will build both user and kernel? 2008-10-24 16:34 and user/ will eventually have fsck.tux3 2008-10-24 16:34 I guess we won't try to build kernel 2008-10-24 16:34 probably just call the makefile in user 2008-10-24 16:35 it could possibly extract a patch from a mercurial repo it knows about 2008-10-24 16:35 err 2008-10-24 16:35 git repo for kernel stuff 2008-10-24 16:35 it could also do make tarball and make docs 2008-10-24 16:35 or make deb even 2008-10-24 16:36 ok, anyway /test/ is going to go away right now 2008-10-24 16:37 sucks hg and git don't really understand the concept of mv 2008-10-24 16:37 hg does! 2008-10-24 16:37 I mean I moved some stuff around using it 2008-10-24 16:37 renamed a directory 2008-10-24 16:38 "rename files; equivalent of copy + remove" 2008-10-24 16:38 it fakes it 2008-10-24 16:38 yup 2008-10-24 16:38 but the history is not lost 2008-10-24 16:41 it is kind of 2008-10-24 16:41 it doesn't have the notion that the moved object is the same 2008-10-24 16:41 in fact hg and git really don't have notions of objects, only of equality 2008-10-24 16:41 equality of file text 2008-10-24 16:41 deficiency 2008-10-24 16:41 why is not enought? :D 2008-10-24 16:42 but it works sort of ok most of the time 2008-10-24 16:42 well for example if a person changes the name, you know they're the same person, even if they change hats too 2008-10-24 16:43 so if in a changeset you both mv and edit the mved file, git and hg lose track 2008-10-24 16:44 true 2008-10-24 16:44 Depends on how seriously it is modified 2008-10-24 16:44 I changed a file for less than 10% and it was still tracked 2008-10-24 16:45 this was in hg? 2008-10-24 16:45 or git? 2008-10-24 16:47 git 2008-10-24 16:47 ok there we go 2008-10-24 16:48 On less than 10% it says 'moved' more was a rewrite iirc 2008-10-24 16:48 one can imagine the messy heuristics to do that 2008-10-24 16:48 accuracy is always better 2008-10-24 16:49 Wine can still trace some code that was rewritten a lot of times and moved various things around 2008-10-24 16:51 git-blame even found some wine 0.0.2 code 2008-10-24 16:53 Q: what does bmap? 2008-10-24 16:53 the original Monotone relied even more heavily on heuristcs for rename 2008-10-24 16:53 (the readpage looks pretty mess in minix :|) 2008-10-24 16:53 there are always places where it causes strange behavior 2008-10-24 16:54 bmap queries the filesystem about the physical location of a given logical block of a file 2008-10-24 16:54 bmap is problematic for filesystems that do deferred allocation 2008-10-24 16:54 static sector_t minix_bmap(struct address_space *mapping, sector_t block) 2008-10-24 16:54 or online defrag, or shrink, or remap 2008-10-24 16:55 so I just need to properly construct the mapping and then ask for the bloks one by one, right? 2008-10-24 16:55 yes 2008-10-24 16:55 bmap doesn't always work, for the reasons above 2008-10-24 16:55 an inherently racy interface 2008-10-24 16:56 so I should make the readpage work instead? :D 2008-10-24 16:56 readpage isn't exposed to userspace 2008-10-24 16:56 that's the point of bmap 2008-10-24 16:56 whay of userspace, principally lilo, to find physical block locations 2008-10-24 16:57 using bmap inside kernel is unconscionable, but probably there are cases 2008-10-24 16:58 I'm going back to readpage :P 2008-10-24 16:58 right 2008-10-24 16:59 note that the readpage interface is also racy, relying on the cached physical block address as it does 2008-10-24 16:59 there is nothing to prevent the physical block address from moving before being read 2008-10-24 17:00 filesystem has to provide that locking 2008-10-24 17:02 so I guess I could easily end up in a bad state by calling stuff in an unexpected order... 2008-10-24 17:03 if your filesystem supports online defrag, shrink, remapping or delayed allocation 2008-10-24 17:51 sk8 oclock 2008-10-24 17:51 going to be a sunset skate 2008-10-24 17:54 enjoy :D 2008-10-24 18:16 ACTION is trying to find who should properly set the i_blkbits in an inode 2008-10-24 18:17 and the answer is: alloc_inode 2008-10-24 19:02 wow... alloc_page_buffers doesn't make a circular list of bh 2008-10-24 19:03 it just make a NULL-terminat list 2008-10-24 19:03 and create_empty_buffers finished the job 2008-10-24 19:18 (again thinking out loud) actually... I don't think I need to create the buffer links to cover the page 2008-10-24 19:18 I can use one 2008-10-24 19:19 get the real address using the get_block and then call the bread to get the data 2008-10-24 19:20 hmm... my __bread also allocated a bh 2008-10-24 19:24 http://lxr.linux.no/linux+v2.6.26/fs/minix/itree_common.c#L145 more complicated things are taking places... 2008-10-24 19:24 because of the indirect blocks 2008-10-24 19:25 back 2008-10-24 19:26 welcome! :D 2008-10-24 19:27 I'm just making some noise here :P 2008-10-24 19:27 razvanm, don't forget that the page buffers are used not just for once off IO but to cache the physical block address 2008-10-24 19:28 so it's probably best to fully populate the page 2008-10-24 19:28 some filesystems may make assumptions 2008-10-24 19:28 I will populate the page 2008-10-24 19:28 but I don't want to keep around the bhs 2008-10-24 19:28 do the filesystems expect them to be around a lot? 2008-10-24 19:29 what I'm implementing right now is block_read_full_page 2008-10-24 19:29 http://lxr.linux.no/linux+v2.6.26.6/fs/block_dev.c#L71 <- blkbits for the inode for a block device is set here 2008-10-24 19:29 probably bogus 2008-10-24 19:29 I solved that thing :P 2008-10-24 19:29 it was a bug in my alloc_inode :D 2008-10-24 19:31 I wonder what costs more, having an extra indirection in the inode to find the block shift, or bloating up all inodes by an extra word? 2008-10-24 19:31 let me try to formulate a coherent question: is ok if I fill implement block_read_full_page without maintaining the bh for it? 2008-10-24 19:31 I suspect the latter is more costly 2008-10-24 19:31 yes, certainly 2008-10-24 19:32 some filesystems do that 2008-10-24 19:32 yeah! romfs :P 2008-10-24 19:32 though they skip block_read_full_page entirely 2008-10-24 19:32 and just do what has to be done directly 2008-10-24 19:32 which I think is going to be the strategy for tux3 2008-10-24 19:32 great! :D 2008-10-24 19:33 slightly larger code in return for avoiding a huge pile of donkey dung 2008-10-24 19:33 minix_readpage is calling only one function: the block_read_full_page 2008-10-24 19:33 -!- ChanServ changed mode/#tux3 -> +o flips 2008-10-24 19:34 changing the title? 2008-10-24 19:34 http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: is ->t_block really necessary? 2008-10-24 19:34 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: is ->t_block really necessary?" 2008-10-24 19:34 attempting to 2008-10-24 19:34 it worked :D 2008-10-24 19:34 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: is ->get_block really necessary?" 2008-10-24 19:35 -!- ChanServ changed mode/#tux3 -> -o flips 2008-10-24 19:35 I was about to ask who is t_block :P 2008-10-24 19:35 if the FS is implementing the readpage directly then there should be no need for get_block, right? 2008-10-24 19:36 at least on the read part 2008-10-24 19:36 (the only one I explored so far) 2008-10-24 19:40 ->get_block is called in lots of places besides readpage 2008-10-24 19:42 those places should be some sort of generic code that tries to simplify the FS, right? 2008-10-24 19:43 by systematically avoiding the generic code the need for get_block should go away 2008-10-24 19:43 not using the generic stuff would also help when porting to another OS :P 2008-10-24 20:05 great! my first call to get_block from minix really worked :D 2008-10-24 20:05 time to go home 2008-10-24 20:21 hey flips 2008-10-24 20:21 hi 2008-10-24 20:21 one more week until the big cabal 2008-10-24 20:21 how's it going with atomic commits ? 2008-10-24 20:21 thursday next 2008-10-24 20:22 on the 30th ? 2008-10-24 20:22 following the thread with matt dillon? 2008-10-24 20:22 yes 2008-10-24 20:22 no, ust he initial post 2008-10-24 20:23 flips: can't find an email link on your new home page 2008-10-24 20:23 you might like to fix that or point out that I'm wrong 2008-10-24 20:24 if it's hard to find it needs to be fixed 2008-10-24 20:24 there is is 2008-10-24 20:24 it is 2008-10-24 20:24 should also link it from the front page, agreed 2008-10-24 20:25 yeah, that would be best, having difficulty finding it 2008-10-24 20:28 do you have a link to the discussion btw ? 2008-10-24 20:29 http://tux3.org/pipermail/tux3/ 2008-10-24 20:31 flips: matt is a great technical ally btw 2008-10-24 20:31 bh, check it out now 2008-10-24 20:31 that's putting it mildly 2008-10-24 20:31 imo, he's the best overall kernel engineer in all of the BSDs minus the original BSD/OS folks 2008-10-24 20:32 matt by the way is responsible for inspiring the linux 2.6 vm design 2008-10-24 20:32 flips: still don't see the link yet 2008-10-24 20:32 flips: yeah, I was behind the scenes in those dats 2008-10-24 20:32 days 2008-10-24 20:32 I saw the entire thing traspire 2008-10-24 20:32 transpire 2008-10-24 20:32 I'll be back later 2008-10-24 20:32 ok 2008-10-24 20:33 try shift-reload 2008-10-24 20:35 still nothing different 2008-10-24 20:42 what browser? 2008-10-24 20:42 something about how it handles frames? 2008-10-24 21:43 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-24 21:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-24 22:16 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-25 00:54 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-25 02:09 -!- Kirantpatil(~kiran@122.167.205.69) has joined #tux3 2008-10-25 02:09 -!- Kirantpatil(~kiran@122.167.205.69) has left #tux3 2008-10-25 02:24 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-10-25 02:34 -!- pgquiles(~pgquiles@156.Red-88-25-133.staticIP.rima-tde.net) has joined #tux3 2008-10-25 08:56 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-25 10:20 -!- nataliep(~nataliep@cpe-76-170-3-242.socal.res.rr.com) has joined #tux3 2008-10-25 10:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-25 14:50 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-25 16:39 -!- pgquiles(~pgquiles@29.Red-81-33-102.dynamicIP.rima-tde.net) has joined #tux3 2008-10-25 17:41 why does the minix_readdir always calls the filler with DT_UNKOWN? http://lxr.linux.no/linux+v2.6.26/fs/minix/dir.c#L135 2008-10-25 18:27 flks 2008-10-25 18:31 flks? 2008-10-25 18:38 aaaa I should be able to find if it's a directory using the inode->i_mode... 2008-10-25 19:00 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-25 19:20 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-25 20:28 ACTION is happy. He managed to read a whole minixfs. :P 2008-10-25 23:17 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-26 00:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-26 00:38 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-26 05:16 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-26 05:23 -!- MaZe1(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-26 08:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-26 12:00 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-10-26 13:03 -!- pgquiles(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-26 13:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-26 13:20 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-26 15:13 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-10-26 16:36 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-26 18:45 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-26 19:10 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-26 21:19 -!- Kirantpatil(~kiran@122.167.176.141) has joined #tux3 2008-10-26 21:19 -!- Kirantpatil(~kiran@122.167.176.141) has left #tux3 2008-10-26 22:27 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-26 22:30 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-27 02:58 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-27 03:16 -!- pgquiles__(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-27 04:27 -!- pgquiles(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-27 05:04 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-27 05:47 -!- flips(~phillips@phunq.net) has joined #tux3 2008-10-27 06:47 -!- FelipeS(~Felipe@lawn-128-61-122-225.lawn.gatech.edu) has joined #tux3 2008-10-27 07:06 -!- FelipeS(~Felipe@lawn-128-61-122-225.lawn.gatech.edu) has joined #tux3 2008-10-27 07:25 -!- FelipeS_(~Felipe@lawn-128-61-122-225.lawn.gatech.edu) has joined #tux3 2008-10-27 07:33 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-10-27 08:03 -!- mingming(~mingming@c-24-22-117-202.hsd1.or.comcast.net) has joined #tux3 2008-10-27 08:04 good morning mingming 2008-10-27 08:04 flips, good morning:) 2008-10-27 08:05 did you see the benchmarks posted to btrfs mailing list? 2008-10-27 08:05 shows ext4 doing rather well 2008-10-27 08:07 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-27 08:08 for example: http://btrfs.boxacle.net/repository/single-disk/Initial-compare/Initial-Compare-Single_disk_Mail_server_simulation._num_threads=1.html 2008-10-27 08:08 morning tim_dimm 2008-10-27 08:08 morning flips 2008-10-27 08:09 yo 2008-10-27 09:44 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-27 10:15 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-27 10:45 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-27 11:02 -!- FelipeS(~Felipe@lawn-128-61-30-224.lawn.gatech.edu) has joined #tux3 2008-10-27 13:18 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-27 13:57 hey 2008-10-27 13:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-27 14:22 -!- FelipeS(~Felipe@lawn-128-61-126-196.lawn.gatech.edu) has joined #tux3 2008-10-27 16:35 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-27 17:38 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-27 19:29 quiet around here recently 2008-10-27 19:34 what happens when I'm working the swing shift 2008-10-27 19:45 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-27 19:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-27 20:00 shapor, ping 2008-10-27 20:03 here's a good place to start reading today: http://lxr.linux.no/linux+v2.6.27/+code=vfs_rename 2008-10-27 22:03 hey flips 2008-10-27 22:08 hi bh 2008-10-27 22:16 -!- RazvanM(~RazvanM@pool-151-196-13-39.balt.east.verizon.net) has joined #tux3 2008-10-27 22:26 how's it going ? 2008-10-27 23:56 -!- ajonat(~ajonat@190.48.124.28) has joined #tux3 2008-10-28 00:12 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-28 03:22 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-10-28 04:27 -!- pgquiles(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-28 09:10 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-28 09:28 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-28 09:47 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-28 11:03 -!- flips(~phillips@phunq.net) has joined #tux3 2008-10-28 11:15 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-28 12:02 -!- mingming_(~mingming@32.97.110.55) has joined #tux3 2008-10-28 12:48 -!- pgquiles(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-28 14:37 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-28 14:39 -!- pgquiles_(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-28 15:18 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-28 15:39 -!- FelipeS(~Felipe@lawn-128-61-20-25.lawn.gatech.edu) has joined #tux3 2008-10-28 19:22 -!- RazvanM(~razvan@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-10-28 19:23 ACTION is macbookless this time :| 2008-10-28 19:39 hi 2008-10-28 19:43 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-10-28 19:43 hi 2008-10-28 19:43 hi RalucaM 2008-10-28 19:49 folks 2008-10-28 19:50 hi 2008-10-28 19:56 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-28 20:08 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-28 20:09 miss anything? 2008-10-28 20:09 (my laptop just crashed on me twice in a row, I'm guessing because of a bad power supply... 2008-10-28 20:09 nope 2008-10-28 20:09 is today's class canceled? 2008-10-28 20:10 I would think not.... but maybe? 2008-10-28 20:10 flips: ping 2008-10-28 20:10 :) 2008-10-28 20:11 although to be fair with the latest round of kernel+nvidia+madwifi upgrades the machine seems much less stable then it was in the past... so maybe it wasn't the power supply just bad luck. Still twice in 10 minutes really takes the cake (the running tally is around thrice in the previous two weeks) 2008-10-28 20:12 and before that it was roughly once every 2-3 weeks 2008-10-28 20:12 oh 2008-10-28 20:12 what linux are you using? 2008-10-28 20:13 debian was always quite stable for me :P 2008-10-28 20:13 fc9 with kernel from koji, newest nvidia, madwifi from svn head 2008-10-28 20:14 I see 2008-10-28 20:14 it used to only lock up occasionally (rarely) during suspend to ram 2008-10-28 20:14 but now it's locked up a few times while I was typing on it... which is new 2008-10-28 20:14 maybe I'll leave it in memtest86 overnight 2008-10-28 20:14 hm... maybe it's a hw issue... 2008-10-28 20:15 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-28 20:15 well, the wireless driver is binary hal + opensource code, and it _is_ buggy 2008-10-28 20:15 2.6.27 supposedly fixes that 2008-10-28 20:15 with the ath9k driver 2008-10-28 20:15 but I'm still on 2.6.26.7 2008-10-28 20:16 I just wish I got kernel crashdumps or something out of this 2008-10-28 20:17 I should probably figure out how to use the kdump kernel stuff 2008-10-28 20:17 :) 2008-10-28 20:18 IIRC, /etc/sysconfig/kdump 2008-10-28 20:18 file not present 2008-10-28 20:18 probably have to install something first 2008-10-28 20:18 found an fc6 wiki 2008-10-28 20:35 blame timothy 2008-10-28 20:36 hi 2008-10-28 20:36 ok, shall we start a little late? 2008-10-28 20:37 folks? 2008-10-28 20:37 ah it is ;-) 2008-10-28 20:37 ponk 2008-10-28 20:37 ok 2008-10-28 20:37 works for me 2008-10-28 20:37 later works for me too :D 2008-10-28 20:38 and here I was about ready to get synergy between laptop and home projector working 2008-10-28 20:38 so what would we rather look at: 1) the get_block question above 2) mysteries of rename? 2008-10-28 20:38 rename 2008-10-28 20:39 rename it is 2008-10-28 20:40 hey flips 2008-10-28 20:40 let's go find where it happens 2008-10-28 20:40 fs/namei.c 2008-10-28 20:40 maybe tux3 rock the world ;) 2008-10-28 20:41 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2582 2008-10-28 20:41 beat me ;) 2008-10-28 20:41 by 2 seconds 2008-10-28 20:41 or sys_rename 2008-10-28 20:41 most of this is permission checking and locking 2008-10-28 20:42 may_create, may_delete, just checking perms 2008-10-28 20:43 http://lxr.linux.no/linux+v2.6.26.6/fs/namei.c#L2671 vfs_rename_other - everything but directories 2008-10-28 20:43 i_mutex is the synchronizer 2008-10-28 20:44 what's last_type ? 2008-10-28 20:44 type of the last segment in the path, i.e., the file 2008-10-28 20:45 oops, I missed a huge item in vfs_rename 2008-10-28 20:45 lock_rename 2008-10-28 20:45 I'm still parsing sys_renameat ;-) 2008-10-28 20:46 oh you started at the syscall 2008-10-28 20:46 not much happening there 2008-10-28 20:46 I'm wondering what that mutex_lock_nested is 2008-10-28 20:46 new one for me, let's see how far back it goes 2008-10-28 20:47 lockdep stuff 2008-10-28 20:47 it's just mutex_lock without lockdep warning 2008-10-28 20:47 ah, so it's just a mutex 2008-10-28 20:47 ok 2008-10-28 20:47 the dentry gets created in it 2008-10-28 20:48 don't think so 2008-10-28 20:49 I should have started at do_rename actually 2008-10-28 20:49 we are in do_rename now? 2008-10-28 20:49 the first big event is the path_lookup 2008-10-28 20:49 ok 2008-10-28 20:49 yes, we went back 2008-10-28 20:49 to see where the dentries come from 2008-10-28 20:49 ugh, lock_rename 2008-10-28 20:49 I jumped to the run stuff first ;) 2008-10-28 20:50 ok, in linux whenever you have an open file you have a dentry for it 2008-10-28 20:50 right, but on rename the dest might not exist 2008-10-28 20:50 'might' ;-) 2008-10-28 20:50 so the vfs can return the dentry as a handle for a directory/file 2008-10-28 20:50 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2625 2008-10-28 20:51 right, a dentry can exist without the underlying object existing 2008-10-28 20:51 #2679 2008-10-28 20:51 shall we go into path_lookup? 2008-10-28 20:51 buckle up your complexity belt 2008-10-28 20:51 http://lxr.linux.no/linux+v2.6.26.6/fs/namei.c#L1133 2008-10-28 20:52 you mean lookup_hash ? or something else 2008-10-28 20:52 oh, 2.6.26 2008-10-28 20:52 ah, sorry 2008-10-28 20:52 I'll move to 2.6.27 2008-10-28 20:52 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1045 2008-10-28 20:53 like all the path functions, it eventually calls path_walk 2008-10-28 20:53 this code 2008-10-28 20:53 is totally different in 2.6.27 2008-10-28 20:53 refactored at least 2008-10-28 20:53 really? 2008-10-28 20:54 sys_renameat is huge 2008-10-28 20:54 there's no do_rename 2008-10-28 20:55 true 2008-10-28 20:55 it was killed by Al's cleanup 2008-10-28 20:55 yeah, it needed it 2008-10-28 20:56 ok, let's pick a version 2008-10-28 20:56 do_rename was close to renameat 2008-10-28 20:56 cause that's probably why this didn't make a lot of sense to me ;-) 2008-10-28 20:56 so it's been made renameat 2008-10-28 20:56 well we started a level below 2008-10-28 20:56 ok where were we 2008-10-28 20:57 ah, and there are new names for the walk functions 2008-10-28 20:57 lookup_hash 2008-10-28 20:57 do_path_lookup 2008-10-28 20:57 sys_renameat -> user_path_lookup -> do_path_lookup? 2008-10-28 20:57 where do you see lookup_hash? 2008-10-28 20:57 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2679 2008-10-28 20:58 hirofumi, yes 2008-10-28 20:58 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-28 20:58 ok, more refactoring 2008-10-28 20:59 this was previously done within the walk I suppose 2008-10-28 21:00 nameidata 2008-10-28 21:00 the struct that acts as scratchpad for this family of functions 2008-10-28 21:01 in some cases is abused as a substitute for function parameters 2008-10-28 21:01 anyway, the new factoring returns a nameidata instead of dentry 2008-10-28 21:01 and the lookup_hash converts the name in the nameidata to a dentry 2008-10-28 21:02 let's check out nameidata 2008-10-28 21:02 :-) 2008-10-28 21:03 this must be partially for inotify 2008-10-28 21:03 it was around long before inotify was 2008-10-28 21:03 http://lxr.linux.no/linux+v2.6.27/include/linux/namei.h#L18 2008-10-28 21:04 saved_names... never looked at that 2008-10-28 21:04 L27 though 2008-10-28 21:04 probably proc support 2008-10-28 21:04 probably is how symlink traversal was made nonrecursive 2008-10-28 21:05 so you can get meaningful dumps of fd's 2008-10-28 21:05 you get that by following parent links in dentries 2008-10-28 21:06 the "path" in the nameidata... a little oddly named 2008-10-28 21:06 http://lxr.linux.no/linux+v2.6.27/include/linux/path.h#L7 2008-10-28 21:06 it's a dentry/mount pair 2008-10-28 21:07 not immediately obviously what the mount part is for 2008-10-28 21:07 anyway, I wanted to do rename this time 2008-10-28 21:07 not path walk, which I need to review first 2008-10-28 21:08 it keeps changing and it was complex to begin with 2008-10-28 21:08 (side note) an auto-parser which would figure out and auto-annotate code, with a comment, containing where it gets called from and what it calls, what locks are held before and after and during would be useful.... 2008-10-28 21:08 have it ready by friday? 2008-10-28 21:08 :D 2008-10-28 21:09 ;-) 2008-10-28 21:09 anyway, suffice to say for now that path_walk scans off each segment of the / separated path and looks for a dentry 2008-10-28 21:09 if it doesn't find one, it calls the filesystem 2008-10-28 21:10 inode->i_op->lookup 2008-10-28 21:10 if the filesystem says its a symlink, which yields a new path 2008-10-28 21:10 the details a wickely complex 2008-10-28 21:10 for various reasons 2008-10-28 21:10 including nfs 2008-10-28 21:10 enough on path_walk for now 2008-10-28 21:10 yes 2008-10-28 21:11 let's get back to the rename locking 2008-10-28 21:12 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2657 2008-10-28 21:13 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1462 <- lock)_rename 2008-10-28 21:13 pretty simple 2008-10-28 21:13 needs to lock in the order parent, child 2008-10-28 21:14 to avoid deadlock 2008-10-28 21:14 per fs sb lock 2008-10-28 21:14 yes that too 2008-10-28 21:15 so we have the per-sb lock, and we will take a lock on each directly 2008-10-28 21:15 directory 2008-10-28 21:15 first, walk the chain of dentries up from one directory to the root, looking for the second directory 2008-10-28 21:15 if we find it, we know the second is ancestor of the first, so take the second lock first 2008-10-28 21:16 otherwise, do the same in the other direction 2008-10-28 21:16 if neither is ancestor of the other, the order doesn't matter 2008-10-28 21:16 why do we still have parent/child locks if there's no parent/child relation? 2008-10-28 21:16 though the _nested syntax doesn't make that obvious 2008-10-28 21:17 the answer to that will be found by looking at mutex_lock_nested 2008-10-28 21:18 the int subclass is only advisory 2008-10-28 21:18 oh the parameter is merely for run-time lock checking 2008-10-28 21:18 for automatically checking lock dependencies 2008-10-28 21:18 I suppose there should be a "no dependency" value, don't know why there isn't 2008-10-28 21:19 I think the PARENT CHILD are just arbitrary strings 2008-10-28 21:19 which usually happen to actually match what the locks labeled with them are used for 2008-10-28 21:19 NORMAL, PARENT, CHILD, XATTR, QUOTA 2008-10-28 21:20 # define mutex_acquire(l, s, t, i) lock_acquire(l, s, t, 0, 2, NULL, i) <- ooh, ugly 2008-10-28 21:21 I'm having a hard time believing that all that debugging stuff doesn't create runtime overhead when not used 2008-10-28 21:21 does mutex_lock sleep till it gets the lock? 2008-10-28 21:21 yes 2008-10-28 21:21 I'd guess it can get compiled out 2008-10-28 21:21 via the preprocessor or compiler opt 2008-10-28 21:22 I'll take that on faith 2008-10-28 21:22 I don't see the compiler removing all the extra parameters 2008-10-28 21:22 like the subclass 2008-10-28 21:22 anyway, I think I understand (un)lock_rename 2008-10-28 21:22 right 2008-10-28 21:22 we can look at lockdep another time 2008-10-28 21:23 useful facility that I have never used 2008-10-28 21:23 ACTION doesn't make locking errors 2008-10-28 21:23 usually 2008-10-28 21:24 ok, a couple more reality checks then into vfs_rename 2008-10-28 21:24 locking is just a matter of good design ;-) 2008-10-28 21:24 which is much easier when you're writing your own code from scratch 2008-10-28 21:25 I left off the smilely above 2008-10-28 21:25 everybody makes locking errors 2008-10-28 21:25 (since half the problem is knowing what the design is) 2008-10-28 21:26 vfs_rename is all just permission checks 2008-10-28 21:26 as we saw before 2008-10-28 21:26 the rename_other vs rename_directory 2008-10-28 21:26 why don't we abort here on trap != NULL ? 2008-10-28 21:26 back at 2657. 2008-10-28 21:26 oh nevermind 2008-10-28 21:27 we're still dealing with the dirs the stuff we're renaming are in, not the stuff we're renaming itself 2008-10-28 21:27 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2567 <- we need to get yet another lock here 2008-10-28 21:27 the lock on a filesystem object we are about to implcitly unlink 2008-10-28 21:28 shouldn't this be a mutex_lock_nested call? 2008-10-28 21:28 doesn't need to be order with respect to source or dest dirs 2008-10-28 21:28 ordered 2008-10-28 21:29 on with respect to the sb rename mutex 2008-10-28 21:29 only 2008-10-28 21:30 anyway, we call the fs ->rename method and that's about all there is to that 2008-10-28 21:30 the fs can happiily work away knowing that all the needed locks are already taken 2008-10-28 21:31 oh can imagine bottlenecks here with mass renames 2008-10-28 21:31 but since one doesn't really see mass renames, it's not a big issue 2008-10-28 21:31 now rename_dir 2008-10-28 21:32 oh, d_move? 2008-10-28 21:32 and renames within the same directory don't use the per-fs-sb lock 2008-10-28 21:32 ah right 2008-10-28 21:32 the dentry cache has to be updated to reflect what the filesystem did to the backing store 2008-10-28 21:33 good observation 2008-10-28 21:34 _other and _dir are almost identical 2008-10-28 21:34 probably should be one function 2008-10-28 21:36 the new_dentry is preemptively unhashed for some reason 2008-10-28 21:37 something of a mystery why 2008-10-28 21:37 this code really suffers from being nearly devoid of comments 2008-10-28 21:38 rehashed later on 2008-10-28 21:38 without explanation 2008-10-28 21:39 homework 2008-10-28 21:39 maybe, comment of dentry_unhash 2008-10-28 21:39 homework: "wtf is the unhash/rehash in vfs_rename_dir all about?" 2008-10-28 21:39 I think this is how the target gets deleted? 2008-10-28 21:40 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2110 2008-10-28 21:40 see rename_other for a target also being unlinked 2008-10-28 21:40 after all a rename will move the source, but kill the destination 2008-10-28 21:40 hence S_DEAD 2008-10-28 21:42 151#define S_DEAD 16 /* removed, but still open directory */ 2008-10-28 21:42 http://lxr.linux.no/linux+v2.6.27/include/linux/fs.h#L151 2008-10-28 21:43 what that's special to directories is not clear 2008-10-28 21:43 regular files can also be removed but still open 2008-10-28 21:43 if S_DEAD, we can't lookup anymore 2008-10-28 21:44 why is it even in the hash then? 2008-10-28 21:44 both good points 2008-10-28 21:45 may be just for d_move? 2008-10-28 21:45 reiserfs refuses to read xattrs for a dead dir 2008-10-28 21:46 ENOENT automatically on readdir 2008-10-28 21:46 S_DEAD isn't used by very many filesystems 2008-10-28 21:46 (that last was vfs) 2008-10-28 21:46 see IS_DEADDIR 2008-10-28 21:47 may_create false in dead dir etc 2008-10-28 21:47 yes 2008-10-28 21:48 can;t create child dentry for dead dir ( in lookup_hash) 2008-10-28 21:48 now, why is it still in the hash? 2008-10-28 21:49 laziness? 2008-10-28 21:50 somebody holds a count on it somehow? 2008-10-28 21:50 seems like a stretch 2008-10-28 21:50 well 2008-10-28 21:51 another day, another crufty bit of linux kernel 2008-10-28 21:52 rename is the ickiest of the vfs namespace functions 2008-10-28 21:52 the others will see clear by comparison 2008-10-28 21:53 hmm 2008-10-28 21:53 this didn't seem that bad 2008-10-28 21:53 MaZe: could be worse, right? :D 2008-10-28 21:53 yup 2008-10-28 21:56 ACTION says thanks for the lesson! 2008-10-28 21:56 ACTION also goes to bed because he wake up very early today. 2008-10-28 21:56 yes, thanks! 2008-10-28 21:56 ok plug in power 2008-10-28 21:57 dentry_unhash in rename seems to be just for strange fs 2008-10-28 21:58 if it cannot handle the case of removing a directory that is still in use by something else.. 2008-10-28 21:59 oh 2008-10-28 22:03 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-28 22:03 lol 2008-10-28 22:12 pushed the wrong button? 2008-10-28 22:13 nope 2008-10-28 22:14 ran out of battery power 2008-10-28 22:14 had to go find a socket and plug yourself in then 2008-10-28 22:14 nope 2008-10-28 22:14 instead of reaching for power cord 2008-10-28 22:14 I started typing 2008-10-28 22:14 and the system critical shut down 2008-10-28 22:15 since apparently the batter went from 30% to 2% in a couple seconds 2008-10-28 22:15 (clean shutdown though) 2008-10-28 23:14 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-28 23:47 are you already starting to implement atomic commit? 2008-10-29 00:01 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-29 00:12 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-29 00:19 -!- pgquiles_(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-29 01:19 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-29 04:44 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-29 06:15 -!- FelipeS(~Felipe@lawn-128-61-120-139.lawn.gatech.edu) has joined #tux3 2008-10-29 06:15 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-29 06:21 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-29 07:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-29 07:50 -!- pgquiles(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-29 08:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-29 08:47 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-29 08:56 -!- RzM|Away(~razvan@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-10-29 10:27 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-29 11:49 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-29 11:57 hirofumi, yes 2008-10-29 11:58 oh, great 2008-10-29 11:58 I was thinking about it last week 2008-10-29 11:58 next commit will add date handling, after that it's all atomic commit work 2008-10-29 11:59 I'm writing a post to clarify a few details at the moment 2008-10-29 11:59 it's a fun thing to think about, first new approach to the problem in 15 years 2008-10-29 12:00 the method needs a name 2008-10-29 12:00 isn't it atomic commit? 2008-10-29 12:00 a new kind of atomic commit 2008-10-29 12:00 the first kind used in filesystems was journalling 2008-10-29 12:01 then came logging and tree-based copy on write 2008-10-29 12:01 recursive copy on write 2008-10-29 12:01 ah, yes. 2008-10-29 12:01 this is non-recursive copy on write 2008-10-29 12:01 but that would be a lame name 2008-10-29 12:02 something to think about over the next couple weeks 2008-10-29 12:02 yes, atomic commit. it seems too generic 2008-10-29 12:03 btw, in rollup, do we need to write out modified btree-index? 2008-10-29 12:05 rollup writes out previously modified btree nodes 2008-10-29 12:05 i see 2008-10-29 12:05 and may at the same time modify more btree nodes, which will be written in a future rollup 2008-10-29 12:06 it will be recursive way to root? 2008-10-29 12:07 it eventually goes to the root, yes, but does not create a new root 2008-10-29 12:07 i see. how do we handle root? 2008-10-29 12:07 so there are two essential differences from recursive tree copy on write: 1) the updates are spread out in time, they don't happen on each leaf write 2) does not generate new trees 2008-10-29 12:08 or rather, does not generate multiple trees 2008-10-29 12:08 we have a few fixed locations for root 2008-10-29 12:08 and a sequence number 2008-10-29 12:08 I need to write that in a design note 2008-10-29 12:09 root is modified very rarely 2008-10-29 12:09 oh, i see. tux3 can merge btree-index modification in multiple phase? 2008-10-29 12:09 generally only when the inode table btree index needs an additional level 2008-10-29 12:10 it can 2008-10-29 12:10 um.. 2008-10-29 12:11 the question of whether the current tree state is represented via promises in commit blocks or actual written out index blocks is orthagonal to the phase mechanism 2008-10-29 12:12 orthagonal? 2008-10-29 12:12 ACTION my english skill is too poor 2008-10-29 12:13 "does not affect" 2008-10-29 12:13 "one can be changed without affecting the other" 2008-10-29 12:14 i see. 2008-10-29 12:14 your english skill is fine, I didn't even notice you're not a native speaker 2008-10-29 12:14 oh, it's surprise to me 2008-10-29 12:15 thanks. 2008-10-29 12:16 i'm still thinking about rollup stage... 2008-10-29 12:16 in "Cache state reconstruction" 2008-10-29 12:16 section 2008-10-29 12:17 it says parent blocks of rolled up, will via promises recorded 2008-10-29 12:19 it means parent block will be copy-on-write, then it will be written to new location as new block? 2008-10-29 12:22 yes 2008-10-29 12:22 i see 2008-10-29 12:23 in fact, there is not a copy on wirte 2008-10-29 12:23 because the buffer is in cache 2008-10-29 12:23 ah 2008-10-29 12:23 the block buffer is simply assigned to a new location 2008-10-29 12:23 and the new location becomes a promise 2008-10-29 12:24 actual copy on write of buffers does happen, but it is for a different purpose: to prevent stalls in writing by userspace programs 2008-10-29 12:25 um..., but new one is block on stable image + previous promise 2008-10-29 12:25 yes 2008-10-29 12:26 if I think stable image is original, and new one can be called copy-on-write? 2008-10-29 12:26 except no copy is done 2008-10-29 12:26 so without a copy, it isn't copy on write 2008-10-29 12:27 a better term is redirect on write 2008-10-29 12:27 now that I think of it, copy on write is incorrect terminology for the algorithm used by btrfs 2008-10-29 12:27 well, let me think about that 2008-10-29 12:28 i see 2008-10-29 12:28 depends how they actually implement it 2008-10-29 12:29 physical remmapping is done in buffer cache? 2008-10-29 12:30 yes 2008-10-29 12:30 during normal operation what we do is make the normal modification to the cached image of index block just as it is implemented now, and at the same time, write a promise to modify the physical block into a commit block 2008-10-29 12:30 rollup does not apply promises, because they are already applied 2008-10-29 12:30 only recover does 2008-10-29 12:30 only recovery does 2008-10-29 12:31 however, rollup optimize(?) promises? 2008-10-29 12:32 i mean it will rewrite/merge dirty index blocks 2008-10-29 12:32 rollup writes out the dirty index block, making the promises no longer necessary, so they can be discarded 2008-10-29 12:32 yes 2008-10-29 12:32 i see 2008-10-29 12:32 what I realized a couple days ago is that promises can be retired out of order 2008-10-29 12:33 and so we need a way to know which promises don't need to be applied any more, because the index block they refer to was already written out 2008-10-29 12:34 um... 2008-10-29 12:34 we don't know in advance what order the index blocks will be written out, because it depends on the pattern of filesystem activity 2008-10-29 12:34 it doesn't have dependency? 2008-10-29 12:34 dependency on what? 2008-10-29 12:35 e.g. previous phase may have parent directory of current phase? 2008-10-29 12:35 did you mean the word "directory" ? 2008-10-29 12:36 directory entry 2008-10-29 12:36 a changed directory entry must be written out in the same phase as the changed inode table block 2008-10-29 12:37 that is a fule that guarantees atomicity 2008-10-29 12:37 a rule 2008-10-29 12:37 we don't actually analyze those dependencies 2008-10-29 12:37 um.. 2008-10-29 12:38 but instead, just see what buffers the filesystem operation changes 2008-10-29 12:38 and add those changed buffers to the current phase 2008-10-29 12:38 previous phase has "foo", and next one has "foo/bar"? 2008-10-29 12:39 there can be a commit between creating foo and foo/bar, that is ok 2008-10-29 12:39 yes 2008-10-29 12:40 but reverse order of phase, bar is orphaned entry? 2008-10-29 12:40 but that can't happen because the phases cannot be completed out of order 2008-10-29 12:41 ah, maybe i missread "retired out of order" 2008-10-29 12:42 right, it's just the promises that can be retired out of order 2008-10-29 12:42 i see 2008-10-29 12:42 the order in which promises can be retired (that is, ignored on recovery) depends on the order in which we add dirty index blocks to the active phase 2008-10-29 12:43 that is a key point I need to mention: we do not normally add dirty index blocks to the active phase 2008-10-29 12:45 umm.. hard to understand yet for me unfortunately 2008-10-29 12:45 but I think you are the closest to understanding 2008-10-29 12:46 thanks, I hope 2008-10-29 12:46 so 2008-10-29 12:46 the reason we don't add dirty index blocks to the active phase is, that would defeat the optimization we do with the promises 2008-10-29 12:46 so instead, we only add them on split or rollup 2008-10-29 12:48 um.. what means "don't add"? 2008-10-29 12:48 we don't dirty those? 2008-10-29 12:48 each phase has a list of buffers that belong to it, and will be written to disk in that phase 2008-10-29 12:48 ah 2008-10-29 12:49 delay? 2008-10-29 12:49 which delay? 2008-10-29 12:49 delay to add dirty index blocks? 2008-10-29 12:50 there are dirty blocks not added to a phase, these are the blocks that need to be reconstructed from promises on recovery 2008-10-29 12:50 i see 2008-10-29 12:51 speaking of delay... a phase cannot begin to be written to disk until the next phase has started 2008-10-29 12:52 however, it can be by timeout? 2008-10-29 12:52 ah, yes 2008-10-29 12:52 I think a better trigger is, write queue on the underlying device nearly empty 2008-10-29 12:53 oh, i see 2008-10-29 12:53 so that when the device is not doing anything, at the start of an untar for example, the first phase will be very short 2008-10-29 12:54 sounds very good 2008-10-29 12:54 yes, good for throughput 2008-10-29 12:54 yes 2008-10-29 12:55 btw, in "Phase transition" section 2008-10-29 12:55 Starting a new phase requires incrementing the phase counter in the 2008-10-29 12:55 cached filesystem superblock and flushing all dirty inodes. 2008-10-29 12:55 in this section, "flushing all dirty inodes" means ->write_inode on linux 2008-10-29 12:55 ? 2008-10-29 12:56 but I meant also flushing any dirty blocks (pages in kernel) cached by the inode 2008-10-29 12:57 i see. actual write out... 2008-10-29 12:57 yes, I should have been more specific 2008-10-29 12:58 in starting a new phase, we need to flush dirty buffers? 2008-10-29 12:59 to start a new phase 2008-10-29 12:59 flush the buffers dirtied in the previous phase 2008-10-29 12:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-29 13:00 well, flush the _data_ buffers dirtied in the previous phase 2008-10-29 13:00 it's better to say, flush the dirty inode blocks 2008-10-29 13:00 ah, i see 2008-10-29 13:00 ordered-write mode? 2008-10-29 13:01 this is more strict than ordered-write 2008-10-29 13:01 because we write the data blocks to new locations that don't overwrite data in a previous phase 2008-10-29 13:01 it's like data=journal 2008-10-29 13:02 has the same effect, but without writing twice 2008-10-29 13:02 i see 2008-10-29 13:03 I thought data buffers may be in place update 2008-10-29 13:05 thanks. I belive my understanding became more good 2008-10-29 13:09 we might add in place update later as an additional mode 2008-10-29 13:09 like ordered data 2008-10-29 13:11 the advantage is not in terms of speed, because in both cases we write the new blocks only once, but in reducing fragmentation because the data does not have to be relocated 2008-10-29 13:11 yes 2008-10-29 13:12 for a solid state disk, the advantage is very little 2008-10-29 13:12 so we will always want our strict mode for ssd I think 2008-10-29 13:13 ah, yes. it may be important in future 2008-10-29 13:13 I have an ssd now :) 2008-10-29 13:13 my eee 2008-10-29 13:13 oh, too fast :) 2008-10-29 13:14 we don't have good fs for it yet :) 2008-10-29 13:16 btw, are you already thinking about locking rules? 2008-10-29 13:16 yes 2008-10-29 13:16 in some depth 2008-10-29 13:16 oh, great 2008-10-29 13:17 I'm ignore about it for now 2008-10-29 13:17 that's reasonable 2008-10-29 13:17 we can start with a simple lock 2008-10-29 13:18 e.g. per btree big lock? 2008-10-29 13:18 yes 2008-10-29 13:18 i see 2008-10-29 13:18 per inode, the easist thing 2008-10-29 13:19 and one more for modifying the inode table 2008-10-29 13:19 and may be for bitmap? 2008-10-29 13:19 yes 2008-10-29 13:19 allocation lock 2008-10-29 13:19 i see 2008-10-29 13:20 ah 2008-10-29 13:20 in phase transision, we modify bitmap for commit blocks etc.? 2008-10-29 13:20 yes 2008-10-29 13:20 and bitmap change will be written to same phase? 2008-10-29 13:20 commit blocks are marked as allocated to prevent them from being allocated for other purposes 2008-10-29 13:21 good question 2008-10-29 13:21 it's probably most efficient to write it to the same phase 2008-10-29 13:22 well 2008-10-29 13:22 good question :) 2008-10-29 13:22 i see. I thought it may be in phase commit or related blocks 2008-10-29 13:23 there will be multiple commit blocks per phase 2008-10-29 13:23 so phase commit points those blocks? 2008-10-29 13:24 each commit block points to some number of flushed blocks 2008-10-29 13:24 as many as will fit in the commit 2008-10-29 13:25 yes 2008-10-29 13:25 and all the flushed blocks, plus all the commit blocks, have to be completely written before the commit block for the phase is written 2008-10-29 13:25 it might be better to call those multiple commit blocks, log blocks 2008-10-29 13:25 and reserve the term commit block for the phase commit block 2008-10-29 13:26 sounds good 2008-10-29 13:26 I think that the dirty bitmaps for the log blocks have to be in the same phase, yes 2008-10-29 13:26 i see. how about commit block? 2008-10-29 13:27 commit block also allocate new block? 2008-10-29 13:27 yes 2008-10-29 13:28 I don't see a clear reason why it has to be in the allocation map of its own phase, or in the next phase 2008-10-29 13:29 um.. 2008-10-29 13:30 if crashed, we don't now free blocks until trace phases? 2008-10-29 13:30 now -> know 2008-10-29 13:30 we fall back to the last completed phase 2008-10-29 13:30 which means we found the commit block, and we know that it is allocated 2008-10-29 13:31 ah 2008-10-29 13:32 in recovery, we will mark those as allocated? 2008-10-29 13:33 yes 2008-10-29 13:33 part of reconstructing dirty metadata 2008-10-29 13:33 the phase commit block can be freed after the next phase completes 2008-10-29 13:34 one thing we could do in future, is allow more than one phase on disk 2008-10-29 13:34 yes 2008-10-29 13:34 this gives a very limited form of versioning 2008-10-29 13:35 at the expense of making allocation decisions more difficult 2008-10-29 13:35 it might be useful for something 2008-10-29 13:35 i see 2008-10-29 14:22 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-29 19:37 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-29 20:40 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-29 20:41 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-29 22:11 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-30 01:19 -!- vcgomes[away](~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-10-30 03:54 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-10-30 04:01 -!- pgquiles__(~pgquiles@19.Red-83-44-236.dynamicIP.rima-tde.net) has joined #tux3 2008-10-30 07:27 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-30 09:29 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-30 12:08 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-30 12:15 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-30 12:56 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-30 12:57 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-30 13:12 flips: cabal is today right ? 2008-10-30 13:50 bh, postponed due to me having a cold 2008-10-30 13:51 ok 2008-10-30 13:51 when ? 2008-10-30 14:09 stay tuned for further developments 2008-10-30 14:11 we having the 8pm u today? 2008-10-30 14:19 ok 2008-10-30 14:19 flips: I was going to drive up Friday possibly so that's why I asked 2008-10-30 14:27 oh, it will be next week 2008-10-30 14:37 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-30 14:41 ok 2008-10-30 14:41 I might be in the SF bay area by then 2008-10-30 16:42 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-30 16:53 -!- ajonat(~ajonat@190.48.117.217) has joined #tux3 2008-10-30 17:52 -!- ajonat(~ajonat@190.48.124.161) has joined #tux3 2008-10-30 18:33 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-10-30 19:50 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-10-30 20:00 ping... 2008-10-30 20:00 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-10-30 20:00 hi 2008-10-30 20:00 ACTION is warming up the lxr 2008-10-30 20:01 ACTION is also very sleepy because he has an uptime of about 17h 2008-10-30 20:02 ACTION looks around... 2008-10-30 20:02 ACTION is not particularly sleepy today because he had a decent maintenance downtime window this time around... 2008-10-30 20:02 flips is not around? 2008-10-30 20:03 ACTION searches around for flips 2008-10-30 20:03 ACTION is also back on macbook :P 2008-10-30 20:03 ACTION me on a macbook as well 2008-10-30 20:03 :D 2008-10-30 20:03 macbook is good... 2008-10-30 20:03 hi 2008-10-30 20:03 ups that wasn't a correct action 2008-10-30 20:03 hey 2008-10-30 20:03 on the other hand I did some progress on PPC while the macbook was away ;-) 2008-10-30 20:03 RazvanM: running Mac or linux on the macbook? 2008-10-30 20:03 MaZe: Mac OS 2008-10-30 20:04 ok, let's look at the block io library today, a little more 2008-10-30 20:04 ah, well, I'm running linux, I like the hardware, but couldn't get used to the OS and switched to running progressing versions of fedora 2008-10-30 20:04 see what's there, and why we call it the library 2008-10-30 20:05 there are actually two layers of block library functions 2008-10-30 20:05 the top layer being the generic_* routines 2008-10-30 20:05 let's start with the which linux ver question 2008-10-30 20:05 2.6.27 2008-10-30 20:05 I slipped last time 2008-10-30 20:06 the generic_* are all called through hooks, which we have seen 2008-10-30 20:06 of the form "if this hook (e.g., ->write) is nonzero then call the function supplied by the filesystem" 2008-10-30 20:07 otherwise the vfs calls a generic function 2008-10-30 20:07 grep generic * | grep EXPORT | wc -l 2008-10-30 20:07 26 2008-10-30 20:07 all in buffer.c and filemap.c? 2008-10-30 20:07 (in fs/) 2008-10-30 20:07 why not have the hooks point to the default and not have to the the if check every time? 2008-10-30 20:08 because we're lame 2008-10-30 20:08 grep generic * | grep EXPORT | cut -d ':' -f 1 | sort | uniq | xargs 2008-10-30 20:08 buffer.c fs-writeback.c inode.c libfs.c locks.c namei.c namespace.c open.c read_write.c splice.c stat.c super.c xattr.c 2008-10-30 20:08 changing it would require a big spam edit to dozens of filesystems, if you can show a benefit such a patch is sometimes accepted 2008-10-30 20:08 50/50 chance, assuming it actually deletes code or makes something more efficient 2008-10-30 20:09 razvanm, so generics are splattered all over the place 2008-10-30 20:09 in keeping with how they came to be: they all started life as specific code used by some filesystem, typically ext2 2008-10-30 20:10 really it would be better if they were all in libfs.c 2008-10-30 20:10 I wonder what's in locks.c 2008-10-30 20:11 ACTION looks 2008-10-30 20:11 libfs.c:EXPORT_SYMBOL_GPL(generic_fh_to_dentry); 2008-10-30 20:11 libfs.c:EXPORT_SYMBOL_GPL(generic_fh_to_parent); 2008-10-30 20:11 libfs.c:EXPORT_SYMBOL(generic_read_dir); 2008-10-30 20:11 locks.c:EXPORT_SYMBOL(generic_setlease); 2008-10-30 20:11 only one is in locks.c 2008-10-30 20:11 libfs.c would appear to be half an idea 2008-10-30 20:12 what do you mean - half an idea? 2008-10-30 20:12 having only those three minor generics in it doesn't fit the name well 2008-10-30 20:12 fh_to dentry is an nfs function by the way 2008-10-30 20:13 (beside the 26 in fs/ there are 10 more in mm/; 9 in mm/filemap.c and one in mm/page-writeback.c) 2008-10-30 20:13 I suppose we could look at how nfs works one session some time down the road 2008-10-30 20:13 ok, I see 2008-10-30 20:13 that will be scary 2008-10-30 20:13 fh - filehandle? 2008-10-30 20:13 so basically nfs cookie? 2008-10-30 20:13 yes 2008-10-30 20:14 we say opaque filehandle and reserve cookie to mean directory cookie 2008-10-30 20:14 how large is it? 2008-10-30 20:14 it? 2008-10-30 20:15 a fh? 2008-10-30 20:15 yup 2008-10-30 20:15 lots of bytes 2008-10-30 20:15 64 I seem to recall 2008-10-30 20:15 don't quote me 2008-10-30 20:16 one way to find out is to look at the slab cache 2008-10-30 20:16 cat /proc/slabinfo 2008-10-30 20:16 another is to printk(... sizeof(struct fh)); 2008-10-30 20:16 in junkfs 2008-10-30 20:17 ok, we're not going to replace the top layer of fs library functions 2008-10-30 20:17 5*4 bytes it would seem (at least) 2008-10-30 20:17 assuming struct fid is what it is 2008-10-30 20:18 it's defined by the rfs of course 2008-10-30 20:18 rfc 2008-10-30 20:18 v2 fh is 32 bytes 2008-10-30 20:18 v3 is variable up to 64 2008-10-30 20:19 so I recalled sort of correctly 2008-10-30 20:19 anyway 2008-10-30 20:19 ;-) 2008-10-30 20:19 nfs is a whole huge messy topic 2008-10-30 20:19 and is interesting to tux3 only to the extent that we have to do a few things to make that mess work 2008-10-30 20:19 what the fs needs to obey in order to support nfs exporting is probably worth going over 2008-10-30 20:19 doesn't nfs work with any fs? 2008-10-30 20:19 oh, and is interesting to tux3 because one of the prime uses of tux3 will be exporting nfs 2008-10-30 20:20 the fs has to obey some rules in order to support nfs 2008-10-30 20:20 maze, yes it would be worth a homework assignment: "all the places nfs causes pain for a regular fs" 2008-10-30 20:20 isn't that more like a dissertation? 2008-10-30 20:21 for example, there has to be a way to supply stable directory cookies across reboots that fit in 31 bits to support v2 2008-10-30 20:21 that relaxes to 64 bits in v3 2008-10-30 20:21 still painful 2008-10-30 20:21 maze, not really, it's mostly googling 2008-10-30 20:22 there are only a few places filesystems have to do something bizarre and unnatural 2008-10-30 20:22 the directory one is the one I'm directly familiar with because of the pain it caused in htree development 2008-10-30 20:23 btrfs guys are currently going through equivalent pain 2008-10-30 20:23 trying to get their directory scheme to work with nfs 2008-10-30 20:23 reiser never played well with nfs 2008-10-30 20:23 ok 2008-10-30 20:23 hi 2008-10-30 20:23 question: nfs2, is it still widely used? 2008-10-30 20:23 hi hirofumi 2008-10-30 20:24 (seeing as nfs4 is long out...) 2008-10-30 20:24 maze, I haven't seen one for a long time 2008-10-30 20:24 i wake up now 2008-10-30 20:24 and a brand new filesystem could possibly ignore it 2008-10-30 20:24 but since we know how to support it properly, why not? 2008-10-30 20:25 well, 31 vs 64 bits is a bit off a difference 2008-10-30 20:25 the directory index planned for tux3 doesn't care 2008-10-30 20:25 the original htree would have benefitted 2008-10-30 20:26 ok, we could spend one session on just that: how ext3 dirops handle nfs v2/f3 2008-10-30 20:26 it's rather complex 2008-10-30 20:27 now, let's move on down to the second layer of fs library calls in the read/write path 2008-10-30 20:27 when we looked at generic read/write, we saw that the filesystem does all its work in the ->readpage/->writepage calls 2008-10-30 20:27 at least in the non-mpage forms 2008-10-30 20:28 today, more of the work, by volume, gets done in the unspeakably messy but faster mpage stuff 2008-10-30 20:28 we will look at both 2008-10-30 20:29 but let's consider the _2copy generic function first 2008-10-30 20:29 and follow it into ext3 2008-10-30 20:29 anybody got a url for the ->writepage call in _2copy? 2008-10-30 20:29 ACTION is searching 2008-10-30 20:30 (my standard trick when I need to get up and get my cup of tea for example) 2008-10-30 20:30 generic_perform_write_2copy? 2008-10-30 20:30 uhm it doesn;t? 2008-10-30 20:31 that's the one 2008-10-30 20:31 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L2216 2008-10-30 20:31 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L2331 2008-10-30 20:32 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L2345 ? 2008-10-30 20:32 2312 status = a_ops->prepare_write(file, page, offset, offset+bytes); 2008-10-30 20:32 there's commit_write a little bit down as well 2008-10-30 20:32 2345 status = a_ops->commit_write(file, page, offset, offset+bytes); 2008-10-30 20:32 right 2008-10-30 20:32 it's a two-prong plug 2008-10-30 20:32 for no good reason 2008-10-30 20:32 :-) 2008-10-30 20:33 hmm, not sure 2008-10-30 20:33 maybe it's needed for partial page writes 2008-10-30 20:33 so if all the filesystem does is supply those two functions, then standard buffered write will just magically work 2008-10-30 20:33 let's see how ext2 supplies them 2008-10-30 20:34 maze, no, there's no good reason 2008-10-30 20:34 as attested to by them being scheduled for eradication 2008-10-30 20:34 finally 2008-10-30 20:34 alway were just a messy wart 2008-10-30 20:35 -!- madhu(~chatzilla@122.252.226.161) has joined #tux3 2008-10-30 20:35 hey all 2008-10-30 20:35 hi bobby 2008-10-30 20:35 hirofumi, what time is in in japan? 2008-10-30 20:35 long time no see 2008-10-30 20:35 I think really early in the morning 2008-10-30 20:35 12:35 2008-10-30 20:35 oh 2008-10-30 20:36 its 9:05 in india :) 2008-10-30 20:36 http://lxr.linux.no/linux+v2.6.27/fs/ext2/inode.c#L783 2008-10-30 20:36 ok, that's fine 2008-10-30 20:36 11:36 in baltimore ;-) 2008-10-30 20:36 midnight is the best time to hack 2008-10-30 20:36 that is true! 2008-10-30 20:36 yes 2008-10-30 20:37 ext2 doesn't implement those functions 2008-10-30 20:37 yup ;-) 2008-10-30 20:37 exactly 2008-10-30 20:37 actually most fs don't 2008-10-30 20:37 RazvanM: when is the next tux3 U? 2008-10-30 20:37 because they are using the ->writepage interface instead 2008-10-30 20:37 looks like: cifs afs gfs2 ecryptfs impement prepare_write 2008-10-30 20:38 which does prepare_ ... comit_ 2008-10-30 20:38 ext2 uses ->write_begin and ->write_end 2008-10-30 20:38 bobby: it's right now right here 2008-10-30 20:39 ohk 2008-10-30 20:39 bobby: and flips is the teacher :D 2008-10-30 20:39 and the next one is on tuesday at 8 pm pdt 2008-10-30 20:39 (see topic) 2008-10-30 20:39 hmm, the timings are a bit difficult :( 2008-10-30 20:39 although really the next session should be updated 2008-10-30 20:39 hirofumi, yes, new things 2008-10-30 20:40 the replacement of prepare... commit 2008-10-30 20:40 yes 2008-10-30 20:40 wtill pointlessly a two-prong plug 2008-10-30 20:40 still 2008-10-30 20:40 bobby, the current time seems to work out fine 2008-10-30 20:40 means you have to get up early ;) 2008-10-30 20:40 those are introduced for bug fix 2008-10-30 20:40 flips: yeah :( 2008-10-30 20:41 bug fix? 2008-10-30 20:41 yes 2008-10-30 20:41 hirofumi, got a url? 2008-10-30 20:41 hirofumi: care to elaborate? 2008-10-30 20:41 $ grep prepare_write *.c | grep EXPORT 2008-10-30 20:41 buffer.c:EXPORT_SYMBOL(block_prepare_write); 2008-10-30 20:41 libfs.c:EXPORT_SYMBOL(simple_prepare_write); 2008-10-30 20:41 race condition iirc 2008-10-30 20:42 $ grep commit_write *.c | grep EXPORT 2008-10-30 20:42 buffer.c:EXPORT_SYMBOL(block_commit_write); 2008-10-30 20:44 fs: introduce write_begin, write_end, and perform_write aops 2008-10-30 20:44 2008-10-30 20:44 These are intended to replace prepare_write and commit_write with more 2008-10-30 20:44 flexible alternatives that are also able to avoid the buffered write 2008-10-30 20:44 deadlock problems efficiently (which prepare_write is unable to do). 2008-10-30 20:44 commitid of git is afddba49d18f346e5cc2938b6ed7c512db18ca68 2008-10-30 20:45 so this is how stuff get added without the removing the old ones :P 2008-10-30 20:45 http://lxr.linux.no/linux+v2.6.27/mm/page-writeback.c#L974 <- ok, here is the generic writepage call, as opposed to the prepare/commit interface in _2copy 2008-10-30 20:46 http://lxr.linux.no/linux+v2.6.27/mm/page-writeback.c#L991 <- generic_writepages 2008-10-30 20:46 http://lxr.linux.no/linux+v2.6.27/mm/page-writeback.c#L866 <- write_cache_pages 2008-10-30 20:47 936 ret = (*writepage)(page, wbc, data); 2008-10-30 20:47 so... I retract the claim about _2copy, and now assert that if all you implement is ->writepage, that the vfs will use it via write_cache_pages to implement buffered file write 2008-10-30 20:48 so now let's look at how ext2 implements it 2008-10-30 20:49 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L2859 2008-10-30 20:49 http://lxr.linux.no/linux+v2.6.27/fs/ext2/inode.c#L707 <- ext2_writepage 2008-10-30 20:49 way ahead of me ;) 2008-10-30 20:50 so all ext2 does is pass its ext2_get_block back to the vfs block io library 2008-10-30 20:50 we have looked at that call before 2008-10-30 20:50 no need to again right now, correct? 2008-10-30 20:50 and block_write_full_page is part of the vfs block io library 2008-10-30 20:51 right 2008-10-30 20:51 it does prepare... commit 2008-10-30 20:51 maybe now has been changed to begin...end 2008-10-30 20:51 let's see 2008-10-30 20:52 well know 2008-10-30 20:52 it's basically prepare and commit grafted together 2008-10-30 20:52 with a call to get_block sandwiched in between 2008-10-30 20:53 arguably a candidate for conversion to use the new functions 2008-10-30 20:53 let's look at the ext2 part 2008-10-30 20:53 ext2_get_block 2008-10-30 20:53 we've looked at it briefly before, right? 2008-10-30 20:54 http://lxr.linux.no/linux+v2.6.27/fs/ext2/inode.c#L694 2008-10-30 20:55 this is the one that traverses the file index starting at an inode and given a logical offset, to find a physical block number which is returned by filling in a b_blocknr field in a supplied buffer_head 2008-10-30 20:55 there are a few decorations on that interface, such as telling the caller that the block was newly allocated and thus the buffer needs to be zeroed by some callers 2008-10-30 20:56 now, this is where we want to do things differently in tux3 2008-10-30 20:56 how? 2008-10-30 20:57 are we going to have readpage/writepage? 2008-10-30 20:57 instead of calling back into the fs lib from ->block_write_full_page, tux3 will just go on to write out the page 2008-10-30 20:57 by calling submit_bio 2008-10-30 20:57 we will implement ->readpage/->writepage, but they won't call the library routines 2008-10-30 20:58 and we maybe won't have to write a tux3_getblock 2008-10-30 20:58 i see 2008-10-30 20:58 that's the interesting question I wanted to address today: can we get away with no tux3_getblock at all, but instead just initiate io ourselves where the vfs calls ->writepage and similar 2008-10-30 20:59 we'll not have a tux3_getblock at all or it will be exposed to the outside of tux3? 2008-10-30 20:59 I'd like to have none at all, though we need something similar to implement bmap 2008-10-30 21:00 (midnight) 2008-10-30 21:00 however that doesn't have to implement all the slightly wierd semantics of typical ->get_block 2008-10-30 21:00 razvanm, did you turn into a pumpkin? 2008-10-30 21:00 getblock is pretty weird, because what it could/should return may depend on the future and whether we want to read or write 2008-10-30 21:00 and where's raluca? 2008-10-30 21:00 :D 2008-10-30 21:00 she's here :P 2008-10-30 21:01 actually, we have a small pumpkin made of plush ;-) 2008-10-30 21:01 maze, it's a braindamaged interface imho, not least because it works at cross purposes with delayed allocation 2008-10-30 21:01 we don't actually need to allocate physical disk blocks (besides reserving them) until we actually wish to flush to disk 2008-10-30 21:01 ;-) 2008-10-30 21:01 here and quiet :) 2008-10-30 21:01 [not reserving them - reserving space for them 'somewhere'] 2008-10-30 21:01 I was able to make use of the minix_get_block to fill a page ;-) 2008-10-30 21:02 maze, yes, and we always are going to send something to disk when we get a ->writepage call, though it may not be the page we got the ->writepage for 2008-10-30 21:02 right 2008-10-30 21:02 ->writepage also comes to us from deep in vm 2008-10-30 21:02 in shrink_caches 2008-10-30 21:02 wait 2008-10-30 21:02 why will we always send something to disk? 2008-10-30 21:03 we're doing write-through? 2008-10-30 21:03 because either the user or vmm told us we should 2008-10-30 21:03 not write-cache and flush on close 2008-10-30 21:03 ? 2008-10-30 21:03 it's bad behavior for a fs to cache write stuff for a long time 2008-10-30 21:04 generic behavior is to start the IO transfer inside sys_write 2008-10-30 21:04 yes, it's writethrough 2008-10-30 21:04 uhm, I'd argue that 2008-10-30 21:04 if the disk is idle - sure start writeing something 2008-10-30 21:04 otherwise... 2008-10-30 21:05 ok, let's start next time by considering the question of where we do _not_ immediately initiate writeout in sys_write 2008-10-30 21:05 very useful exercise 2008-10-30 21:05 :-) 2008-10-30 21:06 how did we do today, what ground did we cover 2008-10-30 21:06 since there are huge benefits to clumping writes and reads up, we shouldn't write to aggressively 2008-10-30 21:06 ACTION this today's lesson was also short ;-) 2008-10-30 21:06 felt short, true 2008-10-30 21:06 it was an hour 2008-10-30 21:06 went by fast 2008-10-30 21:06 true :D 2008-10-30 21:08 I have to start implementing the write part for minix so today's lesson was informative for me 2008-10-30 21:09 also the area I'm working in at the moment, kind of 2008-10-30 21:09 minix? 2008-10-30 21:09 no, writeout 2008-10-30 21:09 for tux3 2008-10-30 21:10 ah, yes 2008-10-30 21:10 atomic commit, and just exactly what our cache behavior will be 2008-10-30 21:10 RalucaM, are you writing minix? 2008-10-30 21:10 flips, i see 2008-10-30 21:10 hirofumi: I'm trying to use the minix fs from macos ;-) 2008-10-30 21:10 nope 2008-10-30 21:11 i see 2008-10-30 21:11 razvanm, do you talk to the ext3cow guys? 2008-10-30 21:11 I've mostly been going through various primitives in the kernel and trying to familiarize myself with them 2008-10-30 21:11 (atomic ops, mutexes, etc...) 2008-10-30 21:11 flips: the guy graduated and left before I had a chance to benefit from his knowledge :( 2008-10-30 21:11 maze, if you try to do them all you will never do anything but ;) 2008-10-30 21:12 razvanm, it was just one guy? 2008-10-30 21:12 no, not all - just the ones I run into 2008-10-30 21:12 flips: I think so :D 2008-10-30 21:12 ext2cow still seems to be an active project 2008-10-30 21:12 and it still seems to be centered in JHU 2008-10-30 21:13 http://www.ext3cow.com/Developers.html 2008-10-30 21:13 zach left 2008-10-30 21:13 Randal is the prof 2008-10-30 21:13 http://www.ext3cow.com/Blog/Blog.html 2008-10-30 21:13 I hope he'll sign my project when I'm done :D 2008-10-30 21:13 last entry is june 2008-10-30 21:14 indeed 2008-10-30 21:14 hmm, last patch is 2.6.20.3 2008-10-30 21:15 does seem like some loss of momentum 2008-10-30 21:15 I should email randal burns and see if there are plans 2008-10-30 21:16 do you meet him? 2008-10-30 21:16 never 2008-10-30 21:17 not so far 2008-10-30 21:17 but maybe eventually 2008-10-30 21:17 http://hssl.cs.jhu.edu/pipermail/ext3cow-devel/2008-October/000064.html 2008-10-30 21:17 last post is today 2008-10-30 21:17 :D so the world is not yet _that_ small 2008-10-30 21:17 nearly that small 2008-10-30 21:19 how working for ise inc 2008-10-30 21:19 yup 2008-10-30 21:19 a bunch of people are there now 2008-10-30 21:22 looks like ext3cow is dead in the water without zachary 2008-10-30 21:23 http://hssl.cs.jhu.edu/pipermail/ext3cow-devel/2008-July/000048.html <- somebody from france forward ported to 2.6.25.3 2008-10-30 21:23 did they have something more to add to it? 2008-10-30 21:23 deletion? 2008-10-30 21:23 kernel merge? 2008-10-30 21:24 that is, snapshot deletion 2008-10-30 21:24 deletion inside a snapshot? 2008-10-30 21:24 seems to me, as it is it will run and make snapshots until the volume is full 2008-10-30 21:24 then game over 2008-10-30 21:24 I will be amazed if they don't have a way to delete a snapshot :D 2008-10-30 21:25 that's what the thread is about 2008-10-30 21:26 > > There are some limitations : 2008-10-30 21:26 > > -The oldest version of a file cannot be deleted with this method. 2008-10-30 21:26 > > -Old versions of directories cannot be deleted. 2008-10-30 21:26 > 2008-10-30 21:27 Nicolas ENG is doing it 2008-10-30 21:27 hmm... 2008-10-30 21:27 anyway I do not expect it is easy 2008-10-30 21:27 but it's not dead 2008-10-30 21:27 that's good 2008-10-30 21:27 I have a question 2008-10-30 21:28 will the economic downturn improve the amount of work in open source communities? 2008-10-30 21:28 or it will be the other way around 2008-10-30 21:30 probably won't change it much 2008-10-30 21:30 hasn't in the past 2008-10-30 21:30 what it does tend to do is accelerate adoption 2008-10-30 21:30 why is that? 2008-10-30 21:30 other factors are way more important 2008-10-30 21:30 like who happens to be inspired at the time 2008-10-30 21:30 and what kind of tools are used 2008-10-30 21:31 a lot of the work comes from universities 2008-10-30 21:31 which are sheltered from the economy pretty well 2008-10-30 21:31 are they? 2008-10-30 21:31 even for this one? :D 2008-10-30 21:32 as long as you can convince dad to keep sending money ;) 2008-10-30 21:32 brb 2008-10-30 21:33 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-10-30 21:33 back 2008-10-30 21:33 during economic booms, high tech companies tend to raid the universities 2008-10-30 21:33 call that cradle robbing 2008-10-30 21:34 that should cool down noticeably this year, in fact it already has 2008-10-30 21:34 lets students concentrate on a more well round education 2008-10-30 21:35 :-) 2008-10-30 21:35 enrollment in grad school will probably go up :) 2008-10-30 21:36 the undergrad enrollment was pretty low at JHU for some years 2008-10-30 21:37 pretty expensive isn't it? 2008-10-30 21:37 indeed... 2008-10-30 21:37 >30K 2008-10-30 21:38 yeah thats rediculous 2008-10-30 21:38 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-10-30 21:39 wow 37700 according to their site 2008-10-30 21:40 the univ could be hit hard if people will not be able to afford this anymore 2008-10-30 21:40 per year? 2008-10-30 21:40 yes! 2008-10-30 21:40 thats over 150k for a 4 yr undergrad degree 2008-10-30 21:40 plus books 2008-10-30 21:40 as a business proposition... marginal 2008-10-30 21:40 unless it comes with scholarships 2008-10-30 21:41 which why the grad is attractive ;-) 2008-10-30 21:41 not so attractive when to watch the advisors writing proposal after proposal to get the funding :| 2008-10-30 21:41 to = you 2008-10-30 21:42 so how do you manage if you don't mind saying in public? 2008-10-30 21:43 I do my part as best as I can while the advisors are doing the same 2008-10-30 21:44 I come from Ro where research is a luxury that schools doesn't have 2008-10-30 21:44 so this is heaven from that perspective :P 2008-10-30 21:45 considering the super low number of American students this is something that doesn't look the same for them 2008-10-30 21:46 I certainly appreciated the opportunity to get to university 2008-10-30 21:46 almost didn't leave ;) 2008-10-30 21:46 how is that? 2008-10-30 21:46 got bitten by the computer hacking bug 2008-10-30 21:46 had lots of primary research going on around me 2008-10-30 21:47 very compelling environment 2008-10-30 21:47 :D 2008-10-30 21:48 i ran away from school as fast as i could after my undergrad 2008-10-30 21:49 easy choice those were the dot com days 2008-10-30 21:50 contributory cause of the dot bust? 2008-10-30 21:51 acres of cubes full of dropouts learning by doing ;) 2008-10-30 21:51 sort of like this time round 2008-10-30 21:52 hah, actually i finished school in the middle of the dot bomb 2008-10-30 21:53 theres always plenty of jobs right after the 'oh shit we fired too many people' stage 2008-10-30 21:55 there's always plenty of jobs for anybody who can admin/hack their way out of a wet paper bag 2008-10-30 21:56 yeah that too 2008-10-30 22:27 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-30 23:12 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-31 04:01 folks 2008-10-31 08:30 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-31 10:09 -!- mingming(~mingming@c-24-22-117-202.hsd1.or.comcast.net) has joined #tux3 2008-10-31 10:12 FYI: prepare_write and commit_write was replaced completely in current tree 2008-10-31 11:06 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-31 11:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-31 11:57 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-31 11:57 lol 2008-10-31 11:58 moving target ;-) 2008-10-31 11:58 replaced with what? 2008-10-31 11:58 ->write_begin and ->write_end 2008-10-31 12:00 it should be small issue 2008-10-31 12:03 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-31 14:13 folks 2008-10-31 16:47 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-10-31 21:47 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-10-31 22:41 -!- bobby(~chatzilla@122.252.226.161) has joined #tux3 2008-10-31 23:50 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-01 02:07 happy halloween 2008-11-01 04:12 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-01 09:30 -!- pgquiles(~pgquiles@158.Red-80-39-234.dynamicIP.rima-tde.net) has joined #tux3 2008-11-01 12:44 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-11-01 12:44 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-01 12:44 -!- pgquiles(~pgquiles@158.Red-80-39-234.dynamicIP.rima-tde.net) has joined #tux3 2008-11-01 12:44 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-01 12:44 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-11-01 12:44 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-01 12:44 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-11-01 12:44 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-11-01 12:44 -!- vcgomes[away](~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-11-01 12:44 -!- flips(~phillips@phunq.net) has joined #tux3 2008-11-01 12:44 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-01 12:51 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-01 13:21 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-01 13:58 -!- ajonat(~ajonat@190.48.108.125) has joined #tux3 2008-11-01 15:01 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-11-01 15:22 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-01 17:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-01 18:15 -!- ajonat(~ajonat@110-74-17-190.fibertel.com.ar) has joined #tux3 2008-11-01 18:25 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-01 19:28 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-01 22:24 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-01 23:01 -!- Kirantpatil(~kiran@122.167.209.82) has joined #tux3 2008-11-01 23:01 -!- Kirantpatil(~kiran@122.167.209.82) has left #tux3 2008-11-02 08:27 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-02 09:00 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-02 09:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-02 11:33 -!- pgquiles(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-02 11:38 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-02 11:49 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-02 13:10 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-02 14:03 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-02 16:13 -!- ajonat(~ajonat@190.48.108.125) has joined #tux3 2008-11-02 18:30 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-02 19:12 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-02 20:05 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-02 20:42 folks 2008-11-02 22:37 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-03 00:20 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-11-03 00:20 hey a;; 2008-11-03 00:20 all* 2008-11-03 01:02 flips: http://lkml.org/lkml/2007/7/28/186 2008-11-03 01:02 old post from me, not sure if you ever saw this 2008-11-03 07:25 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-03 08:15 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-03 08:22 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-03 08:53 -!- pgquiles(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-03 10:16 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-03 10:58 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-03 12:13 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-03 12:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-03 13:14 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-03 16:18 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-03 16:50 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-03 18:38 -!- ajonat(~ajonat@190.48.112.111) has joined #tux3 2008-11-03 19:48 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-03 21:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-03 23:44 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-04 06:15 -!- mlankhorst_(~m@fw1.astro.rug.nl) has joined #tux3 2008-11-04 06:49 -!- pgquiles(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-04 08:46 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-04 09:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-04 09:12 ACTION is going to be away for the rest of the week (due to a conference) 2008-11-04 10:14 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-04 12:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-04 14:31 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-04 19:56 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-04 19:57 hi 2008-11-04 19:57 yo 2008-11-04 20:08 hi 2008-11-04 20:08 now that the election thing's done, should we probe the kernel? 2008-11-04 20:08 :-) 2008-11-04 20:08 oh is it? 2008-11-04 20:08 just 2008-11-04 20:09 some whooping and hollering outside 2008-11-04 20:09 nice and quiet over here 2008-11-04 20:10 looks like a huge win too 2008-11-04 20:10 projections as of yesterday ran from 330 to 350 2008-11-04 20:11 looks like it will be towards the high side 2008-11-04 20:12 http://lxr.linux.no/linux+v2.6.27/ 2008-11-04 20:12 let's take a look at iget 2008-11-04 20:13 if you're ready 2008-11-04 20:13 hi 2008-11-04 20:14 hi 2008-11-04 20:14 today is tux3 u? 2008-11-04 20:14 just starting 2008-11-04 20:14 now that the u.s. election is no longer in doubt 2008-11-04 20:15 iget, is that like wget but from apple? 2008-11-04 20:15 oh 2008-11-04 20:15 in that they both run a computer, yes 2008-11-04 20:15 lol 2008-11-04 20:16 hmm, I'm getting some indexing incorrectness from lxr 2008-11-04 20:16 it only finds one occurance of ext2_iget 2008-11-04 20:18 search iget? 2008-11-04 20:18 grep in fs/ext2 finds a bunch 2008-11-04 20:18 I wonder if it is just this version that is messed up 2008-11-04 20:19 I noticed some other indexing errors with lxr a few days ago 2008-11-04 20:19 I get roughly (if not exactly) the same results with 2.6.26.7 2.6.27 and 2.6.27.4 2008-11-04 20:19 2 matches and 5 in freetext 2008-11-04 20:20 the matches are declaration and definition 2008-11-04 20:20 the freetext seem fine as well 2008-11-04 20:20 http://lxr.linux.no/linux+v2.6.27.4/fs/ext2/inode.c#L1184 2008-11-04 20:21 the original iget is long gone 2008-11-04 20:21 we now have iget_locked 2008-11-04 20:21 and iget5_locked 2008-11-04 20:22 the purpose is to return an inode given an inode number 2008-11-04 20:22 http://lxr.linux.no/linux+v2.6.27/fs/inode.c#L942 2008-11-04 20:23 this implies that the filesystem has inode numbers 2008-11-04 20:23 which is not a requirement in Linux 2008-11-04 20:23 vfat for example 2008-11-04 20:23 and ramfs 2008-11-04 20:24 the new interface is a little unfamiliar to me, it is broken into two parts 2008-11-04 20:25 iget(5)_locked, and the filesystem is actually called when the inode is unlocked 2008-11-04 20:25 as part of the unlock 2008-11-04 20:25 the two functions are almost identical 2008-11-04 20:26 iget5 takes a generic test function to be used in the hash search 2008-11-04 20:27 next stop is unlock_new_inode 2008-11-04 20:28 so basically more OO implemented in C... 2008-11-04 20:28 ersatz oo 2008-11-04 20:28 http://lxr.linux.no/linux+v2.6.27/fs/inode.c#L576 2008-11-04 20:29 I'm assuming the part in #ifdef is a no-op? 2008-11-04 20:30 it is 2008-11-04 20:30 so I lied 2008-11-04 20:31 just for lockdep 2008-11-04 20:31 call into the filesystem is part of a iget_locked ->iget unlock_new_inode sandwich 2008-11-04 20:31 I don't think we're looking at the func we should be looking at 2008-11-04 20:32 I think new_inode and unlock_new_inode are paired, iget_locked should probably be unlocked elsewhere 2008-11-04 20:33 hmm 2008-11-04 20:33 or maybe I'm seeing things 2008-11-04 20:34 for example 2008-11-04 20:34 http://lxr.linux.no/linux+v2.6.27/fs/ext2/inode.c#L1184 2008-11-04 20:34 ok, I think get(5)_locked can potentially return a new_inode, but not always - only if the initial lookup fails 2008-11-04 20:34 iget is just a library function, called by the fs 2008-11-04 20:35 unlock_new_inode would appear to be a misnomer 2008-11-04 20:35 it always clears the I_NEW flag, which is perhaps the reason its named that way 2008-11-04 20:35 it gets called at the very end of ext2_iget 2008-11-04 20:36 I think it requires some conditions that are true for new inodes 2008-11-04 20:36 for the locking to be correct 2008-11-04 20:36 http://lxr.linux.no/linux+v2.6.27/fs/inode.c#L595 2008-11-04 20:39 ok, is it really true that ext2_iget returns the inode locked if is already in the hash and unlocked if it is new? 2008-11-04 20:39 yes I'm wondering about that 2008-11-04 20:39 seems ... weird ... 2008-11-04 20:41 does iget_locked really return a locked inode? 2008-11-04 20:41 if iget_locked found in cache, it's already unlocked 2008-11-04 20:42 if not found, inode has I_NEW|I_LOCK 2008-11-04 20:42 yes 2008-11-04 20:43 ok, so the inode hash is just a service the vfs provides to a filesystem 2008-11-04 20:44 it'll increase ref count though 2008-11-04 20:44 I think I get it 2008-11-04 20:44 if it's in cache, then it's a valid inode, that others can access 2008-11-04 20:44 if it's not, then we allocate a new one 2008-11-04 20:44 and it's invalid, and thus has to be locked, so others don't get junk, until we fill it in and unlock it 2008-11-04 20:44 yes 2008-11-04 20:45 there is one user of iget* that doesn't treat it as a mere library call 2008-11-04 20:45 which is nfs 2008-11-04 20:46 care to elaborate? 2008-11-04 20:47 well 2008-11-04 20:47 it used to ;) 2008-11-04 20:47 now we have nfs-specific methods to resolve filehandles 2008-11-04 20:47 ah, nfsd 2008-11-04 20:47 moment 2008-11-04 20:48 are we talking about the nfs server or the client? 2008-11-04 20:48 server 2008-11-04 20:48 e.g., ext2_export_ops->ext2_fh_to_dentry 2008-11-04 20:49 so the server is a client of whatever filesystem it's exporting... it shouldn't be mucking around in that filesystems innards at all? 2008-11-04 20:49 right, the filessystem has to provide the export_operations interface and a couple other things 2008-11-04 20:50 and this makes iget* a proper library call 2008-11-04 20:50 not ever called by vfs 2008-11-04 20:51 the vfs way of getting an inode is to resolve a path 2008-11-04 20:51 so the iget* family of functions are just a library implementation of a inode cache? 2008-11-04 20:51 eventually calling ->lookup 2008-11-04 20:51 which filesystems may or may not use 2008-11-04 20:51 yes 2008-11-04 20:51 as far as I know, all inode using filesystems use it 2008-11-04 20:52 fatfs doesn't use it 2008-11-04 20:52 not inode - fs 2008-11-04 20:53 iget*_locked? 2008-11-04 20:53 yes 2008-11-04 20:54 I wonder how far back the _locked variant goes 2008-11-04 20:54 I first knew these as iget() and iget4() 2008-11-04 20:55 what ext2 actually does between the iget and the unlock is important, but not that interesting 2008-11-04 20:55 fill in the cached inode 2008-11-04 20:56 by finding a block in the buffer cache, or reading it in if it's not there 2008-11-04 20:56 basically on-disk format to in-memory inode-cache conversion 2008-11-04 20:56 yes 2008-11-04 20:57 we have ext2_update_inode that writes out a changed inode 2008-11-04 20:57 create a new inode in cache, delete an inode, and those are the main inode operations 2008-11-04 20:58 now if I may, I'll talk about something specific to tux3 2008-11-04 20:59 I found that I have one big interaction between the caching layer and the disk update layer that I initially overlooked 2008-11-04 21:00 go on ... 2008-11-04 21:00 writing to a file is very nicely decoupled, all the inode can go into the page cache 2008-11-04 21:00 and then later, changes can be made to the on disk structures 2008-11-04 21:01 and the ondisk inode updated to reflect that, with pointers to the new data and updated attributes 2008-11-04 21:01 when we create a file, the change to the dirent block only needs to happen in cache 2008-11-04 21:02 but we need to have a inode number to make the directory link 2008-11-04 21:02 that requires accessing the on-disk filesystems 2008-11-04 21:02 looking for a free inode 2008-11-04 21:02 and updating the inode table block so that the same inode is not allocated again 2008-11-04 21:02 well theoretically inodes could just be the number of the operation on the fs 2008-11-04 21:03 you have to remember that number somehow 2008-11-04 21:03 because they have to be persistent 2008-11-04 21:03 superblock? 2008-11-04 21:03 the inode number is forever, at least if you are supporting nfs on your filesystem 2008-11-04 21:04 the superblock stores the last # we allocated, playing back the log may increase that 2008-11-04 21:04 allocating an inode, involves taking the number, increasing it, and logging the increase 2008-11-04 21:04 and we use 64bits or something, so we don't have to worry about running out 2008-11-04 21:05 works if the value of the inode number doesn't matter 2008-11-04 21:05 why should it matteR? 2008-11-04 21:05 do we have size limits? 2008-11-04 21:05 usually it does matter, because you need to store related fs objects near each other 2008-11-04 21:06 in pracice, because more than one inode is stored on a block 2008-11-04 21:06 related in what sense? 2008-11-04 21:06 by being in the same directory for example 2008-11-04 21:06 can't you just use a hash tree for inode lookups though? 2008-11-04 21:07 meant b-tree ;-) 2008-11-04 21:07 hash keyed btree? 2008-11-04 21:07 no, hash was just a typo 2008-11-04 21:08 meant a b-tree indexed by inode # 2008-11-04 21:08 you could 2008-11-04 21:08 but physical proximity is pretty important 2008-11-04 21:08 you'll have physical proximity for files created close to each other timewise 2008-11-04 21:08 ACTION waits for Maze to suggest adding another layer of indirection 2008-11-04 21:09 not so, for files which are being updated etc 2008-11-04 21:09 ACTION thinks about the days of flash drives and physical location no longer mattering... 2008-11-04 21:09 even for flash it matters, if you want to pack more than one inode onto a block 2008-11-04 21:10 yes, but not as much 2008-11-04 21:10 by a huge margin 2008-11-04 21:10 true 2008-11-04 21:10 but still enough to care I think 2008-11-04 21:10 and single inodes are probably going to be pretty beefy 2008-11-04 21:11 filesystems that have taken the simplifying assumption of making inodes block-granular have paid for it in performance 2008-11-04 21:11 ocfs2 is a good example 2008-11-04 21:11 they can't be block granular, that much is obvious 2008-11-04 21:11 besides 2008-11-04 21:11 with disks 2008-11-04 21:11 it's not just a matter of having many within one disk block 2008-11-04 21:12 sector granular and maybe you have a deal ;) 2008-11-04 21:12 but having them close to each other (say within the same 64k) also makes a big deal (readahead etc) 2008-11-04 21:12 you should still probably be capable of fitting more than 1 (maybe 2?) inodes in a sector 2008-11-04 21:13 although really depends 2008-11-04 21:13 how many can be packed on a block has a big effect on cache performance, mass delete is a good way to stress that 2008-11-04 21:13 performance of which cache? 2008-11-04 21:13 disk cache? 2008-11-04 21:13 page cache? 2008-11-04 21:13 bugfer cache? 2008-11-04 21:13 inode table block cache -> buffer cache 2008-11-04 21:14 the answer is "yes" to all three 2008-11-04 21:14 clearly - the bug cache is most important ;-) 2008-11-04 21:15 anyway, the question is similar to "why don't we just allocate data blocks sequentially" 2008-11-04 21:15 instead of going to the bother of having bitmaps etc 2008-11-04 21:15 yup 2008-11-04 21:15 well, no, not quite 2008-11-04 21:15 for one thing, you need to be able to find freed inodes 2008-11-04 21:15 because you've got deletes - that's why you need the bitmap 2008-11-04 21:16 both for blocks and inodes 2008-11-04 21:16 data and inodes I meant 2008-11-04 21:16 well... 2008-11-04 21:16 you need to be able to reuse blocks 2008-11-04 21:16 you don't need to be able to reuse inodes 2008-11-04 21:16 and inodes 2008-11-04 21:16 ah, ok 2008-11-04 21:16 not reusing inodes probably even solves some problems (nfs) 2008-11-04 21:17 so you make the inode number extra big and never wrap 2008-11-04 21:17 right 2008-11-04 21:17 store them in a hash on disk 2008-11-04 21:17 well, some structure 2008-11-04 21:17 ok, a btree 2008-11-04 21:17 and the inode attributes... right there in the btree? 2008-11-04 21:18 yup 2008-11-04 21:18 and always add new inodes on the right of the btree 2008-11-04 21:18 yup 2008-11-04 21:19 another solution would be to have a pool of pre-allocated inodes that you can pull from if you need to create an inode 2008-11-04 21:19 but that would not have good physical locality either 2008-11-04 21:20 anyway, this was all by way of trying to avoid having to store a new inode number in the inode table block at file create time, right? 2008-11-04 21:21 well, you don't have to store it per say 2008-11-04 21:21 only log the intent 2008-11-04 21:21 I thought the problem was not so much the need to store it 2008-11-04 21:21 but the need to find a free inode number 2008-11-04 21:21 that's correct 2008-11-04 21:22 well, this gets rid of the need to find a free inode number 2008-11-04 21:22 and lets the fs pick the inode number before creating a dirent 2008-11-04 21:23 yup 2008-11-04 21:23 I'll ponder that 2008-11-04 21:23 in the mean time, let me go on about the messy interaction we get otherwise 2008-11-04 21:24 so, _assuming_ we have to look at the inode table structure to choose an inode at file create time 2008-11-04 21:24 [[this can also be extended lock-less to smp or even multi-kernel fs]] 2008-11-04 21:25 right, so a couple of solid advantages 2008-11-04 21:26 still, the notion of endless inode numbers is a little uncomfortable 2008-11-04 21:26 maybe it's just me 2008-11-04 21:26 why? 2008-11-04 21:27 don't know 2008-11-04 21:27 hmm 2008-11-04 21:27 maybe I'm comfortable with this because it lies at the core of my approach to netfs and solving the 2008-11-04 21:27 'make it stateless but sane with caching and reboots' problem 2008-11-04 21:28 I doubt there is any such thing as stateless and sane 2008-11-04 21:28 (two core ideas, not reuse anything that can be not reused, i.e. inodes, second idea, make every operation either a read or reversible) 2008-11-04 21:30 (ie. a delete has to have all the information needed to run it in reverse as a create - although this doesn't require file content) 2008-11-04 21:30 anyway, what happens is: vfs creates dentry for new inode; calls fs; fs allocates inum, stores initial attributes in inode table block; write data to page cache 2008-11-04 21:30 repeat a bunch of times, then flush the cache to stable storage 2008-11-04 21:31 flushing requires assigning blocks to the cached data and making the inode reference those blocks 2008-11-04 21:31 meanwhile... another file create happens 2008-11-04 21:32 the file create has to wait for the flush to complete, or it will change a block that has to be written to disk as part of a consistent fs image 2008-11-04 21:33 alternatively you could treat files/inodes with 1 hardlink to be directly part of the directory file, and only with the instance of a second hardlink would it become promoted into a true inode, although the number wouldn't change, but it would deal with locality on a dir-level 2008-11-04 21:34 ok, there's your problem 2008-11-04 21:34 Can't that be solved with copy-on-write techniques? 2008-11-04 21:35 not quite 2008-11-04 21:35 hmm 2008-11-04 21:35 we can "fork" a new inode table block to accomodate the create, but then the flush might change the original version 2008-11-04 21:35 I was actually thinking we never overwrite old blocks, always write out new ones, and only reuse old ones once they're fully freed 2008-11-04 21:36 (fully freed - not referenced from anywhere in the superblock + forward log chain) 2008-11-04 21:36 you have to distinguish between disk and cache when you say overwrite 2008-11-04 21:37 true 2008-11-04 21:37 well pages in flight to disk have to be treated as (or be) locked 2008-11-04 21:38 except the page isn't in flight yet, we're just setting up the cache to be transferred to disk 2008-11-04 21:38 yeah, that's why copy-on-write update semantics are so nice ;-) 2008-11-04 21:39 and don't work in this case 2008-11-04 21:39 because the flush changes to original copy, while the forked copy is also changed 2008-11-04 21:39 with no obvious way to merge the two sets of changes 2008-11-04 21:40 I think I see 2008-11-04 21:40 I think I need to write an email 2008-11-04 21:40 what's up with the gettogether? 2008-11-04 21:41 seems unlikely, I just spent the weekend out with stomach flu 2008-11-04 21:41 still pretty shaky 2008-11-04 21:42 in that case let's put it off 2008-11-04 21:43 I kind of need to know by wed if I'm going somewhere on thu ;-) 2008-11-04 21:43 wisest course of action 2008-11-04 21:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-04 22:11 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-04 22:46 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-05 03:51 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-11-05 06:47 -!- pgquiles(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-05 07:02 -!- mingming(~mingming@c-24-22-117-202.hsd1.or.comcast.net) has joined #tux3 2008-11-05 07:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-05 08:20 -!- stargazr5(~gauravstt@59.95.34.238) has joined #tux3 2008-11-05 08:53 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-05 09:25 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-05 10:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-05 11:56 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-05 14:28 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-05 14:57 -!- MaZe1(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-05 15:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-05 17:54 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-05 18:25 maze, ping? 2008-11-05 18:25 pong 2008-11-05 18:26 re yesterday's proposition... 2008-11-05 18:26 that inode numbers can just be allocated sequentially 2008-11-05 18:26 the reasoning against goes like this... 2008-11-05 18:27 1) you want to store some attributes at the location indexed by the inum 2008-11-05 18:27 2) you want to store some file data near the inode attributes 2008-11-05 18:28 3) sequential allocation is only one possible access pattern, others have to be supported 2008-11-05 18:28 file data? as in file contetns? 2008-11-05 18:28 yes 2008-11-05 18:28 needs physical proximity to the inode attributes 2008-11-05 18:29 for tiny files, ie. symlinks... 2008-11-05 18:32 I'm not sure why the above matter? 2008-11-05 18:32 4) on update pattern that needs to be support is "add a new file near some existing files" which sequential inum allocation cannot do without violating (1) or (2) 2008-11-05 18:34 s/on/one/ 2008-11-05 18:38 hmm, maybe, I'm not convinced that it can't be solved by preallocating inodes at the directory level 2008-11-05 18:39 (ie. directories have pools of preallocated unused inode numbers) 2008-11-05 18:39 that's inode number allocation thinly disguised 2008-11-05 18:40 ...true... 2008-11-05 19:12 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-05 19:12 hi tim_dimm 2008-11-05 19:13 just ponged u 2008-11-05 19:13 just got it now 2008-11-05 19:26 -!- ajonat(~ajonat@190.48.125.59) has joined #tux3 2008-11-05 19:33 hmm, I need a term other than flush that means "force allocation of disk data to back dirty data cache" 2008-11-05 19:33 because "flush" also means "write it out", usually 2008-11-05 19:35 fallac? 2008-11-05 19:35 woo catchy 2008-11-05 19:35 falloc_dd 2008-11-05 19:35 woosh? 2008-11-05 21:54 folks 2008-11-06 01:05 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-06 01:47 -!- pgquiles(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-06 02:13 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-06 08:42 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 09:08 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-06 11:21 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-06 12:51 -!- pgquiles(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-06 13:21 -!- ajonat(~ajonat@190.48.97.229) has joined #tux3 2008-11-06 14:06 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 15:22 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-06 15:45 -!- pgquiles__(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-06 15:50 hmm, I just hatched the bright idea of changing my "phase" terminology to "delta" 2008-11-06 15:50 so Tux3 syncs to disk by generating a series of deltas from one atomic state to another 2008-11-06 15:50 ah, this feels right 2008-11-06 15:52 hi 2008-11-06 15:52 sounds good 2008-11-06 15:54 thanks for the vote :-) 2008-11-06 15:54 it also gets me closer to giving the new atomic sync algorithm a name 2008-11-06 15:54 tentatively "delta sync" 2008-11-06 15:55 how difference with normal sync? 2008-11-06 15:56 normal sync provided by the vfs, or as implemented by various filesystems? 2008-11-06 15:56 the semantics of sync are defined by Posix 2008-11-06 15:57 so delta sync is what does it? 2008-11-06 15:57 atomicity is not a requirement of posix 2008-11-06 15:57 ah, i see 2008-11-06 15:57 a delta transitions from one consistent state of a filesystem to the next 2008-11-06 15:58 is there any progress to implement? 2008-11-06 15:58 I'd like to do something 2008-11-06 15:58 I just solved a big design issue I think, and am describing the solution 2008-11-06 15:59 I also need to check in my timestamp code 2008-11-06 15:59 later tonight 2008-11-06 15:59 so... serious work in implementing atomic sync starts about now 2008-11-06 15:59 will it post to tux3-ml? 2008-11-06 15:59 yes 2008-11-06 15:59 let me talk about it now for a bit 2008-11-06 15:59 helps me write 2008-11-06 16:00 yes 2008-11-06 16:00 the issue is: I had assumed there is a nice clean separation between the cache that each filesystem operation updates, and what has to be transferred to disk for a delta 2008-11-06 16:01 and that all I needed to make the separation perfect was the "fork" operation 2008-11-06 16:01 that does a copy on write of a block that a filesystem operation wants to update, if the block is part of a delta not yet transferred to disk 2008-11-06 16:02 this turned out not to work with namespace operations such as file create, delete, rename 2008-11-06 16:02 the difficulty is with inode table blocks 2008-11-06 16:03 why is it difficult? 2008-11-06 16:03 the "top end" filesystem operation for a create needs to change two blocks: a dirent block and an inode table block 2008-11-06 16:04 yes 2008-11-06 16:04 the "flush" operation at a delta transition (used to be called phase transition) also changes the inode table block 2008-11-06 16:05 the question is: what happens if another file create comes in that wants to change the same inode table block? 2008-11-06 16:05 we can't just "fork" the inode table block in cache 2008-11-06 16:05 um.. 2008-11-06 16:05 because the "flush" hasn't completed yet, it may not have stored pointers to the new data extents in the inode table block yet 2008-11-06 16:06 so therefore the file create needs to wait for any flush in progress to complete 2008-11-06 16:06 not to be transferred to disk, but to transfer all necessary information to the inode table block in cache 2008-11-06 16:08 there are two things I don't like about that: 1) it's extra complexity to implement that synchronization between top end namespace operations and the bottom end flush 2) it causes a "bump" while waiting for the flush step to complete 2008-11-06 16:08 now, to be fair, I doubt that this bump would be worse than existing filesystems, which wait on all kinds of things 2008-11-06 16:09 but I'd still rather get rid of it 2008-11-06 16:09 so what I'm proposing is to defer namespace operations in much the same way as we defer write allocations 2008-11-06 16:10 we just let the namespace operation be recorded in cache, as it already is in the stub tux3 kernel code 2008-11-06 16:10 wait... I left out an important detail 2008-11-06 16:11 what is it? 2008-11-06 16:11 we have to select an inode number before making a new entry in a dirent block 2008-11-06 16:12 well, we could possibly make the entry without an inode number, then patch in the inode number later, but that would be pretty messy 2008-11-06 16:12 yes 2008-11-06 16:13 so now, I will propose to just let a file create, create the file in the dentry cache, and the sys_open will just check to make sure the name does not already exist and return as soon as that is done 2008-11-06 16:14 no change to any inode table block 2008-11-06 16:14 what was stored to dentry->d_inode? 2008-11-06 16:14 pointer to an inode as usual 2008-11-06 16:15 the inode does not have to have an inode number, as you can see from the fact that ramfs doesn't need it 2008-11-06 16:15 in the case of nfs, and inode number is required in order to resolve a filehandle 2008-11-06 16:16 actually unique number, not inode number? 2008-11-06 16:16 it has to be globally stable across reboots 2008-11-06 16:16 ah, yes 2008-11-06 16:16 so the only convenient way to get that is make it be the inode number 2008-11-06 16:17 anyway, any nfs filehandle operations has to wait for the "flush" to be completed before it can resolve the file handle 2008-11-06 16:17 but that is not a problem 2008-11-06 16:17 because nfs already has to wait for a sync when it creates a new file 2008-11-06 16:18 I'm not sure about "async" option, however maybe yes 2008-11-06 16:18 ok, so the way tux3 remembers what file it is supposed to create is to keep a pointer to the dentry 2008-11-06 16:19 indeed, async is an interesting question and I considered it... but don't immediately remember the answer except that I thought it would work fine ;) 2008-11-06 16:20 well, I'd like to think this today more 2008-11-06 16:20 yes, needs a lot of thought 2008-11-06 16:20 I have given it a lot of thought over the last week 2008-11-06 16:21 current issue is only this? 2008-11-06 16:21 yes 2008-11-06 16:21 great 2008-11-06 16:23 sys_open(CREATE) needs to do three things: check the name doesn't already exist in a direct block; remember the dentry for later dirent creation; be sure that the later dirent creation will succeed 2008-11-06 16:23 if sys_open(CREATE) is deferred, then sys_unlink must also be deferred 2008-11-06 16:25 ah, I mispoke above, sys_open doesn't have to remember the dentry, but the inode 2008-11-06 16:25 very slight distinction, easier to implement 2008-11-06 16:25 hrm, no I was right the first time 2008-11-06 16:25 has to remember the dentry 2008-11-06 16:26 because it needs to know the new name linking to the inode 2008-11-06 16:26 sys_unlink similarly remembers the dentry 2008-11-06 16:26 what happen to readdir? 2008-11-06 16:26 oh yes 2008-11-06 16:26 thanks for reminding me 2008-11-06 16:27 looks like readdir is most complex 2008-11-06 16:27 we have to flush any pending creates and deletes before starting the readdir 2008-11-06 16:27 it's not a big problem I think 2008-11-06 16:28 just a "flush" (really really need better terminology) at the beginning of the readdir 2008-11-06 16:28 I think the result of doing this namespace deferring will be pretty nice 2008-11-06 16:29 eh, "flush" is not "write out"? 2008-11-06 16:29 no, it is setting up the blocks for writeout 2008-11-06 16:29 ah, it's like ->write_inode? 2008-11-06 16:30 assignment of physical extent locations, move of cached attributes into inode table blocks, and now adding cached namespace operations to dirent blocks and inode table blocks 2008-11-06 16:30 not even like a write_inode 2008-11-06 16:30 it moves things from one place to another in cache 2008-11-06 16:31 for any given block, it has to be completely set up in cache before we submit it 2008-11-06 16:32 i see 2008-11-06 16:33 I will go for my skate, and think up a new name for "flush" now 2008-11-06 16:33 thanks for following along with this, very accurately as usual 2008-11-06 16:33 as for the entry without the inode number, is it possible to handle like you do with forward logging? 2008-11-06 16:34 tim_dimm, except that we _must_ assign an inode number before we can deal with an nfs handle 2008-11-06 16:34 a solution I considered, to be sure 2008-11-06 16:34 so pre-assign it 2008-11-06 16:35 have the next inode number always ready 2008-11-06 16:35 tim_dimm, yes 2008-11-06 16:35 that works 2008-11-06 16:35 except what I'm proposing will be nicer 2008-11-06 16:36 the inode number gets assigned quite quickly, which will be nice for nfs 2008-11-06 16:36 and the top end filesystem operation returns to user very quickly, because the only real work it has to do is check to see the file doesn't exist in the case of create 2008-11-06 16:38 ok, I better get out for my skate 2008-11-06 16:38 enjoy 2008-11-06 16:38 gets dark really soon/fast these days 2008-11-06 16:38 see you 2008-11-06 16:38 see you 2008-11-06 18:04 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 19:50 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-06 19:50 hi 2008-11-06 19:59 hi 2008-11-06 20:00 hi 2008-11-06 20:00 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 20:00 hirofumi, still here? 2008-11-06 20:00 hi 2008-11-06 20:01 oh, right 2008-11-06 20:01 ok, first thing is to wrap up a little bit on tuesday's iget investigation 2008-11-06 20:02 first a confession: I was not functioning very well that day, as you may have noticed... just getting over a pretty severe stomach flu 2008-11-06 20:03 anyway, first thing I want to do is clear up the question of why iget5_ sometimes returns locked, sometimes not 2008-11-06 20:03 it returns locked only in the case where it just added the inode to the hash 2008-11-06 20:03 yes 2008-11-06 20:03 and that is just so that filesystems can proceed to fill in the inode with attributes from backing store 2008-11-06 20:03 that's the conclusion we came to , right? 2008-11-06 20:04 if so, good 2008-11-06 20:04 otherwise, the inode is returned unlocked, but with a reference count on it 2008-11-06 20:04 this is a mem in Linux kernel 2008-11-06 20:05 yes 2008-11-06 20:05 the object can't be operated on safefy when it is not locked, but because of the reference count, the caller knows the object will not suddenly disappear 2008-11-06 20:05 so you can't modify it, but you can read it 2008-11-06 20:05 at that point, only two operations can be performed on the object: 1) lock it or 2) drop the reference count 2008-11-06 20:06 you can't read it either 2008-11-06 20:06 because somebody else might be modifying it 2008-11-06 20:06 well, depends on the locking semantics being used 2008-11-06 20:06 yes 2008-11-06 20:06 the situation I described is typical in linux 2008-11-06 20:06 you will find it used for a number of different kinds of objects 2008-11-06 20:07 you might say, there are not a lot of alternatives to this 2008-11-06 20:07 care to give an alternative example? 2008-11-06 20:08 and alternative to the refcount + lock object strategy? 2008-11-06 20:08 yes 2008-11-06 20:08 a typical alternative is garbage collect, which makes writing new code easier, but isn't suitable for kernel 2008-11-06 20:09 is less efficient and sometimes has long lags 2008-11-06 20:09 ok, I was thinking something else usable in kernel ;-) 2008-11-06 20:10 the refcount could be done away with, and you have a special lock state instead, where the owner of the lock uses its own knowledge to determine if the object can be discarded 2008-11-06 20:10 anyway, the refcount strategy is pretty general, which tends to allow for unforseen new applications 2008-11-06 20:10 and I'm going to take a look of one of those right now 2008-11-06 20:11 this is in the area I've been talking about today 2008-11-06 20:11 I'll call it deferred namespace operations for now 2008-11-06 20:12 the proposal is to handle sys_open(..., CREATE) as a deferred operation 2008-11-06 20:12 the file will just be created in dentry cache as with ramfs 2008-11-06 20:13 not actually placed in a dirent block 2008-11-06 20:13 so today I'd like to walk through sys_create and do a reality check on that 2008-11-06 20:13 so, here's a question to begin with: from POSIX semantics, when must such operations actually make it to disk? do they have to be ordered correctly with regard to other operations? 2008-11-06 20:14 answer: only on sync 2008-11-06 20:14 fsync of the fd? 2008-11-06 20:14 sync can be caused in a number of ways 2008-11-06 20:14 fsync, sync command, umount, O_SYNC 2008-11-06 20:15 what about file close? 2008-11-06 20:15 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1503 <- vfs_create( 2008-11-06 20:15 not file close 2008-11-06 20:16 in fact, it is possible for a file to be created and deleted without ever touching disk 2008-11-06 20:16 and what about stuff like mkdir? 2008-11-06 20:16 ext* can't do that 2008-11-06 20:16 but it's possible 2008-11-06 20:16 also never needs to touch disk if it's immediately unlinked 2008-11-06 20:17 what we _must_ do to satisfy Posix is return the right error codes from sys_open 2008-11-06 20:17 EEXIST in particular 2008-11-06 20:17 so who takes care of the appropriate sync semantics? 2008-11-06 20:17 the vfs? 2008-11-06 20:17 how does it tell us when and what needs to be synced/flushed? 2008-11-06 20:17 vfs only takes care of it for dumb filesystems like Ext2 and VFAT 2008-11-06 20:18 you will see oddities in there like ->assoc_buffers 2008-11-06 20:18 a field in the inode that points at metadata associated with a particular inode, in ext2 those are the dirty index blocks 2008-11-06 20:19 this mechanism is pretty much useless for anything but a filesystem as dumb as ext2 2008-11-06 20:19 there is handler for fsync, iirc 2008-11-06 20:19 yes 2008-11-06 20:19 fsync will go and do a series of steps that all filesystems need 2008-11-06 20:20 sometimes more than some filesystems need 2008-11-06 20:20 well 2008-11-06 20:20 sync_super and functions like that 2008-11-06 20:20 we can go look there now instead of rename if you like 2008-11-06 20:20 let's take a detour 2008-11-06 20:21 I'd like to understand how this affects other operations - how do I cause a mkdir to get flushed to disk? 2008-11-06 20:21 must I sync the entire fs? 2008-11-06 20:21 ACTION oops, I have to reboot 2008-11-06 20:21 fsync on the parent directory 2008-11-06 20:22 http://lxr.linux.no/linux+v2.6.27/fs/super.c#L465 ->sync_fs is a per-filesystem method 2008-11-06 20:23 a look, this is interesting 2008-11-06 20:24 249void __fsync_super(struct super_block *sb) 2008-11-06 20:24 258 sb->s_op->sync_fs(sb, 1); 2008-11-06 20:24 so you can sync the whole filesystem by calling fsync on the superblock 2008-11-06 20:25 I wonder if that's exported to userspace somehow 2008-11-06 20:26 __fsync_super starts off by writing out the superblock 2008-11-06 20:26 not intuitive 2008-11-06 20:26 you'd expect that to be the last thing it does 2008-11-06 20:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-06 20:27 a look, this is interesting 2008-11-06 20:27 249void __fsync_super(struct super_block *sb) 2008-11-06 20:27 258 sb->s_op->sync_fs(sb, 1); 2008-11-06 20:27 so you can sync the whole filesystem by calling fsync on the superblock 2008-11-06 20:27 I wonder if that's exported to userspace somehow 2008-11-06 20:27 __fsync_super starts off by writing out the superblock 2008-11-06 20:27 not intuitive 2008-11-06 20:27 I don't think at the end you're guaranteed to have a no-dirty-buffers in memory situation 2008-11-06 20:27 you'd expect that to be the last thing it does 2008-11-06 20:27 (for hirofumi's benefit) 2008-11-06 20:27 no 2008-11-06 20:27 (08:21:10 PM) ***hirofumi oops, I have to reboot 2008-11-06 20:27 (08:21:25 PM) hirofumi left the room (quit: Remote host closed the connection). 2008-11-06 20:27 (08:21:33 PM) flips: fsync on the parent directory 2008-11-06 20:27 (08:22:56 PM) flips: http://lxr.linux.no/linux+v2.6.27/fs/super.c#L465 ->sync_fs is a per-filesystem method 2008-11-06 20:27 you're only guaranteed that everything that was dirty at the time of the call gets written 2008-11-06 20:27 thanks 2008-11-06 20:28 thanks 2008-11-06 20:28 brb 2008-11-06 20:28 246 * device. Takes the superblock lock. Requires a second blkdev 2008-11-06 20:28 247 * flush by the caller to complete the operation. 2008-11-06 20:28 syncing has always been pretty messed up in Linux 2008-11-06 20:29 it's not obvious why two blkdev flushes should be required 2008-11-06 20:29 http://lxr.linux.no/linux+v2.6.27/fs/super.c#L249 2008-11-06 20:31 ok, let's determine if __fsync_super is a filesystem library call or whether it is called directly from vfs 2008-11-06 20:32 sigh, it's too bad I can't trust lxr to return all the uses, I wonder what happened to it 2008-11-06 20:33 surprisingly, __fsync_super and fsync_super are hardly used at all 2008-11-06 20:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 20:34 http://lxr.linux.no/linux+v2.6.27/fs/sync.c#L24? 2008-11-06 20:34 yes, the main use 2008-11-06 20:35 this particular nest of functions always leaves me with a headache 2008-11-06 20:37 ah, in fact it isn't necessary for fsync to write the superblock last the way an atomic committing filesystem has to 2008-11-06 20:37 because the only thing that is promised is that everything dirty is written 2008-11-06 20:37 not that the result is consistent 2008-11-06 20:38 well 2008-11-06 20:38 that's not quite the right statement 2008-11-06 20:39 it's not promised that the filessystem is clean 2008-11-06 20:40 only if it does not have journal or something? 2008-11-06 20:40 ;-) 2008-11-06 20:41 it's umount that will mark the filesystem clean 2008-11-06 20:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-06 20:52 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-06 20:52 -!- pgquiles__(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-06 20:52 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-06 20:52 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-06 20:52 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-11-06 20:52 -!- mlankhorst_(~m@fw1.astro.rug.nl) has joined #tux3 2008-11-06 20:52 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-11-06 20:52 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-11-06 20:52 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-11-06 20:52 -!- vcgomes[away](~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-11-06 20:52 -!- flips(~phillips@phunq.net) has joined #tux3 2008-11-06 20:52 262 * The whole writeout design is quite complex and fragile. <- there, somebody agrees with me 2008-11-06 20:53 heh 2008-11-06 20:53 hirofumi, what was the last line you saw? 2008-11-06 20:54 315 * how much sense this makes. Presumably I had a good 2008-11-06 20:54 316 * reasons for doing it this way, and I'd rather not 2008-11-06 20:54 317 * muck with it at present. 2008-11-06 20:54 318 */ 2008-11-06 20:55 which is last line? 2008-11-06 20:56 you split off the network a while ago 2008-11-06 20:56 yes 2008-11-06 20:56 not much to worry about 2008-11-06 20:56 http://lxr.linux.no/linux+v2.6.27/fs/fs-writeback.c#L285 2008-11-06 20:56 (08:41:28 PM) flips: it's umount that will mark the filesystem clean 2008-11-06 20:56 (08:42:03 PM) flips: anyway... sync_filesystems first calls a per-filesystem sync_fs method 2008-11-06 20:56 (08:42:20 PM) flips: then goes and syncs each dirty inode 2008-11-06 20:56 (08:42:35 PM) flips: a modern filesystem should do the whole job in its sync_fs method 2008-11-06 20:56 (08:42:49 PM) MaZe: everything that was dirty before the sync will be written out, but the end-state (even assuming no-one else mucks around) can be dirty 2008-11-06 20:56 (08:43:06 PM) flips: so let's take a look and see if the functionality provided by sync_inodes is optional 2008-11-06 20:56 (08:43:49 PM) flips: dirty or even inconsistent in the case of ext2, if there is parallel writing going on 2008-11-06 20:56 (08:44:34 PM) flips: http://lxr.linux.no/linux+v2.6.27/fs/fs-writeback.c#L648 2008-11-06 20:56 (08:44:45 PM) flips: __sync_inodes 2008-11-06 20:56 (08:45:26 PM) flips: if (sb->s_root) <- I wonder what that's for 2008-11-06 20:56 (08:46:04 PM) flips: this function does the blockdev sync on behalf of the filesystem 2008-11-06 20:56 (08:46:20 PM) flips: no choice about that, though I am not sure why a filesystem would want a choice 2008-11-06 20:56 (08:47:47 PM) hirofumi left the room (quit: charon.oftc.net larich.oftc.net). 2008-11-06 20:56 (08:47:47 PM) tux3bot left the room (quit: charon.oftc.net larich.oftc.net). 2008-11-06 20:56 (08:47:47 PM) shapor left the room (quit: charon.oftc.net larich.oftc.net). 2008-11-06 20:56 (08:47:47 PM) ceatinge left the room (quit: charon.oftc.net larich.oftc.net). 2008-11-06 20:56 (08:47:56 PM) flips: http://lxr.linux.no/linux+v2.6.27/fs/fs-writeback.c#L531 sync_sb_inodes 2008-11-06 20:56 (08:48:09 PM) flips: 442void generic_sync_sb_inodes(struct super_block *sb, 2008-11-06 20:56 (08:49:16 PM) flips: is generic_sync_sb_inodes interesting at all? 2008-11-06 20:56 (08:49:33 PM) flips: I'm mainly interested in determining if there's a way to avoid using it at all 2008-11-06 20:56 (08:50:07 PM) MaZe: lol 2008-11-06 20:56 (08:51:20 PM) flips: 376__writeback_single_inode(struct inode *inode, struct writeback_control *wbc) 2008-11-06 20:56 (08:52:27 PM) flips: 269__sync_single_inode(struct inode *inode, struct writeback_control *wbc) 2008-11-06 20:56 (08:52:53 PM) hirofumi [~hirofumi@210.171.168.39] entered the room. 2008-11-06 20:56 (08:52:53 PM) shapor [~shapor@yzf.shapor.com] entered the room. 2008-11-06 20:56 (08:52:53 PM) tux3bot [~tux3bot@yzf.shapor.com] entered the room. 2008-11-06 20:56 (08:52:53 PM) ceatinge [~ceatinge@veryclever.net] entered the room. 2008-11-06 20:56 (08:52:57 PM) flips: 262 * The whole writeout design is quite complex and fragile. <- there, somebody agrees with me 2008-11-06 20:56 (08:53:06 PM) MaZe: heh 2008-11-06 20:56 (08:53:14 PM) flips: hirofumi, what was the last line you saw? 2008-11-06 20:56 (08:54:19 PM) flips: 315 * how much sense this makes. Presumably I had a good 2008-11-06 20:56 (08:54:19 PM) flips: 316 * reasons for doing it this way, and I'd rather not 2008-11-06 20:56 (08:54:19 PM) flips: 317 * muck with it at present. 2008-11-06 20:56 (08:54:19 PM) flips: 318 */ 2008-11-06 20:56 (08:55:22 PM) hirofumi: which is last line? 2008-11-06 20:56 (08:56:00 PM) flips: you split off the network a while ago 2008-11-06 20:56 (08:56:12 PM) hirofumi: yes 2008-11-06 20:56 (08:56:13 PM) flips: not much to worry about 2008-11-06 20:56 (08:56:23 PM) flips: http://lxr.linux.no/linux+v2.6.27/fs/fs-writeback.c#L285 2008-11-06 20:56 for tux3 irc logger 2008-11-06 20:57 since that's apparently the part it missed 2008-11-06 20:57 ah 2008-11-06 20:57 ok, into do_writepages 2008-11-06 20:58 at least we have the option of providing our own there instead of generic_writepages() 2008-11-06 20:58 so that answers my question, sort of 2008-11-06 20:59 "can we avoid this whole massive mess of a half though through sync mechanism" 2008-11-06 20:59 I'm assuming if we need to override more stuff we can add additional records to the filesystems struct (or elsewehere)? 2008-11-06 20:59 the answer is: we can avoid the real work, which is the writepages 2008-11-06 20:59 filesystems struct? 2008-11-06 20:59 filesystem_operations or whatever it's called 2008-11-06 21:00 I didn't see a lot of overrides as we drilled down 2008-11-06 21:00 was looking for them 2008-11-06 21:00 but I could have missed something 2008-11-06 21:00 let's see what generic_writepages does 2008-11-06 21:01 that's what I meant by 'add' 2008-11-06 21:01 than see whether any journalling filesystems use it 2008-11-06 21:01 as in add-in the possibility for new overrides 2008-11-06 21:01 you mean fix the vfs? 2008-11-06 21:01 sure, can take a long time to convince akpm to take stuff like that 2008-11-06 21:02 if there's a way to make the old crap work, if only by just putting up with the uneeded things it does, that is generally the preferred route 2008-11-06 21:02 well, don't think I meant fix, more like add in one-or-two more hooks(hacks) 2008-11-06 21:02 still is an uphill battle, but one can always try 2008-11-06 21:02 although I guess that also leads to spaghetti code 2008-11-06 21:03 provide a core kernel patch, plus benchmarks that show how it makes your fs run better 2008-11-06 21:03 then get ready for everybody to challenge you to prove it also makes ext3 run better 2008-11-06 21:03 lol 2008-11-06 21:04 I would understand not wanting to make existing stuff run slower... but why require it to make existing stuff run faster? 2008-11-06 21:04 994 /* deal with chardevs and other special file */ <- gross 2008-11-06 21:05 which file? 2008-11-06 21:05 866int write_cache_pages(struct address_space *mapping, <- ok I'm here 2008-11-06 21:05 852 * write_cache_pages - walk the list of dirty pages of the given address space and write all of them. 2008-11-06 21:05 just a sec 2008-11-06 21:05 http://lxr.linux.no/linux+v2.6.27/mm/page-writeback.c#L852 2008-11-06 21:06 this calls ->writepage on each dirty page cache page 2008-11-06 21:07 I strongly suspect that we do not want to do things this way in tux3 2008-11-06 21:07 hey flips 2008-11-06 21:07 but we want to do our own page cache walk, like I do it in tux3/user/filemap.c 2008-11-06 21:07 ACTION scans the backlog 2008-11-06 21:08 hope there's enough for you ;) 2008-11-06 21:09 so, a quick scan through generic_writepages shows that most of what it worries about is irrelevant to tux3 2008-11-06 21:09 things like the possibility that the fs is actually tmpfs 2008-11-06 21:09 the write congestion model is also very suspect 2008-11-06 21:10 still, if we had to, we could let this function drive tux3 writeout 2008-11-06 21:10 yeah, it's always good to get a core person talking about stuff with clarity 2008-11-06 21:10 you can learn the most from folks like that 2008-11-06 21:10 basically by not really writing pages inside our tux3 ->writepage 2008-11-06 21:11 bh, to be honest, nobody can talk about this stuff with clarity, except to say it's kind of unnatural 2008-11-06 21:11 but, those will trigger some dela trasision? 2008-11-06 21:12 hirofumi, I was just wondering about that ;) 2008-11-06 21:12 let's see if we can answer that, and wind up for tonight 2008-11-06 21:14 there's two types of syncs: those done for data integrity reasons (someone called fsync) and those done to free memory (clear out dirty pages/buffers) 2008-11-06 21:14 there actually behaviour could potentially be vastly different 2008-11-06 21:14 umm... those have to block process to avoid fill all memory with dirty pages 2008-11-06 21:14 their :-( 2008-11-06 21:14 yes 2008-11-06 21:14 maze, vm should not require a big difference so long as some cache is cleaned 2008-11-06 21:15 free memory - write out biggest block of successive dirty pages to one region of disk that you can (potentially many such blobs) 2008-11-06 21:15 data integrity, needs to write out specific data in a specific order, potentially writing out additional data if it's convenient 2008-11-06 21:19 well, it looks like sync_inodes_sb is the last thing __fsync_super does, and generic_writepages is in turn the last thing that happens 2008-11-06 21:19 there is no final ->finish_up_the_sync method 2008-11-06 21:19 which is probably wrong 2008-11-06 21:19 so maze, yes, I think you need to save the world by adding a new method 2008-11-06 21:20 oh way 2008-11-06 21:20 let me see 2008-11-06 21:20 can we wrap this entire thing 2008-11-06 21:20 ah, dont'think so 2008-11-06 21:22 oh wait again, at least in sync_filesystems we can avoid the whole mess 2008-11-06 21:22 with our own ->sync_fs function 2008-11-06 21:23 ext3 provides its own 2008-11-06 21:25 flips: yeah, but you know the code so you have some kind of mental understanding pictured in your head. That's very helpful for folks that need pieces explained 2008-11-06 21:26 do dirty pages/buffers, etc, have some sort of time-of-first-and/or-last-dirtying, so that you can ask for a flush of everything dirtied before time x? 2008-11-06 21:28 buffers used to be handled that way 2008-11-06 21:29 it wasn't too useful, plus hard to get right 2008-11-06 21:29 what did we call that, everbody used it 2008-11-06 21:30 35 if (unlikely(laptop_mode)) 2008-11-06 21:30 36 laptop_sync_completion(); <- this strange hook does allow for a cleanup at the end of do_sync 2008-11-06 21:31 but I think the conclusion is, do_sync is still pretty awful 2008-11-06 21:31 tries to do the job the same way for everybody 2008-11-06 21:33 I think what tux3 will do, is try to do the entire job inside sync_fs, and then just not have any dirty pages for do_sync to come along and bother about it later 2008-11-06 21:33 heh 2008-11-06 21:33 of course, nothing prevents new pages from being dirtied after the sync_fs 2008-11-06 21:33 but I think they can be ignored 2008-11-06 21:34 because if the ->sync_fs writes out everything dirty _at the time it was called_ then the semantics are satisfied 2008-11-06 21:34 am I correct in assuming, that - provided no other fs writes to us are happening, after tux3 we should have on-disk state and on-memory state be equivalent (although not necessarily equal)? 2008-11-06 21:34 not correct 2008-11-06 21:35 because of the promise/rollup arrangement 2008-11-06 21:35 thus making ->sync_fs be a synchronization point, and not just a write all dirty data 2008-11-06 21:35 well 2008-11-06 21:35 ok, correct 2008-11-06 21:35 didn't mean after tux3, meant after sync_fs 2008-11-06 21:35 ie. rip out drive after sync_fs completes, or poweroff machine, and you don't lose any modifications or any other state 2008-11-06 21:36 the filesystem will appear dirty on remount 2008-11-06 21:36 (I think this is stronger, then the current required semantics of sync_fs, since it only requires write-out of dirty stuff, but not off the stuff that that write-out will/may dirty) 2008-11-06 21:36 tux3 will always appear 'dirty' though 2008-11-06 21:37 since there's very little difference between clean and dirty (basically a bit, right?) 2008-11-06 21:37 that is of course a big problem with this generic sync 2008-11-06 21:37 that flushing pages can dirty metadata 2008-11-06 21:37 so the only thing it doesn't do is clear the dirty bit in the superblock (although it could actually do that as well, but then we'd have to dirty it again, so kind of pointless) 2008-11-06 21:38 that's about it 2008-11-06 21:38 [dirty it again, for any further modification to the fs, which are likely to happen] 2008-11-06 21:38 so basically sync_fs and umount should be pretty much equivalent - except one does and the other doesn't clear the fs is dirty superblock bit 2008-11-06 21:39 one wonders if ext3 will avoid a journal replay if shut down immediately after do_sync 2008-11-06 21:39 [not talking about tearing down all the in memory structures of course, which are effectively just caches] 2008-11-06 21:40 32 sync_inodes(wait); /* Mappings, inodes and blockdevs, again. */ <- this comment written without blushing, impressive 2008-11-06 21:41 I think the current sync is an implementation of 'sync;sync;sync' - there everything should be synced 2008-11-06 21:41 geez, we actually sync the filesystem about four times 2008-11-06 21:41 in do_sync 2008-11-06 21:42 this thing has always been deeply messed, and I suppose it is a kind of comfort that it still is ;) 2008-11-06 21:42 maze, I was thinking precisely the same thing 2008-11-06 21:43 ok,. family time for me 2008-11-06 21:43 we didn't get to the rename stuff I wanted to do 2008-11-06 21:43 it's a common opinion among sysadmins, that you need to run sync thrice to sync the system ;-) 2008-11-06 21:43 that is for next tuesday now 2008-11-06 21:44 yes, familiar with that one 2008-11-06 21:44 and linux is a new advanced os that does sync sync sync automatically for you 2008-11-06 21:44 indeed 2008-11-06 21:44 so by following the full prescription, you do 9 of those 2008-11-06 21:45 I mean some of this is understandable, since you can have multiple layers of fs/loop/blockdevs/raid/lvm/etc 2008-11-06 21:45 and you should sync bottom up 2008-11-06 21:45 but no-one actually implements the sync that way 2008-11-06 21:45 but it doesn't 2008-11-06 21:45 it syncs top down 2008-11-06 21:45 already deeply suspect 2008-11-06 21:45 I don't think it's top down 2008-11-06 21:45 it's more like random 2008-11-06 21:46 notice that we just sync fs's in linear order 2008-11-06 21:47 I think with the current system to be truly safe you have to run sync on the order of O(# mounted filesystems) + O(# of used block devices) 2008-11-06 21:47 times 2008-11-06 21:47 in parallel actually 2008-11-06 21:47 it's supposed to 2008-11-06 21:47 29 sync_supers(); /* Write the superblocks */ 2008-11-06 21:47 30 sync_filesystems(0); /* Start syncing the filesystems */ 2008-11-06 21:47 31 sync_filesystems(wait); /* Waitingly sync the filesystems */ 2008-11-06 21:47 32 sync_inodes(wait); /* Mappings, inodes and blockdevs, again. */ <- top down 2008-11-06 21:48 that just feels wrong 2008-11-06 21:48 and hope there's no loops in the dependency structure (you can get loops, but you'd have to be crazy to want to: ie. put a file-backed-loop device as a spare member of raid device storing that filesystem) 2008-11-06 21:49 no notion of dependency for that matter 2008-11-06 21:49 if you have a filesystem loopback mounted on a filesystem looback mounted on... 2008-11-06 21:49 yup, it's duct tape and wishful thinking 2008-11-06 21:49 then there is no code here to make sure that the sync is bottom up 2008-11-06 21:50 ok, family time for real 2008-11-06 21:51 tuesday: we go look at sys_rename again, this time concentrating on object lifetimes, locking etc 2008-11-06 21:52 ok, see you 2008-11-06 23:40 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-07 01:25 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-07 01:55 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-07 05:04 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-07 07:00 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-11-07 08:27 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 09:10 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-07 11:07 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-07 12:28 -!- ajonat(~ajonat@190.48.97.229) has joined #tux3 2008-11-07 12:47 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 13:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 13:50 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 15:12 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 15:50 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 16:25 hi 2008-11-07 16:26 just a question: 2008-11-07 16:26 the "flush" operation at a delta transition (used to be called phase transition) also changes the inode table block 2008-11-07 16:26 the question is: what happens if another file create comes in that wants to change the same inode table block? 2008-11-07 16:26 we can't just "fork" the inode table block in cache 2008-11-07 16:27 new file creation is a new delta? 2008-11-07 16:27 in here 2008-11-07 16:29 If new creation was same delta, can't we modify "inode table block" buffer? 2008-11-07 17:39 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-07 18:18 hirofumi, yes, new file in a new delta 2008-11-07 18:19 if it's new delta, before starting new delta, we have to "flush"? 2008-11-07 18:19 yes 2008-11-07 18:20 if it is in the same delta (that is, the "active" delta") then the change can be made entirely in the dcache 2008-11-07 18:21 in the original model I had in mind, the change would also be made in a cached directory entry block 2008-11-07 18:21 but now I just want file operations to operate only on the dcache 2008-11-07 18:22 and the state of the dcache at the time of a delta transition is then put into directory entry blocks 2008-11-07 18:22 this will have the side effect of making the case where a file is created then immediately deleted a lot more efficient, for what that is worth 2008-11-07 18:22 in other words, temporary files get really cheap 2008-11-07 18:24 if new delta did "flush", we already has lastest state of buffers? 2008-11-07 18:25 can you clarify? 2008-11-07 18:26 maybe I don't understand what is the issue of that 2008-11-07 18:26 I'm thinking that issue is 2008-11-07 18:26 after the "flush" the state of the namespace is then represented in the cached directory entry blocks 2008-11-07 18:27 however, the top end may have gone on and made more changes 2008-11-07 18:27 so the only time it will exactly match is when there are no new namespace operations taking place 2008-11-07 18:29 um... 2008-11-07 18:31 so if new delta is blocked for "flush" to start new delta, I think there is no new namespace operations 2008-11-07 18:32 all namespace operations would have to block too 2008-11-07 18:32 yes 2008-11-07 18:32 that is what I don't like 2008-11-07 18:33 ah, i see 2008-11-07 18:33 it doesn't really make sense to block top end operations just because the cache layer is busy preparing to _begin_ a writeout phase 2008-11-07 18:33 I thought all operations wait to start new delta in that case 2008-11-07 18:33 the only time it should block is when it can't get new memory for cache 2008-11-07 18:33 i see 2008-11-07 18:33 starting a new delta would not solve the problem 2008-11-07 18:34 yes 2008-11-07 18:34 so the approach I have in mind is for the top end file operations not to modify dirent or inode table blocks at all 2008-11-07 18:34 is it optimization thing? 2008-11-07 18:34 yes 2008-11-07 18:34 ah, i see 2008-11-07 18:35 just about avoiding stalls 2008-11-07 18:35 I thought those were "fix" 2008-11-07 18:35 also a correctness thing, if the top end and the back end are block modifying the same blocks then synchronization is required 2008-11-07 18:36 yes 2008-11-07 18:36 by rearranging things so that the top end does not modify any cached blocks except cached data blocks, that eliminates the need to do the synchronization 2008-11-07 18:36 so I think the implementation complexity is about the same either way 2008-11-07 18:37 but the "deferred namespace operations" approach will give beter behavior 2008-11-07 18:38 maybe block all thing is easy, but much slow although 2008-11-07 18:38 true 2008-11-07 18:39 we could have a rw semaphore that the back end takes when it starts to prepare a delta for writeout 2008-11-07 18:39 btw, inode number is asigned while deffering 2008-11-07 18:39 and file operations take a read lock when they want to create or delete a file 2008-11-07 18:39 that would work, and it would obviously be a bottleneck 2008-11-07 18:39 yes, inode number assignment will be deferred 2008-11-07 18:40 which means that nfs file handle resolution must wait until that is completed 2008-11-07 18:40 so that at least has to be synchronized 2008-11-07 18:40 and the other thing you noticed is that directory listing also has to wait until names are flushed into the directory entry blocks 2008-11-07 18:41 yes 2008-11-07 18:41 what happen to fstat()? 2008-11-07 18:41 it needs to return ino 2008-11-07 18:41 gets the attributes out of the cached inode 2008-11-07 18:41 ah 2008-11-07 18:41 yes 2008-11-07 18:42 too bad about taht 2008-11-07 18:42 that also require "flush"? 2008-11-07 18:42 I seem to recall that you're not supposed to use the inode field 2008-11-07 18:42 let me check 2008-11-07 18:43 btw, ramfs assigns ino (i.e. last_ino) 2008-11-07 18:46 have you got a file/line number? 2008-11-07 18:46 for fstat? 2008-11-07 18:46 for the ramfs ino assignment 2008-11-07 18:47 it's in new_inode() 2008-11-07 18:47 http://lxr.linux.no/linux+v2.6.27.5/fs/inode.c#L549 2008-11-07 18:48 I think assignment is 2008-11-07 18:48 http://lxr.linux.no/linux+v2.6.27.5/fs/inode.c#L567 2008-11-07 18:48 that's new 2008-11-07 18:49 oh 2008-11-07 18:49 seems like a bad hack 2008-11-07 18:49 I wonder what the justification was 2008-11-07 18:49 need to go crawling through git commits to find out 2008-11-07 18:50 might now take long actually 2008-11-07 18:50 justification of last_ino? 2008-11-07 18:50 for ramfs having inode numbers 2008-11-07 18:50 I bet it's connected to some other questionable design direction 2008-11-07 18:51 at least, iirc, hash needs i_ino 2008-11-07 18:52 which hash? 2008-11-07 18:52 inode_hash 2008-11-07 18:53 that's a per-fs operation 2008-11-07 18:53 http://lxr.linux.no/linux+v2.6.27.5/include/linux/fs.h#L1843 2008-11-07 18:55 insert_inode_hash is not called by the vfs 2008-11-07 18:55 http://lxr.linux.no/linux+v2.6.27/+ident=18000161 2008-11-07 18:56 yes 2008-11-07 18:56 hmm, it's taking kernel.org forever to generate a history for ramfs/inode.c 2008-11-07 18:57 maybe ramfs want to use generic stuff and libraries 2008-11-07 18:57 it worked fine for years with no inode numbers 2008-11-07 18:57 um... 2008-11-07 18:59 in 2.4.0, new_inode was called get_empty_inode 2008-11-07 18:59 and get_empty_inode() uses last_ino with same way 2008-11-07 19:00 2.6.26.5 uses new_inode but not last_ino 2008-11-07 19:01 ok, I see what you mean 2008-11-07 19:01 you're right 2008-11-07 19:02 will/can tux3 work without i_ino assignment? 2008-11-07 19:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-07 19:03 good question 2008-11-07 19:03 I think so 2008-11-07 19:04 at least, inode number assignments can be deferred for a while 2008-11-07 19:04 it might be necessary to make fstat wait until inode number has been assigned 2008-11-07 19:04 i see 2008-11-07 19:05 this is still better than making file create wait until a delta has been prepared for writeout 2008-11-07 19:05 yes 2008-11-07 19:06 btw, we need to handle all physical remapping stuff like this? 2008-11-07 19:06 556 static unsigned int last_ino; <- ohmygod, barf 2008-11-07 19:06 that is disgusting 2008-11-07 19:07 which physical remapping? 2008-11-07 19:07 you mean, moving file blocks around? 2008-11-07 19:08 I'm not sure, maybe pointer to dtree root? 2008-11-07 19:08 and atree root 2008-11-07 19:09 ok, now obviously the inode number generated in new_inode is replaced later by any filesystem that actually has inode numbers 2008-11-07 19:09 so... there is something deeply broken here 2008-11-07 19:09 and inode number that suddenly changes is worse than no inode number at all 2008-11-07 19:10 yes, it's good old broken behavior of no actual inode number 2008-11-07 19:10 re dtree roots etc, that is handled entirely by the back end, we don't have a messy collision like with namespace ops 2008-11-07 19:11 great 2008-11-07 19:11 ok, I will be busy investigating inode number issues for a little while, thanks for bringing up the issue 2008-11-07 19:11 thanks for explaining those 2008-11-07 19:12 I need to get this post posted and get things moving 2008-11-07 19:12 by now, you know everything that is in the post 2008-11-07 19:12 but you may be the only one, so I have to share it with the rest 2008-11-07 19:13 yes, that's very good 2008-11-07 19:14 I want to read it too though :) 2008-11-07 19:16 one thing you _can_ always do before assigning an inode number is write to a file 2008-11-07 19:16 the case of create+write is very common, obviously 2008-11-07 19:17 yes 2008-11-07 19:17 maybe i_ino is needed at a few places 2008-11-07 19:18 anything that could access it before the filesystem make the real assignment is broken 2008-11-07 19:18 but we should look for specific cases 2008-11-07 19:19 last_ino is a private static, that means it starts again at zero each time the kernel boots 2008-11-07 19:19 and is shared by all filesystems, it probably only matters to about three of them 2008-11-07 19:20 yes 2008-11-07 19:20 and because it is limited to 2^32, there is a real danger it can wrap 2008-11-07 19:20 should not be hard to write a test case 2008-11-07 19:20 it's about the crappiest thing I've seen in kernel, ever 2008-11-07 19:21 2^32 is workaround for lfs (ia32e mode in x86_64) 2008-11-07 19:21 I noticed the comment 2008-11-07 19:24 um.. security stuff may use i_ino... 2008-11-07 19:34 http://lxr.linux.no/linux+v2.6.27.5/kernel/auditsc.c#L1792 2008-11-07 19:37 um.. audit seems to use i_ino to check it's interesting inode... 2008-11-07 20:22 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-07 20:32 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-08 03:41 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-08 05:24 -!- pranith(~Bobby@122.162.73.68) has joined #tux3 2008-11-08 05:34 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-08 05:44 MaZe, hello 2008-11-08 10:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-08 11:07 -!- ajonat(~ajonat@190.48.102.146) has joined #tux3 2008-11-08 11:16 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-08 12:01 -!- ajonat(~ajonat@190.48.102.146) has joined #tux3 2008-11-08 12:22 -!- pranith(~Bobby@122.162.67.24) has joined #tux3 2008-11-08 12:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-08 13:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-08 14:08 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-08 15:31 pranith: hey 2008-11-08 15:31 hmm, not here 2008-11-08 15:51 -!- pgquiles(~pgquiles@244.Red-81-36-194.dynamicIP.rima-tde.net) has joined #tux3 2008-11-08 15:54 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-08 17:01 -!- konrad(~konrad@maxk.cs.washington.edu) has joined #tux3 2008-11-08 17:03 -!- konrad(~konrad@maxk.cs.washington.edu) has joined #tux3 2008-11-09 07:54 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-11-09 07:55 hey all 2008-11-09 07:57 hi 2008-11-09 08:06 hirofumi: hello 2008-11-09 08:07 havent heard any tux3 news lately 2008-11-09 08:07 whats going on? 2008-11-09 08:08 I think flips is writing about some issue of implementation recently 2008-11-09 08:09 writing email 2008-11-09 08:30 ohk 2008-11-09 09:03 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-09 09:08 -!- mib_9lu3j8(ca4ea502@webchat.mibbit.com) has joined #tux3 2008-11-09 09:08 hey mib_9lu3j8 2008-11-09 09:08 :) 2008-11-09 09:10 :) 2008-11-09 09:11 yoyo!!!! 2008-11-09 09:11 sup?????? 2008-11-09 09:12 em peeking? 2008-11-09 09:12 spu: tux3 ... 2008-11-09 09:40 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-09 09:57 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-09 11:34 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-09 11:54 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-09 12:17 -!- pgquiles(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-09 12:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-09 12:51 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-09 13:31 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-09 19:56 -!- ajonat(~ajonat@190.48.102.146) has joined #tux3 2008-11-09 20:17 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-09 20:39 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-09 21:32 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-10 00:18 -!- konrad(~konrad@maxk.cs.washington.edu) has joined #tux3 2008-11-10 02:20 sigh, that took way too long 2008-11-10 02:21 post on deferred namespace operations is now up 2008-11-10 02:21 enjoy 2008-11-10 02:22 hey flips 2008-11-10 02:22 hi 2008-11-10 02:22 drove through LA to SF today, I'm here now 2008-11-10 02:22 how's it going ? 2008-11-10 02:22 ACTION notes that it's getting late 2008-11-10 02:22 indeed 2008-11-10 02:22 time for me to crash 2008-11-10 03:06 -!- konrad(~konrad@maxk.cs.washington.edu) has joined #tux3 2008-11-10 08:06 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-11-10 08:53 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-10 09:22 flips, nice post 2008-11-10 10:02 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-10 10:12 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-10 10:57 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-10 11:20 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-10 13:01 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-10 14:15 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-10 14:15 tim_dimm, thanks 2008-11-10 14:16 clear and concise 2008-11-10 14:16 and makes me want to know more 2008-11-10 14:16 :-) 2008-11-10 14:17 I think the main question is "when will it work" 2008-11-10 14:17 that's a good one 2008-11-10 14:18 will try to get at least one bit of functioning code checked in today 2008-11-10 14:18 the other is who's going to help? 2008-11-10 14:20 right 2008-11-10 14:22 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-10 14:59 ok, there's the time bug 2008-11-10 14:59 it's my fault, storing time in 48 bits instead of 64 2008-11-10 14:59 and forgetting to compensate for that 2008-11-10 15:03 stat test/bar 2008-11-10 15:03 File: `test/bar' 2008-11-10 15:03 Size: 4 Blocks: 1226358191 IO Block: 4096 regular file 2008-11-10 15:03 Device: 12h/18d Inode: 14 Links: 0 2008-11-10 15:03 Access: (0666/-rw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root) 2008-11-10 15:03 Access: 1969-12-31 16:00:00.000000000 -0800 2008-11-10 15:03 Modify: 1969-12-31 16:00:00.1075315904 -0800 2008-11-10 15:03 Change: 2008-11-10 15:03:11.-1077364536 080 2008-11-10 15:03 ok, ctime is right 2008-11-10 15:04 negative value? 2008-11-10 15:05 yes, that's odd 2008-11-10 15:05 that 080 is funny 2008-11-10 15:06 it seems '-' left out last '0' 2008-11-10 15:06 what is that field anyway? 2008-11-10 15:07 time zone? 2008-11-10 15:07 I thought so 2008-11-10 15:07 then how could it be wrong? 2008-11-10 15:08 maybe '-' was added, so it was go out? 2008-11-10 15:08 linux manpage for ls and stat don't document what the fields are 2008-11-10 15:08 wait a minute, I'll check it 2008-11-10 15:08 well, I will get atime and mtime working and check in the patch 2008-11-10 15:08 it's progress 2008-11-10 15:09 btw, can we see the timestamp patch? 2008-11-10 15:10 yes, I will make mtime and atime work and post it 2008-11-10 15:10 or just check it in 2008-11-10 15:11 ok 2008-11-10 15:12 now what to do with atime 2008-11-10 15:12 make it be the same as ctime for now I suppose 2008-11-10 15:12 and leave it that way until an atime table is implemented, a low priority 2008-11-10 15:13 yes, atime is not important I think 2008-11-10 15:13 stat foo 2008-11-10 15:13 File: `foo' 2008-11-10 15:13 Size: 0 Blocks: 0 IO Block: 1024 regular empty file 2008-11-10 15:13 Device: 806h/2054d Inode: 3564722 Links: 1 2008-11-10 15:13 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) 2008-11-10 15:13 Access: 2008-11-10 15:05:00.000000000 -0800 2008-11-10 15:13 Modify: 2008-11-10 15:05:00.000000000 -0800 2008-11-10 15:13 Change: 2008-11-10 15:05:00.000000000 -0800 2008-11-10 15:14 ojk, so, there is a stat bug 2008-11-10 15:14 if mtime is messed up, then ctime display is messed up, even if ctime is correct 2008-11-10 15:14 sloppy coding 2008-11-10 15:14 magically, atime appears correct 2008-11-10 15:15 even though I know it is not set properly but tux3 2008-11-10 15:15 strange assumptions at work in the library code 2008-11-10 15:15 ok, a little cleanup then checkin time 2008-11-10 15:16 in stat.c, buffer is "static". so it might be overflowed 2008-11-10 15:16 stat.c in which directory? 2008-11-10 15:16 stat command in coreutils 2008-11-10 15:16 you found that fast 2008-11-10 15:17 I just "apt-get source coreutils" 2008-11-10 15:17 oh, before that, "dpkg -S /usr/bin/stat" 2008-11-10 15:18 still fast 2008-11-10 15:19 maybe since I do king of this often 2008-11-10 15:19 ok, I have know idea whether storing time attributes in 48 bits instead of 64 is a good idea, but it's not much code to accomplish it so I'll keep that for the time being 2008-11-10 15:21 what's for 14bits? 2008-11-10 15:21 14bits? 2008-11-10 15:21 oops, 16bits 2008-11-10 15:21 as I've implemented it, tux3 times are 32.32 fixed point 2008-11-10 15:22 heh 2008-11-10 15:22 and stored in inode table blocks as 32.16 2008-11-10 15:22 gives a precision of 1/(2**16) seconds 2008-11-10 15:23 um.. 2008-11-10 15:23 and I don't know if the two byte savings per time field is worth the lower precision, but this can be changed easily if we decide later 2008-11-10 15:24 what value was store to sb->s_time_gran? 2008-11-10 15:25 ah, in userspace there is no such field 2008-11-10 15:25 yes, userspace will work perfect 2008-11-10 15:27 I wonder if s_time_gran can be 2**32 ? 2008-11-10 15:27 I bet things will break 2008-11-10 15:27 let me check timespec_trunc() 2008-11-10 15:28 allowing non-binary granularity to represent fractions of seconds was always a very bad thing to do 2008-11-10 15:29 http://lxr.linux.no/linux+v2.6.27.5/kernel/time.c#L285 2008-11-10 15:29 beat me ;) 2008-11-10 15:30 ;) looks like work? 2008-11-10 15:30 289 * Currently current_kernel_time() never returns better than 2008-11-10 15:30 290 * jiffies resolution. Exploit that. <- doesn't sound good 2008-11-10 15:31 gran is just unsigned 2008-11-10 15:31 32 bits on 32 bit arch 2008-11-10 15:31 so 2**32 will be zero 2008-11-10 15:32 we can convert to messed up decimal format on attribute load and save 2008-11-10 15:32 costing some extra multiply and divides 2008-11-10 15:32 that probably do not matter compared to cost of packing/unpacking attributes 2008-11-10 15:32 ah 2008-11-10 15:33 anyway, it is a relatively small thing, if tux3 has to used messed up decimal time formats on disk like everybody else then it will 2008-11-10 15:33 for now we will use nice clean fixed point 2008-11-10 15:33 i see 2008-11-10 15:34 oh, another way to deal with the issue 2008-11-10 15:35 is to use 32.16 in inode time fields, then time_gran will work 2008-11-10 15:35 in fact 32.31 will probably work 2008-11-10 15:36 ok, so I don't think there is a problem, and we don't actually introduce any new overhead by doing this, because kernel already does multiply/divides on all time conversion 2008-11-10 15:37 yes 2008-11-10 15:39 btw, the comment of timespec_trunc() seems to be strange 2008-11-10 15:40 if the resolution was better than jiffies, it does nothing 2008-11-10 15:40 the doesn't handle higher resolution comment? 2008-11-10 15:40 so it means resolution is xtime resolution 2008-11-10 15:41 but now, xtime can handle nanosec on some archs 2008-11-10 15:41 void update_xtime_cache(u64 nsec) 2008-11-10 15:41 { 2008-11-10 15:41 xtime_cache = xtime; 2008-11-10 15:41 timespec_add_ns(&xtime_cache, nsec); 2008-11-10 15:41 } 2008-11-10 15:42 you can teach me about kernel time handling as we go ;) 2008-11-10 15:42 xtime is current wall time 2008-11-10 15:43 it needs xtime_lock to read, so to avoid xtime_lock, kernel cached to xtime_cache 2008-11-10 15:44 well, xtime is current wall time 2008-11-10 15:44 struct timespec current_fs_time(struct super_block *sb) 2008-11-10 15:44 { 2008-11-10 15:44 struct timespec now = current_kernel_time(); 2008-11-10 15:44 return timespec_trunc(now, sb->s_time_gran); 2008-11-10 15:44 } 2008-11-10 15:45 so like you know, current_fs_time() is used to update timestamp 2008-11-10 15:45 I didn't know that 2008-11-10 15:46 have never really looked at timestamps in kernel 2008-11-10 15:46 if timespec_trunc() does nothing, I think timestamp resolution will be nanosec 2008-11-10 15:46 only I know that it's really messed up, two major different ways of doing things 2008-11-10 15:46 and both ways looking clumsy and complex 2008-11-10 15:47 current_fs_time() was introduced to avoid timestamp was go back 2008-11-10 15:47 that sure did need fixing 2008-11-10 15:48 but the fix would be far more easily verified if fixed point was used instead of decimal conversions 2008-11-10 15:48 I am sure that libc is part of the problem 2008-11-10 15:48 it uses decimal fractions and nothing will ever change that 2008-11-10 15:49 that is baked into posix 2008-11-10 15:49 sigh 2008-11-10 15:50 i see. I'm not sure about fixed point though 2008-11-10 15:50 neither am I 2008-11-10 15:50 but it is not hard to change if we have to 2008-11-10 15:50 right now I think we do not have to, because time_gran will work fine for us 2008-11-10 15:51 well, anyway, I think this is not big issue 2008-11-10 15:51 agreed 2008-11-10 15:51 a bigger issue is checking in the patch so we can move on to more important things 2008-11-10 15:52 so... about 15 minutes more cleanup and it goes in 2008-11-10 15:52 ok, good. even if it can't compile, we don't care. we can fix it 2008-11-10 16:17 yuck, what a mess, trace.h includes unistd.h for getpid, but unistd.h takes the magic _XOPEN defines to decide whether to expose the secret new functions like posix_memalign 2008-11-10 16:17 so the result is, posix_memalign doesn't get exposed if trace.h is included first 2008-11-10 16:18 the whole concept of C headers is deeply broken 2008-11-10 16:19 GNU_SOURCE doesn't work? 2008-11-10 16:19 didn't know about it 2008-11-10 16:19 select feature 2008-11-10 16:20 iirc, -D_GNU_SOURCE 2008-11-10 16:21 http://mail.gnome.org/archives/gnome-list/1998-May/msg00886.html 2008-11-10 16:22 that helps, I will get rid of the XOPEN defines 2008-11-10 16:24 i see what is problem now. define feature macro by -D seems to help 2008-11-10 16:34 http://tux3.org/tux3 2008-11-10 16:35 have not removed the XOPEN defines yet 2008-11-10 16:58 I noticed we have two files for fuse - tux3fs.c and tux3fuse.c 2008-11-10 17:05 hirofumi: yes one is using the fuse low level api 2008-11-10 17:05 contributed by 2 different peopel 2008-11-10 17:05 i see. I thought those are merge miss 2008-11-10 17:17 nope 2008-11-10 17:18 tux3fuse.c is the low level api one, and it is preferred, since the api more closely matches the kernel vfs interfaces, which means easier porting... maybe 2008-11-10 17:20 i see, thanks 2008-11-10 18:24 the tux3fs fuse version isn't fixed up for time attributes yet 2008-11-10 18:24 I thought that might be a good bit for konrad 2008-11-10 18:25 ACTION is summoned 2008-11-10 18:25 want to fix tux3fs.c to handle inode times? 2008-11-10 18:26 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-10 18:26 hm 2008-11-10 18:26 this patch made them work for tux3fuse.c: http://tux3.org/tux3?cs=5f175e3877b5 2008-11-10 18:26 ah 2008-11-10 18:26 let me take a look 2008-11-10 18:26 and I'm going to post to the email about it now 2008-11-10 18:35 df test 2008-11-10 18:35 Filesystem 1K-blocks Used Available Use% Mounted on 2008-11-10 18:35 df: `test': Function not implemented 2008-11-10 18:35 I wonder what function it means 2008-11-10 18:38 something's missing in the high level fuse bits 2008-11-10 18:38 that's right 2008-11-10 18:38 patch coming in a moment 2008-11-10 18:38 nothing is being saved for date, it's not just the wrong number 2008-11-10 18:38 or wrong date, even 2008-11-10 18:39 there's a mistake in the tux3fuse attribute code 2008-11-10 18:39 the stat struct was not fully initialized 2008-11-10 18:41 http://tux3.org/tux3?cs=a9f7f86e37d0 <- there 2008-11-10 18:42 ls -l does not give the right time yet 2008-11-10 18:42 but stat (1) does 2008-11-10 18:58 ok, ls -l apparently shows mtime in the time field 2008-11-10 18:58 not documented in man ls, a shame 2008-11-10 18:59 ctime is fine, mtime is coming through as zero 2008-11-10 18:59 that is because tux3 is not storing mtime at the moment 2008-11-10 19:00 and decode_attrs needs to set mtime to be ctime if the mtime wasn't stored 2008-11-10 19:00 It seems reasonable to avoid storing mtime any time it was not explicitly set, I think shapor looked into that 2008-11-10 19:01 tux3fs isn't storing mtime, ctime, or atime for some reason 2008-11-10 19:03 it's storing ctime 2008-11-10 19:03 try these commands: 2008-11-10 19:03 make && make mkfs && make defuse 2008-11-10 19:04 then in another terminal, su; echo foo >/tmp/test/bar; ls -l /tmp/test 2008-11-10 19:04 sorry 2008-11-10 19:04 stat /tmp/test 2008-11-10 19:05 if you do ls -l, you will see an unset date, which I am fixing right now 2008-11-10 19:05 that is because ctime is set, mtime isn't 2008-11-10 19:06 ok, that works 2008-11-10 19:12 konrad, this fixes ls -l: http://tux3.org/tux3?cs=999d5660bc90 2008-11-10 19:13 flips: I think you're talking about tux3fuse, I'm talking about tux3fs 2008-11-10 19:14 konrad, ah, right, you said high level fuse is broken 2008-11-10 19:14 that may be, but at least tux3 isn't also broken, which it was a few minutes ago 2008-11-10 19:14 good ;) 2008-11-10 19:15 konrad, if you're sure, then how about emailing the fuse list? 2008-11-10 19:15 hm? 2008-11-10 19:15 did you mean that fuse itself is broken 2008-11-10 19:15 no 2008-11-10 19:15 or tux3fs.c 2008-11-10 19:15 ok 2008-11-10 19:15 tux3fs.c 2008-11-10 19:15 right, I didn't look at it 2008-11-10 19:16 I didn't really look at tux3fuse.c either, just enough to fix the dates 2008-11-10 19:16 mtimes seem to work fine on some other fuse fs's I have mounted 2008-11-10 19:16 yeah 2008-11-10 19:17 until a few minutes ago, mtime never worked in tux3 at all 2008-11-10 19:17 because we don't store mtime attributes at the moment 2008-11-10 19:17 now mtime is taken the same as ctime if it isn't given explicitly 2008-11-10 19:25 .st_atim = { 2008-11-10 19:25 .tv_sec = high32(inode->i_atime), 2008-11-10 19:25 .tv_nsec = millionths(inode->i_atime) * 1000, 2008-11-10 19:25 }, 2008-11-10 19:25 above seems to work for fine resolution 2008-11-10 19:27 thankyou 2008-11-10 19:28 cleaner is to define a billionths function 2008-11-10 19:31 I think time_sec/time_nsec may be more readable 2008-11-10 19:32 we don't define either _BSD_SOURCE or _SVID_SOURCE, so why do we see st_atim? 2008-11-10 19:33 _GNU_SOURCE is including those 2008-11-10 19:34 ok, and do you use stat (1) to verify atime has high precision? 2008-11-10 19:35 atime is just epoch actually 2008-11-10 19:35 ctime seems fine 2008-11-10 19:37 echo foo >test/foo && stat test/foo 2008-11-10 19:37 File: `test/foo' 2008-11-10 19:37 Size: 4 Blocks: 0 IO Block: 4096 regular file 2008-11-10 19:37 Device: 12h/18d Inode: 14 Links: 0 2008-11-10 19:37 Access: (0666/-rw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root) 2008-11-10 19:37 Access: 1969-12-31 16:00:00.000000000 -0800 2008-11-10 19:37 Modify: 2008-11-10 19:37:04.436890000 -0800 2008-11-10 19:37 Change: 2008-11-10 19:37:04.436890000 -0800 2008-11-10 19:37 high precision mtime and ctime 2008-11-10 19:38 my source may be a bit old 2008-11-10 19:39 "touch testfile" seems strange 2008-11-10 19:40 right, the date did not get set 2008-11-10 19:40 yes 2008-11-10 19:43 the bug is here: 2008-11-10 19:43 if (to_set & FUSE_SET_ATTR_MTIME) { 2008-11-10 19:43 printf("Setting mtime to %Lu\n", (L)attr->st_mtime); 2008-11-10 19:43 inode->i_mtime = attr->st_mtime; 2008-11-10 19:43 inode->present |= MTIME_BIT; 2008-11-10 19:43 } 2008-11-10 19:43 missing conversion between st time and tuxtime 2008-11-10 19:46 um... MTIME is including CTIME? 2008-11-10 19:46 including? 2008-11-10 19:46 I can't find FUSE_SET_ATTR_CTIME 2008-11-10 19:47 ctime can't be set by setattr 2008-11-10 19:47 ah, yes 2008-11-10 19:48 maybe "touch testfile" is confusebul, it can replace with ":> testfile" 2008-11-10 19:49 ":> testfile" also seems not to store ctime 2008-11-10 19:50 well tex3_setattr is definitely wrong 2008-11-10 19:50 I'll fix it first and see if touch still doesn't work 2008-11-10 19:50 ok 2008-11-10 19:55 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-10 20:00 ah, maybe i_size is 0, so ctime also present... 2008-11-10 20:01 also missing I think you meant 2008-11-10 20:01 that would make sense 2008-11-10 20:01 hmm 2008-11-10 20:02 also ctime is not present 2008-11-10 20:02 right 2008-11-10 20:02 well, I think I made a conceptual error there 2008-11-10 20:03 we need to know ctime even if size is zero 2008-11-10 20:03 and optimizing for zero length files makes no sense 2008-11-10 20:03 yes, however looks like not big issue 2008-11-10 20:03 so we better always store the ctime/size attribute even for zero length files 2008-11-10 20:05 you are right 2008-11-10 20:05 inode->i_size = 0; 2008-11-10 20:05 inode->present = CTIME_SIZE_BIT|MODE_OWNER_BIT|DATA_BTREE_BIT; 2008-11-10 20:06 in make_inode, above seems to fix this 2008-11-10 20:06 ok, fixed the time conversion in tux3fuse.c and ctime is still not set as you pointed out 2008-11-10 20:09 inode->i_size = 0 is probably not required, since the new inode comes from new_inode() which zeros all fields initially 2008-11-10 20:09 i see 2008-11-10 20:10 ok, you are right 2008-11-10 20:10 it seems to work 2008-11-10 20:11 note that we have to get both time fields out of the inode in one atomic fetch if we want them to be consistent 2008-11-10 20:11 in kernel, where things are parallel 2008-11-10 20:11 I'm going to leave it as it is for now 2008-11-10 20:13 umm.. we can just "read from current_fs_time(), then set"? 2008-11-10 20:14 well, timestamp stuff seems to work now almost, I think we just need to teaks a bit 2008-11-10 20:15 s/teaks/tweaks/ 2008-11-10 20:16 http://tux3.org/tux3?cs=dab895e2e896 <- your fixes and mine 2008-11-10 20:18 hirofumi, but what lock prevents concurrent access to our inode time fields? 2008-11-10 20:18 ACTION will be out for a bit 2008-11-10 20:20 ah, I thought it about "set", not "read". so you're right, it is not consistent 2008-11-10 20:24 see you later 2008-11-10 20:25 see you 2008-11-10 22:41 ...back 2008-11-10 22:42 I posted some patches, could you check those? 2008-11-10 22:42 sure 2008-11-10 22:47 hirofumi, do you have a server online that you could use to export mercurial? 2008-11-10 22:47 so I can pull? 2008-11-10 22:47 maybe I can use user.kernel.org 2008-11-10 22:47 that would work 2008-11-10 22:47 wait a minute 2008-11-10 22:48 another way would be for me to set up a repository you can push to 2008-11-10 22:50 btw, "hg import" can use for email 2008-11-10 22:52 right, that will take some of the pain away 2008-11-10 22:53 and give proper attribution 2008-11-10 22:53 could you try "http://userweb.kernel.org/~hirofumi/tux3/"? 2008-11-10 22:53 I've just copied 2008-11-10 22:54 hg incoming http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-10 22:54 abort: 'http://userweb.kernel.org/~hirofumi/tux3/' does not appear to be an hg repository! 2008-11-10 22:55 um... maybe .hg was hidden 2008-11-10 22:56 I can see .hg via http 2008-11-10 22:57 um.. "tar czf tux3.tar.gz tux3" is not work for hg? 2008-11-10 22:57 it should 2008-11-10 22:58 can you clone from that url yourself? 2008-11-10 22:58 not work 2008-11-10 22:59 well when you get it working, just let me know, for now I can use hg import 2008-11-10 22:59 thanks 2008-11-10 23:01 where does CURRENT_TIME_SEC come from? 2008-11-10 23:02 in dir.c? 2008-11-10 23:03 dir.c:#define CURRENT_TIME_SEC 123 2008-11-10 23:03 right, it doesn't come from anywhere 2008-11-10 23:03 just something I did to get it to compile 2008-11-10 23:04 so since dir.c includes tux3.h, we should make it use tuxtime() 2008-11-10 23:05 I assumed 123 may be used for unit test 2008-11-10 23:05 sometimes you have to just assume I was being lazy, then forgot about it 2008-11-10 23:06 :) 2008-11-10 23:06 so should I merge it, then clean it up or wait for a new post? 2008-11-10 23:06 either way is ok 2008-11-10 23:06 I think new patch is prefer for clean history 2008-11-10 23:06 yes 2008-11-10 23:07 I see you continued my pattern of #if1 in tux3fuse.c 2008-11-10 23:07 yes, those should be removed? 2008-11-10 23:07 so I will just merge that, but it's painful the way that code repeats 2008-11-10 23:08 not your fault 2008-11-10 23:08 it would be nice if that assignment only happened in one place 2008-11-10 23:08 but it's not a big deal 2008-11-10 23:09 ah, sounds nice. but it seems to be separated patch thing 2008-11-10 23:09 ah, and millionths is uniformly changed to billionths, that is fine 2008-11-10 23:09 yes 2008-11-10 23:09 I'll just wait for your update to the CURRENT_TIME then merge them with hg import 2008-11-10 23:12 oops, I have to update all patches 2008-11-10 23:20 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-10 23:21 done, could you try? 2008-11-10 23:25 yes 2008-11-10 23:25 hg import doesn't "just work" 2008-11-10 23:26 it can't figure out the required strip option for that patches 2008-11-10 23:28 um.. it seems to work in my tree... 2008-11-10 23:29 hg import /src/patches/hirofumi/* 2008-11-10 23:29 applying /src/patches/hirofumi/billionths 2008-11-10 23:29 patching file user/tux3.h 2008-11-10 23:29 Hunk #1 FAILED at 254. 2008-11-10 23:29 1 out of 1 hunk FAILED -- saving rejects to file user/tux3.h.rej 2008-11-10 23:29 patching file user/tux3fuse.c 2008-11-10 23:29 Hunk #1 FAILED at 87. 2008-11-10 23:29 Hunk #2 FAILED at 187. 2008-11-10 23:29 Hunk #3 succeeded at 239 (offset -30 lines). 2008-11-10 23:29 2 out of 3 hunks FAILED -- saving rejects to file user/tux3fuse.c.rej 2008-11-10 23:29 abort: patch command failed: exited with status 1 2008-11-10 23:29 partial success 2008-11-10 23:31 um.. "hg log -r tip" is 385:dab895e2e896? 2008-11-10 23:31 yes 2008-11-10 23:32 hirofumi@devron (tux3)$ ../../usr/bin/hg import ../a/* 2008-11-10 23:32 applying ../a/p1.email 2008-11-10 23:32 applying ../a/p2.email 2008-11-10 23:32 applying ../a/p3.email 2008-11-10 23:32 um... 2008-11-10 23:32 ah, order is wrong? 2008-11-10 23:33 billionths is last patch 2008-11-10 23:33 that could be it 2008-11-10 23:34 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-10 23:35 yes, that was it 2008-11-10 23:35 ok, thanks 2008-11-10 23:35 that is easier than writing log comments by hand, for sure 2008-11-10 23:36 but being able to pull would be easier yet, and I can use the incoming command 2008-11-10 23:36 yes, I'm reading hg-repo setup doc 2008-11-10 23:39 shapor has some experience with this 2008-11-10 23:41 more easier way on userweb.kernel.org is tarball? 2008-11-10 23:45 ah, ok. for plain http, hg need "static-http://" 2008-11-10 23:45 hg clone static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-10 23:51 yeah, that is the simplest way 2008-11-10 23:53 yes, thanks. now, it seems to work 2008-11-10 23:56 -!- pranith(c05e2202@webchat.mibbit.com) has joined #tux3 2008-11-11 00:14 I thought I tried that 2008-11-11 00:16 hg clone static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-11 00:16 abort: HTTP Error 404: Not Found 2008-11-11 00:17 fixed. I re-created and forget to rename .hg to tux3 2008-11-11 00:17 I still get 404 2008-11-11 00:17 hg clone static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-11 00:17 abort: HTTP Error 404: Not Found 2008-11-11 00:18 yes... 2008-11-11 00:19 use of static- as part of the url is a mistake on matt's part I think 2008-11-11 00:19 I didn't need to rename. now it seems to work 2008-11-11 00:19 it should be hg clone --static http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-11 00:19 not for me 2008-11-11 00:20 "hg clone static-http://userweb.kernel.org/~hirofumi/tux3/" doesn't work? 2008-11-11 00:20 hg clone static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-11 00:20 abort: HTTP Error 404: Not Found 2008-11-11 00:20 hmm 2008-11-11 00:20 different hg version? 2008-11-11 00:20 404 sounds strange 2008-11-11 00:20 http://userweb.kernel.org/~hirofumi/tux3/? 2008-11-11 00:21 hg --version 2008-11-11 00:21 Mercurial Distributed SCM (version 0.9.1) 2008-11-11 00:21 Mercurial Distributed SCM (version 1.0.1) 2008-11-11 00:22 sounds suspicious 2008-11-11 00:22 let's see what backports.org has 2008-11-11 00:23 has 1.01 2008-11-11 00:23 sigh, I hate fiddling with pins 2008-11-11 00:24 um.. 2008-11-11 00:24 etch hg seems to work static-http 2008-11-11 00:24 http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thread/578a41096f9bb7c0/192b5874832a1dc6?lnk=raot 2008-11-11 00:25 I pull static-http from shapor often 2008-11-11 00:25 could you recheck url? 2008-11-11 00:26 ah, seems right one 2008-11-11 00:26 I pasted exactly the one you gave me 2008-11-11 00:26 but the link you found seems to suggest its an etch bug 2008-11-11 00:27 that seems lenny bug 2008-11-11 00:28 ah, hg may have incompatible change? 2008-11-11 00:28 ah right 2008-11-11 00:29 it's a mystery 2008-11-11 00:29 attempting to strace hg is an exercise in frustration 2008-11-11 00:29 maybe http://hg.intevation.org/mercurial/crew-stable/rev/98b6c3dde237? 2008-11-11 00:34 ok, it seems incompatible change 2008-11-11 00:34 which change? 2008-11-11 00:34 old one is http://hg.intevation.org/kolab/kuudecode/.hg/ 2008-11-11 00:35 new one is http://hg.intevation.org/argh/.hg/ 2008-11-11 00:35 ok 2008-11-11 00:35 I'll downgrade to 0.9.1 2008-11-11 00:35 worth a try 2008-11-11 00:43 ok, done 2008-11-11 00:46 it works 2008-11-11 00:47 ok, I'll use this at next time 2008-11-11 00:47 and I can backport hg to get 1.01 on etch, but there doesn't seem to be a strong reason to 2008-11-11 00:48 until I try to clone from somebody else ;) 2008-11-11 00:49 yeah, well, install to another place was easy ;) 2008-11-11 00:49 ACTION wished we lived in a world where everything just worked, but then... that world would probably not need tux3 2008-11-11 00:55 maybe it's good, but it may be boring world ;) 2008-11-11 00:56 in that world, windows would work perfectly, thus leaving no way to get rid of it 2008-11-11 00:58 ok, here's the question: how much more fixing does this user space code need before we go into kernel? 2008-11-11 00:58 I am thinking, not much 2008-11-11 00:58 it doesn't really make a lot of sense to debug the concurrency and atomic commit in user space 2008-11-11 00:59 too much effort to simulate the dcache, add threads and so on 2008-11-11 01:00 I think it's good if we have some prototype to share some design details 2008-11-11 01:00 and we haven't tested any big files yet 2008-11-11 01:00 that is good to do in user space 2008-11-11 01:01 personally I'd like to work for kernel as soon as possible though 2008-11-11 01:02 me too 2008-11-11 01:03 I'll think about it tonight, and maybe get started on the kernel port tomorrow 2008-11-11 01:03 not abandon the user space code, but start working on the kernel 2008-11-11 01:04 sounds good. I'm going to think about buffers tonight 2008-11-11 01:04 maybe reread docs and think about it 2008-11-11 01:55 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-11 02:13 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-11-11 07:30 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-11 08:26 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-11 08:26 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-11 09:13 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-11 09:46 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-11 10:05 -!- Bobby_(~Bobby@122.162.72.214) has joined #tux3 2008-11-11 11:51 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-11 12:12 -!- mingming(~mingming@32.97.110.55) has joined #tux3 2008-11-11 13:10 there's work to do on disk endian and alignment 2008-11-11 13:10 dleaf.c declares three structures with bitfields, which won't produce consistent results across architectures 2008-11-11 13:10 what to do 2008-11-11 13:13 yes, and we should take some care to change those bitfileds 2008-11-11 13:14 I'm just considering whether extent should remain as a struct 2008-11-11 13:14 but with just one 64 bit field 2008-11-11 13:14 or should it just be a typedef for u64 2008-11-11 13:16 and we uses pack_to_extent() and unpack_from_extent()? 2008-11-11 13:16 we already have make_extent 2008-11-11 13:17 ah, yes 2008-11-11 13:17 and extent_count(extent) 2008-11-11 13:17 we need to extra care about bitfields 2008-11-11 13:17 so, ok, next thing is to make extent_block(extent) 2008-11-11 13:17 right, we can't use bitfields for structures on disk 2008-11-11 13:18 because we can't expect consistent layout across architectures 2008-11-11 13:18 bitfield layout is not defined in c99 2008-11-11 13:18 not precisely enough 2008-11-11 13:19 and it needs to additional instractions 2008-11-11 13:19 I don't know if gcc generates good quality code for bitfields 2008-11-11 13:20 e.g. extent->block++ and extent->count++ may be racy 2008-11-11 13:20 that at least is precisely defined 2008-11-11 13:22 ok, we obviously need extent_block(extent) 2008-11-11 13:23 maybe 2008-11-11 13:23 yes 2008-11-11 13:23 I like your name pack_extent more than make_extent 2008-11-11 13:23 we may not need unpack_extent 2008-11-11 13:24 sounds good 2008-11-11 13:24 btw, do you know about "sparse" to check endian bug? 2008-11-11 13:24 know about it, but have not used it 2008-11-11 13:25 we need to add endian attributes everywhere as I recall 2008-11-11 13:25 yes, but it can helps us more or less 2008-11-11 13:27 annotations like (__le16 *) 2008-11-11 13:27 really ugly 2008-11-11 13:27 but then, it makes it look more like kernel code ;) 2008-11-11 13:28 if we just use be_u* stuff 2008-11-11 13:28 we can add __bitwise attribute like "typedef __u16 __bitwise __le16;" 2008-11-11 13:28 right 2008-11-11 13:28 __attribute__((bitwise)) 2008-11-11 13:28 yes 2008-11-11 13:29 will work in userspace just as well as kernel 2008-11-11 13:29 yes 2008-11-11 13:29 now, does sparse make any assumption that would prevent it working on userspace code? 2008-11-11 13:30 like assuming kbuild? 2008-11-11 13:30 no, we can use it in userspace 2008-11-11 13:30 ok, that's fine 2008-11-11 13:30 we should 2008-11-11 13:30 actually I'm using it for exfat test codes 2008-11-11 13:31 so maybe the rest is bitfields stuff 2008-11-11 13:32 will we use #ifdef like "struct iphdr"? 2008-11-11 13:33 do you have a url? 2008-11-11 13:34 ahah 2008-11-11 13:34 bitfields in a network byte order structure 2008-11-11 13:34 yes 2008-11-11 13:35 only within a byte though 2008-11-11 13:35 still, it's a precedent 2008-11-11 13:35 http://lxr.linux.no/linux+v2.6.27.5/include/linux/ip.h#L85 2008-11-11 13:39 hey flips 2008-11-11 13:39 hi bh 2008-11-11 13:49 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-11 14:00 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-11 14:02 hirofumi, by the way it is ok to write long lines in tux3 code, it is better than wrapping with lots of tabs 2008-11-11 14:02 if we have to wrap at kernel merge time, we will do it then 2008-11-11 14:03 of course when it can be written naturally to fit on a short line, that is best 2008-11-11 14:03 ok, tux3graph? 2008-11-11 14:03 yes. Not worth changing, but updates can use longer lines 2008-11-11 14:05 ok. well, I'm using 80 columns for emacs 2008-11-11 14:05 ;) 2008-11-11 14:05 need a special setting for tux3? 2008-11-11 14:06 your code does look nice and "kernally" 2008-11-11 14:06 not need. if I want to see long lines, I'll use fullscreen 2008-11-11 14:06 yes, 2008-11-11 14:07 my default setting is for kernel almost 2008-11-11 14:09 ack, my commit comment was wrong for the previous commit 2008-11-11 14:09 fixing commit messages is an unsolved problem with git-style rev control 2008-11-11 14:10 umm... we can't use "git reset --hard" for it? 2008-11-11 14:36 walk->mock.group.count++; <- this is where wrappers make things ugly 2008-11-11 14:37 um.. group_count(&walk->mock.group)? 2008-11-11 14:38 needs to increment it 2008-11-11 14:38 ah 2008-11-11 14:38 static inline void inc_group_count(struct group *group) 2008-11-11 14:38 { 2008-11-11 14:38 ++group->count_field; 2008-11-11 14:38 } 2008-11-11 14:39 but it is copy of group, not buffer 2008-11-11 14:39 so, just use as is? 2008-11-11 14:40 true, it's in the mock structure 2008-11-11 14:40 it would be a pain to have mock be a lot different from the buffer struct 2008-11-11 14:41 well, for now I will wrap it in an inc and later we can edit to something more efficient if we want 2008-11-11 14:41 yes 2008-11-11 14:42 I did the extent wrappers working directoy on a struct and the group wrappers working on pointers, that is inconsistent, later has to be made consistent 2008-11-11 14:42 I don't know which is better, but the pointers avoid lots of *'s 2008-11-11 14:43 maybe I think, we are thinking about locking, we will be needed some modify 2008-11-11 15:08 yes, locking always messes things up 2008-11-11 15:09 especially spinlocks where you can't call most functions etc 2008-11-11 15:09 yes 2008-11-11 15:10 big diff comitted for dleaf group, still have to do struct entry 2008-11-11 15:10 and then struct dleaf 2008-11-11 15:10 after that, struct ileaf 2008-11-11 15:10 and btree index blocks 2008-11-11 15:10 ok, are you going to write sparse stuff? 2008-11-11 15:11 if not, I'll do 2008-11-11 15:11 thanks 2008-11-11 15:11 I'm happy to share that 2008-11-11 15:12 ok, maybe I'll do it tomorrow or so 2008-11-11 15:12 sure, the mindless field wrapping should be done by sometime tomorrow 2008-11-11 15:12 at least a good chunk of it 2008-11-11 15:12 sk8 oclock 2008-11-11 15:12 oh, enjoy 2008-11-11 15:21 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-11 17:15 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-11 17:28 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-11 18:24 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-11 19:18 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-11 19:36 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-11 19:44 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-11 19:54 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-11 19:54 hi 2008-11-11 19:54 hi 2008-11-11 19:56 hey 2008-11-11 19:57 hi all 2008-11-11 19:57 hey guys 2008-11-11 19:58 damn... I'm the only one that didn't say 'hi' :P 2008-11-11 20:01 ping... 2008-11-11 20:02 no tux3 tonight? 2008-11-11 20:03 I think it will be :) 2008-11-11 20:03 i pinged flips 2008-11-11 20:03 no pong back yet? :D 2008-11-11 20:03 hi 2008-11-11 20:03 latency... 2008-11-11 20:04 :D 2008-11-11 20:04 I was having so much fun wrapping big endian fields I didn't notice the time :-/ 2008-11-11 20:04 ok 2008-11-11 20:04 pong ;-) 2008-11-11 20:04 :-) 2008-11-11 20:04 heh 2008-11-11 20:04 today let's think about deferred namespace operations 2008-11-11 20:05 this is something that nobody has tried in Linux before 2008-11-11 20:05 cool concept 2008-11-11 20:05 it looks like a big win 2008-11-11 20:05 and not too much code 2008-11-11 20:05 we have to worry first about file create, then delete 2008-11-11 20:06 then rename, unlink, and everything else that changes a vfs name 2008-11-11 20:06 what I am hoping is that the dentry cache is a help here, as I wrote in the post 2008-11-11 20:06 anybody not read the deferred namespace post? 2008-11-11 20:06 so I guess, we play devil's advocate? 2008-11-11 20:07 ACTION did not :-( 2008-11-11 20:07 ACTION did. 2008-11-11 20:07 razvanm, best would be to have a quick scan through right now 2008-11-11 20:07 maze, you will make a fine devil 2008-11-11 20:07 hirofumi is pretty good at that too 2008-11-11 20:08 I'm reading it now... 2008-11-11 20:08 so first, create 2008-11-11 20:08 vfs_create in fact 2008-11-11 20:08 this is the third time we've gone in to look at this one 2008-11-11 20:08 I think 2008-11-11 20:08 should be easy - basically you make it async... 2008-11-11 20:08 and each time, looking at it in a completely different way 2008-11-11 20:09 we have to worry about several things 2008-11-11 20:09 we still need to return EEXIST properly 2008-11-11 20:09 and any other of error that can happen, these cannot be deferred 2008-11-11 20:10 when we decide to defer, we must be cure that the dererred update will complete without error 2008-11-11 20:10 so we need to go far enough along that we know the create will be successfull... won't run out of disk space etc? 2008-11-11 20:10 yes 2008-11-11 20:11 one thing I don't think we need to worry about is cache memory deadlock 2008-11-11 20:11 but we do have to make a good, conservative upperbound estimate of disk blocks that could be used 2008-11-11 20:11 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1503 <- here we are 2008-11-11 20:12 another thing we need to worry about is communicating our intentions to the filesystem back end 2008-11-11 20:12 my plan is to link the dentry onto a list of deferred creates 2008-11-11 20:13 it would be nice if we could do that without allocating a new object just for the link 2008-11-11 20:13 so we want to keep eyes open for opportunities to use a link field that already exists 2008-11-11 20:13 you cannot use some link that is already in the dentry? 2008-11-11 20:14 I don't know whether that's possible or not 2008-11-11 20:14 would it be possible to allocate dentries slightly larger than normal? 2008-11-11 20:14 hopefully we can shed some light on that now 2008-11-11 20:14 ie. are we always the ones creating/destroying these dentries? 2008-11-11 20:14 dentrie structs can't be customized per-filesystem like inodes can 2008-11-11 20:15 generalized dentry create/destroy doesn't exist _I think_ 2008-11-11 20:15 let's verify that but looking at the dcache interface 2008-11-11 20:15 although it has ->d_fsprivate, iirc 2008-11-11 20:15 ah 2008-11-11 20:16 it was ->d_fsdata 2008-11-11 20:16 well we'd still have to allocate an object to go in that field 2008-11-11 20:16 yes 2008-11-11 20:16 ->private is a linux meme that is pretty lame 2008-11-11 20:16 leads to lots of alloc/free activity 2008-11-11 20:17 theoretically that field could just be the next pointer 2008-11-11 20:17 incidentally, the per-fs customization that we now have for inodes was my contribution 2008-11-11 20:17 (historical note) 2008-11-11 20:18 this issue isn't the biggest issue in the world, but let's just keep it in the backs of our minds 2008-11-11 20:18 opportunistic optimization 2008-11-11 20:18 now what is the home of the dentry operations... dcache.c or dentry.c? 2008-11-11 20:18 I always forget 2008-11-11 20:19 fs/dcache.c 2008-11-11 20:19 http://lxr.linux.no/linux+v2.6.27/fs/dcache.c 2008-11-11 20:19 lots of d_* :D 2008-11-11 20:19 it's ancient stuff 2008-11-11 20:20 back in the day, there were no source indexers do the d_ was needed to be able to grep 2008-11-11 20:20 there's still some of that going around, but you see most of the newer subsystems leave off that decoration on field names 2008-11-11 20:20 my, dcache.c has grown muchly 2008-11-11 20:21 it was shocking when buffer.c grew over 2K lines 2008-11-11 20:21 now everything is that big or bigger 2008-11-11 20:21 d_alloc is where we need to look right now 2008-11-11 20:22 http://lxr.linux.no/linux+v2.6.27/fs/dcache.c#L902 2008-11-11 20:22 I have it at 912 2008-11-11 20:22 ah 2008-11-11 20:22 you include the comment :) 2008-11-11 20:23 :-) 2008-11-11 20:23 ok, let's look for hooks 2008-11-11 20:23 hello tux3 u 2008-11-11 20:23 don't see any 2008-11-11 20:23 hi bh 2008-11-11 20:23 cute, you even pass in a constructor 2008-11-11 20:23 [not to be mistaken with a const struct] 2008-11-11 20:24 which is a rarely used marginal feature that accounts for a lot of code 2008-11-11 20:24 ? 2008-11-11 20:24 historical note #2: the person who gave us the notion of slab allocation is the same Jeff Bonwick who gave us ZFS 2008-11-11 20:24 or gave somebody ZFS 2008-11-11 20:25 maze, got a url? 2008-11-11 20:25 sorry for what? 2008-11-11 20:25 I'm at 912 2008-11-11 20:25 the constructor you referred to 2008-11-11 20:25 it was a joke... constructor = const struct 2008-11-11 20:25 ah, whoosh 2008-11-11 20:26 everybody know what a qstr is? 2008-11-11 20:26 little gizmo for passing around strings in kernel 2008-11-11 20:27 interesting that we allocate dentry + extra memory for string if needed, instead of just allocating the appropriate amount of memory 2008-11-11 20:27 I assume this means the extra memory is very rarely needed 2008-11-11 20:27 makes it a little easier to handle strings as ptr/len 2008-11-11 20:27 variable size allocations in kernel quickly lead to bad fragmentation 2008-11-11 20:27 hash + len + ptr 2008-11-11 20:28 kmalloc in general is problematic for that reason 2008-11-11 20:28 but having 64 different sizes of strings, say, mixed together in one heap space would be nasty 2008-11-11 20:29 those qstr thingies also carry a hash value around, to avoid recomputing the string hash 2008-11-11 20:29 but don't we already have slab caches of various sizes... couldn't we simply use those instead of dedicated ones? 2008-11-11 20:29 which is likely as useless optimization 2008-11-11 20:29 I wonder if anybody ever checked 2008-11-11 20:30 maze, it might work out ok 2008-11-11 20:30 save x amount of space 2008-11-11 20:30 be a hero and write that patch 2008-11-11 20:30 or a bigger hero and write tux3 patches :) 2008-11-11 20:31 ok, nothing much to see in d_alloc 2008-11-11 20:31 one of the things we want to watch for, is where d_inode gets assigned 2008-11-11 20:31 allocated and assigned 2008-11-11 20:32 what's with the else of the if (parent) on 958 2008-11-11 20:32 deferred-anthing is mainly about lifetimes of objects 2008-11-11 20:32 why does that happen only without a parent...? 2008-11-11 20:32 maybe that's the root 2008-11-11 20:33 if so, it's a crime there's no comment 2008-11-11 20:33 yes 2008-11-11 20:33 real root and mount point 2008-11-11 20:33 ah, later on there's the init for the parent exists case at 963 2008-11-11 20:33 961 spin_lock(&dcache_lock); <- I thought we were using rcu in dcache 2008-11-11 20:33 I thought wrong I guess 2008-11-11 20:34 probably was tried and found ineffective 2008-11-11 20:34 while we're looking at this 2008-11-11 20:34 we are using 2008-11-11 20:34 plus the spinlock? 2008-11-11 20:35 yes 2008-11-11 20:35 INIT_LIST_HEAD(&dentry->d_subdirs); <- worth noting this 2008-11-11 20:35 maybe for complex operations 2008-11-11 20:35 weird and wonderful locking as usual 2008-11-11 20:36 I'm not sure what the rule is for d_subdirs 2008-11-11 20:36 surely not all subdirs of every dir dentry are on the list 2008-11-11 20:36 only ones that were opened 2008-11-11 20:36 ok, all open subdirectories 2008-11-11 20:37 all open 2008-11-11 20:37 or all with dentries? 2008-11-11 20:37 d_alias... I used to know what that was for 2008-11-11 20:37 all cached subdirectories? 2008-11-11 20:37 yes, all cached 2008-11-11 20:38 where a subdir must be cached to be open 2008-11-11 20:38 needed for some search, to check correctness of something 2008-11-11 20:38 I think the d_alias is a list with all the dentry that points to the same inode 2008-11-11 20:39 could be 2008-11-11 20:39 iirc, yes 2008-11-11 20:39 http://lxr.linux.no/linux+v2.6.27/include/linux/dcache.h#L82 <- declaration with comments 2008-11-11 20:40 a dentry is quite a bulky object 2008-11-11 20:40 and we can have millions of them cached 2008-11-11 20:41 let-s get back to the create 2008-11-11 20:41 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1503 2008-11-11 20:41 let's find out where the inode gets created and assigned to the dentry 2008-11-11 20:42 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1519 2008-11-11 20:43 1610static int __open_namei_create(struct nameidata *nd, struct path *path, 2008-11-11 20:44 http://lxr.linux.no/linux+v2.6.27/fs/ext2/namei.c#L106 2008-11-11 20:44 1616 if (!IS_POSIXACL(dir->d_inode)) <- inode is valid by here 2008-11-11 20:44 yes, that's it 2008-11-11 20:44 where are you? 2008-11-11 20:45 am I lost, or are you ;-) ? 2008-11-11 20:45 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1616 2008-11-11 20:45 yes 2008-11-11 20:45 ah, no. it's dir (i.e. parent) 2008-11-11 20:45 thanks 2008-11-11 20:45 that resolves that 2008-11-11 20:45 so the fs allocates the inode and fills it in to the dentry in ->create 2008-11-11 20:45 how'd we get wherever we are ;-) ? 2008-11-11 20:46 let's find where the inode is actually assigned to the dentry 2008-11-11 20:46 inode->i_op->create() 2008-11-11 20:47 http://lxr.linux.no/linux+v2.6.27/fs/ext2/namei.c#L39 2008-11-11 20:47 43 d_instantiate(dentry, inode); 2008-11-11 20:47 yes 2008-11-11 20:47 let's see what d_instantiate does 2008-11-11 20:48 http://lxr.linux.no/linux+v2.6.27/fs/dcache.c#L980 2008-11-11 20:48 adds it to the alias list 2008-11-11 20:48 everything else is decoration 2008-11-11 20:49 btw, those are inotify and selinux 2008-11-11 20:49 yes 2008-11-11 20:49 it's a shame it had to be two separate hooks 2008-11-11 20:49 not really beautiful 2008-11-11 20:49 ok, this is where the file becomes visible 2008-11-11 20:50 and it's a particularly important point for tux3 2008-11-11 20:50 because we are going to make the new name for the inode visible there without having an inode number yet 2008-11-11 20:50 or a dirent 2008-11-11 20:50 ok, question 2008-11-11 20:51 when we link it onto the list, now do we know there is not already a dentry with the same name on the list? 2008-11-11 20:51 very important question 2008-11-11 20:51 again affects tux3 defer strategy a lot 2008-11-11 20:52 to answer this, we have to go back and look at where vfs calls the filesystem to check for a name collision 2008-11-11 20:52 yes 2008-11-11 20:53 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L519 <- possibly here 2008-11-11 20:53 519 result = dir->i_op->lookup(dir, dentry, nd); 2008-11-11 20:53 in real_lookup 2008-11-11 20:53 483 * We get the directory semaphore, and after getting that we also 2008-11-11 20:53 484 * make sure that nobody added the entry to the dcache in the meantime.. 2008-11-11 20:53 485 * SMP-safe 2008-11-11 20:53 486 */ 2008-11-11 20:54 so it is the directory semaphore that ensures the name is not added twice 2008-11-11 20:54 obviously, this is a huge bottleneck 2008-11-11 20:54 but that's something to think about in the future when we are so good at filesystem stuff that we can hack core vfs without breaking it 2008-11-11 20:55 anyway, it is good for tux3 deferring things 2008-11-11 20:55 you're saying the dir mutex is a problem? 2008-11-11 20:55 not the dcache lock? 2008-11-11 20:55 do_filp_open -> lock i_mutex of dir -> lookup_hash -> __lookup_hash -> lookup 2008-11-11 20:55 it must be 2008-11-11 20:55 so, i_mutex of dir? 2008-11-11 20:55 hirofumi, that was fast ;) 2008-11-11 20:55 yes 2008-11-11 20:55 so think about a directory with 10 million files 2008-11-11 20:56 and create/delete/rename activity all over it 2008-11-11 20:56 ACTION ?stack overlow 2008-11-11 20:56 it must suck very badly 2008-11-11 20:56 well such big directories are not uncommon any more 2008-11-11 20:57 maybe vm shirinks some of dentries? 2008-11-11 20:57 it evicts ones that aren't open 2008-11-11 20:57 using the dentry lru list 2008-11-11 20:57 but the problem here is just lock contention 2008-11-11 20:58 on lock on 10 millilon objects has to hurt 2008-11-11 20:58 some loads 2008-11-11 20:58 one lock I meant 2008-11-11 20:58 ok, now when we defer, our dirent update will take place outside the lock 2008-11-11 20:59 how do we know that is always ok? 2008-11-11 20:59 (this is a snap question to see who's awake) 2008-11-11 20:59 http://lxr.linux.no/linux+v2.6.27/include/linux/dcache.h#L181 2008-11-11 20:59 there's also a single rename lock 2008-11-11 20:59 yes, also bad 2008-11-11 20:59 but the directory mutex takes the prize I think 2008-11-11 21:00 if it was linked to defered list, it's ok? 2008-11-11 21:00 seqlock... sounds like a rcu thingy at least 2008-11-11 21:00 maze? 2008-11-11 21:00 hmm 2008-11-11 21:00 ? 2008-11-11 21:01 the question is, how can we be sure that the deferred dirent add won't collide with an existing dentry 2008-11-11 21:01 the seqlock does seem like it only 'breaks' during many renames 2008-11-11 21:01 well... 2008-11-11 21:01 our fs checked its dirent blocks under the parent i_mutex, but by now that lock is dropped 2008-11-11 21:01 with a hash that's hard 2008-11-11 21:02 to answer this question, we have to think about how the dcache works 2008-11-11 21:02 before we dropped the lock, we exposed the new name via the dentry cache 2008-11-11 21:02 we could take a lock on the first place we would search for it in the hash 2008-11-11 21:02 there are _two_ places where the vfs can return EEXIST 2008-11-11 21:03 we could 2008-11-11 21:03 be we don't have to, according to my calculations 2008-11-11 21:03 but I meant 2008-11-11 21:03 let's find the other place 2008-11-11 21:03 um.. negetive dentry? well, let's find 2008-11-11 21:04 opposite of negative 2008-11-11 21:04 ah, yes 2008-11-11 21:04 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1765 2008-11-11 21:04 if it already exists in dcache 2008-11-11 21:04 yes 2008-11-11 21:05 well 2008-11-11 21:05 it has to d_instantiated 2008-11-11 21:05 and that is the definition of "exists in dcache" I guess 2008-11-11 21:06 until then, it's just a loose object, not connected to anything 2008-11-11 21:06 not quite true 2008-11-11 21:06 it's on the hash list before being d_instantiated I think 2008-11-11 21:06 worth checking that 2008-11-11 21:07 yes 2008-11-11 21:07 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1950 <- maybe this is the EEXIST we are looking for 2008-11-11 21:08 in vfs_mknod 2008-11-11 21:08 well no 2008-11-11 21:08 nope 2008-11-11 21:08 just during dev creation 2008-11-11 21:08 notice it's in mknod 2008-11-11 21:08 yes 2008-11-11 21:09 ah, vfs_create also does may_create 2008-11-11 21:09 may_create does this job 2008-11-11 21:10 so once the vfs has checked this, it holds the parent lock until the fs has d_instantiated the dentry 2008-11-11 21:11 now, what we do in tux3 is prevent the dentry from being evicted 2008-11-11 21:11 taking a count on the underlying inode should be enough 2008-11-11 21:11 dentries also have counts, but I think a pinned inode pins the dentry too 2008-11-11 21:11 again worth checking 2008-11-11 21:12 so as log as we prevent those dentries from being evicted, we can rely on no name collision 2008-11-11 21:12 of course, we still check for a collision when we make the entry 2008-11-11 21:13 because bugs happen 2008-11-11 21:13 ok, that's enough for today 2008-11-11 21:13 :-) 2008-11-11 21:13 I think, pretty deep stuff, no? 2008-11-11 21:13 yup 2008-11-11 21:13 for me yes 2008-11-11 21:14 yes, dcache stuff is complex and buggy 2008-11-11 21:14 so making the decision to take a run at the deferred namespace design required having some idea that the protection we were looking at works the way it does 2008-11-11 21:14 the dcache.c in my emulation is only 92 lines :P 2008-11-11 21:15 hirofumi, I wouldn't say buggy 2008-11-11 21:15 flips, are you sure the deferred namespace allocations don't violate any sus standards? 2008-11-11 21:15 fragile is a better word 2008-11-11 21:15 very easy to break 2008-11-11 21:15 that is why Linux only lets two or three people in the world touch the stuff 2008-11-11 21:15 I know some bugs around ->d_revalidate 2008-11-11 21:15 Linus I meant 2008-11-11 21:15 ah 2008-11-11 21:16 shapor, fairly sure 2008-11-11 21:16 for example, we must return EEXIST as if there were no defer, and ENOENT in the case of deferred delete 2008-11-11 21:17 yeah, i read around a bit, doesn't seem to mention anything specifically 2008-11-11 21:17 those are required 2008-11-11 21:17 right 2008-11-11 21:17 let me see, d_revalidate, I knew what it did a little while ago... 2008-11-11 21:17 although if you do very little io 2008-11-11 21:17 ah 2008-11-11 21:18 it's directory traversal 2008-11-11 21:18 a create/delete during a sequence of getdents 2008-11-11 21:18 well, I think tux3 will not need it though 2008-11-11 21:18 doesn't it possibly increase the likelyhood of lost data on a hard reboot? 2008-11-11 21:19 hirofumi, at the moment it already has it because we're using ext2 dirops 2008-11-11 21:19 shapor, it's necessary to guarantee that can't happen 2008-11-11 21:19 we have the fallback to last completed delta for that 2008-11-11 21:20 we're allowed to "lose" andything that wasn't synced/fscyned 2008-11-11 21:20 yes i know we are allowed to 2008-11-11 21:20 but it seems like this would increase the likelyhood 2008-11-11 21:20 though of course we do not want to lose 10 seconds worth of adata 2008-11-11 21:20 yeah 2008-11-11 21:20 under light load 2008-11-11 21:20 people would notice that 2008-11-11 21:20 or perhaps a weeks worth 2008-11-11 21:20 under heavy load they probably don't care 2008-11-11 21:20 under a *very* light load 2008-11-11 21:21 people don't fsync because they don't need to 2008-11-11 21:21 taking a home server (like mine) for instance 2008-11-11 21:21 so the plan there is to have a theshold trigger from the underlying device 2008-11-11 21:21 i add a few photos a week maybe 2008-11-11 21:21 but if we're idle we can force a sync 2008-11-11 21:21 plenty to fit in cache for months 2008-11-11 21:21 when it's just about not busy, it causes anything remaining in cache to be flushed out 2008-11-11 21:21 right, we can't forget that 2008-11-11 21:22 how do we detect that? 2008-11-11 21:22 maze, in fact I just provided a precise definition of "idle" 2008-11-11 21:22 same as forward logging 2008-11-11 21:22 shapor, good question 2008-11-11 21:22 I think we will make a core kernel patch for it 2008-11-11 21:22 and we will also have a work-around way of doing that without the core patch 2008-11-11 21:22 that needs to go on the todo list 2008-11-11 21:22 with forward logging we either have an up2date superblock - or one with a chain 2008-11-11 21:23 one of those hard things to do i think ;) 2008-11-11 21:23 yes it needs to 2008-11-11 21:23 who's maintaining the todo list? ;-) 2008-11-11 21:23 idle is when we have an up2date superblock 2008-11-11 21:24 I don't think that block device low theshold is hard, however there has been relatively unsuccessful attempts to do this nicely in the past 2008-11-11 21:24 and there is a sort-of mechanism for that now 2008-11-11 21:24 uglier than sin, and mainly just used internal to vm 2008-11-11 21:25 congestion stuff? 2008-11-11 21:25 it's a generic concept that should be part of the filesystem interface 2008-11-11 21:25 yes 2008-11-11 21:25 anyway, there is also a workaround 2008-11-11 21:25 not sure that matters 2008-11-11 21:26 which is, after _every_ filesystem operation, we polll our device to see how busy it is 2008-11-11 21:26 we should be scheduling writes with callbacks on complete 2008-11-11 21:26 if the callback doesn't have anything new to schedule to writeout 2008-11-11 21:26 then it should quiesce the fs 2008-11-11 21:26 right 2008-11-11 21:26 we also can check for when we need to initiate more flushing in the completion callbacks 2008-11-11 21:26 quiesce involves getting rid of all deferred stuff 2008-11-11 21:26 sncing superblock etc 2008-11-11 21:27 right 2008-11-11 21:27 though perfect quiesce is unimportant 2008-11-11 21:27 so we're always in one of 2 states 2008-11-11 21:27 writing out data 2008-11-11 21:27 or 2008-11-11 21:27 synced and quiesced 2008-11-11 21:27 we need to get to a state where the fs is _recoverable_ to the last commit 2008-11-11 21:27 not necessary to do anything beyond that 2008-11-11 21:27 ...maybe... 2008-11-11 21:28 just because we don't have to write something to disk, doesn't mean we shouldn't 2008-11-11 21:28 well, it's also necssary that we commit everything dirty in a timely manner 2008-11-11 21:28 (so long as we can) 2008-11-11 21:28 in line with my belief 2008-11-11 21:28 there is also laptop mode 2008-11-11 21:28 that will accumulate a _lot_ of dirty data and just hold it in memory 2008-11-11 21:29 right 2008-11-11 21:29 what we probably want to do is repurpose the laptop mode api to be a generic flush-now api 2008-11-11 21:30 and to avoid depending on that core-kernel patch, we implement a mechanism for now that checks whether we should flush at each bio completion, and on completion of each fs operation 2008-11-11 21:31 there are basically 2 knobs 2008-11-11 21:31 how much dirty data we're willing to keep in mem, and how old the eldest unwritten changes are allowed to get 2008-11-11 21:31 note 'willing' - not maximum 2008-11-11 21:31 without laptop mode, they're basically both 0 2008-11-11 21:32 we want to write everything out asap 2008-11-11 21:32 it really should be a generic kernel decision, not in the hands of the fs 2008-11-11 21:32 even though under load it won't happen 2008-11-11 21:32 for now, our strategy is to write everything as soon as we can, unless the block device is backed up 2008-11-11 21:33 I think we saw, the default for generic buffered write is to initiate bio transfer immediately in the write call 2008-11-11 21:33 so this is the expected behavior 2008-11-11 21:35 ok, thursday we will go look at deferred file deleting 2008-11-11 21:36 deferred rename could get interesting, locking wise 2008-11-11 21:38 ACTION says Good night! 2008-11-11 21:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-11 22:08 struct entry { u32 limit_field:8, keylo_field:24; }; <- old style 2008-11-11 22:08 struct entry { char limit[1], keylo[3]; }; <- new style 2008-11-11 22:09 I think u8 is prefer 2008-11-11 22:09 it should not make a difference 2008-11-11 22:09 well 2008-11-11 22:09 ok, it means we can directly read from limit 2008-11-11 22:09 without a wrapper 2008-11-11 22:10 so it's u8 2008-11-11 22:10 i see 2008-11-11 22:10 I was originally going to wrap both of them 2008-11-11 22:10 btw, I'm writing sparse stuff now 2008-11-11 22:10 sparse seems to dislike c99 2008-11-11 22:10 ok, I have maybe half an hour to go on dleaf 2008-11-11 22:11 hmm 2008-11-11 22:11 well 2008-11-11 22:11 it should be taught to like it 2008-11-11 22:11 it's about time Linus moved on to the next century ;) 2008-11-11 22:11 c99 isn't even this century 2008-11-11 22:12 well, now it is starting to work for tux3 2008-11-11 22:13 ok, I'll finish up dleaf 2008-11-11 22:13 hard work 2008-11-11 22:20 static inline unsigned entry_keylo(struct entry *entry) 2008-11-11 22:20 { 2008-11-11 22:20 return from_be_u32(*(be_u32 *)entry) >> 8; 2008-11-11 22:20 } 2008-11-11 22:21 make_entry is trickier because it returns a struct 2008-11-11 22:21 hard to cast an integer to a struct 2008-11-11 22:22 um... 2008-11-11 22:22 make_entry(limit, keylo) 2008-11-11 22:22 { 2008-11-11 22:23 return (struct entry){...}; 2008-11-11 22:23 } 2008-11-11 22:23 ? 2008-11-11 22:23 the hard part is filling in the ... 2008-11-11 22:36 static inline struct entry make_entry(tuxkey_t keylo, unsigned limit) 2008-11-11 22:36 { 2008-11-11 22:36 return (struct entry){ to_be_u32(keylo) << 8 | limit }; 2008-11-11 22:36 } 2008-11-11 22:36 and the struct has a single 32 bit field that isn't referenced 2008-11-11 22:37 does seem to work 2008-11-11 22:37 um.. to_be_u32(keylo << 8 | limit)? 2008-11-11 22:40 better 2008-11-11 22:40 well is it the same 2008-11-11 22:40 maybe it has difference 2008-11-11 22:40 limit is supposed to be the zeroth byte 2008-11-11 22:40 yes 2008-11-11 22:41 I think mine is wrong 2008-11-11 22:41 doesn't work on be arch 2008-11-11 22:41 yes 2008-11-11 22:42 yours is wrong too if we want the limit to be in the zeroth byte 2008-11-11 22:42 to_be_u32(keylo | (limit << 24)) 2008-11-11 22:43 ah, yes 2008-11-11 22:43 what a mess huh? 2008-11-11 22:43 it would be way better if they fixed gcc 2008-11-11 22:43 this is barbaric 2008-11-11 22:44 I think we need to use #ifdef __ENDIAN for it 2008-11-11 22:45 not in this case 2008-11-11 22:45 if we want to do like iphdr 2008-11-11 22:46 iphdr uses the bitfields 2008-11-11 22:46 I have no idea whether that works 2008-11-11 22:47 although I'm not sure, I think it may work 2008-11-11 22:48 well this is almost completed 2008-11-11 22:48 using something we're sure works 2008-11-11 22:49 yes 2008-11-11 22:50 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-11 22:52 static inline unsigned group_keyhi(struct group *group) 2008-11-11 22:52 { 2008-11-11 22:52 return from_be_u32(*(be_u32 *)group) & 0xffffff; 2008-11-11 22:52 } 2008-11-11 22:53 looks good 2008-11-11 22:54 on thing: with extents we might want to make the number of bits for the version vs extent block address variable 2008-11-11 22:54 which is impossible to do with bit fields 2008-11-11 22:55 umm... not sure, but may work 2008-11-11 22:55 there is no such thing as a variable offset or variable width bit field 2008-11-11 22:56 variable width? 2008-11-11 22:56 yes, as in variable number of bits for the version 2008-11-11 22:56 it might be nice 2008-11-11 22:57 ah, it may be benefit of current way 2008-11-11 23:01 oh drat, now I'm paying for making extent_count and extent_block take a struct instead of *struct 2008-11-11 23:01 I probalby better change that now 2008-11-11 23:03 ah, yes 2008-11-11 23:04 btw, do you know "make balloctest" error? 2008-11-11 23:15 no 2008-11-11 23:15 indeed 2008-11-11 23:15 I guess it's not in the makefile test list 2008-11-11 23:15 yes 2008-11-11 23:15 have you found the bug? 2008-11-11 23:16 no, I found it when I tested my patches 2008-11-11 23:16 add to todo list 2008-11-11 23:31 ah, ok 2008-11-11 23:32 in balloc.c:322 2008-11-11 23:32 struct dev *dev = &(struct dev){ .bits = 3 }; 2008-11-11 23:32 dev->bits is 3 2008-11-11 23:32 so, buffer->data is 8bytes 2008-11-11 23:38 that's a little small 2008-11-11 23:38 any reason it should not work? 2008-11-11 23:39 one of it is in set_bits()? 2008-11-11 23:39 unsigned loff = start >> 3, roff = (limit >> 3) - 1; 2008-11-11 23:39 ? 2008-11-11 23:39 add "- 1" to roff 2008-11-11 23:39 that has worked pretty well in the past 2008-11-11 23:40 but yes it might have a bug 2008-11-11 23:40 I thought I saw lots, that turned out to be nonbugs 2008-11-11 23:40 in that case, roff == 8 2008-11-11 23:41 I'm still not sure where is bug... 2008-11-11 23:43 I think it is the new buffer paranoia that exposed this 2008-11-11 23:44 yes 2008-11-11 23:44 it allocates only 8bytes if ->bits == 3 2008-11-11 23:46 start = 49 count = 16 roff = 8 2008-11-11 23:46 yes, bitmap[roff] is out of buffer 2008-11-11 23:47 limit = 64 2008-11-11 23:47 that's out of range 2008-11-11 23:47 caller is wrong? 2008-11-11 23:48 49 + 16 is 55, not 64 2008-11-11 23:48 65? 2008-11-11 23:48 oh 2008-11-11 23:48 sorry 2008-11-11 23:49 >>> limit = 64 start = 63 count = 1 roff = 8 2008-11-11 23:49 getting close 2008-11-11 23:49 yes 2008-11-11 23:50 this is not one of those pieces of code where you just want to make it work 2008-11-11 23:50 it has to be understood, and that's not really easy 2008-11-11 23:51 let's see what rmask is 2008-11-11 23:51 0, that's good 2008-11-11 23:52 at first, can we see "start" and "count"? 2008-11-11 23:52 see? 2008-11-11 23:52 oh 2008-11-11 23:52 start - start bit to set? 2008-11-11 23:52 >>> limit = 64 start = 63 count = 1 roff = 8 rmask = 0 2008-11-11 23:52 there they all are 2008-11-11 23:54 if (rmask) bitmap[roff] |= rmask; 2008-11-11 23:55 that fixes it 2008-11-11 23:55 might even be correct 2008-11-11 23:55 ah, i see 2008-11-11 23:56 if (roff != loff) memset(bitmap + loff + 1, -1, roff - loff - 1);? 2008-11-11 23:57 the memset is fine 2008-11-11 23:57 I'm not sure memset(,, 0) is work? 2008-11-11 23:57 or not 2008-11-11 23:57 yes, it must 2008-11-11 23:57 let's see what the man page says 2008-11-11 23:58 would be very broken if it didn't handle zero 2008-11-11 23:58 ok 2008-11-11 23:59 clean_bits needs if(rmask)? 2008-11-12 00:00 yes 2008-11-12 00:01 but not quite that test 2008-11-12 00:03 maybe if (~rmask) 2008-11-12 00:04 no 2008-11-12 00:05 yes, maybe if (rmask) 2008-11-12 00:05 maybe if (~rmask & 0xff) 2008-11-12 00:05 hmm 2008-11-12 00:05 yes 2008-11-12 00:05 you're right 2008-11-12 00:06 how what about all_set 2008-11-12 00:07 I'm also looking it 2008-11-12 00:07 yuck 2008-11-12 00:07 ok, I will look at the endian conversions some more 2008-11-12 00:07 yes 2008-11-12 00:57 flips, still here? 2008-11-12 00:57 yes 2008-11-12 00:57 sparse work done 2008-11-12 00:58 cool 2008-11-12 00:58 some of patches may be need to rethink 2008-11-12 00:58 my patches from earlier today? 2008-11-12 00:58 no, no. my patches 2008-11-12 00:59 some patches changes codes for sparse 2008-11-12 00:59 my changes to big endian in dleaf seem to be correct, but it exposed a bug in filemap.c 2008-11-12 00:59 filemap.c code is really hard to work on 2008-11-12 01:00 the bug is in writing to the last 6 bytes of a 1 exabyte file ;) 2008-11-12 01:01 actually, reading it back, the write seems to have worked fine 2008-11-12 01:02 at least for now, userspace is not needed to work perfetct I hope 2008-11-12 01:03 filemap.c better be very reliable 2008-11-12 01:03 the same bug will show in kernel 2008-11-12 01:04 oh, we need to work for it 2008-11-12 01:04 btw, I put sparse stuff to http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-12 01:05 please check it later 2008-11-12 01:07 probably most bad patch is "Don't use the variable-length array" 2008-11-12 01:08 7 changesets 2008-11-12 01:08 yes 2008-11-12 01:09 some dleaf changes will conflict with mine 2008-11-12 01:10 oh wait 2008-11-12 01:10 that is my change :) 2008-11-12 01:10 well, don't care conflict, current one basically is for review 2008-11-12 01:11 I'll fix if there is conflicts 2008-11-12 01:13 I think, with those, we can check some endian bugs like 2008-11-12 01:14 warning: incorrect type in initializer (different base types) 2008-11-12 01:14 does sparse actually rely on having the symbol __bitwise__ or do we get to choose? 2008-11-12 01:14 attribute((bitwise))? 2008-11-12 01:15 bitwise is a made up attribute too isn't it? 2008-11-12 01:15 what does it mean, big endian or little endian? 2008-11-12 01:16 ah, it is actually not endian check 2008-11-12 01:17 sparse thinks it is special type 2008-11-12 01:17 so, we can't some assign to other type 2008-11-12 01:17 sparse doesn't understand typeof? 2008-11-12 01:18 without __force cast 2008-11-12 01:18 I see I think 2008-11-12 01:18 I don't know, maybe gcc typeof extenstion is not supported 2008-11-12 01:19 required std=qnu99 2008-11-12 01:19 gnu99 2008-11-12 01:19 it is gnu99 feature? 2008-11-12 01:20 it's normally available in gcc unless you specify -std=c99 2008-11-12 01:20 if you want c99 functionality, like inline decls, and typeof, then you need -std=gnu99 2008-11-12 01:21 ah, yes 2008-11-12 01:21 i see 2008-11-12 01:22 I see you fixed the makefile quite a lot 2008-11-12 01:22 like making things depend on .o instead of .c 2008-11-12 01:22 yes 2008-11-12 01:22 to avoid add sparse rule everywhere 2008-11-12 01:23 -Dmain=notmain <- I wonder how that got into the makefile 2008-11-12 01:25 you added a new test for the corner case in balloc I see 2008-11-12 01:25 everything looks good 2008-11-12 01:25 when you you want it merged? 2008-11-12 01:26 maybe, after your modify is checked in 2008-11-12 01:26 the use of __ in sparse symbols is going beyond strange 2008-11-12 01:26 is that something we can choose, or does sparse need these exactly symbols? 2008-11-12 01:26 it's just #define 2008-11-12 01:26 it would be nice if the names were more meaningful 2008-11-12 01:27 just adding underbars doesn't say much 2008-11-12 01:27 any ideas? 2008-11-12 01:27 no, because I don't know what they're for 2008-11-12 01:27 maybe ideas by tomorrow after I've read about sparse 2008-11-12 01:27 ok 2008-11-12 01:28 anyway, everybody else's sparse stuff looks just as strange, so it's nothing new 2008-11-12 01:28 :) 2008-11-12 01:29 yes 2008-11-12 01:29 Redefining main is evil though 2008-11-12 01:29 hi maarten 2008-11-12 01:29 it is 2008-11-12 01:29 hello 2008-11-12 01:29 I'm surprised it lasted this long 2008-11-12 01:30 it was just supposed to be a temporary hack to get code written quickly 2008-11-12 01:30 but it proved too useful 2008-11-12 01:30 Heheh 2008-11-12 01:30 It's a lot easier if you're not actually running the code 2008-11-12 01:31 the unit tests from the dozen main routines or so have caught dozens of bugs 2008-11-12 01:31 Nice 2008-11-12 01:32 just caught another really hard one, I think I'd better get some sleep first 2008-11-12 01:32 /usr/bin/hibernate.sh 2008-11-12 01:32 me too 2008-11-12 01:33 Although that's just a C program that writes 'disk' to /sys/power/state :) 2008-11-12 01:33 hirofumi, where are you? 2008-11-12 01:33 in japan 2008-11-12 01:33 18:33 2008-11-12 01:34 At least in my case, I destroyed the original scripts that would do an awful lot of stuff that's not necessary any more with new kernels 2008-11-12 01:34 I'm not sleeping from yesterday ;) 2008-11-12 01:34 one day ahead 2008-11-12 01:35 Nah 2008-11-12 01:35 Even in california it should be 1:35 or so 2008-11-12 01:35 hirofumi is 2008-11-12 01:35 nearly 2008-11-12 01:35 well 2008-11-12 01:35 true 2008-11-12 01:35 it's already tomorrow 2008-11-12 01:39 good night 2008-11-12 01:40 good night 2008-11-12 01:58 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-12 06:19 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-12 06:48 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-12 07:32 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-11-12 08:53 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-12 09:13 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-12 09:44 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-12 10:17 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-12 10:42 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-12 12:08 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-12 13:13 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-12 14:05 found/fixed a bug in xattrs, was really gross 2008-11-12 14:06 was using the file with the xattr for the atom table instead of the proper atom table 2008-11-12 14:07 now, a bug to fix in filemap.c extent decoding 2008-11-12 14:07 exposed by the endian change 2008-11-12 14:08 tux3 now has a stacktrace function 2008-11-12 14:18 void stacktrace(void) 2008-11-12 14:18 { 2008-11-12 14:18 void *array[100]; 2008-11-12 14:18 size_t size = backtrace(array, 100); 2008-11-12 14:18 printf("_______stack ______\n"); 2008-11-12 14:18 backtrace_symbols_fd(array, size, 2); 2008-11-12 14:18 } 2008-11-12 14:46 the filemap bug is in not handling an extent that overlaps the end of the IO range 2008-11-12 14:47 will skate then fix 2008-11-12 16:16 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-12 18:36 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-12 19:18 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-12 19:52 hirofumi, there? 2008-11-12 19:53 hi 2008-11-12 19:53 I finished that round of debugging and checked in the dleaf endian conversions 2008-11-12 19:54 ok, I'll pull 2008-11-12 19:54 dleaf is all done except for the ->free and ->used fields, which I will leave in cpu endian because they are going to go away 2008-11-12 19:55 ok 2008-11-12 20:07 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-12 20:09 ok, merge is done 2008-11-12 20:09 ready for a pull? 2008-11-12 20:09 sparse seems to warn many lines 2008-11-12 20:09 yes 2008-11-12 20:09 http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-12 20:11 done 2008-11-12 20:11 how do I run the sparse tests? 2008-11-12 20:12 make C=1 2008-11-12 20:13 if we didn't specify C=1, it doesn't run sparse 2008-11-12 20:14 sparse isn't in etch repo 2008-11-12 20:14 I will do something about that sometime, but for now, why not just post your sparse warnings to the mailing list? 2008-11-12 20:15 ok, I'll post it 2008-11-12 20:16 I curious to see what sparse hates about my code :) 2008-11-12 20:21 I posted the logs for current codes 2008-11-12 20:21 it seems to be useful more or less 2008-11-12 20:22 I'll look at balloc.c first 2008-11-12 20:23 ok 2008-11-12 20:23 dleaf is big, so I'll see iattr.c 2008-11-12 20:24 s/big/many warnings/ 2008-11-12 20:24 it's right about disksuper->volblocks, endian conversion is missing 2008-11-12 20:25 yes, I think sparse is useful for this 2008-11-12 20:26 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-12 20:26 I'll install from upstream 2008-11-12 20:27 oh, sparse is introduced at lenny 2008-11-12 20:27 I'm installing it to ~/bin though 2008-11-12 20:28 I meant, I'll install from source 2008-11-12 20:28 ah 2008-11-12 20:30 if x86_64, I can send binary 2008-11-12 20:33 it's no problem to grab it and compile 2008-11-12 20:34 ok, there's a patch for one warning in dleaf.c 2008-11-12 20:36 whoops, I failed to check in the actual endia conversions for dleaf.c 2008-11-12 20:38 did sparse warns it? if so, I'm happy 2008-11-12 20:38 it did 2008-11-12 20:38 that's how I noticed 2008-11-12 20:39 btw, what is disksuper->magic format? 2008-11-12 20:39 I thought it was #defined 2008-11-12 20:39 yes 2008-11-12 20:40 it's supposed to be char[something] 2008-11-12 20:40 it is "char magic[]" 2008-11-12 20:40 yes 2008-11-12 20:40 in inode.c, from_be_u64(*(u64 *)disk->magic) 2008-11-12 20:40 big endian? 2008-11-12 20:40 yes 2008-11-12 20:40 u64 big endian? 2008-11-12 20:41 oh 2008-11-12 20:41 that matches char array order 2008-11-12 20:41 so... it can be changed to be_u64 2008-11-12 20:42 I think little and big is same in the case of it 2008-11-12 20:42 except for being backwards 2008-11-12 20:42 we will use big endian everywhere on disk 2008-11-12 20:42 umm.. it's u8 magic[8]; 2008-11-12 20:43 if it's u64 magic, I think it has difference 2008-11-12 20:44 ah, it is just the printf that treats it as a 64 bit word 2008-11-12 20:44 that was dumb of me 2008-11-12 20:45 so, u8 magic[8] is correct on disk format? 2008-11-12 20:45 sure 2008-11-12 20:45 I suppose the way I wrote it is the easiest way to show it in hex 2008-11-12 20:45 which is the way it makes most sense 2008-11-12 20:45 maybe not so dumb 2008-11-12 20:45 yes, however sparse warns about it 2008-11-12 20:46 even if you change it go (be_u64 *) ? 2008-11-12 20:46 if be_u64, it's ok 2008-11-12 20:46 ok, well that's correct 2008-11-12 20:46 it's ugly, but short 2008-11-12 20:51 #define SB_MAGIC (*(u64 *)(u8[]){ 't', 'u', 'x', '3', 0xdd, 0x08, 0x09, 0x06 }) 2008-11-12 20:52 this seems to work 2008-11-12 20:52 cpu endian #define though 2008-11-12 20:52 yes 2008-11-12 20:53 without the *(u64 *), what breaks? 2008-11-12 20:53 I checked in the struct extent endian conversions 2008-11-12 20:53 in dleaf.c, which should get rid of a few warnings 2008-11-12 20:53 it can't magic == SB_MAGIC? 2008-11-12 20:54 no, it uses memcmp 2008-11-12 20:54 anything wrong with that? 2008-11-12 20:55 if we use be_u64 magic, we need to convert to cpu endian? 2008-11-12 20:55 the be_u64 is only in a warning 2008-11-12 20:56 you meant, warn("invalid superblock [%Lx]", (L)from_be_u64(*(__force be_u64 *)disk->magic)); 2008-11-12 20:57 ? 2008-11-12 20:57 yes 2008-11-12 20:57 ok 2008-11-12 20:57 what is the __force for? 2008-11-12 20:57 suppress sparse warn 2008-11-12 20:57 I wonder why sparse warns about that 2008-11-12 20:58 normal cast is not enough for __bitwise type 2008-11-12 20:58 hmm, a debatable design decision but oh well 2008-11-12 20:59 we will allow sparse to __force us to do things it's way ;) 2008-11-12 21:00 is bitwise a synomym for endian? 2008-11-12 21:00 whoops, I was just wrong 2008-11-12 21:00 oh good :) 2008-11-12 21:00 (be_u64 *) is enough 2008-11-12 21:01 ah, it is pointer, so it's ok 2008-11-12 21:02 bitwise is not synomym for endian 2008-11-12 21:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-12 21:02 bitwise can use for another perpose 2008-11-12 21:02 e.g. gfp_t is __bitwise type, iirc 2008-11-12 21:03 with it, we can't pass "int" to it 2008-11-12 21:04 it seems like a fine thing, but the word itself is misleading 2008-11-12 21:04 does not seem to have anything to do with bits 2008-11-12 21:05 I don't care about work, because I'm not familiar to english :) 2008-11-12 21:05 you're lucky :) 2008-11-12 21:05 s/work/word/ 2008-11-12 21:05 ;) 2008-11-12 21:15 sparse likes to put itself into your home directory 2008-11-12 21:15 by default 2008-11-12 21:15 a bit of a surprise 2008-11-12 21:15 yes 2008-11-12 21:16 and it puts itself into about 6 different directories there 2008-11-12 21:16 with no uninstall 2008-11-12 21:16 actually I put symlink it from ~/bin/ 2008-11-12 21:17 I guess there are not many people using it 2008-11-12 21:17 oh, yes 2008-11-12 21:17 yes 2008-11-12 21:18 I think so 2008-11-12 21:18 and recently it has no update 2008-11-12 21:18 so somebody can adopt it and fix the "bitwise" terminology :) 2008-11-12 21:19 :) 2008-11-12 21:19 if you do sudo make install like most people do, it installs itself with root permission in your home directory 2008-11-12 21:20 oh 2008-11-12 21:23 well, I'm finding good static analyzer of opensource, unfortunately I can't find 2008-11-12 21:23 I'm sure sparse will be the one 2008-11-12 21:23 it's just a little rough right now 2008-11-12 21:24 yes 2008-11-12 21:24 if gcc allow plugin, I think it solves many things 2008-11-12 21:25 sudo make HOME=/usr install <- this works 2008-11-12 21:25 yes, that is the right way 2008-11-12 21:25 ok, sparse works for me now 2008-11-12 21:26 please let me know which files you would like to leave for me 2008-11-12 21:26 ACTION goes for dinner 2008-11-12 21:26 iattr.c and inode.c was done 2008-11-12 21:27 I'll work for xattr.c next 2008-11-12 22:27 I will do some more of dleaf.c then 2008-11-12 22:27 ok 2008-11-12 22:37 return (from_be_u64(*(be_u64 *)&extent >> 48) & 0x3f) + 1; <- I don't see anything wrong with this 2008-11-12 22:39 which file? 2008-11-12 22:39 dleaf.c 2008-11-12 22:39 94? 2008-11-12 22:40 91 actually, but same thing 2008-11-12 22:42 return ((from_be_u64(*(be_u64 *)&extent) >> 48) & 0x3f) + 1;? 2008-11-12 22:42 it shifts be value 2008-11-12 22:43 ah 2008-11-12 22:43 yes, sparse is right 2008-11-12 22:43 good ;) 2008-11-12 22:44 is balloc.c alread done? 2008-11-12 22:44 s/alread/already/ 2008-11-12 22:45 I did not do anything to it 2008-11-12 22:45 ok, I'll do 2008-11-12 22:48 sparse also complains about some things in libc headers 2008-11-12 22:48 what is it? 2008-11-12 22:50 I may modified sparse 2008-11-12 22:50 my sparse 2008-11-12 22:50 there is no warn about libc 2008-11-12 22:51 /usr/include/string.h:420:6: warning: undefined preprocessor identifier '__USE_FORTIFY_LEVEL' 2008-11-12 22:53 maybe new libc is not warned 2008-11-12 22:53 probably wasn't tested with this 2008-11-12 22:54 anyway it does not bother me 2008-11-12 22:54 yes, probaby bug of sparse 2008-11-12 22:54 warning: mixing declarations and code <- you turned this off with a patch? 2008-11-12 22:54 yes 2008-11-12 22:55 I think -Wno-declaration-after-statement is for it 2008-11-12 23:05 that turns the warning on or off? 2008-11-12 23:06 -W"no" is off, -W is on 2008-11-12 23:06 ok, my gcc must not support it 2008-11-12 23:06 I patched sparse to turn it off 2008-11-12 23:06 eh 2008-11-12 23:07 it should be passed to sparse only 2008-11-12 23:07 didn't work for me, and in the sparse source I did not see any control over that option 2008-11-12 23:08 Wdeclarationafterstatement in sparse/libc.c is control it 2008-11-12 23:08 ah, my sparse is from git 2008-11-12 23:09 --- parse.c.old 2008-11-12 23:08:48.000000000 -0800 2008-11-12 23:09 +++ parse.c 2008-11-12 22:59:26.000000000 -0800 2008-11-12 23:09 @@ -1804,7 +1804,7 @@ 2008-11-12 23:09 break; 2008-11-12 23:09 if (lookup_type(token)) { 2008-11-12 23:09 if (seen_statement) { 2008-11-12 23:09 - warning(token->pos, "mixing declarations and code"); 2008-11-12 23:09 + //warning(token->pos, "mixing declarations and code"); 2008-11-12 23:09 seen_statement = 0; 2008-11-12 23:09 } 2008-11-12 23:09 stmt = alloc_statement(token->pos, STMT_DECLARATION); 2008-11-12 23:09 seen_statement is controled by Wdeclarationafterstatement 2008-11-12 23:10 } else { 2008-11-12 23:10 seen_statement = Wdeclarationafterstatement; 2008-11-12 23:10 token = statement(token, &stmt); 2008-11-12 23:10 } 2008-11-12 23:10 I wonder why I still see the warnings then 2008-11-12 23:10 because you pass Wno-... in the Makefile 2008-11-12 23:10 if (lookup_type(token)) { 2008-11-12 23:10 if (seen_statement) { 2008-11-12 23:10 warning(token->pos, "mixing declarations and code"); 2008-11-12 23:11 seen_statement = 0; 2008-11-12 23:11 } 2008-11-12 23:11 stmt = alloc_statement(token->pos, STMT_DECLARATION); 2008-11-12 23:11 token = external_declaration(token, &stmt->declaration); 2008-11-12 23:11 } else { 2008-11-12 23:11 seen_statement = Wdeclarationafterstatement; 2008-11-12 23:11 token = statement(token, &stmt); 2008-11-12 23:11 } 2008-11-12 23:11 ? 2008-11-12 23:11 not in this version 2008-11-12 23:11 ah, i see 2008-11-12 23:12 anyway, the next thing that breaks, I will get the git version 2008-11-12 23:12 linux/kernel/git/josh/sparse.git 2008-11-12 23:12 thanks 2008-11-12 23:12 git://git.kernel.org/pub/scm/linux/kernel/git/josh/sparse.git 2008-11-12 23:14 flips, you've got mail 2008-11-12 23:30 ok, I have sparse installed from git 2008-11-12 23:31 still gives warning: undefined preprocessor identifier '__USE_FORTIFY_LEVEL' 2008-11-12 23:31 probably because it doesn't know about the older libc headers 2008-11-12 23:31 does not bother me enough to fix it 2008-11-12 23:31 good 2008-11-12 23:33 ok, I suppose I should do btree.c 2008-11-12 23:33 and get rid of ->used and ->free, and then we will have nearly all the endian work done 2008-11-12 23:34 yes 2008-11-12 23:36 ah, also ileaf 2008-11-12 23:43 yes 2008-11-12 23:44 also atom tables and dirents 2008-11-12 23:44 ah 2008-11-12 23:44 ah, maybe you did atoms in xattr.c? 2008-11-12 23:44 probably not 2008-11-12 23:45 removed warns, but not checked 2008-11-12 23:45 it's need to use be_u* or not 2008-11-12 23:45 yes, be_* 2008-11-12 23:47 some of the atom stuff is done 2008-11-12 23:47 by you I think 2008-11-12 23:48 @@ -444,8 +444,8 @@ int main(int argc, char *argv[]) 2008-11-12 23:48 struct inode *inode = &(struct inode){ .sb = sb, 2008-11-12 23:48 .map = map, .i_mode = S_IFDIR | 0x666, 2008-11-12 23:48 .present = abits, .i_uid = 0x12121212, .i_gid = 0x34343434, 2008-11-12 23:48 - .btree = { .root = { .block = 0xcaba1f00d, .depth = 3 } }, 2008-11-12 23:48 - .i_ctime = 0xdec0debead, .i_mtime = 0xbadfaced00d }; 2008-11-12 23:48 + .btree = { .root = { .block = 0xcaba1f00dULL, .depth = 3 } }, 2008-11-12 23:48 + .i_ctime = 0xdec0debeadULL, .i_mtime = 0xbadfaced00dULL }; 2008-11-12 23:48 map->inode = inode; 2008-11-12 23:48 sb->atable = inode; 2008-11-12 23:48 2008-11-12 23:48 basically only it for xattr.c 2008-11-12 23:48 all the atom stuff is in xattr.c, and it looks done 2008-11-12 23:49 just tell me when to pull 2008-11-12 23:49 probaby, atom_t is be? 2008-11-12 23:49 currently it seems u32 2008-11-12 23:49 no, only on-disk structs are be 2008-11-12 23:49 well 2008-11-12 23:50 most uses of atom_t are cpu 2008-11-12 23:50 i see 2008-11-12 23:51 we have to decide whether the xcache is in be_ or not 2008-11-12 23:51 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-12 23:52 ok, the xcache is in cpu form 2008-11-12 23:53 and we convert in encode_attrs and decode_attrs 2008-11-12 23:53 just as the in-memory inode is in cpu form 2008-11-12 23:53 i see 2008-11-12 23:54 attrs = decode16(attrs, &atom); <- this is the endian decode for on-disk atoms 2008-11-12 23:55 so almost all the endian work is already done for inode attributes 2008-11-12 23:55 ok 2008-11-12 23:57 lorefs and hirefs in dump_atoms() may be be_u16... 2008-11-12 23:58 yes 2008-11-13 00:00 btw, I'll work for valgrind error of filemap.c before more endian work 2008-11-13 00:00 the lost data? 2008-11-13 00:01 I don't know yet, I just did "make tests" 2008-11-13 00:01 tests: balloctest dleaftest ileaftest btreetest dirtest iattrtest xattrtest filemaptest inodetest 2008-11-13 00:01 with above change 2008-11-13 00:02 filemap_extent_io: need 8 data and 8 index bytes 2008-11-13 00:02 filemap_extent_io: need 16 bytes, 248 bytes free 2008-11-13 00:02 filemap_extent_io: pack 0x0 => 0/1 2008-11-13 00:02 filemap_extent_io: extent 0x0/1 => 0 2008-11-13 00:02 filemap_extent_io: block 0x0 => 0 2008-11-13 00:02 ==17083== Use of uninitialised value of size 8 2008-11-13 00:02 ==17083== at 0x4E69909: (within /lib/libc-2.7.so) 2008-11-13 00:02 ==17083== by 0x4E6C7DB: vfprintf (in /lib/libc-2.7.so) 2008-11-13 00:02 ah, that could easily introduce a flaw if the change isn't make everywhere 2008-11-13 00:02 ==17083== by 0x404C6F: logline (trace.h:15) 2008-11-13 00:02 ==17083== by 0x40D28B: filemap_extent_io (filemap.c:226) 2008-11-13 00:02 ==17083== by 0x40D4FE: filemap_block_write (filemap.c:264) 2008-11-13 00:02 ==17083== by 0x40338A: write_buffer_to (buffer.c:179) 2008-11-13 00:02 ==17083== by 0x4033B2: write_buffer (buffer.c:186) 2008-11-13 00:02 ==17083== by 0x403C1F: flush_buffers (buffer.c:396) 2008-11-13 00:02 ==17083== by 0x40D8B9: main (filemap.c:304) 2008-11-13 00:02 ==17083== 2008-11-13 00:03 good luck, and I apologize for the low quality of that code 2008-11-13 00:04 it seems good basically, so no need to sorry 2008-11-13 00:05 and it's just trance_on() 2008-11-13 00:20 ah 2008-11-13 00:20 + return (from_be_u64(*(be_u64 *)&extent >> 48) & 0x3f) + 1; 2008-11-13 00:20 I think you alread changed it? 2008-11-13 00:20 yes 2008-11-13 00:20 ok, I'd like to merge those 2008-11-13 00:21 I think it's already in the repo 2008-11-13 00:21 oh 2008-11-13 00:22 wait 2008-11-13 00:24 return ((from_be_u64(*(be_u64 *)&extent) >> 48) & 0x3f) + 1; <- current version 2008-11-13 00:24 ok 2008-11-13 00:24 sparse liked it 2008-11-13 00:31 static inline struct extent make_extent(block_t block, unsigned count) 2008-11-13 00:31 { 2008-11-13 00:31 return (struct extent){ to_be_u64(block << 16 | (u64)(count - 1) << 10) }; 2008-11-13 00:31 } 2008-11-13 00:31 static inline unsigned extent_block(struct extent extent) 2008-11-13 00:31 { 2008-11-13 00:31 return from_be_u64(*(be_u64 *)&extent) & ~(-1LL << 16); 2008-11-13 00:31 } 2008-11-13 00:31 static inline unsigned extent_count(struct extent extent) 2008-11-13 00:31 { 2008-11-13 00:31 return ((from_be_u64(*(be_u64 *)&extent) >> 10) & 0x3f) + 1; 2008-11-13 00:31 } 2008-11-13 00:31 static inline unsigned extent_version(struct extent extent) 2008-11-13 00:31 { 2008-11-13 00:31 return from_be_u64(*(be_u64 *)&extent) & 0x3ff; 2008-11-13 00:31 } 2008-11-13 00:31 current dleaf.c is like this? 2008-11-13 00:32 no, that looks like from you 2008-11-13 00:32 yes 2008-11-13 00:33 valgrind seems to like it 2008-11-13 00:33 you are working on a 64 bit system? 2008-11-13 00:33 well, I'd like to see current version of those 2008-11-13 00:33 yes 2008-11-13 00:34 (u64)(count - 1) << 10) <- small think, don't need (u64) here 2008-11-13 00:34 small thing 2008-11-13 00:34 ah, yes 2008-11-13 00:35 it should be for (u64)block 2008-11-13 00:35 I thought putting the block at the low end of the word would make it a little easier to debug when looking at hexdumps 2008-11-13 00:36 ah, i see. I just referenced old one 2008-11-13 00:37 probably, your version after sparse fix, I think it works for valgrind too 2008-11-13 00:37 good, I could see any 64 bit issues 2008-11-13 00:37 count not I mean 2008-11-13 00:39 ah, sparse one is already pushed 2008-11-13 00:40 ok, current hg is works fine 2008-11-13 00:40 good, time for me to sleep 2008-11-13 00:40 just email when you're ready for a pull 2008-11-13 00:40 ok, good night 2008-11-13 00:41 good night 2008-11-13 00:46 night 2008-11-13 00:46 good night 2008-11-13 01:03 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-13 01:14 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-11-13 02:34 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-13 02:58 hey flips 2008-11-13 02:59 how's progress on atomic commits ? 2008-11-13 03:05 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-13 03:20 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-13 03:24 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-13 03:32 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-13 03:55 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-13 05:57 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-13 06:37 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-13 07:12 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-13 07:25 -!- pranith(~Bobby@122.162.72.220) has joined #tux3 2008-11-13 08:15 -!- Bobby_(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 09:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-13 09:36 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 09:40 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-13 09:51 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-13 10:37 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 10:42 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 10:50 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 10:55 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:02 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:04 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-13 11:09 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:17 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:20 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:24 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:32 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:38 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:49 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:54 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:57 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:59 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 12:02 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 12:40 Endian conversions are nearly all done 2008-11-13 12:42 ileaf.c not done yet 2008-11-13 12:44 hey 2008-11-13 14:01 ileaf done 2008-11-13 14:02 hmm, is that all the conversions? 2008-11-13 14:24 What's the native endianness for tux3? 2008-11-13 14:43 big 2008-11-13 14:44 -!- ajonat(~ajonat@190.48.98.3) has joined #tux3 2008-11-13 14:44 Linux has traditionally used little endian for filesystems 2008-11-13 14:44 so this is a slight departure 2008-11-13 14:44 but there are advantages to big endian 2008-11-13 14:45 for one thing, the code actually gets tested on x86 2008-11-13 14:45 which nearly all developers use 2008-11-13 14:45 instead of being noops 2008-11-13 14:46 another advantage is, hexdumps are a lot easier to read in big endian 2008-11-13 14:46 ZFS went with the highly doubtful decision to use native endian 2008-11-13 14:47 which means that ZFS filesystem layout is different depending on whether you create the fs with a big or little endian machine 2008-11-13 14:47 I don't know what they were thinking 2008-11-13 14:47 s/thinking/smoking/ 2008-11-13 15:06 I thought a lot of filesystems were bigendian, ext3 is 2008-11-13 15:10 ext3 is little endian 2008-11-13 15:13 Weird 2008-11-13 15:15 little endian is weird period 2008-11-13 15:16 true 2008-11-13 15:17 in the stone age it made some kind of sense for processors that fetch data a byte at a time, the alu could start to work on the first byte before loading the second 2008-11-13 15:19 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-11-13 15:20 Yeah, but people are more interested with backwards compatibility than better instruction sets 2008-11-13 16:48 -!- rmull(~rmull@acsx01.bu.edu) has joined #tux3 2008-11-13 16:48 Hi guys 2008-11-13 16:49 I just attended a presentation by my school's Sun Microsystems campus ambassador and he reminded my how badly I want ZFS 2008-11-13 16:50 I have a 16 disk fileserver and RAIDZ2, automatic data correction, and the excellent snapshotting are just too cool to ignore 2008-11-13 16:51 I assume a RAIDZ-style implementation will never occur because that's supposed to be the job of md 2008-11-13 16:51 I recall reading on the tux3 announce on LKML that it was mentioned tux3 was supposed to be better than ZFS 2008-11-13 16:51 And my question is: How? 2008-11-13 18:13 rmull, the btrfs guys are putting raid into btrfs 2008-11-13 18:14 I believe that raid does not belong in a filesystem 2008-11-13 18:15 we are not at the benchmarking point yet, but indications are Tux3 will be faster than ZFS, and not suffer from excess memory use like ZFS does 2008-11-13 18:16 the ZFS snapshot vs clone model is clunky, tux3 will have no such restriction 2008-11-13 18:25 rmull, have you ever actually seen ZFS correct some data? 2008-11-13 18:27 he flips 2008-11-13 18:27 hey 2008-11-13 18:33 flips: I've seen demos and heard personal anecdotes from three different individuals 2008-11-13 18:34 The standard "dd urandom over a disk" test, then checksums match before and after 2008-11-13 18:38 But - I've never personally tried it. 2008-11-13 18:38 the ability to make remote incremental backups far outweighs the ability to construct data damaged by random dd's in my opinion 2008-11-13 18:39 Oh, that was another thing I wanted to ask - regarding zumastor 2008-11-13 18:39 the plan is to backport some of the tux3 mechanisms to zumastor 2008-11-13 18:40 to get a way more efficient volume level replicating snapshot 2008-11-13 18:40 Is it possible to donate some money to the tux3 project? 2008-11-13 18:40 certainly 2008-11-13 18:42 you could warm up by donating beer :) 2008-11-13 18:45 The beer is an option, though I have not existed long enough on this celestial boulder to buy alchohol as decided by my democratically elected government 2008-11-13 18:45 -!- ajonat_(~ajonat@190.48.125.206) has joined #tux3 2008-11-13 18:47 ok, next thing to fix are permissions and owners I think 2008-11-13 18:47 ought to be pretty easy 2008-11-13 19:31 whoops, endian stuff broke btree probe 2008-11-13 19:31 maybe I have patch for it 2008-11-13 19:31 that would be nice 2008-11-13 19:32 looks like node->count got set to little endian 1 2008-11-13 19:32 yes, it was warned by sparse 2008-11-13 19:33 ok, just tell me when to pull 2008-11-13 19:33 I wonder how I missed it 2008-11-13 19:33 make clean was needed? 2008-11-13 19:34 maybe. I didn't compile everything under C=1 2008-11-13 19:34 http://userweb.kernel.org/~hirofumi/endian-conversion-fixes.patch 2008-11-13 19:34 for right now 2008-11-13 19:35 I'm reading xattr stuff I completely missed now 2008-11-13 19:36 I also broke tux3graph.c 2008-11-13 19:36 yes, the patch is including it too 2008-11-13 19:36 nice 2008-11-13 19:39 fuse does indeed run again 2008-11-13 19:40 yes, make tests was also passed 2008-11-13 19:42 what do you think about rename "inode" to "atable" for inode in xattr? 2008-11-13 19:43 good 2008-11-13 19:43 thanks, I'll convert it 2008-11-13 19:44 what do you think about changing C to CHECK? 2008-11-13 19:45 both is ok to me, I just used name what linux is using 2008-11-13 19:45 C=1 works for kernel compile? 2008-11-13 19:45 yes 2008-11-13 19:45 gross ;) 2008-11-13 19:46 ;) C=1 and V=1 is available 2008-11-13 19:46 I understand that the price of bytes has gone up lately, but I think we can affort another 4 2008-11-13 19:47 yes 2008-11-13 19:48 oh and the target is .c.o 2008-11-13 19:48 somebody really likes to write obscure code 2008-11-13 19:49 I'm not sure, make may allow %.c as alternative 2008-11-13 19:50 it's ok 2008-11-13 19:53 why does tux3fuse need its own checker target? 2008-11-13 19:53 no target, build rule 2008-11-13 19:53 it needs $(pkg-config ...) addition 2008-11-13 19:54 IOW, it still depends on *.c 2008-11-13 19:55 makes sense 2008-11-13 19:56 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-13 19:56 Hi! 2008-11-13 19:56 hi 2008-11-13 19:56 hi 2008-11-13 19:56 I remembered this time :D 2008-11-13 19:57 ok, there is the patch for C -> CHECK 2008-11-13 19:58 I have two minutes to get ready 2008-11-13 20:00 can anybody think of a reason why we should keep tux3fs.c, the high level fuse interface? 2008-11-13 20:00 we haven't done anything with it for weeks 2008-11-13 20:00 and tux3fuse.c gets regularly maintained 2008-11-13 20:01 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-13 20:01 hi maze 2008-11-13 20:01 hey 2008-11-13 20:01 did I miss anything? 2008-11-13 20:01 can anybody think of a reason why we should keep tux3fs.c, the high level fuse interface? 2008-11-13 20:01 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-13 20:01 that was all the activity in the last minute 2008-11-13 20:01 hi 2008-11-13 20:01 hi ralucam 2008-11-13 20:02 ACTION is late 2008-11-13 20:02 last week we ended on an exciting note, having looked at vfs->create pretty closely 2008-11-13 20:02 thinking about how defer is going to work 2008-11-13 20:03 so let's think keeping about deferred namespace ops, and go look at the delete side 2008-11-13 20:04 the goal of this exercise is to convince ourselves we can maintain the illusion that a file exists when it is not actually backed by the filesystem 2008-11-13 20:04 did that make sense? 2008-11-13 20:04 yup 2008-11-13 20:04 besides creating and deleting, we also need to worry about rename 2008-11-13 20:04 and the implicit delete that can happen in a rename 2008-11-13 20:05 ok, let's have a url work where the vfs calls the file delete method 2008-11-13 20:05 (this is my standard trick to grab a minute for myself) 2008-11-13 20:05 :-) 2008-11-13 20:07 ok, it was a trick 2008-11-13 20:07 the vfs doesn't call delete, it calls ->unlink 2008-11-13 20:08 inode_operations->unlink 2008-11-13 20:09 http://lxr.linux.no/linux/fs/namei.c#L2216 2008-11-13 20:09 as usual, the fastest way to find out a detail like that is look at fs/ext2 2008-11-13 20:09 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2234 2008-11-13 20:09 nearly a tie :) 2008-11-13 20:10 let's take a quick walk through it 2008-11-13 20:10 before calling unlink, the vfs will check for a negative dentry 2008-11-13 20:10 pretty simple what it does 2008-11-13 20:11 after locking the target directory, similar to create 2008-11-13 20:11 (I was referring to vfs_unlink) 2008-11-13 20:11 yes 2008-11-13 20:11 http://lxr.linux.no/linux+v2.6.27.6/fs/namei.c#L1394 2008-11-13 20:11 see the nfs wrinkle 2008-11-13 20:11 sillyrename 2008-11-13 20:12 this is needed because, unlike a local filesystem, and nfs server is expected to be able to reboot and not lose track of an unlinked file 2008-11-13 20:12 because the client may still have it open 2008-11-13 20:13 and NFS, being stateless (that is the _real_ sillyness) does not know that 2008-11-13 20:13 let's have a quick look at may_delete 2008-11-13 20:13 and verify that it checks for a negative dentry 2008-11-13 20:14 1398 if (!victim->d_inode) 2008-11-13 20:14 1399 return -ENOENT; <- maybe this is the check 2008-11-13 20:15 yes 2008-11-13 20:15 that's bad code, it should have a wrapper to show that it's the negative dentry condition 2008-11-13 20:16 ok, this is very important for us 2008-11-13 20:16 because tux3 will not actually remove the underlying dirent until the next delta transition arrives 2008-11-13 20:17 it will just clear the inode field, turning the dentry "negative", and save a pointer to the dentry 2008-11-13 20:17 i see. and we pin it? 2008-11-13 20:17 normally, the dentry gets a dput not too long after this, and tux3 needs to take its own reference count 2008-11-13 20:17 yes 2008-11-13 20:17 "pin it" for short 2008-11-13 20:18 I don't think there is a lot more to look at there 2008-11-13 20:19 Stephen Tweedie in his notes on Ext3, calls delete the hardest operation, by far 2008-11-13 20:19 so if that seemed simple, I guess we just don't understand ;) 2008-11-13 20:20 now let's take a look at rename 2008-11-13 20:20 if tux3 defers create and delete, then it must also defer any other namespace changes, like rename 2008-11-13 20:20 well, wait 2008-11-13 20:20 where did the delete actually happen? 2008-11-13 20:20 we haven't even gone into ext2 code... 2008-11-13 20:20 ah, let's do that 2008-11-13 20:21 we will find that the fs actually invalidates the dentry 2008-11-13 20:21 http://lxr.linux.no/linux+v2.6.27.6/fs/ext2/namei.c#L253 2008-11-13 20:22 thanks 2008-11-13 20:23 2.6.27.6 lxr seems broken 2008-11-13 20:23 http://lxr.linux.no/linux+v2.6.27/fs/ext2/dir.c#L566 2008-11-13 20:23 ext2_delete_entry 2008-11-13 20:24 oh wait 2008-11-13 20:24 no, the fs does not do this 2008-11-13 20:25 removed entry from dirents page, so entry was gone from readdir() 2008-11-13 20:26 2241 d_delete(dentry); 2008-11-13 20:26 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2241 2008-11-13 20:27 we can still lookup entry via dcache..., it unhashed from dcache 2008-11-13 20:27 1513 * Turn the dentry into a negative dentry if possible, otherwise 2008-11-13 20:27 1514 * remove it from the hash queues so it can be deleted later 2008-11-13 20:27 1515 */ 2008-11-13 20:28 inode may still be opening 2008-11-13 20:28 be opened 2008-11-13 20:28 1527 dentry_iput(dentry); 2008-11-13 20:28 yes 2008-11-13 20:29 http://lxr.linux.no/linux+v2.6.27/fs/dcache.c#L102 <- dentry_iput 2008-11-13 20:29 looking for the place where the dentry inode gets set to null 2008-11-13 20:30 108 dentry->d_inode = NULL; 2008-11-13 20:30 that "if possible" above worries me 2008-11-13 20:30 we need to be able to rely on this 2008-11-13 20:31 109 list_del_init(&dentry->d_alias); <- notice this happens outside the spinlocks 2008-11-13 20:31 it is protected by the directory i_mutex, I believe 2008-11-13 20:32 there is dentry->d_lock and dcache_lock? 2008-11-13 20:32 let's see where those got taken 2008-11-13 20:33 in d_delete 2008-11-13 20:33 http://lxr.linux.no/linux+v2.6.27/fs/dcache.c#L1517 2008-11-13 20:33 it's not pretty 2008-11-13 20:33 let's answer the question "when is it not possible to turn the dentry into a negative dentry" 2008-11-13 20:34 that is, set the d_inode to null 2008-11-13 20:34 the answer is: when the d_count is not 1 2008-11-13 20:35 so that is a pretty scary little corner of the kernel for our plan 2008-11-13 20:35 do you need to care d_drop()? 2008-11-13 20:35 I don't know 2008-11-13 20:36 well if our dentry gets d_dropped and we have not done the deferred delete yet, we are in trouble 2008-11-13 20:36 so we must convince ourselves that this can never happen 2008-11-13 20:37 [there's also the what happens when you create delete create the same entry... problem] 2008-11-13 20:37 "_drop will just make the cache lookup fail." 2008-11-13 20:38 maze, the vfs should turn the negative dentry into a non-negative directory then call us 2008-11-13 20:39 not hard, but it has to be handled 2008-11-13 20:39 193 * d_drop() is used mainly for stuff that wants to invalidate a dentry for some 2008-11-13 20:39 194 * reason (NFS timeouts or autofs deletes). 2008-11-13 20:40 we are _probably_ ok, but this does have to be investigated pretty closely 2008-11-13 20:40 so, we have opened inode without dentry? 2008-11-13 20:40 after create delete create 2008-11-13 20:40 hirofumi, yes we can have that 2008-11-13 20:40 oh 2008-11-13 20:40 that is common already 2008-11-13 20:41 it's just an open, unliked file 2008-11-13 20:41 unlinked 2008-11-13 20:41 but same dentry is reused 2008-11-13 20:41 possilby even linking to the same inode 2008-11-13 20:42 but that is ok 2008-11-13 20:42 ah 2008-11-13 20:42 yes 2008-11-13 20:42 maybe there is same name unhashed dentry 2008-11-13 20:42 ok, well this d_drop code needs to be read very carefully 2008-11-13 20:43 I don't think it can be allowed to be the same inode 2008-11-13 20:43 I'm pretty sure a delete create has to get a new inode # 2008-11-13 20:43 yes, it should be different inode 2008-11-13 20:43 maze, it's not a delete, it's an unlink 2008-11-13 20:43 there's a difference? 2008-11-13 20:43 yes 2008-11-13 20:44 delete is what happens on the last unlink + last close 2008-11-13 20:44 isn't delete == unlink of something with 1 link? 2008-11-13 20:44 ah ok 2008-11-13 20:44 in that case you would indeed get a new inode in most cases 2008-11-13 20:44 I don't think there is any rule that requires that though 2008-11-13 20:44 well 2008-11-13 20:45 if you got the same inode number, the inode would be reinitialized, if that's what you meant 2008-11-13 20:46 unhashed and reused dentries should have different inodes 2008-11-13 20:47 hirorumi, what will break if that is not the case? 2008-11-13 20:47 possibly some sort of monitoring or backup programs? 2008-11-13 20:47 I don't think though that it can actually be guaranteed... 2008-11-13 20:48 reused is confusible 2008-11-13 20:48 if that is a posix requirement, I didn't know it 2008-11-13 20:48 reallocate and has same name 2008-11-13 20:48 maybe files opened over nfs, surviving deletion/creation of the file? 2008-11-13 20:49 so what guarantees that a different inode is used, in ext2 for example? 2008-11-13 20:49 inode is still live, so it shouldn't same inode? 2008-11-13 20:49 oh, certainly, in that case a different inode will be used 2008-11-13 20:50 yes 2008-11-13 20:50 but if you unlink; fsync parent; create, I think you have a good chance of getting the same inode number again 2008-11-13 20:51 yes, if unlinked file is not open 2008-11-13 20:51 true 2008-11-13 20:51 sounds like a mis-feature 2008-11-13 20:51 I believe we all think the same thing now 2008-11-13 20:51 yes 2008-11-13 20:52 I'm still worried about that d_drop 2008-11-13 20:53 so if anything needs to be researched to determine the feasibility of deferred nameops, it is that 2008-11-13 20:53 yes, I think maybe it will also call by dcache shrinker 2008-11-13 20:53 a core patch might be needed to make it work 2008-11-13 20:54 well, I'm not sure where is call it 2008-11-13 20:54 now, things might still be ok 2008-11-13 20:54 because the vfs still has to call our fs to find out if an entry exists 2008-11-13 20:55 we should not be implementing part of the dentry cache in our fs, but if necessary we can 2008-11-13 20:55 keep a list of dentries that we are in process of deleting 2008-11-13 20:55 in fact, we will always trigger that code 2008-11-13 20:56 that bypasses the dentry_iput 2008-11-13 20:57 if it can care memory balance, it may be ok? 2008-11-13 20:58 can you rephrase that question? 2008-11-13 20:58 deffered entries list can be too big 2008-11-13 20:58 oh, we have control of that 2008-11-13 20:59 we can force a delta transition 2008-11-13 20:59 yes, I think we have to trigger to flush it by memory pressure 2008-11-13 21:00 it's easy to test 2008-11-13 21:00 just delete a million files 2008-11-13 21:01 I worry it may make complex the code 2008-11-13 21:01 s/code/deffered code/ 2008-11-13 21:02 that is a danger 2008-11-13 21:02 it would be worth trying an experiment with tux3/junkfs before commiting to the strategy 2008-11-13 21:03 make sure that the dentry cache does the right thing 2008-11-13 21:03 good idea 2008-11-13 21:04 ok, so for homework... let's know the dentry cache better by next tuesday 2008-11-13 21:04 :-) easy to say ;-) 2008-11-13 21:05 utlk and lxr are your friends 2008-11-13 21:05 on tuesday we can compare notes on this question again, and move on to rename 2008-11-13 21:06 yes 2008-11-13 21:06 I was thinking to myself, why does the vfs leave the instantiation of a dentry under control of the fs, but not the invalidation? 2008-11-13 21:06 I think the answer is: because the vfs is broken 2008-11-13 21:08 this would not be the first time this ever happened, however, the usual way to proceed is to find a disgusting workaround 2008-11-13 21:08 make the the fs work in spite of the awkward api 2008-11-13 21:08 maybe many fs depends on current behaviour, so to change it may become too hard 2008-11-13 21:09 then make a core patch and post it under "see how much nicer this makes this thing we are doing" 2008-11-13 21:09 hirofumi, we change stuff like that on a regular basis 2008-11-13 21:09 sometimes while keeping a legacy interface 2008-11-13 21:09 but more often, just edit everything 2008-11-13 21:10 Linus even sees that as a feature, he likes to break out of tree drivers 2008-11-13 21:10 yes, howver maybe dcache still have many legacy code 2008-11-13 21:10 in this case, a new hook would probably be enough 2008-11-13 21:11 for example, call the fs in the __d_drop 2008-11-13 21:11 and we make more complex it ;) 2008-11-13 21:12 well, probably yes 2008-11-13 21:12 if there is a good efficiency argument its ok 2008-11-13 21:12 yes 2008-11-13 21:13 the nfs-specific hack in vfs_unlink is pretty disgusting, if that could be moved entirely into nfs with the help of a hook that also gives us what we want, that would be an improvement 2008-11-13 21:13 and a simplification 2008-11-13 21:13 yes 2008-11-13 21:14 I have to make sure about dcache stuff though 2008-11-13 21:14 me too 2008-11-13 21:15 deferred nameops is an optimization I really want to do, but we have a way of avoiding it initially 2008-11-13 21:15 ah, I think that's very good 2008-11-13 21:16 that makes it a lot easier to tie it to a core vfs patch 2008-11-13 21:16 we can compare how big win 2008-11-13 21:16 right 2008-11-13 21:16 so that settles that 2008-11-13 21:16 we need both, or we can't prove how cool it is 2008-11-13 21:16 yes 2008-11-13 21:17 ok, should we delete tux3fs.c now? 2008-11-13 21:17 hmm 2008-11-13 21:17 konrad? 2008-11-13 21:17 wasn't there an fs flag about ddrop? 2008-11-13 21:17 ? 2008-11-13 21:17 searching 2008-11-13 21:18 ok, I don't think we have a real problem with this anyway 2008-11-13 21:18 we just keep our own list of deferred deletes 2008-11-13 21:18 100#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() 2008-11-13 21:19 not quite d_drop ;-) 2008-11-13 21:19 and when the fs calls the ->lookup, we consult our deferred list to see if we should say "not there" 2008-11-13 21:19 not a lot of code 2008-11-13 21:19 right, d_move 2008-11-13 21:19 well it's nice to know that hook is there 2008-11-13 21:20 because when we look at rename, I am sure there will be issues 2008-11-13 21:20 all for it 2008-11-13 21:20 tux3fuse obsoletes tux3fs 2008-11-13 21:20 ok, done 2008-11-13 21:20 yes 2008-11-13 21:20 and we just wanted to be sure of that 2008-11-13 21:20 which we now are 2008-11-13 21:21 yes 2008-11-13 21:28 ok, the makefile tests that used to run tux3fs now run tux3fuse 2008-11-13 21:28 make testfs and make debug 2008-11-13 21:29 I suppose we could also do something like make testfs DEBUG=1 2008-11-13 21:29 to run the fusefs in the foreground 2008-11-13 21:30 now let's have proper permission handling 2008-11-13 21:34 um.. it seems xattr->size uses 2 for load/store... 2008-11-13 21:39 xattr->atom is u16 and atom_t is u32, and compared those in xcache_lookup() if I'm missing somthing 2008-11-13 21:52 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-11-13 21:55 hirofumi, yes 2008-11-13 21:55 implying a 2**16 limit for atom numbers 2008-11-13 21:56 nothing is borken, but the limit might be a little small 2008-11-13 21:56 atom_t can simply be delared unsigned, because it is not used for on-disk structures 2008-11-13 21:56 my mistake 2008-11-13 21:58 we could possibly use a 16 bit on-disk form for system xattrs and 32 bits for user defined 2008-11-13 21:58 or just always use 32 form 2008-11-13 21:58 i see 2008-11-13 21:58 and add a size optimization later if it looks like it's worth it 2008-11-13 21:59 it seems xattr->atom should be u32 2008-11-13 21:59 ah, no. atom_t probably 2008-11-13 22:01 they should be PACKED 2008-11-13 22:01 xattr and xcache 2008-11-13 22:03 or use u32 for xattr->size too? 2008-11-13 22:03 oh crap, this is where you can see that mercurial needs a real rename 2008-11-13 22:04 no file has older annotation than the move from user/test to user 2008-11-13 22:04 hirofumi, my thinking is, if the xattr has size bigger than 2**16 we store it in an inode page cache, not in a kmalloc 2008-11-13 22:05 currently, xattrs bigger than 2**12 are not used on linux 2008-11-13 22:05 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-13 22:05 i see 2008-11-13 22:05 ACTION needs to see xattr.c more 2008-11-13 22:05 I think it is worth trying to keep these fields small 2008-11-13 22:06 the biggest use of xattrs is acls 2008-11-13 22:06 which can quickly bloat up the metadata if they are not compact 2008-11-13 22:07 atom number of 32 bits and size of 16 bits is probably right for now 2008-11-13 22:07 i see. also samba uses it? 2008-11-13 22:07 for windows acls, yes 2008-11-13 22:07 but tridge has not been happy with xattr performance on linux 2008-11-13 22:08 so far, ext3 is the best and it is not very good 2008-11-13 22:08 I hope we will do better 2008-11-13 22:08 heh 2008-11-13 22:09 i see. I don't use xattr 2008-11-13 22:09 mercurial can't detect rename? 2008-11-13 22:10 xattr = selinux + extended acls + samba + potentially user stuff 2008-11-13 22:10 yes. I don't use all of those 2008-11-13 22:11 at least for now 2008-11-13 22:11 mercurial fakes the rename as a delete and an add 2008-11-13 22:12 sometimes that is not a very good imitation of a rename, it breaks annotate for one thing 2008-11-13 22:12 yes, but at least git can see history of it, iirc 2008-11-13 22:16 one point for git 2008-11-13 22:16 though I don't think git's metadata design actually supports taht 2008-11-13 22:16 it's a heuristic 2008-11-13 22:17 I'll get rid of this bogus PACKED's now 2008-11-13 22:18 probably 2008-11-13 22:18 it will removed from some places? 2008-11-13 22:19 should only be in tux3.h 2008-11-13 22:19 just getting rid of the attribute 2008-11-13 22:19 that will only cause slower code and has no benefit 2008-11-13 22:20 let's think about size of on-disk atoms for a day or so 2008-11-13 22:20 before changing 2008-11-13 22:20 xattr and xcache? yes 2008-11-13 22:21 I guess we are going to go with 32 bit on-disk atoms at first 2008-11-13 22:21 it's meaning dirent->inum? 2008-11-13 22:22 same size, yes 2008-11-13 22:23 inums are actually supposed to be 48 bits in tux3 2008-11-13 22:23 but ext2 dirents limit us to 32 bits 2008-11-13 22:23 we could change that if we care 2008-11-13 22:23 i see 2008-11-13 22:23 but the ext2 dirent code will be entirely replaced at some point 2008-11-13 22:24 maybe we will fix it after phtree? 2008-11-13 22:24 yes, it will be fixed in phtree 2008-11-13 22:24 I don't think anybody will care about having more than 2**32 inodes for a long time 2008-11-13 22:25 yes 2008-11-13 22:28 I think we will keep the u16 halfwords in struct xattr 2008-11-13 22:28 with many cached xattrs is can make a difference to memory use 2008-11-13 22:29 we will return EINVAL for any xattr over some size, probably 2**12 2008-11-13 22:29 and allow bigger xattrs later 2008-11-13 22:30 I'm just not understanding the another u16 in dump_atoms is what for 2008-11-13 22:30 why is it 2**12? 2008-11-13 22:30 nobody uses xattrs bigger than that on linux 2008-11-13 22:30 ext3 has that limitation 2008-11-13 22:31 i see 2008-11-13 22:31 it packs xattrs into pages 2008-11-13 22:31 has to fit in a page 2008-11-13 22:31 ah, i see 2008-11-13 22:31 that's u16 * in dump_atoms 2008-11-13 22:32 that should be be_u16 2008-11-13 22:32 it's a disk block 2008-11-13 22:32 yes 2008-11-13 22:32 do you want to fix it, or me? 2008-11-13 22:32 and it has two part - hi and lo 2008-11-13 22:32 yes 2008-11-13 22:32 I thought that optimization would be worth it 2008-11-13 22:33 nearly always, the use inc/dec will not overlow into the other half 2008-11-13 22:33 I'm not understanding about those 2008-11-13 22:33 which means that when counts have to be flushed to disk, it will be half the data to transfer 2008-11-13 22:34 i see. I'm also not understanding format yet 2008-11-13 22:34 it's why tux3graph is dump those 2008-11-13 22:35 the count tables are stored at a high offeset in the atable 2008-11-13 22:35 I was missing it 2008-11-13 22:35 atable is a just normal file? 2008-11-13 22:36 yes 2008-11-13 22:36 the atom tables (counts and reverse map) are stored above i_size 2008-11-13 22:36 above i_size? 2008-11-13 22:37 i_size is only important because ext2 dirops rely on it to know how many dirent blocks are in the file 2008-11-13 22:37 ah, it's directory 2008-11-13 22:37 yes 2008-11-13 22:37 I'm forgetting it 2008-11-13 22:37 directory code works very well for atom names 2008-11-13 22:38 incidentally, btrfs uses dirops to implement xattrs 2008-11-13 22:38 i see 2008-11-13 22:39 so both tux3 and btrfs will do a directory lookup on every xattr get or set 2008-11-13 22:39 pretty gross really 2008-11-13 22:39 no wonder xattrs aren't fast enough for tridge 2008-11-13 22:40 i see 2008-11-13 22:42 I see that the refcounts are be_u16 in use_atom 2008-11-13 22:42 it's just dump_atoms that needs fixing 2008-11-13 22:43 yes 2008-11-13 22:44 sb->freeatom has to be be_ too 2008-11-13 22:44 and I'm thinking I'd like to change, "be_u16 *good_name = buffer->data" or something 2008-11-13 22:44 oh sorry 2008-11-13 22:44 it's cache 2008-11-13 22:44 right 2008-11-13 22:44 disksuper is 2008-11-13 22:44 good_name? 2008-11-13 22:45 currently - int low = from_be_u16(((be_u16 *)buffer->data)[offset]) + use 2008-11-13 22:45 oh sure 2008-11-13 22:45 I can't see buffer->data is what for 2008-11-13 22:46 buffer->data is char * I think 2008-11-13 22:46 void * 2008-11-13 22:46 in userspace, void * 2008-11-13 22:47 in kernel, right it's probably char * 2008-11-13 22:47 yes 2008-11-13 22:48 buffer_head was invented when there was no such thing as void * 2008-11-13 22:48 char * has to be explicitly cast, a pain 2008-11-13 22:49 we will probably just write (void *)buffer_head->data 2008-11-13 22:49 almost all pepole want void * 2008-11-13 22:50 (void *)buffer_head->b_data 2008-11-13 22:50 yes 2008-11-13 22:51 I think we should wrap all uses of buffer->data 2008-11-13 22:51 so we have buffer_data(buffer) 2008-11-13 22:51 then the change for kernel will be small 2008-11-13 22:52 maybe so, I'm not sure we will use buffer_head or not though 2008-11-13 22:54 we are stuck with it, for remembering block state in the page cache 2008-11-13 22:54 no practical alternative 2008-11-13 22:54 eventually, I will get time to work on replacing buffers by subpages, which will be much nicer 2008-11-13 22:55 it will make struct page smaller for one thing, no need for a list of buffers 2008-11-13 22:55 fsblock from nick? or I was thinking we may use something or it at past 2008-11-13 22:55 I don't know about fsblock 2008-11-13 22:55 but the name sounds like it could be similar to my plan 2008-11-13 22:56 oh 2008-11-13 22:56 looks very different 2008-11-13 22:56 much more heavyweight than what I had in mind 2008-11-13 22:56 got url? 2008-11-13 22:56 http://lwn.net/Articles/239621/ 2008-11-13 22:57 it seems to support lageblock too 2008-11-13 22:57 s/lage/large/ 2008-11-13 22:57 http://lkml.org/lkml/2007/6/25/408 2008-11-13 22:57 that was always the plan 2008-11-13 22:57 http://lkml.org/lkml/2007/6/23/252 2008-11-13 22:58 the patch was posted for review recently 2008-11-13 22:58 [*] About the furthest we could go is use the struct page for the 2008-11-13 22:58 information otherwise stored in the buffer_head, but this would be 2008-11-13 22:58 tricky and suboptimal for filesystems with non page sized blocks and 2008-11-13 22:58 would probably bloat the struct page as well. 2008-11-13 22:58 <- nick is wrong 2008-11-13 22:58 I should review it 2008-11-13 22:59 http://www.spinics.net/lists/linux-fsdevel/msg17327.html 2008-11-13 22:59 most of the recent improvements to vm have resulting in large amounts of bloat 2008-11-13 22:59 it's getting out of hand 2008-11-13 23:00 me too 2008-11-13 23:01 time management too for me 2008-11-13 23:01 well, one thing at a time 2008-11-13 23:01 for now, it's buffer_head or nothing 2008-11-13 23:02 it's gross, but it's not the grossest thing in kernel, far from it 2008-11-13 23:02 "we suck, but we suck fast" :-) 2008-11-13 23:03 well, I'm happy with buffer_head at least for now 2008-11-13 23:03 ext3 completely relies on it 2008-11-13 23:03 partly because of me :) 2008-11-13 23:04 the index code 2008-11-13 23:04 the journal code is also heaviliy dependent 2008-11-13 23:04 yes, jbh is 2008-11-13 23:07 let's see, permission setting in tux3fuse... 2008-11-13 23:07 it can't read/write permission? 2008-11-13 23:08 there is a little work to do 2008-11-13 23:08 nothing big 2008-11-13 23:08 just enable loading/saving that attribute 2008-11-13 23:08 i see 2008-11-13 23:08 and assign it on create etc 2008-11-13 23:08 I'm not using it usually 2008-11-13 23:09 I'm going to try to make it so the fuse fs can be accessible to normal user 2008-11-13 23:09 then the test mount can be in the local directory instead of /tmp 2008-11-13 23:10 current fuse can do it? 2008-11-13 23:10 see tux3_create in tux3fuse.c 2008-11-13 23:11 ok 2008-11-13 23:12 oh, it doesn't store some attributes 2008-11-13 23:12 right, that looks like the main problem 2008-11-13 23:12 probably just need | XXX_BIT 2008-11-13 23:13 I'm just checking that mode_t matches properly 2008-11-13 23:14 it seems to just set iattr.isize, .uid, .gid 2008-11-13 23:14 line number? 2008-11-13 23:15 tux3_create -> tuxcreate -> make_inode -> inode.c:115 2008-11-13 23:16 yes, and iattr->mode isn't set in tux3fuse.c 2008-11-13 23:16 need to set it 2008-11-13 23:17 and "mode | 0666" looks odd 2008-11-13 23:17 that's broken 2008-11-13 23:18 iattr->mode is just unsigned 2008-11-13 23:18 need to get precise about whether it is supposed to be libc mode_t or not 2008-11-13 23:19 yes 2008-11-13 23:19 at least it wants setuid bit 2008-11-13 23:19 libc mode_t does not seem to be documented precisely 2008-11-13 23:20 I think mode_t is same with in kernel usually 2008-11-13 23:20 that would be nice 2008-11-13 23:20 ah, wait 2008-11-13 23:21 size of type? 2008-11-13 23:22 static int xmp_create(const char *path, mode_t mode, struct fuse_file_info *fi) 2008-11-13 23:22 { 2008-11-13 23:22 int fd; 2008-11-13 23:22 fd = open(path, fi->flags, mode); 2008-11-13 23:23 typedef __kernel_mode_t mode_t; 2008-11-13 23:23 yes 2008-11-13 23:23 size varies by arch 2008-11-13 23:24 but what about bit layout 2008-11-13 23:24 yes 2008-11-13 23:24 I think it is same 2008-11-13 23:24 it looks that way 2008-11-13 23:25 I wonder why all the complexity then 2008-11-13 23:25 should just be unsigned mode_t in kernel 2008-11-13 23:26 now where do we get the uid/gid from 2008-11-13 23:26 on x86 uses unsigned short? 2008-11-13 23:26 in fuse 2008-11-13 23:26 a useless thing to do probably 2008-11-13 23:27 probably struct stat uses 2008-11-13 23:27 so, if changed, binary incompatible 2008-11-13 23:27 that should be declared with uintxx_t 2008-11-13 23:27 not mode_t etc 2008-11-13 23:27 if size has to be precisely defined 2008-11-13 23:28 obviously it does 2008-11-13 23:28 some archs standard may have difference? 2008-11-13 23:29 those apis are all defined per-arch 2008-11-13 23:29 yes 2008-11-13 23:30 fuse_get_context seems to be the way to get uid/gid 2008-11-13 23:31 i see 2008-11-13 23:31 I wonder why it isn't just passed with the create 2008-11-13 23:31 seems odd 2008-11-13 23:32 http://www.prism.uvsq.fr/users/ode/in115/references/fuse/fuse_8h.html#a20 2008-11-13 23:34 much odd 2008-11-13 23:34 usually it should be needed 2008-11-13 23:41 Is there a way to know the uid, gid or pid of the process performing 2008-11-13 23:41 -------------------------------------------------------------------- 2008-11-13 23:41 the operation? 2008-11-13 23:41 -------------- 2008-11-13 23:41 Yes: fuse_get_context()->uid, etc. 2008-11-13 23:41 bleah, fuse_context()->pid return 0 for a create, not a good sign 2008-11-13 23:44 well maybe it would be better to start with chmod 2008-11-13 23:45 req->ctx.uid = in->uid; 2008-11-13 23:45 req->ctx.gid = in->gid; 2008-11-13 23:45 req->ctx.pid = in->pid; 2008-11-13 23:45 in libfuse 2008-11-13 23:45 ah, that looks promising 2008-11-13 23:45 in is from kernel 2008-11-13 23:45 looks like 2008-11-13 23:52 I get dereferencing pointer to incomplete type when I try to use the fuse_req_t req in tux3_create 2008-11-13 23:52 even with fuse/fuse_lowlevel.h included 2008-11-13 23:52 um... 2008-11-13 23:56 that header just has typedef struct fuse_req *fuse_req_t 2008-11-13 23:56 and no definition of fuse_req 2008-11-13 23:56 this is lame 2008-11-13 23:56 fuse_get_context()->uid is not work? 2008-11-13 23:57 ->pid was just zero 2008-11-13 23:57 ->uid too? 2008-11-13 23:57 yes, because running as root 2008-11-13 23:57 ah 2008-11-13 23:57 that's why starting with chmod would be easier 2008-11-13 23:58 but fuse seems like it has some half ideas in its interface 2008-11-13 23:59 I'm not sure this is a good use of time 2008-11-14 00:01 get permission denied from fuse when trying to do anything with the filesystem as non-root and the request doesn't even get to tux3fuse 2008-11-14 00:01 that seems broken to me 2008-11-14 00:01 yes 2008-11-14 00:01 well 2008-11-14 00:01 at least it is like kernel 2008-11-14 00:02 first tests the cached object 2008-11-14 00:02 in fact... it is kernel :) 2008-11-14 00:02 fuse seems to need some control 2008-11-14 00:02 well I think this is the kernel doing it in this case 2008-11-14 00:02 http://apps.sourceforge.net/mediawiki/fuse/index.php?title=Fuse.conf 2008-11-14 00:03 thanks 2008-11-14 00:06 that works, I can mount tux3fuse as a normal user 2008-11-14 00:06 ok 2008-11-14 00:07 also there are mount options 2008-11-14 00:07 tux3_statfs: not implemented <- ah nice, my earlier patch is useful 2008-11-14 00:11 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-14 00:12 ok, "./tux3fuse allow_other testdev mntpoint" may work 2008-11-14 00:12 no 2008-11-14 00:12 ok, "./tux3fuse -o allow_other testdev mntpoint" may work 2008-11-14 00:12 ok, that would be better than the config file 2008-11-14 00:13 ok, it seems work fine 2008-11-14 00:14 what's a shell command to print my own uid? 2008-11-14 00:14 besides grepping passwd 2008-11-14 00:14 id -a? 2008-11-14 00:15 thanks 2008-11-14 00:16 fuse_get_context()->uid returns 0 or garbage 2008-11-14 00:16 saw garbage once 2008-11-14 00:16 usually zero 2008-11-14 00:16 yes 2008-11-14 00:17 :( 2008-11-14 00:17 oh 2008-11-14 00:18 and created file is still root 2008-11-14 00:18 ah, no 2008-11-14 00:18 it may be tux3 problem 2008-11-14 00:18 we're not storing the attribute 2008-11-14 00:18 but 2008-11-14 00:18 what uid to store? 2008-11-14 00:19 &(struct iattr){ .mode = mode | 0666 } 2008-11-14 00:19 so, I think 0 2008-11-14 00:19 well that | is just wrong 2008-11-14 00:19 we need fuse to give us a uid to store and it's not being good about that 2008-11-14 00:19 mode... 2008-11-14 00:21 we may still more mount options... 2008-11-14 00:22 oh --help prints help 2008-11-14 00:23 and -d may help 2008-11-14 00:26 tux3_create(1, 'foo8', mode = 81a4) 2008-11-14 00:27 0x81a4 == 0100644 2008-11-14 00:27 looks good? 2008-11-14 00:28 in octal: tux3_create(1, 'foo10', mode = 100644) 2008-11-14 00:28 yes 2008-11-14 00:28 0100000 is S_IFREG 2008-11-14 00:29 we may strip S_IFMT only 2008-11-14 00:29 strip? 2008-11-14 00:29 remove from mode 2008-11-14 00:30 no 2008-11-14 00:30 we can, or leave it 2008-11-14 00:30 i_mode should have it 2008-11-14 00:30 yes 2008-11-14 00:30 yes 2008-11-14 00:30 I don't think kernel uses an internal form for that 2008-11-14 00:30 well 2008-11-14 00:30 wait 2008-11-14 00:30 ext2 does 2008-11-14 00:30 it converts 2008-11-14 00:30 but vfs talks to ext2 in this format I think 2008-11-14 00:31 the bit layout in posix mode is a little stupid 2008-11-14 00:31 ok, so we just lose the | 666 2008-11-14 00:32 that was probably a slip by tero 2008-11-14 00:33 yes 2008-11-14 00:33 I was thinking S_IFREG is 0 2008-11-14 00:33 ls -l test 2008-11-14 00:33 total 0 2008-11-14 00:33 -rw-rw-rw- 0 root root 0 2008-11-14 00:08 foo 2008-11-14 00:33 -rw-rw-rw- 0 root root 0 2008-11-14 00:28 foo10 2008-11-14 00:33 -rw-r--r-- 0 root root 0 2008-11-14 00:33 foo11 2008-11-14 00:34 foo11 was iwth the | 666 removed 2008-11-14 00:34 looks like 644 to me 2008-11-14 00:34 looks good. I think you are using umask == 022 2008-11-14 00:35 but uid and gid is wrong 2008-11-14 00:35 I don't know how to get them from fuse 2008-11-14 00:35 yes 2008-11-14 00:36 kernel send those surely 2008-11-14 00:37 I would expect the uid of the tux3fuse process is always the same as who started it, regardless of who creates a file 2008-11-14 00:38 so it will not do any good to call getuid() 2008-11-14 00:38 isn't it root always? 2008-11-14 00:38 I'm running it as me now 2008-11-14 00:38 with users in /etc/fstab? 2008-11-14 00:39 trivial patch to remove the | 666 posted 2008-11-14 00:39 using etc/fuse.conf 2008-11-14 00:39 i see 2008-11-14 00:39 but your command line way is better 2008-11-14 00:39 I have not tried it yet 2008-11-14 00:39 but anyway, we need fuse to tell us which user created the file 2008-11-14 00:39 it is the only one that knows, after the user calls into the kernel 2008-11-14 00:40 yes 2008-11-14 00:40 if (!f->got_init && in->opcode != FUSE_INIT) 2008-11-14 00:40 fuse_reply_err(req, EIO); 2008-11-14 00:40 else if (f->allow_root && in->uid != f->owner && in->uid != 0 && 2008-11-14 00:40 in->opcode != FUSE_INIT && in->opcode != FUSE_READ && 2008-11-14 00:40 in->opcode != FUSE_WRITE && in->opcode != FUSE_FSYNC && 2008-11-14 00:40 in->opcode != FUSE_RELEASE && in->opcode != FUSE_READDIR && 2008-11-14 00:40 in->opcode != FUSE_FSYNCDIR && in->opcode != FUSE_RELEASEDIR) { 2008-11-14 00:40 fuse_reply_err(req, EACCES); 2008-11-14 00:41 } else if (in->opcode >= FUSE_MAXOP || !fuse_ll_ops[in->opcode].func) 2008-11-14 00:41 fuse_reply_err(req, ENOSYS); 2008-11-14 00:41 else { 2008-11-14 00:41 odd codes here in libfuse 2008-11-14 00:41 f->allow_root must be 0? ... 2008-11-14 00:41 not very highly abstracted 2008-11-14 00:47 -o allow_other is only allowed if etc/fuse.conf exists 2008-11-14 00:47 what error was occured? 2008-11-14 00:47 I think I'm geting close 2008-11-14 00:47 ... 2008-11-14 00:48 ./tux3fuse /tmp/testdev /tmp/test -o allow_other 2008-11-14 00:48 fusermount: option allow_other only allowed if 'user_allow_other' is set in /etc/fuse.conf 2008-11-14 00:48 I see 2008-11-14 00:50 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-14 00:50 ok, the makefile can create the /etc/fuse.conf file if it doesn't already exist 2008-11-14 00:51 with sudo 2008-11-14 00:51 after that, testing can be as a normal user 2008-11-14 00:58 oh, ============================ uid 1000, gid 1000, pid 25509 2008-11-14 00:58 that looks good 2008-11-14 00:59 what is the trick? 2008-11-14 00:59 + fprintf(stderr, "============================ uid %u, gid %u, pid %u\n", 2008-11-14 00:59 + req->ctx.uid, 2008-11-14 00:59 + req->ctx.gid, 2008-11-14 00:59 + req->ctx.pid); 2008-11-14 00:59 and copy fuse_req from libfuse 2008-11-14 00:59 ah 2008-11-14 00:59 so, req->ctx is valid 2008-11-14 00:59 nice 2008-11-14 00:59 that is the way it should be 2008-11-14 00:59 but fuse_get_context() is wrong 2008-11-14 00:59 the get_context idea is stupid 2008-11-14 01:00 we just want to access it in right way 2008-11-14 01:00 ok 2008-11-14 01:00 fuse_req_ctx() may be it 2008-11-14 01:01 that makes sense to me 2008-11-14 01:01 and let's just paste the definition into tux3fuse.c 2008-11-14 01:01 when they fix the headers then we can deal with that problem 2008-11-14 01:02 are you going to do that? 2008-11-14 01:02 that's not needed 2008-11-14 01:02 const struct fuse_ctx *ctx = fuse_req_ctx(req); 2008-11-14 01:02 fprintf(stderr, "============================ uid %u, gid %u, pid %u\n", 2008-11-14 01:02 ctx->uid, 2008-11-14 01:02 ctx->gid, 2008-11-14 01:02 ctx->pid); 2008-11-14 01:02 ok, done 2008-11-14 01:02 ah, so there is a definition of fuse_cts 2008-11-14 01:02 nice 2008-11-14 01:03 looks fine 2008-11-14 01:03 yes, we can use fuse_ctx in tux3fuse.c 2008-11-14 01:03 sudo echo user_allow_other >/etc/fuse.conf <- this doesn't work 2008-11-14 01:03 because the > is not handled as superuser 2008-11-14 01:03 yes 2008-11-14 01:03 I have never figured out how to do an echo like that with sudo 2008-11-14 01:03 sudo "..." may work? 2008-11-14 01:04 no 2008-11-14 01:04 sudo "echo user_allow_other >/etc/fuse.conf" 2008-11-14 01:04 sudo: echo user_allow_other >/etc/fuse.conf: command not found 2008-11-14 01:04 ugh 2008-11-14 01:05 probably has to be a cp 2008-11-14 01:05 I've often had this problem 2008-11-14 01:06 sudo 'echo "echo user_allow_other > /etc/fuse.conf" | sh'? 2008-11-14 01:06 heh 2008-11-14 01:06 no 2008-11-14 01:06 it wouldn't be work 2008-11-14 01:07 still using '...' 2008-11-14 01:07 and sudo won't allow parens around the command 2008-11-14 01:07 sudo sh -c "echo ..." ? 2008-11-14 01:08 right 2008-11-14 01:08 good 2008-11-14 01:08 that's how I did it the last time it worked I think 2008-11-14 01:09 works, now the next challenge is escaping it for make 2008-11-14 01:09 $(shell sudo ...) ? 2008-11-14 01:10 ah 2008-11-14 01:10 doesn't need escaping it seems 2008-11-14 01:10 yes 2008-11-14 01:10 $ is just need to escape 2008-11-14 01:10 works 2008-11-14 01:11 now we can test things more on tux3fuse 2008-11-14 01:12 sudo sh -c "echo user_allow_other > /etc/fuse.conf" ? 2008-11-14 01:12 looks good to me 2008-11-14 01:12 eh 2008-11-14 01:12 what do you mean? 2008-11-14 01:13 sorry, missed your earlier reply... 2008-11-14 01:13 ah, in makefile we can escape $ 2008-11-14 01:14 $$ for shell 2008-11-14 01:14 ok, it's checked in 2008-11-14 01:14 and time for me to sleep 2008-11-14 01:14 good night 2008-11-14 01:15 tomorrow I'll make the small change to run the tests in the local directory instead of /tmp 2008-11-14 01:15 a little easier to test, and make clean will clean up the test dev 2008-11-14 01:16 and we will be staring to use fuse_ctx 2008-11-14 01:16 s/staring/starting/ 2008-11-14 01:16 yes, and add the load/save of owner attribute 2008-11-14 01:16 ah 2008-11-14 01:17 I'll read xattr.c more 2008-11-14 01:17 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-14 01:18 that is in inode.c 2008-11-14 01:18 and this is a standard attr, not an xattr 2008-11-14 01:18 yes 2008-11-14 01:19 it meant I want to read to understand xattr.c more 2008-11-14 01:19 ah yes 2008-11-14 01:20 there's a design bug 2008-11-14 01:20 ok 2008-11-14 01:20 I should not have treated empty attribute as not existing 2008-11-14 01:20 it is allowed to have an xattr with empty value 2008-11-14 01:20 stupid, but legal 2008-11-14 01:21 i see 2008-11-14 01:21 store_attrs stores xattr always 2008-11-14 01:22 it looks like the owner attribute is already loaded/saved 2008-11-14 01:22 yes 2008-11-14 01:22 that's a bug 2008-11-14 01:22 it's because the xattr cache is created always 2008-11-14 01:22 it should only be created on first set_xattr, or when xattrs exist in the inode table block 2008-11-14 01:22 i see 2008-11-14 01:22 there's a fixme comment 2008-11-14 01:23 mode 0100644 uid 0 gid 0 root d:1 ctime 491d410770900000 size 4 <- mode is saved properly 2008-11-14 01:24 I guess when you do the ctx fix, owner will work right away 2008-11-14 01:24 yes, I think so too 2008-11-14 01:25 I think it might be time to start working on kernel tomorrow 2008-11-14 01:25 there are still lots of things that could be done in user space, bugs to fix even, but kernel could start now 2008-11-14 01:26 if we can start, that's interesting for me 2008-11-14 01:26 that's a good enough reason for me 2008-11-14 01:27 thanks, let's enjoy 2008-11-14 01:43 open("./testdev", O_RDWR|O_LARGEFILE) = -1 ENOENT 2008-11-14 01:43 doesn't like the ./ 2008-11-14 01:43 I thought that was ok 2008-11-14 01:44 in main, tux3fuse chdir to mount point 2008-11-14 01:45 that sounds wrong 2008-11-14 01:45 they should use openat 2008-11-14 01:45 it may have reason for fuse 2008-11-14 01:46 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-14 01:46 a reason, no doubt, a valid reason is another question 2008-11-14 01:46 yes, maybe easy way is pass fullpass of testdev 2008-11-14 01:47 maybe $(PWD)/testdev? 2008-11-14 01:48 I'm in a symlinked directory 2008-11-14 01:48 and I think the chdir ends up in the real directory 2008-11-14 01:49 fuse just shouldn't do that 2008-11-14 01:49 it's bad 2008-11-14 01:49 this is the linux dcache working for us 2008-11-14 01:49 ah 2008-11-14 01:49 without the dcache, chdir to a symlinked directory would end up in the real directory 2008-11-14 01:50 but the illusion created by dcache isn't perfect, as we see here 2008-11-14 01:50 well 2008-11-14 01:50 I will just leave the testfile in /tmp 2008-11-14 01:50 well, it may work we just remove int fd = open(mountpoint, O_RDONLY); 2008-11-14 01:50 if we just remove 2008-11-14 01:51 which file? 2008-11-14 01:51 tux3file.c 2008-11-14 01:51 tux3fuse.c 2008-11-14 01:51 in main() 2008-11-14 01:52 oh, _we_ do the chdir 2008-11-14 01:52 we're bad :) 2008-11-14 01:52 yes 2008-11-14 01:53 and I thought it may be for fuse 2008-11-14 01:53 that's better 2008-11-14 01:53 well, the tracing output disappears 2008-11-14 01:55 not all the tracing output 2008-11-14 01:55 stderr output is still there 2008-11-14 01:57 sorry, stdout output is still there 2008-11-14 01:57 stderr output is gone 2008-11-14 01:58 without it, does tux3fuse work? 2008-11-14 01:58 yes 2008-11-14 01:58 oh, we're bad :) 2008-11-14 02:09 well with the chdir gone file open works as expected 2008-11-14 02:09 good 2008-11-14 02:10 it will be committed? 2008-11-14 02:10 I think the log output is ok too, the funny thing I see is a lot of extra calls to tux3_getattr(1) only when working in the local directory 2008-11-14 02:11 it looks strange 2008-11-14 02:11 ok, I'll check this in 2008-11-14 02:11 and you can see if you like it 2008-11-14 02:11 ok 2008-11-14 02:12 the variable TESTDIR has the directory 2008-11-14 02:12 TESTDIR = /tmp makes it work like before 2008-11-14 02:15 I've put some patches 2008-11-14 02:15 http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-14 02:15 it's including uid and gid for tux3fuse 2008-11-14 02:19 whoops 2008-11-14 02:19 we got an extra head 2008-11-14 02:19 ah, merged ok 2008-11-14 02:20 good 2008-11-14 02:21 looks like we got everything 2008-11-14 02:21 we were both working on Makefile and tux3fuse.c 2008-11-14 02:21 mercurial handled that nicely 2008-11-14 02:22 nice 2008-11-14 02:23 oh, and last commit has two parents. good 2008-11-14 02:24 it's pretty scary because if you get multiple heads in mercurial it can be very hard to get rid of them 2008-11-14 02:24 a flaw 2008-11-14 02:24 and I don't think rollback works on a pull 2008-11-14 02:24 if it's a public repositoriy, this can be a real mess 2008-11-14 02:25 I always pull to my private repo, then pull from there to the public one if everything works ok 2008-11-14 02:25 so if I have to, I can throw away my private repo and clone the public one 2008-11-14 02:27 two parents meant "Merged updates from Hirofumi" in http://tux3.org/tux3/ 2008-11-14 02:27 -!- pgquiles(~pgquiles@141.Red-83-40-80.dynamicIP.rima-tde.net) has joined #tux3 2008-11-14 02:27 yes, as it should be 2008-11-14 02:28 if you get two heads in your repo and I pull from that, it can be really hard to recover from 2008-11-14 02:29 hg doesn't work? 2008-11-14 02:29 it won't merge the heads in that case 2008-11-14 02:30 I am not sure of the exact conditions 2008-11-14 02:30 lots of complaints about that 2008-11-14 02:30 i see 2008-11-14 02:30 another point for git 2008-11-14 02:30 in general, hg is still winning in productivy I think 2008-11-14 02:31 I'm not use scm heavily though, I just like gitk 2008-11-14 02:33 hg view is very similar 2008-11-14 02:33 oh 2008-11-14 02:33 try it :) 2008-11-14 02:34 hg: unknown command 'view' :) 2008-11-14 02:36 hmm 2008-11-14 02:36 works here 2008-11-14 02:36 maybe it needs some libraries 2008-11-14 02:37 maybe it is a plugin indeed 2008-11-14 02:40 it is also known as hgk 2008-11-14 02:41 ah, it seems to need .hgrc 2008-11-14 02:42 yes 2008-11-14 02:42 or /etc/mercurial/hgrc.d/hgext.rc I think 2008-11-14 02:42 I have that and not .hgrc 2008-11-14 02:42 curious 2008-11-14 02:45 ah, gitk 2008-11-14 02:45 looks like git 2008-11-14 02:45 gitk 2008-11-14 02:45 it sure does 2008-11-14 02:46 good 2008-11-14 02:46 -!- pgquiles(~pgquiles@141.Red-83-40-80.dynamicIP.rima-tde.net) has joined #tux3 2008-11-14 02:47 hola pgquiles 2008-11-14 02:47 hi flips 2008-11-14 02:48 glad to see tux3 is nearing the kernel port 2008-11-14 02:48 starts tomorrow ready or not :) 2008-11-14 02:49 :-) 2008-11-14 02:50 so, how long until it is production-ready? :-P 2008-11-14 02:50 nobody can know 2008-11-14 02:50 we can make a better guess about how long till benchmark-ready 2008-11-14 02:51 one of the things I've taken care to design out of the project is, dealock-prone daemons 2008-11-14 02:51 that consumed a great deal of time and energy in zumastor 2008-11-14 02:55 441 commits in tux3.org/tux3 2008-11-14 02:58 oh, hg seems to have bisect too 2008-11-14 02:58 and patchbomb 2008-11-14 02:59 very nice 2008-11-14 03:05 TESTDIR=. and TESTDIR=/tmp seems same result 2008-11-14 03:11 yes 2008-11-14 03:11 I saw some extra tux3_statfs calls working with the local directory, but my local directory is actually symlinked 2008-11-14 03:12 which may make a difference 2008-11-14 03:12 ah, I see 2008-11-14 03:24 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-14 03:28 -!- pgquiles__(~pgquiles@141.Red-83-40-80.dynamicIP.rima-tde.net) has joined #tux3 2008-11-14 04:23 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-14 04:35 -!- pgquiles__(~pgquiles@141.Red-83-40-80.dynamicIP.rima-tde.net) has joined #tux3 2008-11-14 07:21 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-11-14 07:51 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 08:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 09:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 10:03 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-14 10:09 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-11-14 11:08 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 11:35 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-11-14 12:05 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-14 12:48 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 13:31 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 14:22 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 15:05 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 16:56 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 17:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 19:58 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 20:19 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 20:57 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-14 22:58 ok, here we go 2008-11-14 22:59 first thing I find is the kernel has its own, sloppily defined bitops 2008-11-14 22:59 as usual 2008-11-14 22:59 kind of partial niceness (as in... the same names I used) mixed with screamingly awful stupidity 2008-11-14 22:59 what's a bitop? 2008-11-14 22:59 is set_bit 2008-11-14 23:00 working on linus's kernel means learning to like the smell and texture of his shit ;) 2008-11-14 23:00 still, it's a shock every time 2008-11-14 23:01 and usually elicits the same kind of whining from me 2008-11-14 23:02 so, upshot is, we have to reverse the order of all our bitop calls 2008-11-14 23:02 fortunately, not many of those 2008-11-14 23:03 example: linus has set_bit, and instead of the obvious get_bit, it's test_bit 2008-11-14 23:03 sheehs 2008-11-14 23:03 the whole damn kernel is like that 2008-11-14 23:03 detect a pattern at your peril 2008-11-14 23:09 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-11-14 23:56 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-15 00:50 yay, tux3.h compiles 2008-11-15 00:50 it's a start 2008-11-15 01:03 hi 2008-11-15 01:14 hi 2008-11-15 01:15 let me get a url for the git repo 2008-11-15 01:15 are you already started? 2008-11-15 01:15 tried compiling tux3.h just to see what would happen 2008-11-15 01:15 http://phunq.net/ddtree?p=tux3fs ? 2008-11-15 01:16 i see 2008-11-15 01:16 yes 2008-11-15 01:16 I think sync of kernel and userland is not easy 2008-11-15 01:17 no, exact sync is a waste of time 2008-11-15 01:17 but partial sync is worth it 2008-11-15 01:17 ah 2008-11-15 01:17 the kernel files are the most important ones of course 2008-11-15 01:18 ok 2008-11-15 01:18 so we will think kernel files is master? 2008-11-15 01:18 diff -u /src/tux3/tux3.h fs/tux3/tux3.h | wc 2008-11-15 01:18 194 607 4316 2008-11-15 01:19 yes 2008-11-15 01:19 we can make a few changes to reduce the diff with kernel 2008-11-15 01:19 shelling buffer->data to get void * is definitely worth it 2008-11-15 01:20 or just use ->b_data and use char *? 2008-11-15 01:21 the extra casts are annoying 2008-11-15 01:21 diff -u /src/tux3/tux3.h fs/tux3/tux3.h | diffstat 2008-11-15 01:21 tux3.h | 90 +++++++++++++++-------------------------------------------------- 2008-11-15 01:21 1 file changed, 22 insertions(+), 68 deletions(-) 2008-11-15 01:21 wc /src/tux3/tux3.h 2008-11-15 01:21 335 1040 7472 /src/tux3/tux3.h 2008-11-15 01:21 so this stayed about 80% the same 2008-11-15 01:21 I think tux3.h can be share 2008-11-15 01:23 if we move userland code to other file 2008-11-15 01:23 and dleaf.c, ileaf.c, btree.c, dir.c, xattr.c, balloc.c, filemap.c, and iattr.c 2008-11-15 01:23 most of it can be the same 2008-11-15 01:23 just with a few defines 2008-11-15 01:24 #ifdef __KERNEL__ stuff? 2008-11-15 01:24 if we have to 2008-11-15 01:24 #define printf printk 2008-11-15 01:25 in kernel 2008-11-15 01:25 um.. 2008-11-15 01:25 it's stupid that it's called printk anyway 2008-11-15 01:25 it is just a slightly different version of printf 2008-11-15 01:26 yes 2008-11-15 01:27 actuall conversion is happen later, or we use it always? 2008-11-15 01:27 convert later 2008-11-15 01:27 i see 2008-11-15 01:27 it's good to leave some trival cleanup for when the patches are posted 2008-11-15 01:28 most of the printks will be converted to warn() and error() 2008-11-15 01:28 and trace() 2008-11-15 01:28 printfs I mean 2008-11-15 01:29 yes 2008-11-15 01:29 we use it in kernel always? 2008-11-15 01:29 yes, it's good to wrap those 2008-11-15 01:29 every filesystem does 2008-11-15 01:30 so you can turn them on, off, be controlled by a variable and so on 2008-11-15 01:30 ah 2008-11-15 01:30 did you have some changes, I thought you mentioned something yesterday 2008-11-15 01:31 patches for userspace? 2008-11-15 01:31 yes 2008-11-15 01:31 no 2008-11-15 01:31 ok 2008-11-15 01:31 I pushed already 2008-11-15 01:32 there is trivial patch though 2008-11-15 01:32 diff -puN user/tux3.h~inode-cleanup user/tux3.h 2008-11-15 01:32 --- tux3-dev/user/tux3.h~inode-cleanup 2008-11-14 18:43:07.000000000 +0900 2008-11-15 01:32 +++ tux3-dev-hirofumi/user/tux3.h 2008-11-14 18:43:07.000000000 +0900 2008-11-15 01:32 @@ -275,7 +275,7 @@ static inline u32 high32(fixed32 val) 2008-11-15 01:32 struct iattr { 2008-11-15 01:32 u64 isize, mtime, ctime, atime; 2008-11-15 01:32 unsigned mode, uid, gid, links; 2008-11-15 01:32 -} iattrs; 2008-11-15 01:32 +}; 2008-11-15 01:32 :) 2008-11-15 01:32 I found the same 2008-11-15 01:32 and 2008-11-15 01:32 diff -puN user/inode.c~inode-cleanup user/inode.c 2008-11-15 01:32 --- tux3-dev/user/inode.c~inode-cleanup 2008-11-14 18:43:07.000000000 +0900 2008-11-15 01:32 +++ tux3-dev-hirofumi/user/inode.c 2008-11-14 18:43:07.000000000 +0900 2008-11-15 01:32 @@ -181,8 +181,6 @@ int save_inode(struct inode *inode) 2008-11-15 01:32 unsigned size; 2008-11-15 01:32 if (!(ileaf_lookup(&sb->itable, inode->inum, path[levels].buffer->data, &size))) 2008-11-15 01:33 return -EINVAL; 2008-11-15 01:33 - if (inode->i_size) 2008-11-15 01:33 - inode->present |= CTIME_SIZE_BIT; 2008-11-15 01:33 err = store_attrs(sb, path, inode); 2008-11-15 01:33 release_path(path, levels + 1); 2008-11-15 01:33 free_path(path); 2008-11-15 01:33 well, does not matter 2008-11-15 01:33 you're right though 2008-11-15 01:33 we're always going to store the size now 2008-11-15 01:33 yes 2008-11-15 01:34 inode should have it already 2008-11-15 01:34 where do we set CTIME_SIZE_BIT? 2008-11-15 01:35 make_inode 2008-11-15 01:35 ? 2008-11-15 01:35 yes 2008-11-15 01:36 also, atkind needs to be moved out of tux3.h 2008-11-15 01:36 not atkind 2008-11-15 01:36 atsize 2008-11-15 01:36 ah, yes 2008-11-15 01:37 and entirely we would need to add static 2008-11-15 01:37 to functions 2008-11-15 01:38 yes 2008-11-15 01:39 do we get rid of #include "*.c"? 2008-11-15 01:39 before merge for sure 2008-11-15 01:40 well 2008-11-15 01:40 the test code has to go out of all the above files 2008-11-15 01:40 that stays strictly in user space 2008-11-15 01:40 the kernel files will have no c file inludes 2008-11-15 01:40 includes 2008-11-15 01:41 ok 2008-11-15 01:41 personally, I do not see why people do not like c source includes 2008-11-15 01:41 it's a useful way to save time fiddling with header files 2008-11-15 01:41 it violates scode? 2008-11-15 01:42 scode? 2008-11-15 01:42 if we include *.c, static is not local 2008-11-15 01:42 local in file 2008-11-15 01:42 sure it is 2008-11-15 01:42 well why does that matter? 2008-11-15 01:42 we do it with headers 2008-11-15 01:43 you mean, when somebody sees a function they expect to be able to know if it's external or static by scanning from the beginning of the file 2008-11-15 01:43 that's a pretty weak reason 2008-11-15 01:44 I think it's mostly just habit 2008-11-15 01:44 um.. 2008-11-15 01:45 kernel has some source file includes already 2008-11-15 01:45 some really sick ones 2008-11-15 01:45 I think kmalloc is one 2008-11-15 01:45 but we won't 2008-11-15 01:45 .c? 2008-11-15 01:45 no need to attract criticism that way, even if there is no reason for it 2008-11-15 01:45 yes 2008-11-15 01:45 there's nothing wrong with it 2008-11-15 01:46 preferable to bloating up headers, which are at the breaking point in linux 2008-11-15 01:47 file says "I do those, and you can use this" 2008-11-15 01:47 my rule 2008-11-15 01:48 um... 2008-11-15 01:48 I can't say by english well 2008-11-15 01:49 maybe I think file is some sort of object 2008-11-15 01:49 and static is private 2008-11-15 01:49 extern is public 2008-11-15 01:49 but a file isn't an object 2008-11-15 01:49 in c 2008-11-15 01:49 yes 2008-11-15 01:49 no special status 2008-11-15 01:50 I'm thinking like some sort of private rule or something 2008-11-15 01:51 maybe s/I'm thinking/I hope/ 2008-11-15 01:51 it's just habit 2008-11-15 01:51 probaby yes 2008-11-15 01:51 but that doesn't mean it should be ignored 2008-11-15 01:52 yes, it can be ignored 2008-11-15 01:53 I think it is why many sources doesn't use #include "*.c" 2008-11-15 01:55 well, anyway, it does not matter I think 2008-11-15 01:55 I thought the system of redefining main would break when there were a lot of files 2008-11-15 01:55 but it didn't 2008-11-15 01:55 just found out how to do it so it doesn't break 2008-11-15 01:56 very useful for rapid code writing 2008-11-15 01:56 but gets less useful when the coding slows down 2008-11-15 01:56 so many c programs are very badly organized 2008-11-15 01:56 because it's too much work to change the c files _and_ the header files 2008-11-15 01:57 the worst programs to work with are the ones that declare everything in headers, then put the functions in random order in the c files 2008-11-15 01:57 yes, it's bad 2008-11-15 01:58 basically foo.c should have foo.h in my thinking 2008-11-15 01:59 and I try to minimize header files 2008-11-15 01:59 yes 2008-11-15 02:00 those should be included into .c? 2008-11-15 02:00 I know what the thinking is, there is an attempt to see a c file as a module and the .h file as the definition 2008-11-15 02:00 yes 2008-11-15 02:00 c doesn't do that kind of "hiding" very well 2008-11-15 02:00 it's too limited 2008-11-15 02:01 so instead of "hiding" it ends up being bogus modularization 2008-11-15 02:01 often, when you put functions right beside each other, you see they are factored wrong 2008-11-15 02:01 but nobody notices that because they're in separate files 2008-11-15 02:01 and it's too much work to move them around because of the headers 2008-11-15 02:02 yes 2008-11-15 02:02 limitation of c? 2008-11-15 02:04 if only c++ and c had not diverged 2008-11-15 02:04 c is barbaric in many ways 2008-11-15 02:05 well, enough of that :) 2008-11-15 02:05 I will start to check in git stuff from tomorrow 2008-11-15 02:05 yes 2008-11-15 02:05 ok 2008-11-15 02:05 um.. what is check mean? 2008-11-15 02:06 "check in" is an idiom 2008-11-15 02:06 ah 2008-11-15 02:06 ok 2008-11-15 02:06 "check out" comes from shopping, taking stuff to the cash register 2008-11-15 02:06 it got used in source control 2008-11-15 02:07 and it seemed to make sense to say check in also 2008-11-15 02:07 yes 2008-11-15 02:07 I misread it "check" and "in git" 2008-11-15 02:07 :) 2008-11-15 02:08 I never thought about it until you mentioned it, but there was no term "check in" before revision control systems came along 2008-11-15 02:08 there was, but it did not mean that 2008-11-15 02:09 also, there is another completely different meaning of "check out", which means to take a close look at something 2008-11-15 02:09 english is strange 2008-11-15 02:09 japanese too :) 2008-11-15 02:09 three alphabets is too many :) 2008-11-15 02:10 yes 2008-11-15 02:11 btw, well, I also try userspace to kernel port 2008-11-15 02:11 sure 2008-11-15 02:11 I'll post my diff for tux3.h to the list 2008-11-15 02:11 it is by no means perfect 2008-11-15 02:12 thanks 2008-11-15 02:12 btw, we port it at first? 2008-11-15 02:13 we port it with atomic commit change? 2008-11-15 02:15 I'm assuming, if it's just porting, it's not so hard 2008-11-15 02:15 but maybe atomic commit is hard for me at least 2008-11-15 02:16 I think we port first, then develop the atomic commit, first in user space, and port those changes later 2008-11-15 02:16 yes 2008-11-15 02:16 it's tricky 2008-11-15 02:16 there are some easy bits 2008-11-15 02:16 the commit block code can be written pretty easily 2008-11-15 02:16 and we can then test the cache reconstruction in user space 2008-11-15 02:17 I think in kernel, we should first just get read working 2008-11-15 02:17 ok 2008-11-15 02:17 so we make a filesystem in userspace and read it in kernel 2008-11-15 02:17 good 2008-11-15 02:17 I hope we can do it in a few days 2008-11-15 02:18 I think that's all it will be 2008-11-15 02:18 a couple days 2008-11-15 02:18 see that nice bio code that maze helped write a month or so ago 2008-11-15 02:18 that's all we need for reading in from the filesystem at first 2008-11-15 02:19 oh 2008-11-15 02:19 that's for reading our metadata blocks 2008-11-15 02:19 I'm going to use sb_bread() 2008-11-15 02:19 I was going 2008-11-15 02:20 take a look at vecio, it's much nicer 2008-11-15 02:20 less fakery 2008-11-15 02:20 directly uses the bio interface 2008-11-15 02:20 vecio and syncio 2008-11-15 02:21 ah, it means for multiple buffer_heads? 2008-11-15 02:21 yes 2008-11-15 02:22 so we will do it same way even if one buffer_head? 2008-11-15 02:22 yes 2008-11-15 02:22 i see 2008-11-15 02:22 it's perfectly efficient 2008-11-15 02:22 yes, sounds good 2008-11-15 02:23 junkfs_fill_super can go away of course 2008-11-15 02:23 ok 2008-11-15 02:24 that was just a joke for maze 2008-11-15 02:24 I think maybe bio stuff should be finished first 2008-11-15 02:25 s/I think/I should try/ 2008-11-15 02:27 it's using semaphore, but maybe we want to use wakeup or something 2008-11-15 02:27 right, there is a version that uses wakeup 2008-11-15 02:27 oh 2008-11-15 02:27 it really does not make much difference 2008-11-15 02:27 I can change it back without changing the interface 2008-11-15 02:28 yes, change should be simple 2008-11-15 02:28 http://phunq.net/ddtree?p=tux3fs;a=commitdiff;h=11bab0d727fcd10295330a3eeeb5d24fb4783d27 2008-11-15 02:29 ah, it's changed wakeup to semphore 2008-11-15 02:30 yes, that was just to try it out 2008-11-15 02:30 I'm not sure right now whether wakeup is enough or not 2008-11-15 02:30 it may be completion() 2008-11-15 02:32 it's ok to leave it just as it is for now 2008-11-15 02:32 the syncio/vecio interface is the one that matters 2008-11-15 02:33 ok 2008-11-15 02:35 the "syncio" interface can go in diskio.c I think 2008-11-15 02:35 and just edit the diskio calls to use it 2008-11-15 02:36 probably we can 2008-11-15 02:37 dev->fd obviously changes 2008-11-15 02:37 we need a wrapper 2008-11-15 02:37 sb_dev(struct tux3_sb *sb) 2008-11-15 02:38 something like that 2008-11-15 02:38 or set fd to buffer 2008-11-15 02:38 has to be dev 2008-11-15 02:38 ah 2008-11-15 02:39 oh right, buffers have devs too 2008-11-15 02:39 that's kind of dumb 2008-11-15 02:39 we can just pass buffer to it 2008-11-15 02:39 yes 2008-11-15 02:39 works in user space too 2008-11-15 02:40 looks like easy 2008-11-15 02:40 make a blockread and a blockwrite function 2008-11-15 02:40 ok 2008-11-15 02:41 btw, will we use wait_on_buffer()? 2008-11-15 02:41 we will, but let's just start with syncio 2008-11-15 02:41 and think about what our completion model will be 2008-11-15 02:42 don't we need to buffer_head to bio? 2008-11-15 02:42 filemap should submit all it's io asynchronously 2008-11-15 02:42 no 2008-11-15 02:42 buffer_head is actually completely useless for IO 2008-11-15 02:43 just adds an obfuscating layer 2008-11-15 02:43 we can wait on our own synchronizer 2008-11-15 02:43 the only reason we can't get rid of buffer_heads completely is, we need them as locking handles 2008-11-15 02:43 if we read buffer by two users 2008-11-15 02:44 for btree code for example 2008-11-15 02:44 we may pass two bio to device? 2008-11-15 02:44 that is prevented by locking the buffer 2008-11-15 02:44 when the second user acquires the lock, it will see the buffer is uptodate 2008-11-15 02:45 isn't it wait_on_buffer? 2008-11-15 02:45 yes :) 2008-11-15 02:45 ok :) 2008-11-15 02:48 I'm thinking, we might be able to have page->private point at something other than a list of buffers 2008-11-15 02:49 it could point at an array of locks 2008-11-15 02:49 I'm just wondering if that gets rid of the need for buffers 2008-11-15 02:50 i see 2008-11-15 02:52 going field by field 2008-11-15 02:52 we don't need ->data because we have that from the page cache page 2008-11-15 02:52 we don't need ->index for the same reason 2008-11-15 02:52 we do need state 2008-11-15 02:52 i see 2008-11-15 02:53 but not count, we can use the page count instead 2008-11-15 02:53 I'm not sure about ->index 2008-11-15 02:53 because the only purpose is to know when to destory the object 2008-11-15 02:53 instead of a buffer, we point at our litle vector of locks and state 2008-11-15 02:54 yes 2008-11-15 02:54 we get the low bits of the buffer index from the low bits of that address 2008-11-15 02:54 we align those objects to make it work 2008-11-15 02:54 which we want to do anyway 2008-11-15 02:55 i see 2008-11-15 02:55 don't need lru, the page has that 2008-11-15 02:55 we do need the dirty link, that might be another field on our little array 2008-11-15 02:55 we don't need the hashlink because the page has the radix tree 2008-11-15 02:56 and we don't need to map because page->mapping has that 2008-11-15 02:56 so... what do we need buffers for? 2008-11-15 02:56 backward compatible :) 2008-11-15 02:56 with? 2008-11-15 02:57 e.g. maybe external module 2008-11-15 02:57 we could just all our little lock/state/list struct a "buffer" :) 2008-11-15 02:57 because struct buffer is not used in kernel 2008-11-15 02:57 s/all/call/ 2008-11-15 02:57 this requires not using the block io library at all 2008-11-15 02:57 which uses the buffer lists 2008-11-15 02:58 but I think I was going to avoid that anyway 2008-11-15 02:58 it's ugly and buggy 2008-11-15 02:58 maybe has no bugs right now 2008-11-15 02:58 but gets them regularly 2008-11-15 02:58 something a little bit radical to think about 2008-11-15 02:58 for the moment we will use buffers :) 2008-11-15 02:59 ->b_blocknr? 2008-11-15 02:59 yes 2008-11-15 02:59 tux3 doesn't use that 2008-11-15 02:59 yes 2008-11-15 02:59 it's just a cache of the physical block number, which is not obviously useful 2008-11-15 02:59 yes 2008-11-15 03:00 maybe all fs should use delayed write? 2008-11-15 03:00 no 2008-11-15 03:00 delayed allcation 2008-11-15 03:00 it's the trend 2008-11-15 03:00 no question 2008-11-15 03:00 yes 2008-11-15 03:01 the only possible win is in repeatedly writing the same block, which is really rare 2008-11-15 03:01 hard work, but sounds right way 2008-11-15 03:01 hard work if we have to convert all fs 2008-11-15 03:01 good thing we don't 2008-11-15 03:02 I'd like to try to do IO directly at the bio layer as much as possible 2008-11-15 03:02 yes 2008-11-15 03:02 probably all people think so 2008-11-15 03:02 Morning flips 2008-11-15 03:02 hi maarten 2008-11-15 03:02 Erm, I don't want to know why you're still awake 2008-11-15 03:03 right 2008-11-15 03:03 time to sleep 2008-11-15 03:03 my little girl told me I have to get up early to play with her 2008-11-15 03:03 lol 2008-11-15 03:03 So you stay awake the whole night? :P 2008-11-15 03:03 it's only 3 2008-11-15 03:03 but you're right 2008-11-15 03:03 you should sleep :) 2008-11-15 03:04 "early to sleep, early to rise, makes a man healty, wealthy and wise" 2008-11-15 03:04 good night 2008-11-15 03:04 night 2008-11-15 03:05 good night 2008-11-15 09:32 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-15 11:15 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-15 11:57 -!- pranith(~Bobby@122.162.67.35) has joined #tux3 2008-11-15 12:35 -!- ajonat(~ajonat@190.48.104.174) has joined #tux3 2008-11-15 13:08 -!- Bobby_(~Bobby@122.162.67.35) has joined #tux3 2008-11-15 15:21 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-15 16:48 -!- pgquiles(~pgquiles@166.Red-83-45-22.dynamicIP.rima-tde.net) has joined #tux3 2008-11-15 16:54 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-15 19:19 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-15 19:37 maze, ping 2008-11-15 19:37 . 2008-11-15 19:37 , 2008-11-15 19:37 hey 2008-11-15 19:38 see the chat yesterday about buffers? 2008-11-15 19:38 right up your alley I think 2008-11-15 19:38 I don't think I've seen it 2008-11-15 19:39 the quick summary is that we think there is a simple, viable alternative to buffers 2008-11-15 19:39 which are used only for locking blocks 2008-11-15 19:39 care to give me a timestamp 2008-11-15 19:39 hmm, I don't have timestamps turned on 2008-11-15 19:40 it's the most recent stuff from 2008-11-15 19:40 just a couple pages back 2008-11-15 19:40 11-14 22:58 and on? 2008-11-15 19:40 right now most filesystems use page->private, which is a circular list of buffer_heads 2008-11-15 19:40 2008-11-14 22:59 first thing I find is the kernel has its own, sloppily defined bitops 2008-11-15 19:41 you're back too far 2008-11-15 19:41 I'll get a line of text 2008-11-15 19:41 take a look at vecio, it's much nicer 2008-11-15 19:42 buffer_head is actually completely useless for IO 2008-11-15 19:42 I'm thinking, we might be able to have page->private point at something other than a list of buffers 2008-11-15 19:43 so the proposal is to have page->private point at an array of locks + state instead of at buffers 2008-11-15 19:44 call each lock + state a "handle" 2008-11-15 19:44 so an array of handles 2008-11-15 19:44 from a handle, we need to be able to find the underlying page 2008-11-15 19:46 hmm 2008-11-15 19:46 a handle ends up being much like a buffer field, except for having fewer fields and being allocated in sets of size blocks-per-page 2008-11-15 19:46 still trying to parse that 2008-11-15 19:46 err 2008-11-15 19:46 a handle ends up being much like a buffer_head, except for having fewer fields and being allocated in sets of size blocks-per-page 2008-11-15 19:50 still reading 2008-11-15 19:50 if we use bit-spin locks, then the array becomes a bitmap 2008-11-15 19:51 bit-spin is a little misleading 2008-11-15 19:51 bit lock 2008-11-15 19:51 requires one bit and a wait queue 2008-11-15 19:52 ok, done parsing 2008-11-15 19:52 however digestion is posing problems 2008-11-15 19:53 folks 2008-11-15 19:54 ok, I'm begining to grok this 2008-11-15 19:54 the proposition is: page->private, which is currently a list of buffers, could easily be something else. The block IO library assumes it is a list of buffers, therefore could not be used. 2008-11-15 19:54 yup 2008-11-15 19:54 I've got that much 2008-11-15 19:54 if it means we break the block io library 2008-11-15 19:54 that sounds like an all around win 2008-11-15 19:54 I think so 2008-11-15 19:55 it's gross code 2008-11-15 19:55 trying to be general 2008-11-15 19:55 also it's a bug magnet 2008-11-15 19:55 it's also bottlenecky in that it forces page at a time access 2008-11-15 19:56 so, to fix that, there is the mpage library 2008-11-15 19:56 multi-page 2008-11-15 19:56 which is another huge mess 2008-11-15 19:56 so basically you need a slab cache of size (header + blocks per page * sizeof(struct handle) ) 2008-11-15 19:56 exactly 2008-11-15 19:56 and header includes a pointer to the page 2008-11-15 19:56 and alignment needs to be forced in such a way 2008-11-15 19:57 yes, and each handle has some way of getting back to the header 2008-11-15 19:57 that a handle pointer can find the beginning of the entire slab entry 2008-11-15 19:57 my first thought was alignment to find the beginning of the handle set 2008-11-15 19:57 that works actualy 2008-11-15 19:57 excelpt 2008-11-15 19:57 wastes space 2008-11-15 19:57 because there is slack space to enfore the alignment 2008-11-15 19:58 handle could store a byte saying which handle in entry it is 2008-11-15 19:58 that way you can always back up 2008-11-15 19:58 exactly 2008-11-15 19:58 or less than a byte, must a few bits from the state field 2008-11-15 19:58 must a few bits 2008-11-15 19:58 ack 2008-11-15 19:58 just a few bits 2008-11-15 19:58 well 2008-11-15 19:58 depends on size of block and page 2008-11-15 19:58 might be 512 bytes and 16 K? 2008-11-15 19:58 ? 2008-11-15 19:58 so 5 bits at least 2008-11-15 19:59 what size are blocks in tux3? 2008-11-15 19:59 5 bits is reasonable 2008-11-15 19:59 4k for practical purposes 2008-11-15 20:00 in which case this all devolves 2008-11-15 20:00 there are some 8k arches, but taking advantage of that to get 8k fs blocks leaves you stuck with a filesystem that can't be read on x86 2008-11-15 20:00 bad for testing 2008-11-15 20:00 in what sense? 2008-11-15 20:00 should use smaller block than page size 2008-11-15 20:00 of course 2008-11-15 20:00 otherwise 8k archs will be always broken 2008-11-15 20:01 one nice thing about using handle sets is, lots of loops go away 2008-11-15 20:02 in general it should be much easier to move around 2008-11-15 20:02 and find stuf 2008-11-15 20:02 this is only needed where block size objects have to be locked 2008-11-15 20:03 which includes everything in "buffer cache" (we need a better name for that) and a few things in page cache, for example a dir btree 2008-11-15 20:03 or an atom table refcount block 2008-11-15 20:04 regular file data doesn't need block locking at all, it can rely on the page lock 2008-11-15 20:05 hmm 2008-11-15 20:05 can it? or must it? rely on the page lock? 2008-11-15 20:06 "can" 2008-11-15 20:06 must 2008-11-15 20:06 well we'd better resolve that 2008-11-15 20:06 I guess, because of permissions being steered at the page level 2008-11-15 20:06 pages don't have permissions 2008-11-15 20:06 an entire page of data has to have the same state 2008-11-15 20:06 actualy it doesn't 2008-11-15 20:06 it is ok for one block to be uptodate and others on the same page, not 2008-11-15 20:07 right, but 2008-11-15 20:07 dirty bit will mark the whole page 2008-11-15 20:07 so unless due to partial write out - you'll never have parts of a page valid and others not 2008-11-15 20:07 yes, and if block-granular dirty bits are set, then the page dirty bit must be clear, I think that's the current rule 2008-11-15 20:08 partial page valid is a common state that has to be handled accurately 2008-11-15 20:08 doesn't get much testing these days, so I assume there are bugs 2008-11-15 20:09 there are very few 1k block ext3 filesystems around any more 2008-11-15 20:09 with the demise of floppies 2008-11-15 20:09 and even then, testing was light 2008-11-15 20:10 page dirtiness is now determined by a heirarchy 2008-11-15 20:10 starting with the page table entry state 2008-11-15 20:10 when we check page table for dirtiness, we clear that bit and set the page dirty bit 2008-11-15 20:11 that's what I meant by permissions above - whether a page is present, readable, writeable etc 2008-11-15 20:11 when we check a block for dirtiness, we clear the page dirty bit and set the block dirty bit 2008-11-15 20:11 there's kind of a pattern 2008-11-15 20:12 a further complication is the page writeback state 2008-11-15 20:12 when a page is submitted for IO, the dirty bit is cleared and the writeback bit set 2008-11-15 20:12 the reason for this is to detect a re-dirty during writeback 2008-11-15 20:13 I don't think the writeback state has any other purpose 2008-11-15 20:13 I could be wrong about that, these details are not exactly documented 2008-11-15 20:13 buffers do not have a writeback state, go figure about that 2008-11-15 20:14 I think the reason this can be omitted is, buffer operations always aquire the buffer lock before operating on the buffer 2008-11-15 20:14 these things are pretty obvious how they should be in order to work though ;-) 2008-11-15 20:14 not the case with pages, that can have dirty state change asynchronously 2008-11-15 20:14 obvious to you maybe 2008-11-15 20:14 not anyone else 2008-11-15 20:15 it's been very different in the past 2008-11-15 20:15 and there has been a decreasing number of bugs over the years, but no assurance they're all gone 2008-11-15 20:15 try to find a precise statement of the state rules I just mentioned, for example 2008-11-15 20:16 the closest you'll get are a few confident sounding posts from Linux 2008-11-15 20:16 linus 2008-11-15 20:16 where he sounds like he knows exactly what he's talking about, and lo, another bug connected with fundamental design issues will show up a few months later 2008-11-15 20:16 so my conclusion: nobody has got this entirely straight 2008-11-15 20:17 not linus, akpm, me or you ;) 2008-11-15 20:17 well, for example, if you want to act on something being dirty, you have to clear the dirty bit before doing anything else, so you can catch a re-dirtying while you were cleaning it all up 2008-11-15 20:17 however 2008-11-15 20:17 that would mean the page now seems clear to others 2008-11-15 20:18 and you also need to recognize the possibility/inevitabiliy of asynchronous dirtying 2008-11-15 20:18 so you first have to mark the page as in writeout 2008-11-15 20:18 then clear the dirty bit 2008-11-15 20:18 then write it out 2008-11-15 20:18 then clear the in-writeout bit 2008-11-15 20:18 and then the page may or may not be dirty - which ever it is the dirty bit will tell you 2008-11-15 20:18 the writeout bit probably has to be a lock bit 2008-11-15 20:18 so what exactly is the writeback state used for? 2008-11-15 20:19 answer: to avoid submitting the page for writeout again 2008-11-15 20:19 well 2008-11-15 20:19 no 2008-11-15 20:19 to avoid submitting the page for reading 2008-11-15 20:19 I'd understand it to mean currently being written out 2008-11-15 20:19 yes, but why do we need to know? we already have the uptodate bit 2008-11-15 20:20 I think the only reason we need to know is to avoid submitting a non-uptodate page for reading 2008-11-15 20:20 you can't have the same page in write to disk flight at the same time 2008-11-15 20:20 or you have a race 2008-11-15 20:20 because the writes might happen in the reverse order 2008-11-15 20:20 but we should never have that if the dirty bit is cleared when the page is submitted 2008-11-15 20:20 taht in itself is not the purpose of having a writeback state 2008-11-15 20:21 the bit might get redirtied 2008-11-15 20:21 and the page resubmitted 2008-11-15 20:21 ok, point 2008-11-15 20:21 precisely - hence the need for the write state 2008-11-15 20:21 one of the two reasons 2008-11-15 20:21 the other is to avoid submitting a not uptodate page for reading 2008-11-15 20:21 I'm not sure I buy that 2008-11-15 20:22 because? 2008-11-15 20:22 if the page is in memory - it's more uptodate then disk 2008-11-15 20:22 the only exception being a totally invalid page 2008-11-15 20:22 a new page is not update 2008-11-15 20:22 and not dirty 2008-11-15 20:22 whether the page is dirty or being written out, doesn't matter 2008-11-15 20:22 no, but a new page is just plain invalid 2008-11-15 20:23 there's no explicit 'invalid' state? 2008-11-15 20:23 invalid == !uptodate 2008-11-15 20:23 oh ;-) 2008-11-15 20:23 invalid == !uptodate && !dirty && !writeback 2008-11-15 20:23 it's kind of a mess 2008-11-15 20:24 these shouldn't be independent state bits 2008-11-15 20:24 no they shouldnt' 2008-11-15 20:24 they should be a single enum 2008-11-15 20:24 but 2008-11-15 20:24 as I did it in tux3/buffer.c 2008-11-15 20:24 because of the cpu being bits maybe forced on us for dirty 2008-11-15 20:24 although no 2008-11-15 20:24 that's only in the pte 2008-11-15 20:24 exactly 2008-11-15 20:24 it's pure braindamage 2008-11-15 20:25 sometimes linus's tries to see meaning in the independent bits, but there is none 2008-11-15 20:25 INVALID, SYNCED, DIRTY, WRITEOUT, DIRTY-WRITEOUT 2008-11-15 20:26 wondering if there's anything I've missed 2008-11-15 20:26 yes 2008-11-15 20:27 possibly a READ-IN if we want async readin as well 2008-11-15 20:27 actually readin seems useful period 2008-11-15 20:27 the one you missed is "dirty in a previous delta" 2008-11-15 20:27 right, but that's tux specific ;-) 2008-11-15 20:27 it's a general concept 2008-11-15 20:27 ext3 does something similar 2008-11-15 20:27 true 2008-11-15 20:28 I don't know why this isn't a enum, but: 2008-11-15 20:28 #define BUFFER_STATE_EMPTY 1 2008-11-15 20:28 #define BUFFER_STATE_CLEAN 2 2008-11-15 20:28 #define BUFFER_STATE_DIRTY 3 2008-11-15 20:28 #define BUFFER_STATE_JOURNALED 4 2008-11-15 20:28 I was still thinking of this as more of a block-dev cache then a fs related setting 2008-11-15 20:28 it's going to be an enum in a few minutes ;-) 2008-11-15 20:28 and there should also be #define BUFFER_STATE_INVAL 0 2008-11-15 20:29 although I guess INVALID could almost mean READ-IN 2008-11-15 20:29 depends on the semantics you might want 2008-11-15 20:29 it's good for robustness to have it separate, to detect complete garbage 2008-11-15 20:29 not much point to invalid pages that you aren't trying to read in 2008-11-15 20:29 or forgetting to inialize the state 2008-11-15 20:29 ...to (have) invalid... 2008-11-15 20:30 right, but it's cleaner to have both - agreed 2008-11-15 20:30 the buffer_state_journalled state is a holdover from ddsnap which actually has a journal 2008-11-15 20:31 in fact, just one of those states isn't enough in general 2008-11-15 20:31 it's natural to want to have more than two deltas in the pipeline in cache 2008-11-15 20:32 also, you don't want to have to go walking through lists and changing buffer states at delta transitions 2008-11-15 20:32 ok, so personally 2008-11-15 20:32 so what has to be done is, you do modular arithmetic on the journal/delta state 2008-11-15 20:32 I still think the entire multiple deltas in flight thing 2008-11-15 20:32 should be solved by having multiple pages with the data 2008-11-15 20:32 BUFFER_STATE_DELTA0/1/2/3 2008-11-15 20:32 with cow semantics 2008-11-15 20:33 you have to know when a cow is needed 2008-11-15 20:33 right 2008-11-15 20:33 that is the main purpose of the multiple delta states 2008-11-15 20:33 it might be enough to have just two 2008-11-15 20:33 so you need to have a state along the lines of WRITE-OUT, COPY-on-WRITE 2008-11-15 20:34 "in the active delta" and "in some other delta" 2008-11-15 20:34 but then the modular arithmentic breaks 2008-11-15 20:34 arithmetic 2008-11-15 20:34 ack 2008-11-15 20:34 not clear whether the pages can be relinked approprately though when we catch the write to a WRITEOUT COW page 2008-11-15 20:34 I think BUFFER_STATE_DELTA0/1/2/3 is the right idea 2008-11-15 20:35 then you test against the low bits of the delta counter 2008-11-15 20:35 to determine when cow is needed 2008-11-15 20:35 a page in read or write flight can't be touched because of dma etc 2008-11-15 20:35 you mean, does it work to atomically pull a page out of the page cache for cow? 2008-11-15 20:35 I hope it does 2008-11-15 20:36 more like flip a WRITE-OUT, COW page into an CLEAN page which we just cloned 2008-11-15 20:36 dma code never goes fiddling into the radix tree 2008-11-15 20:36 but right 2008-11-15 20:36 you can't do that 2008-11-15 20:36 you have to copy the page, and insert the cope where the original was 2008-11-15 20:36 the copy 2008-11-15 20:36 right -> 'flip' 2008-11-15 20:37 ok 2008-11-15 20:37 I'll accept that meaning of flip ;) 2008-11-15 20:37 here meaning replace, and clone meaning copy into new page 2008-11-15 20:37 tux3 terminology is "fork" 2008-11-15 20:37 well, the flip was in the page cache or wherever, the fork was the entire process 2008-11-15 20:38 cloning doesn't capture the idea of inserting the copy into the index structure 2008-11-15 20:38 flip is decent terminology 2008-11-15 20:38 fork a page [ clone page [allocate new page, copy data], flip pointers] 2008-11-15 20:38 however I got there first with 'fork' ;) 2008-11-15 20:38 so for me fork = clone + flip, clone = allocate + copy 2008-11-15 20:38 sure 2008-11-15 20:39 which means a new state is needed 2008-11-15 20:39 WRITEOUT COW FORK IN PROGRESS 2008-11-15 20:39 not needed, the radix lock makes it atomic 2008-11-15 20:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-15 20:40 atomic in regards to the io completion as well? 2008-11-15 20:40 atomic in regards to anybody changing the data while copying/flipping 2008-11-15 20:40 note that this breaks for mmapped pages 2008-11-15 20:41 it's not the data that can change - it's the state of the original page 2008-11-15 20:41 also, it's not needed for mmapped pages, which is lucky 2008-11-15 20:41 it might transition from WRITEOUT COW to CLEAN 2008-11-15 20:41 ok, so that probably actually doesn't hurt us 2008-11-15 20:41 right 2008-11-15 20:41 although 2008-11-15 20:41 because the original is out of the page cache 2008-11-15 20:41 we could potentially have enough information there for the iio completion function to know it can free the page 2008-11-15 20:42 ok, here's a problem: when completion happens, the state of the _copy_ has to go to uptodate 2008-11-15 20:42 unless it's dirty 2008-11-15 20:42 you sure the copy wasn't uptodate to begin with? 2008-11-15 20:42 so that looks like a mess 2008-11-15 20:43 ah, I was wrong 2008-11-15 20:43 the copy was made precisely because it was re-dirtied 2008-11-15 20:43 so it stays dirty 2008-11-15 20:43 we get rid of the "writeback" state 2008-11-15 20:43 looks to me like the copy always ends up CLEAN, and ready for DIRTYing, but not actually dirty 2008-11-15 20:43 taht just becomes "some previous delta" 2008-11-15 20:44 it's not clean 2008-11-15 20:44 if it was clean it could be evicted 2008-11-15 20:44 yup 2008-11-15 20:44 it's "dirty2" or whatever 2008-11-15 20:44 it can 2008-11-15 20:44 so we will have dirty0/dirty1/dirty2/dirty3 most probably 2008-11-15 20:45 hmm 2008-11-15 20:45 and those states are treated modulo the delta counter 2008-11-15 20:45 this all sort of depends on whether we're looking at this more from a fs or block-dev centric perspective 2008-11-15 20:46 so for micro-niceness, we want the dirty0 value to have its lower two bits clear 2008-11-15 20:46 block-dev centric perspective - we're caching and writeing/reading specific physical blocks of data from device 2008-11-15 20:46 fs-centric - we're caching and writeing/reading specific blocks of a file 2008-11-15 20:46 the two are widely different 2008-11-15 20:47 fs has to handle both 2008-11-15 20:47 no 2008-11-15 20:47 it has one bockdev cache and n file caches 2008-11-15 20:47 the block-dev interface should provide a nice generic functional implementation of the first 2008-11-15 20:47 while the fs should deal with the second 2008-11-15 20:47 it's illegal to have both block dev IO and fs IO going at the same time 2008-11-15 20:48 fs IO should be going through block dev io 2008-11-15 20:48 but the block dev isn't buffered 2008-11-15 20:48 fs IO goes direct to the device 2008-11-15 20:48 that's precisely the problem ;-) 2008-11-15 20:48 ? 2008-11-15 20:49 there should be a nice buffer on the block-dev 2008-11-15 20:49 that the fs should be using 2008-11-15 20:49 that's called the "buffer cache" 2008-11-15 20:49 we could make much better use of it than we do 2008-11-15 20:50 hmm, this of course gets quickly ridiculously complex, with multiple layered block devices (raid/lvm) 2008-11-15 20:51 [or rather it gets complex if you don't want to have multiple copies of everything in memory] 2008-11-15 20:51 there is a danger of double caching 2008-11-15 20:51 which we don't do much about currently 2008-11-15 20:51 which is probably why the decision was made not to cache at all, except at the fs level 2008-11-15 20:51 there was never any conscious decision ;) 2008-11-15 20:52 at some point you have to go to the device 2008-11-15 20:52 sure 2008-11-15 20:53 hmm 2008-11-15 20:53 if there was some sort of page-centric read/write mechanism 2008-11-15 20:53 I'm still introspecting a little about the idea of submitting bio traffic direclty within the sys->write 2008-11-15 20:53 I think that's right, but I'm not 100% sure 2008-11-15 20:53 that relied on state-flipping of pages + cow semantics to work 2008-11-15 20:54 ah, I was babbling 2008-11-15 20:54 we don't do that, we only submit the bio traffic after delta setup 2008-11-15 20:55 with proper semantics the same physical page could be the cache at all layer of the system 2008-11-15 20:55 I don't think stacked block devices should cache anything, unless they are also transforming it 2008-11-15 20:56 like raid, has to assemble the pieces of a page sometimes 2008-11-15 20:56 and compressing and encrypting block devices 2008-11-15 20:56 exactly 2008-11-15 20:56 so you can't share those pages, because they're different 2008-11-15 20:56 if data is already in memory 2008-11-15 20:56 we should be able to make use of it 2008-11-15 20:56 the data is in memory, sometimes, but it's not in the right place, usually not the right form either 2008-11-15 20:56 there should be very little reason for multiple copies of the same data in ram 2008-11-15 20:57 you need to be more specific 2008-11-15 20:57 not talking about encrypt or compress here 2008-11-15 20:57 are you talking about raid? 2008-11-15 20:57 just about lvm and raid 2008-11-15 20:57 ok, well you've got slightly the wrong idea about lvm 2008-11-15 20:57 encrypt and compress is different obviously 2008-11-15 20:57 lvm is more about redirecting block io 2008-11-15 20:57 it doesn't manipulate data as much as it manipulates handles for the data in the form of struct bios 2008-11-15 20:57 I know 2008-11-15 20:58 same thing with partitions 2008-11-15 20:58 (thouse should just be lvmed) 2008-11-15 20:58 handled by the same mechanism in fact 2008-11-15 20:58 they are 2008-11-15 20:58 essentially 2008-11-15 20:58 except current lvm is too cluless to know that 2008-11-15 20:58 ok, so besides raid, what else behave partially transparently? 2008-11-15 20:58 at the bio level it's the same 2008-11-15 20:59 cryptoloop 2008-11-15 20:59 not really... there's crypto 2008-11-15 20:59 nbd 2008-11-15 20:59 but loop is probably a good point 2008-11-15 20:59 xxx-over-E 2008-11-15 20:59 loopback mount 2008-11-15 21:00 aren't those (nbd, xxx-over-E) just hw dev drivers like a normal sata hard disk? 2008-11-15 21:00 arguably, you could also say that the blockdev-to-blockdev layering is simply remote 2008-11-15 21:00 hmm, I'm beginnign to think I'm over complicating things 2008-11-15 21:01 you could - except it isn't ;-) 2008-11-15 21:01 if it actually is remote, there's not much you can save via sharing ram 2008-11-15 21:01 isn't? 2008-11-15 21:01 if it's local, then you can potentially already have it in ram 2008-11-15 21:01 ah, let's forget this 2008-11-15 21:01 well I'm pointing out that there is no ram to share in most cases 2008-11-15 21:02 right 2008-11-15 21:02 loopback is the one that stands out 2008-11-15 21:02 I'm beginning to think this might have been a red-herring 2008-11-15 21:02 besides raid and loop 2008-11-15 21:02 double caching in loopback is indeed a problem 2008-11-15 21:02 also NFS 2008-11-15 21:02 nfs local mount 2008-11-15 21:02 the client double caches with the fs backing the server 2008-11-15 21:03 right 2008-11-15 21:03 filthy problems that nobody wants to spend years fixing 2008-11-15 21:03 anyway, getting back to the fs, there is an opportunity to do very agressive caching at the buffer layer 2008-11-15 21:03 I first thought that this would be the job of vfs 2008-11-15 21:04 but it now seems to me that the fs can take care of this perfectly well itself 2008-11-15 21:04 what I call "physical readahead" 2008-11-15 21:04 you speculatively read ranges of data into the buffer cache, then page table misses look first in the buffer cache before going to disk 2008-11-15 21:05 to make this work requires leaving additional state information in the buffer cache 2008-11-15 21:05 to know when a physical block is present in the page cache 2008-11-15 21:05 and therefore should not be speculativley read 2008-11-15 21:06 (it doesn't hurt to have clean aliases of blocks in the buffer cache, except wasting memory and IO bandwidth) 2008-11-15 21:06 now I think this is something we could build into tux3 2008-11-15 21:06 and it would be a killer performance hack 2008-11-15 21:07 we might only need a single bit to say "this page is present in page cache" 2008-11-15 21:07 I don't think we need to be able to find _which_ page cache has the page 2008-11-15 21:08 it also doesn't hurt for the "is in page cache" bit to be slightly out of sync (I think) 2008-11-15 21:08 that would just mean a chance of some extra IO and a wasteful buffer cache alias 2008-11-15 21:13 sorry for zoning off ;-) someone else is bugging me 2008-11-15 21:17 flips, what % hit rate do you expect from the physical readahead alogorityhm? 2008-11-15 21:17 sp..? 2008-11-15 21:17 in some currently troublesome cases it will be spectacular 2008-11-15 21:17 what do you think is a troublesome case? 2008-11-15 21:18 what's the application? 2008-11-15 21:18 can see improvements by more than an order of magnitude, things like grepping a kernel tree 2008-11-15 21:18 the application is anything that has to read all the data in a bushy directory tree 2008-11-15 21:18 even a flat directory tree 2008-11-15 21:18 is this just the metadata, or the data? 2008-11-15 21:19 this kind of readahead is cross-file and cross-directory, as opposed to current readahead which is only within a file 2008-11-15 21:19 metadata and date 2008-11-15 21:19 data 2008-11-15 21:19 including data that is not in the file currently being read 2008-11-15 21:20 I was thinking about this from the hollywood production studio perspective 2008-11-15 21:20 the current, traditional unix readahead approach forces synchronous waiting on file and directory opens 2008-11-15 21:20 but more importantly, from david's perspective 2008-11-15 21:20 the sync waits turn into waiting for disk rotations 2008-11-15 21:20 which really stretches things out 2008-11-15 21:21 probably can construct a two-orders-of-magnitude case pretty easily 2008-11-15 21:21 in fact 2008-11-15 21:21 so while the disk rotates, walk the buffer cache 2008-11-15 21:21 I should demonstrate this with a loopback mount 2008-11-15 21:21 makes a shitload of sense to me 2008-11-15 21:22 it allows the fs to pick up data track-wise, without knowing exactly to which files the blocks belong 2008-11-15 21:22 known as a track cache, but without the double caching of a track cache 2008-11-15 21:23 sounds like a proximity thing 2008-11-15 21:23 never heard of track cache 2008-11-15 21:23 never heard of SMARTDRV? 2008-11-15 21:24 yeah 2008-11-15 21:24 the fs does have to make the assumption that the data it speculatively reads will be needed soon 2008-11-15 21:24 that is usually not very hard to deduce 2008-11-15 21:24 read block 1, then 2, it's reasonable to guess that 3 will also be read 2008-11-15 21:25 sounds like the research going on at IIT 2008-11-15 21:26 I don't know of anybody taking a run at this specific thing 2008-11-15 21:26 prefetch 2008-11-15 21:26 sometimes it's talked about and somebody says they should try writing a dm block device to do track caching and see how it performs, but nobody ever got around to it 2008-11-15 21:32 hmm, ok where was I ;-) 2008-11-15 21:32 heh, now I'm buggin flips 2008-11-15 21:32 how you doing MaZe? 2008-11-15 21:34 fine, you? 2008-11-15 21:34 busy with my twins 2008-11-15 21:34 now I think this is something we could build into tux3 2008-11-15 21:34 and it would be a killer performance hack 2008-11-15 21:34 just got in touch with someone from atheros to work on finding a fix for ath9k + 2.6.27 => corrupt disk 2008-11-15 21:34 that was about using additional state in the buffer cache to implement physical readahead 2008-11-15 21:34 corrupt disk??? 2008-11-15 21:34 turns out he's in Mountain View as well ;-) wicked. 2008-11-15 21:35 yeah. Macbook Pro 3,1 + more than 3GB ram (mapped above the 32-bit physical barrier) + 2.6.27 + ath9K -> runs out of SW-IOMMU buffer state, ext3 aborts journal on write/read dma errors and corrupts fs 2008-11-15 21:36 almost as bad as the notebook bricking escapade 2008-11-15 21:36 why are these things coming out of the woodwork in packs? 2008-11-15 21:37 32-bit physical barrier? 2008-11-15 21:37 oh 2008-11-15 21:37 4GB ram maps at 0-3G and at 4-5 G 2008-11-15 21:37 in the 48 bit IO space 2008-11-15 21:37 needs bounce buffers 2008-11-15 21:37 linux I presume 2008-11-15 21:37 yes 2008-11-15 21:37 only linux has bounce buffers 2008-11-15 21:38 I'm pretty sure the bug is in ath9k somewhere 2008-11-15 21:38 nobody else was crazy enough to try 2008-11-15 21:38 but the ext3 behaviour is terrible 2008-11-15 21:38 sw-iommu is bounce buffers right? 2008-11-15 21:38 it's a typical but for monolithic kernel design 2008-11-15 21:38 I don't think that's bounce buffers 2008-11-15 21:38 bounce buffers are only for disk data 2008-11-15 21:39 since it apparently manages to write out some but not all of the dirty data and probably does so in a random order to random disk locations ;-) 2008-11-15 21:39 I think you're talking about 48 bit IO space 2008-11-15 21:39 hmm 2008-11-15 21:39 well, all my ram is below 5G 2008-11-15 21:39 I think the problem is DMA32 not being able to talk to 4-5G area 2008-11-15 21:39 I haven't looked at SW-IOMMU 2008-11-15 21:39 and thus needing to copy the data into bounce buffers below the 4G mark 2008-11-15 21:40 it's possible it's implemented as ram that could collide with high-mapped cache pages 2008-11-15 21:40 ah, aborts on dma errors 2008-11-15 21:40 that is not data corruptiong 2008-11-15 21:40 that is IO space collision 2008-11-15 21:40 well, the disk doesn't mount afterwards ;-) 2008-11-15 21:41 and once I ended up with half the drive in lost+found 2008-11-15 21:41 smashing the dma registers would quickly corrupt data+metadata 2008-11-15 21:41 no I don't think this is corrupt dma registers 2008-11-15 21:41 how else would you get read/write dma errors? 2008-11-15 21:42 DMA: Out of SW-IOMMU space for 4224 bytes at device 0000:0b:00.0 2008-11-15 21:42 and hence forth read/writes fail 2008-11-15 21:42 sometimes 2008-11-15 21:43 which means ext3 writes to disk, sometimes fail. 2008-11-15 21:43 http://www.google.com/url?sa=U&start=1&q=http://forums.novell.com/novell-product-support-forums/suse-linux-enterprise-server-sles/sles-virtualization/310206-sles10sp1-xen-out-sw-iommu-kernel-problems.html&ei=6bIfSeibJqCSsQOw_6jMCA&usg=AFQjCNHJyPOJH7bKpEmGxszfoIWOiZvYZQ 2008-11-15 21:43 but apparently it's too stupid to remount r/o early enough to fail to write junk to disk 2008-11-15 21:43 maze, when did google stop putting direct urls on search pages? 2008-11-15 21:43 it's highly annoying/evil 2008-11-15 21:44 right up there with running ads on the search page 2008-11-15 21:44 I wonder how you got that 2008-11-15 21:44 it's what you always get for a good search now 2008-11-15 21:44 google search 2008-11-15 21:44 ACTION tests taht 2008-11-15 21:44 I believe that should happen only through javascript 2008-11-15 21:45 and cuting and pasting the link should not be affected 2008-11-15 21:45 ok, it doesn't happen on my other machine 2008-11-15 21:45 hmm 2008-11-15 21:45 must be some experiment 2008-11-15 21:46 well where do I send my flame 2008-11-15 21:47 somebody wants to experiment with "let's see how mad the users get when we annoy them" 2008-11-15 21:47 hmm 2008-11-15 21:47 actually it might be noscript 2008-11-15 21:48 do you have noscript enabled? 2008-11-15 21:49 ACTION deletes 13 google cookies 2008-11-15 21:49 noscript? 2008-11-15 21:49 plugin? 2008-11-15 21:49 no 2008-11-15 21:49 it's a firefox extension 2008-11-15 21:50 let's see what plugins I have 2008-11-15 21:50 removing the cookies got rid of it 2008-11-15 21:50 if I see it again I will remove the cookies one at a time 2008-11-15 21:51 somebody should be busted back to test monkey ;) 2008-11-15 21:51 stay away from code checkins 2008-11-15 22:47 -!- pranith(~bobby@122.162.73.97) has joined #tux3 2008-11-15 22:56 hi 2008-11-15 22:56 hi 2008-11-15 22:57 git repo seems to be broken 2008-11-15 22:57 how? 2008-11-15 22:57 can't clone? 2008-11-15 22:57 info/refs doesn't point last commit 2008-11-15 22:57 I'm not running any git daemon, does git have anything like hg's static-http? 2008-11-15 22:57 oh 2008-11-15 22:58 but you can clone? 2008-11-15 22:58 yes 2008-11-15 22:58 but "git clone" is not work 2008-11-15 22:58 git log 2008-11-15 22:58 commit 1601d51c063ea1235d526ec304e1705d9096f645 2008-11-15 22:58 Author: Daniel Phillips 2008-11-15 22:58 Date: Tue Sep 30 11:31:57 2008 -0700 2008-11-15 22:58 Add mm.h header needed from some architectures 2008-11-15 22:58 it clones until info/refs point 2008-11-15 22:58 let me try it locally 2008-11-15 22:59 I cloned it by lftp 2008-11-15 22:59 git clone /src/2.6.26.5.tux3 2008-11-15 22:59 Initialized empty Git repository in /tmp/2.6.26.5.tux3/.git/ 2008-11-15 22:59 2008-11-15 22:59 now it's spinning 2008-11-15 22:59 understandable, it's a full kernel tree 2008-11-15 23:00 it's still just stting 2008-11-15 23:00 and not showing in top 2008-11-15 23:01 "git update-server-info" will be needed 2008-11-15 23:01 I am running a git-daemon 2008-11-15 23:01 let me try 2008-11-15 23:01 git's acting badly 2008-11-15 23:01 sitting and silently failing is not good 2008-11-15 23:02 probaby git version is old? 2008-11-15 23:02 git --version 2008-11-15 23:02 git version 1.5.4.2 2008-11-15 23:02 not old 2008-11-15 23:02 it didn't "git push"? 2008-11-15 23:03 I didn't try that 2008-11-15 23:03 well, "chmod +x .git/hooks/post-update" may solve it 2008-11-15 23:05 no change 2008-11-15 23:05 in general, I find that git wastes much more time than mercurial 2008-11-15 23:06 maybe this is a test of my "hacker qualifications" 2008-11-15 23:06 now... where is the conf file for git-daemon 2008-11-15 23:07 I don't know too 2008-11-15 23:08 documentation for git-daemon looks completely lacking 2008-11-15 23:09 is git clone supposed to be able to work without a git-daemon? 2008-11-15 23:09 it was a long time ago I configured it 2008-11-15 23:10 yes 2008-11-15 23:10 http should work 2008-11-15 23:10 it slow though 2008-11-15 23:11 clone works on zumastor in that same directory 2008-11-15 23:11 trying ddtree 2008-11-15 23:11 info/refs in tux3fs is just wrong 2008-11-15 23:12 any such think as a git repair? 2008-11-15 23:12 ddtree clones fine 2008-11-15 23:12 cd tux3fs && GIT_DIR=. git update-server-info 2008-11-15 23:12 I'm assuming tux3fs is actually tux3fs.git in git manner 2008-11-15 23:13 IOW, it's not tux3fs/.git 2008-11-15 23:13 tux3fs is a symlink to the git repo 2008-11-15 23:14 it symlinks to the .git dir 2008-11-15 23:14 ok 2008-11-15 23:14 but I can't clone from the place it symlinks to either 2008-11-15 23:14 just sits stupidly and does nothing 2008-11-15 23:14 I'll strace 2008-11-15 23:14 "GIT_DIR=. git update-server-info" tried? 2008-11-15 23:15 is tried? 2008-11-15 23:15 it doesn't like the . 2008-11-15 23:15 but the second part runs without complaining 2008-11-15 23:15 GIT_DIR=/somewhere/tux3fs git update-server-info? 2008-11-15 23:16 oh now it's working 2008-11-15 23:16 ok, I saw info/refs is right from http 2008-11-15 23:16 git clone /src/2.6.26.5.tux3 2008-11-15 23:16 Initialized empty Git repository in /src/tmp/2.6.26.5.tux3/.git/ 2008-11-15 23:16 0 blocks 2008-11-15 23:16 Checking out files: 100% (24276/24276), done. 2008-11-15 23:16 all right, it was update-server-info 2008-11-15 23:17 whatever that is 2008-11-15 23:17 it's update info/refs 2008-11-15 23:18 it updates info/refs 2008-11-15 23:18 iirc 2008-11-15 23:18 you'd think git clone could figure that out itself 2008-11-15 23:19 I think clone does it 2008-11-15 23:19 what did refs used to have in it? 2008-11-15 23:19 I don't have that any more 2008-11-15 23:19 but maybe http clone doesn't 2008-11-15 23:19 I'm not sure 2008-11-15 23:20 I don't see why http clone can't figure this out 2008-11-15 23:20 maybe for remote clone 2008-11-15 23:20 anyway 2008-11-15 23:20 it works 2008-11-15 23:20 yes 2008-11-15 23:20 and I'm more annoyed with git 2008-11-15 23:20 well 2008-11-15 23:20 last check in was a couple months ago 2008-11-15 23:21 maybe "git push" is needed 2008-11-15 23:21 yes 2008-11-15 23:21 now I will check in tux3.h, it will take me about an hour to fix it up 2008-11-15 23:21 what happen? 2008-11-15 23:21 I need to resolve the differences between the user space and the kernel version 2008-11-15 23:21 maybe just a few minutes 2008-11-15 23:22 wait a minute 2008-11-15 23:22 I have some patches 2008-11-15 23:22 I also made one change 2008-11-15 23:22 let me give it to you 2008-11-15 23:22 I also put those to somewhere 2008-11-15 23:22 do you make user/kernel/tux3.h ? 2008-11-15 23:23 I don't know user/kernel/tux3.h 2008-11-15 23:23 ok, so it is just patches to fs/tux3/tux3.h 2008-11-15 23:23 how about posting to mailing list? 2008-11-15 23:24 but those are still not compilable 2008-11-15 23:24 mine compiles 2008-11-15 23:24 it's pretty simple 2008-11-15 23:25 my patches copy all from userspace, then is trying to compile those 2008-11-15 23:25 ah 2008-11-15 23:25 there are some big differences 2008-11-15 23:25 yes 2008-11-15 23:25 patch is big 2008-11-15 23:25 but simple 2008-11-15 23:26 so what I plan is to split it into two files, user/tux3.h, which includes user/kernel/tux3.h 2008-11-15 23:26 and user/kernel/tux3.h is identical to the git repo 2008-11-15 23:26 and has #ifdef __KERNEL for the few things that kernel has to have and userspace doesn't 2008-11-15 23:26 what is "user/kernel" mean? 2008-11-15 23:27 is it kernel tree? 2008-11-15 23:27 it's a new directory I'm going to make in the mercurial repo 2008-11-15 23:27 for kernel? 2008-11-15 23:28 for files that are identical to the git tux3 repo 2008-11-15 23:28 then we will try to make as many as possible of the files identical 2008-11-15 23:29 yes 2008-11-15 23:29 ok, best if for me just to do it 2008-11-15 23:29 but why isn't it linux/fs/tux3/* 2008-11-15 23:29 then I'll push both repos, and you can make other changes you have 2008-11-15 23:29 you mean, put the user space stuff as a subdir in the kernel? 2008-11-15 23:29 yes 2008-11-15 23:29 oh 2008-11-15 23:30 for one thing we'd have to switch from hg to git 2008-11-15 23:30 ah. no 2008-11-15 23:30 ok, why not just copy all files from kernel? 2008-11-15 23:30 we could 2008-11-15 23:30 it's the easiest thing to do I guess 2008-11-15 23:31 well, let's go with it, and see 2008-11-15 23:31 but the git repo will be "upstream" 2008-11-15 23:31 yes 2008-11-15 23:31 I'll just do something, they we'll fix it up ;) 2008-11-15 23:31 s/they/then/ 2008-11-15 23:32 ok :) 2008-11-16 00:03 ok, tux3.h diff reduced to zero, now to make it compile in both places 2008-11-16 00:04 ok 2008-11-16 00:04 tuxtimeval may have problem 2008-11-16 00:05 it's unsigned, but timespec is long 2008-11-16 00:05 ah 2008-11-16 00:05 well 2008-11-16 00:05 and 1000000000 doesn't fit to 32bit 2008-11-16 00:05 already changed that 2008-11-16 00:06 ok 2008-11-16 00:10 compiles in kernel 2008-11-16 00:11 good 2008-11-16 00:12 now I have to reduce the diff to zero again... 2008-11-16 00:13 sounds like bad things... 2008-11-16 00:14 no, just normal 2008-11-16 00:14 I fix up minor things in the kernel, then need to move the changes to the user file 2008-11-16 00:14 in fact just applying the diff will do the trick 2008-11-16 00:15 there 2008-11-16 00:16 now to make it compile in user space 2008-11-16 00:16 if we can share it on both of userspace and kernel, at least prototype phase, it's good 2008-11-16 00:16 yes, it would get chaotic otherwise 2008-11-16 00:17 diff and patch sounds like not good 2008-11-16 00:19 a last resort 2008-11-16 00:19 I'm using diff to make the files the same now 2008-11-16 00:21 http://userweb.kernel.org/~hirofumi/patchset.tar.gz 2008-11-16 00:21 my crasy patches is 2008-11-16 00:29 to make sure, those don't need to merge at least for now 2008-11-16 00:37 user almost compiles 2008-11-16 00:37 then I'll have to reduce the diff again 2008-11-16 00:37 those are already pushed? 2008-11-16 00:38 to tux3.org 2008-11-16 00:39 not yes 2008-11-16 00:39 not yet 2008-11-16 00:40 which flavor of gettimeofday handles nanosecs? 2008-11-16 00:40 not sure 2008-11-16 00:40 get_time() or something? 2008-11-16 00:45 clock_gettime(clockid_t clk_id, struct timespec *tp); 2008-11-16 00:45 now, does that get it from hardware? 2008-11-16 00:45 not that we really care where it gets it from in user space 2008-11-16 00:46 NOTE 2008-11-16 00:46 Most systems require the program be linked with the librt library to use these functions. 2008-11-16 00:46 bleah 2008-11-16 00:47 ok, we will use the usec version in user space and nsec in kernel 2008-11-16 00:47 stupid libc 2008-11-16 00:47 ah 2008-11-16 00:47 but why doesn't use librt? 2008-11-16 00:47 librt is part of libc 2008-11-16 00:48 librt is part of libc package 2008-11-16 00:48 it's another dependency 2008-11-16 00:48 have to specify the lib 2008-11-16 00:48 it's not worth it 2008-11-16 00:48 ok 2008-11-16 00:49 maybe "most systems" does not include linux 2008-11-16 00:49 let me try it 2008-11-16 00:49 oh, clock_id 2008-11-16 00:49 forget that 2008-11-16 00:49 CLOCK_REALTIME? 2008-11-16 00:50 ok 2008-11-16 00:53 inode.o: In function `tuxtime': 2008-11-16 00:53 kernel/tux3.h:271: undefined reference to `clock_gettime' 2008-11-16 00:53 collect2: ld returned 1 exit status 2008-11-16 00:53 so "most systems" includes linux 2008-11-16 00:53 why does posix have to be such random garbage 2008-11-16 00:53 is it posix? 2008-11-16 00:54 I suppose libc could have been nice and included it 2008-11-16 00:54 thereby making it easier to move to nanosecond precision 2008-11-16 00:54 but these interfaces came from posix committees 2008-11-16 00:55 very little thinking ahead 2008-11-16 00:56 ok, it's not worth it, userspace will have usecs 2008-11-16 00:59 now userspace compiles 2008-11-16 01:00 now I have to reduce the diff again 2008-11-16 01:00 fortunately it gets smaller each time 2008-11-16 01:04 740 line diff for tux3/user 2008-11-16 01:05 and 363 line for kernel 2008-11-16 01:05 hard work 2008-11-16 01:05 and it's ugly 2008-11-16 01:05 but it's a start 2008-11-16 01:05 we'll make it nicer a little bit at a time 2008-11-16 01:05 can I see what is doing? 2008-11-16 01:06 ok, I'll post both patches before committing 2008-11-16 01:06 ok? 2008-11-16 01:06 post or commit is fine for me 2008-11-16 01:06 ok, committing is easier 2008-11-16 01:06 either is fine 2008-11-16 01:08 I do +#KBUILD_CFLAGS += $(call cc-option,-Wdeclaration-after-statement,) 2008-11-16 01:08 in the global makefile 2008-11-16 01:08 I suppose I should make that a separate commit 2008-11-16 01:08 EXTRA_CFLAGS is work for it 2008-11-16 01:09 in tux3/Makefile? 2008-11-16 01:09 obj-$(CONFIG_TUX3) += tux3.o 2008-11-16 01:09 tux3-objs += balloc.o btree.o dir.o dleaf.o filemap.o hexdump.o iattr.o \ 2008-11-16 01:09 ileaf.o inode.o super.c xattr.c 2008-11-16 01:09 EXTRA_CFLAGS += -std=gnu99 -Wno-declaration-after-statement 2008-11-16 01:09 my kernel Makefile in fs/tux3/ 2008-11-16 01:09 better 2008-11-16 01:09 ok, I will put in the EXTRA 2008-11-16 01:09 but leave the others for your patch 2008-11-16 01:10 just do something you need 2008-11-16 01:10 If conflict, I'll fix my patches 2008-11-16 01:11 for it, i'm using quilt like scripts 2008-11-16 01:11 5 minutes to go, about 2008-11-16 01:11 yes, I should too 2008-11-16 01:11 scripts like quilt 2008-11-16 01:11 I noticed you can factor your patches more easily than I can 2008-11-16 01:11 and you put in nice small patches, unlike me 2008-11-16 01:12 have to wait for a whole kernel build because I changed the makefile 2008-11-16 01:12 pretty fast for uml though 2008-11-16 01:12 without modules 2008-11-16 01:12 my habit for linux 2008-11-16 01:13 good, I have a root filesystem all set up 2008-11-16 01:13 but it is 32 bit 2008-11-16 01:13 might cause problems for you on 64 bit host 2008-11-16 01:13 but you know how to set up your own rootfs 2008-11-16 01:14 yes 2008-11-16 01:14 recently, I'm using kvm for work like it 2008-11-16 01:14 for a long time, the biggest change you had to make was using devfs, jdike liked devfs 2008-11-16 01:14 and made it the default 2008-11-16 01:14 using anything non-default in uml used to be very painful 2008-11-16 01:14 now it is better 2008-11-16 01:15 ok, kernel builds, user builds, diff is zero... 2008-11-16 01:15 time to commit before it breaks again :) 2008-11-16 01:16 sounds good to me :) 2008-11-16 01:18 git commit done 2008-11-16 01:18 thanks 2008-11-16 01:19 hg commit done 2008-11-16 01:20 it's not really pretty, I used #ifdef __KERNEL__ in 4-5 places, sometimes with #else 2008-11-16 01:20 I think you are forgetting trace.h for kernel 2008-11-16 01:20 hmm 2008-11-16 01:20 forgetting to commit it? 2008-11-16 01:21 yes 2008-11-16 01:21 right 2008-11-16 01:21 have to tell git about it 2008-11-16 01:22 yes 2008-11-16 01:22 there 2008-11-16 01:22 every time I type git commit -a I think bad thoughts about somebody ;) 2008-11-16 01:23 "flawed masterpiece" 2008-11-16 01:24 I don't notice it, because I use patch from scripts to commit to git 2008-11-16 01:25 well it sure beats svn and really really beats cvs 2008-11-16 01:25 sure 2008-11-16 01:26 I just think hg and git useful 2008-11-16 01:26 both of 2008-11-16 01:26 ok, I'll look for your patches for the other files 2008-11-16 01:26 I was going to implement bdata(buffer) and bcount(buffer) in tux3/user 2008-11-16 01:26 thanks, those are crazy patches 2008-11-16 01:27 bufdata is ready 2008-11-16 01:27 ok 2008-11-16 01:27 bcount is using for bnode 2008-11-16 01:27 oops 2008-11-16 01:27 has to be bufcount then 2008-11-16 01:27 that's where C really misses not having classes 2008-11-16 01:28 I think b_count is not needed 2008-11-16 01:28 it should be needed in userspace only 2008-11-16 01:28 buffer->count 2008-11-16 01:28 maybe 2008-11-16 01:28 I think that's right 2008-11-16 01:28 it was used for debug 2008-11-16 01:28 ok 2008-11-16 01:29 well it's important to debug that in kernel too 2008-11-16 01:29 extra buffer count is a common error 2008-11-16 01:29 I think it's job for brelse 2008-11-16 01:29 bufcount isn't always zero in brelse 2008-11-16 01:30 but -1 is wrong 2008-11-16 01:30 yes, that should be an assert 2008-11-16 01:30 for kernel, 2008-11-16 01:30 the most common error is, buffer->count is more than zero after a commit 2008-11-16 01:31 +#define ATOMIC_UNDERFLOW_CHECK(v) do { \ 2008-11-16 01:31 i see 2008-11-16 01:31 we should keep the dumping code even in kernel 2008-11-16 01:32 it can dump into the kprint buffer, later we can convert it to dump over a pipe 2008-11-16 01:32 it's hexdump()? 2008-11-16 01:32 lots of dumping code 2008-11-16 01:32 like show_buffers 2008-11-16 01:32 ah, ok 2008-11-16 01:33 I'm impressed you got all those files to compile 2008-11-16 01:33 things like ->map instead of ->mapping 2008-11-16 01:33 and fields being very different 2008-11-16 01:33 in inode 2008-11-16 01:33 and mapping 2008-11-16 01:33 and inode and sb 2008-11-16 01:33 yes 2008-11-16 01:33 huge differences 2008-11-16 01:34 I had planned to reduce the differences a little before moving into kernel 2008-11-16 01:34 but it's ok doing it your way 2008-11-16 01:34 and ->dev 2008-11-16 01:34 nothing like kernel device 2008-11-16 01:34 yes 2008-11-16 01:35 I think plan is not big difference, my plan is ust reverce 2008-11-16 01:35 s/ust/just/ 2008-11-16 01:35 any plan is a good plan ;) 2008-11-16 01:35 good 2008-11-16 01:48 ACTION starts playing catch up 2008-11-16 01:49 you guys have been busy 2008-11-16 01:54 kernel port started today 2008-11-16 01:54 yeah, exciting 2008-11-16 01:54 just read your post on that 2008-11-16 01:55 i see you've got tux3.h updated and checked in to the git repo as well 2008-11-16 01:55 yes 2008-11-16 01:56 and hirofumi had another 5-6 files compiling in kernel, now he's merging with the tux3.h changes 2008-11-16 01:56 on sunday maybe we will get as far as reading 2008-11-16 01:57 it's possible 2008-11-16 01:57 hopefully there won't be any rolling blackouts in santa monica :) 2008-11-16 02:01 as long as it's less than 17 minutes I'm ok 2008-11-16 02:09 unfortunately, my patches is far from compilable 2008-11-16 02:11 well, just one file at a time then 2008-11-16 02:12 btw, info/refs of git seems to be not updated 2008-11-16 02:12 hmm 2008-11-16 02:12 ok 2008-11-16 02:12 is it ok now? 2008-11-16 02:13 looks like good 2008-11-16 02:13 I should do the same thing I do with hg 2008-11-16 02:13 have a working repo and a public repo 2008-11-16 02:13 "chmod +x .git/hooks/post-update" may do automaticaly 2008-11-16 02:13 then the git pull should update the refs 2008-11-16 02:14 it's already +x 2008-11-16 02:14 oops 2008-11-16 02:15 anyway, it's good to have a private/public setup 2008-11-16 02:15 let's me make mistakes and not expose them 2008-11-16 02:15 yes 2008-11-16 02:16 anybody is not interesting temporary state 2008-11-16 02:17 which files have you not tried to compile? 2008-11-16 02:17 I could start working on something 2008-11-16 02:17 ok, it seems "git push" do update-server-info 2008-11-16 02:17 almost all files 2008-11-16 02:17 what about dleaf.c? 2008-11-16 02:18 iattr.c? 2008-11-16 02:18 I'm doing logical changes 2008-11-16 02:18 dleaf.o 2008-11-16 02:18 ileaf.o 2008-11-16 02:18 hexdump.o 2008-11-16 02:18 inode.o 2008-11-16 02:18 super.o 2008-11-16 02:18 can compile those 2008-11-16 02:19 ok, I can work on btree.c 2008-11-16 02:20 have idea what to do for bread? 2008-11-16 02:20 sb_bread 2008-11-16 02:20 ok 2008-11-16 02:20 we talked about other things, but that is easiest 2008-11-16 02:20 for filemap? 2008-11-16 02:20 um 2008-11-16 02:21 no, filemap should use syncio 2008-11-16 02:21 ok 2008-11-16 02:21 I'll think about that more carefullly 2008-11-16 02:21 the way kernel does things is a little different than how I did it in user space 2008-11-16 02:21 bread calls a per-mapping method in user space 2008-11-16 02:22 yes 2008-11-16 02:22 in kernel... hmm 2008-11-16 02:22 it actually tries to submit IO 2008-11-16 02:22 maybe it is big stuff 2008-11-16 02:22 yes, bread->sb_bread is not quite right 2008-11-16 02:23 yes, at least for filemap.c 2008-11-16 02:23 we need tux3_bread that calls a method, just like user space 2008-11-16 02:23 maybe, yes 2008-11-16 02:23 and more little thing is inode and sb 2008-11-16 02:24 yes, all the field names are different 2008-11-16 02:24 and in kernel, maybe we have tux_inode_info or something 2008-11-16 02:24 yes, TUX3I(inode) 2008-11-16 02:25 so, the issue is, we have to sync to userspace or not 2008-11-16 02:25 struct tux_inode 2008-11-16 02:25 we don't 2008-11-16 02:25 ok 2008-11-16 02:25 we can try to group the parts that didn't change much later 2008-11-16 02:25 I thought though that syncing tux3.h would help keep it from getting too far apart 2008-11-16 02:26 i see 2008-11-16 02:26 probably my patches helps to think it 2008-11-16 02:26 we don't have to rename bread, because it's not used in kernel any more 2008-11-16 02:27 let's just keep calling in bread instead of renaming as tux3_bread 2008-11-16 02:27 calling it bread I meant 2008-11-16 02:28 umm... I like tux3_bread or tux_bread in both of kernel and user 2008-11-16 02:28 well, I don't care much though 2008-11-16 02:28 how about blockread ? 2008-11-16 02:28 sounds good 2008-11-16 02:28 bread is confusible for me 2008-11-16 02:28 yes 2008-11-16 02:29 and getblk needs to be called something 2008-11-16 02:29 yes 2008-11-16 02:29 blockbuf? 2008-11-16 02:30 alloc_buffer? 2008-11-16 02:30 it does more than alloc 2008-11-16 02:30 returns from hash if it isn't there 2008-11-16 02:30 sorry 2008-11-16 02:30 if it is there 2008-11-16 02:30 blockget 2008-11-16 02:31 consistent with iget 2008-11-16 02:31 ah 2008-11-16 02:31 and things like that 2008-11-16 02:31 well, iget will also read 2008-11-16 02:31 sounds good, maybe 2008-11-16 02:31 blockget then? 2008-11-16 02:32 it's good enough for now 2008-11-16 02:32 getblk and blockget? 2008-11-16 02:32 yes 2008-11-16 02:32 getblk -> blockget 2008-11-16 02:32 sounds good 2008-11-16 02:32 getblk for device buffer 2008-11-16 02:32 bread -> blockread 2008-11-16 02:32 blockget for filemap 2008-11-16 02:32 yes 2008-11-16 02:32 ? 2008-11-16 02:32 sb_getblk 2008-11-16 02:32 ah, yes 2008-11-16 02:32 sounds good 2008-11-16 02:34 I will make those changes in user space now 2008-11-16 02:34 ok 2008-11-16 02:43 committed 2008-11-16 02:43 fast 2008-11-16 02:43 would have been faster with an editor that can make multifile changes 2008-11-16 02:44 haven't had that since nedit 2008-11-16 02:44 anyway, it was mindless ;) 2008-11-16 02:45 ah, it's still using ->map 2008-11-16 02:45 now, I'm thinking about what the advantage is of having blockread and friends having that per-mapping method 2008-11-16 02:45 yes 2008-11-16 02:46 it needs to take buffer 2008-11-16 02:46 ah, yes 2008-11-16 02:46 I wonder why I did ->map there 2008-11-16 02:46 oh, it does take buffer 2008-11-16 02:46 which one takes ->map ? 2008-11-16 02:47 blockread()? 2008-11-16 02:47 I wonder if we change that to ->mapping will that help reduce the diff with kernel? 2008-11-16 02:47 i_mapping actually 2008-11-16 02:48 gross 2008-11-16 02:48 well 2008-11-16 02:48 how about mapping(inode) ? 2008-11-16 02:48 or just take inode? 2008-11-16 02:48 that might work 2008-11-16 02:49 which functions/files? 2008-11-16 02:49 for tux3 inode and mapping are always one to one 2008-11-16 02:49 and in kernel they point at each other 2008-11-16 02:49 if it's blockdev 2008-11-16 02:50 ah, so we need mapping :) 2008-11-16 02:50 stupid, the field is called "host" in struct address_space 2008-11-16 02:51 yes 2008-11-16 02:51 which is always the inode... so... 2008-11-16 02:51 it actually used to be void * 2008-11-16 02:51 I think I was the one who pointed out it's always inode 2008-11-16 02:52 wow, that was a long time ago 2008-11-16 02:52 yes, someone have big picture 2008-11-16 02:52 may have 2008-11-16 02:53 grep -I -- "->map" * | wc 2008-11-16 02:53 92 461 5273 2008-11-16 02:53 let's think about this :) 2008-11-16 02:54 btw, blockread(sb->devmap,) should be sb_bread? 2008-11-16 02:54 personally I don't care to change it 2008-11-16 02:54 sure, just leave it 2008-11-16 02:55 I don't know if that abstraction is even useful in kernel 2008-11-16 02:55 it was very useful in userspace 2008-11-16 02:55 allowed me to test dirops using a full volume as a single file 2008-11-16 02:55 i see 2008-11-16 02:56 it might be part of a better solution for loopback in kernel 2008-11-16 02:56 buffer->dev->map? 2008-11-16 02:56 right, there are two ways of getting to the device map 2008-11-16 02:56 that's redundant 2008-11-16 02:56 kernel is worse ;) 2008-11-16 02:56 devs are cached all over the place 2008-11-16 02:57 iirc 2008-11-16 02:58 ok, anyway blockget should take buffer, not map 2008-11-16 02:58 let me see how hard that is to change in userspace 2008-11-16 02:58 ah 2008-11-16 02:59 I think it should take mapping 2008-11-16 02:59 reasoning? 2008-11-16 02:59 oh 2008-11-16 02:59 map is per inode? 2008-11-16 02:59 yes 2008-11-16 02:59 it has to, duh 2008-11-16 02:59 what was I thinking 2008-11-16 03:00 well, but I think sb->devmap is not 2008-11-16 03:01 that should be in the tux3-specific part of the inode 2008-11-16 03:01 that's a big thing 2008-11-16 03:01 we need to break up the inode into two parts 2008-11-16 03:01 generic and fs-specific 2008-11-16 03:01 sorry 2008-11-16 03:01 blockread(sb->devmap) 2008-11-16 03:02 is just sb_bread, you're right 2008-11-16 03:02 ok 2008-11-16 03:02 let me see 2008-11-16 03:02 thanks 2008-11-16 03:02 int dev_blockread(struct buffer *buffer) <- that's the function in user space 2008-11-16 03:03 we can use the same name in kernel, it will just be a call to sb_bread 2008-11-16 03:03 ok 2008-11-16 03:04 sb->s_bdev <- I can make that change in userspace 2008-11-16 03:04 to get closer to kernel 2008-11-16 03:05 um... what is it point? 2008-11-16 03:05 what does it point ? 2008-11-16 03:05 let me grep 2008-11-16 03:05 in kernel, bdev 2008-11-16 03:05 but in userspace? 2008-11-16 03:05 struct block_device in kernel 2008-11-16 03:06 just what we want for the bio field 2008-11-16 03:07 is it using in kernel only? 2008-11-16 03:08 block_device? 2008-11-16 03:08 that would be like my struct dev 2008-11-16 03:08 ah 2008-11-16 03:08 so we change sb->dev to sb->s_bdev? 2008-11-16 03:08 so sb->devmap can be changed to sb->s_bdev 2008-11-16 03:08 ah, i see 2008-11-16 03:09 I'll do that now 2008-11-16 03:17 committed 2008-11-16 03:17 now tux3.h isn't the same as kernel any more ;) 2008-11-16 03:18 ok 2008-11-16 03:19 splitting up the inode fields into generic and specific is a big job 2008-11-16 03:20 there isn't a real reason for that in user space, because there is only one filesystem 2008-11-16 03:20 so only one kind of inode 2008-11-16 03:20 yes 2008-11-16 03:20 but it would really help the port if we do it in userspace anyway I think 2008-11-16 03:21 yes, we have to chose diff or simple 2008-11-16 03:21 diff is ok 2008-11-16 03:21 but some diffs are way to big 2008-11-16 03:21 way too big 2008-11-16 03:21 yes 2008-11-16 03:22 ok, you know how it goes, right? the specific inode is a container_of the vfs inode 2008-11-16 03:22 I always choose diff should simple in my case 2008-11-16 03:22 yes 2008-11-16 03:22 I did 2008-11-16 03:22 my patches is including it 2008-11-16 03:22 ok 2008-11-16 03:22 I won't bother you with that 2008-11-16 03:23 in fact, best thing for me is to sleep 2008-11-16 03:23 and not bother you :) 2008-11-16 03:23 :) 2008-11-16 03:23 good night 2008-11-16 03:23 good night 2008-11-16 03:24 I'll make some patches for tommrow or later 2008-11-16 03:24 :) 2008-11-16 07:48 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-16 08:02 -!- pranith(~Bobby@122.162.73.84) has joined #tux3 2008-11-16 08:31 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-16 08:58 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-11-16 12:13 424 * Writing is not so simple. -- http://lxr.linux.no/linux+v2.6.27/fs/mpage.c#L424 2008-11-16 12:13 sounds like an akpm comment :-) 2008-11-16 12:13 mpage.c is another of those functions that knows about buffers 2008-11-16 12:13 those files, I mean 2008-11-16 12:14 and thus one we want to avoid completely if we implement our own page->private block handles 2008-11-16 12:15 it's looking to me like that is what we want to do 2008-11-16 12:16 the whole block IO library including mpage looks like a futile attempt to generalize what can't be generalized that way, it's really just a part of ext3 moved into a library 2008-11-16 12:16 there's a slim chance I could be wrong about that, and there is actually deep value encapsulated in those interfaces ;) 2008-11-16 12:17 but bypassing them is looking... somewhat liberating 2008-11-16 12:17 umm... mpage is part of readahead stuff 2008-11-16 12:18 url? 2008-11-16 12:19 notice... tux3/filemap.c is doing its own readahead 2008-11-16 12:19 wait a minute, I'll grep kernel 2008-11-16 12:20 http://lxr.linux.no/linux+v2.6.27.5/mm/readahead.c#L85 2008-11-16 12:22 hirorumi, did you sleep at all? 2008-11-16 12:23 no, I slept several hours 2008-11-16 12:23 me too 2008-11-16 12:23 seems I'm thinking about this too much to sleep :) 2008-11-16 12:24 ok, ->readpages does not necessarily point to mpage.c stuff 2008-11-16 12:24 :) I waked up at 5:00, so maybe I'll sleep a bit again 2008-11-16 12:24 just like ->readpage does not necessarily point to the block IO library 2008-11-16 12:24 thank goodness for that consistency 2008-11-16 12:25 generally, anything that calls page_buffers _can't_ be called from generic vm/vfs 2008-11-16 12:25 otherwise page->private (the buffers) would be a lie 2008-11-16 12:25 I do believe that core kernel is consistent about that, and if not, it's a bug 2008-11-16 12:26 i see 2008-11-16 12:27 maybe fsblock patch tells what change is needed 2008-11-16 12:27 we cal start by just setting our ->readpages to null, then everything will go through ->readpage 2008-11-16 12:27 I need to read that 2008-11-16 12:27 yes 2008-11-16 12:27 me too 2008-11-16 12:27 I'm not reading deeply 2008-11-16 12:28 I am skeptical that anybody will succeed in cleaning up core kernel buffer handling before at least one filesystem has succeeded in doing the job cleanly 2008-11-16 12:28 I'll have a read through 2008-11-16 12:29 on tuesday we can look at try_to_free_buffers and fsblock 2008-11-16 12:29 I'm making notes today 2008-11-16 12:29 i see 2008-11-16 12:30 -!- ChanServ changed mode/#tux3 -> +o flips 2008-11-16 12:31 http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: the ->try_to_free_buffers hairball 2008-11-16 12:32 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: the ->try_to_free_buffers hairball" 2008-11-16 12:32 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: the try_to_free_buffers hairball" 2008-11-16 12:32 -!- pgquiles(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-16 12:33 if buffers are busy when somebody calls try_to_free_buffers, it just doesn't do anything 2008-11-16 12:33 so it's more or less advisory 2008-11-16 12:34 that is deeply scary 2008-11-16 12:34 on the other hand, if a filesystem knows that nobody other than itself is allowed to touch the buffers (they are ->private after all) then it can accurately handle this 2008-11-16 12:38 @@ -2685,9 +2687,13 @@ int try_to_release_page(struct page *pag 2008-11-16 12:38 if (PageWriteback(page)) 2008-11-16 12:38 return 0; 2008-11-16 12:38 2008-11-16 12:38 + BUG_ON(!(PagePrivate(page) ^ PageBlocks(page))); 2008-11-16 12:38 if (mapping && mapping->a_ops->releasepage) 2008-11-16 12:38 return mapping->a_ops->releasepage(page, gfp_mask); 2008-11-16 12:38 - return try_to_free_buffers(page); 2008-11-16 12:38 + if (PagePrivate(page)) 2008-11-16 12:38 + return try_to_free_buffers(page); 2008-11-16 12:38 + else 2008-11-16 12:38 + return try_to_free_blocks(page); 2008-11-16 12:38 } 2008-11-16 12:38 fsblock seems set own bit to page, and escapes try_to_free_buffers() 2008-11-16 12:40 it should just require the user to set its own ->releasepage, then it would not have to add that extra hook 2008-11-16 12:40 it can go BUG somewhere else if the fsblock user forgot to set the release method 2008-11-16 12:41 i see 2008-11-16 12:41 I guess I need to jump into that thread and basically say "Nick, implement this for one filesystem first, if you can't do that this effort can't possibly succeed" 2008-11-16 12:42 it's totally wrong to generalize and add to core kernel a mechanism that has not been proven once 2008-11-16 12:43 funny, I don't find recent fsblock posts in my lkml folder 2008-11-16 12:43 "I've been doing some work on fsblock again lately, so in case anybody might" 2008-11-16 12:44 keyword for it 2008-11-16 12:44 comment on fsblock patch 2008-11-16 12:44 http://www.spinics.net/lists/linux-fsdevel/msg17327.html 2008-11-16 12:44 ah 2008-11-16 12:45 not posted to lkml 2008-11-16 12:45 should have been 2008-11-16 12:45 linux-fsdevel only 2008-11-16 12:47 neil brown was the only one brave enough to comment 2008-11-16 12:48 "I think you need to include a design document as a patch to 2008-11-16 12:48 Documentation/something." 2008-11-16 12:48 ah, yes 2008-11-16 12:49 I think we should just go ahead and implement our kernel port without using the block library at all, and show how that a) does not require a lot of code and b) is way easier to analyze 2008-11-16 12:49 I'm looking deeper into how we can use our own page->private state vector 2008-11-16 12:50 it's looking very doable 2008-11-16 12:50 maybe good performance is good catchy 2008-11-16 12:50 and avoiding global locks 2008-11-16 12:50 we know that our buffers can't be touched by any other filesystem 2008-11-16 12:51 so obviously, taking a global lock in a path that is private to our filesystem is a big waste 2008-11-16 12:51 global lock is lock_buffer? 2008-11-16 12:51 let me give an example 2008-11-16 12:52 thanks 2008-11-16 12:52 3129 spin_lock(&mapping->private_lock); <- ok, this is actually a counterexample 2008-11-16 12:52 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L3129 2008-11-16 12:53 takes a per-mapping lock 2008-11-16 12:53 well, that is bad if the mapping is for a petabyte-sized file 2008-11-16 12:53 being accessed by 50 processes at the same time, which is a real use case 2008-11-16 12:54 even an 8 cpu machine will choke 2008-11-16 12:54 i see 2008-11-16 12:54 an Altix will stop dead 2008-11-16 12:55 ACTION is googling Altix 2008-11-16 12:55 i see 2008-11-16 12:55 512 cpu machines 2008-11-16 12:55 now up to 1024 2008-11-16 12:55 too many 2008-11-16 12:55 :) 2008-11-16 12:55 not enough for some folks 2008-11-16 12:55 you will have a machine like that under your desk in 5 years 2008-11-16 12:56 oh 2008-11-16 12:56 maybe 32-64 processors by that time 2008-11-16 12:56 all on one chip 2008-11-16 12:56 maybe i would like to split those to some part 2008-11-16 12:58 ->private_lock seems to be used for two perpose 2008-11-16 12:59 for mapping->assoc_mapping and page->private 2008-11-16 12:59 yes, protecting some page dirty operations is the second 2008-11-16 13:00 assoc_mapping is for ext2-style metadata handling 2008-11-16 13:00 we won't use it 2008-11-16 13:00 yes 2008-11-16 13:00 ->private_lock is used to synchronize against __set_page_dirty_buffers 2008-11-16 13:01 I have not analyzed that path yet 2008-11-16 13:01 gfs2 and reiserfs is calling it 2008-11-16 13:01 and __set_page_dirty 2008-11-16 13:02 that gets very "interesting" 2008-11-16 13:02 if fs has a_pos->set_page_dirty, it will not be used 2008-11-16 13:02 mm/migrate.c is a file I haven't read at all 2008-11-16 13:03 good 2008-11-16 13:03 it's nice that set_page_dirty can be hooked 2008-11-16 13:03 quick check though 2008-11-16 13:03 those seems fininaly call __set_page_dirty 2008-11-16 13:04 and that path has to handle the complex interaction between page dirty state and buffer dirty state 2008-11-16 13:04 again, it will be much better if we can restrict that analysis to only our filesystem 2008-11-16 13:05 sound good probably 2008-11-16 13:06 dirty state also interacts with writeback state as we were discussing yesterday 2008-11-16 13:06 yes 2008-11-16 13:06 how did your build attempt go? 2008-11-16 13:07 ->map and sb is remaining 2008-11-16 13:08 sb is also divided into generic and specific 2008-11-16 13:08 as you know obviously 2008-11-16 13:08 and filemap.c needs some functions on other files 2008-11-16 13:08 yes 2008-11-16 13:08 sb may be easy 2008-11-16 13:09 now I'm trying to remove #include "*.c" in filemap.c 2008-11-16 13:09 how about: static inline struct address_space *imap(struct inode *inode) ? 2008-11-16 13:10 and put that wrapper in the userspace code, to reduce the diff with kernel? 2008-11-16 13:10 yes 2008-11-16 13:10 and typedef struct address_space imap_t 2008-11-16 13:11 or rename map to address_space 2008-11-16 13:11 so: static inline struct imap_t *imap(struct inode *inode) ? 2008-11-16 13:11 address_space is such a misleading name 2008-11-16 13:11 at least no "struct" :) 2008-11-16 13:11 inode->mapping yielding address_space is just awful code 2008-11-16 13:12 well, looks good to me 2008-11-16 13:12 maybe even: static inline struct map_t *map(struct inode *inode) ? 2008-11-16 13:12 if nobody has polluted the kernel namespace by taking "map" 2008-11-16 13:12 let's see 2008-11-16 13:13 I think imap is good 2008-11-16 13:14 since others use like buf* or something 2008-11-16 13:14 only, imap is a well known mail protocol 2008-11-16 13:14 which should not matter, but I think that will bother people 2008-11-16 13:14 ok, somebody took map 2008-11-16 13:14 that's sickening ;) 2008-11-16 13:14 "info" is also gone 2008-11-16 13:14 those were the days when nobody thought about pollution 2008-11-16 13:15 namespace pollution 2008-11-16 13:15 oh, it was jdike 2008-11-16 13:15 in uml 2008-11-16 13:15 bad 2008-11-16 13:16 arch/um/os-Linux/skas/mem.c:192 2008-11-16 13:16 however, map_t is ok 2008-11-16 13:16 yes 2008-11-16 13:17 imapping? 2008-11-16 13:18 to avoid for imap 2008-11-16 13:18 Hmm 2008-11-16 13:19 typedef struct address_space map_t; 2008-11-16 13:19 static inline map_t *mapping(struct inode *inode) 2008-11-16 13:19 { 2008-11-16 13:19 return inode->i_mapping; 2008-11-16 13:19 } 2008-11-16 13:19 ok? 2008-11-16 13:20 it looks good to me at least for now 2008-11-16 13:20 this will be very easy to edit mindlessly if we want to change it 2008-11-16 13:21 somebody needs to write a lkml post about global namespace pollution 2008-11-16 13:21 "info" is the worst offender I know of, part of the 8087 support 2008-11-16 13:21 no way that symbol should have escaped its local file 2008-11-16 13:22 anyway, our generic names like "mapping" will be static as they should be 2008-11-16 13:22 struct info of math_emu.h? 2008-11-16 13:22 yes 2008-11-16 13:23 i see 2008-11-16 13:23 ok, I will change inode->map to mapping(inode) in tux3/user 2008-11-16 13:24 ok 2008-11-16 13:31 http://userweb.kernel.org/~hirofumi/patches.tar.gz 2008-11-16 13:31 is current patches 2008-11-16 13:31 shoud I try them or just look? 2008-11-16 13:32 please don't merge yet 2008-11-16 13:33 http://userweb.kernel.org/~hirofumi/errors 2008-11-16 13:33 what is the difference between patchset/ and tux3/ ? 2008-11-16 13:33 and compile errors 2008-11-16 13:33 patchset is patch 2008-11-16 13:33 tux3 is applied files 2008-11-16 13:33 copy of fs/tux3 2008-11-16 13:35 and filemap.c accesses inode via buffer->map->inode 2008-11-16 13:35 maybe it's not good for kernel 2008-11-16 13:36 ok, one big source of errors is, user/tux3 struct sb is fs/tux3 struct super_block 2008-11-16 13:36 yes 2008-11-16 13:36 so what we should do is have struct sb be the tux3 private sb 2008-11-16 13:36 I think... 2008-11-16 13:37 I think it's same with inode 2008-11-16 13:38 a little different, user/tux3 struct inode is a lot like the vfs struct inode 2008-11-16 13:38 I think we rename "struct sb" to "struct super_block", then in kernel, use TUX_SB() 2008-11-16 13:38 struct sb also has blockbits stuff 2008-11-16 13:38 no need to should ;) 2008-11-16 13:38 shout 2008-11-16 13:38 it can be tux_sb() 2008-11-16 13:38 especially as it is not a macro 2008-11-16 13:39 and TUX_I() to tux_i()? 2008-11-16 13:39 tux_inode ? 2008-11-16 13:39 um... 2008-11-16 13:40 probably tux_inode or tux_i is good 2008-11-16 13:40 tnode ;) 2008-11-16 13:40 however, I'm too familiar current kernel 2008-11-16 13:40 just joking 2008-11-16 13:41 well, that usage came from me ;-) 2008-11-16 13:41 so I have the moral authority to improve it 2008-11-16 13:42 I'm chicken :) 2008-11-16 13:42 shouting is generally considered bad taste, it should be regarded as a comment like "this macro should be changed to an inline and renamed without caps" 2008-11-16 13:42 well, let's go with tux_inode or tux_i 2008-11-16 13:42 http://marc.info/?l=linux-kernel&m=100977701311304&w=2 2008-11-16 13:43 bit of history 2008-11-16 13:43 if we had used tux_imap then tux_i would have been a grep collision 2008-11-16 13:43 maybe tux_map and tux_inode to start with? 2008-11-16 13:44 the exact name isn't important 2008-11-16 13:44 as long as we know immediately what it is, and can grep easily 2008-11-16 13:44 hmm 2008-11-16 13:44 ok 2008-11-16 13:44 no, not tux_map 2008-11-16 13:44 it's not tux-specific 2008-11-16 13:44 ok, but tux_inode 2008-11-16 13:45 tux_mapping? 2008-11-16 13:45 ah 2008-11-16 13:45 except it's not tux3 specific 2008-11-16 13:45 imapping? 2008-11-16 13:45 it's not like sb and inode 2008-11-16 13:45 static inline map_t *mapping(struct inode *inode) <- userspace version 2008-11-16 13:46 whoops 2008-11-16 13:46 static inline map_t *mapping(struct inode *inode) <- kernel version 2008-11-16 13:46 ok 2008-11-16 13:46 static inline map_t *mapping(struct inode *inode) <- userspace version 2008-11-16 13:46 looks good 2008-11-16 13:47 now, how do we get a map_t given a buffer in kernel? 2008-11-16 13:48 that is another big source of compile errors 2008-11-16 13:48 I think we should pass super_block or inode to those funstions 2008-11-16 13:48 as well as the buffer? 2008-11-16 13:49 e.g. filemap_extent_io()? 2008-11-16 13:50 if we don't use buffer_head and instead use our handles, we do handle_head(handle)->page->mapping 2008-11-16 13:50 where handle_head finds the header of the set of handles using arithmetic maze and I talked about yesterday 2008-11-16 13:51 struct page *b_page; <- duh 2008-11-16 13:51 buffer->b_page, and we wrap that 2008-11-16 13:52 maze, maze, where maze? 2008-11-16 13:52 ;-) 2008-11-16 13:52 so, static inline map_t *bufmap(buffer_t *buffer) <- something like that 2008-11-16 13:52 I think we are going to do that page->private thing, but we will try to make it compile without first 2008-11-16 13:53 why don't we just pass inode? 2008-11-16 13:53 example file/line number? 2008-11-16 13:54 e.g. filemap_extent_io() 2008-11-16 13:54 in user/tux3 ? 2008-11-16 13:54 yes 2008-11-16 13:55 lxr is very unreliable 2008-11-16 13:55 but too useful to abandon ;) 2008-11-16 13:56 the grep taking 30 seconds right now is a very good demonstration of why tux3 is needed 2008-11-16 13:56 readahead in linux is very badly broken 2008-11-16 13:56 oh, filemap_extent_io is not in mainline 2008-11-16 13:57 I would not hold up that style as a good example to follow ;) 2008-11-16 13:57 it is bad to pass inode and buffer when buffer already gives you inode 2008-11-16 13:59 if it's blockdev file, buffer->page->mapping->host is bdev inode 2008-11-16 13:59 man you guys have been busy ;-) 2008-11-16 13:59 is that ok? 2008-11-16 14:00 buffer->page should be valid even if it is not a blockdev mapping 2008-11-16 14:00 so we can always write buffer->b_page->mapping->host 2008-11-16 14:00 much better than passing an extra parameter 2008-11-16 14:00 but we want to wrap that 2008-11-16 14:01 if we don't touch ->host as tux_inode 2008-11-16 14:01 ah, if it's /dev/block, filemap_extent_io() should be called? 2008-11-16 14:02 shouldn't be called 2008-11-16 14:02 filemap_extent_io is an experimental patch from the btrfs guys 2008-11-16 14:02 I do not think it is a good design direction 2008-11-16 14:02 it gives us nothing that submit_bio does not give 2008-11-16 14:03 we will change user/tux3/filemap.c? 2008-11-16 14:03 we should not write buffer->b_page->mapping->host directly, but use a wrapper 2008-11-16 14:03 now, should the wrapper return a mapping or an inode? 2008-11-16 14:04 oh 2008-11-16 14:04 sorry 2008-11-16 14:04 _our_ filemap_extent_io 2008-11-16 14:04 yes, yes 2008-11-16 14:04 ACTION backs up 2008-11-16 14:05 maze, jump in, the water's warm 2008-11-16 14:05 maze, the block handle idea is looking very doable 2008-11-16 14:05 hirofumi, we don't have to worry about blockdev io 2008-11-16 14:06 it is always a bug to have blockdev io and file io running at the same time on the same device 2008-11-16 14:06 no 2008-11-16 14:07 it means /dev/block 2008-11-16 14:07 special file 2008-11-16 14:07 ? 2008-11-16 14:07 mknod b x x 2008-11-16 14:07 in tux3 2008-11-16 14:08 ah, yes 2008-11-16 14:08 all we have to do is open that, we don't handle the io 2008-11-16 14:08 so, filemap_extent_io() shouldn't be called for /dev/block? 2008-11-16 14:09 no 2008-11-16 14:09 um.. 2008-11-16 14:09 it also is never called in tux3 for anything in the buffer cache 2008-11-16 14:10 yes 2008-11-16 14:10 if we do open("/dev/block"), which func is called? 2008-11-16 14:10 if we do open("/dev/block") and write(fd), which func is called? 2008-11-16 14:11 where the root fs is tux3? 2008-11-16 14:11 at least, /dev is? 2008-11-16 14:11 yes /dev/block is on tux3 2008-11-16 14:11 that calls a method through one of the _operations structs 2008-11-16 14:12 ok 2008-11-16 14:12 so, in that case, filemap_extent_io() shouldn't be called? 2008-11-16 14:13 correct 2008-11-16 14:13 ok, i see 2008-11-16 14:13 blockdev.c functions get called 2008-11-16 14:14 all we do is open the device by calling a blockdev library function 2008-11-16 14:14 ext2 is a good example of that 2008-11-16 14:14 yes 2008-11-16 14:15 exclude inode metadata 2008-11-16 14:15 ? 2008-11-16 14:15 extent io doesn't get called on that either 2008-11-16 14:16 because all metadata in tux3 is blocks 2008-11-16 14:16 metadata is so much smaller than file data, that that makes sense 2008-11-16 14:16 and it keeps this simple 2008-11-16 14:16 it meant e.g. inode->i_ctime 2008-11-16 14:17 it's not handled by blockdev.c 2008-11-16 14:17 ah I understand what you said now 2008-11-16 14:17 sorry 2008-11-16 14:18 that should be handled like any other inode 2008-11-16 14:18 I realized while I was sleeping that proper handling of io priorities and reads gets tricky... if a low io prio process triggers a disk read, and then later a higher io prio process needs to read the same data, then you need to _somehow_ increase the priority of the existing in-flight read. Theoretically with a read you could just have two in flight, but that's bound to be a bad decision... I'm pretty sure you should never have more then 1 out 2008-11-16 14:18 what this most likely means is that you need to keep track of the state 2008-11-16 14:18 maze, that's an elevator problem... in other words we do not ourselves have to feel embarrassed if it sucks ;) 2008-11-16 14:19 in some easy to reference way from the page 2008-11-16 14:19 so that you can cancel and reissue the io with higher priority 2008-11-16 14:19 haha :) 2008-11-16 14:19 or provide an interface to increase the priority of existing io 2008-11-16 14:20 you are _so far_ beyond the level of engineering that currently exists in kernel it's not funny 2008-11-16 14:20 fixing that stuff will cause major resistance from people who don't have a clue what you're talking about, you need to fight fights you can win 2008-11-16 14:21 cfq stuff? 2008-11-16 14:21 yes, elevator, and with no usable api to tell it what to do 2008-11-16 14:21 just grin and bear it 2008-11-16 14:21 well, currently you can submit_bio 2008-11-16 14:21 can't fix the entire world at once 2008-11-16 14:21 and then basically it's out of your hands 2008-11-16 14:21 yes, then the elevator takes over 2008-11-16 14:21 and the whole elevator api is a disaster 2008-11-16 14:22 you could cancel_bio and resubmit with dif prio 2008-11-16 14:22 just close your eyes and treat it as a black box 2008-11-16 14:22 but that probably sucks as well, because you actually need to make sure the completion gets called correctly 2008-11-16 14:22 what is this cancel_bio of which you speak? 2008-11-16 14:22 I believe there is a way to cancel in-flight io 2008-11-16 14:23 surprise me, find a pointer 2008-11-16 14:24 you are entirely correct, there is major priority inversion in the block path 2008-11-16 14:24 but you can't fix everything all at once 2008-11-16 14:26 btw, i_*time stuff has different format 2008-11-16 14:26 hirofum, devbits(sb) wrapper will help the port I think 2008-11-16 14:27 different field names, or different format? 2008-11-16 14:27 format 2008-11-16 14:27 user is tuxtime, yes 2008-11-16 14:28 whereas generic inode is timespec 2008-11-16 14:28 u64 in userspace, and struct timespec in kernel 2008-11-16 14:28 indeed, I can't seem to find a 'cancel' - was sure I'd seen one... 2008-11-16 14:29 so we have to go with timespec in generic inode 2008-11-16 14:29 no choice about that, however much it sucks 2008-11-16 14:29 and tuxtime is limited to our on-disk 2008-11-16 14:29 yes 2008-11-16 14:29 maze, I'm not in the least surprised 2008-11-16 14:30 in general, linux is cancel-challenged 2008-11-16 14:30 and we can't pass i_*time to decode/encode directly 2008-11-16 14:30 it is not remotely close to the level of abstraction required to release the resources involved 2008-11-16 14:30 so we rely on "always run to completion" 2008-11-16 14:30 this is a huge deficiency 2008-11-16 14:31 big iron unixes treat stuff like that seriously, we penquins instead rely on "always run to completion" 2008-11-16 14:31 where the run to completion path can be pretty fast and unobstructed is something the path sets an error flag 2008-11-16 14:33 hirofumi, we will pass tuxtimeval(...something...) to encode 2008-11-16 14:34 probably timetimeval(timespec) 2008-11-16 14:34 probably good 2008-11-16 14:34 it's currently tuxtimeval(unsigned sec, unsigned nsec), so that should be changed in user 2008-11-16 14:35 yes, I think user first is easy way 2008-11-16 14:35 and a better name would be mktuxtime 2008-11-16 14:36 or make_* like others 2008-11-16 14:36 well, mktuxtime is good to me though 2008-11-16 14:36 it's similar to libc usage 2008-11-16 14:37 whatever the name is, it will take timespec 2008-11-16 14:37 let's hope the field names are compatible between user and kernel ;) 2008-11-16 14:37 looks good [time_t mktime(struct tm *tm)] 2008-11-16 14:38 they are 2008-11-16 14:38 it's similar [struct to time_t] 2008-11-16 14:38 stupidly named, but equally stupid in libc and kernel 2008-11-16 14:38 the fields are tv_ instead of ts_ 2008-11-16 14:39 no problem, struct tm and struct timespec/timeval is already much different 2008-11-16 14:40 we need a blockbits(sb) wrapper I think 2008-11-16 14:40 sounds better than devbits() 2008-11-16 14:41 it's easy to understand 2008-11-16 14:42 what is SB going to be, a generic sb or specific? 2008-11-16 14:43 in kernel? 2008-11-16 14:43 yes 2008-11-16 14:43 I'm thinking to use SB as super_block 2008-11-16 14:44 and tux3 specific filed is via tux_sb() 2008-11-16 14:44 yes 2008-11-16 14:45 and btree->sb is needed? 2008-11-16 14:45 if we pass sb to those functions, we don't need 2008-11-16 14:45 that is just to save passing a parameter in a lot of places 2008-11-16 14:45 yes 2008-11-16 14:46 it's like having a sb field in inode 2008-11-16 14:46 you could pass sb everywhere too, but it's a pain 2008-11-16 14:46 ok 2008-11-16 14:46 next is, btree->sb is super_block or tux_sb? 2008-11-16 14:47 need to check the usage 2008-11-16 14:47 ok 2008-11-16 14:48 well, with those change, I think kernel is compilable or near 2008-11-16 14:48 most usage of inode->sb is tux3 specific 2008-11-16 14:48 so it should be the specific inode 2008-11-16 14:49 inode->sb? 2008-11-16 14:49 sorry 2008-11-16 14:49 most usage of btree->sb is tux3 specific 2008-11-16 14:49 for example, btree->sb->entries_per_node 2008-11-16 14:50 in probe(), btree->sb->s_bdev 2008-11-16 14:50 yes, that's the only generic one I saw so far 2008-11-16 14:50 and btree->sb->blocksize 2008-11-16 14:51 seems sb is for those 2008-11-16 14:51 both would be handled by a btree_super(btree) wrapper 2008-11-16 14:52 well 2008-11-16 14:52 some wrapper ;) 2008-11-16 14:52 it our struct, so maybe no need wrapper 2008-11-16 14:53 true 2008-11-16 14:53 I want to pull your changes at some point 2008-11-16 14:54 yes, I will need about an hour of quiet to get something worth pulling 2008-11-16 14:54 I'm editing about about 1% speed right now ;-) 2008-11-16 14:54 and my typo rate is through the roof :) 2008-11-16 14:54 :) 2008-11-16 14:55 did I scare maze away with my revelation about the state of io cancelling in linux? 2008-11-16 14:55 ok, I'll continue to fix filemap.c 2008-11-16 14:55 ok 2008-11-16 14:55 nope, just busy with work ;-( 2008-11-16 14:55 see you in an hour 2008-11-16 14:55 if I didn't scare you away, you just don't understand ;) 2008-11-16 14:56 anybody who really knows how much work there is left to do in linux would probably have chosen to go into law instead 2008-11-16 14:59 why does struct super_block have both ->dev_t (unsigned int) and ->s_bdev (struct block_device *) ??? 2008-11-16 15:00 dev_t is for userspace 2008-11-16 15:00 in kernel kdev_t in bdev 2008-11-16 15:00 but why in the super_block? 2008-11-16 15:00 it should be in block_device 2008-11-16 15:00 I forgot it why in super_block 2008-11-16 15:00 bogus alert 2008-11-16 15:01 dev_t bd_dev; /* not a kdev_t - it's a search key */ <- and we have this oddity 2008-11-16 15:01 in block_device 2008-11-16 15:01 oh 2008-11-16 15:01 well 2008-11-16 15:01 grin and bear it 2008-11-16 15:03 I think we should put our blockbits into our tux3 specific sb 2008-11-16 15:04 the generic usage is too weird 2008-11-16 15:04 why? 2008-11-16 15:04 set_blocksize for example 2008-11-16 15:05 um... sb_set_blocksize? 2008-11-16 15:05 it sets our blocksize 2008-11-16 15:05 can can be called from outside 2008-11-16 15:05 I think 2008-11-16 15:06 it sets bdev_inode->i_blkbits, iirc 2008-11-16 15:07 bdev->bd_block_size and bdev->bd_inode->i_blockbits from outside 2008-11-16 15:09 794 sb_set_blocksize(s, block_size(bdev)); 2008-11-16 15:10 http://lxr.linux.no/linux+v2.6.27/fs/super.c#L794 2008-11-16 15:10 in get_sb_bdev 2008-11-16 15:11 get_sb_bdev is misleading, because it's not opening a blockdev 2008-11-16 15:12 it's creating a sb for a filesystem mounted on a block device 2008-11-16 15:13 anyway, it thinks it knows something about the blocksize the filesystem wants, which is really wrong 2008-11-16 15:13 only the filesystem knows that, after reading the superblock 2008-11-16 15:14 so I think... because that usage is so crazy, and it is only us that ever uses that field, we should use our own blockbits field 2008-11-16 15:17 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-16 15:18 that means we don't have to wrap sb->blockbits 2008-11-16 15:25 it initializes super_block from bdev blocksize to super_block 2008-11-16 15:26 well, it's just garbage 2008-11-16 15:26 - s->s_old_blocksize = block_size(bdev); 2008-11-16 15:26 - sb_set_blocksize(s, s->s_old_blocksize); 2008-11-16 15:26 + sb_set_blocksize(s, block_size(bdev)); 2008-11-16 15:26 error = fill_super(s, data, flags & MS_VERBOSE ? 1 : 0); 2008-11-16 15:27 in past, vfs try to remember blocksize of bdev before mount 2008-11-16 15:27 however it was removed 2008-11-16 15:27 but, someone forgot to remove sb_set_blocksize 2008-11-16 15:27 and there as still bits lying around just to add confusion 2008-11-16 15:28 anyway 2008-11-16 15:28 it's easier for us to just keep using our fs specific field in struct sb 2008-11-16 15:28 ok 2008-11-16 15:46 http://userweb.kernel.org/~hirofumi/patches.tar.gz 2008-11-16 15:46 http://userweb.kernel.org/~hirofumi/errors 2008-11-16 15:47 well I am working on the ->map problem 2008-11-16 15:48 ok 2008-11-16 15:48 I hope the patch for that will be in a useful form 2008-11-16 15:48 after that, I'd like to sync with your state 2008-11-16 15:49 inode->sb should become our specific inode, which points at our specific sb 2008-11-16 15:50 so inode->sb becomes tux_inode(inode)->sb 2008-11-16 15:50 i see 2008-11-16 15:50 in userspace I think we can just do #define tux_inode(inode) inode 2008-11-16 15:51 an inline of course, but works like that 2008-11-16 15:51 and the fields won't collide 2008-11-16 15:51 simplest thing 2008-11-16 15:51 ok 2008-11-16 15:51 I'll go back to work on the mapping edit 2008-11-16 15:52 it's a big job 2008-11-16 15:52 sorry, I should have done this one long ago 2008-11-16 15:53 yes 2008-11-16 15:54 it big, so I didn't until now ;) 2008-11-16 15:57 static inline unsigned bufsize(struct buffer *buffer) 2008-11-16 15:57 { 2008-11-16 15:57 return 1 << buffer->map->inode->sb->blockbits; 2008-11-16 15:57 } 2008-11-16 15:57 I'm not sure I like that ;-) 2008-11-16 15:58 hi guys 2008-11-16 15:58 flips: whats wrong with that? 2008-11-16 15:58 bufsize is used for something 2008-11-16 15:58 well for one thing: dereferencing pointer to incomplete type 2008-11-16 15:58 inode is not defined when buffer.h is read 2008-11-16 15:58 that's a mess 2008-11-16 15:58 static inline size_t bufsize(struct buffer_head *buffer) 2008-11-16 15:58 { 2008-11-16 15:58 return buffer->b_size; 2008-11-16 15:58 } 2008-11-16 15:59 ah 2008-11-16 15:59 I'd like to avoid using b_size 2008-11-16 15:59 it was temporary hack 2008-11-16 15:59 b_size for every buffer? 2008-11-16 16:00 sounds wasteful 2008-11-16 16:00 yes 2008-11-16 16:00 and broken concept 2008-11-16 16:00 as if you could have different sizes for the same block 2008-11-16 16:01 aren't all buffers the same size? 2008-11-16 16:01 in practice, yes 2008-11-16 16:01 but ancient bsd interface was written as if they were not 2008-11-16 16:01 ah 2008-11-16 16:01 bogus generality 2008-11-16 16:02 well I need to do something about this inode forward reference... up till now, buffer.c did not need to know the specific structure of an inode 2008-11-16 16:02 good time for a cup of tea 2008-11-16 16:02 yes 2008-11-16 16:02 I solved that by having dev->bits 2008-11-16 16:03 but that isn't nice 2008-11-16 16:03 above one looks good 2008-11-16 16:05 without dereferencing some memory 2008-11-16 16:05 but hope those in dcache of cpu 2008-11-16 16:06 we're going to do disk IO at that point, so we can load a few cache lines 2008-11-16 16:06 without noticing 2008-11-16 16:06 it's the nasty forward reference to the inode type I need to resolve 2008-11-16 16:07 and buffer allocation 2008-11-16 16:08 init_buffers only takes dev to get the blocksize, we just pass blockbits instead 2008-11-16 16:08 ah 2008-11-16 16:09 static struct buffer_head *new_block(struct btree *btree) 2008-11-16 16:09 { 2008-11-16 16:09 block_t block = (btree->ops->balloc)(btree->sb); 2008-11-16 16:09 if (block == -1) 2008-11-16 16:09 return NULL; 2008-11-16 16:09 struct buffer_head *buffer = blockget(btree->sb->s_bdev, block); 2008-11-16 16:09 if (!buffer) 2008-11-16 16:09 return NULL; 2008-11-16 16:09 memset(bufdata(buffer), 0, bufsize(buffer)); 2008-11-16 16:09 set_buffer_dirty(buffer); 2008-11-16 16:09 return buffer; 2008-11-16 16:09 } 2008-11-16 16:09 I referenced in user/tux3/btree.c 2008-11-16 16:09 but memset() may be unnecessary 2008-11-16 16:10 yes, it's wrong 2008-11-16 16:10 and would be expensive if we leave it that way 2008-11-16 16:11 that is old code that came from ddsnap 2008-11-16 16:11 i see 2008-11-16 16:15 well I am going to keep dev->bits, but keep it strictly inside buffer.c 2008-11-16 16:15 sounds good 2008-11-16 16:16 I believe we are getting close the compilable 2008-11-16 16:16 I have another 20 minutes or so to go to come up with a mapping patch 2008-11-16 16:17 ok, it's no problem, I'm glad current state 2008-11-16 16:20 after this stuff, the kernel stuff should be more interesting 2008-11-16 16:21 there are some fun things coming up 2008-11-16 16:22 doing it all without buffer_head might be pretty easy 2008-11-16 16:22 oh 2008-11-16 16:24 glue code with linux operations is also fun 2008-11-16 16:24 in the "lots of work" sense 2008-11-16 16:24 and lots of lines of code 2008-11-16 16:25 yes, that's fun 2008-11-16 16:33 ACTION takes a break 2008-11-16 16:39 compiles 2008-11-16 16:39 now to fix the user mode breakage 2008-11-16 16:55 ok, all the tests ran 2008-11-16 16:55 now lets see what the patch looks like 2008-11-16 17:04 ok, let's see if it compiles in kernel 2008-11-16 17:09 ok 2008-11-16 17:10 which files? 2008-11-16 17:10 just tux3.h 2008-11-16 17:11 ok 2008-11-16 17:11 fs/tux3/tux3.h:209: error: expected specifier-qualifier-list before 'map_t' 2008-11-16 17:11 :) 2008-11-16 17:11 I hate random syntax error messages 2008-11-16 17:12 oh 2008-11-16 17:12 it's trivial 2008-11-16 17:13 but the gcc message is really stupid 2008-11-16 17:13 it actualy means some symbol was undefined 2008-11-16 17:13 how hard could it be to actually report that 2008-11-16 17:13 maybe gcc is not sure about c :) 2008-11-16 17:15 if (!specs->declspecs_seen_p) 2008-11-16 17:15 { 2008-11-16 17:15 c_parser_error (parser, "expected specifier-qualifier-list"); 2008-11-16 17:15 return NULL_TREE; 2008-11-16 17:15 } 2008-11-16 17:15 in gcc 2008-11-16 17:15 i see 2008-11-16 17:15 ok, compiles in kernel 2008-11-16 17:15 now... copy the file back to user, see if it still compiles 2008-11-16 17:15 and runs the tests... 2008-11-16 17:15 the name came from c spec 2008-11-16 17:15 ok 2008-11-16 17:16 and check it in 2008-11-16 17:16 ok 2008-11-16 17:16 ...compiles and runs the tests 2008-11-16 17:18 whoops, another few minutes of cleanup to do 2008-11-16 17:18 yes 2008-11-16 17:30 just one doubtful issue left 2008-11-16 17:30 did I make a stupid change to use s_bdev yesterday? 2008-11-16 17:30 yes 2008-11-16 17:31 it is already in hg 2008-11-16 17:31 yes, but it doesn't really make sense 2008-11-16 17:31 ok 2008-11-16 17:31 blockget takes a mapping, not a device 2008-11-16 17:31 yes, but it should be no problem 2008-11-16 17:32 I'll check it in the way it is 2008-11-16 17:32 in that case, we will use sb_bread or something 2008-11-16 17:32 blockget will call sb_bread 2008-11-16 17:32 ah 2008-11-16 17:33 + struct buffer *buffer = blockget(btree->sb->s_bdev, block); <- this is wrong 2008-11-16 17:33 it works, but it's really misleading 2008-11-16 17:33 I was thinking it will be replaced with sb_bread 2008-11-16 17:34 but it still should make sense 2008-11-16 17:34 really, that patch was just a mistake 2008-11-16 17:35 I must have been too tired 2008-11-16 17:35 i see 2008-11-16 17:35 in that case, mapping->host is bdev inode 2008-11-16 17:36 anyway, it is public, so I will check in the mapping fix 2008-11-16 17:36 then worry about undoing that mistake after 2008-11-16 17:36 at least I don't care, but probably yes 2008-11-16 17:38 ok, committed 2008-11-16 17:39 ok, I'll try to merge my patches 2008-11-16 17:45 ok, I have the revert of the sb->devmap to sb->b_dev ready to commit 2008-11-16 17:45 ok 2008-11-16 17:46 what is we use for sb->devmap in kernel? 2008-11-16 17:47 the address_space of the buffer cache 2008-11-16 17:48 just make a redundant field in our fs-specfic sb for that 2008-11-16 17:48 well 2008-11-16 17:48 i see 2008-11-16 17:48 it's already there ;) 2008-11-16 17:48 a redundant assignment 2008-11-16 17:48 this is lazy and will get cleaned up but it will work 2008-11-16 17:49 so... vfs stores the inode of the block device for us 2008-11-16 17:49 we just store sb->sb_bdev->b_inode in our fs-specific sb 2008-11-16 17:50 let me see if I got the field names right 2008-11-16 17:50 a bit difference 2008-11-16 17:50 maybe sb->s_bdev->bd_inode 2008-11-16 17:51 sb->s_bdev->bd_inode 2008-11-16 17:51 yes, see "will die" 2008-11-16 17:51 but it's ok for today 2008-11-16 17:51 but inode? 2008-11-16 17:51 isn't it bdev? 2008-11-16 17:51 sb->s_bdev->bd_inode->i_mapping 2008-11-16 17:51 an address_space 2008-11-16 17:52 which we call a map_t 2008-11-16 17:52 we don't use sb_bread at all? 2008-11-16 17:52 blockread can call sb_bread 2008-11-16 17:52 sb_bread takes sb 2008-11-16 17:53 that's easy to get from a mapping 2008-11-16 17:53 let me think about it and try to make more sense ;) 2008-11-16 17:53 i_mapping->host->i_sb? 2008-11-16 17:54 you were just going to edit blockread(sb->devmap to sb_bread(sb right? 2008-11-16 17:54 yes, I was thinking so 2008-11-16 17:54 that will work fine 2008-11-16 17:55 yes 2008-11-16 17:55 but sounds like blockread should do for it 2008-11-16 17:56 yes, that is also trivial 2008-11-16 17:56 smaller change 2008-11-16 17:56 but, in that case, i don't know sb is where come from 2008-11-16 17:57 blockread calls sb_bread(sb->devmap->host->i_sb 2008-11-16 17:57 I think it is bdev_sb, not ours 2008-11-16 17:57 hm 2008-11-16 17:57 blockread calls sb_bread(map->host->i_sb 2008-11-16 17:57 :) 2008-11-16 17:57 :) 2008-11-16 17:58 we will use sb_bread or something? 2008-11-16 17:58 yes 2008-11-16 17:59 you notice, blockread that reads a file only comes from user/inode.c 2008-11-16 17:59 so you can always just do sb_bread inside blockread 2008-11-16 18:00 and for file IO, we have some new kernel code to write 2008-11-16 18:01 my nice abstracting of blockread to work either on a block device or on a file has no use that I know of in kernel 2008-11-16 18:01 yes 2008-11-16 18:01 for userspace, it was very useful for testing 2008-11-16 18:01 for inode.c is yes 2008-11-16 18:01 I just worry about blockread(sb->devmap) 2008-11-16 18:02 because? 2008-11-16 18:02 we might accidentally use it on a file mapping? 2008-11-16 18:02 we can't get sb from sb->devmap 2008-11-16 18:02 our sb 2008-11-16 18:03 sure we can 2008-11-16 18:03 in userspace? 2008-11-16 18:03 in userspace we have map->inode->sb 2008-11-16 18:03 yes 2008-11-16 18:04 in kernel... 2008-11-16 18:04 yes, in kernel 2008-11-16 18:04 map->inode->i_sb 2008-11-16 18:04 um 2008-11-16 18:04 map->inode is bd_inode 2008-11-16 18:04 so, bd_inode->i_sb is not our sb 2008-11-16 18:05 map->host->i_sb 2008-11-16 18:05 damm, linus got me again ;) 2008-11-16 18:05 yes 2008-11-16 18:05 what sb is it? 2008-11-16 18:06 a bogus blockdev sb? 2008-11-16 18:06 in fs/block_dev.c 2008-11-16 18:06 around 330 2008-11-16 18:06 static struct vfsmount *bd_mnt __read_mostly; 2008-11-16 18:07 and it's bd_mnt->mnt_sb 2008-11-16 18:07 idiotic 2008-11-16 18:08 bd_inode come from pesudo bdev fs 2008-11-16 18:08 yes, there's brain damage there 2008-11-16 18:08 tail wagging the dog 2008-11-16 18:08 maybe 2008-11-16 18:09 at least, there was some problem 2008-11-16 18:10 well just make the edit then 2008-11-16 18:11 as you originally planned, and have probably already done 2008-11-16 18:11 at least once ;) 2008-11-16 18:11 ok 2008-11-16 18:31 hirofumi, the only thing we actually need from the sb in blockread is the blocksize, which we could store in the block_device using set_blocksize if we really wanted to 2008-11-16 18:31 it doesn't matter though 2008-11-16 18:33 i see 2008-11-16 18:34 well, for us, sb_bread may be temporary usage for right now 2008-11-16 18:35 right 2008-11-16 18:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-16 18:53 http://userweb.kernel.org/~hirofumi/patchset.tar.gz 2008-11-16 18:54 http://userweb.kernel.org/~hirofumi/errors 2008-11-16 18:54 a few stuff is remaining though 2008-11-16 18:55 wgetting... 2008-11-16 19:00 what are /pc/ and /txt/ ? 2008-11-16 19:00 those are scripts helpers 2008-11-16 19:02 this is the database for your quilt-like thing :) 2008-11-16 19:02 yes, modified akpm-scripts before quilt 2008-11-16 19:05 so patchset/tux3/balloc.c is before applying any patches? 2008-11-16 19:05 no, 2008-11-16 19:05 tux3 is copy of fs/tux3 final state 2008-11-16 19:06 ok, so the ~ versions are with the named patches reversed? 2008-11-16 19:06 yes, xxx~yyy is before patch of yyy 2008-11-16 19:07 everything is not hidden 2008-11-16 19:07 it's dirty, but I think it's useful 2008-11-16 19:07 so, patchset/tux3/balloc.c is exactly the file you are trying to compile? 2008-11-16 19:07 oh yes, you're fast with it 2008-11-16 19:07 yes 2008-11-16 19:08 so most of the errors in balloc.c are about trying to find the blocksize 2008-11-16 19:09 in bits 2008-11-16 19:09 and i_*time, and buffer->map 2008-11-16 19:10 I've reviewed my patch, it seems not bad 2008-11-16 19:11 it looks fine 2008-11-16 19:11 in balloc.c, I think the functions should take tux_inode's instead of vfs inode 2008-11-16 19:12 does that help us? 2008-11-16 19:12 which func? 2008-11-16 19:12 ah, ok 2008-11-16 19:13 count_range, bitmap_dump, balloc_extent_from_range, bfree_extent 2008-11-16 19:14 um... or pass our sb? 2008-11-16 19:14 looks good enough 2008-11-16 19:14 ah, not 2008-11-16 19:15 some funcs needs inode of bitmap 2008-11-16 19:15 ah, we have it in sb 2008-11-16 19:15 yes 2008-11-16 19:15 and for now, we do not need more than one bitmap file per volume 2008-11-16 19:15 i see 2008-11-16 19:16 sounds good 2008-11-16 19:17 next is, I should merge those to userspace...? 2008-11-16 19:18 if you feel like it 2008-11-16 19:18 if it makes it easier for you to work 2008-11-16 19:18 otherwise I can 2008-11-16 19:18 or we can wake up shapor ;) 2008-11-16 19:18 :) 2008-11-16 19:19 or other plain is we fix remaining errors in userspace first, then merge 2008-11-16 19:20 that might help 2008-11-16 19:20 yes 2008-11-16 19:20 let me look at some more errors 2008-11-16 19:20 ok 2008-11-16 19:21 mv linux/fs/tux3 linux/fs/tux3.orig; cp -a patchset/tux3 linux/fs/tux3 2008-11-16 19:21 struct file' has no member named 'f_inode <- there's an easy one 2008-11-16 19:21 should be work 2008-11-16 19:21 yes 2008-11-16 19:21 let me see 2008-11-16 19:21 it's file->f_dentry->d_inode 2008-11-16 19:22 well, hard to make that be the same as userspace 2008-11-16 19:22 so do we wrap it? 2008-11-16 19:22 file_inode(file) 2008-11-16 19:22 well, maybe we will need some change around it for operations 2008-11-16 19:22 ? 2008-11-16 19:22 inode_operations 2008-11-16 19:23 inode->i_op->readdir 2008-11-16 19:23 we don't have it yet 2008-11-16 19:24 we may just set ext2_readdir to inode_operations though 2008-11-16 19:25 we could fix a lot of things with a wrapper that takes a generic inode and returns our specific sb 2008-11-16 19:25 more than half of the errors 2008-11-16 19:26 yes 2008-11-16 19:27 do we have a tux_sb yet? 2008-11-16 19:27 no, we don't have yet 2008-11-16 19:27 it would take a generic sb and return a tux sb 2008-11-16 19:27 yes 2008-11-16 19:28 luckly, it seems we need it at a few places only 2008-11-16 19:29 we can use the one that goes from a generic inode to a specif sb in a lot of places 2008-11-16 19:29 if tux_inode()->sb is our sb, it may be ok 2008-11-16 19:30 it is remaining stuff 2008-11-16 19:30 right, we could be lazy and have a redudant pointer to the sb 2008-11-16 19:30 the generic inode already points to the generic sb 2008-11-16 19:30 yes 2008-11-16 19:31 static inline tux_sb *tux_sb(struct inode *inode) ? 2008-11-16 19:31 ok, I'll do it in inode-info.patch 2008-11-16 19:31 um.. or just tux_inode()->sb? 2008-11-16 19:32 well, that's if we want to have the redundant field in kernel 2008-11-16 19:32 yes 2008-11-16 19:33 I'm not sure, which is good 2008-11-16 19:33 so, it was remaining 2008-11-16 19:33 static inline tux_sb *tux_sb_from_inode(struct inode *inode) ? 2008-11-16 19:34 ugly long name 2008-11-16 19:34 :) 2008-11-16 19:34 but otherwise it sounds like the conversion from generic sb to specific 2008-11-16 19:34 just tux_inode(inode)->sb? 2008-11-16 19:34 or tux_sb(inode->i_sb) 2008-11-16 19:35 or ... 2008-11-16 19:35 or tux_sb_from_inode(inode) 2008-11-16 19:35 or tux_sb(inode) 2008-11-16 19:35 tux_sb(inode->i_sb) <- let's do this one 2008-11-16 19:35 ok 2008-11-16 19:35 but in userspace? 2008-11-16 19:35 that means changing all ->sb to ->i_sb in user space 2008-11-16 19:36 ok 2008-11-16 19:36 good 2008-11-16 19:36 do it in userspace first and try to apply that patch? 2008-11-16 19:37 that's fine 2008-11-16 19:37 I'll merge those to my patches after that 2008-11-16 19:37 ok, should be about 15 minutes 2008-11-16 19:37 ok, let's work on userspace 2008-11-16 19:39 ok, it's going to be a little bit longer, because my wife just called me for dinner 2008-11-16 19:39 it's sushi ;) 2008-11-16 19:39 maybe you want to do it instead of me? 2008-11-16 19:40 otherwise it will be about 40 minutes 2008-11-16 19:43 oh, good dinner 2008-11-16 20:09 back 2008-11-16 20:09 ok 2008-11-16 20:10 it was hamachin and salmon nigiri 2008-11-16 20:10 with a nice sake 2008-11-16 20:11 oh, good 2008-11-16 20:13 ->i_sb is a lot uglier than ->sb 2008-11-16 20:13 sigh 2008-11-16 20:13 :) 2008-11-16 20:15 btw, are you using vi to view/edit? 2008-11-16 20:17 kate 2008-11-16 20:19 oh 2008-11-16 20:20 using kscope? 2008-11-16 20:21 I'm always finding good environment for devlopment 2008-11-16 20:21 I should using something besides lxr 2008-11-16 20:22 but I just use lxr, so I can pass url's around 2008-11-16 20:22 costs me a lot of time and frustration 2008-11-16 20:23 i see 2008-11-16 20:23 ok, there is the sb -> i_sb change 2008-11-16 20:23 ACTION is thinking to add cscope to url convertion 2008-11-16 20:24 ok 2008-11-16 20:24 now inode->i_sb to tux_sb(inode->i_sb), making things more verbose 2008-11-16 20:24 ok 2008-11-16 20:25 sigh again 2008-11-16 20:25 :) 2008-11-16 20:38 ok, done 2008-11-16 20:39 I did not try to figure out where we need generic super_block 2008-11-16 20:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-16 20:39 I think generic super_block is only needed in a few places 2008-11-16 20:39 yes 2008-11-16 20:40 maybe SB is super_block 2008-11-16 20:40 maybe 2008-11-16 20:40 you decide ;) 2008-11-16 20:41 ok 2008-11-16 20:42 most of the uses I saw so far want fs specific sb 2008-11-16 20:42 I think SB needs to be specific sb 2008-11-16 20:43 ah, blockbits is in specific sb, so yes 2008-11-16 20:43 and btree roots, nextalloc 2008-11-16 20:44 map_t *map = new_map(sb->devmap->dev, &filemap_ops); <- first example I found of generic sb 2008-11-16 20:46 yes 2008-11-16 20:46 I didn't copy inode.c though 2008-11-16 20:47 right 2008-11-16 20:47 on the other hand, most uses of inode are generic I think 2008-11-16 20:47 anywhere they are not, we can wrap in tux_inode(inode) 2008-11-16 20:47 yes 2008-11-16 20:48 I will wait for a patch from you for that 2008-11-16 20:48 it's already including almost 2008-11-16 20:48 cool 2008-11-16 20:48 inode-info.patch 2008-11-16 20:48 well, I will wait for a new list of errors :) 2008-11-16 20:49 ah 2008-11-16 20:49 timespec? 2008-11-16 20:49 yes 2008-11-16 20:49 it's remaining 2008-11-16 20:49 ok, I'll merge sb stuff 2008-11-16 20:50 ok, I will change the way we wrap times 2008-11-16 20:50 to take just one argument 2008-11-16 20:51 it seems we need fixed32 -> timespec, and timespec -> fixed32 2008-11-16 20:51 yes 2008-11-16 21:30 http://userweb.kernel.org/~hirofumi/ 2008-11-16 21:30 merged 2008-11-16 21:30 time patch is nearly done 2008-11-16 21:31 remaining is buffer->map 2008-11-16 21:31 how will we do this merge? 2008-11-16 21:31 I have a back patch here, and you have a big patch there 2008-11-16 21:31 maybe I should post the time patch to the list 2008-11-16 21:31 oh 2008-11-16 21:31 you're not changing user 2008-11-16 21:31 :) 2008-11-16 21:31 yes 2008-11-16 21:32 merge to userspace would be last one 2008-11-16 21:33 buffer->map 2008-11-16 21:33 well it seems times still work in fuse 2008-11-16 21:34 yes 2008-11-16 21:35 checked in 2008-11-16 21:35 I hope it's work after merge 2008-11-16 21:35 should I wget errors again? 2008-11-16 21:35 thanks 2008-11-16 21:36 it would be good 2008-11-16 21:36 errors was reduced 2008-11-16 21:37 ok, buffer_head -> map_t 2008-11-16 21:38 yes 2008-11-16 21:38 buffer->b_page->mapping 2008-11-16 21:39 works only for file buffers, which is what we have in filemap.c 2008-11-16 21:39 well, sorry, it works for buffer cache buffers too, but the mapping does not point back at our sb as you pointed out 2008-11-16 21:40 yes 2008-11-16 21:40 so we want a wrapper? 2008-11-16 21:40 bufmap(buffer) 2008-11-16 21:40 ? 2008-11-16 21:40 looks good to me 2008-11-16 21:41 ah, and sb->devmap 2008-11-16 21:41 static inline map_t *bufmap(struct buffer_head *buffer) { return buffer->b_page->mapping; } 2008-11-16 21:41 should I put that in tux3.h and commit? 2008-11-16 21:41 yes 2008-11-16 21:43 is it for kernel? well, anyway, please commit it 2008-11-16 21:45 kernel, but it's not quite right 2008-11-16 21:45 the uses in filemap.c are buffer->map->inode 2008-11-16 21:46 and kernel wants bufmap(buffer)->host 2008-11-16 21:46 wrapper or something is needed? 2008-11-16 21:46 yes 2008-11-16 21:46 stupid stupid fieldname 2008-11-16 21:46 #define host inode ;) 2008-11-16 21:47 buffer_inode 2008-11-16 21:47 buffer_inode(buffer) 2008-11-16 21:47 tempting to say bufnode(buffer) 2008-11-16 21:47 sounds nice, but rather misleading 2008-11-16 21:48 sounds good 2008-11-16 21:49 what about peekblk? 2008-11-16 21:50 I'm not sure 2008-11-16 21:50 just return null in kernel 2008-11-16 21:50 that means no readahead 2008-11-16 21:50 or extents longer than one block 2008-11-16 21:50 i see 2008-11-16 21:50 fix later 2008-11-16 21:51 I think that's enough for now 2008-11-16 21:52 we will just #define bufmap(map) 2008-11-16 21:52 as nothing 2008-11-16 21:52 and that way it won't even be typechecked 2008-11-16 21:52 what is passed? 2008-11-16 21:53 whoops, has to be defined as NULL 2008-11-16 21:53 map_t is passed 2008-11-16 21:53 ah, for userspace 2008-11-16 21:53 yes 2008-11-16 21:53 #define bufmap(map) NULL 2008-11-16 21:53 and kernel 2008-11-16 21:53 map_t is struct address_space 2008-11-16 21:54 I don't know how much sense that makes 2008-11-16 21:54 but we'll try it to start 2008-11-16 21:54 i see 2008-11-16 21:57 time stuff was merged 2008-11-16 21:57 was it easy? 2008-11-16 21:57 yes 2008-11-16 21:57 no conflict 2008-11-16 21:57 :) for once 2008-11-16 21:58 yes :) 2008-11-16 21:58 now I need to try compiling tux3.h in kernel 2008-11-16 21:59 then will commit with the new wrapper 2008-11-16 21:59 ok 2008-11-16 21:59 buffer->map is remaining? 2008-11-16 22:00 where? 2008-11-16 22:01 should only be in user space 2008-11-16 22:01 e.g. guess_extent() 2008-11-16 22:01 buffer->map->inode 2008-11-16 22:01 changed that to buffer_inode(buffer) 2008-11-16 22:01 i see 2008-11-16 22:09 ok, there is buffer_inode 2008-11-16 22:09 ok 2008-11-16 22:10 also fixes peekbuf and the remaining buffer->map with a macro that discards the parameter and returns null 2008-11-16 22:11 should I fix those? 2008-11-16 22:11 we also have a buffer->map in btree, in advance 2008-11-16 22:11 if you want to write a peek ;) 2008-11-16 22:12 well I think we can leave that for later 2008-11-16 22:12 ok 2008-11-16 22:12 NULL will make it stop right away and only transfer the block actually asked for 2008-11-16 22:12 no readahead, no forming bigger extents 2008-11-16 22:13 bufmap stuff was commited? 2008-11-16 22:13 yes 2008-11-16 22:13 not bufmap 2008-11-16 22:13 buf buffer_inode 2008-11-16 22:13 ok 2008-11-16 22:13 I see there is one use of buffer->map in btree.c that looks hard to deal with 2008-11-16 22:15 ah, btree is an example of where my buffer cache vs page cache abstraction would be useful in kernel 2008-11-16 22:15 though we have not done it yet, we would like to map a btree into a file for phtree 2008-11-16 22:15 well, I will worry about that later 2008-11-16 22:16 i see 2008-11-16 22:16 in btree.c, buffer->map is only used for the blockget in advance 2008-11-16 22:17 so that can just be edited to sb_bread, like in other places 2008-11-16 22:17 for now 2008-11-16 22:17 and the buffer->map can be edited to NULL 2008-11-16 22:17 ok? 2008-11-16 22:17 I'll leave as is for now 2008-11-16 22:18 should be almost out of compile errors 2008-11-16 22:18 yes 2008-11-16 22:18 ok, got to spend a little time with my girl 2008-11-16 22:18 ok 2008-11-16 22:29 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-16 23:51 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-11-17 02:43 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-17 03:33 flips: Ping? 2008-11-17 03:35 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-17 04:24 -!- pgquiles__(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-17 04:31 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-17 05:10 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-11-17 05:11 -!- pgquiles__(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-17 06:25 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-17 08:36 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-17 09:07 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-17 09:31 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-17 10:20 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-17 12:14 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-17 13:18 hirofumi, up yet? 2008-11-17 13:18 yes 2008-11-17 13:19 waked up right now 2008-11-17 13:19 should I wait for the small changes I mentioned on the ml? 2008-11-17 13:20 probably yes, well, I'm reading it now :) 2008-11-17 13:20 #define buffer buffer_head is admittedly ugly 2008-11-17 13:20 but won't cause any problem 2008-11-17 13:21 that's meaning, kernel uses buffer_head, and userspace users buffer? 2008-11-17 13:22 yes, and I wrote the define backwards in my message 2008-11-17 13:22 ah 2008-11-17 13:22 ok 2008-11-17 13:22 but, buffer is used for another way 2008-11-17 13:23 i.e. struct buffer *buffer -> struct buffer_head *buffer_head 2008-11-17 13:24 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-17 13:25 yes 2008-11-17 13:25 sorry 2008-11-17 13:25 that will break badly 2008-11-17 13:26 ok, typedef struct buffer_head buffer_t 2008-11-17 13:26 which I should have wrote first 2008-11-17 13:26 :p 2008-11-17 13:27 it would be nice to be able to do #typedef struct buffer_head struct buffer; 2008-11-17 13:28 ok, I'll think about it for several minutes, and maybe use it 2008-11-17 13:31 updated my post to correct the stupidity 2008-11-17 13:31 well after these patches it gets more interesting 2008-11-17 13:31 start to work on actual kernel code 2008-11-17 13:42 um.. probably, buffer_t 2008-11-17 13:43 yes 2008-11-17 13:44 maybe some stage of compile can do replace, unfortunately I can't find 2008-11-17 13:45 do replace? 2008-11-17 13:46 are you having trouble changing that one patch? 2008-11-17 13:46 struct buffer structre -> struct buffer_head 2008-11-17 13:46 no problem 2008-11-17 13:46 oh 2008-11-17 13:46 -!- konrad(~konrad@garloff.cs.washington.edu) has joined #tux3 2008-11-17 13:46 in kernel/tux3.h, inside #ifdef __KERNEL__ 2008-11-17 13:48 about another patch, *leaf_create/destroy is memory stuff 2008-11-17 13:48 so, I think kernel will not use those 2008-11-17 13:49 these are supposed to be called by the btree code 2008-11-17 13:49 I have not hooked up those calls yet 2008-11-17 13:49 um... what's for? 2008-11-17 13:50 we need static inline free(void *mem) { kfree(mem); } 2008-11-17 13:50 when leaves are removed from a btree they have to be freed 2008-11-17 13:50 but it should be "struct buffer"? 2008-11-17 13:51 sorry, which are you talking about now, the buffer patch, or the function move patch? 2008-11-17 13:51 about function move 2008-11-17 13:52 so we need malloc to be kmalloc in kernel 2008-11-17 13:52 and free to be kfree 2008-11-17 13:52 dleaf is in buffer cache? 2008-11-17 13:52 yes 2008-11-17 13:52 ok, right 2008-11-17 13:52 those are completely wrong 2008-11-17 13:53 sometimes it takes me a while to get the message ;) 2008-11-17 13:53 it seems to be used for unit test 2008-11-17 13:53 let's just keep your patch the way it is 2008-11-17 13:53 and about ext2_dump_entries() 2008-11-17 13:54 maybe we can use it in kernel 2008-11-17 13:54 yes 2008-11-17 13:54 but buffer->map->inode->inum is used 2008-11-17 13:55 we may need some change 2008-11-17 13:55 like defining bufmap(buffer) 2008-11-17 13:55 let's just keep that patch as you have it 2008-11-17 13:55 it doesn't hurt 2008-11-17 13:56 ok, thanks 2008-11-17 13:56 let's keep buffer_head too :) 2008-11-17 13:56 I'll do buffer_t stuff 2008-11-17 13:56 ok, if you want 2008-11-17 13:56 oh 2008-11-17 13:56 I could pull right now if you like 2008-11-17 13:56 please pull 2008-11-17 13:57 ok 2008-11-17 13:57 and if you need, I'll fix buffer_t 2008-11-17 13:57 I don't need it, let's not do it 2008-11-17 13:57 ok 2008-11-17 13:57 this will give me one more reason to think about how to do it without buffers at all :) 2008-11-17 13:57 later 2008-11-17 13:58 see you 2008-11-17 13:58 oh 2008-11-17 13:58 i see 2008-11-17 13:59 yes, I meant, later do it without buffers 2008-11-17 13:59 I was reading vm and vfs code last night to see how practical that is 2008-11-17 13:59 i see 2008-11-17 14:01 http://www.kernel.org/pub/linux/kernel/v2.5/ChangeLog-2.5.17 <- see [PATCH] i_dirty_buffers locking fix 2008-11-17 14:03 yes 2008-11-17 14:03 have you see that before? 2008-11-17 14:03 seen 2008-11-17 14:04 pull is done 2008-11-17 14:04 ok, I will try compiling these files in kernel 2008-11-17 14:05 maybe seen, but I forget it perfectly 2008-11-17 14:06 IOW, I have not seen :) 2008-11-17 14:06 so I looked at generic_osync_inode, which does in fact assume buffer_heads in address_space.private_list 2008-11-17 14:07 however, it simply exists if there are no assoc_mappings 2008-11-17 14:07 so... we would not use assoc_mappings, which I never intended anyway 2008-11-17 14:08 http://lxr.linux.no/linux+v2.6.27/mm/truncate.c <- this code heavily assumes buffer_head 2008-11-17 14:08 19#include /* grr. try_to_release_page, 2008-11-17 14:08 20 do_invalidatepage */ 2008-11-17 14:10 um... looks like it does not assuming 2008-11-17 14:12 hmm, why did I think that? 2008-11-17 14:12 quick look, I just find buffer_head 2008-11-17 14:16 try_to_release_page calls try_to_free_buffers, which calls page_buffers() which assumes page_private has buffer_heads 2008-11-17 14:17 ah, yes 2008-11-17 14:17 _however_ 2008-11-17 14:17 however, releasepage can be replaced? 2008-11-17 14:17 this can be prevented if the mapping ops define ->releasepage 2008-11-17 14:17 right 2008-11-17 14:18 if not, nobh mode seems not work 2008-11-17 14:19 so I have not found a problem yet. I want to keep thinking about having page->private point at a vector of handles instead of list of buffers, but we will keep using buffers for now 2008-11-17 14:19 yes 2008-11-17 14:19 of course, I worry about extra code ending up in fs/tux3 due to using an alternative to buffers 2008-11-17 14:19 I hope that is not too much code 2008-11-17 14:20 the reward is: significantly less allocate/free activity for buffer_heads; easier to audit for correctness 2008-11-17 14:20 and reduced cache memory use 2008-11-17 14:21 a few percent 2008-11-17 14:21 yes 2008-11-17 14:22 against that we have probably a few hundred more lines of code 2008-11-17 14:22 probably yes 2008-11-17 14:22 oh, another benefit: everybody would like to write a filesystem without buffer_head, but nobody has except maybe xfs 2008-11-17 14:22 and the xfs usage was not very linux-like 2008-11-17 14:23 so if we go explore that territory, we are helping everybody 2008-11-17 14:23 ok, I am going to try compiling 2008-11-17 14:23 ok 2008-11-17 14:24 I'll go to food shop 2008-11-17 14:24 eating is good :) 2008-11-17 14:47 -!- pgquiles__(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-17 15:09 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-17 15:33 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-17 15:45 back 2008-11-17 15:54 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-17 17:23 back too 2008-11-17 17:23 do you remember what difference is index_t and block_t? 2008-11-17 17:24 the same really 2008-11-17 17:24 ok 2008-11-17 17:24 oh 2008-11-17 17:24 I'll remove index_t 2008-11-17 17:24 on 32 bit arch, index_t is u32 2008-11-17 17:24 better not 2008-11-17 17:25 just remembered 2008-11-17 17:25 on 32 bit arch, index_t is u32 because of the page cache limitation 2008-11-17 17:25 ah, pgoff_t? 2008-11-17 17:26 yes 2008-11-17 17:26 index_t in guess_extent() is not right usage? 2008-11-17 17:28 maybe current index_t is used as block_t 2008-11-17 17:28 um.. but buffer->index is index_t 2008-11-17 17:31 well, anyway, currently index_t is remaining in kernel 2008-11-17 17:32 it doesn't hurt 2008-11-17 17:33 buffer->index? 2008-11-17 17:33 index_t in guess_extent looks right to me 2008-11-17 17:34 buffer->index is index_t when the buffer is for a file address_space 2008-11-17 17:34 i see 2008-11-17 17:34 unsigned ends[2] = { bufindex(buffer), bufindex(buffer) }; 2008-11-17 17:35 I thought bufindex() means b_blocknr 2008-11-17 17:35 right, should be index_t 2008-11-17 17:35 b_blocknr used to be a physical block address 2008-11-17 17:35 yes 2008-11-17 17:35 now, it is always a index into a page cache 2008-11-17 17:35 it seems I misread it 2008-11-17 17:36 by implication, 32 bit kernels can't deal with volumes bigger than 64 TB 2008-11-17 17:36 I don't think we will ever fix that 2008-11-17 17:37 yes 2008-11-17 17:37 except by allowing bigger buffers one day 2008-11-17 17:37 in that case, buffer_head is used like page? 2008-11-17 17:38 it supports finer granularity than page and has no other reason to exist 2008-11-17 17:38 well, a minor reason is to cache b_blocknr 2008-11-17 17:38 and serve as an interface for fs->get_block 2008-11-17 17:38 yes 2008-11-17 17:39 the get_block interface seems to me to be something that should go away eventually 2008-11-17 17:39 it is a poor interface 2008-11-17 17:39 anyway, I think we will not use it, we should know for sure in a few days 2008-11-17 17:40 buffer->index is physical address on device in some case? 2008-11-17 17:40 yes 2008-11-17 17:40 in userspace 2008-11-17 17:40 yes 2008-11-17 17:40 for anything not in page cache 2008-11-17 17:40 yes 2008-11-17 17:41 that is, btree indexes, inode table blocks... 2008-11-17 17:41 if buffer->index is index_t, it's 32bit? 2008-11-17 17:41 commit blocks 2008-11-17 17:41 64 bit on 64 bit arch 2008-11-17 17:41 ah 2008-11-17 17:41 it can be typedef pgoff_t index_t 2008-11-17 17:42 or we can just edit it to be pgoff_t, but is less readable and misleading in userspace 2008-11-17 17:43 in new_buffer, we pass block_t, and set to buffer->index 2008-11-17 17:44 it confuse me 2008-11-17 17:44 me too 2008-11-17 17:44 should pass in index_t 2008-11-17 17:44 really, index_t and block_t are the same 2008-11-17 17:44 um.. but we have 48bit block address? 2008-11-17 17:45 we do, but we can't use it on 32 bit arch 2008-11-17 17:45 um... even if CONFIG_LBD? 2008-11-17 17:45 I think we need a feature flag in the superblock that says 'this volume exceeds 32 bits' 2008-11-17 17:46 user space can handle it 2008-11-17 17:46 then I thought I should pay attention to kernel restrictions 2008-11-17 17:46 even with CONFIG_LBD 2008-11-17 17:46 because the page cache can't handle it 2008-11-17 17:46 I think page cache can 2008-11-17 17:46 how? 2008-11-17 17:47 that would be nice 2008-11-17 17:47 it ->index is PAGE_CACHE_SIZE base index 2008-11-17 17:47 sector_t is blocksize base index 2008-11-17 17:47 ah, but PAGE_CACHE_SIZE is always PAGE_SIZE 2008-11-17 17:47 yes 2008-11-17 17:47 there were patches, but never accepted 2008-11-17 17:47 or maybe I missed it 2008-11-17 17:47 and it was accepted? 2008-11-17 17:48 no 2008-11-17 17:48 PAGE_CACHE_SIZE is PAGE_SIZE 2008-11-17 17:48 ->index is PAGE_CACHE_SIZE base, so address space is 32bit + 12bit == 48bit? 2008-11-17 17:49 right. Linus never saw the wisdom of taking a patch for it when there really isn't such a thing as a 32 bit machine operating a volume bigger than 64TB 2008-11-17 17:49 not this year anyway 2008-11-17 17:49 and maybe never 2008-11-17 17:49 44 bit 2008-11-17 17:49 we were both wrong ;) 2008-11-17 17:49 it's only 16TB 2008-11-17 17:49 oops, 44bit 2008-11-17 17:50 but it's actually really hard to find a 32 bit processor these days 2008-11-17 17:50 except embedded 2008-11-17 17:51 I wonder how long before it gets hard to find a spinning disk 2008-11-17 17:51 more than 10 years I think 2008-11-17 17:51 maybe 2008-11-17 17:53 well, in new_block(), balloc() returns block_t, and pass it to blockread() 2008-11-17 17:53 new_block in btree.c 2008-11-17 17:54 well let's choose then, either all index_t or all block_t 2008-11-17 17:55 block_t is fine with me 2008-11-17 17:55 me too 2008-11-17 17:55 and I have patch for it :) 2008-11-17 17:55 ok, where do I pull? ;) 2008-11-17 17:55 not yet ;) 2008-11-17 17:56 I'm tackleing to hide sb->devmap 2008-11-17 17:56 use sb_bread() instead of blockread() in a few places 2008-11-17 17:58 I had added bufmap(buffer) but I didn't put that in for some reason 2008-11-17 17:58 shall we put in bufmap? 2008-11-17 17:59 well I think I will go work on tux3/super.c and read/decode the superblock 2008-11-17 17:59 sounds good 2008-11-17 18:00 I'm going to work on userspace for kernel more 2008-11-17 18:01 I'm not sure about bufmap yet, after some work, maybe we can see 2008-11-17 19:23 back in an hour or two 2008-11-17 19:59 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-17 21:04 -!- ajonat(~ajonat@190.48.113.103) has joined #tux3 2008-11-17 21:54 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-17 23:38 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-17 23:49 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-18 00:43 -!- pgquiles(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-18 01:03 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-18 01:10 -!- pgquiles__(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-18 01:42 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-11-18 03:28 folks 2008-11-18 04:51 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-18 05:09 -!- pgquiles_(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-18 07:10 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-18 07:59 -!- pgquiles_(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-18 08:59 -!- pgquiles_(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-18 09:10 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-18 09:46 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-18 12:06 -!- mingming(~mingming@32.97.110.55) has joined #tux3 2008-11-18 13:03 I am looking more into the idea of running tux3 without buffer_heads's 2008-11-18 13:03 struct handles { 2008-11-18 13:03 atomic_t count; 2008-11-18 13:03 struct page *page; 2008-11-18 13:03 struct wait_queue_head wait; 2008-11-18 13:03 unsigned char state[] 2008-11-18 13:03 }; 2008-11-18 13:04 this goes in page->private instead of a pointer to a circular list of buffers 2008-11-18 13:04 it provides the state of all blocks on the page 2008-11-18 13:04 which normally will be just one block, so there will be one byte of handles->state 2008-11-18 13:05 but if blocksize is 1K, there will be 4 bytes of handles->state 2008-11-18 13:05 all blocks use the same wait queue 2008-11-18 13:05 using the typical wait loop, that checks a condition then sleeps on a queue 2008-11-18 13:06 we use wake_up_all when there is an io completion on any of the blocks 2008-11-18 13:07 hey 2008-11-18 13:07 then each waiter... normally just one even with 1K blocks... checks the state flags to see if it should continue, if the wake was actually for a different block it just goes back to sleep 2008-11-18 13:08 benefits of this: the struct size is 20 bytes to handle 4 1K blocks, which would be 200 bytes otherwise 2008-11-18 13:08 bigger benefit is, we only have to handle conditions that might arise in our on filesystem, unlike buffer code that tries to handle everything in the world 2008-11-18 13:09 handles are needed on a page whenever a block is dirty 2008-11-18 13:10 or under IO 2008-11-18 13:10 so we in the handles count when any of the blocks becomes dirty, or when a block is submitted for reading 2008-11-18 13:10 s/in/inc/ 2008-11-18 13:11 and dec on IO completion, either read or write 2008-11-18 13:11 when the count goes to zero, we _always_ free the set of handles 2008-11-18 13:11 there is no point in keeping it around when there is no dirty state to record 2008-11-18 13:11 so that is another saving 2008-11-18 13:11 dirty state or IO state 2008-11-18 13:12 one "feature" of buffers we don't bother with is recording physical block number for a file page 2008-11-18 13:13 if we think that caching physical block number is good for performance, we will do that some other way 2008-11-18 13:13 I don't think caching physical block number is very useful 2008-11-18 13:14 the only time it could save some lookup in a file index is when rewriting a file block, a rare event 2008-11-18 13:14 we will more typically redirect a write to a different block than rewrite the original block 2008-11-18 13:16 bh, so my proposal is to share a wait queue between possibly 4 metadata blocks on a single page 2008-11-18 13:16 I really doubt there will be much contention 2008-11-18 13:16 it's tempting to share the wait queue between even more blocks 2008-11-18 13:19 if they all share one wait queue, about 340 handles structs will fit on one page 2008-11-18 13:19 that would be 1360 blocks at 1K block size 2008-11-18 13:20 all sharing one wait queue 2008-11-18 13:20 I wonder if that would cause thundering herds 2008-11-18 13:20 anyway, even if the wait_queue is only shared by blocks on the same page, it is still very compact 2008-11-18 13:21 200 per page 2008-11-18 13:21 and this structure is only needed to represent pinned metadata and blocks in flight 2008-11-18 13:22 probably never need more than one page of these things 2008-11-18 13:26 let me see, a kernel tree has 26000 files these days 2008-11-18 13:27 if we untarr the whole thing and commit it as one delta... 2008-11-18 13:27 with a 1k block filesystem (just to be silly) 2008-11-18 13:28 that is how many pinned and in flight blocks? 2008-11-18 13:30 about 1625 inode table blocks 2008-11-18 13:31 about 13,000 file data blocks 2008-11-18 13:32 plus dirent blocks and index blocks say, 16,000 blocks all together 2008-11-18 13:32 that would be 80 blocks needed for block handles 2008-11-18 13:33 80 pages 2008-11-18 13:33 320 k 2008-11-18 13:34 vs about 340 MB of IO 2008-11-18 13:34 that is fine 2008-11-18 13:35 probably not useful to commit an entire kernel untar as a single delta, but if we did, it's nice to know that block handles would not be an issue 2008-11-18 13:43 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-18 14:05 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-18 14:07 flipsout: hard to say 2008-11-18 14:12 what is? 2008-11-18 14:13 oh, the contention 2008-11-18 14:14 contending between 4 physically adjacent blocks is pretty obviously not a problem, between 300 looks like clearly too much contention 2008-11-18 14:15 it would be easy to tune somewhere between those extremes 2008-11-18 14:27 we have lockstat which I wrote the first revision for before peterz rewrote into lockdep 2008-11-18 14:27 so you can try it without guesswork 2008-11-18 14:27 to see if locks are being hit hard or something like that 2008-11-18 14:29 good idea 2008-11-18 14:29 the thing to do is start with the obviously uncontended case, and run it with lockdep to see if that's actually true 2008-11-18 14:31 probably, further reducing the block handle state when it has already been reduced from 200 bytes/page to 20 does not matter very much 2008-11-18 15:15 flipsout: lockdep is pretty low impact on the system 2008-11-18 15:15 so just try it first 2008-11-18 15:16 flipsout: are you working on a kernel port at the moment ? 2008-11-18 15:16 when the bufferless code is ready to try, I will wave it at you for lockdepping, that seems to be right up your alley 2008-11-18 15:16 yes 2008-11-18 15:16 isn't it obvious? 2008-11-18 15:16 currently obsessing about how much new ground we will break 2008-11-18 15:16 I figured as much 2008-11-18 15:16 how's atomic commits going ? 2008-11-18 15:16 good 2008-11-18 15:17 all under control on paper 2008-11-18 15:17 you'll break it, I'm sure and do good things 2008-11-18 15:17 ACTION has complete faith in flips 2008-11-18 15:17 I am 99% convinced that we are going to go bufferless 2008-11-18 15:17 why ? 2008-11-18 15:17 with our own home grown replacement for buffers 2008-11-18 15:17 yes 2008-11-18 15:17 :) 2008-11-18 15:17 nice 2008-11-18 15:17 buffer code is a giant hairball 2008-11-18 15:17 and a bug farm 2008-11-18 15:18 because the VM systems is too primitive for this kind of stuff 2008-11-18 15:18 getting tux3 working reliably with it would be a trial and error process, like for everybody else 2008-11-18 15:18 buffer caches aca be treated differently and have things like tags or something to help with online disk checking 2008-11-18 15:18 by concentrating only on page state, plus our own home grown block state, we have a good chance of being able to prove things about object lifetimes etc 2008-11-18 15:19 see the try_to_free_buffers disaster 2008-11-18 15:19 we are going to avoid that 2008-11-18 15:19 nice 2008-11-18 15:19 simple concept: when we add our handles vector to page->private, inc the page count 2008-11-18 15:19 nice 2008-11-18 15:20 when we remove it, doc the page count, which frees it if it hits zero 2008-11-18 15:20 I knew that a file system would eventually press changes into the VM core 2008-11-18 15:20 so then, we concentrate on handling our blocks handle count accurately, which looks easy 2008-11-18 15:20 inc on dirty or initiate read, dec when IO on the block completes 2008-11-18 15:21 when it hits zero, free the handles and dec the page count as above 2008-11-18 15:21 what could be cleaner? 2008-11-18 15:21 no try_to_free_buffers madness 2008-11-18 15:21 the handles just free themselves at the right time 2008-11-18 15:21 even if the vm scanning code is fiddling with our pages 2008-11-18 15:22 which is stupid and should be stopped, but that is for later 2008-11-18 15:22 what it does mean is, we have to write replacements for everything like block_write/read/truncate_full_page 2008-11-18 15:23 but I think the replacements will be a fraction of the size of the originals, and possible to audit 2008-11-18 15:23 whereas the originals are insane 2008-11-18 15:23 skate oclock 2008-11-18 15:24 we're not going to do deferred namespace ops at first, that will eliminate some development uncertainty 2008-11-18 15:25 instead, just take the crude lock against the delta setup to provide exclusion between fs operations and disk sync 2008-11-18 15:26 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-18 15:26 the killer reason for this is: we need to do it the crude way in order to have something to measure against to show how cool the cool way is 2008-11-18 15:26 second killer reason is, we save a few weeks of development to get to working prototype 2008-11-18 15:27 on the other hand, avoiding buffers... I'm being a little inconsistent there because there is going to be a week or two extra development to replace the functionality of the block IO library 2008-11-18 15:28 against which we save the hassle of trying to work around the oddities of the block IO library 2008-11-18 15:28 and I bet, we save some bug hunts 2008-11-18 15:28 also, it seems obvious to me we want to banish buffers eventually 2008-11-18 15:28 so this gets work done that would have to be done anyway 2008-11-18 15:29 and most probably, makes our atomic commit implementation much cleaner 2008-11-18 15:29 so time saved there too 2008-11-18 15:32 -!- pgquiles_(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-18 15:45 flipsout: the block wait stuff is so that you can efficiently deal with io on extents ? 2008-11-18 15:46 versus trying to wake a shit load of threads or things blocking against a wait queue 2008-11-18 16:26 Hehehe 2008-11-18 16:26 That was fun 2008-11-18 16:26 Apparantly you can do more from signal handlers than I thought 2008-11-18 16:27 As long as you change the pointers where it's supposed to return to :) 2008-11-18 16:40 Anyway, bedtime 2008-11-18 17:01 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-18 17:13 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-18 17:22 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-18 17:48 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-18 18:22 bh, the block wait stuff is so that we can correctly deal with IO and locking on metadata blocks 2008-11-18 18:23 versions having a shitload of buffer_heads hanging around and thousands of lines of dodgy block library code dealing with them 2008-11-18 18:23 s/versions/versus/ 2008-11-18 18:42 why is it want_queue_t, instead of wait_bit stuff? 2008-11-18 18:43 where? 2008-11-18 18:44 in struct handles 2008-11-18 18:44 you have to have a wait queue somewhere, even with wait_bit 2008-11-18 18:44 unless I missed something 2008-11-18 18:44 yes 2008-11-18 18:44 but it's not per handles 2008-11-18 18:44 url? 2008-11-18 18:45 wait a bit 2008-11-18 18:45 also searching here 2008-11-18 18:46 bushman, were you serious about setting up a lxr? 2008-11-18 18:46 http://lxr.linux.no/linux+v2.6.27.5/kernel/wait.c#L222 2008-11-18 18:46 slowness of lxr.linux.no is a big problem 2008-11-18 18:46 i was, i have a large chunk of it done, the modperl portion is beyond borked tho 2008-11-18 18:46 and i dont know enough about modperl 2008-11-18 18:46 but shap does ;) 2008-11-18 18:47 let's see what bit_waitqueue is 2008-11-18 18:47 bushman, modperl is what defeated me 2008-11-18 18:47 that an the postgresql stuff 2008-11-18 18:48 postgres part works 2008-11-18 18:48 there is one section where they explain how to do it without modperl, just regular cgi 2008-11-18 18:49 hirofumi, so waitqueue_bit provides one hashed set of waitqueues for everybody who wants to wait on any kind of bit? 2008-11-18 18:49 or just on page bits? 2008-11-18 18:49 bit_waitqueue I mean 2008-11-18 18:50 I think shared global hash 2008-11-18 18:50 if it works, find 2008-11-18 18:51 get rid of the wait queue 2008-11-18 18:51 make the block handles smaller by 8 bytes each 2008-11-18 18:51 24bytes? 2008-11-18 18:51 even 2008-11-18 18:52 let me see, wait queue is a list_head and a spinlock, 12 bytes 2008-11-18 18:52 on 64 bits... 24 bytes 2008-11-18 18:52 I think it should work, because buffer_head is already using 2008-11-18 18:53 you're right, I never notice buffer_head doesn't have its own wait queue 2008-11-18 18:53 so lets used the bit_wait stuff 2008-11-18 18:53 however, maybe we may tweak "unsigned char state[]" 2008-11-18 18:54 it may require "unsigned long" 2008-11-18 18:54 it might 2008-11-18 18:54 does bit wait require that? 2008-11-18 18:54 also, we don't want to wait on bits really 2008-11-18 18:54 we want to wait on enumerated state 2008-11-18 18:55 that is, on a byte, probably most naturally 2008-11-18 18:55 it would be quite trivial to do our own hash waitqueue wait-on-byte-state 2008-11-18 18:56 we have to check 8bits? 2008-11-18 18:56 that would be nicest 2008-11-18 18:56 buffer states are actually scalar, not bits 2008-11-18 18:56 traditional linux use of bit states has always been kind of wrong 2008-11-18 18:57 empty, clean, dirty 2008-11-18 18:57 those aren't bits, they're scalar states 2008-11-18 18:57 lock is not enough for wait? 2008-11-18 18:57 lock -> a lock bit 2008-11-18 18:57 maybe 2008-11-18 18:58 maybe we are only ever waiting on one thing: io completion 2008-11-18 18:58 yes, "lock bit" is used for it 2008-11-18 18:59 on a word 2008-11-18 18:59 yes 2008-11-18 19:00 and we have to encode an offset to the beginning of the blocks struct in that word too 2008-11-18 19:00 that's just a few bits 2008-11-18 19:01 so with one block per page and 32 bit arch the struct would be 12 bytes 2008-11-18 19:02 or with four blocks/page, it would be 24 bytes 2008-11-18 19:02 as opposed to 192 bytes for ring of buffers as now 2008-11-18 19:02 struct handles { 2008-11-18 19:02 atomc_t count; 2008-11-18 19:02 struct page *page; 2008-11-18 19:03 unsigned long lock; 2008-11-18 19:03 unsigned char state[]; 2008-11-18 19:03 } 2008-11-18 19:03 ? 2008-11-18 19:03 the state is the lock 2008-11-18 19:03 struct handles { 2008-11-18 19:03 atomic_t count; 2008-11-18 19:03 struct page *page; 2008-11-18 19:03 unsigned state[]; 2008-11-18 19:03 }; 2008-11-18 19:04 i see 2008-11-18 19:04 I think that works 2008-11-18 19:04 need to actually write the state transition code 2008-11-18 19:05 it's necessary to be able to wait on individual blocks 2008-11-18 19:05 for example in btree splitting 2008-11-18 19:07 unsigned long lock can wait individual blocks? 2008-11-18 19:07 it has to be able to wait on them separately, yes 2008-11-18 19:08 you can have completely unrelated blocks on the same page 2008-11-18 19:08 so one block might be for one btree, and another for a different btree 2008-11-18 19:08 this could be a real problem if the locks are tied together 2008-11-18 19:09 unsigned lock has 32bit at least, so it can have individual 32blocks per page? 2008-11-18 19:10 you mean, use the unsigned lock lock as a bit array of blocks 2008-11-18 19:10 yes 2008-11-18 19:10 in practice we only need up to 8 2008-11-18 19:10 so that is fine 2008-11-18 19:10 we also need a handle, a single address that is a handle for a block 2008-11-18 19:11 if our handles struct was a power of two in side, we could use the low 3 bits as the block index 2008-11-18 19:12 but that is really hard to arrange across all arches 2008-11-18 19:12 s/side/size/ 2008-11-18 19:13 so the only workable idea I have come up with is, have the handle point at something that encodes the offset to the beginning of the struct 2008-11-18 19:13 padding is used as another perpose? 2008-11-18 19:13 we could pad to power of two, yes 2008-11-18 19:14 that would fix it 2008-11-18 19:14 and we might be able to encode both buffer state and lock bit in one 32 bit word 2008-11-18 19:14 3 bits of state and one of lock 2008-11-18 19:14 x 8 = 32 bits 2008-11-18 19:15 i see 2008-11-18 19:15 that is 24 bytes / struct handles 2008-11-18 19:15 um 2008-11-18 19:15 sorry 2008-11-18 19:15 12 bytes 2008-11-18 19:16 round out to 16 with a pad word 2008-11-18 19:16 it's very small :) 2008-11-18 19:16 good 2008-11-18 19:16 good enough 2008-11-18 19:16 I think slab will also be happy with it 2008-11-18 19:16 very happy 2008-11-18 19:17 struct handles { 2008-11-18 19:17 atomic_t count; 2008-11-18 19:17 struct page *page; 2008-11-18 19:17 unsigned state, pad; 2008-11-18 19:17 }; 2008-11-18 19:18 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-18 19:18 and I'm sure peopel with come up with yet better ways 2008-11-18 19:19 however if we wrap it nicely this will be easy to change 2008-11-18 19:19 it doesn't matter tough, it may needs "unsigned long state;" on 64bit 2008-11-18 19:19 yes 2008-11-18 19:19 and "usngined long state, pad;" on 32bit 2008-11-18 19:20 I'm not sure, wait_bit requires it or not 2008-11-18 19:20 I think that pad is needed on 64 bit too 2008-11-18 19:21 we use ->pad as another perpose? 2008-11-18 19:21 only to make the structure be a power of two in size 2008-11-18 19:22 ah, 32bytes on 64bit cpu? 2008-11-18 19:22 there are two ways I know of to encode the block index in the handle: 1) make the handle set be a power of two in size and use the low bits of the handle address for the block index 2) encode the offset to the beginning of the handle set at the address pointed to by the handle 2008-11-18 19:23 yes, 32 bytes on 64 arch 2008-11-18 19:24 handle set size as power of two is attractive and simple 2008-11-18 19:24 and cache line efficient 2008-11-18 19:24 what's "block index" meaning here? 2008-11-18 19:25 something that tells us which of the possibly 8 different blocks on a page the handle is for 2008-11-18 19:26 so we write something like wait_on_block(handle); 2008-11-18 19:26 wait_on_block(handle, block_index)? 2008-11-18 19:27 I don't see how that would work 2008-11-18 19:27 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-18 19:27 we need to encode the block index in the handle 2008-11-18 19:27 handle is "struct handles"? 2008-11-18 19:28 it refers to one of the blocks controlled by struct handles 2008-11-18 19:28 ah, i see 2008-11-18 19:29 handle may have pointer to struct handles? 2008-11-18 19:29 yes 2008-11-18 19:29 i see 2008-11-18 19:30 page points a struct handles 2008-11-18 19:30 page->private? 2008-11-18 19:30 yes 2008-11-18 19:30 i see 2008-11-18 19:30 maybe wait_on_block(handle, block_index) will work 2008-11-18 19:30 need to think about it 2008-11-18 19:31 well 2008-11-18 19:31 normally it won't be a wiat_on_block, that is only for synchronous waiting 2008-11-18 19:31 it's more common that we will have "do something when all these events have completed" 2008-11-18 19:32 for readblock, yes it's just wait_on_block 2008-11-18 19:33 if (!uptodate(block)) { readblock(block); } 2008-11-18 19:33 yes 2008-11-18 19:33 btw, we have to rename ext2_* to something 2008-11-18 19:33 for the dir ops? 2008-11-18 19:34 because it actually collides doesn't it 2008-11-18 19:34 tux2_ ;) 2008-11-18 19:34 oh, 2? 2008-11-18 19:34 it's not ext2 any more because the dir fields are big endian 2008-11-18 19:34 ah 2008-11-18 19:34 tux2... kind of a joke 2008-11-18 19:34 it's certainly not the dir format we want 2008-11-18 19:35 n**2 performance 2008-11-18 19:35 yes 2008-11-18 19:35 currently, in kernel can't fail in link phase 2008-11-18 19:35 dir2_* ? 2008-11-18 19:35 so, for now, I'll change ext2_* to tux_* 2008-11-18 19:35 ok 2008-11-18 19:36 fine 2008-11-18 19:36 short 2008-11-18 19:36 sensible :) 2008-11-18 19:36 and rest of work is sb stuff 2008-11-18 19:36 I started on that last night 2008-11-18 19:36 oh 2008-11-18 19:37 you started on it too? 2008-11-18 19:37 no 2008-11-18 19:37 load_sb, save_sb, dump_sb 2008-11-18 19:37 I just checked it what's need 2008-11-18 19:37 sb_bread requires super_block 2008-11-18 19:37 ah 2008-11-18 19:38 also things like store_attrs 2008-11-18 19:38 some things from tux3/inode.c 2008-11-18 19:38 I think we are talking difference things 2008-11-18 19:38 yes 2008-11-18 19:38 yes 2008-11-18 19:39 tux_sb() stuff for me 2008-11-18 19:39 right 2008-11-18 19:40 so, I was thinking name for it 2008-11-18 19:40 "struct sb" to "struct super_block", then use tux_sb() 2008-11-18 19:40 and I need to deal with store_attrs, make_inode, open_inode, save_inod, load_sb, save_sb 2008-11-18 19:40 or other 2008-11-18 19:40 struct sb is most commonly the tux3 sb 2008-11-18 19:41 yes, it's big rename 2008-11-18 19:41 so, I'm not doing it yet 2008-11-18 19:41 so for vfs superblock, struct super_block 2008-11-18 19:41 for tux superblock, struct sb 2008-11-18 19:42 maybe the issue is, superblock can point sb, but sb doesn't 2008-11-18 19:42 they better point at each other 2008-11-18 19:42 they're supposed to 2008-11-18 19:43 it's really sad we didn't do sb the same way as inode 2008-11-18 19:43 disagreement between me and viro over that 2008-11-18 19:43 ah 2008-11-18 19:43 anyway, since they don't have a container_of relationship, they have to point at each other 2008-11-18 19:43 it's not viro 2008-11-18 19:44 it was in fact 2008-11-18 19:44 do we introduce tux_superblock()? 2008-11-18 19:45 struct sb *tux_sb(struct super_block *super) 2008-11-18 19:45 yes 2008-11-18 19:45 that prototype is ok with you? 2008-11-18 19:45 yes 2008-11-18 19:46 I think we have already 2008-11-18 19:46 then we have sb->super 2008-11-18 19:46 so a new field in tux sb that points at the vfs super_block 2008-11-18 19:46 ok? 2008-11-18 19:47 or we could write a wrapper 2008-11-18 19:47 then we need tux_superblock(struct sb) or something? 2008-11-18 19:47 fine either way 2008-11-18 19:47 we don't actualyl need a wrapper for sb->super 2008-11-18 19:47 why? 2008-11-18 19:47 in userspace, we just make that field point at itself 2008-11-18 19:47 well 2008-11-18 19:47 ah 2008-11-18 19:47 that would require a bunch of initialization 2008-11-18 19:47 so a wrapper is easier 2008-11-18 19:48 another issue is 2008-11-18 19:48 struct super_block *vfs_super(struct sb *sb) 2008-11-18 19:48 doesn't really work well in user space 2008-11-18 19:49 unless we #define super_block sb 2008-11-18 19:49 ahem 2008-11-18 19:49 in user space: struct sb *vfs_super(struct sb *sb) 2008-11-18 19:49 better 2008-11-18 19:49 much 2008-11-18 19:50 another issue is 2008-11-18 19:50 we use "struct sb" everywhere 2008-11-18 19:50 functions argument, etc. 2008-11-18 19:51 ah, we have pointer to super_block 2008-11-18 19:51 yes 2008-11-18 19:51 ok, I'll try what happen 2008-11-18 19:51 and super_block is needed only rarely 2008-11-18 19:51 yes 2008-11-18 19:51 so there should be only a few vfs_super(sb) 2008-11-18 19:51 yes, I hope so 2008-11-18 19:52 I'm pretty sure 2008-11-18 19:52 things like sb_bread 2008-11-18 19:52 I'm going to post the proposal for bufferless operation now 2008-11-18 19:53 to lkml, or tux3? 2008-11-18 19:53 tux3, and later bring it up on lkml 2008-11-18 19:53 i see 2008-11-18 19:53 I think I will send akpm a little patch to remove a small buffer dependency from truncate.c 2008-11-18 19:53 and open the discussion that way 2008-11-18 19:54 sounds good 2008-11-18 19:54 I don't think anybody else has written a bufferless fs 2008-11-18 19:54 at least not a disk backed one 2008-11-18 19:54 so that would be a valuable contribution in general 2008-11-18 19:54 -!- ajonat(~ajonat@190.48.123.99) has joined #tux3 2008-11-18 19:55 cool 2008-11-18 19:57 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-18 19:57 hi 2008-11-18 20:00 hey 2008-11-18 20:00 so what's the topic tonight? 2008-11-18 20:00 try_to_free_buffers hairball??? 2008-11-18 20:11 oops 2008-11-18 20:11 11 minutes late ;) 2008-11-18 20:11 ok 2008-11-18 20:11 let's go do try_to_free_buffers 2008-11-18 20:12 the slimy underbelly of core kernel 2008-11-18 20:13 ACTION waits for lxr while it spins and spins 2008-11-18 20:15 ok, done 2008-11-18 20:15 oh, tux3 u 2008-11-18 20:15 it's that time 2008-11-18 20:15 I hope you will find the try_to_free_buffers tour interesting 2008-11-18 20:15 most probably you have looked at it 2008-11-18 20:16 seems it was firefox that was spinning 2008-11-18 20:17 not lxr 2008-11-18 20:17 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L3114 2008-11-18 20:17 this is part of the vm page reclaim mechanism 2008-11-18 20:17 vm does shrink_caches in vmscan.c 2008-11-18 20:18 tries to evict pages 2008-11-18 20:18 pages used for metadata by various filesystems will be sitting in the buffer or page cache with rings of buffers on them 2008-11-18 20:18 oh, we're starting? cool 2008-11-18 20:19 there may be IO in flight that was initiated via one of those buffers 2008-11-18 20:19 the page can't be reclaimed if so 2008-11-18 20:20 so try_to_free_buffers tries to determine whether there is activity on any of the buffers attached to the page 2008-11-18 20:20 to do so, it has to know about every lock in creation 2008-11-18 20:21 so it can take the right locks when it looks at the buffer states, and that buffer state will stay stable long enough to remove the buffers from the page 2008-11-18 20:21 there is no way to ensure this is always possible 2008-11-18 20:22 http://lxr.linux.no/linux+v2.6.27/mm/vmscan.c#L341 2008-11-18 20:22 the call from vmscan.c 2008-11-18 20:22 one of them 2008-11-18 20:22 a rare path 2008-11-18 20:24 here, if the page is dirty, vmscan tries to call the filesystem to write it out 2008-11-18 20:24 that will always be a bad idea with tux3 2008-11-18 20:24 with tux3, if a page is dirty it is always either a) pinned or b) going to be written out in a delta commit soon 2008-11-18 20:25 either way, it is useless for vmscan to stick it's nose in and get involved 2008-11-18 20:26 or we trigger to flush deltas? 2008-11-18 20:26 that's what I meant 2008-11-18 20:26 ah 2008-11-18 20:27 so... one of the things we want to fix after we have basic code running is, we want to go get a core kernel interface that says "don't ever do anything to our dirty pages" 2008-11-18 20:27 this is probably true of all journalling filesystems 2008-11-18 20:27 at least for metadata 2008-11-18 20:28 for now, I think we are just going to exit from ->writepage 2008-11-18 20:28 if the vfs calls us 2008-11-18 20:28 doesn't the kernel assume your pages contain buffers? 2008-11-18 20:28 and maybe increment a counter to see how anxious the vm is getting 2008-11-18 20:28 we need to read that very closely, but no 2008-11-18 20:29 akpm tried to make it so that actual buffers are never assumed 2008-11-18 20:29 lets check that here 2008-11-18 20:29 we have 2008-11-18 20:29 340 if (PagePrivate(page)) { 2008-11-18 20:29 341 if (try_to_free_buffers(page)) { 2008-11-18 20:30 which looks like a smoking gun 2008-11-18 20:30 however, it is controlled by 335 if (!mapping) { 2008-11-18 20:31 so any page in our filesystem will have a mapping, right? 2008-11-18 20:31 obviously it will have to ;-) 2008-11-18 20:31 the !mapping case is some bizarre corner case, involving some other filesystem 2008-11-18 20:32 337 * Some data journaling orphaned pages can have 2008-11-18 20:32 338 * page->mapping == NULL while being dirty with clean buffers. 2008-11-18 20:32 339 */ 2008-11-18 20:32 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L3124 2008-11-18 20:32 in try_to_free_buffers 2008-11-18 20:32 rechecks mapping == NULL 2008-11-18 20:32 so it might? still get called with mapping != NULL 2008-11-18 20:33 if without it, in past kernel was crashed 2008-11-18 20:33 pretty scary, hmm? 2008-11-18 20:33 interesting 2008-11-18 20:33 is the mapping parameter to pageout 2008-11-18 20:33 truth is, nobody really knows all the corner cases in here 2008-11-18 20:33 page->mapping or something else? 2008-11-18 20:34 I can't remember the detail of it 2008-11-18 20:34 so, orphaned pages 2008-11-18 20:34 [I don't buy this, there should frickin' not be any corner cases... I mean for g'sake... this gets called all the time... argh.] 2008-11-18 20:34 caused by truncating 2008-11-18 20:35 maze, and it's been this way for twenty years now 2008-11-18 20:35 keep repeating this to yourself "everybody else sucks even worse" 2008-11-18 20:35 ;-) 2008-11-18 20:35 let's keep poking around in here 2008-11-18 20:35 not quite twenty years 2008-11-18 20:35 I spent all last night debugging the ath9k wireless driver... don't tell me 2008-11-18 20:36 17 years 2008-11-18 20:36 still underage 2008-11-18 20:36 yes, has zits 2008-11-18 20:36 this is one 2008-11-18 20:36 how are we going to do it better in tux3? 2008-11-18 20:37 I very much want try_to_free_buffers never to be called on a tux3 page 2008-11-18 20:37 register our own 'try_to_free_some_memory()' function? 2008-11-18 20:37 so... if we have page->private set to struct handles, we will always have mapping set as well 2008-11-18 20:37 is there one of those? 2008-11-18 20:37 there should be 2008-11-18 20:38 if there isn't, there should be ;-) 2008-11-18 20:38 we should provide it 2008-11-18 20:38 but first we need a filesystem that understands it 2008-11-18 20:38 and there are only two possibilities: hack an existing one, or write one 2008-11-18 20:38 all existing filesystems, you will find are heavily dependent on these kinky facts about buffers etc 2008-11-18 20:39 we're on the right track then ;-) 2008-11-18 20:39 and not easily able to respond to a request like "evict some pages" 2008-11-18 20:39 here's the problem 2008-11-18 20:39 the vm knows which pages are at the cold end of the lru 2008-11-18 20:39 the filesystem knows which pages it can write out 2008-11-18 20:40 the vm has no interface for communicating the former to the latter 2008-11-18 20:40 so perhaps remove pages that can't be dropped from the vm's lru? 2008-11-18 20:40 so it tries to write pages out itself, without having any idea whether a page is pinned or not 2008-11-18 20:40 that's right 2008-11-18 20:40 only have clean pages ever in the vm's jurisdiction? 2008-11-18 20:40 tux3 pages should never be on the vm lru list at all 2008-11-18 20:40 it was drop_buffers() (de7d5a3b6c9ff8429bf046c36b56d3192b75c3da) 2008-11-18 20:40 correct 2008-11-18 20:41 so that's where we want to go 2008-11-18 20:41 the only sensible thing to do 2008-11-18 20:41 but we need to be able to demonstrate efficacy 2008-11-18 20:42 and that means having a filesystem to demonstrate it on 2008-11-18 20:42 let's look at some more callers of try_to_free_buffers 2008-11-18 20:43 ->releasepage() handlers 2008-11-18 20:43 grow_dev_page() in fs/buffer.c 2008-11-18 20:43 the more common cases 2008-11-18 20:43 __mpage_writepage in fs/mpage.c 2008-11-18 20:43 unmap_and_move in mm/migrate.c 2008-11-18 20:43 try_to_release_page in mm/filemap.c 2008-11-18 20:44 and jbd stuff 2008-11-18 20:44 grow_dev_pages gets called when doing sb_bread on the blockdev mapping, right hirofumi? 2008-11-18 20:44 iirc, yes 2008-11-18 20:44 if buffer_head is not available 2008-11-18 20:44 lxr doesn't find it 2008-11-18 20:45 __getblk_slow -> grow_buffers -> grow_dev_page 2008-11-18 20:45 yes 2008-11-18 20:46 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1119 < __getblk_slow 2008-11-18 20:46 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1082 2008-11-18 20:46 :) 2008-11-18 20:47 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1028 2008-11-18 20:47 grow_dev_page 2008-11-18 20:47 yes 2008-11-18 20:47 what saves us from having our page->private abused here? 2008-11-18 20:47 short of avoiding sb_bread 2008-11-18 20:47 well 2008-11-18 20:48 avoid sb_bread I think 2008-11-18 20:48 which of course assumes buffer_head 2008-11-18 20:48 since it returns one 2008-11-18 20:48 so here we use our bio skills and submit the block IO ourself 2008-11-18 20:48 :-) 2008-11-18 20:48 code we already have thanks to a little burst of hacking from maze and I a couple of months ago 2008-11-18 20:48 which explains the smile I think 2008-11-18 20:49 ;-) 2008-11-18 20:49 let's look for more ->releasepages 2008-11-18 20:49 oops 2008-11-18 20:49 oh, that's right 2008-11-18 20:50 looking for any assumptions about buffer_heads in page->private 2008-11-18 20:50 bearing in mind that akpm has been in here years ago hunting/killing such things 2008-11-18 20:50 they'd have to be hidden pretty well to escape 2008-11-18 20:51 hirofumi, so far as I know, nobody has really taken advantage of this abstracting of page->private yet 2008-11-18 20:51 possible exception of tmpfs 2008-11-18 20:51 fsblock? 2008-11-18 20:51 that would qualify as "not yet" 2008-11-18 20:51 and I have still not read it 2008-11-18 20:52 but I instinctively react with fear at the idea of trying to add a core kernel solution to this 2008-11-18 20:52 without ever having tried/proved a specific filesystem solution 2008-11-18 20:52 and if PagePrivate() is not set, we use ->private as another perpose... 2008-11-18 20:52 where is that? 2008-11-18 20:53 can't remember, wait a bit ... 2008-11-18 20:53 meanwhile, the only user of releasepage outside a specific filesystem is in filemap.c 2008-11-18 20:53 set_page_private() is 2008-11-18 20:54 set_page_order() 2008-11-18 20:54 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L2688 <- releasepage in filemap.c 2008-11-18 20:54 try_to_release_page 2008-11-18 20:54 how encouraging does that sound? 2008-11-18 20:54 so a quick question here... all file data doesn't go through buffers and goes direct to pages, right? so buffer only get used for metadata, directories, bitmaps, etc...? 2008-11-18 20:55 buffers sometimes get used on pages too 2008-11-18 20:55 when/why? 2008-11-18 20:55 due to falling back to block io library functions like block_read_full_page 2008-11-18 20:55 which assumes/uses buffers 2008-11-18 20:55 back 2008-11-18 20:56 this is done for example in mpage.c 2008-11-18 20:56 where the multipage algorithms give up and don't try to handle some corner cases 2008-11-18 20:56 but we use some sort off one-of-buffer_head then, don't wee? 2008-11-18 20:56 we as in tux3? 2008-11-18 20:56 or we as in try to make this big mess work? 2008-11-18 20:57 no as in those cases where we (other fs's) fall back to buffer-io for file data 2008-11-18 20:57 I remember there was some code were we were 'simulating' the existance of buffers to make the buffer_io functions usable 2008-11-18 20:57 not actually 2008-11-18 20:58 block_read/write_full_page will put buffers on a page that doesn't have them 2008-11-18 20:58 so it's not a buffer on the stack 2008-11-18 20:58 where they will be found later by nosy vm scan threads 2008-11-18 20:59 oh, so we actually potentially allocate extra memory in the block_read/write_full_page calls for the buffer structs? 2008-11-18 20:59 yes 2008-11-18 20:59 ugh 2008-11-18 20:59 and since those paths are rare these days 2008-11-18 20:59 the corner cases are exercised rarely 2008-11-18 20:59 create_empty_bufferes() does it 2008-11-18 21:00 it requires podigious skills to track down the problems when they happen 2008-11-18 21:00 one more such problem and Linus will pop a vein ;) 2008-11-18 21:00 last one was too hard even for akpm 2008-11-18 21:01 we're past the breaking point of complexity on this issue 2008-11-18 21:01 how was the last one fixed? 2008-11-18 21:01 it's an epic tale 2008-11-18 21:01 google "bug hunt" 2008-11-18 21:01 hmm 2008-11-18 21:02 need another keyword 2008-11-18 21:03 I'll find a link later 2008-11-18 21:03 cool 2008-11-18 21:03 ok, when is ->releasepage useful for a tux3 filesystem? 2008-11-18 21:04 I think: never 2008-11-18 21:04 because we are going to remove our handles struct as soon as its job is finished 2008-11-18 21:04 that is, as soon as IO completes 2008-11-18 21:05 (we're talking about mapping->releasepage I assume) 2008-11-18 21:05 we place it on a page 1) when we start to read a block on the page or 2) when we dirty a block on the page 2008-11-18 21:05 the our page->private is always either busy, or not there 2008-11-18 21:05 hence ->releasepage is useless 2008-11-18 21:05 yes 2008-11-18 21:06 (mapping->a_ops->releasepage to be exact) 2008-11-18 21:06 what the vm does to try to free up some cache memory when it's exhausted 2008-11-18 21:06 works only on stupid filesystems 2008-11-18 21:06 or filesystems that like to leave buffer heads hanging around 2008-11-18 21:06 my guess is it works for non-jouirnaled fs 2008-11-18 21:06 were you can flush data in any order 2008-11-18 21:10 this is done in the possibly misguided believe that it is good to cache the physical block in the buffer b_blocknr field 2008-11-18 21:10 to reduce filesystem get_block calls 2008-11-18 21:10 so one thing we want to demonstrate is that we do not need generic caching of physical block numbers 2008-11-18 21:10 this being the only argument for leaving buffer heads wrapped onto pages 2008-11-18 21:10 hmm 2008-11-18 21:10 now mpage doesn't cache physical block numbers does it 2008-11-18 21:10 been a couple weeks since we looked at it 2008-11-18 21:10 I do not think it adds buffers to pages 2008-11-18 21:10 notice: before vmscan can try_to_free_buffers, it has to lock a page 2008-11-18 21:10 on a multi cpu machine, even trylocks for that speculative free have to hurt 2008-11-18 21:11 maze, correct 2008-11-18 21:11 but why are we optimizing for dumb filesystems? 2008-11-18 21:11 when nobody uses them for anything serious? 2008-11-18 21:11 (except for one large and possibly misguided search engine company ;) 2008-11-18 21:12 now, now, now... 2008-11-18 21:12 if handles still used after io complition, what happen? 2008-11-18 21:12 if handles is still used 2008-11-18 21:16 e.g. we're trying to truncate ditry pages 2008-11-18 21:17 it is 2008-11-18 21:17 602 * possible for a page to have PageDirty set, but it is actually 2008-11-18 21:17 603 * clean (all its buffers are clean). 2008-11-18 21:17 so, the sum of buffer dirty state overrides the page dirty bit 2008-11-18 21:17 according to this 2008-11-18 21:17 block_read_full_page checks for all buffers clean I think, and clears the page dirty bit if so 2008-11-18 21:17 but apparently not every user of buffers does this 2008-11-18 21:17 I wonder why we don't add a printk warning there and hunt them all down 2008-11-18 21:17 ok, I think we did more than an hour 2008-11-18 21:17 on this one exciting topic 2008-11-18 21:18 hopefully, I managed to communicate some sense of how much we want to simplify this situation 2008-11-18 21:18 609 * Rarely, pages can have buffers and no ->mapping. These are 2008-11-18 21:18 610 * the pages which were not successfully invalidated in 2008-11-18 21:18 611 * truncate_complete_page(). 2008-11-18 21:18 on thursday we will look at truncate 2008-11-18 21:19 ok 2008-11-18 21:19 we will strip off our handles after all block IO has completed on a page 2008-11-18 21:19 but questions about truncation still remain 2008-11-18 21:20 yes, I was searching a such race 2008-11-18 21:20 so... homework: please read truncate.c before thursday 2008-11-18 21:20 :) 2008-11-18 21:20 it's a starting point for truncate 2008-11-18 21:20 the scariest operation in the kernel 2008-11-18 21:21 because of interactions with mmap? 2008-11-18 21:21 oh 2008-11-18 21:21 just to set the mood: nothing prevents simultaneous truncation and extending write 2008-11-18 21:22 as in no locks? 2008-11-18 21:22 right 2008-11-18 21:22 eh, ->i_mutex? 2008-11-18 21:22 why? is it that performance critical? 2008-11-18 21:22 hirofumi, doesn't exclude write 2008-11-18 21:23 I'm not sure why this is good, it certainly makes life interesting 2008-11-18 21:23 ah.. 2008-11-18 21:23 probably because having write take i_mutex is just too expensive 2008-11-18 21:48 http://lxr.linux.no/linux+v2.6.27.6/lib/swiotlb.c#L88 2008-11-18 21:48 disk corruption at it's finest 2008-11-18 21:49 this is the softiommu you were talking about day before yesterday? 2008-11-18 21:50 something like that 2008-11-18 21:51 does in fact use bounce buffers 2008-11-18 21:55 yup 2008-11-18 21:56 yup, but my reference is to the 2008-11-18 21:56 if you run out of space, use fallback buffer 2008-11-18 21:56 a) the fallback buffer is 32KB and gets returned regardless of the size of the mapping you ask for 2008-11-18 21:57 b) there's no locking so multiple dma xfrs (in both directions) can be using it at the same time -> priceless 2008-11-18 21:58 end result: do some network xfr and disk io at the same time, and you'll end up xfr your disk and writing 'tcpdump' to your hdd 2008-11-18 21:59 so anyway, that easily explains the disk corruption 2008-11-18 22:16 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-18 22:26 whoops 2008-11-18 22:26 got a patch ready, maze? 2008-11-18 22:27 for? 2008-11-18 22:27 the sw-iommu? 2008-11-18 22:27 still parsing it 2008-11-18 22:27 yes 2008-11-18 22:27 thinking about dropping the fallback buffer in it's entirety 2008-11-18 22:27 see little point for it 2008-11-18 22:27 there's two solutions I can see 2008-11-18 22:28 a) drop it 2008-11-18 22:28 b) lock it on map and unlock on unmap 2008-11-18 22:28 88 * When the IOMMU overflows we return a fallback buffer. This sets the size. 2008-11-18 22:28 89 */ 2008-11-18 22:28 90static unsigned long io_tlb_overflow = 32*1024; 2008-11-18 22:28 either using locking primitives (in which case the system could potentially hang), or on a panic-on-lock-of-locked, in which case we'll get a panic message immediately 2008-11-18 22:28 no locking, ugh 2008-11-18 22:29 how'd that slip by? 2008-11-18 22:29 no idea 2008-11-18 22:29 http://bugzilla.kernel.org/show_bug.cgi?id=11811 2008-11-18 22:29 panic on recursive lock sounds useful 2008-11-18 22:29 to demonstrate it can happen 2008-11-18 22:30 the thing is 2008-11-18 22:30 theoretically, if we just sleep and wait for the lock to be released, the system might function just fine 2008-11-18 22:30 problem is, there's no guarantee that will ever actually happen, since releasing the previous mapping may require getting a new one, which would then block 2008-11-18 22:32 the only benefit of the fallback buffer seems to be dma of <=32KB via pci_dma_map/unmap_single (which will use it) as opposed to the scatter-gather interface (which will never use the fallback buffer, and instead return an error to the driver) 2008-11-18 22:32 still... seems like just getting rid of it would be best 2008-11-18 22:32 if we can exhaust 64M of primary bounce buffers, than the last 32KB won't save us 2008-11-18 22:33 unless scatter-gather actually regularly burns through 64M, in which case we should see such corruption issues more often 2008-11-18 22:41 looks like this is a long standing bug 2008-11-18 22:41 found another guy seeing disk corruption in this manner 2008-11-18 22:41 http://fixunix.com/kernel/525473-re-2-6-25-dma-out-sw-iommu-space-asus-m2n32-amd-8gb-memory.html 2008-11-18 23:07 building 2008-11-19 01:01 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-19 08:14 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-19 08:36 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-19 09:51 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-19 10:37 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-19 11:53 -!- pgquiles_(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-19 12:18 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-19 12:35 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-19 12:44 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-19 13:10 hirofumi, there? 2008-11-19 14:08 hey flips 2008-11-19 14:08 hi 2008-11-19 14:54 hi 2008-11-19 14:58 hirorumi: (true/false?) If we use the block IO library then all our metadata transfers will be one block per bio 2008-11-19 14:58 I need time to think, wait a minute... 2008-11-19 14:59 block IO library means buffer_head stuff or bio stuff? 2008-11-19 15:03 If we use buffer_head, I think it is not one bio 2008-11-19 15:04 e.g. ll_rw_block() pass buffer_head to elevator as one bio 2008-11-19 15:04 then elevator will merge bio if possible 2008-11-19 15:04 merge requests 2008-11-19 15:05 not bio 2008-11-19 15:05 and each bio->bi_endio will be called separately 2008-11-19 15:05 once per buffer 2008-11-19 15:05 ah, yes 2008-11-19 15:05 yes 2008-11-19 15:09 -!- ajonat(~ajonat@190.48.123.99) has joined #tux3 2008-11-19 15:28 hirofumi, so this has to be bad for metadata intensive stuff like find 2008-11-19 15:28 or grep 2008-11-19 15:28 over lots of small files 2008-11-19 15:29 maybe 2008-11-19 15:29 could confirm by running some tests on ramfs 2008-11-19 15:29 oh 2008-11-19 15:30 some result is available? 2008-11-19 15:30 I'll try some tests some time 2008-11-19 15:31 get an idea how much cpu ext3/4 are eating when not limited by disk bandwidth 2008-11-19 15:31 great 2008-11-19 15:32 btw, we uses tux_ and tux3_ prefix 2008-11-19 15:33 which do you like? 2008-11-19 15:33 I'd like to hear before it is increased more 2008-11-19 15:33 tux_ 2008-11-19 15:33 it's shorter 2008-11-19 15:34 ok 2008-11-19 15:34 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-19 15:50 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-19 16:11 -!- pgquiles_(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-19 16:15 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-19 17:37 hirofumi, I couldn't actually find a valid reason why we need to try to represent handle pointer + block offset in one pointer, so your (struct handles *handles, unsigned i) looks correct 2008-11-19 17:37 in other words, no reason to be clever there 2008-11-19 17:37 we would have to add another parameter everywhere we currently use struct buffer * 2008-11-19 17:38 sounds good 2008-11-19 17:39 btw, where do we have to add? searching .... 2008-11-19 17:40 many places 2008-11-19 17:40 like btree 2008-11-19 17:40 and now that I'm thinking of all those places... 2008-11-19 17:40 handle_t is starting to sound good again 2008-11-19 17:41 there's no killer reason to use it, but a lot of little reasons 2008-11-19 17:41 i see 2008-11-19 17:41 If performance is good, it's killer reason? 2008-11-19 17:41 I think so 2008-11-19 17:42 block handles instead of buffers looks like a pretty big win 2008-11-19 17:42 both efficiency, and more importantly, provability 2008-11-19 17:42 and the cache memory saving is significant 2008-11-19 17:44 yes. btw, what is provability? 2008-11-19 17:44 ability to show that the code is correct 2008-11-19 17:44 very difficult with buffers 2008-11-19 17:44 because the code is so crufty 2008-11-19 17:44 and spread out over many files 2008-11-19 17:44 i see 2008-11-19 17:45 we will make our buffer.c or handles.c, and probably all handles stuff is in it? 2008-11-19 17:45 we will define our object lifetime much more tightly 2008-11-19 17:46 basic operations, and the usage will be in two or three of our filesystem files 2008-11-19 17:46 sounds very good 2008-11-19 17:46 don't have to go outside the filesystem to prove things 2008-11-19 17:46 mostly 2008-11-19 17:47 lifetime of a handle covers only two things: IO in progress and pinned dirty metadata 2008-11-19 17:47 you can call it, dirty state and empty state 2008-11-19 17:47 yes 2008-11-19 17:48 are you tackling it recently? 2008-11-19 17:48 yes 2008-11-19 17:48 wrote some code 2008-11-19 17:48 will post it soon 2008-11-19 17:48 it just seems to be a cleanup I really want to do, early 2008-11-19 17:49 good 2008-11-19 17:49 another big advantage is not having to have and endio per block 2008-11-19 17:50 we will submit big bios covering say, 64 metadata blocks 2008-11-19 17:50 i see 2008-11-19 17:50 and endio walks through the biovecs, using the page_size and page_offset to determine number and index of metadata blocks 2008-11-19 17:50 in fact 2008-11-19 17:50 we can mix data and metadata in one bio 2008-11-19 17:51 because only the metadata will have non-null page->private 2008-11-19 17:51 yes 2008-11-19 17:51 but, I'm not sure about some small things 2008-11-19 17:52 difference of buffer cache and page cache 2008-11-19 17:52 we can completely ignore, or not 2008-11-19 17:52 we can 2008-11-19 17:52 except for that one thing you pointed out last week 2008-11-19 17:52 ->host is not our sb for buffer cache 2008-11-19 17:53 ah 2008-11-19 17:53 the reason we _have_ to have block handles is, things like btree node splitting 2008-11-19 17:53 and later, btree index 2008-11-19 17:53 note: dirent blocks are in file page cache, but they are metadata 2008-11-19 17:53 they need handles 2008-11-19 17:54 yes 2008-11-19 17:54 um.. 2008-11-19 17:54 same with bitmaps, atom tables... 2008-11-19 17:55 we need handles for all? 2008-11-19 17:55 everything except file data, which only needs bio 2008-11-19 17:55 that is, user visible file data 2008-11-19 17:55 even if blocksize != page size? 2008-11-19 17:56 I think so 2008-11-19 17:56 i see 2008-11-19 17:56 might find something funny with truncate 2008-11-19 17:56 we're going to dig into that tomorrow 2008-11-19 17:56 i see 2008-11-19 17:57 well, I'd like to prove that in actual fs, i.e. on tux3 2008-11-19 17:57 let me see, the final page in a file, or a apge at either side of a hole can be only partially valid 2008-11-19 17:57 yes 2008-11-19 17:57 I think I'm close to disproving myself ;) 2008-11-19 17:58 :) 2008-11-19 17:58 btw, do you have any request to me? 2008-11-19 17:58 a hole is the best example 2008-11-19 17:59 thinking 2008-11-19 17:59 I should put aside my current patch in fs/tux3 and see how close your latest is to running 2008-11-19 18:00 and yes 2008-11-19 18:00 I'll post that patch to the mailing list 2008-11-19 18:00 in its broken state right now 2008-11-19 18:00 ok 2008-11-19 18:01 I was reading jbd code today 2008-11-19 18:01 oh, much complex stuff 2008-11-19 18:01 that is where a lot of the ext3 block synchronization code and comments are 2008-11-19 18:01 well we have similar problems 2008-11-19 18:01 probably, yes 2008-11-19 18:09 GFP_NOFS is a really interesting issue 2008-11-19 18:09 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-19 18:09 it's a broken idea, actually 2008-11-19 18:09 oh 2008-11-19 18:09 if there is dirty cache that can be flushed out to _some other_ fs, then NOFS prevents that 2008-11-19 18:10 NOFS is only a valid concept when there is just one fs mounted 2008-11-19 18:10 one filebacked fs 2008-11-19 18:11 ah, we have to tell it. yes 2008-11-19 18:11 what fs is actuall users of allocate 2008-11-19 18:12 good point, it avoids a lot memory-allocating code in filesystems when memory is tight 2008-11-19 18:12 but these days, a block device is almost as likely to allocate memory as a filesystem 2008-11-19 18:13 um... inode and dcache? 2008-11-19 18:13 they use a lot of memory, but so do encryption and compression 2008-11-19 18:14 and fancy raid 2008-11-19 18:14 ah, sounds like really big memory 2008-11-19 18:14 those block devices deadlock if you push them hard 2008-11-19 18:15 we solved that in ddsnap by limiting the total traffic that could be in flight 2008-11-19 18:15 ah, good 2008-11-19 18:15 I remember lvm also did some hack for it 2008-11-19 18:16 which still deadlocks as of the last time we tested it 2008-11-19 18:16 oh 2008-11-19 18:16 lvm... years back had problems with deadlock in the user interface 2008-11-19 18:16 eventually solved by taking PF_MEMALLOC in dm-ioctl.c 2008-11-19 18:17 i see 2008-11-19 18:17 dm devices like snapshot lock up easily, still 2008-11-19 18:18 heh 2008-11-19 18:18 ZFS has got a reputation for locking up, you have to fiddle with the memory configuration "until it works" 2008-11-19 18:18 ok, back to NOFS 2008-11-19 18:19 we're really trying to say 'don't recurse' 2008-11-19 18:19 ok, I'm searching actual user of __GFP_FS 2008-11-19 18:19 yes 2008-11-19 18:20 we're also saying "somebody else take care of scanning for free pages please" 2008-11-19 18:20 yes 2008-11-19 18:22 I'm thinking that NOFS doesn't really do harm... as long as vmscan gets woken up 2008-11-19 18:23 it used to be, there would be a lot of dirty pages in the lru with mapped buffers that the vm could send straight to the block device 2008-11-19 18:23 so the difference between NOFS and NOIO actually meant something 2008-11-19 18:23 today they are effectively the same 2008-11-19 18:25 probably, difference is swap? 2008-11-19 18:25 ah yes 2008-11-19 18:26 looks like so 2008-11-19 18:27 and interesting experment: set both GFP_USER and GFP_KERNEL to __NOFS and see if that affects performance 2008-11-19 18:28 easy experiment 2008-11-19 18:28 in other words, how much does it cost to have kswapd do all the work, and is kswapd always woken up when it needs to be 2008-11-19 18:29 and, are there costs that could be saved by avoiding having many tasks scanning in parallel, contending locks and polluting cache 2008-11-19 18:29 anyway 2008-11-19 18:29 different topic 2008-11-19 18:30 I'm copying in your latest fs/tux3 files now 2008-11-19 18:30 ok 2008-11-19 18:31 btw, I noticed we still add some patch to user/tux3 for inode.c/super.c 2008-11-19 18:32 ? 2008-11-19 18:32 we will add 2008-11-19 18:32 yes, I could not think of a common place to put that common code 2008-11-19 18:33 we could have sb.c, for load/save sb 2008-11-19 18:33 the inode code... 2008-11-19 18:33 put it in filemap.c I think 2008-11-19 18:34 it all inode table block handling 2008-11-19 18:34 ah 2008-11-19 18:34 or maybe itable.c 2008-11-19 18:34 it meant more simple thing 2008-11-19 18:35 we need some #define TUX_BITMAP_IO 0 2008-11-19 18:35 stuff 2008-11-19 18:36 and add more "extern" stuff to tux3.h 2008-11-19 18:36 well, more common files sounds good 2008-11-19 18:37 yes 2008-11-19 18:38 I was going to put some c source includes in the kernel to be lazy, somebody needs to stop me from doing that ;-) 2008-11-19 18:38 it's time to start maintaining proper global definitions 2008-11-19 18:39 :) oh, which file? 2008-11-19 18:39 I was going to include filemap.c in super.c 2008-11-19 18:39 dumb idea 2008-11-19 18:40 kernel devs will scream 2008-11-19 18:40 ah 2008-11-19 18:42 ok, I'll cleanup global define stuff 2008-11-19 19:01 unsigned get_handle_state(handle_t handle) 2008-11-19 19:01 { 2008-11-19 19:01 struct handles *handles = (void *)(handle & ~7); 2008-11-19 19:01 unsigned shift = (handle & 7) << 2; 2008-11-19 19:01 return (handles->map >> shift) & 7; 2008-11-19 19:01 } 2008-11-19 19:01 void set_handle_state(handle_t handle, unsigned state) 2008-11-19 19:01 { 2008-11-19 19:01 struct handles *handles = (void *)(handle & ~7); 2008-11-19 19:01 unsigned shift = (handle & 7) << 2; 2008-11-19 19:01 handles->map = (handles->map & ~(7 << shift)) | (state << shift); 2008-11-19 19:01 } 2008-11-19 19:01 will always be under handle_lock 2008-11-19 19:02 but still, I wonder if some memory barrier is needed 2008-11-19 19:02 if there is spin_lock(handle_lock), it should already be including? 2008-11-19 19:03 the handle_lock will be a long life lock like a buffer lock 2008-11-19 19:03 and... 2008-11-19 19:03 I'm wondering if that is the right model 2008-11-19 19:04 ah 2008-11-19 19:04 but it's true, handle_lock isn't sufficient 2008-11-19 19:04 needs a spinlock if they are going to be scalar fields 2008-11-19 19:04 or else we got to bits, as traditional 2008-11-19 19:06 probably, I need a lot of thinking for it with full sources... 2008-11-19 19:06 I think memory barrier for buffer_head stuff is also complex 2008-11-19 19:35 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-19 20:02 compare and swap looks like a way to do it 2008-11-19 20:45 ok, I see how to do the state variable synchronization efficiently with cmpxchg, and initially we can just be lazy and protect it with a spinlock 2008-11-19 20:47 ah, state 2008-11-19 20:49 test_and_set_bit()? 2008-11-19 20:49 or we have to set multiple bits? 2008-11-19 21:03 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-19 21:37 -!- RazvanM(~RazvanM@pool-96-244-34-153.bltmmd.east.verizon.net) has joined #tux3 2008-11-19 22:23 flips: handle_state? 2008-11-19 22:35 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-20 01:14 hirofumi, block handle state is scalar: empty, clean, dirty0, dirty1 2008-11-20 01:14 not bits 2008-11-20 01:14 the bits are not independent 2008-11-20 02:50 ah, ok 2008-11-20 07:22 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-20 07:49 -!- pgquiles(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-20 08:36 -!- pgquiles_(~pgquiles@62.43.226.52) has joined #tux3 2008-11-20 08:41 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-20 09:05 flips: ping 2008-11-20 09:07 flips: you should comment on this: http://blog.karlitschek.de/2008/11/my-perfect-desktop-part-1-documents.html 2008-11-20 09:07 versioning, sync with other machines, etc are accomplished by tux3 :-) 2008-11-20 09:36 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-20 09:53 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-20 10:05 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-20 10:13 -!- pgquiles__(~pgquiles@109.Red-79-153-83.staticIP.rima-tde.net) has joined #tux3 2008-11-20 11:49 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-20 12:39 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-20 14:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-20 14:22 -!- ChanServ changed mode/#tux3 -> -o flips 2008-11-20 15:27 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-20 15:33 re block handles 2008-11-20 15:34 I think there are two different things being locked: block data and block state 2008-11-20 15:34 one lock is not enough for both 2008-11-20 15:34 and for block state, a lock isn't necessarily the right model anyway 2008-11-20 15:36 state transitions are the right model, empty to clean, empty to dirty, clean to dirty, dirty to clean 2008-11-20 15:50 yes 2008-11-20 15:51 one is buffer data 2008-11-20 15:51 and one is state 2008-11-20 15:51 state is serialized by cmpxchg, and buffer data is block bit (wait_bit)? 2008-11-20 15:52 s/block bit/lock bit/ 2008-11-20 15:52 ok, here's a cool detail: when we want to partially rewrite a block, according to the model above we need two things: the block has to be clean or dirty and we need the block lock 2008-11-20 15:52 the cool thing is, we can get both in one atomic op 2008-11-20 15:52 cmpxchg 2008-11-20 15:53 I'll provide details how later 2008-11-20 15:53 ok 2008-11-20 15:53 one situation where we need to do this is, in parallel access to dirents, which is the model we will use for now before doing the deferred nameops 2008-11-20 15:54 on yesterday, I was thinking barrier for data sync with userland 2008-11-20 15:54 -!- ajonat(~ajonat@190.48.123.99) has joined #tux3 2008-11-20 15:54 but it was for handles->state 2008-11-20 15:55 I now call the handles struct a "struct blockmap" 2008-11-20 15:55 we will use the library cmpxchg, which I think is implemented for all arches 2008-11-20 15:55 too bad we used the stupid intel name for that 2008-11-20 15:56 computer science calls it compare and swap 2008-11-20 15:56 yes, in past it was actually x86 only, iirc 2008-11-20 15:57 blockmap, um... 2008-11-20 15:57 then it proved too useful for just x86 2008-11-20 15:57 well, the name can change 2008-11-20 15:57 it's the logic and design that are hard ;) 2008-11-20 15:57 I like buffer or something 2008-11-20 15:57 buffermap? 2008-11-20 15:57 bufmap? 2008-11-20 15:57 um... 2008-11-20 15:58 a clean break with buffer ops is needed I think 2008-11-20 15:58 no confusion 2008-11-20 15:58 I have convinced myself that we are going to save time getting to 100% stable with this interface 2008-11-20 15:58 ah, i see 2008-11-20 15:59 however I thought it replacement of buffer 2008-11-20 15:59 it is 2008-11-20 15:59 but we don't want to leave any confusion about whether we actually access buffers or not 2008-11-20 15:59 the "buffer" is actually part of a page 2008-11-20 16:00 subpagemap ;) 2008-11-20 16:00 oh 2008-11-20 16:00 but, it may good idea 2008-11-20 16:01 it's looking pretty clean and it definitely saves memory 2008-11-20 16:01 I think it will save bugs too 2008-11-20 16:01 and a modest amount of cpu 2008-11-20 16:01 sk8 oclock 2008-11-20 16:02 ok. subpagemap or something may good idea 2008-11-20 16:02 well, enjoy 2008-11-20 16:03 structure name for it 2008-11-20 16:13 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-20 16:16 what did I think about blockmap 2008-11-20 16:16 I saw it is block mapping - mapping block number to another block number like partition layer 2008-11-20 17:52 hirofumi, I agree, that is confusing 2008-11-20 17:52 so it won't be struct blockmap 2008-11-20 17:52 struct blockbits is better 2008-11-20 17:53 still not very good ;) 2008-11-20 17:56 struct blocks is confusing too, because it sounds like the block data might be separate from the page 2008-11-20 17:58 maybe. although I thought about blocks too 2008-11-20 17:58 block_states? 2008-11-20 17:59 well, personally I use "block" as block number though 2008-11-20 17:59 "block" - physical block number 2008-11-20 18:00 "iblock" - logical block number 2008-11-20 18:00 logical block number means a some sort of page->index in here 2008-11-20 18:01 "This new api lets our filesystem operate on subpage blocks like the 2008-11-20 18:01 traditional page->buffers (now page->private) but with 1/3 the cache 2008-11-20 18:01 memory use and exactly the state transitions needed by the filesystem. 2008-11-20 18:01 Object lifetime is precisely controlled so that no kernel system outside 2008-11-20 18:01 our filesystem has to guess what to do with a page that has non-null 2008-11-20 18:01 page->private" 2008-11-20 18:02 let's not worry too much about the name of the struct ;) 2008-11-20 18:02 very easy to change, struct block_whatever won't be seen outside of the helper functions 2008-11-20 18:02 ok :) 2008-11-20 18:02 it's handle_t that will be seen everywhere 2008-11-20 18:02 now you can worry about that name ;) 2008-11-20 18:04 we will have an unsigned biovec_to_handles(handles[8]) function 2008-11-20 18:04 or get_biovec_handles 2008-11-20 18:05 that uses the bi_offset and bi_length fields to return hands 2008-11-20 18:05 handles 2008-11-20 18:05 it means biovec to handle_t? 2008-11-20 18:05 yes 2008-11-20 18:05 can return up to eight handles 2008-11-20 18:05 all on the same page 2008-11-20 18:05 well 2008-11-20 18:06 the object is to update the block states in endio 2008-11-20 18:06 ah, i see 2008-11-20 18:06 probably a slightly better way to do this, but that's the idea 2008-11-20 18:07 filemap.c will set up those biovecs 2008-11-20 18:07 using a list of handles that were dirtied in that phase 2008-11-20 18:07 probably yes 2008-11-20 18:07 we will keep our dirty handle list on a page I think, instead of making it a linked list 2008-11-20 18:08 so the rule is simple: in set_block_dirty, if the block was not already dirty, it gets added to the per-delta dirty block list 2008-11-20 18:08 yes 2008-11-20 18:09 that means set_block_dirty should return the previous state of the block 2008-11-20 18:09 so cmpxchg works really well for this 2008-11-20 18:09 what's for? 2008-11-20 18:09 ah, ok 2008-11-20 18:09 to set the new dirty state atomically and return the old state at the same time 2008-11-20 18:10 another nice thing: taking the block lock can be done in the same cmpxchg op 2008-11-20 18:10 and only do something for newly dirty block 2008-11-20 18:11 so we can have a set_dirty_and_lock_and_is_it_the_first_dirty operation 2008-11-20 18:11 with a single atomic instruction 2008-11-20 18:11 yes, newly dirty means "add to the delta dirty block list" 2008-11-20 18:11 yes 2008-11-20 18:12 also, sometimes we need to know "was it in the current delta when dirtied" 2008-11-20 18:12 so we can fork 2008-11-20 18:12 if not 2008-11-20 18:12 ah, yes 2008-11-20 18:14 we can also pick up bugs like if the delta setup tries to dirty a block that is dirty in a newer delta 2008-11-20 18:15 one slight annoyance: every time somebody wants to read the data on the block, they have to take the block lock 2008-11-20 18:15 I think that is the same for buffers 2008-11-20 18:16 it is an atomic operation for every block peek at the block data 2008-11-20 18:16 later... that might show on profiles 2008-11-20 18:16 not to worry for a few months though 2008-11-20 18:17 or copy stable data to another 2008-11-20 18:17 s/every block peek/every peek/ 2008-11-20 18:17 yes 2008-11-20 18:17 well, we would want actuall working fs in kernel 2008-11-20 18:18 right 2008-11-20 18:18 something more immediate... we want "crabbing" type locks for btrees 2008-11-20 18:18 where you hold a lock on the parent, take a lock on a child, and release the parent lock 2008-11-20 18:18 sorry, what's "crabbing"? 2008-11-20 18:18 (just explained I hope) 2008-11-20 18:19 it's a well known locking strategy for trees 2008-11-20 18:19 far better than a per-tree lock 2008-11-20 18:19 something like "crappy"? 2008-11-20 18:19 like a crab from the sea 2008-11-20 18:19 walks sides, moving one set of legs at a time 2008-11-20 18:20 sideways 2008-11-20 18:20 it's kind of a stupid name ;) 2008-11-20 18:20 i see 2008-11-20 18:20 anyway it's the common term 2008-11-20 18:20 http://books.google.com/books?id=S_yHERPRZScC&pg=PA867&lpg=PA867&dq=crabbing+b-tree&source=bl&ots=JJmyQRJvAl&sig=y4uFzidtcJE9m-kNKqZLv5-dUvA&hl=en&sa=X&oi=book_result&resnum=1&ct=result 2008-11-20 18:21 oh, fast 2008-11-20 18:21 we can do crabbing down the btree with our block locks 2008-11-20 18:21 ah, maybe I have this book in japanese 2008-11-20 18:21 just as we could do it with a lock_buffer 2008-11-20 18:21 :) 2008-11-20 18:22 but... I am thinking of an improved variant 2008-11-20 18:22 that takes a shared read lock on the parent 2008-11-20 18:22 now I wonder if there is any way to do that with our new block state 2008-11-20 18:23 something to think about 2008-11-20 18:23 yes, probably db needs it, but maybe there is fs can more good one 2008-11-20 18:23 is not important immediately, crabbing with exclusive locks is not too bad 2008-11-20 18:24 yes 2008-11-20 18:24 crabbing will get rid of most contention on our inode table btree 2008-11-20 18:24 i see 2008-11-20 18:25 ok, my goal is to post a prototype of this that runs in kernel tonight 2008-11-20 18:26 then we can see if it handles all our use cases 2008-11-20 18:26 ok 2008-11-20 18:26 and if we like it, convert user space to a matching api (though locking doesn't actually have to lock anything there) 2008-11-20 18:27 ok 2008-11-20 18:27 we already shelled most of the functions involved 2008-11-20 18:27 that was smart :) 2008-11-20 18:27 :) I'll try to make working fs in kernel for now 2008-11-20 18:27 good luck! 2008-11-20 18:28 I'll join you in that soon 2008-11-20 18:28 thanks :) 2008-11-20 18:28 and it's about time maze and shapor show some more of their skillz ;) 2008-11-20 18:28 good :) 2008-11-20 18:28 mlankhorst is a very smart dude too 2008-11-20 18:28 though mostly works on video stuff 2008-11-20 18:29 oh, video 2008-11-20 18:29 heh 2008-11-20 18:29 if you're interested, talk to him 2008-11-20 18:29 no 2008-11-20 18:29 he does really deep hacks 2008-11-20 18:29 probably, it's not most unfamilir stuff for me 2008-11-20 18:29 ;) 2008-11-20 18:30 probably, it's most unfamilir stuff for me 2008-11-20 18:30 it's way different all right 2008-11-20 18:30 hardware registers require a lot of patience 2008-11-20 18:31 I only tackled it in bios or near 2008-11-20 18:32 3d and more complex video stuff is hard to understand for me 2008-11-20 18:39 one hard thing at a time 2008-11-20 18:39 filesystems are at least as hard 2008-11-20 18:40 ok, here's a big point: somebody who wants to change the state of the block does _not_ have to take the block lock 2008-11-20 18:40 because state transitions themselves will be atomic with the help of cmpxchg 2008-11-20 18:41 probably, yes 2008-11-20 18:41 the two that happen in endio are empty -> clean and dirty -> clean, it would be impossible to take the block lock here 2008-11-20 18:41 and we don't want to have it held across an entire write operation 2008-11-20 18:41 because it's legal to read the block in that period 2008-11-20 18:42 we do need to lock for read I think 2008-11-20 18:42 beccause the block is invalid until the read is complete 2008-11-20 18:42 um... dirty -> clean should have block lock? 2008-11-20 18:42 it doesn't need it, I think 2008-11-20 18:42 please check my reasoning 2008-11-20 18:43 ok 2008-11-20 18:43 changing of state is with the help of cmpxchg, which locks the bus 2008-11-20 18:43 it does another important thing 2008-11-20 18:43 it doesn't change the state if the state was already changed by somebody else 2008-11-20 18:43 on top of that 2008-11-20 18:44 only one single event, the bio endio is allowed to do the dirty -> clean transition 2008-11-20 18:44 ah, dirty -> clean happend in endio 2008-11-20 18:44 always 2008-11-20 18:45 I thought it is before submit_bio 2008-11-20 18:45 in truncate, the transition is dirty->empty 2008-11-20 18:45 that is for pages 2008-11-20 18:45 it is not allowed for our block the change while it is under IO 2008-11-20 18:45 that is allowed only for data pages 2008-11-20 18:45 for our bloock to change I meant 2008-11-20 18:46 i see, but it is before submit_bio 2008-11-20 18:46 so we don't need a block writeback state 2008-11-20 18:46 just like buffers don't have that now 2008-11-20 18:46 wait a bit... 2008-11-20 18:46 we can leave the buffer in dirty state until the bio finishes 2008-11-20 18:47 the block I mean 2008-11-20 18:48 another _very_ good thing about doing this subsytem: it forces us to actually understand the life cycle of traditional buffers 2008-11-20 18:48 http://lxr.linux.no/linux+v2.6.27.5/fs/buffer.c#L2997 2008-11-20 18:48 http://lxr.linux.no/linux+v2.6.27.5/fs/buffer.c#L3010 2008-11-20 18:49 test_clear_buffer_dirty(bh)? 2008-11-20 18:49 that's trying to handle the case where somebody can asynchronously dirty the buffer 2008-11-20 18:49 like when a page dirty bit is pushed down into the buffers 2008-11-20 18:50 metadata blocks don't have that 2008-11-20 18:50 it's possible we might see that on the boundary page at i_size in a data file 2008-11-20 18:50 need to think about that 2008-11-20 18:51 I think that code relies on lock_buffer to keep everybody away from the buffer when the dirty bit is cleared 2008-11-20 18:52 http://lxr.linux.no/linux+v2.6.27.5/fs/mpage.c#L469 2008-11-20 18:53 i see 2008-11-20 18:53 well, I'm not sure 2008-11-20 18:53 mpage assumes buffers meaning we can't use it 2008-11-20 18:54 not connected with the above chat 2008-11-20 18:54 just observing that 2008-11-20 18:55 need to look at the code that pushes dirty bits down into the page buffers 2008-11-20 18:55 the pushes page dirty bit down 2008-11-20 18:55 and see if that's called from core, or per-fs 2008-11-20 18:55 yes 2008-11-20 18:56 probably, I'm not understanding those 2008-11-20 18:56 ok, mpage_readpages and similar are called only from filesystems 2008-11-20 18:56 they are part of the buffer oriented block library 2008-11-20 18:57 I think we can do this much more nicely 2008-11-20 18:57 not being able to use mpage.c will be no loss 2008-11-20 18:57 we're already doing a large part of that in filemap.c 2008-11-20 18:57 readpages is called from readahead stuff? 2008-11-20 18:57 let's check that 2008-11-20 18:58 __do_page_cache_readahead -> ->readpages()? 2008-11-20 18:59 that's ok 2008-11-20 18:59 readpages is a method our fs can implement 2008-11-20 18:59 yes 2008-11-20 18:59 if there was a core kernel call to mpage_readpages that would be a problem 2008-11-20 18:59 it's really nice that somebody has made this clean for us 2008-11-20 19:00 are there any users of that big cleanup at all? 2008-11-20 19:00 maybe... ramfs or something? 2008-11-20 19:00 even xfs is using traditional buffers... push their own buffer layer on top of that! 2008-11-20 19:01 s/push/plus/ 2008-11-20 19:01 probably 2008-11-20 19:01 probably btrfs is 2008-11-20 19:02 1204 * Yes, Virginia, this is indeed insane. 2008-11-20 19:02 which file? 2008-11-20 19:02 1204 * Yes, Virginia, this is indeed insane. 2008-11-20 19:02 whoops 2008-11-20 19:02 http://lxr.linux.no/linux+v2.6.27.5/mm/page-writeback.c#L1204 2008-11-20 19:03 ah, maybe this is for mkpage or something, iirc 2008-11-20 19:04 it's a result of the bug hunt two christmases ago 2008-11-20 19:04 took 3 months to find the corruption 2008-11-20 19:04 ah 2008-11-20 19:04 -!- RazvanM(~RazvanM@pool-96-244-34-153.bltmmd.east.verizon.net) has joined #tux3 2008-11-20 19:05 it was caused by "accurate dirty accounting" 2008-11-20 19:07 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2008-11-20 19:07 well, I need full source to think those 2008-11-20 19:07 yes 2008-11-20 19:07 so that's why I'm writing a quick prototype 2008-11-20 19:07 good 2008-11-20 19:07 I'll post my tiny user space program pretty soon 2008-11-20 19:08 yes 2008-11-20 19:08 currently called foo.c ;) 2008-11-20 19:08 time of tux3 u? 2008-11-20 19:08 oh :) 2008-11-20 19:08 in 52 minutes from now 2008-11-20 19:08 ok 2008-11-20 19:09 ah, it was ->page_mkwrite 2008-11-20 19:10 -!- ChanServ changed mode/#tux3 -> +o flips 2008-11-20 19:10 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: the page dirty hairball" 2008-11-20 19:10 -!- ChanServ changed mode/#tux3 -> -o flips 2008-11-20 19:11 ->page_mkwrite, nice to see its a method 2008-11-20 19:11 -!- ceatinge(~ceatinge@veryclever.net) has joined #tux3 2008-11-20 19:11 now is it asynchronous 2008-11-20 19:13 I try to remember why is that big comment introduced... 2008-11-20 19:13 http://lxr.linux.no/linux+v2.6.27.5/fs/ext4/inode.c#L4809 <- wowo, ext4_patge_mkwrite 2008-11-20 19:13 that is scary 2008-11-20 19:14 anyway, it is obvious that ->mkwrite is not asynchronous, because ext4_page_mkwrite is taking mutex locks 2008-11-20 19:14 well 2008-11-20 19:14 does that follow 2008-11-20 19:15 yes 2008-11-20 19:15 hmm, no, it's only obvious its not called under spinlock 2008-11-20 19:15 that may be preparing for mmaped page 2008-11-20 19:16 http://lxr.linux.no/linux+v2.6.27.5/fs/ext4/inode.c#L4809 <- called from here in do_wp_page 2008-11-20 19:17 that case is for mmapped page 2008-11-20 19:17 yes, copy on write page to capture to write from userland 2008-11-20 19:18 ok, it seems likely that asynchronous subpage dirty happens only to file page cache 2008-11-20 19:18 s/likely/certain/ 2008-11-20 19:19 i see 2008-11-20 19:19 but we still have to worry about how to represent partial page valid at the ->i_size boundary 2008-11-20 19:19 well, it is allowed to take mutex here 2008-11-20 19:19 partial page? 2008-11-20 19:19 yes, at the end of a file 2008-11-20 19:20 it's zeroed after ->size? 2008-11-20 19:20 some blocks clean or dirty and some empty 2008-11-20 19:20 even if its zeroed, we can't leave the block there in "clean" state I think 2008-11-20 19:20 i see, block state 2008-11-20 19:20 above ->size 2008-11-20 19:21 I might be wrong about that 2008-11-20 19:21 if I'm wrong about that it will be easy to handle 2008-11-20 19:22 so... anything about i_size will not be read in from backing store 2008-11-20 19:22 except that I violate that in the ->atable inode ;) 2008-11-20 19:22 block_full_page_write handles about it (i.e. it reads ->i_size)? 2008-11-20 19:23 right, and we won't use that, so don't ahve to worry about what it does 2008-11-20 19:23 hmm, I remember I said we were going to look at truncate 2008-11-20 19:23 like me fix the topic 2008-11-20 19:24 yes 2008-11-20 19:24 -!- ChanServ changed mode/#tux3 -> +o flips 2008-11-20 19:24 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: the truncate hairball" 2008-11-20 19:25 -!- ChanServ changed mode/#tux3 -> -o flips 2008-11-20 19:25 another nice benefit of doing the block state work now is... we can post the patch to lkml and say "is this right?" 2008-11-20 19:26 yes, that would be nice 2008-11-20 19:27 anyway, I retract my claim that buffer page dirty is not cleared on submit_bh 2008-11-20 19:27 I think that we must not do that though 2008-11-20 19:28 i see 2008-11-20 19:28 because a) we have no chance of block re-dirty, which is what that strategy is supposed to handle and b) we need to know the dirty state across the whole transfer to know when to fork a block 2008-11-20 19:29 let's call a page that lies partly above and partly below i_size a "boundary page" 2008-11-20 19:30 ah, a) is 2008-11-20 19:30 ah, i see 2008-11-20 19:31 I misread it as "boundary block" 2008-11-20 19:32 I'm going to be worrying about boundary blocks for a while, because it seems we have to record the subpage state there 2008-11-20 19:33 maybe we don't 2008-11-20 19:33 um.. I think it is not needed 2008-11-20 19:34 it would be nice if it's not needed 2008-11-20 19:34 then we would not ever have page->private for a file page cache 2008-11-20 19:34 maybe it's just not break get_block() 2008-11-20 19:34 i see 2008-11-20 19:35 we will have page->private for directory page cache, but that can't be memory mapped and our code always controls i_size directly 2008-11-20 19:35 yes 2008-11-20 19:36 -!- RalucaM(~ral@pool-96-244-34-153.bltmmd.east.verizon.net) has joined #tux3 2008-11-20 19:36 hi 2008-11-20 19:36 same with sb->bitmap, sb->atable, etc 2008-11-20 19:36 hi 2008-11-20 19:36 but if we can avoid it, it would be clean 2008-11-20 19:37 well, I'm not sure 2008-11-20 19:37 can't avoid it with directories when we go to btree directories 2008-11-20 19:37 we need to handle block resolution then 2008-11-20 19:38 it meant "above i_size" 2008-11-20 19:39 yes 2008-11-20 19:39 i see 2008-11-20 19:40 I think we don't need it for file page cache, but it's important to know for sure 2008-11-20 19:40 it would also be nice to be sure that it works even for file page cache 2008-11-20 19:41 even if we do not intend to ever use it that way 2008-11-20 19:41 the reason we don't need it for file page cache is... we never need to handle individual blocks in a user-visible file 2008-11-20 19:42 it means "handle_t" here? 2008-11-20 19:42 "it" means 2008-11-20 19:42 yes 2008-11-20 19:42 I'm not sure my claim is correct 2008-11-20 19:42 ah, yes 2008-11-20 19:43 what do we do when we have a one-block hole in a file 2008-11-20 19:43 easy to create with a simple program 2008-11-20 19:43 now we need to transfer that one block from disk 2008-11-20 19:43 to disk 2008-11-20 19:43 I mean 2008-11-20 19:43 blocks exclude one block? 2008-11-20 19:44 because somebody filled it in with a sys_write 2008-11-20 19:44 so 3 blocks on the page have valid data, one is a hole 2008-11-20 19:44 and somebody does a sys_write to the hole 2008-11-20 19:44 I think it's clear, we have to have page->private = block state there 2008-11-20 19:45 one block is also valid, but it is not transfer to disk 2008-11-20 19:45 I think it is same with other pages 2008-11-20 19:46 I think you're right 2008-11-20 19:46 just set up a bio that covers the one block 2008-11-20 19:46 yes, I think so 2008-11-20 19:47 filemap.c in that case has to work a little differently, it has to work at page resolution 2008-11-20 19:47 iirc, it was block resolution... 2008-11-20 19:47 it is in user space 2008-11-20 19:48 yes 2008-11-20 19:48 diskread/diskwrite does those by blocksize? 2008-11-20 19:49 ok, wait, I think we were too quick just above 2008-11-20 19:49 somebody does buffered sys_write to just one block... and we dirty the whole page? 2008-11-20 19:49 because without page->handles we have no other choice 2008-11-20 19:50 I'm not sure about it 2008-11-20 19:50 if we dirty a whole page it will cause extra writeout to a 1K block filesystem, do we care? 2008-11-20 19:50 currently, ->write_begin() will read full page 2008-11-20 19:51 for a subpage write, yes 2008-11-20 19:52 then, if the filesystem is using the block IO library, only pages that are !uptodate will actually be read 2008-11-20 19:52 only buffers I mean 2008-11-20 19:52 if the filesystem is using the block IO library, only buffers that are !uptodate will actually be read 2008-11-20 19:53 yes 2008-11-20 19:53 on the write, the block IO library will put buffers on the page, then only write the ones needing writing 2008-11-20 19:53 yes 2008-11-20 19:53 so, we are probably going to do simialr 2008-11-20 19:54 i see 2008-11-20 19:54 it might be possible to avoid it at the expense of extra IO for 1K block volume 2008-11-20 19:54 that doesn't really sound nice though, even if it can work 2008-11-20 19:55 in 5 minutes I'm going to ask if everybody read truncate.c ;) 2008-11-20 19:55 well, probably similar will be required to simple mmap 2008-11-20 19:56 ok 2008-11-20 19:59 mmap is a slightly different case, when the hardware dirty bit is set we _must_ write all blocks on the page 2008-11-20 19:59 yes 2008-11-20 20:00 it also gets set asynchronously, possibly making things interesting 2008-11-20 20:00 why do we need to read full page 2008-11-20 20:00 by the way, ->mkwrite is changing the writable status of a mapped page, not pushing down the hardware dirty bit 2008-11-20 20:00 on page fault, the full page must be read 2008-11-20 20:01 it's not allowed to have a partically valid page exposed to user 2008-11-20 20:01 partically 2008-11-20 20:01 bleah 2008-11-20 20:01 partially 2008-11-20 20:01 yes, so also write() have to do it 2008-11-20 20:01 not syswrite 2008-11-20 20:02 not write 2008-11-20 20:02 write can leave partially valid pages, but then the subpage state _must_ be recorded 2008-11-20 20:02 so I think we have settled that one, probably, -> we have to record subpage state in file page cache too 2008-11-20 20:03 if user mapped partially valid page, what data is visible? 2008-11-20 20:03 that is never allowed, the invalid parts of the page have to be set to zero 2008-11-20 20:03 then the whole page is marked uptodate 2008-11-20 20:03 sorry 2008-11-20 20:03 dirty 2008-11-20 20:03 well 2008-11-20 20:04 I'm not sure ;) 2008-11-20 20:04 this is a murky area 2008-11-20 20:04 ah, I count that is valid part (i.e. zeroed area) 2008-11-20 20:05 ;-) 2008-11-20 20:07 consider two sys_writes of different 1K blocks on the same page, they obviously can't both zero the remainder of the page 2008-11-20 20:07 ok, so the first one, that instantiates the page can zero the remainder 2008-11-20 20:08 and mark the page dirty 2008-11-20 20:08 then the second write can overwrite the zeroes 2008-11-20 20:08 inefficient because of useless zeroing work, but it works 2008-11-20 20:09 ok, should we start truncating? 2008-11-20 20:09 probably 2008-11-20 20:09 let's jump into fs/truncate.c 2008-11-20 20:10 sorry 2008-11-20 20:10 vmtruncate in mm/memory.c? 2008-11-20 20:11 yes 2008-11-20 20:11 err sorry 2008-11-20 20:11 mm/truncate.c 2008-11-20 20:11 it's mm because it deals only with the page cache 2008-11-20 20:11 blurry area there 2008-11-20 20:11 http://lxr.linux.no/linux+v2.6.27.5/mm/memory.c#L2165 2008-11-20 20:12 http://lxr.linux.no/linux+v2.6.27/mm/truncate.c (let's start here) 2008-11-20 20:12 the memory.c part is also important 2008-11-20 20:12 and the fs-specific part 2008-11-20 20:12 19#include /* grr. try_to_release_page, 2008-11-20 20:12 20 do_invalidatepage */ 2008-11-20 20:13 truncate.c is in fact entirely independent of buffers 2008-11-20 20:13 except for one thing 2008-11-20 20:13 i see 2008-11-20 20:14 I knew that one thing just a short while ago ;) 2008-11-20 20:14 what is it? 2008-11-20 20:15 ok, easiest way to refresh my memory is to remove the include of buffer.h 2008-11-20 20:15 and recompile 2008-11-20 20:16 here goes 2008-11-20 20:16 there's a comment about what it's for? 2008-11-20 20:16 block_invalidate_page 2008-11-20 20:17 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1516 2008-11-20 20:17 assumes buffers 2008-11-20 20:18 but see, this gets called only if the mapping does not supply its own invalidatepage method 2008-11-20 20:18 so we must do that 2008-11-20 20:18 yes 2008-11-20 20:18 to avoid extreme badness by having the block library try to operate on our page->handles 2008-11-20 20:19 I think I will send Andrew a patch to remove that last buffer dependency there 2008-11-20 20:20 by initializing a_ops->invalidatepage to block_invalidatepage if the filesystem did not fill in that method 2008-11-20 20:20 hirofumi, makes sense? 2008-11-20 20:20 umm.. 2008-11-20 20:20 then we can go filesystem by filesystem and make it fill in that method explicitly 2008-11-20 20:21 probably, changing a_ops is not good 2008-11-20 20:21 probably, changing a_ops is not good by vfs 2008-11-20 20:21 on mapping space creation though 2008-11-20 20:21 sorry 2008-11-20 20:21 on address_space creation 2008-11-20 20:22 vfs already assumes buffer operations here 2008-11-20 20:22 all we do is move that badness to a different place, which is progress towards getting rid of it entirely 2008-11-20 20:22 yes, maybe we need 2008-11-20 20:22 new_inode() 2008-11-20 20:22 yes 2008-11-20 20:23 each fs sets own ->a_ops = foo_aops 2008-11-20 20:23 after that, vfs needs to change that foo_aops 2008-11-20 20:23 yes, and lets look at exactly where it filles that in 2008-11-20 20:23 fills 2008-11-20 20:24 this would break out of key fs'es of course... but that's ok I guess (and odne all the time anyway), although probably still worth checking non-null in register filesystem 2008-11-20 20:24 maze, no it wouldn't 2008-11-20 20:24 meant out of tree 2008-11-20 20:24 understood 2008-11-20 20:24 not that I think it's bad to break out of tree stuff 2008-11-20 20:24 why wouldn't it? 2008-11-20 20:25 you'd no longer be checking for null right? 2008-11-20 20:25 because this default already exists in truncate.c 2008-11-20 20:25 if the fs leaves it null, the vfs would fill it in 2008-11-20 20:25 now let's see there the fs supplies a_ops and see if that works 2008-11-20 20:26 that kind of defeats the purpose of having only the filesystems that use it fill it in 2008-11-20 20:27 why? 2008-11-20 20:28 you're just moving the null check to a different file? 2008-11-20 20:29 -!- ajonat_(~ajonat@190.48.127.205) has joined #tux3 2008-11-20 20:29 yes, but maybe to a place that is not acceptable 2008-11-20 20:29 as hirofumi first said 2008-11-20 20:29 hmm, ok 2008-11-20 20:29 lxr.linux.no/linux+v2.6.27/fs/ext2/inode.c#L1264 2008-11-20 20:30 that's where ext2 fills it in 2008-11-20 20:30 long after the place where vfs could default it transparently 2008-11-20 20:30 yes 2008-11-20 20:30 so this proposal may run around on that messed up interface 2008-11-20 20:30 really, a_ops should be supplied at new_inode time 2008-11-20 20:31 but it isn't 2008-11-20 20:31 and kernel filesystems are massively dependent on that 2008-11-20 20:31 probably, "provide invalidatepage always" would be acceptable 2008-11-20 20:32 yes, more work but much cleaner 2008-11-20 20:32 yes 2008-11-20 20:33 probably the best is to propose doing it the wrong way, and be told to do it the right way 2008-11-20 20:33 much better than just doing it the right way and being ignored ;) 2008-11-20 20:33 lol 2008-11-20 20:33 it's not even a joek 2008-11-20 20:33 joke 2008-11-20 20:34 well let's continue in truncate.c 2008-11-20 20:34 I see that as ego-stroking/management 2008-11-20 20:34 ;-) 2008-11-20 20:34 hmm, it works on even low-ego people 2008-11-20 20:35 it would work on me for example :) 2008-11-20 20:36 it's a matter of, you get invested in the issue after having stated how to solve it 2008-11-20 20:36 ok, truncate.c is full of truncate helpers 2008-11-20 20:36 it's hard to know what they're all for without looking at the high level 2008-11-20 20:37 but let's look at some of them 2008-11-20 20:37 ok 2008-11-20 20:37 http://lxr.linux.no/linux+v2.6.27/mm/truncate.c#L120 <- invalidate_complete_page 2008-11-20 20:37 124 if (page->mapping != mapping) <- oddity 2008-11-20 20:38 tasks may race, both doing invalidates 2008-11-20 20:38 so the page may be already out of the mapping by the time we get here 2008-11-20 20:38 127 if (PagePrivate(page) && !try_to_release_page(page, 0)) <- this concerns us 2008-11-20 20:39 we are supposed to get rid of our page->handles there 2008-11-20 20:39 one of more of the blocks may be under IO 2008-11-20 20:40 only in the case of file page cache 2008-11-20 20:40 because buffer cache blocks will never be truncated 2008-11-20 20:41 let's see what remove_mapping does 2008-11-20 20:41 all stuff we can ignore 2008-11-20 20:42 worries about swap cache... none of our pages can go to swap cache 2008-11-20 20:42 yes 2008-11-20 20:43 we do have to be aware of the question: can any of our pages be removed from a mapping without us knowing? 2008-11-20 20:43 or in other words, what prevents that? 2008-11-20 20:43 let's look at truncate_inode_pages_range to get an idea 2008-11-20 20:43 ->releasepage hander? 2008-11-20 20:44 maybe 2008-11-20 20:44 http://lxr.linux.no/linux+v2.6.27/mm/truncate.c#L158 2008-11-20 20:44 mm may feel entitled to remove a clean page cache page 2008-11-20 20:45 and we should remove any block handles in our ->releasepage to allow it to do so 2008-11-20 20:45 but will we ever leave block handles on a clean page? 2008-11-20 20:45 I think we should not 2008-11-20 20:46 probably, yes 2008-11-20 20:46 dirty pages are different story... the vm really should never try to release one of our dirty pages 2008-11-20 20:46 but it does try to, by calling ->writepage 2008-11-20 20:46 this is a poor interface as I have said before 2008-11-20 20:46 and truncate? 2008-11-20 20:47 yes, truncate too 2008-11-20 20:47 more commonly even 2008-11-20 20:47 yes 2008-11-20 20:47 jsut to restate: every one of our dirty pages is either already due to be written out or is pinned metadata... with the exception of truncated pages 2008-11-20 20:49 ok, truncate_inode_pages_range 2008-11-20 20:49 has this fancy new pagevec stuff 2008-11-20 20:49 http://lxr.linux.no/linux+v2.6.27/mm/swap.c#L471 2008-11-20 20:50 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L759 <- find_get_pages 2008-11-20 20:51 so this interface gets a bunch of struct page pointers from the mapping 2008-11-20 20:51 111static inline int page_cache_get_speculative(struct page *page) 2008-11-20 20:51 fun 2008-11-20 20:52 141 VM_BUG_ON(PageTail(page)); 2008-11-20 20:52 more fun 2008-11-20 20:52 this is a case where the explanatory comment is _much_ longer than the code 2008-11-20 20:53 anybody ever looked at this before? 2008-11-20 20:53 it's part of lockless pagecache 2008-11-20 20:53 scary stuff 2008-11-20 20:54 I'm going to run away for now 2008-11-20 20:54 can't see actually what is doing 2008-11-20 20:54 I've heard a little about it 2008-11-20 20:54 I hope the efficiency improvement is well worth the mess ;) 2008-11-20 20:54 only improves things on multi-cpu machines I think 2008-11-20 20:55 it seems to including !CONFIG_SMP case 2008-11-20 20:55 well 2008-11-20 20:55 :p 2008-11-20 20:56 will look more deeply later 2008-11-20 20:56 yes 2008-11-20 20:56 ok, so we got a pagevec full of page structs 2008-11-20 20:56 177 pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { 2008-11-20 20:57 in order of page index 2008-11-20 20:57 yes 2008-11-20 20:58 if anybody has a page locked here, we give up on that page 2008-11-20 20:58 that would be a page locked above i_size 2008-11-20 20:59 pages above i_size can also be in writeback 2008-11-20 20:59 locked would most often be, for reading 2008-11-20 21:00 but can also be locked by vmscan 2008-11-20 21:00 kswapd 2008-11-20 21:00 196 if (page_mapped(page)) { 2008-11-20 21:00 197 unmap_mapping_range(mapping, 2008-11-20 21:00 198 (loff_t)page_index< 199 PAGE_CACHE_SIZE, 0); 2008-11-20 21:01 this looks inefficient 2008-11-20 21:01 unmap one page at a time, using an interface designed for multiple pages 2008-11-20 21:01 I wonder why that was left that way 2008-11-20 21:02 maybe too much work to do it otherwise, just to make truncate of a mmapped file work efficiently 2008-11-20 21:02 that should not happen in a properly working user application, I think 2008-11-20 21:02 i see, probably yes 2008-11-20 21:03 the we do the truncate complete page, where we will try to remove our block handles 2008-11-20 21:04 so... does truncate_complete_page tolerate blocking, waiting for a block to be in the right state? 2008-11-20 21:05 I think so 2008-11-20 21:05 it's under lock_page() 2008-11-20 21:05 86 * If truncate cannot remove the fs-private metadata from the page, the page 2008-11-20 21:05 87 * becomes orphaned. It will be left on the LRU and may even be mapped into 2008-11-20 21:05 88 * user pagetables if we're racing with filemap_fault(). 2008-11-20 21:06 we want to think about life cycles of orphaned pages too 2008-11-20 21:07 what does 'orphaned' mean? 2008-11-20 21:07 that's the definition, just above 2008-11-20 21:07 ah, ok, so mmapped into somebody userspace, but no longer part of the file 2008-11-20 21:07 a truncated page, removed for a page cache, that still has page->private state attached 2008-11-20 21:07 could be under IO, or in use by something 2008-11-20 21:08 maybe "page->mapping != mapping" page? 2008-11-20 21:08 right, mapping could be null 2008-11-20 21:08 i see 2008-11-20 21:09 from the way it's written, it sounds like the page could even be put into a different mapping 2008-11-20 21:09 which would be horrible 2008-11-20 21:09 but I think it's actually talking about swapper space mapping there 2008-11-20 21:09 well 2008-11-20 21:09 dinner time here 2008-11-20 21:09 we will continue with truncate on tuesday 2008-11-20 21:10 ok 2008-11-20 21:10 it's a big, obscure topic 2008-11-20 21:10 and the traditional source of most vfs bugs 2008-11-20 21:10 I didn't know about orphaned page until now 2008-11-20 21:11 there are also "morton pages" :) 2008-11-20 21:11 :) 2008-11-20 21:13 :-) 2008-11-20 21:37 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-11-20 22:08 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-20 22:09 -!- _ajonat(~ajonat@190.48.124.216) has joined #tux3 2008-11-21 00:13 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-21 00:15 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-11-21 00:15 hey all 2008-11-21 00:31 hi pranith 2008-11-21 00:42 flips: i see you started the kernel port :) 2008-11-21 00:42 congrats 2008-11-21 00:42 thought about your deduplication project? 2008-11-21 00:43 hmm, not thoroughly.. 2008-11-21 00:43 i think i'll need help 2008-11-21 00:43 some help always helps 2008-11-21 00:43 :) 2008-11-21 00:44 b back soon.. lunch time 2008-11-21 00:44 you'd get it working with the user space code I think 2008-11-21 01:18 hi, flips here? 2008-11-21 01:19 yes 2008-11-21 01:19 could you read email about path[]? 2008-11-21 01:19 flips: Don't you need sleep or anything? ;) 2008-11-21 01:19 it's not late yes 2008-11-21 01:19 just read it now 2008-11-21 01:20 yes, path[] is slightly better than *path 2008-11-21 01:20 it's a small style thing, the first one is documentation that it really is an array 2008-11-21 01:20 ok, I'll change all to path[] 2008-11-21 01:20 the second one could be intended to be a field of a struct, a return value, or something like that 2008-11-21 01:20 then I'll pull? 2008-11-21 01:21 yes, please. probably, tommrow 2008-11-21 01:21 please pull after change all to path[] 2008-11-21 01:24 when I wake up tomorrow then 2008-11-21 01:25 thanks 2008-11-21 01:43 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-11-21 05:41 -!- ajonat(~ajonat@190.48.121.56) has joined #tux3 2008-11-21 08:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-21 08:55 -!- ajonat_(~ajonat@190.48.110.24) has joined #tux3 2008-11-21 09:17 -!- ajonat_(~ajonat@190.48.107.160) has joined #tux3 2008-11-21 09:22 -!- _ajonat(~ajonat@190.48.126.39) has joined #tux3 2008-11-21 09:29 -!- ajonat_(~ajonat@190.48.108.186) has joined #tux3 2008-11-21 10:13 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-21 10:51 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-21 11:42 -!- Bobby_(~Bobby@122.162.71.65) has joined #tux3 2008-11-21 14:09 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-21 15:37 -!- konrad(~konrad@garloff.cs.washington.edu) has joined #tux3 2008-11-21 19:41 who sets the a_ops for our filesystem's blockdev mapping? 2008-11-21 20:37 I am thinking that the s_bdev blockdev has address ops that assume buffers 2008-11-21 20:38 so set_page_dirty on a page in s_bdev would be a disaster if we are using page->private for something else 2008-11-21 20:39 I am thinking, why do we need to use s_bdev anyway, why don't we create our own address_space for our "buffer cache"? 2008-11-21 20:39 just just the same gendisk 2008-11-21 20:52 I think it's blockdev.c itself 2008-11-21 20:52 blockdev.c setup bd_inode (block device's inode) 2008-11-21 20:53 and we borrow that bd_inode->i_mapping 2008-11-21 20:55 bd_inode manages their page cache 2008-11-21 20:57 we initialize special inode by init_special_inode() 2008-11-21 20:59 then def_blk_fops->open() (blkdev_open) initialize inode->i_mapping by bd_inode->i_mapping 2008-11-21 21:00 and read()/write() is handled by blockdev.c stuff 2008-11-21 21:13 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-21 21:39 ah, our "buffer cache" 2008-11-21 21:39 I think it's not needed 2008-11-21 21:42 if we didn't use bd_inode at all 2008-11-21 21:52 that's what I think 2008-11-21 21:53 if we set up our own mapping it will need an inode 2008-11-21 21:53 one of ours 2008-11-21 22:01 ah, for using ->private by handle_t 2008-11-21 22:02 yes 2008-11-21 22:02 it isn't going to work if we cache metadata in ->s_bdev 2008-11-21 22:02 because blockdev a_ops assume buffers 2008-11-21 22:02 yes 2008-11-21 22:03 if users access blockdev under tux3 directly, it breaks 2008-11-21 22:08 I don't think we have to do a lot more than get a new_inode() 2008-11-21 22:09 then the blockdev cache and our cache will be completely separate, which we don't care about 2008-11-21 22:10 there is no valid use of blockdev by an outsider except to copy it somewhere 2008-11-21 22:10 crude from of backup 2008-11-21 22:10 crude form 2008-11-21 22:10 and racy 2008-11-21 22:11 while copying the blockdev, a new delta might land 2008-11-21 22:11 yes, but... 2008-11-21 22:11 well, there are some operations 2008-11-21 22:11 like sync? 2008-11-21 22:12 maybe freeze_bdev() or something 2008-11-21 22:12 maybe 2008-11-21 22:12 it calls our filesystem sync 2008-11-21 22:12 our filesystem freeze 2008-11-21 22:12 and there we can flush our block device ourselves 2008-11-21 22:13 ah 2008-11-21 22:13 217 if (sb->s_op->write_super_lockfs) 2008-11-21 22:13 218 sb->s_op->write_super_lockfs(sb); 2008-11-21 22:14 the operation is called freeze in one place, lockfs in another :p 2008-11-21 22:14 useless operation, actually 2008-11-21 22:14 it would work, however maybe I'm not fan of new layer 2008-11-21 22:15 it's either have our own address_space for the blockdev or abandon the page->private = handles idea 2008-11-21 22:15 we are definitely doing our own sync 2008-11-21 22:16 core kernel sync algorithm is hopelessly inappropriate 2008-11-21 22:16 or merge handle to block_dev.c 2008-11-21 22:17 ? 2008-11-21 22:17 oh, change core 2008-11-21 22:17 yes 2008-11-21 22:17 maybe later 2008-11-21 22:18 yes, if so, I would be fan of it 2008-11-21 22:18 -!- RazvanM(~RazvanM@pool-96-244-34-153.bltmmd.east.verizon.net) has joined #tux3 2008-11-21 22:18 well I think handles would be a noticable improvement over buffer rings 2008-11-21 22:18 circular buffer lists 2008-11-21 22:19 yes, I think so 2008-11-21 22:19 I would not call using our own inode for the blockdev a new layer 2008-11-21 22:19 ah, or teach block_dev.c - "don't touch ->private" 2008-11-21 22:19 that's what I thought you meant 2008-11-21 22:20 like most other core files have been taught 2008-11-21 22:20 I wonder why andrew left that one alone 2008-11-21 22:20 well, it might be too hard 2008-11-21 22:21 uuummm.... but locks like it's not too hard 2008-11-21 22:21 s/locks/looks/ 2008-11-21 22:22 and not too hard to use our own inode either 2008-11-21 22:22 then mapping->host will be pointing where we want :) 2008-11-21 22:22 :) 2008-11-21 22:22 we would just use super->s_bdev to set up bio transfers 2008-11-21 22:23 that is, to set the bio->bi_dev 2008-11-21 22:23 well, it may not be generic if earch fs own ->mapping 2008-11-21 22:23 let me add a little bit of test code to junkfs to try that 2008-11-21 22:24 what is generic about the blockdev mapping now? I don't think there is any valid generic operation on it 2008-11-21 22:24 buffer_head? 2008-11-21 22:24 and block_read_* stuff 2008-11-21 22:25 which we will not use if we use handles 2008-11-21 22:25 yes 2008-11-21 22:25 that is library stuff used by filesystems 2008-11-21 22:25 maybe some ioctl on the blockdev is generic 2008-11-21 22:25 so, I thought we just overwrite the detail of block_* 2008-11-21 22:25 not ->mapping 2008-11-21 22:26 I thought that to until I found that some of the mapping ops are assume buffers for blockdevs 2008-11-21 22:27 I thought that too until I found that some of the mapping ops assume buffers for blockdevs <- corrected 2008-11-21 22:27 i see 2008-11-21 22:28 so... tux3/super.c junkfs already loads a superblock using just a bio transfer, no sb_bread 2008-11-21 22:29 now what should I try to prove the concept? 2008-11-21 22:29 create a new_inode(), transfer a page to/from the sb->s_bdev 2008-11-21 22:30 concept of own ->mapping? 2008-11-21 22:30 yes, we will be able to use the original idea of sb->devmap 2008-11-21 22:31 probably operations from outside fs? 2008-11-21 22:31 example? 2008-11-21 22:31 I'm not sure 2008-11-21 22:31 most operations go through our super operations or inode operations 2008-11-21 22:32 a few go through mapping->a_ops 2008-11-21 22:32 set_page_dirty is one of those 2008-11-21 22:32 yes, if we have own mapping, we don't touch bdev from fs anymore 2008-11-21 22:33 we touch the physical device of course 2008-11-21 22:33 but we don't touch that cache 2008-11-21 22:33 yes, it would be work 2008-11-21 22:34 for fs obviously 2008-11-21 22:34 it just sits unused, not much wasted 2008-11-21 22:34 yes 2008-11-21 22:34 if we replace all of the block IO library I can't imagine it being more than a thousand lines 2008-11-21 22:35 yes 2008-11-21 22:36 outside fs can't do anything though 2008-11-21 22:38 well, I'm not sure 2008-11-21 22:39 so, I think "let's try with it" is good 2008-11-21 22:40 if it works with nfs, then it works ;) 2008-11-21 22:40 nfs certainly never accesses our blockdev 2008-11-21 22:40 good :) 2008-11-21 22:41 I posted my little handle.c file 2008-11-21 22:41 btw, now I can mount tux3 ;) 2008-11-21 22:41 it runs in user space and has some untried kernel code 2008-11-21 22:41 you can??? 2008-11-21 22:41 hehe :) 2008-11-21 22:41 tell me details :) 2008-11-21 22:42 well, I can just readdir() though 2008-11-21 22:42 that's a lot 2008-11-21 22:42 I'll put hackly patches to userweb.kernel.org 2008-11-21 22:43 worth putting in my git fs/tux3? 2008-11-21 22:44 I'm not sure, it may too hackish 2008-11-21 22:44 http://userweb.kernel.org/~hirofumi/tux3-hack.tar.gz 2008-11-21 22:44 well please post to the list with your url 2008-11-21 22:44 and we'll read it and talk about it 2008-11-21 22:45 ok 2008-11-21 22:45 well, it's simple - just add some operations 2008-11-21 22:45 yes 2008-11-21 22:46 but to read a directory... you're reading in at least the root dir btree root and leaf 2008-11-21 22:46 not to mention the inode table root and leaf 2008-11-21 22:47 ah, it may not read dirent yet 2008-11-21 22:47 because my rootdir is empty 2008-11-21 22:48 you can create a nonempty rootdir with the user space code 2008-11-21 22:48 yes 2008-11-21 22:48 I didn't try it yet 2008-11-21 22:49 it's highly likely to work 2008-11-21 22:49 the userspace readdir is the same interface as kernel 2008-11-21 22:49 but if we add blockget/blockread, it would work or near 2008-11-21 22:50 yes, we have readdir/lookup already 2008-11-21 22:51 ok 2008-11-21 22:51 I'll write email after those work 2008-11-21 22:52 I think it shouldn't be so hard 2008-11-21 22:54 I'm coding the tux3-private blockdev idea now 2008-11-21 22:54 ok 2008-11-21 22:54 we will try to make a decision if this is a good way to go, in the next day or so 2008-11-21 22:55 ok, I think it's ok at least for a long time though 2008-11-21 22:56 struct inode *devnode = new_inode(sb); <- here we go 2008-11-21 22:57 ok 2008-11-21 22:57 we just use it, or overwrite bd_inode->i_mapping? 2008-11-21 22:58 we just use it 2008-11-21 22:58 ok 2008-11-21 22:58 tux_inode->dev (address_space *) 2008-11-21 22:58 more mapping_t I think we have it now 2008-11-21 22:58 or mapping_t I meant 2008-11-21 22:59 I feel overwrite may be good right now 2008-11-21 22:59 overwrite what? 2008-11-21 23:00 oh 2008-11-21 23:00 sb->s_bdev->bd_inode->i_mapping = devnode->i_mapping 2008-11-21 23:00 ah 2008-11-21 23:01 umm.. may racey 2008-11-21 23:01 then I think we would break things if somebody runs dump on /dev/sdaX 2008-11-21 23:01 I don't think it will work 2008-11-21 23:01 hmm 2008-11-21 23:02 it might work 2008-11-21 23:02 hard to tell 2008-11-21 23:02 but I'm pretty sure a completely separate inode will work 2008-11-21 23:02 yes 2008-11-21 23:03 I'll read a bit of blockdev.c and see how many buffer dependencies there are 2008-11-21 23:03 if overwrite work, everything may work, well, I'm not sure 2008-11-21 23:04 separate inode would be enough for now 2008-11-21 23:33 uml still works, good :) 2008-11-21 23:34 it's been a couple months since I last mounted tux3 in kernel 2008-11-21 23:46 hirofumi, you added an inode slab I guess? 2008-11-21 23:47 yes 2008-11-21 23:47 which I'm doing right now too ;) 2008-11-21 23:47 it's pretty simple, a little verbose 2008-11-21 23:47 oh 2008-11-21 23:48 my inode slab patch? 2008-11-21 23:48 my patch is a little verbose? 2008-11-21 23:48 no, the interface for creating fs specific inodes 2008-11-21 23:48 ah 2008-11-21 23:48 my interface :p 2008-11-21 23:49 new_inode? 2008-11-21 23:49 and the alloc and destroy methods 2008-11-21 23:50 not a big deal 2008-11-21 23:50 ok 2008-11-22 00:29 -!- _ajonat(~ajonat@190.48.125.11) has joined #tux3 2008-11-22 00:35 ok, readdir seems to work 2008-11-22 00:35 really hack though 2008-11-22 01:20 what is the really hack part? 2008-11-22 01:20 most of linux is just a hack anyway 2008-11-22 01:20 blockread.patch 2008-11-22 01:20 :) 2008-11-22 01:20 url? 2008-11-22 01:21 http://userweb.kernel.org/~hirofumi/tux3-hack.tar.gz 2008-11-22 01:21 in that tarball, patchset/patches/blockread.patch 2008-11-22 01:21 by the way, it took me until now just to add the inode slab allocation to the stub tux3 fs 2008-11-22 01:22 decided I didn't really need to init_once 2008-11-22 01:22 well 2008-11-22 01:22 if you don't, you ahve to do inode_init_once in tux3_alloc_inode 2008-11-22 01:23 and if you don't, you can spend a long time figuring out why it's segfaulting 2008-11-22 01:23 :p 2008-11-22 01:23 segfault? 2008-11-22 01:23 I wasn't doing the inode init_once 2008-11-22 01:23 wrong 2008-11-22 01:23 it's needed 2008-11-22 01:24 ah, spin_lock_init, etc.? 2008-11-22 01:24 when I added the tux3 inode slab 2008-11-22 01:24 even a simple thing like that has lots to go wrong 2008-11-22 01:26 I just copied from ext2 almost 2008-11-22 01:26 same here, and I thought I would "improve" it a little 2008-11-22 01:26 changed init_once to NULL ;) 2008-11-22 01:26 doesn't work 2008-11-22 01:27 ah 2008-11-22 01:27 if there never was an init_once feature in slab we'd probably be better off 2008-11-22 01:27 saves a microscopic amount of cpu 2008-11-22 01:28 requires you to re-init the fields on delete 2008-11-22 01:28 so probably doesn't save anything 2008-11-22 01:28 maybe it's only for spin_lock_init etc. 2008-11-22 01:28 this was inode.assoc_mapping 2008-11-22 01:29 if it was only spinlocks it would be ok 2008-11-22 01:29 i see 2008-11-22 01:30 the init_once feature makes stuff fragile and causes weird errors when an object is freed with garbage in some fields 2008-11-22 01:30 ok, back to the task ;) 2008-11-22 01:30 if you have time, could you review my patches? 2008-11-22 01:31 or later 2008-11-22 01:31 I'm looking right now 2008-11-22 01:31 thanks 2008-11-22 01:31 in between the complaining ;) 2008-11-22 01:31 :) 2008-11-22 01:32 now I'm trying to replace some blockread with address_space and libraries 2008-11-22 01:32 that's sane? 2008-11-22 01:32 which one for example? 2008-11-22 01:33 add get_block, readpage, and readpages 2008-11-22 01:33 and use readpage from dir.c 2008-11-22 01:33 wow you used my dwalk_next api 2008-11-22 01:34 tux3_get_block is almost copy of filemap_extent_io 2008-11-22 01:34 well, that would be like many other linux filesystems 2008-11-22 01:35 yes 2008-11-22 01:35 I'm thinking, we will replace those, and compare 2008-11-22 01:36 or 2008-11-22 01:36 just start proper stuff 2008-11-22 01:36 bh_result->b_blocknr = iblock; <- what is this for? 2008-11-22 01:37 filemap_extent_io reads it 2008-11-22 01:37 it's a very good thing you've done 2008-11-22 01:37 bh_result->b_blocknr = iblock? 2008-11-22 01:38 this hack 2008-11-22 01:38 ah 2008-11-22 01:38 thanks 2008-11-22 01:38 getting it up and running with the block library 2008-11-22 01:38 ok 2008-11-22 01:39 I'll try next with aops stuff 2008-11-22 01:41 now I'm thinking blockread() by read_mapping_page(), um... that's sane? 2008-11-22 01:41 bh_result->b_blocknr = iblock; <- but it looks like this is always overwritten at the bottom of tux3_get_block 2008-11-22 01:41 yes 2008-11-22 01:41 if it's used in between, I missed it 2008-11-22 01:42 ah, I may removed it 2008-11-22 01:42 guess_extent was using it 2008-11-22 01:42 right, and guess_extent makes no sense here 2008-11-22 01:42 because this can only map one block at a time 2008-11-22 01:43 a big constriction, which is my the mpage stuff was made 2008-11-22 01:43 btw, it can map multiple blocks 2008-11-22 01:43 with only one bh_result? 2008-11-22 01:43 b_size tell it 2008-11-22 01:43 ah, send a fake buffer 2008-11-22 01:43 yes 2008-11-22 01:44 generic_* supports that? 2008-11-22 01:44 iirc, readpages and direct-io stuff 2008-11-22 01:44 fancy 2008-11-22 01:44 pushing the interface past the breaking point ;) 2008-11-22 01:44 how does it return more than one physical block? 2008-11-22 01:45 with only one bh->b_blocknr? 2008-11-22 01:45 it can return only contiguous blocks 2008-11-22 01:45 well that's useful 2008-11-22 01:45 when did that go in? 2008-11-22 01:46 iirc, it was with direct-io 2008-11-22 01:46 first, it was get_blocks 2008-11-22 01:46 later, get_blocks merged to get_block 2008-11-22 01:46 about 5 years ago 2008-11-22 01:46 maybe 2008-11-22 01:48 replace blockread() by read_mapping_page() in dir.c 2008-11-22 01:48 what do you think? 2008-11-22 01:48 it will depend on page, not buffer 2008-11-22 01:48 won't work for 1K fs 2008-11-22 01:48 no 2008-11-22 01:49 it calls ->readpage 2008-11-22 01:49 blockread should work, why replace it? 2008-11-22 01:49 to read via dir->mapping 2008-11-22 01:50 current blockread() reads buffer by sb_bread() 2008-11-22 01:50 blockread already reads via the mapping, it just goes through an extra function to do it 2008-11-22 01:50 you mean, read without creating buffers 2008-11-22 01:50 ah 2008-11-22 01:50 in my blockread.patch 2008-11-22 01:51 it doesn't create page cache 2008-11-22 01:51 it just use sb_bread 2008-11-22 01:51 ok, check out ext3_bread 2008-11-22 01:52 oh 2008-11-22 01:52 it seems to use sb_getblk() 2008-11-22 01:52 hmm, it bloated up a bit since I wrote it ;) 2008-11-22 01:53 ah, so ext3 uses buffer cache for dir 2008-11-22 01:53 and ext2 uses page cache 2008-11-22 01:53 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-22 01:54 the opposite 2008-11-22 01:54 well 2008-11-22 01:54 then both have the dir in file page cache 2008-11-22 01:54 ext3 accesses it as blocks 2008-11-22 01:54 ext2 accesses as pages 2008-11-22 01:54 ext3 has to do that because of htree and journalling 2008-11-22 01:55 ext3 uses sb_getblk()? 2008-11-22 01:55 something like it 2008-11-22 01:55 so, it's buffer cache? 2008-11-22 01:55 no, it's operating on a file page cache 2008-11-22 01:56 um... sb_getblk() doesn't use buffer cache? 2008-11-22 01:56 ext3_getblk is confusing because of all the journal stuff attached 2008-11-22 01:57 let me check it... 2008-11-22 01:57 sb_getblk assumes the buffer cache, that is the blockdev mapping 2008-11-22 01:58 so it can't use that for the dir 2008-11-22 01:58 it could possibly use __getblk 2008-11-22 01:58 http://lxr.linux.no/linux+v2.6.27.5/fs/ext3/inode.c#L1012 2008-11-22 01:58 yes 2008-11-22 01:58 but it doesn't seem to 2008-11-22 01:58 instead, it looks like __getblk was cut and pasted into ext3 2008-11-22 01:58 and extended to multiple blocks 2008-11-22 01:59 ah, ext3 does use sb_getblk 2008-11-22 01:59 yes 2008-11-22 02:00 ext3 seems to use page cache only for file data 2008-11-22 02:01 how does that work 2008-11-22 02:01 see dir.c, it's reading blocks from an inode 2008-11-22 02:01 ext3/dir.c? 2008-11-22 02:02 struct inode *inode = filp->f_path.dentry->d_inode; 2008-11-22 02:02 yes 2008-11-22 02:02 yes, synced 2008-11-22 02:02 now I'm on ext3_readdir 2008-11-22 02:07 wow 2008-11-22 02:07 it seems to use ext3_bread -> ext3_getblk -> get_block -> sb_getblk 2008-11-22 02:07 it's chaining through the directory index 2008-11-22 02:08 and doing the directory lookup in the buffer cache 2008-11-22 02:08 just as you said 2008-11-22 02:08 that's braindamage 2008-11-22 02:08 yes 2008-11-22 02:08 I didn't realize this was done to htree ;) 2008-11-22 02:08 maybe jbd really depends on buffer_head for metadata 2008-11-22 02:08 doing all this in a regular file page cache would be lots faster 2008-11-22 02:09 I guess it does 2008-11-22 02:09 what a hack 2008-11-22 02:09 I really had no idea 2008-11-22 02:09 I would have preferred to fix jbd 2008-11-22 02:09 I mean, it can journal data 2008-11-22 02:09 ah, i see 2008-11-22 02:09 so it must be able to journal directory blocks from file page cache 2008-11-22 02:09 well 2008-11-22 02:10 I remember someone (maybe linus) claimed it 2008-11-22 02:10 claimed what? 2008-11-22 02:10 it should use page cache 2008-11-22 02:10 it sure should 2008-11-22 02:10 I was kind of not involved in the community at the time 2008-11-22 02:11 well 2008-11-22 02:11 ext3 is obsolete ;) 2008-11-22 02:11 yes :) 2008-11-22 02:11 and I suppose ext4 has the same breakage 2008-11-22 02:11 sure 2008-11-22 02:11 no wonder ted didn't respond to me when I explained how much more efficient the file page cache is for directories 2008-11-22 02:12 ok 2008-11-22 02:12 well 2008-11-22 02:12 In think __getblk will work for file page cache 2008-11-22 02:12 try it 2008-11-22 02:12 ah, __getblk 2008-11-22 02:13 oh 2008-11-22 02:13 it woudln't be that hard to fix ext3 I think 2008-11-22 02:13 some time where we have time on our hands 2008-11-22 02:13 it's __getblk(bdev, block, size) 2008-11-22 02:16 __find_get_block looks ok 2008-11-22 02:16 only trivial use of bdev 2008-11-22 02:17 well 2008-11-22 02:17 it has lost track of the mapping 2008-11-22 02:17 and it's going to end up using the blockdev mapping, indeed 2008-11-22 02:17 yes 2008-11-22 02:18 it seems so 2008-11-22 02:18 let me see if I can find my original ext2_bread 2008-11-22 02:18 thanks 2008-11-22 02:31 http://lwn.net/2001/0503/a/directory-index.php3 2008-11-22 02:31 [RFC] Ext2 Directory Index in page cache 2008-11-22 02:31 ancient patch 2008-11-22 02:31 i see 2008-11-22 02:32 now I'm reading... 2008-11-22 02:33 if (!buffer_uptodate(bh)) 2008-11-22 02:33 wait_on_buffer(bh); <- seems unnecessary 2008-11-22 02:36 I see that code also does ext2_get_block 2008-11-22 02:36 I think that's unnecssary 2008-11-22 02:38 if that work like sb_getblk(), ext2_get_block would be needed 2008-11-22 02:38 I think 2008-11-22 02:38 can just leave the block unmapped 2008-11-22 02:38 sorry 2008-11-22 02:38 leave the buffer unmapped 2008-11-22 02:38 and let ext2_bread map it 2008-11-22 02:39 so, this code is inefficient for creating new files 2008-11-22 02:39 well 2008-11-22 02:39 actually the get_block has to be done sometime 2008-11-22 02:39 yes 2008-11-22 02:41 ok, so should we make a tux3_getblk like that? 2008-11-22 02:41 um... I'm not sure 2008-11-22 02:41 or just work like file data? 2008-11-22 02:42 the dirops want blocks 2008-11-22 02:43 ext2 dirops can be hacked to work directly on pages, that's how ext2/dir.c works 2008-11-22 02:43 but it isn't pretty 2008-11-22 02:43 and it doesn't extend to btree directories 2008-11-22 02:43 i see 2008-11-22 02:43 the key parts of that ext2_getblk are: grab_page_cache; then put buffers on it 2008-11-22 02:44 I didn't really know what I was doing when I wrote it ;) 2008-11-22 02:44 that wait_on_buffer doesn't make any sense 2008-11-22 02:45 btw, btree directory means phtree? 2008-11-22 02:45 yes 2008-11-22 02:45 if phtree uses logical index, it can use page cache? 2008-11-22 02:46 the btree is mapped into the inode page cache 2008-11-22 02:46 so it has to work in blocks 2008-11-22 02:46 can't work directly on pages 2008-11-22 02:47 i imaged some node points to another node 2008-11-22 02:48 http://web.archive.org/web/20030604184516/people.nl.linux.org/~phillips/htree/dx.pcache-2.4.4-6 2008-11-22 02:48 this has ext2_bread 2008-11-22 02:48 let me try to read a bit 2008-11-22 02:50 see I improved it so ext2_getblk doesn't always do the get_block 2008-11-22 02:51 use page cache, but needs block resolution? 2008-11-22 02:51 yes 2008-11-22 02:51 i see 2008-11-22 02:51 that's the big problem 2008-11-22 02:51 that's why I'm doing the handles thing, to make that hurt less 2008-11-22 02:52 ext2_getblk should not be calling ext2_get_block, actually 2008-11-22 02:52 it should just leaf the buffer unmapped 2008-11-22 02:53 and it will be mapped when it is written 2008-11-22 02:53 i see 2008-11-22 02:54 if the block already exists on disk, it will be mapped when ext2_bread tries to see if a file already exists 2008-11-22 02:54 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-22 02:54 but it will work the way it is 2008-11-22 02:54 did work, very well 2008-11-22 02:54 and eventually evolved into sb_getblk ;) 2008-11-22 02:55 though it's hardly recognizable now 2008-11-22 02:55 FWIW, read_mapping_page(), then adding buffer_head is also work? 2008-11-22 02:55 yes 2008-11-22 02:55 same idea 2008-11-22 02:55 i see 2008-11-22 02:56 I'm sure you can write this better than I did 2008-11-22 02:56 oh, I'm not sure :) 2008-11-22 02:56 well, I'll try 2008-11-22 02:58 ok, I'll go back to my fs-private blockdev mapping experiment 2008-11-22 02:58 anyway, maybe, I think I understand what did you think.. 2008-11-22 02:58 it's very simple 2008-11-22 02:58 ok 2008-11-22 02:59 just find the cache page, adding it if it's not already there, put buffers on if it does not already have them, then follow the list to the right buffer and return it 2008-11-22 02:59 the list -> page->private buffers list 2008-11-22 03:00 ok 2008-11-22 03:00 for right now, we will use buffer_head? 2008-11-22 03:00 in your hack 2008-11-22 03:00 ok 2008-11-22 03:00 and we will decide if we are going to use the new handles pretty soon 2008-11-22 03:00 maybe tomorrow 2008-11-22 03:01 oh 2008-11-22 03:01 ok 2008-11-22 03:01 I mean, we will decide then, not start using them then ;-) 2008-11-22 03:01 :) 2008-11-22 03:01 did you look at my handle.c? 2008-11-22 03:01 it's very simple right now 2008-11-22 03:01 some funny shifts 2008-11-22 03:02 yes, code of userspace 2008-11-22 03:02 and some ifdefs 2008-11-22 03:02 right, untested locking code 2008-11-22 03:02 cut and paste from buffer.c 2008-11-22 03:03 I think that per se is very good 2008-11-22 03:03 I think issue is how can we fit current code 2008-11-22 03:03 by using it to define blockread() 2008-11-22 03:03 I think issue is how can we add current kernel 2008-11-22 03:03 and things like that 2008-11-22 03:04 we have wrappers for most of it, so we just use them 2008-11-22 03:04 I'll test this in kernel tomorrow 2008-11-22 03:04 after I sleep 2008-11-22 03:04 and write some more of the functions 2008-11-22 03:04 blockget and blockread 2008-11-22 03:04 i.e., getblk and bread 2008-11-22 03:05 ok 2008-11-22 03:05 like that ext2 code, modernized and changed from buffers to handles 2008-11-22 03:06 yes 2008-11-22 03:06 also slabs for the handles 2008-11-22 03:06 ok, time to sleep 2008-11-22 03:06 good night 2008-11-22 03:06 amazing patches :) 2008-11-22 03:07 bye 2008-11-22 08:31 -!- ajonat(~ajonat@190.48.125.11) has joined #tux3 2008-11-22 10:24 -!- ajonat(~ajonat@190.48.125.11) has joined #tux3 2008-11-22 14:32 folks 2008-11-22 14:51 ok, time for a new fs: hackfs 2008-11-22 14:51 a filesystem all in one file (hackfs.c) 2008-11-22 14:52 just to demo/develop the block handle technque 2008-11-22 15:01 nice 2008-11-22 15:16 ok, hackfs is up and running, one file, 259 lines long 2008-11-22 15:16 including the vecio support and test code 2008-11-22 15:17 and has a bug :) 2008-11-22 15:17 root@usermode:~# umount /mnt 2008-11-22 15:17 BUG: Dentry 09858898{i=95,n=foo} still in use (1) [unmount of hack ubdb] 2008-11-22 15:17 BUG: failure at fs/dcache.c:640/shrink_dcache_for_umount_subtree()! 2008-11-22 15:17 Kernel panic - not syncing: BUG! 2008-11-22 15:17 EIP: 0073:[<400d355d>] CPU: 0 Not tainted ESP: 007b:bfa5c72c EFLAGS: 00200246 2008-11-22 15:17 Not tainted 2008-11-22 15:17 EAX: ffffffda EBX: 08054358 ECX: 0804fa27 EDX: 08054388 2008-11-22 15:17 ESI: 08054359 EDI: 0804fa20 EBP: 08054358 DS: 007b ES: 007b 2008-11-22 15:17 09d39d64: [<0806a177>] show_regs+0xb4/0xb9 2008-11-22 15:17 09d39d90: [<08059826>] panic_exit+0x25/0x3b 2008-11-22 15:18 09d39da4: [<08083932>] notifier_call_chain+0x27/0x53 2008-11-22 15:18 09d39dcc: [<08083975>] __atomic_notifier_call_chain+0x17/0x19 2008-11-22 15:18 09d39ddc: [<0808398c>] atomic_notifier_call_chain+0x15/0x17 2008-11-22 15:18 09d39df8: [<080700d2>] panic+0x52/0xdd 2008-11-22 15:18 09d39e18: [<080b9446>] shrink_dcache_for_umount_subtree+0x120/0x1bd 2008-11-22 15:18 09d39e38: [<080b9c5d>] shrink_dcache_for_umount+0x4e/0x5d 2008-11-22 15:18 09d39e44: [<080ac2c7>] generic_shutdown_super+0x17/0xb9 2008-11-22 15:18 09d39e58: [<080ac37c>] kill_block_super+0x13/0x27 2008-11-22 15:18 09d39e68: [<080ac425>] deactivate_super+0x4b/0x62 2008-11-22 15:18 this is because it's still using the ramfs inode cleanup, hirofumi already fixed that in fs/tux3 2008-11-22 15:18 but I must skate before fixing 2008-11-22 15:19 then we will have a nice, separate platform to develop block handles 2008-11-22 15:40 -!- tim_dimm(~timothyhu@cpe-76-90-122-49.socal.res.rr.com) has joined #tux3 2008-11-22 17:22 -!- tim_dimm(~timothyhu@cpe-76-90-122-49.socal.res.rr.com) has joined #tux3 2008-11-22 17:40 fixed 2008-11-22 17:41 hi 2008-11-22 17:41 needed d_genocide(sb->s_root); 2008-11-22 17:41 as usual, this vfs feature is very well documented :p 2008-11-22 17:41 hi hirofumi 2008-11-22 17:41 ok, I have hackfs just about ready to post 2008-11-22 17:41 use kill_litter_super? 2008-11-22 17:42 also need to release the block device 2008-11-22 17:42 i see 2008-11-22 17:42 well 2008-11-22 17:42 it's a crazy hybrid of ramfs and a block fs 2008-11-22 17:42 just for testing 2008-11-22 17:43 for read side, do we need buffer_head or handle_t? 2008-11-22 17:43 these vfs functions are quite sensibly designed, but the names are stupid, there are no comments, and no documentation 2008-11-22 17:43 would get an 'f' in most compsci classes 2008-11-22 17:43 handle_t 2008-11-22 17:43 when we have handle support 2008-11-22 17:43 there will be no buffers at all 2008-11-22 17:44 did you mean for now, or later? 2008-11-22 17:44 now and later 2008-11-22 17:44 for now we do not need handle_t 2008-11-22 17:44 later we will not need struct buffer_head 2008-11-22 17:44 while writing page cache support for blockread() 2008-11-22 17:44 you mean, what can we write now to make the bit edit easier? 2008-11-22 17:45 like typedef struct buffer handle_t ? 2008-11-22 17:45 I feel we don't need to change state of buffer_head 2008-11-22 17:45 how about typedef struct buffer buffer_t ? 2008-11-22 17:45 no 2008-11-22 17:46 ok, you were asking something else I guess 2008-11-22 17:46 ah, at first, I'm assuming we need to lock page->private to add something 2008-11-22 17:47 to add something to page->private 2008-11-22 17:47 we do, if we want it to work on smp 2008-11-22 17:47 where there is no other lock 2008-11-22 17:47 for dirops, there is a lock 2008-11-22 17:47 the parent directory i_mutex 2008-11-22 17:48 so no lock needed on any directory block I think 2008-11-22 17:48 ah, i see 2008-11-22 17:48 bitmap blocks need locks for smp 2008-11-22 17:48 wait a bit 2008-11-22 17:48 and file index, btree index, btree leaves and dleafs 2008-11-22 17:49 vm shrinker is not shrink buffer_head or handle_t? 2008-11-22 17:49 doesn't shrink 2008-11-22 17:49 you are asking how we will shrink? 2008-11-22 17:49 no 2008-11-22 17:49 oh right 2008-11-22 17:50 it shrinks buffer cache, yes 2008-11-22 17:50 but we will do it differently with block handles 2008-11-22 17:50 and always remove them as soon as they have no users 2008-11-22 17:51 we can give them back to slab, maybe 2008-11-22 17:51 so, I'm thinking to add page->private is not efficient on read side 2008-11-22 17:51 compared to what? 2008-11-22 17:51 s/I'm thinking/I wonder/ 2008-11-22 17:51 what kind of read? 2008-11-22 17:51 file data? 2008-11-22 17:51 page = grab_cache_page(mapping, index); 2008-11-22 17:51 if (!page) 2008-11-22 17:51 return NULL; 2008-11-22 17:51 if (!page_has_buffers(page)) 2008-11-22 17:51 create_empty_buffers(page, tux_sb(inode->i_sb)->blocksize, 0); 2008-11-22 17:51 bh = page_buffers(page); 2008-11-22 17:51 while (offset--) 2008-11-22 17:51 bh = bh->b_this_page; 2008-11-22 17:52 get_bh(bh); 2008-11-22 17:52 unlock_page(page); 2008-11-22 17:52 page_cache_release(page); 2008-11-22 17:52 you are completely right 2008-11-22 17:52 "read side" means users don't change state 2008-11-22 17:52 that is because as it is written, filemap.c works a block at a time 2008-11-22 17:52 it takes lock_page() 2008-11-22 17:53 right, in only needs to know that page->data remains valid, which it knows by holding a use count on the page 2008-11-22 17:53 but, I wonder we can fake it on stack? 2008-11-22 17:53 without lock_page 2008-11-22 17:54 what object would go on stack? 2008-11-22 17:54 handles? 2008-11-22 17:54 yes handles 2008-11-22 17:54 that's a cool idea 2008-11-22 17:55 I was thinking of other, messier ideas 2008-11-22 17:55 how are we sure that somebody won't add handles to the page while we are processing the fake handles? 2008-11-22 17:56 stack of user don't care about it, because doesn't tack other than page->data 2008-11-22 17:57 tack? 2008-11-22 17:57 users of stack 2008-11-22 17:57 touch? 2008-11-22 17:57 yes 2008-11-22 17:57 s/tack/touch/ 2008-11-22 17:58 I think it works fine 2008-11-22 17:58 blockget() can't use stack, because it will change state 2008-11-22 17:58 yes 2008-11-22 17:59 yes, it may work 2008-11-22 17:59 what if somebody else reads the page before we do? 2008-11-22 18:00 what does "we do" means 2008-11-22 18:01 what do we do? 2008-11-22 18:02 our code gets a ->readpage call, and sets up a read bio without locking the page. At the same time, somebody else gets a ->readpage call and sets up a bio for the same page 2008-11-22 18:02 is it lock_page that normally prevents this 2008-11-22 18:02 yes 2008-11-22 18:03 so you are just trying to avoid taking lock_page and releasing four times for a 1K blocksize? 2008-11-22 18:03 to read page, page should be uptodate already 2008-11-22 18:03 yes 2008-11-22 18:03 I don't think we have to worry about that, nobody really cares about performance of 1K filesystems ;) 2008-11-22 18:04 oh 2008-11-22 18:04 and for a 4K filesystem, we have to take the page lock anyway 2008-11-22 18:04 page lock protects the page state, i.e., uptodate, as I understand it 2008-11-22 18:05 hmm 2008-11-22 18:05 no 2008-11-22 18:05 protects uptodate, but not dirty? 2008-11-22 18:05 and we have to add handles to page->private even if 4k filesystem 2008-11-22 18:05 yes 2008-11-22 18:06 we want to create our own version of mpage.c functions that don't add handles 2008-11-22 18:07 where the IO is on even page boundaries 2008-11-22 18:07 we just add handles at partial beginning and ending pages 2008-11-22 18:07 like mpage.c does not I think, it adds buffers there 2008-11-22 18:07 but, blockread() returns handles? 2008-11-22 18:07 by falling back to block_read/write_full_page 2008-11-22 18:08 yes 2008-11-22 18:08 returns a handle 2008-11-22 18:08 so, we have to add handle to page->private for read? 2008-11-22 18:08 "like mpage.c does now" I meant 2008-11-22 18:09 s/read/blockread/ 2008-11-22 18:09 ah 2008-11-22 18:09 yes, if we are reading with blockread 2008-11-22 18:09 but we will optimize that 2008-11-22 18:09 to mostly read full pages 2008-11-22 18:09 yes 2008-11-22 18:09 and leave page->private NULL 2008-11-22 18:10 yes 2008-11-22 18:10 we can do that optimization later, it will be satisfying I think 2008-11-22 18:10 ok 2008-11-22 18:10 it will be like mpage.c, but not nearly as messy I hope 2008-11-22 18:11 and I also hope that our single page read/write code will be much cleaner than the buffer.c code 2008-11-22 18:12 i see 2008-11-22 18:12 our handles are 12 times more memory efficient than buffer rings 2008-11-22 18:12 for a 1K filesystem 2008-11-22 18:12 which nobody cares about ;-) 2008-11-22 18:12 yes 2008-11-22 18:13 24 times more efficient for 512 byte blocks, our smallest 2008-11-22 18:13 and 3 times more efficient for 4K blocks, people care about that 2008-11-22 18:13 way more efficient for L1 cache 2008-11-22 18:13 and if we can handle highmem with it 2008-11-22 18:14 in some case, it's much help 2008-11-22 18:14 yes, that will be a nice improvement 2008-11-22 18:14 so directories are not all in kernel memory 2008-11-22 18:15 but 32 bit machines are dying out 2008-11-22 18:15 new machines with huge disks are all 64 bit 2008-11-22 18:15 however, my machine is 32 bit :) 2008-11-22 18:15 yes 2008-11-22 18:16 pentium-m 2008-11-22 18:16 I wonder if the eee is 32 bit 2008-11-22 18:16 I expect so 2008-11-22 18:16 I think enterprise people don't want to change environment 2008-11-22 18:17 it's a celeron, whatever that means 2008-11-22 18:17 that is true 2008-11-22 18:17 it will be 5 years before 32 bit linux servers become rare in machine rooms 2008-11-22 18:17 anyway, it is easy to support 2008-11-22 18:18 it's a definite improvement for anybody who has that kind of machine 2008-11-22 18:18 yes 2008-11-22 18:18 also, we are going to be _far_ more cpu efficient than ext3 dirops, by using radix tree probes where ext3 calls ext3_get_block 2008-11-22 18:19 ah, i see 2008-11-22 18:19 should really fix that in ext3, it must be costly under some common loads 2008-11-22 18:19 there is other improvements of page cache? 2008-11-22 18:20 for dir 2008-11-22 18:20 radix tree lookups are much more efficient that get_block, which has to decode inode fields and index blocks 2008-11-22 18:20 _plus_ do radix tree lookups 2008-11-22 18:21 so, if you have a double indirect directory block 2008-11-22 18:21 get block has to do 3 radix tree probes on the blockdev, and decode two index blocks 2008-11-22 18:22 whereas if the lookup is in page cahce, it is just one radix tree probe 2008-11-22 18:22 get block is getblk in here? 2008-11-22 18:22 and that is in a much smaller radix tree than the blockdev 2008-11-22 18:22 ext3_get_block 2008-11-22 18:22 i see 2008-11-22 18:22 has to go looking in multiple blocks on the blockdev 2008-11-22 18:23 so the cpu advantage and reduced cache pressure is enormous by having the dir in file page cache instead of blockdev page cache 2008-11-22 18:24 ah, i see 2008-11-22 18:24 ok, the eee is 32 bit 2008-11-22 18:25 and it's likely to stay that way for a long time 2008-11-22 18:25 very hard to justify the extra heat of a 64 bit processor 2008-11-22 18:25 however, it also does not need highmem 2008-11-22 18:25 2 G max in the current generation 2008-11-22 18:26 so, in 1 1/2 years it will be 4 G max, and we need highmem for that 2008-11-22 18:26 so highmem is worth doing, because of the eee alone 2008-11-22 18:28 which cpu? pentium-m? 2008-11-22 18:28 yes, and the new ones are atom 2008-11-22 18:28 I doubt there will be a 64 bit atom for a long time 2008-11-22 18:28 not compatible with long battery life 2008-11-22 18:29 i see 2008-11-22 18:29 maybe 32 bit machines have ten more years to live 2008-11-22 18:29 or maybe they will always be more power efficient than 64 bit, for the same work 2008-11-22 18:30 ah, so it may live always 2008-11-22 18:30 it might 2008-11-22 18:30 depends if 64 bit turns out to be more power efficient or not 2008-11-22 18:31 right now it isn't 2008-11-22 18:31 sounds like it hard to do 2008-11-22 18:31 but if we look back at history, 32 bit ended up more power efficient than 16 bit, because the 16 bit machines had to do a lot more instructions to process the same data 2008-11-22 18:32 i see 2008-11-22 18:32 anyway, we know what the right thing to do now is ;) 2008-11-22 18:32 :) 2008-11-22 18:41 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-22 18:51 hi bushman 2008-11-22 18:52 ok, time to clean up hackfs and get it ready for posting 2008-11-22 18:52 hello 2008-11-22 18:52 sounds like you had busy few days 2008-11-22 18:53 hirofumi was driving it 2008-11-22 18:53 got a FS-related question, if you got a minute 2008-11-22 18:53 he got tux3 partially working in kernel 2008-11-22 18:53 always 2008-11-22 18:53 http://wiki.archlinux.org/index.php/JFS#Deadline_I.2FO_Scheduler 2008-11-22 18:54 this says that JFS works best with the Deadline scheduler 2008-11-22 18:54 why would a certain FS work better with a certain scheduler? 2008-11-22 18:54 common wisdom is that jfs is broken, regardless of scheduler 2008-11-22 18:54 yea, but this is just an example 2008-11-22 18:55 well, the linux scheduling design approach is essentially "random" 2008-11-22 18:55 so that causes random results with random combinations 2008-11-22 18:55 it's a nearly completely unrepresented area of kernel development 2008-11-22 18:56 IO scheduling I meant 2008-11-22 18:56 yes, but stats say that with truly random inputs and enough samples, everything should average out 2008-11-22 18:56 task scheduling gets a lot of attention 2008-11-22 18:56 io scheduling has one guy who ever hacks on it 2008-11-22 18:56 and it doesn't really seem to be his area 2008-11-22 18:57 so if people get different results, then it's not that random 2008-11-22 18:57 it's random with few sample points 2008-11-22 18:58 it's a mistake to confuse random with uniform ;) 2008-11-22 18:58 here, we have a radom "penquin distribution" 2008-11-22 18:58 but should there be a particular synergy, positive or negavite, between task scheduling and FS? 2008-11-22 18:59 there are also herring distributions 2008-11-22 18:59 (poisson, get it) 2008-11-22 18:59 ACTION this is getting exponentially geeky ;) 2008-11-22 18:59 task scheduling also matters, but it's IO scheduling you're talking about right? 2008-11-22 19:00 do'h, yes, i meant IO not task 2008-11-22 19:01 I don't know the details of either jfs or deadline scheduler, so I can't comment on just why they interact 2008-11-22 19:02 but it is not surprising at all that they do 2008-11-22 19:02 no no, i'm asking a much broader question, are there/should there be changes like that, and why or why not? 2008-11-22 19:02 if I had to guess... 2008-11-22 19:03 write backlogs can starve reads with a completely naive scheduler 2008-11-22 19:03 reads are generally synchronous, while writes are parallel 2008-11-22 19:03 are schedulers even aware of read to write ratios? 2008-11-22 19:04 so, with some notition of realtime scheduling, those starved reads can be bumped to near the front of the queue 2008-11-22 19:04 instead of having to work their way all the way up 2008-11-22 19:04 there's some attempt in these schedulers, yes 2008-11-22 19:04 one could call it "barefoot tech" 2008-11-22 19:04 like lots of parts of Linux 2008-11-22 19:05 starts completely crappy, then gets improved over time 2008-11-22 19:05 this part of linux is barely out of the completely crappy phase 2008-11-22 19:06 you'd think that with the amount of servers running linux by now someone would have a serious need to fix this stuff up properly 2008-11-22 19:06 irix is miles ahead of linux in io sheduling 2008-11-22 19:07 linux servers suck for storage, nobody told you? 2008-11-22 19:07 linux servers don't suck for web serving 2008-11-22 19:07 so that's what most linux servers are 2008-11-22 19:07 heh, good luck telling dotcommers that they have to deal with solaris or something without gnu-utils 2008-11-22 19:08 a lot of them are switching 2008-11-22 19:08 i tried once, didnt end well 2008-11-22 19:08 dot commers don't really use storage servers, i.e., NAS 2008-11-22 19:08 in their colo farms anyway 2008-11-22 19:08 well 2008-11-22 19:09 yahoo does ;) 2008-11-22 19:09 big mistake 2008-11-22 19:09 one of many 2008-11-22 19:09 i wanna ask 'so how would you design a high throughput storage for a dotcom scenario' but i have a feeling that'd end in a 4hr conversation ;) 2008-11-22 19:10 I'd tear out the block layer and rewrite it 2008-11-22 19:10 not the low level part, that's pretty good 2008-11-22 19:10 scatter gather and hardware drivers 2008-11-22 19:11 but the upper level, the request queues, block devices, elevators... bleah 2008-11-22 19:11 really awful 2008-11-22 19:11 the concept of "run the io queues" is really broken 2008-11-22 19:11 likewise the plug/unplug idea 2008-11-22 19:12 (those are really the same problem) 2008-11-22 19:12 the request queue api is horrifying 2008-11-22 19:13 to round it out, lvm is beyond awful 2008-11-22 19:13 all of this has combined to create a decent market opportunity for sun 2008-11-22 19:13 we really had to work at that ;) 2008-11-22 19:13 ah, nice troll bushman ;) 2008-11-22 19:14 so why is sun is fireing people? :( 2008-11-22 19:14 past excesses 2008-11-22 19:15 their server biz is the one bright spot 2008-11-22 19:15 man you make my brain run around and change sentence structure half way through a sentence... better make some tea 2008-11-22 19:17 so would you currently do for running a dotcom-ish shared storage sort of thing, multiple webservers needing to access even more storage boxes 2008-11-22 19:17 slit my wrists 2008-11-22 19:42 flips, why don't you use linus tree for git? 2008-11-22 19:43 why not "git clone" of linus tree? 2008-11-22 19:43 suggest a url and I will use it 2008-11-22 19:43 well 2008-11-22 19:44 I thought it would help us to work on a stable tree for a while 2008-11-22 19:44 yes 2008-11-22 19:44 I think we should keep doing that, at least for another month 2008-11-22 19:44 then rebase to something 2008-11-22 19:44 I don't care if it's intent 2008-11-22 19:44 -!- tim_dimm(~timothyhu@cpe-76-90-122-49.socal.res.rr.com) has joined #tux3 2008-11-22 19:45 yes 2008-11-22 19:45 can we rebase to linus tree and preserve the commit history? 2008-11-22 19:45 yes, I think so 2008-11-22 19:45 ah, no 2008-11-22 19:46 I didn't think so 2008-11-22 19:46 maybe current tree needs some trink to convert linus tree 2008-11-22 19:46 git doesn't seem to do that awfully well 2008-11-22 19:47 probably it can 2008-11-22 19:47 ok, well when we get close to that, let's figure out how 2008-11-22 19:47 but, maybe it become odd history 2008-11-22 19:47 keeping the history is kind of nice 2008-11-22 19:47 let's see how it looks now 2008-11-22 19:48 such commit messages as: "syncio is two lines shorter" 2008-11-22 19:48 :p 2008-11-22 19:48 let's worry about it a month from now 2008-11-22 19:49 ok 2008-11-22 19:50 I'm going to add a second head to the repository 2008-11-22 19:50 for hackfs 2008-11-22 19:50 I'd rather not create a new git tree for that 2008-11-22 19:50 branch? 2008-11-22 19:50 yes 2008-11-22 19:50 branch from the original commit 2008-11-22 19:50 sounds good 2008-11-22 19:51 ok, should be pretty soon 2008-11-22 19:51 it can't "git pull" from ../tux3fs/.git 2008-11-22 19:52 well, we'll need to apply change via patch in future 2008-11-22 19:55 I thought that problem was because of "update-server-info" before 2008-11-22 19:55 anway, I will set up git: 2008-11-22 19:56 and we can get a repo at kernel.org maybe 2008-11-22 19:56 ah, no 2008-11-22 19:56 it's no problem 2008-11-22 19:57 our tree is not based on linus tree 2008-11-22 19:57 so, sha1 of commit is differenct with linus tree 2008-11-22 19:57 "git pull" will try to merge base tree of linux too 2008-11-22 19:57 I see what you meant 2008-11-22 19:58 but, it's not big problem 2008-11-22 19:58 some git hacker knows how to rebase between trees with history ;) 2008-11-22 19:58 well, we can do merge via patch at least 2008-11-22 19:59 I think "git format-patch | git am" will do 2008-11-22 19:59 sure 2008-11-22 20:01 btw, I've put some patches for userland 2008-11-22 20:01 http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-22 20:01 could you review later? 2008-11-22 20:02 one cleanup patch, and others sync to kernel 2008-11-22 20:04 I'll look right now 2008-11-22 20:04 thanks 2008-11-22 20:11 folks 2008-11-22 20:11 hi 2008-11-22 20:59 hirofumi, everything looks fine 2008-11-22 21:00 thanks 2008-11-22 21:00 having to decode32 to a temporary is a little annoying 2008-11-22 21:00 but you are right, the type is different across arches 2008-11-22 21:00 yes 2008-11-22 21:01 I think the compiler will optimize the temporary store away though 2008-11-22 21:01 ah, maybe it does 2008-11-22 21:01 because decode32 and everything down to the word swap is inline 2008-11-22 21:01 so it just looks ugly, it's not actually ugly ;) 2008-11-22 21:01 I'll pull now 2008-11-22 21:02 thanks 2008-11-22 21:02 btw, xattr can't list currently? 2008-11-22 21:02 btw, xattr can't support "list" operation currently? 2008-11-22 21:02 true 2008-11-22 21:03 ok 2008-11-22 21:03 I'll add that to my todo list 2008-11-22 21:03 we will able to read file data with next patches 2008-11-22 21:04 :) 2008-11-22 21:04 probably, almost all read operations was done 2008-11-22 21:05 so generic_read will work maybe? 2008-11-22 21:05 yes 2008-11-22 21:05 and splice too :) 2008-11-22 21:06 :) 2008-11-22 21:06 that's just by making ->readpage work right? 2008-11-22 21:06 yes 2008-11-22 21:06 using the good old block IO lib 2008-11-22 21:06 I'm glad I didn't try to do handles first 2008-11-22 21:07 :) 2008-11-22 21:08 pulled 2008-11-22 21:09 ok, now to get hackfs ready 2008-11-22 21:09 ok 2008-11-22 21:12 now I got rid of fs/hackfs/Makefile 2008-11-22 21:12 so hackfs really is just one file 2008-11-22 21:13 make fs/hackfs/hackfs.o ? 2008-11-22 21:14 exactly 2008-11-22 21:14 patch adds fs/hackfs.c, changes fs/Makefile and fs/Kconfig 2008-11-22 21:15 and gives a block filesystem that doesn't read or write backing store ;-) 2008-11-22 21:15 so the lkml post will be [ANNOUNCE] Hackfs: a useless filesystem 2008-11-22 21:15 ah, i see 2008-11-22 21:16 of course it is useful, but not for storing files 2008-11-22 21:16 it will be useful for getting feedback on the block handles idea, also having our own blockdev cache 2008-11-22 21:17 I'd really like to hear andrew's opinion on that 2008-11-22 21:17 sounds good 2008-11-22 21:18 yes, probably, akpm has some opinions 2008-11-22 21:47 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-11-22 23:13 git doesn't include a new file in git diff, even after git adding it 2008-11-22 23:13 that is "surprising" 2008-11-22 23:14 some option is needed 2008-11-22 23:15 --dont-be-stupid 2008-11-22 23:16 :) 2008-11-22 23:16 git diff HEAD 2008-11-22 23:17 that worked 2008-11-22 23:17 I would never have guessed it, ever 2008-11-22 23:17 $ git diff (1) 2008-11-22 23:17 $ git diff --cached (2) 2008-11-22 23:17 $ git diff HEAD (3) 2008-11-22 23:17 1. Changes in the working tree not yet staged for the next commit. 2008-11-22 23:17 2. Changes between the index and your last commit; what you would 2008-11-22 23:17 be committing if you run "git commit" without "-a" option. 2008-11-22 23:17 3. Changes in the working tree since your last commit; what you 2008-11-22 23:17 would be committing if you run "git commit -a" 2008-11-22 23:17 man says something 2008-11-22 23:18 3 should be the default 2008-11-22 23:18 well 2008-11-22 23:18 yes, it would be bug 2008-11-22 23:18 I did try git-diff -a 2008-11-22 23:19 or their have historical reason can't change it 2008-11-22 23:19 could have "legacy" and "sensible" mode 2008-11-22 23:19 after figuring out what people really hate 2008-11-22 23:20 this has to be one of those 2008-11-22 23:20 that and commit -a 2008-11-22 23:20 enough to drive people to mercurial 2008-11-22 23:20 which is sad, because it's really cosmetic 2008-11-22 23:20 well 2008-11-22 23:20 I'll save my complaining for #git 2008-11-22 23:21 well, I use patch, so I don't care :) 2008-11-22 23:21 I'm using patch to move hackfs from branch off master to branch off original commit 2008-11-22 23:21 git format-patch? 2008-11-22 23:22 ah 2008-11-22 23:23 I'm using patch for temporary work like quilt 2008-11-22 23:23 your system looks really cool 2008-11-22 23:23 I can read the database :) 2008-11-22 23:23 now that you explained it 2008-11-22 23:24 :) 2008-11-22 23:24 btw, somehow I broke Makefile dependency in userland 2008-11-22 23:25 *.o doesn't have dependency to #include "*c" 2008-11-22 23:26 we usually write the depedencies explicltly 2008-11-22 23:27 now, "touch filemap.c; make" doesn't update tux3 correctly 2008-11-22 23:27 ok 2008-11-22 23:27 should be easy to fix 2008-11-22 23:27 anyway, the source includes will go away pretty soon 2008-11-22 23:27 userland too? 2008-11-22 23:28 yes in user 2008-11-22 23:28 oh 2008-11-22 23:28 but includes of kernel/*.c will go in 2008-11-22 23:28 so that all the files in fs/tux3 will also be in user/kernel 2008-11-22 23:29 and be #included from user 2008-11-22 23:29 i see 2008-11-22 23:29 so, we can drop __KERNEL__ on top and tail 2008-11-22 23:31 I'll fix Makefile after that if needed 2008-11-22 23:31 most __KERNEL__ will still be needed 2008-11-22 23:32 all this will do is make it easy to copy files between user/kernel and fs/tux3 2008-11-22 23:32 I imaged it 2008-11-22 23:32 user/balloc.c 2008-11-22 23:32 #include 2008-11-22 23:32 #include "kernel/balloc.c" 2008-11-22 23:32 int main() 2008-11-22 23:32 exactly 2008-11-22 23:33 so, we can drop __KERNEL__ in kernel/balloc.c 2008-11-22 23:33 e.g. for int main() 2008-11-22 23:33 oh yes, for that 2008-11-22 23:33 yes, a lot of __KERNEL__ will go 2008-11-22 23:33 yes 2008-11-22 23:33 that's good 2008-11-22 23:35 if you want to do it, go ahead 2008-11-22 23:35 otherwise I will do it tomorrow 2008-11-22 23:35 ok, if filemap optimization was done 2008-11-22 23:35 tux3_get_block 2008-11-22 23:36 at least, I'm going to add multiple blocks support 2008-11-22 23:37 that is using the buffer_head.b_size feature? 2008-11-22 23:38 yes 2008-11-22 23:39 readahead stuff will work more efficiency more or less 2008-11-22 23:39 ok, I will convert hackfs back from the semaphore synchronizer to a wait queue 2008-11-22 23:40 ok 2008-11-23 00:04 I really don't like the way wait_event quietly adds & to both its parameters 2008-11-23 00:04 would be fine in c++ 2008-11-23 00:04 in linux, it's different from all the atomic ops for example 2008-11-23 00:52 damn, I wanted hackfs to be on a branch in my public repository 2008-11-23 00:52 instead, when I pulled it, git merged it with HEAD 2008-11-23 00:53 why can you it? 2008-11-23 00:53 I think I would have had to create a branch in the public repostory, then pull 2008-11-23 00:54 now... 2008-11-23 00:54 um... git checkout -b xxx; git pull 2008-11-23 00:54 yes 2008-11-23 00:54 but now I don't like what I've got 2008-11-23 00:54 any such thing as a rollback in git? 2008-11-23 00:54 um... what happen? 2008-11-23 00:54 git reset --hard 2008-11-23 00:54 http://phunq.net/ddtree?p=tux3fs 2008-11-23 00:55 ah, that's what I want 2008-11-23 00:55 oh 2008-11-23 00:55 it seems you didn't create branch 2008-11-23 00:55 right 2008-11-23 00:58 git always hangs after printing counting objects... done 2008-11-23 00:58 on a pull 2008-11-23 00:58 and the pull is complete as far as I can tell 2008-11-23 00:58 then it just waits forever 2008-11-23 00:59 git gc? 2008-11-23 00:59 shows no activity in top 2008-11-23 00:59 you mean, I should run git gc? 2008-11-23 01:00 yes 2008-11-23 01:00 it will optimize history 2008-11-23 01:01 now, it gitweb, it only shows one branch 2008-11-23 01:01 the currently checked out one 2008-11-23 01:01 http://phunq.net/ddtree?p=tux3fs 2008-11-23 01:01 master and hackfs? 2008-11-23 01:01 ah 2008-11-23 01:02 ah, you want to click hackfs 2008-11-23 01:02 http://phunq.net/ddtree?p=tux3fs;a=shortlog;h=refs/heads/hackfs 2008-11-23 01:02 so to post the url I can give this 2008-11-23 01:03 looks good 2008-11-23 01:03 http://phunq.net/ddtree?p=tux3fs;a=blob;f=fs/hackfs/hackfs.c;h=75420a90b9a5ca39b0ad28754d2329c18a018302;hb=131c2d7a789998b2a7fbc2e6277e99c5002b7d35 2008-11-23 01:03 ugly url ;-) 2008-11-23 01:03 or post patches simpley 2008-11-23 01:04 yes 2008-11-23 01:04 yes 2008-11-23 01:04 like goolge 2008-11-23 01:04 I hate those ;) 2008-11-23 01:04 :) 2008-11-23 01:05 ok, that's it for tonight, only got old code refreshed 2008-11-23 01:05 fixed one bug in hackfs 2008-11-23 01:05 ah, two bugs 2008-11-23 01:05 then you add handles stuff? 2008-11-23 01:05 added the inode allocation 2008-11-23 01:05 right, tomorrow it gets handles 2008-11-23 01:06 good 2008-11-23 01:06 and I will try to have it working enough to post for comment by monday or tuesday 2008-11-23 01:06 ok 2008-11-23 01:09 I have to read more on memory barriers 2008-11-23 01:09 look in hackfs.c, see the struct syncio.done ? 2008-11-23 01:09 probably, me too 2008-11-23 01:10 I think I need a barrier there, if not an atomic op 2008-11-23 01:10 in case the endio executes on a different cpu 2008-11-23 01:10 otherwise, there is a bunch of wasted locking, I think 2008-11-23 01:11 the spinlock on the wait queue doesn't do anything useful because there is only one reader and only one possible endio 2008-11-23 01:11 the wait queue also never has more than one element 2008-11-23 01:11 lost wake up problem? 2008-11-23 01:12 the spinlock in the wait queue is only to synchronize access to the wait queue 2008-11-23 01:12 and there is only one possible source of wake up 2008-11-23 01:12 so I think the synchronization on the wait queue is redudant in this case 2008-11-23 01:12 and there are many cases like this 2008-11-23 01:13 also, a list of only one element seems redudant 2008-11-23 01:13 that is, the wait queue only ever has one element on it 2008-11-23 01:13 well, it works fine :) 2008-11-23 01:14 and the semaphore is gone 2008-11-23 01:14 could not think of any way to get rid of the done variable 2008-11-23 01:14 it doesn't have lock_page/lock_buffer 2008-11-23 01:14 which doesn't? 2008-11-23 01:15 it waits after submit_bio 2008-11-23 01:15 but, we want to wait before submit_bio? 2008-11-23 01:15 yes, there is no page cache page involved here 2008-11-23 01:15 this kmalloced reason that test_syncio does IO to is owned exclusively by itself 2008-11-23 01:16 blah 2008-11-23 01:16 the kmalloced region that test_syncio does IO to is owned exclusively by itself 2008-11-23 01:16 so, it hard to talk about barrier/lost wakeup for me 2008-11-23 01:16 it's a hard subject 2008-11-23 01:17 there's no chance of lost wakeup in this situation 2008-11-23 01:17 using wait_event 2008-11-23 01:17 wakeup may happen before wait_event 2008-11-23 01:17 that's ok 2008-11-23 01:18 oh 2008-11-23 01:18 because sync.done will be 1 2008-11-23 01:18 and the process won't go to sleep 2008-11-23 01:18 ah 2008-11-23 01:19 probably competion? 2008-11-23 01:19 completion 2008-11-23 01:19 that's how that race was fixed: first the task sets itself to TASK_IFORGET 2008-11-23 01:19 then it tests the condition (synd.done) 2008-11-23 01:20 then if the condition is false, it schedule()s 2008-11-23 01:20 otherwise it sets itself to TASK_RUNNING 2008-11-23 01:20 and continues 2008-11-23 01:20 yes, it seems wait_for_completion() 2008-11-23 01:21 TASK_IFORGET -> TASK_UNINTERRUPTABLE 2008-11-23 01:21 it's worth reading through these macros 2008-11-23 01:21 it's actually pretty simple, but seems complex 2008-11-23 01:22 mysterious 2008-11-23 01:22 and scary 2008-11-23 01:22 memory barriers are like that I expect 2008-11-23 01:22 there's a good writeup by dhowells in Documentation 2008-11-23 01:23 wait_for_completion() checks ->done in spin_lock 2008-11-23 01:23 wait_for_completion() checks ->done with spin_lock 2008-11-23 01:23 barrier is not needed 2008-11-23 01:24 memory-barriers.txt? 2008-11-23 01:24 yes, it's really good 2008-11-23 01:24 yes 2008-11-23 01:25 wow, sched.c is 9266 lines long now 2008-11-23 01:25 massive bloat up 2008-11-23 01:25 yes 2008-11-23 01:25 cfs? 2008-11-23 01:26 I forget 2008-11-23 01:26 that and many other things 2008-11-23 01:27 yes 2008-11-23 01:27 process migration 2008-11-23 01:28 ah, yes 2008-11-23 01:28 it really wants to be a scheduler directory 2008-11-23 01:28 few months ago, I saw a problem around it 2008-11-23 01:29 cpu affinity... 2008-11-23 01:30 yes 2008-11-23 01:31 a lot of rt stuff now 2008-11-23 01:33 probably, -rt tree may have more rt stuff for sched 2008-11-23 01:35 find_busiest_group is a huge function 2008-11-23 01:36 so... this thing is going to be in next year's cell phones ;) 2008-11-23 01:37 oh :) 2008-11-23 01:37 well, they may still use 2.4.x 2008-11-23 01:38 It looks like hackfs should be using completion all right 2008-11-23 01:39 yes 2008-11-23 01:39 well, those will be changed by handles stuff? 2008-11-23 01:40 yes 2008-11-23 01:40 the endio will use handle operations 2008-11-23 01:40 and completion might not be the right thing any more 2008-11-23 01:40 certainly won't be 2008-11-23 01:41 sure 2008-11-23 01:41 but that's fine, it's still good to learn about it 2008-11-23 01:42 yes, completion is useful to avoid lost-wakeup 2008-11-23 01:42 all these flavors of event waiting avoid that 2008-11-23 01:42 only when you role your own, you rediscover the classic race ;) 2008-11-23 01:44 yes :) 2008-11-23 01:44 well, drivers people I know discovered actually 2008-11-23 01:44 it's a pity this sched.c code comes without inline comments, at least on the exported functions 2008-11-23 01:45 everybody who ever wrote a server daemon rediscovers it 2008-11-23 01:45 and there are few tools to deal with in in user space 2008-11-23 01:46 oh 2008-11-23 01:46 getting the synchronizers right should be a fun lkml thread 2008-11-23 01:46 I will try my cmpxchg idea 2008-11-23 01:47 make it generic? 2008-11-23 01:48 that would be hard 2008-11-23 01:48 it will be swapping against an array of 4 bit states 2008-11-23 01:48 set_bit_mask_lock()? 2008-11-23 01:48 I think the synchronizer will be very nongeneric 2008-11-23 01:49 there is a 3 bit scalar field 2008-11-23 01:49 that is the challenge 2008-11-23 01:49 8 scalar fields and 8 bits per word 2008-11-23 01:50 so the object is to set exactly one of the scalar fields to a new value 2008-11-23 01:50 the first thing you do is read the old value 2008-11-23 01:50 then make a copy of it and change the field to the desired value 2008-11-23 01:51 the cmpxchg will then put the new value into the field only if the field still matches the old value 2008-11-23 01:51 if it doesn't match, then another process changed it 2008-11-23 01:51 set_bit_mask_lock(long bit_array, int mask, int pos_of_mask)? 2008-11-23 01:51 yes 2008-11-23 01:51 so you start again, rereading the field 2008-11-23 01:52 and lock_bits will be excluded? 2008-11-23 01:52 set_bit_mask_lock is fine for the one bit locks 2008-11-23 01:52 yes, the lock_bits are excluded 2008-11-23 01:52 _however_ 2008-11-23 01:52 taht is not necessary, we can take a lock and set a state in one atomic operation if we want 2008-11-23 01:52 or release a lock and set a state in one operation 2008-11-23 01:53 we can also set multiple states and locks in one operation, if that is useful 2008-11-23 01:53 yes 2008-11-23 01:54 as far as I know, this instruction is just as efficient as any other bus locking instruction 2008-11-23 01:54 it's pretty powerful 2008-11-23 01:54 yes 2008-11-23 01:55 cmpxchg would be expensive more or less though 2008-11-23 01:55 as I understand it, memory barriers substitute for bus locks 2008-11-23 01:56 and are more efficient, not causing as big a stall 2008-11-23 01:56 as expensive as any other bus lock 2008-11-23 01:57 probably, on some arch may just use spin_lock for it 2008-11-23 01:57 probably 2008-11-23 01:57 let's see 2008-11-23 01:58 it's a very common synchronizer now 2008-11-23 01:58 has some academic research supporting it 2008-11-23 01:58 tas? 2008-11-23 01:58 test and set? 2008-11-23 01:59 works only on one bit if I recall correctly 2008-11-23 01:59 http://en.wikipedia.org/wiki/Compare_and_swap 2008-11-23 01:59 test_and_set_bit() is 2008-11-23 02:00 oh, cas 2008-11-23 02:00 http://en.wikipedia.org/wiki/Test-and-set 2008-11-23 02:02 right, intel instruction is just one bit at a time 2008-11-23 02:08 it seems almost all arch are supporting cmpxhg 2008-11-23 02:09 cmpxchg is emulated using irq disable/enable 2008-11-23 02:09 nosmp arch? 2008-11-23 02:09 yes 2008-11-23 02:10 I haven't heard of smp arms ;) 2008-11-23 02:10 in that case, it is just synchronizing against interrupts I guess 2008-11-23 02:11 old arm seems to do it 2008-11-23 02:12 http://lxr.linux.no/linux+v2.6.27/arch/powerpc/include/asm/system.h#L362 <- power pc code 2008-11-23 02:13 start/end op... 2008-11-23 02:13 right, just looking at it 2008-11-23 02:13 it's not exaclty one instruction... 2008-11-23 02:13 I wonder why ibm didn't follow intel's lead on that 2008-11-23 02:14 bus lock may be too expensive on risc? 2008-11-23 02:14 mips/sparc may have like that 2008-11-23 02:15 let's see what an lwsync instruction is 2008-11-23 02:16 I know it like op on mips, then I forget it :) 2008-11-23 02:16 http://en.wikipedia.org/wiki/Load-Link/Store-Conditional 2008-11-23 02:16 maybe it 2008-11-23 02:16 Morning 2008-11-23 02:17 moin 2008-11-23 02:20 http://www.ibm.com/developerworks/eserver/articles/powerpc.html <- very interesting page 2008-11-23 02:25 it seems memory model for mem I/O 2008-11-23 02:25 memory ordering model 2008-11-23 02:25 yes 2008-11-23 02:25 barriers 2008-11-23 02:26 yes, intel also have 2008-11-23 02:26 so they are emulating cmpxchg using barriers 2008-11-23 02:26 yes 2008-11-23 02:26 when I know enough about barriers to understand how that's done, I'll tell you ;) 2008-11-23 02:27 it seems like magic to me at this point 2008-11-23 02:27 thanks :) 2008-11-23 02:27 I have good book for it... 2008-11-23 02:27 just mail it ;) 2008-11-23 02:28 there is a lot of material available online 2008-11-23 02:28 japanese version though :) 2008-11-23 02:28 oh, so I'd have to learn japanese first 2008-11-23 02:28 :) 2008-11-23 02:28 I think english version is available too... 2008-11-23 02:28 well, my plan is to do roughly what I think is right, and let people who really know what they're doing tell me how it should be done 2008-11-23 02:31 http://www.amazon.com/UNIX-Systems-Modern-Architectures-Multiprocessing/dp/0201633388/ref=sr_1_1/187-2052376-9735966?ie=UTF8&s=books&qid=1227436227&sr=1-1 2008-11-23 02:35 if it work, it's fine 2008-11-23 02:54 looks like a fine book 2008-11-23 02:54 yes, I learned about barrier from it almost all 2008-11-23 02:55 of course, my knowlege is not enough though :) 2008-11-23 03:10 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-11-23 07:05 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-23 09:07 -!- tim_dimm(~timothyhu@cpe-76-90-122-49.socal.res.rr.com) has joined #tux3 2008-11-23 09:39 -!- tim_dimm(~timothyhu@cpe-76-90-122-49.socal.res.rr.com) has joined #tux3 2008-11-23 11:17 -!- tim_dimm(~timothyhu@cpe-76-90-122-49.socal.res.rr.com) has joined #tux3 2008-11-23 12:45 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-23 12:53 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-23 13:38 http://developer.intel.com/design/pentiumii/manuals/243191.htm <- intel instruction set manual, includes cmpxchg details 2008-11-23 15:31 just about sk8 oclock 2008-11-23 15:48 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-23 16:23 -!- ajonat(~ajonat@190.48.125.11) has joined #tux3 2008-11-23 17:33 hirofumi, around? 2008-11-23 17:46 hi 2008-11-23 17:55 hi 2008-11-23 17:55 I'm working on ->readpage using handles 2008-11-23 17:55 in hackfs 2008-11-23 17:56 so... I think I'm not going to try to make a new library, using handles instead of buffer_heads 2008-11-23 17:56 but instead, just provide support so the fs can do those loops 2008-11-23 17:56 our fs will therefore be aware of sb->blocks_per_page 2008-11-23 17:57 um... 2008-11-23 17:57 so, one thing it needs to know, is how many discontiguous regions of mapped, empty blocks there are on a page so it can allocate a bio of the right size 2008-11-23 17:58 filemap.c is going to have this page-aware code 2008-11-23 17:58 and we will try to make it nice 2008-11-23 17:59 with some nice primitives and reduce the amount of looping over page blocks 2008-11-23 17:59 so, filemap.c already has a loop over extent blocks 2008-11-23 18:00 and that is where the bio building and submission will go 2008-11-23 18:00 um... I can't image what you does yet 2008-11-23 18:01 right, that's why I make hackfs 2008-11-23 18:01 to demonstrate 2008-11-23 18:01 well, probably it work 2008-11-23 18:01 hackfs will have a very simple block mapping scheme 2008-11-23 18:01 each file has 8 blocks on the disk ;) 2008-11-23 18:02 indexed directly by the inode number, which is allocated sequentially 2008-11-23 18:02 and all the files become invalid on unmount 2008-11-23 18:02 in the current code of hackfs? 2008-11-23 18:02 yes 2008-11-23 18:02 in the next version coming 2008-11-23 18:02 so that means that hackfs doesn't have any metadata 2008-11-23 18:03 which lets it concentrate just on the block IO transfers for now 2008-11-23 18:03 the demonstration should be about 500 lines or so 2008-11-23 18:03 maybe 600 2008-11-23 18:03 and it will be code that we can move into tux3 pretty directly 2008-11-23 18:04 i see 2008-11-23 18:04 it doesn't have any metadata? 2008-11-23 18:04 anyway, current subproject is to figure out how many biovecs need to be allocated for each page 2008-11-23 18:04 no, no metadata 2008-11-23 18:04 we throw away the inode attributes on umount 2008-11-23 18:05 and directly calculate the position of file blocks 2008-11-23 18:05 mapping is very simple 2008-11-23 18:05 on thing, it doesn't really exercise the physically discontiguous part well 2008-11-23 18:05 but it will exercise a lot of the rest of the interface 2008-11-23 18:06 i see 2008-11-23 18:06 and I can add discontiguousness to it later, or we can try merging it into tux3 and developing it there 2008-11-23 18:06 but first, there are lots of details to get working even without being able to test pysical discontiguous mapping 2008-11-23 18:07 so for example 2008-11-23 18:07 i see 2008-11-23 18:07 I can radomly generate a bitmap for each file that makes 3 of the 8 blocks unavailable 2008-11-23 18:08 and that willl exercise the discontig mapping algorithm 2008-11-23 18:08 probably 2008-11-23 18:10 btw, on yersterday, were you trying to replace cmpxhg with memory barrier? 2008-11-23 18:10 just trying to understand memory barriers better 2008-11-23 18:10 we don't need the absolute best synchronizer at the start 2008-11-23 18:10 i see 2008-11-23 18:11 I wanted to know how this works: 2008-11-23 18:11 void unlock_block(handle_t handle) 2008-11-23 18:11 { 2008-11-23 18:11 unsigned *state = &handles(handle)->statemap; 2008-11-23 18:11 unsigned bit = 1 << handle_lockbit(handle); 2008-11-23 18:11 smp_mb__before_clear_bit(); 2008-11-23 18:11 clear_bit(bit, state); 2008-11-23 18:11 smp_mb__after_clear_bit(); 2008-11-23 18:11 wake_up_bit(state, bit); 2008-11-23 18:11 } 2008-11-23 18:11 wow, it really annoys me that linux got the order of his bitops backwards ;) 2008-11-23 18:12 I'm overly sensitive about things like that 2008-11-23 18:12 s/linux/linus/ 2008-11-23 18:13 um... before and after sounds odd 2008-11-23 18:13 it's to cover the unique memory models across all arches 2008-11-23 18:14 sometimes one will be null, I think 2008-11-23 18:14 noop 2008-11-23 18:14 I didn't check, just assumed 2008-11-23 18:14 yes 2008-11-23 18:14 that makes it _really_ hard to test 2008-11-23 18:14 this kind of thing is really scary, and why we need to post code to lkml 2008-11-23 18:14 but, some arch may have both 2008-11-23 18:15 maybe 2008-11-23 18:15 I didn't survey 2008-11-23 18:15 what does read side do read side 2008-11-23 18:15 ? 2008-11-23 18:16 read side of what? 2008-11-23 18:16 wait_event side 2008-11-23 18:16 56#define smp_mb__before_clear_bit() smp_mb() 2008-11-23 18:16 57#define smp_mb__after_clear_bit() smp_mb() <- alpha 2008-11-23 18:17 just a sec 2008-11-23 18:17 void lock_block(handle_t handle) 2008-11-23 18:17 { 2008-11-23 18:17 unsigned *state = &handles(handle)->statemap; 2008-11-23 18:17 unsigned bit = 1 << handle_lockbit(handle); 2008-11-23 18:17 wait_on_bit_lock(state, bit, schedule, TASK_UNINTERRUPTIBLE); 2008-11-23 18:17 } 2008-11-23 18:17 I can't see why before_clear_bi is not useful in there 2008-11-23 18:17 I can't see why before_clear_bi is useful in there 2008-11-23 18:18 neither can I 2008-11-23 18:18 maybe nobody cares about alpha 2008-11-23 18:18 powerpc has the same 2008-11-23 18:18 no 2008-11-23 18:18 :p 2008-11-23 18:19 143#define smp_mb__before_clear_bit() barrier() 2008-11-23 18:19 144#define smp_mb__after_clear_bit() barrier() 2008-11-23 18:19 145 <- probably the same too 2008-11-23 18:19 well 2008-11-23 18:19 that's x86 2008-11-23 18:19 I thought it would be more clever 2008-11-23 18:19 after_clear_bit will commit memory before starting wake_up 2008-11-23 18:20 but I think before_clear_bit should unnesessary 2008-11-23 18:20 ok, something to remember: these barriers do two things: tell the compiler not to reorder code and tell the processor to do something to the memory bus 2008-11-23 18:21 yes 2008-11-23 18:21 barrier() is for compiler 2008-11-23 18:21 I'm just checking 2008-11-23 18:21 void unlock_buffer(struct buffer_head *bh) 2008-11-23 18:21 { 2008-11-23 18:21 clear_bit_unlock(BH_Lock, &bh->b_state); 2008-11-23 18:21 smp_mb__after_clear_bit(); 2008-11-23 18:21 wake_up_bit(&bh->b_state, BH_Lock); 2008-11-23 18:21 } 2008-11-23 18:21 that's where I got it from 2008-11-23 18:22 it seems not to do before_clear_bit 2008-11-23 18:23 is __memory_barrier a gcc buildin? 2008-11-23 18:23 I don't know __memory_barrier 2008-11-23 18:24 whoops, did I cut and paste wrong? 2008-11-23 18:24 it seems for intel compiler? 2008-11-23 18:27 google can't find it 2008-11-23 18:27 http://lkml.indiana.edu/hypermail/linux/kernel/0504.1/0439.html 2008-11-23 18:27 someone says it is for intel compiler 2008-11-23 18:28 ok 2008-11-23 18:28 so what about gcc? 2008-11-23 18:28 #define barrier() __asm__ __volatile__("": : :"memory") 2008-11-23 18:28 ? 2008-11-23 18:29 where did you find that version of unlock_buffer above? it is different from the 2.6.27 version 2008-11-23 18:29 ok, that is a pure gcc message 2008-11-23 18:30 it is from git 2008-11-23 18:30 ah, somebody figured out that barrier before + after is overkill 2008-11-23 18:30 for unlock 2008-11-23 18:30 fs: buffer lock use lock bitops 2008-11-23 18:30 2008-11-23 18:30 trylock_buffer and unlock_buffer open and close a critical section. 2008-11-23 18:30 Hence, we can use the lock bitops to get the desired memory ordering. 2008-11-23 18:30 2008-11-23 18:30 Signed-off-by: Nick Piggin 2008-11-23 18:30 well fine 2008-11-23 18:30 Signed-off-by: Andrew Morton 2008-11-23 18:30 Signed-off-by: Linus Torvalds 2008-11-23 18:30 --------------------------------- fs/buffer.c --------------------------------- 2008-11-23 18:30 index ac78d4c..6569fda 100644 2008-11-23 18:30 @@ -76,8 +76,7 @@ EXPORT_SYMBOL(__lock_buffer); 2008-11-23 18:30 2008-11-23 18:30 void unlock_buffer(struct buffer_head *bh) 2008-11-23 18:30 { 2008-11-23 18:30 - smp_mb__before_clear_bit(); 2008-11-23 18:30 - clear_buffer_locked(bh); 2008-11-23 18:31 + clear_bit_unlock(BH_Lock, &bh->b_state); 2008-11-23 18:31 smp_mb__after_clear_bit(); 2008-11-23 18:31 wake_up_bit(&bh->b_state, BH_Lock); 2008-11-23 18:31 oh, sure 2008-11-23 18:31 bad nick 2008-11-23 18:31 didn't post to lkml 2008-11-23 18:32 just quietly commited to -mm 2008-11-23 18:32 maybe 2008-11-23 18:33 I wonder after_clear_bit is nesessary or not though 2008-11-23 18:34 wake_up_bit uses spin_lock, so it may include barrier 2008-11-23 18:35 well, after_clear_bit will make sure it 2008-11-23 18:35 I was wondering about that 2008-11-23 18:35 it doesn't hurt 2008-11-23 18:35 becuase it's just a compiler directive 2008-11-23 18:36 well it not to reorder twice at the same point doesn't change anything 2008-11-23 18:36 s/well/telling/ 2008-11-23 18:36 after_clear_bit ? 2008-11-23 18:36 if there-s one after clear bit and before spinlock, it doesn't matter 2008-11-23 18:36 haven't checked yet... 2008-11-23 18:37 4585 spin_lock_irqsave(&q->lock, flags); 2008-11-23 18:37 http://lxr.linux.no/linux+v2.6.27/kernel/sched.c#L4580 2008-11-23 18:38 the irqsave lets it be used from interrupt context, I know you know that 2008-11-23 18:38 just for any observers ;) 2008-11-23 18:39 :) 2008-11-23 18:40 well, intent of those barriers are clear 2008-11-23 18:42 71static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) 2008-11-23 18:42 http://lxr.linux.no/linux+v2.6.27/include/asm-x86/spinlock.h#L71 2008-11-23 18:42 yes 2008-11-23 18:43 76 LOCK_PREFIX "xaddw %w0, %1\n" 2008-11-23 18:43 I think LOCK_PREFIX is a processor memory barrier 2008-11-23 18:43 yes 2008-11-23 18:43 but is there a compiler barrier tere? 2008-11-23 18:43 yes 2008-11-23 18:44 which line? 2008-11-23 18:44 crappy inline asm syntax 2008-11-23 18:44 iirc, lock prefix and volatile inline asm doesn't reorder by compiler 2008-11-23 18:45 I was reading the string directives last week, forgot all of them because 1 character magic codes just plain suck 2008-11-23 18:45 I think the reordering is controlled by the assembler directives 2008-11-23 18:45 on the operands 2008-11-23 18:45 the operand descriptors 2008-11-23 18:46 reorder by compiler? 2008-11-23 18:46 gah, I hate this syntax 2008-11-23 18:46 need to go look up the x86 assembler directive codes 2008-11-23 18:47 I mean, operand codes 2008-11-23 18:47 demented syntax 2008-11-23 18:47 should have been thrown away many years ago 2008-11-23 18:48 last part of asm()? 2008-11-23 18:48 yes 2008-11-23 18:48 "emory operand constraint" 2008-11-23 18:50 you know it's a bad syntax when all the google hits are on howtos and introductions, instead of the actual reference material 2008-11-23 18:50 ...and tutorials 2008-11-23 18:50 and pdfs 2008-11-23 18:50 anything but the actual reference 2008-11-23 18:51 and it is traditional to leave all the inline asm in linux undocumented 2008-11-23 18:51 uncommented 2008-11-23 18:51 yes 2008-11-23 18:51 it's called "job security for hackers" 2008-11-23 18:52 If your assembler instruction can alter the condition code register, 2008-11-23 18:52 add `cc' to the list of clobbered registers. GCC on some machines 2008-11-23 18:52 represents the condition codes as a specific hardware register; `cc' 2008-11-23 18:52 serves to name this register. On other machines, the condition code is 2008-11-23 18:52 handled differently, and specifying `cc' has no effect. But it is 2008-11-23 18:52 valid no matter what the machine. 2008-11-23 18:52 If your assembler instructions access memory in an unpredictable 2008-11-23 18:52 fashion, add `memory' to the list of clobbered registers. This will 2008-11-23 18:52 cause GCC to not keep memory values cached in registers across the 2008-11-23 18:52 assembler instruction and not optimize stores or loads to that memory. 2008-11-23 18:52 You will also want to add the `volatile' keyword if the memory affected 2008-11-23 18:52 is not listed in the inputs or outputs of the `asm', as the `memory' 2008-11-23 18:52 clobber does not count as a side-effect of the `asm'. If you know how 2008-11-23 18:52 large the accessed memory is, you can add it as input or output but if 2008-11-23 18:52 this is not known, you should add `memory'. As an example, if you 2008-11-23 18:52 access ten bytes of a string, you can use a memory input like: 2008-11-23 18:52 probably, this? 2008-11-23 18:53 nothing about reordering there 2008-11-23 18:54 I found my way to the gcc asm reference before, and the x86 specific parameter codes and didn't bookmark it 2008-11-23 18:54 :p 2008-11-23 18:55 http://tigcc.ticalc.org/doc/gnuasm.html 2008-11-23 18:55 The `volatile' keyword indicates that the instruction has important 2008-11-23 18:55 side-effects. GCC will not delete a volatile `asm' if it is reachable. 2008-11-23 18:55 (The instruction can still be deleted if GCC can prove that 2008-11-23 18:55 control-flow will never reach the location of the instruction.) Note 2008-11-23 18:55 that even a volatile `asm' instruction can be moved relative to other 2008-11-23 18:55 code, including across jump instructions. For example, on many targets 2008-11-23 18:55 there is a system register which can be set to control the rounding 2008-11-23 18:55 mode of floating point operations. You might try setting it with a 2008-11-23 18:55 volatile `asm', like this PowerPC example: 2008-11-23 18:55 asm volatile("mtfsf 255,%0" : : "f" (fpenv)); 2008-11-23 18:55 sum = x + y; 2008-11-23 18:55 This will not work reliably, as the compiler may move the addition back 2008-11-23 18:55 before the volatile `asm'. To make it work you need to add an 2008-11-23 18:55 artificial dependency to the `asm' referencing a variable in the code 2008-11-23 18:55 you don't want moved, for example: 2008-11-23 18:55 asm volatile ("mtfsf 255,%1" : "=X"(sum): "f"(fpenv)); 2008-11-23 18:55 sum = x + y; 2008-11-23 18:55 and this? 2008-11-23 18:55 to avoid reorder by compiler, it seems to use dependency of sum 2008-11-23 18:56 you need to know what the operand codes actually mean to draw any conclusion 2008-11-23 18:56 so, "memory" dependency will imply to avoid reorder by compiler 2008-11-23 18:57 if it doesn't have "=X(sum)", it may move by compiler 2008-11-23 18:58 to avoid it, below code was added "=X(sum)", so it is not moved to after "sum = x + y" 2008-11-23 18:59 likewise, "memory" dependency will imply to avoid reorder by compiler 2008-11-23 18:59 because it depends on any memory 2008-11-23 19:01 http://en.wikipedia.org/wiki/GNU_Assembler#Criticisms 2008-11-23 19:01 the documentation situation with inline gas is pathetic 2008-11-23 19:01 I've used my documentation hunting time for today 2008-11-23 19:02 sometime I'll find the actual documentation again 2008-11-23 19:02 "tips" and "tutorials" are usually a waste of time 2008-11-23 19:02 they don't tell you what's actually happening 2008-11-23 19:02 you need the defintion of X for example 2008-11-23 19:02 and Q 2008-11-23 19:03 http://sourceware.org/binutils/docs-2.18/as/index.html 2008-11-23 19:03 it is gcc.info 2008-11-23 19:03 info was a great disservice perpetrated upone the community 2008-11-23 19:04 -!- RazvanM(~RazvanM@pool-96-244-34-153.bltmmd.east.verizon.net) has joined #tux3 2008-11-23 19:05 umm... I'm not sure 2008-11-23 19:05 what should they use? 2008-11-23 19:05 it's very hard to understand why .info is better than .html 2008-11-23 19:06 ah 2008-11-23 19:06 most people prefer man pages when they have the choice 2008-11-23 19:06 yes 2008-11-23 19:06 fsf tried to take away that choice, and succeeded with many of the projects under their control 2008-11-23 19:07 maybe, info was before html? 2008-11-23 19:07 converting man pages to info was not before html 2008-11-23 19:08 i see 2008-11-23 19:08 if so, info would be bad 2008-11-23 19:10 ok, no more time for hunting for documentation on memory operand contraints today 2008-11-23 19:10 it's amazing how the actual documentation is so obscure 2008-11-23 19:11 and the web is full of tutorials, tips, howtos, introductions... 2008-11-23 19:11 all barriers to proper analysis 2008-11-23 19:11 back to hackfs 2008-11-23 19:11 yes 2008-11-23 19:12 I'll delete a barrier from lock_handle 2008-11-23 19:12 like nick did 2008-11-23 19:12 without knowing fully why 2008-11-23 19:12 yes :) 2008-11-23 19:13 btw, if I copy fs/tux3 to user/kernel, which do we use as master? 2008-11-23 19:14 fs/tux3, for now, until we copy all the fs/tux3 files to /kernel 2008-11-23 19:14 you could do that now if you like, and I will pull 2008-11-23 19:14 yes 2008-11-23 19:14 I'll do that 2008-11-23 19:15 thanks, I'll write some more hackfs 2008-11-23 19:15 ok 2008-11-23 19:15 I'm an experience assembly hacker by the way 2008-11-23 19:16 but I've never gotten over the barrier of the awful gcc + bell labs asm syntax 2008-11-23 19:16 only started looking at it seriously last week 2008-11-23 19:18 yes 2008-11-23 19:18 gcc syntax is well undocmented at all 2008-11-23 19:19 I think bell syntax is simple, but gcc is... 2008-11-23 19:22 I did find the docs last week 2008-11-23 19:22 but didn't bookmark 2008-11-23 19:23 oh, good. but maybe it's not official? 2008-11-23 19:24 s/official/official document/ 2008-11-23 19:25 it was official 2008-11-23 19:26 oh, I didn't know there is official doc 2008-11-23 19:27 http://gcc.gnu.org/onlinedocs/gcc/Constraints.html#Constraints <- here 2008-11-23 19:27 ah 2008-11-23 19:27 http://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints <- intel codes 2008-11-23 19:27 that is, x86 codes 2008-11-23 19:28 ah, maybe it's same with info 2008-11-23 19:28 likely 2008-11-23 19:29 info should just die and be replaced by html and all-in-one man pages 2008-11-23 19:32 it's good that nick is combing through all this page cache stuff 2008-11-23 19:32 bad that it's not in public 2008-11-23 19:33 barrier stuff for page cache data patches? 2008-11-23 19:33 yes 2008-11-23 19:33 lots of microoptimizations 2008-11-23 19:33 some major ones like making the page cache rcu 2008-11-23 19:34 we have high level brain damage in linux though 2008-11-23 19:34 i see 2008-11-23 19:34 things like... dirty pages should not even be on the lru 2008-11-23 19:35 for modern filesystems 2008-11-23 19:35 all that scanning is pure waste 2008-11-23 19:36 when I grep... it does really the wrong thing 2008-11-23 19:36 evicts all my executables 2008-11-23 19:36 even though each cache page of the grep is only looked at one 2008-11-23 19:36 once 2008-11-23 19:37 i see 2008-11-23 19:37 how do we solve it? 2008-11-23 19:37 "use once" 2008-11-23 19:38 when a page is read in, it is immediately queued for eviction 2008-11-23 19:38 unless referenced again within a short time 2008-11-23 19:38 current active_list and inactive_list does not work like it? 2008-11-23 19:38 when "within a short time" is not handled accurately, the result is strange random behavior 2008-11-23 19:39 no 2008-11-23 19:39 it's supposed to 2008-11-23 19:39 but it clearly doesn't 2008-11-23 19:39 a major design bug 2008-11-23 19:39 i see 2008-11-23 19:48 do we change license in user/kernel/*? 2008-11-23 19:48 good question 2008-11-23 19:49 yes I think 2008-11-23 19:49 ok, I'll change it 2008-11-23 19:49 what should I put? 2008-11-23 19:49 just delete the license for now 2008-11-23 19:49 ok 2008-11-23 19:50 I'll leave it as is 2008-11-23 19:50 we can keep it in the user files that include the user/kernel fiels 2008-11-23 19:50 sure 2008-11-23 19:50 not pressing 2008-11-23 19:50 we're not distributing yet ;) 2008-11-23 19:50 :) 2008-11-23 19:57 void zero_block(handle_t handle, blocksize) 2008-11-23 19:57 { 2008-11-23 19:57 zero_user(handles(handle)->page, handle_i(handle) * blocksize, blocksize); 2008-11-23 19:57 set_block_state(handle_t handle, BLOCK_DIRTY); 2008-11-23 19:57 } 2008-11-23 19:58 example of the handle api 2008-11-23 19:58 i see 2008-11-23 19:58 readable? 2008-11-23 19:58 yes 2008-11-23 19:59 maybe, I'd like to set_block_state separately 2008-11-23 19:59 hmm, I don't have to pass blocksize 2008-11-23 19:59 it can come from sb 2008-11-23 19:59 the block will always be dirty after zero_user 2008-11-23 20:00 but, I may want to zeroed some blocks 2008-11-23 20:00 ok, except when above i_size 2008-11-23 20:00 true 2008-11-23 20:07 void zero_block(handle_t handle) 2008-11-23 20:07 { 2008-11-23 20:07 unsigned blocksize = tux_node(handles(handle)->page->mapping->host)->sb->blocksize; 2008-11-23 20:07 zero_user(handles(handle)->page, handle_i(handle) * blocksize, blocksize); 2008-11-23 20:07 } 2008-11-23 20:07 void clear_block(handle_t handle) 2008-11-23 20:07 { 2008-11-23 20:07 zero_block(handle); 2008-11-23 20:07 set_block_state(handle_t handle, BLOCK_DIRTY); 2008-11-23 20:07 } 2008-11-23 20:07 I really hate that blocksize expression 2008-11-23 20:08 and it assumes that the blockdev has our sb, which I think we are going to achieve the way we discussed 2008-11-23 20:08 ->host->i_blkbits? 2008-11-23 20:08 oh right 2008-11-23 20:09 blkbits should really be in the address_space not the inode 2008-11-23 20:09 what meaning does blkbits have for a socket? 2008-11-23 20:09 I'm not sure 2008-11-23 20:09 maybe 0 2008-11-23 20:10 void zero_block(handle_t handle) 2008-11-23 20:10 { 2008-11-23 20:10 unsigned shift = handles(handle)->page->mapping->host->i_blkbits; 2008-11-23 20:10 zero_user(handles(handle)->page, handle_i(handle) << shift, 1 << shift); 2008-11-23 20:10 } 2008-11-23 20:11 looks good 2008-11-23 20:12 ok, socket uses get_sb_pseudo, and sb->s_blocksize and inode->i_blkbits is PAGE_SIZE 2008-11-23 20:13 so i_blkbits is wasted 2008-11-23 20:13 well, yes 2008-11-23 20:13 anyway... socket code is not usually a great example of anything ;) 2008-11-23 20:13 almost all of fs emulation 2008-11-23 20:14 see the confusion between struct socket and struct sock 2008-11-23 20:14 yes 2008-11-23 20:14 I'm a bit familiar to it 2008-11-23 20:14 it makes me want to poke my eyes out 2008-11-23 20:15 cut and paste madness 2008-11-23 20:15 ah, I should get back to work ;) 2008-11-23 20:15 :) 2008-11-23 20:15 speaking of cut and paste, I just cut and pasted all of handle.c into hackfs.c 2008-11-23 20:16 no turning back now 2008-11-23 20:19 if you were using quilt, you could turn back :) 2008-11-23 20:45 I should 2008-11-23 20:46 and I will, but not today ;) 2008-11-23 21:17 hirofumi, do you think I should make handles.statemap atomic_t? 2008-11-23 21:17 or is the new modern way to just use ordinary ints and wrap with barriers? 2008-11-23 21:17 in which case we don't get help from the compiler 2008-11-23 21:18 we need something like bitwise attribute for cross-processor data 2008-11-23 21:36 atomic_t and long and long long, I think anything is fine 2008-11-23 21:37 get_block_state can't just read a plain long 2008-11-23 21:37 I think we don't touch handles.statemap directly 2008-11-23 21:37 it has to have barriers or something 2008-11-23 21:37 yes, get_handle_state does 2008-11-23 21:37 cmpxchg(&handle.statemap)? 2008-11-23 21:37 set_handle_state will use cmpxchg 2008-11-23 21:37 but cmpxchg is overkill for get_handle_state 2008-11-23 21:37 ah 2008-11-23 21:38 I think get_handle_state can just read it 2008-11-23 21:38 well, it can read same way with atomic_read 2008-11-23 21:38 probably, I think it just read 2008-11-23 21:39 not all arches just read atomic_t 2008-11-23 21:39 oh 2008-11-23 21:39 http://lxr.linux.no/linux+v2.6.27/arch/powerpc/include/asm/atomic.h#L18 2008-11-23 21:40 ah 2008-11-23 21:40 probably, to avoid read middle state in ll/sc pair 2008-11-23 21:41 ll/sc? 2008-11-23 21:41 ah, lwarx/stwcx on powerpc 2008-11-23 21:42 so, I feel atomic_cmpxchg() and atomic_read() is good 2008-11-23 21:43 ah, I didn't know atomic_cmpxchg 2008-11-23 21:43 fine 2008-11-23 21:44 me too 2008-11-23 21:44 it seems to add a bit silently 2008-11-23 21:44 and the bit locking has to rely on being able to fetch atomic_t->counter 2008-11-23 21:44 we will break on any arch that doesn't have that 2008-11-23 21:46 I'm assuming atomic_cmpxchg is available on all arch 2008-11-23 21:46 yes I think so 2008-11-23 21:46 but we also have bit locks working on the same word 2008-11-23 21:46 ah 2008-11-23 21:46 there are 8 lock bits and 8 3 bit scalar fields in the word 2008-11-23 21:47 if need, we can separate lock fields and state? 2008-11-23 21:49 here's the strangest atomit_t: 2008-11-23 21:49 24typedef struct { volatile __s32 counter; } atomic_t; 2008-11-23 21:49 25typedef struct { volatile __s64 counter; } atomic64_t; 2008-11-23 21:49 aren't atomic_t's only guaranteed for 24 bits 2008-11-23 21:49 there's an unused word in struct handles, so we could 2008-11-23 21:50 I think I remember reading that somewhere... 2008-11-23 21:50 now, I think it's 32bit 2008-11-23 21:50 maze, it was true until sparc was changed to use a hashed spinlock 2008-11-23 21:50 (probably some obscure arch) 2008-11-23 21:50 on sparc, maybe it was 24bits 2008-11-23 21:50 instead of a spinlock in one byte of the counter 2008-11-23 21:50 ah 2008-11-23 21:50 obscure arch == sparc ;) 2008-11-23 21:51 don't tell sun 2008-11-23 21:51 they still have high hopes for niagara 2008-11-23 21:51 well, I'd consider mips obscure as well - even though I have a mips linux machine in my studio 2008-11-23 21:51 (just a router) 2008-11-23 21:51 mips is hardly obscure 2008-11-23 21:51 heavily used in embedded 2008-11-23 21:51 and china uses it because it's out of patent zone 2008-11-23 21:51 I thought arm was king in embedded? 2008-11-23 21:52 mips wins in a lot of routers and things 2008-11-23 21:52 and it wins in china 2008-11-23 21:52 ah, that explains mine then 2008-11-23 21:52 broadcom is heavily into mips 2008-11-23 21:53 maze, this block handle stuff ought to be right up your alley 2008-11-23 21:53 anyway.... sorry for diverting the conversation... going back to what I was doing before (hacking the ath9k driver and the acpi-cpufreq drivers - the first doesn't work on my laptop (compiling with potential fix now), the second on my desktop - hacking in more debugging code) 2008-11-23 21:53 hmm 2008-11-23 21:53 ok, go on then ;-) 2008-11-23 21:54 or maybe I should go back and read more than 10 lines 2008-11-23 21:55 see: http://phunq.net/ddtree?p=tux3fs;a=tree;f=fs/hackfs;hb=131c2d7a789998b2a7fbc2e6277e99c5002b7d35 2008-11-23 21:55 handle.c is being added to that right now 2008-11-23 21:55 to test as a replacement for buffer ops 2008-11-23 21:55 just thinking about your comments above about cmpxchg... anything requiring lock cycles on the bus is painful... 2008-11-23 21:56 so that's why we never use it... 2008-11-23 21:56 unfortunately, it's a sad fact that we have to synchronize between procesors 2008-11-23 21:57 yes, but anything that can be done per-cpu works wonders for performance on more cpu machines 2008-11-23 21:57 ah, yes, hackfs 2008-11-23 21:57 premature optimization 2008-11-23 21:57 first the buffers have to be working correctly, then we give them cpu affinity 2008-11-23 21:57 true 2008-11-23 21:57 s/buffers/block handles/ 2008-11-23 21:58 I thought hackfs was a little more dignified name than junkfs 2008-11-23 21:59 true 2008-11-23 21:59 it looks kind of like what I had planned for junkfs 2008-11-23 21:59 (side note: looks like I'll be full time kernel starting February or March) 2008-11-23 22:00 "atomic_cmpxchg requires explicit memory barriers around the operation." 2008-11-23 22:01 maze, the kernel team lameness factor will decrease 2008-11-23 22:01 you and jiaying 2008-11-23 22:01 lol 2008-11-23 22:02 nice to throw away that pager 2008-11-23 22:02 or do sre's have them? 2008-11-23 22:03 ah right, it's eletric collars 2008-11-23 22:03 lol 2008-11-23 22:04 atomic_read() on powerpc may just be read 2008-11-23 22:04 http://pds.twi.tudelft.nl/vakken/in1200/labcourse/instruction-set/lwz.html 2008-11-23 22:04 so how much does hackfs do? it seems to have a superblock but no actual disk io besides tests 2008-11-23 22:05 hirofumi, thanks 2008-11-23 22:05 I won't get stuck on it 2008-11-23 22:05 make it atomic_t, access->counter where needed and let people tell me what I did wrong 2008-11-23 22:06 maze, it's just as you left it, but has fs-specific inodes now 2008-11-23 22:06 hmm 2008-11-23 22:06 yes I see 2008-11-23 22:06 I should change the slab allocs to kmalloc in keeping with it being useless 2008-11-23 22:06 it will be shorter 2008-11-23 22:07 backing_dev_info - interesting 2008-11-23 22:07 inherited from ramfs 2008-11-23 22:08 backing_dev_info is mostly a crock 2008-11-23 22:08 should just be part of struct block_device 2008-11-23 22:08 it might be useful though 2008-11-23 22:08 it's never needed separately 2008-11-23 22:08 the concept is useful 2008-11-23 22:08 current implementation though... 2008-11-23 22:09 this may be the way to get around auto-triggered-write-out 2008-11-23 22:09 well yes 2008-11-23 22:09 poll the bdi 2008-11-23 22:09 that was the plan 2008-11-23 22:11 the reason bdi is not part of struct block_device in my humble opinion is, to keep mm and block layer guys from being at each other's throats 2008-11-23 22:11 heh 2008-11-23 22:11 otherwise all the bloat in bdi would encounter resistance 2008-11-23 22:11 solution: pointer to bloat 2008-11-23 22:16 Documentation/atomic_ops.txt <- really blows 2008-11-23 22:16 wrt cmpxchg 2008-11-23 22:16 yes, there's also no cmpxchg double pointer 2008-11-23 22:17 even though so many cool things can be done with an atomic exchange of a { pointer, pointer } or { pointer, long } 2008-11-23 22:18 [ie. the cmpxchg8b instruction on 32-bit and cmpxchg16b instruction on 64-bit, even having it's own flag in the cpuid flags, since it was missing from the very first 64-bit x86_64 cpus - the oversight was quickly corrected] 2008-11-23 22:18 [cx8 and cx16] 2008-11-23 22:19 (funny thing is, of course, with the way caches work, this could be an entire cache line - all 32 or 64 bytes of it, but oh well... and it would be even more useful if the two pointers it exchanges didn't have to be next to each other in memory...) 2008-11-23 22:23 oh, it's been so long since I looked at the def of cmpxchg8b 2008-11-23 22:23 not just an extension of cmpxchg, hmm? 2008-11-23 22:23 nope, not really, although it does do the same thing, it uses different registers 2008-11-23 22:24 I believe it would be edx:eax and ecx:ebx 2008-11-23 22:24 and 8 bytes of destination memory 2008-11-23 22:25 cmpxchg16b is the 64-bit equivalent of cmpxchg8b - ie. it uses rdx:rax and rcx:rbx 2008-11-23 22:25 and 16 bytes of destination memory (2 * 64 bits vs cmpxchg8b's 2 * 32 bits) 2008-11-23 22:25 it's the 'dcas' = double compare and swap to cmpxchg's 'cas' = compare and swap 2008-11-23 22:26 allows cool lockless things to be done - although of course, you still end up locking the bus to perform the operation, so best used sparingly 2008-11-23 22:27 (better than a spinlock though) 2008-11-23 22:27 ok, in a couple minutes I'll have a cool thing for you to check out 2008-11-23 22:33 void set_block_state(handle_t handle, unsigned state) 2008-11-23 22:33 { 2008-11-23 22:33 struct handles *info = handles(handle); 2008-11-23 22:33 unsigned shift = handle_shift(handle); 2008-11-23 22:33 while (1) { 2008-11-23 22:33 unsigned oldmap = info->statemap.counter; 2008-11-23 22:33 unsigned newmap = (oldmap & ~(7 << shift)) | (state << shift); 2008-11-23 22:33 if (atomic_cmpxchg(&info->statemap, oldmap, newmap) == oldmap) 2008-11-23 22:33 break; 2008-11-23 22:33 } 2008-11-23 22:33 } 2008-11-23 22:33 so, we just keep storing the new state until we find that we got the old state back 2008-11-23 22:34 meaning nobody raced with us and changed one of the other fields 2008-11-23 22:39 info->statemap.counter; 2008-11-23 22:39 should be atomic_read() 2008-11-23 22:40 right :) 2008-11-23 22:41 while (1) { 2008-11-23 22:41 unsigned oldmap = atomic_read(&info->statemap); 2008-11-23 22:41 unsigned newmap = (oldmap & ~(7 << shift)) | (state << shift); 2008-11-23 22:41 if (atomic_cmpxchg(&info->statemap, oldmap, newmap) == oldmap) 2008-11-23 22:41 break; 2008-11-23 22:41 } 2008-11-23 22:41 and if powerpc's lwz is just read, we can simply use long or something 2008-11-23 22:42 well, looks good for me 2008-11-23 22:42 so it was ok before without the atomic_read, but it just looks better with it? 2008-11-23 22:43 I will try it now and see if a simple test a) doesn't hang and b) changes the value 2008-11-23 22:44 if we don't use atomic_*, it would be long 2008-11-23 22:44 maybe, it's ok if powerpc is using int for atomic_t 2008-11-23 22:45 everybody is use a random choice of int, long and unsigned 2008-11-23 22:45 one specified s32 2008-11-23 22:45 random mess 2008-11-23 22:45 yes 2008-11-23 22:45 but it wouldn't be linux without that 2008-11-23 22:46 well, I'd like to know why powerpc is using lwz 2008-11-23 22:47 root@usermode:~# mount -thackfs /dev/ubdb /mnt 2008-11-23 22:47 super = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2008-11-23 22:47 handlemap 500060 2008-11-23 22:47 handle1 state = 6 2008-11-23 22:47 handle2 state = 5 2008-11-23 22:47 well it works... but does it synchronize? 2008-11-23 22:48 maze, that would be a good way to demonstrate your kernel skillz ;) 2008-11-23 22:48 ok, it seems lzw was just optimization 2008-11-23 22:48 hmm? 2008-11-23 22:48 prove that the above actually forces exclusion 2008-11-23 22:48 not so easy 2008-11-23 22:48 you have to make a separate kernel thread 2008-11-23 22:48 use timers and things to set up a test 2008-11-23 22:49 hirofumi, how did you figure that out? 2008-11-23 22:49 -#define atomic_read(v) ((v)->counter) 2008-11-23 22:49 -#define atomic_set(v,i) (((v)->counter) = (i)) 2008-11-23 22:49 +static __inline__ int atomic_read(const atomic_t *v) 2008-11-23 22:49 +{ 2008-11-23 22:49 + int t; 2008-11-23 22:49 + 2008-11-23 22:49 + __asm__ __volatile__("lwz%U1%X1 %0,%1" : "=r"(t) : "m"(v->counter)); 2008-11-23 22:49 + 2008-11-23 22:49 + return t; 2008-11-23 22:49 +} 2008-11-23 22:49 + 2008-11-23 22:49 from git 2008-11-23 22:50 and comment is 2008-11-23 22:50 commit 9f0cbea0d8cc47801b853d3c61d0e17475b0cc89 2008-11-23 22:50 Author: Segher Boessenkool 2008-11-23 22:50 Date: Sat Aug 11 10:15:30 2007 +1000 2008-11-23 22:50 [POWERPC] Implement atomic{, 64}_{read, write}() without volatile 2008-11-23 22:50 2008-11-23 22:50 Instead, use asm() like all other atomic operations already do. 2008-11-23 22:50 2008-11-23 22:50 Also use inline functions instead of macros; this actually 2008-11-23 22:50 improves code generation (some code becomes a little smaller, 2008-11-23 22:50 probably because of improved alias information -- just a few 2008-11-23 22:50 hundred bytes total on a default kernel build, nothing shocking). 2008-11-23 22:50 2008-11-23 22:50 Signed-off-by: Segher Boessenkool 2008-11-23 22:50 Signed-off-by: Paul Mackerras 2008-11-23 22:50 so, I think we can use simply long for ->statemap 2008-11-23 22:50 theoretically you shouldn't need the atomic read in the loop - it's provided by the cmpxchg on all but the first execution 2008-11-23 22:50 ooh, more inlines would be so nice 2008-11-23 22:51 and it's great to hear about improving code generation, I wonder why that is 2008-11-23 22:51 seems illogical 2008-11-23 22:51 -!- tim_dimm(~timothyhu@cpe-76-90-122-49.socal.res.rr.com) has joined #tux3 2008-11-23 22:51 maze, and atomic read doesn't actually do much 2008-11-23 22:52 right, but it's tighter - nicer ;-) 2008-11-23 22:52 it's basically just atomic_t.counter now 2008-11-23 22:52 it's not, it's bloatier 2008-11-23 22:53 not sure what we're talking about ;-) I was talking about moving atomic_read out of the while (1) loop 2008-11-23 22:53 anyway, I'd like to see atomic_t die, if only because linus called it atomic_read instead of atomic_get (the other is atomic_set) 2008-11-23 22:53 what would you like to see replace it? 2008-11-23 22:53 it can't go out of the loop 2008-11-23 22:53 sure it can 2008-11-23 22:54 replace it with tons of barriers everwhere 2008-11-23 22:54 that's what's happening anyway 2008-11-23 22:54 and use sparse somehow to figure out if the job is done right 2008-11-23 22:54 cmpxchg on failure returns the currently read value from memory (ie. the new oldmap) 2008-11-23 22:55 no it can't, is a set on a different state field races, we have to pick up the new value of the other field for the next loop iteration 2008-11-23 22:56 huh? 2008-11-23 22:56 tmpmap = cmpxchg(oldmap); if (tmpmap != oldmap) oldmap = tmpmap; you mean? 2008-11-23 22:56 yup 2008-11-23 22:56 and remove atomic_read() 2008-11-23 22:57 cmpxchg takes oldvalue, newvalue, pointer to memory, and returns the value it read from memory (which should be oldvalue if there was no race) 2008-11-23 22:57 maze, ok 2008-11-23 22:57 you're right 2008-11-23 22:57 ;-) 2008-11-23 22:57 didn't know the old value was returned on failure 2008-11-23 23:00 void set_block_state(handle_t handle, unsigned state) 2008-11-23 23:00 { 2008-11-23 23:00 struct handles *info = handles(handle); 2008-11-23 23:00 unsigned shift = handle_shift(handle); 2008-11-23 23:00 unsigned oldmap = atomic_read(&info->statemap); 2008-11-23 23:00 while (1) { 2008-11-23 23:00 unsigned newmap = (oldmap & ~(7 << shift)) | (state << shift); 2008-11-23 23:00 unsigned tmpmap = atomic_cmpxchg(&info->statemap, oldmap, newmap); 2008-11-23 23:00 if (tmpmap == oldmap) break; 2008-11-23 23:00 oldmap = tmpmap; 2008-11-23 23:00 } 2008-11-23 23:00 } 2008-11-23 23:02 exactly the same as what I just compiled ;) 2008-11-23 23:02 we optimized for the contended case 2008-11-23 23:02 heh 2008-11-23 23:02 probably will never recover the cpu seconds that it just cost to code that, across all the machines linux runs on, ever 2008-11-23 23:02 heh 2008-11-23 23:03 that's my problem - I like this sort of mental ....ation 2008-11-23 23:03 just keeping your hand in, so to speak 2008-11-23 23:04 of course, this type of stuff is even cooler if you can get rid of a lock by doing it 2008-11-23 23:04 why stop there, we could lift the shift of the new field out? ;) 2008-11-23 23:04 I hope the compiler can also figure that out 2008-11-23 23:04 true 2008-11-23 23:04 both shifts could be lifted out 2008-11-23 23:05 void set_block_state(handle_t handle, unsigned state) 2008-11-23 23:05 { 2008-11-23 23:05 ups ;-) 2008-11-23 23:05 let's let the compiler do that for us 2008-11-23 23:05 if it can't, it's busted 2008-11-23 23:06 in this kind of situation, readability trumps code lifting by hand 2008-11-23 23:08 void set_block_state(handle_t handle, unsigned state) 2008-11-23 23:08 { 2008-11-23 23:08 struct handles *info = handles(handle); 2008-11-23 23:08 unsigned shift = handle_shift(handle); 2008-11-23 23:08 unsigned oldmap = atomic_read(&info->statemap); 2008-11-23 23:08 unsigned newmap, prevmap; 2008-11-23 23:08 do { 2008-11-23 23:08 newmap = (oldmap & ~(7 << shift)) | (state << shift); 2008-11-23 23:08 prevmap = oldmap; 2008-11-23 23:08 oldmap = atomic_cmpxchg(&info->statemap, oldmap, newmap); 2008-11-23 23:08 } while (oldmap != prevmap); 2008-11-23 23:08 } 2008-11-23 23:08 possibly nicer - depends on what you like 2008-11-23 23:08 I was thinking of doing that, but erred on the side of moving on to other stuff 2008-11-23 23:10 dana can make it through the first verse of "hate myself for lovin you" on guitar hero now 2008-11-23 23:10 not bad for a four year old 2008-11-23 23:10 makes it halfway through the refrain 2008-11-23 23:17 is that what your little one is called? nice name 2008-11-23 23:20 chosing names of kids is hard. We decided on a combination of "dan" and "anna" 2008-11-23 23:20 after trying out a bazillion other ideas 2008-11-23 23:21 flips: dana means donating in sanskrit :) 2008-11-23 23:21 hope she donates to free software as much as you... 2008-11-23 23:25 pranith, nice, I'll tell her 2008-11-23 23:26 I did check out the meaning in hebrew, slavic, other languages 2008-11-23 23:26 :) 2008-11-23 23:26 but not sanskrit 2008-11-23 23:32 I wonder how far I am from writing ->readpage now 2008-11-23 23:32 time to start 2008-11-23 23:47 http://phunq.net/ddtree?p=tux3fs 2008-11-23 23:48 hackfs + handles 2008-11-23 23:48 now, hackfs_readpage and hackfs_writepage 2008-11-23 23:49 as emulation of mpage_read/writepage 2008-11-23 23:53 oops, there is no mpage_writepage 2008-11-23 23:58 ok, almost all merging work was done 2008-11-23 23:58 ready for a read through? 2008-11-23 23:59 you mean, merging between fs/tux3 and user/kernel I guess? 2008-11-23 23:59 yes 2008-11-23 23:59 merge and read file data 2008-11-24 00:00 some time I am going to have to fix the bugs in filemap.c before you get to rewrite ;) 2008-11-24 00:00 now, we can copy user/kernel/* to fs/tux3 2008-11-24 00:00 doesn't handle extents overlapping beginning and end of rewrite range properly 2008-11-24 00:00 ok, that's a big improvement 2008-11-24 00:01 and probably, hole is not handled properly 2008-11-24 00:01 ...I will scale down my ambitions to just emulating ->read/writepage for now 2008-11-24 00:01 hole is handled properly :) 2008-11-24 00:02 um.. 2008-11-24 00:02 well, I need to see how you interfaced it 2008-11-24 00:02 but there is no need for the vfs to zero the buffer 2008-11-24 00:02 the filesystem can do it just as well 2008-11-24 00:02 which it does 2008-11-24 00:03 ah, it may work 2008-11-24 00:03 I'm not sure dwalk_next() is work or not 2008-11-24 00:04 well, I'll need to push it at first 2008-11-24 00:04 it's not handled in your code 2008-11-24 00:05 current one? 2008-11-24 00:05 current tux3_get_block? 2008-11-24 00:05 yes 2008-11-24 00:05 hmm 2008-11-24 00:05 I think it's current, yes 2008-11-24 00:05 now we can return multiple blocks tux3_get_block() 2008-11-24 00:05 it's what I pulled from you and tested last night? 2008-11-24 00:05 I changed it with merging work 2008-11-24 00:06 ok, I'll look when you check the new one in 2008-11-24 00:06 yes 2008-11-24 00:06 which interface wants multiple blocks? 2008-11-24 00:06 direct-io and readahead, iirc 2008-11-24 00:06 someone may be using though 2008-11-24 00:06 using ->read/write_pages? 2008-11-24 00:07 readpages -> mpage_readpages 2008-11-24 00:07 direct-io calls get_block directly 2008-11-24 00:08 and expects it to be able to handle a fake buffer with b_size larger than one block? 2008-11-24 00:08 iirc, no 2008-11-24 00:09 if not supported, b_size would just be blocksize 2008-11-24 00:09 map_bh() sets bh->b_size 2008-11-24 00:10 if support multile, fs overwrites after it 2008-11-24 00:11 + map_bh(bh_result, inode->i_sb, block); 2008-11-24 00:11 + bh_result->b_size = min(max_blocks, count) << sbi->blockbits; 2008-11-24 00:11 so map_bh sets bh->b_size as the requested number of blocks, and the filesystem sets it to indicate the actual number mapped? 2008-11-24 00:12 map_bh() does bh->b_size = sb->s_blocksize 2008-11-24 00:12 how does the filesystem know how many blocks it should map? 2008-11-24 00:13 caller sets bh->b_size for it 2008-11-24 00:13 + size_t max_blocks = bh_result->b_size >> inode->i_blkbits; 2008-11-24 00:13 fs users like this 2008-11-24 00:13 which file? 2008-11-24 00:13 fs uses like this 2008-11-24 00:13 in my patch :) 2008-11-24 00:13 ah :) 2008-11-24 00:14 ext2_get_block also does like it 2008-11-24 00:14 very nice 2008-11-24 00:14 http://lxr.linux.no/linux+v2.6.27.5/fs/ext2/inode.c#L694 2008-11-24 00:15 so the usual interface to tell the vfs to zero block is to leave the buffer unmapped 2008-11-24 00:16 yes 2008-11-24 00:16 that should be pretty easy, no? 2008-11-24 00:17 would be easy 2008-11-24 00:18 zero block means hole? 2008-11-24 00:18 block not present in the extent list 2008-11-24 00:19 yes 2008-11-24 00:19 but for old fs, if unmapped, caller thinks it one block 2008-11-24 00:21 old fs, you mean other fs? 2008-11-24 00:22 iow, get_block can't tell multiple blocks hole 2008-11-24 00:22 old fs is meaning "not supporting multiple blocks" 2008-11-24 00:22 right, so your get_blocks has to exit after the first hole block 2008-11-24 00:22 ah 2008-11-24 00:23 ok, the vfs is going to want the hole block to be in state unmapped 2008-11-24 00:23 and it's going to zero that block 2008-11-24 00:23 not the vfs, the block library 2008-11-24 00:24 anyway, no sense to do it in the filesystem in this case 2008-11-24 00:24 yes 2008-11-24 00:25 if all fs supports multiple blocks hole, we will able to tell it via bh->b_size too 2008-11-24 00:26 ah, I see, I used extent with block number of zero to indicate a hole 2008-11-24 00:26 I forgot ;) 2008-11-24 00:27 I don't understand why the block library can't handle it now, from a filesystem that is aware of the interface 2008-11-24 00:28 because old fs doesn't update bh->b_size if unmapped (doesn't call map_bh) 2008-11-24 00:28 thanks for clearing that up 2008-11-24 00:28 I think you said it before ;) 2008-11-24 00:29 :) 2008-11-24 00:29 so, it's ok for our fs to return multiple mapped blocks, plus one hole? 2008-11-24 00:29 well, it's issue of my english skill 2008-11-24 00:29 yes 2008-11-24 00:30 that seems easy enough 2008-11-24 00:30 easy to change tux3_get_block? 2008-11-24 00:31 isn't it? 2008-11-24 00:31 yes, I already did 2008-11-24 00:31 ok, it sounds like everything is in order 2008-11-24 00:31 and yes, it's just language barrier 2008-11-24 00:31 or else me being dense ;) 2008-11-24 00:32 :) 2008-11-24 00:33 I'm writing comment for my 21 patches to push... 2008-11-24 00:35 I'm trying to write hackfs_readpage using handles 2008-11-24 00:40 +void unlock_block(handle_t handle) 2008-11-24 00:40 +{ 2008-11-24 00:40 + unsigned *state = &handles(handle)->statemap.counter; 2008-11-24 00:40 + unsigned bit = 1 << handle_lockbit(handle); 2008-11-24 00:40 + clear_bit(bit, state); 2008-11-24 00:40 + smp_mb__after_clear_bit(); 2008-11-24 00:40 + wake_up_bit(state, bit); 2008-11-24 00:40 +} 2008-11-24 00:41 + 2008-11-24 00:41 clear_bit() should be clear_bit_unlock()? 2008-11-24 00:41 I pasted that in from unlock_buffer 2008-11-24 00:42 void unlock_buffer(struct buffer_head *bh) 2008-11-24 00:42 { 2008-11-24 00:42 smp_mb__before_clear_bit(); 2008-11-24 00:42 clear_buffer_locked(bh); 2008-11-24 00:42 smp_mb__after_clear_bit(); 2008-11-24 00:42 wake_up_bit(&bh->b_state, BH_Lock); 2008-11-24 00:42 } 2008-11-24 00:42 ? 2008-11-24 00:43 pasted and broken apparently ;) 2008-11-24 00:43 and new one is 2008-11-24 00:43 void unlock_buffer(struct buffer_head *bh) 2008-11-24 00:43 { 2008-11-24 00:43 clear_bit_unlock(BH_Lock, &bh->b_state); 2008-11-24 00:43 smp_mb__after_clear_bit(); 2008-11-24 00:43 wake_up_bit(&bh->b_state, BH_Lock); 2008-11-24 00:43 } 2008-11-24 00:43 would be more proper 2008-11-24 00:44 ok 2008-11-24 00:45 fixed 2008-11-24 00:48 actualy I didn't make a cut and paste mistake 2008-11-24 00:48 2.6.27 has it the way I wrote 2008-11-24 00:49 but you got the new version from git 2008-11-24 00:49 oh 2008-11-24 00:49 something tells me this will be changing a lot in the near future 2008-11-24 00:49 it's still very messy 2008-11-24 00:49 sloppy 2008-11-24 00:50 smp_mb__before_clear_bit <- two underbars in a name is usually a bad sign 2008-11-24 00:50 and... what other kind of mb is there than an smp mb? 2008-11-24 00:51 and why not just write the barrier there that is actually required 2008-11-24 00:51 instead of trying to imply it has something to do with bits 2008-11-24 00:51 being on the leading edge is fun :) 2008-11-24 01:29 done 2008-11-24 01:29 :) 2008-11-24 01:29 static-http://userweb.kernel.org/~hirofumi/ 2008-11-24 01:29 could you check it? 2008-11-24 01:29 I just remember a change that has to be made to set_buffer_state 2008-11-24 01:30 it has to return the old state 2008-11-24 01:30 so that for example, we can know we were the frist to dirty the block 2008-11-24 01:30 sorry, set_block_state 2008-11-24 01:31 we can already know it by atomic_read? 2008-11-24 01:32 we can only know it at the time it is set by cmpxchg 2008-11-24 01:33 it's like test-and-set 2008-11-24 01:33 um... 2008-11-24 01:34 ah, if we clear dirty, we can know it by return-valude of cmpxchg 2008-11-24 01:34 unsigned set_block_state(handle_t handle, unsigned state) 2008-11-24 01:34 { 2008-11-24 01:34 struct handles *info = handles(handle); 2008-11-24 01:34 unsigned shift = handle_shift(handle); 2008-11-24 01:34 unsigned oldmap = atomic_read(&info->statemap); 2008-11-24 01:34 while (1) { 2008-11-24 01:34 unsigned newmap = (oldmap & ~(7 << shift)) | (state << shift); 2008-11-24 01:34 unsigned latest = atomic_cmpxchg(&info->statemap, oldmap, newmap); 2008-11-24 01:34 if (oldmap == latest) 2008-11-24 01:34 break; 2008-11-24 01:34 oldmap = latest; 2008-11-24 01:34 } 2008-11-24 01:34 return (oldmap >> shift) & 7; 2008-11-24 01:34 } 2008-11-24 01:34 exactly 2008-11-24 01:34 a trivial change to what we had 2008-11-24 01:34 taking oldmap outside the loop made this easy 2008-11-24 01:35 it's pretty nice now 2008-11-24 01:35 oh, so it's test_and_set_block_state :) 2008-11-24 01:35 thanks to you and maze 2008-11-24 01:35 yes :) 2008-11-24 01:35 really useful 2008-11-24 01:35 yes 2008-11-24 01:36 we could call it change_block_state 2008-11-24 01:36 it's better than set_* obviously 2008-11-24 01:37 running hg view now 2008-11-24 01:37 thanks 2008-11-24 01:38 well, we can further cleanup though 2008-11-24 01:39 this move is a big cleanup already 2008-11-24 01:39 thanks 2008-11-24 01:40 with it, I think we can make tux3 module in user/kernel 2008-11-24 01:42 that change for directory blocks in page cache is not trivial at all 2008-11-24 01:43 probably, I need to work for it more 2008-11-24 01:43 we have mapping_t for struct address_space, right? we could use it more 2008-11-24 01:44 for blockread()? 2008-11-24 01:44 a few places 2008-11-24 01:44 not a big deal 2008-11-24 01:44 probably, yes 2008-11-24 01:45 and also we can use same tux_load_sb with load_sb 2008-11-24 01:45 + .readpage = tux3_readpage, 2008-11-24 01:45 + .readpages = tux3_readpages, 2008-11-24 01:45 +// .writepage = ext4_da_writepage, 2008-11-24 01:45 +// .writepages = ext4_da_writepages, 2008-11-24 01:45 your cut and paste shows :) 2008-11-24 01:46 yes :) 2008-11-24 01:46 maybe copy from my exfat (new FAT-fs) 2008-11-24 01:47 beautify 2008-11-24 01:47 beautiful 2008-11-24 01:48 yes, some odd thing was needed though 2008-11-24 01:48 #include "tux3.h" /* include user/tux3.h, not user/kernel/tux3.h */ 2008-11-24 01:49 pulling 2008-11-24 01:49 kernel/foo.c try to include kernel/tux3.h 2008-11-24 01:49 oops 2008-11-24 01:49 thanks 2008-11-24 01:50 well we should change user/tux3.h go user/tux3user.h 2008-11-24 01:50 and have user/tux3.h just include kernel/tux3.h 2008-11-24 01:50 ugly enough for you? 2008-11-24 01:50 pulling 2008-11-24 01:51 whoops, a conflict on pull 2008-11-24 01:51 we will need -I$(TOP)/user or something 2008-11-24 01:51 wonder how that happened 2008-11-24 01:51 well, I don't care 2008-11-24 01:51 oh 2008-11-24 01:51 I cloned it newly from http://tux3.org/ 2008-11-24 01:54 the problem is inode.c 2008-11-24 01:54 the conflict happened on hg update 2008-11-24 01:55 do you have local change? 2008-11-24 01:55 I don't think so 2008-11-24 01:55 I'll check again 2008-11-24 01:57 mv inode.c inode.c.borked 2008-11-24 01:57 daniel@moonbase:/src/tux3$ hg update 2008-11-24 01:57 1 files updated, 0 files merged, 0 files removed, 0 files unresolved 2008-11-24 01:58 :) 2008-11-24 01:58 hg bug? 2008-11-24 01:58 um.. 2008-11-24 01:58 I see to have it all, fine 2008-11-24 01:59 ok 2008-11-24 01:59 I effectively rm'd inode.c and the hg updated worked 2008-11-24 01:59 all that should not have happened 2008-11-24 01:59 I think we hit a mercurial bug 2008-11-24 02:00 oh 2008-11-24 02:00 either that or I edited inode.c in my sleep 2008-11-24 02:00 :) 2008-11-24 02:00 you can look at the conflicted version if you like, both old and new look like your code 2008-11-24 02:00 you may have that ability :) 2008-11-24 02:01 yes 2008-11-24 02:01 I can sleep while coding, but not code while sleeping ;) 2008-11-24 02:01 I'd like to see it 2008-11-24 02:01 :) 2008-11-24 02:02 actually... I do sometimes solve problem while sleeping 2008-11-24 02:02 go to sleep with a problem, wake up with a solution 2008-11-24 02:02 happened to me about 6 times or so 2008-11-24 02:02 yes, me too 2008-11-24 02:02 it's weird 2008-11-24 02:03 yes, human is already weird 2008-11-24 02:03 pulled to the public repo 2008-11-24 02:03 thanks 2008-11-24 02:04 so... we can _almost_ just copy user/kernel to fs/tux3, but not quite? 2008-11-24 02:04 just have to solve the header include problem? 2008-11-24 02:04 we can just copy user/kernel to fs/tux3 2008-11-24 02:04 user/kernel/* 2008-11-24 02:05 if we want, I think we can make tux3 module in user/kernel 2008-11-24 02:05 so file read/write basically works? 2008-11-24 02:06 except for my bug in filemap extents 2008-11-24 02:06 yes, it seems to work with 50M file 2008-11-24 02:06 ok 2008-11-24 02:06 rewrite will break under some situations 2008-11-24 02:06 I'll fix it later this week 2008-11-24 02:06 first time write should be fine 2008-11-24 02:07 btw, which bug? 2008-11-24 02:07 writing an extent that overlaps an existing extent at the beginning or end does not work properly 2008-11-24 02:07 it's a messy little thing to fix 2008-11-24 02:08 ah 2008-11-24 02:08 well, kernel doesn't support write yet though 2008-11-24 02:08 current behaviour is to write some extra blocks, which actually might not break 2008-11-24 02:09 ok, next, I'll try to add write operations 2008-11-24 02:09 ok, here is the problem with having tux3 and hackfs in the same repo: right now I'm reluctant to checkout the tux3 branch because I have uncommitted work in hackfs 2008-11-24 02:10 oh 2008-11-24 02:10 I need to deal with this issue 2008-11-24 02:10 right now I'm going to copy it into another git tree 2008-11-24 02:11 for now, just leave master as is? 2008-11-24 02:12 git stash may work, but I'm not sure 2008-11-24 02:12 building 2008-11-24 02:12 ah 2008-11-24 02:12 thanks 2008-11-24 02:12 I guess that's what that's for 2008-11-24 02:13 maybe 2008-11-24 02:15 ok, git stash is a good thing :) 2008-11-24 02:15 how to use it? 2008-11-24 02:16 man git-stash 2008-11-24 02:16 already did you try it? 2008-11-24 02:16 no, I read the man page 2008-11-24 02:16 it's what I wanted just then 2008-11-24 02:16 yes, I'm reading too 2008-11-24 02:17 i see 2008-11-24 02:17 but I have already copied to a different repo and built, and getting ready to boot uml 2008-11-24 02:17 hey flips 2008-11-24 02:17 hi 2008-11-24 02:17 how's the kernel port going ? 2008-11-24 02:18 I'm sure it'll be announced loudly when you have it working 2008-11-24 02:19 mounted 2008-11-24 02:19 oh, i see, git stash is 2008-11-24 02:19 hirofumi has it working :) 2008-11-24 02:19 hirofumi basically did it without help from me 2008-11-24 02:19 coding machine 2008-11-24 02:20 :) 2008-11-24 02:20 mounting via the kernel now ? 2008-11-24 02:20 ACTION has been completely disinterested in Linux kernel development these last few weeks :\ 2008-11-24 02:20 segfaults on write :) 2008-11-24 02:20 mount should work now 2008-11-24 02:20 not unexpected 2008-11-24 02:21 oh 2008-11-24 02:21 I expected -EIO 2008-11-24 02:21 I'll get a traceback 2008-11-24 02:21 well, currently ->writepage is NULL 2008-11-24 02:23 that would do it 2008-11-24 02:23 mandatory methods should segfault 2008-11-24 02:23 but... well EIO might be cleaner 2008-11-24 02:23 probably 2008-11-24 02:23 because some filesystems are r/o 2008-11-24 02:24 gdb -args ./linux ubd0=/src/zuma/root ubd1=/tmp/testdev 2008-11-24 02:24 probably, I shouldn't provide ->write and ->aio_write 2008-11-24 02:26 cho hello >/mnt/foobar 2008-11-24 02:26 blockread: ==> ino 13, block 0 2008-11-24 02:26 blockread: <=== b_blocknr 16 2008-11-24 02:26 -bash: /mnt/foobar: Permission denied 2008-11-24 02:26 -!- mlankhorst_(~m@fw1.astro.rug.nl) has joined #tux3 2008-11-24 02:26 echoing into an existing file segfaults 2008-11-24 02:26 oh 2008-11-24 02:27 2159 status = a_ops->prepare_write(file, page, offset, offset+bytes); 2008-11-24 02:27 ok 2008-11-24 02:27 makes sense 2008-11-24 02:28 i see, it wasmandatory 2008-11-24 02:28 it's major progress 2008-11-24 02:28 yes 2008-11-24 02:28 I guess a r/o filesystem is responsible for making sure file ops->write isn't filled in 2008-11-24 02:29 yes 2008-11-24 02:29 when the file operation is filled in, I think it's ok for the vfs to segfault here 2008-11-24 02:29 yes 2008-11-24 02:29 it is my fault 2008-11-24 02:29 we are close to an lkml [ANNOUNCE] 2008-11-24 02:29 about 5 more days? 2008-11-24 02:30 write methods? 2008-11-24 02:31 right 2008-11-24 02:31 as soon as it can write 2008-11-24 02:31 announce 2008-11-24 02:31 good enough :) 2008-11-24 02:31 ok 2008-11-24 02:31 bugs are ok, ugly is ok 2008-11-24 02:31 missing major features is ok 2008-11-24 02:31 :) 2008-11-24 02:31 the point is, it does something 2008-11-24 02:31 well, I'll try tommorow 2008-11-24 02:32 probably, overwrite is easy 2008-11-24 02:33 it will likely work as it is 2008-11-24 02:33 extent size will need more work 2008-11-24 02:33 but... I still need to fix it, later 2008-11-24 02:33 the fix won't affect your work 2008-11-24 02:33 oh, if we're only doing one block per extent right now we won't hit it 2008-11-24 02:33 but we're doing more than that, right? 2008-11-24 02:34 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-24 02:34 hey maze 2008-11-24 02:34 hey, not here really 2008-11-24 02:34 I'm not sure about bug 2008-11-24 02:35 we will pretend not to notice 2008-11-24 02:35 the unit test showed the bug, and I did not have time to fix it then 2008-11-24 02:35 I will soon, but I want to get the handle stuff somewhat working 2008-11-24 02:36 yes 2008-11-24 02:36 it's testing against situations that would be unlikely to come up in normal use 2008-11-24 02:36 rewriting part of a file with a hole in it 2008-11-24 02:36 i see 2008-11-24 02:37 probably, in tux3_get_block, we can just ignore hole for now 2008-11-24 02:38 I guess you are unlikely to hit that bug in testing before I fix it ;) 2008-11-24 02:38 yes :) 2008-11-24 02:39 btw, with path to kernel/Makefile, we can make module in user/kernel/ 2008-11-24 02:39 do we need it? 2008-11-24 02:40 #LINUX = /lib/modules/$(uname -r)/build/ 2008-11-24 02:40 LINUX = /devel/linux/works/git/mercurial/tux3fs-build 2008-11-24 02:40 CONFIG_TUX3 = m 2008-11-24 02:40 ifeq ($(KERNELRELEASE),) 2008-11-24 02:40 all: 2008-11-24 02:40 make -C $(LINUX) M=`pwd` modules 2008-11-24 02:40 else 2008-11-24 02:40 obj-$(CONFIG_TUX3) += tux3.o 2008-11-24 02:40 tux3-objs += balloc.o btree.o dir.o dleaf.o filemap.o hexdump.o iattr.o \ 2008-11-24 02:40 ileaf.o inode.o super.o xattr.o 2008-11-24 02:40 EXTRA_CFLAGS += -std=gnu99 -Wno-declaration-after-statement 2008-11-24 02:40 endif 2008-11-24 02:40 oh, that's cool 2008-11-24 02:40 some people like that 2008-11-24 02:41 right, so "make module" 2008-11-24 02:41 ok, I'll push it after some test 2008-11-24 02:41 cd user/kernel; make 2008-11-24 02:41 very nice 2008-11-24 02:41 ok 2008-11-24 02:41 right, and make module can do cd user/kernel; make ;) 2008-11-24 02:42 oh 2008-11-24 02:45 what does the ifeq do? 2008-11-24 02:45 oh 2008-11-24 02:45 figures out whether it's part of a kernel buidl, or a separate module 2008-11-24 02:46 currently is top make or it's via linux/Makefile 2008-11-24 02:46 yes 2008-11-24 02:46 got it 2008-11-24 02:46 nice hack 2008-11-24 02:47 it's major hack from kbuild people 2008-11-24 02:48 cute hack 2008-11-24 03:02 thanks 2008-11-24 03:02 it seems to work 2008-11-24 03:02 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-24 03:02 please pull it 2008-11-24 03:02 write? 2008-11-24 03:03 no 2008-11-24 03:03 makefile 2008-11-24 03:03 right ;) 2008-11-24 03:03 you're fast, but not quite that fast 2008-11-24 03:04 CC [M] /src/tux3/kernel/dir.o 2008-11-24 03:04 make[2]: *** [/src/tux3/kernel/dir.o] Error 1 2008-11-24 03:04 well 2008-11-24 03:04 my kernel is too old 2008-11-24 03:04 anyway, I don't really want to insert a module into my workstation/server ;) 2008-11-24 03:05 you can specify by "make LINUX=/path-to-kernel" 2008-11-24 03:05 make[2]: *** [/src/tux3/kernel/dir.o] Error 1 2008-11-24 03:05 oh :) 2008-11-24 03:05 what? 2008-11-24 03:06 the error not pasting into xchat properly 2008-11-24 03:06 by default, it build for current using kernel 2008-11-24 03:06 well 2008-11-24 03:06 doesn't matter 2008-11-24 03:06 yes 2008-11-24 03:06 I have the right tree 2008-11-24 03:06 it can overwrite for test kernel 2008-11-24 03:06 just need to tell it the right one, how? 2008-11-24 03:06 btw, uml can load module? 2008-11-24 03:07 it can 2008-11-24 03:07 it's nearly impossible to debug though 2008-11-24 03:07 jdike can do it, I can't 2008-11-24 03:07 ACTION wish folks luck and speedy success to the porting effort 2008-11-24 03:07 night 2008-11-24 03:07 ah 2008-11-24 03:07 night bh 2008-11-24 03:08 well, it can change kernel path to build module 2008-11-24 03:09 ah, LINUX= make 2008-11-24 03:09 or make LINUX= 2008-11-24 03:09 also need ARCH=um 2008-11-24 03:10 ah 2008-11-24 03:10 it's doing it :) 2008-11-24 03:12 CC [M] /src/tux3/kernel/super.o 2008-11-24 03:12 make[2]: *** [/src/tux3/kernel/super.o] Error 1 2008-11-24 03:12 m 2008-11-24 03:12 what happened to cut n paste??? 2008-11-24 03:12 CONFIG_MODULE? 2008-11-24 03:12 #/src/tux3/kernel/super.c:13:24: error: linux/tux3.h: No such file or directory 2008-11-24 03:12 oh 2008-11-24 03:13 it's lines beginning with / 2008-11-24 03:13 stupid xchat 2008-11-24 03:13 ah 2008-11-24 03:13 that's a bug 2008-11-24 03:13 linux/tux3.h 2008-11-24 03:13 it seems not to be right kernel 2008-11-24 03:16 WARNING: "__udivdi3" [/src/tux3/kernel/tux3.ko] undefined! 2008-11-24 03:17 oh 2008-11-24 03:17 but it made the module 2008-11-24 03:17 64bit div 2008-11-24 03:17 I don't know what happened to include/linux/tux3.h 2008-11-24 03:17 now where did I do that div? 2008-11-24 03:17 I remember thinking this is probably going to break 2008-11-24 03:18 I don't know 2008-11-24 03:18 it's in the time calculations 2008-11-24 03:19 ah 2008-11-24 03:19 return ((u64)time.tv_sec << 32) + ((u64)time.tv_nsec << 32) / 1000000000ULL; 2008-11-24 03:19 s/1000000000ULL/1000000000/ 2008-11-24 03:20 return ((u64)time.tv_sec << 32) + ((u64)time.tv_nsec << 32) / 1000000000ULL; 2008-11-24 03:20 it's not actually a 64 bit div 2008-11-24 03:20 it's 64/32 2008-11-24 03:20 yes, but ULL 2008-11-24 03:20 pathetic gcc still can't generate inline code for that 2008-11-24 03:20 oh 2008-11-24 03:20 right 2008-11-24 03:20 my fault 2008-11-24 03:20 well it doesn't need to be ULL 2008-11-24 03:20 ;) 2008-11-24 03:20 return (((val & 0xffffffff) * 1000000000ULL) + 0x80000000) >> 32; 2008-11-24 03:20 and this 2008-11-24 03:21 that should be ok 2008-11-24 03:21 ah, but ULL is not needed 2008-11-24 03:25 oh right 2008-11-24 03:25 well 2008-11-24 03:25 that one has to be ull 2008-11-24 03:25 um... why? 2008-11-24 03:28 because it has to be 64 bit before it can be shifted right 32 bits 2008-11-24 03:28 it has to be ULL on 32 bit arch anyway 2008-11-24 03:29 should probably #define billion according the word size 2008-11-24 03:29 val is 64bit, so it's ok? 2008-11-24 03:29 ok, fine 2008-11-24 03:31 oh, I know what happened to tux3.h 2008-11-24 03:31 I'm still checked out on the hackfs branch ;) 2008-11-24 03:32 oh :) 2008-11-24 03:32 well, maybe we can remove 2008-11-24 03:32 I just touched include/linux/tux3.h and it built 2008-11-24 03:32 nothing in there yet, and why should there be? 2008-11-24 03:33 in kerne, there was it 2008-11-24 03:33 only reason to put stuff there is to export it to user space 2008-11-24 03:33 oh 2008-11-24 03:33 we should just delete that file 2008-11-24 03:33 actual include/linux/tux3.h 2008-11-24 03:34 #ifndef LINUX_TUX3_H 2008-11-24 03:34 #define LINUX_TUX3_H 2008-11-24 03:34 #endif 2008-11-24 03:34 :) 2008-11-24 03:35 I know 2008-11-24 03:35 it's traditional to have a file in include/linux 2008-11-24 03:35 but I can't think of any reason why it's needed 2008-11-24 03:36 yes, it is needed 2008-11-24 03:36 yes, it is not needed 2008-11-24 03:39 incidentally 2008-11-24 03:39 in english, we way "no, it is not needed" 2008-11-24 03:39 difference from asian languages 2008-11-24 03:39 ah 2008-11-24 03:40 thanks 2008-11-24 03:40 I don't know why we say that ;) 2008-11-24 03:40 the other way makes more sense 2008-11-24 03:40 I learned it, and I forget :) 2008-11-24 03:43 well I'd better sleep 2008-11-24 03:43 ok 2008-11-24 03:43 didn't get as much done on block handles as I'd hoped 2008-11-24 03:43 maybe I will get some done while I'm asleep ;) 2008-11-24 03:43 I may able to overwrite some data 2008-11-24 04:18 hirofumi, I put up a post on block state transitions 2008-11-24 04:18 which you might find interesting 2008-11-24 04:18 ok 2008-11-24 07:15 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-11-24 07:43 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-11-24 07:54 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-11-24 07:58 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-24 10:40 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-24 10:54 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-24 13:01 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-24 13:07 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-24 13:10 I'm trying to understand by block_read_full_page does it with the page locked, and the buffers also locked 2008-11-24 13:10 surely the page could be unlocked as soon as the buffers are locked? 2008-11-24 13:44 http://lxr.linux.no/linux+v2.6.27.5/include/linux/page-flags.h#L251 <- wow, nice documentation 2008-11-24 13:45 I wonder who added that? 2008-11-24 13:45 ACTION goes to check git blame 2008-11-24 13:49 hehe 2008-11-24 13:49 There is also git-praise 2008-11-24 13:49 It's less negative 2008-11-24 13:49 hmm, how do you get an annotated file from git-web? 2008-11-24 13:49 I think that is disabled 2008-11-24 13:50 It would be too cpu-intensive 2008-11-24 13:50 aw 2008-11-24 13:50 well 2008-11-24 13:50 git blame it is then 2008-11-24 13:50 git-praise :P 2008-11-24 13:50 I was having some fun with a compiler guy 2008-11-24 13:51 right, it's git-praise in this case 2008-11-24 13:51 I now got wine64 almost working 2008-11-24 13:51 more often -blame though 2008-11-24 13:51 Minus compiler bugs 2008-11-24 13:51 git-annotate would be the logical one 2008-11-24 13:51 both blame and praise are kinda funny... the first time 2008-11-24 13:52 it this the first time wine64 has worked? 2008-11-24 13:52 is that 64 as in 64 bit windows? 2008-11-24 13:52 or 64 bit linux host? 2008-11-24 13:52 or both? 2008-11-24 13:52 ACTION thinks both 2008-11-24 13:53 64-linux host 2008-11-24 13:53 64-windows binary 2008-11-24 13:53 so both 2008-11-24 13:53 would be tough otherwise 2008-11-24 13:53 does wine 32 run on linux 64 host? 2008-11-24 13:53 Sure 2008-11-24 13:53 It's just a linux32 multilib app 2008-11-24 13:54 git-web idiotically puts spaces between segments of the filename path 2008-11-24 13:54 Of course no linux distro creates the .so symlinks for lib32 2008-11-24 13:54 So you have to make them first :/ 2008-11-24 13:55 surely the debian wine maintainer should do something? 2008-11-24 13:55 But wine64 has major issues 2008-11-24 13:56 The least of which is the infinite amount of foo = (int)ptr; ptr2 = (void *)foo; 2008-11-24 13:56 WHich is too common 2008-11-24 13:56 feh, my git tree has everything annocated as "Daniel Phillips" 2008-11-24 13:56 because I didn't clone it 2008-11-24 13:56 I think it's time for a new tree 2008-11-24 13:56 cloned from something 2008-11-24 13:57 somebody didn't here of (long) ? 2008-11-24 13:57 hear 2008-11-24 13:58 No, wine has been around since 1994 2008-11-24 13:58 Sso they haven't always been consistent 2008-11-24 13:58 long has always been defined as the int type that will hold an address 2008-11-24 13:58 so it's always been inconsistent 2008-11-24 13:58 Windows defines long as int :( 2008-11-24 13:59 same-size 2008-11-24 13:59 that's fucked 2008-11-24 13:59 no wonder they had problems with 64 bit windows 2008-11-24 13:59 windows also defines DWORD as long 2008-11-24 13:59 Wine has to define LONG as int for that reason 2008-11-24 13:59 DWORD is meaningly anyway 2008-11-24 14:00 oh 2008-11-24 14:00 That started the windows 64 incompatibility issues 2008-11-24 14:00 DWORD is a wine macro? 2008-11-24 14:00 a windows type 2008-11-24 14:00 sounds bogus 2008-11-24 14:00 It also has DWORD64, and DWORD_PTR 2008-11-24 14:00 And LONG, LONG_PTR, LONG64 2008-11-24 14:00 INT, INT_PTR, INT64 :/ 2008-11-24 14:01 I'm sure there were stupid reasons for all of those 2008-11-24 14:01 getting some header to compile 2008-11-24 14:01 It's insane o.o 2008-11-24 14:01 and not having a big enough alligator pit though throw all the coders who do stuff like that 2008-11-24 14:01 Yeah >:( 2008-11-24 14:03 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-24 14:03 maybe github runs annotations 2008-11-24 14:06 dumb stuff on git-hubs web frontend, shows %2F as path separators in urls instead of / 2008-11-24 14:06 that's an alligator offense 2008-11-24 14:07 http://github.com/github/linux-2.6/tree/master/include%2Flinux%2Fpage-flags.h <- no obvious way to annotate 2008-11-24 14:07 hmm 2008-11-24 14:07 mercurial doesn't seem to choke on annotate 2008-11-24 14:08 maybe I just haven't tried it on a full linux tree 2008-11-24 14:08 but it's the fundamental git schema I guess 2008-11-24 14:08 it's basically delta based, not weave 2008-11-24 14:09 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-24 14:10 ok: synchronization problem of the day: how to be sure that when the last block on a page becomes uptodate, there is only one call to SetPageUptodate 2008-11-24 14:11 Anyway, bedtime 2008-11-24 14:11 guten nacht 2008-11-24 14:11 und schlaft gute 2008-11-24 14:12 tsusch 2008-11-24 14:12 ah, that's how you spell that 2008-11-24 14:12 ACTION isn't sure whether it makes sense to talk german to a dutchman 2008-11-24 14:13 Hehehe 2008-11-24 14:13 I speak as much german as you probably 2008-11-24 14:13 If that isn't much 2008-11-24 14:14 I found out the other day that yiddish is german, spelled funny 2008-11-24 14:14 got to use that here in hollywood zone ;) 2008-11-24 14:18 back to the synchronization problem of the day 2008-11-24 14:20 first, is it possible for two read bios to be in flight for the same page, for different buffers? Answer: probably 2008-11-24 14:20 ah 2008-11-24 14:20 change_block_state might be the answer 2008-11-24 14:21 since it returns the old state 2008-11-24 14:21 only one bio completion can have filled in the final buffer 2008-11-24 14:24 the old buffer state & 0x7777 must be 0x2292 where the 1 is the empty state for the block that just completed and the 2's are clean state 2008-11-24 14:24 the 9 is empty state ored with block locked bit, I mean 2008-11-24 14:26 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-24 14:27 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-24 14:27 -!- ceatinge(~ceatinge@veryclever.net) has joined #tux3 2008-11-24 14:27 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-11-24 14:27 -!- rmull(~rmull@acsx01.bu.edu) has joined #tux3 2008-11-24 14:27 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2008-11-24 14:27 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-11-24 14:27 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-11-24 14:27 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-11-24 14:27 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-24 14:27 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-11-24 14:27 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-11-24 14:29 the old statemap & 0x7777 (to get rid of the lock bits) must be 0x2212 where the 1 is the empty for the block that just completed and all other blocks are 2 (clean) 2008-11-24 14:29 so that transition can be detected 2008-11-24 14:30 I'm not sure what harm would actually come from setting the page uptodate twice though 2008-11-24 14:31 comments on end_buffer_async_read claims bad things will happen, but I don't see it 2008-11-24 14:32 oh 2008-11-24 14:32 double unlock of the page 2008-11-24 14:32 that's bad 2008-11-24 14:32 well, that goes away if we don't have the page locked during block IO, it's not clear why the page is locked anyway 2008-11-24 14:50 Hm 2008-11-24 14:51 I wonder why long long is still 8 bytes long in 64-bits mode 2008-11-24 14:51 because it's lame? 2008-11-24 14:51 I thought you were sleeping 2008-11-24 14:51 I am :) 2008-11-24 14:52 it's practical though 2008-11-24 14:52 tons of 32 bit code declares long long to get 64 bit stuff, and 128 bits is not needed in 64 bit mode 2008-11-24 14:52 true 2008-11-24 14:52 it's because "64 bits should be enoough for anyone" 2008-11-24 14:53 :> 2008-11-24 14:57 I'm guessing int(size)_t sort of alleviates that 2008-11-24 15:05 it's necessary for survival 2008-11-24 15:22 if page is not locked, e.g. mmap can't wait on page? 2008-11-24 15:23 well, do_no_page should wait on the buffers 2008-11-24 15:23 only thing is... do_no_page is gone 2008-11-24 15:23 what replaced it? 2008-11-24 15:24 anyway, I've got a solution to the "only unlock page once" problem, and it is better than the buffer.c solution 2008-11-24 15:24 I would just like to understand while page_lock _plus_ buffer lock is actually needed 2008-11-24 15:25 __do_fault()? 2008-11-24 15:25 calls what to read the page? 2008-11-24 15:25 it will call ->readpage 2008-11-24 15:26 I don't see the call 2008-11-24 15:26 but it will have to lock_page before ->readpage 2008-11-24 15:26 I've seen it in earlier kernels of course 2008-11-24 15:27 __do_fault -> vm_ops->fault -> r->eadpage 2008-11-24 15:27 __do_fault -> vm_ops->fault -> ->readpage 2008-11-24 15:28 ah, vm_ops is new for me 2008-11-24 15:29 in that case, it should be ->fault == filemap_fault 2008-11-24 15:29 thanks, you're much faster than lxr ;) 2008-11-24 15:30 why the name had to be changed from do_no_page is not clear 2008-11-24 15:30 yes, emacs and cscope is fast 2008-11-24 15:30 iirc, ->fault is changed from something 2008-11-24 15:31 maybe before, ->no_page, iirc 2008-11-24 15:31 I will cscope my kernel and just use lxr for tux3 u ;) 2008-11-24 15:31 yes, I'm sure it's good :) 2008-11-24 15:32 ok, so this locks the page of course 2008-11-24 15:32 but we can unlock the page when we lock the buffer I think 2008-11-24 15:32 what could go wrong? ;) 2008-11-24 15:32 but, cscope is needed good fontend though 2008-11-24 15:32 or I should say, we can unlock the page when we lock the block 2008-11-24 15:32 because buffer layer expects the page to be locked for buffer read 2008-11-24 15:33 ah 2008-11-24 15:33 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-11-24 15:34 wait on buffer lock vs wait on page lock 2008-11-24 15:34 yes 2008-11-24 15:34 there are valid things that could be able to run in parallel then 2008-11-24 15:34 propbaby, we will need to page_lock to check buffer 2008-11-24 15:34 simultaneous reads on more than one block on the same page 2008-11-24 15:34 folks 2008-11-24 15:35 right, but that is a short lock 2008-11-24 15:35 this lock across read is a _really_ long lock 2008-11-24 15:35 oh 2008-11-24 15:35 yes 2008-11-24 15:35 anyway, it's something to try later 2008-11-24 15:36 for now we can unlock the page when the last block goes clean, just like the buffer library 2008-11-24 15:36 and I have a nice atomic operation for doing that 2008-11-24 15:36 thanks to you and maze ;) 2008-11-24 15:36 hmm? 2008-11-24 15:36 ah, just mentioning my name ;-) 2008-11-24 15:37 maze, a new use for change_block_state 2008-11-24 15:37 yes 2008-11-24 15:37 oh, cool, what is it? 2008-11-24 15:37 atomically discover when the last block on a page has gone uptodate 2008-11-24 15:37 to avoid unlocking the page twice in the endio 2008-11-24 15:38 the old statemap & 0x7777 (to get rid of the lock bits) must be 0x2212 where the 1 is the empty for the block that just completed and all other blocks are 2 (clean) 2008-11-24 15:42 truncate seems to assume lock_page 2008-11-24 15:43 there is a hook to make it stay way from a page with page->private non-null 2008-11-24 15:43 try_to_release_page 2008-11-24 15:44 yes, if it's not block_invalidatepage, lock_page is not needed probably 2008-11-24 15:44 would we then put our block into state "truncated" and complete the release when IO on the block completes (just an idea) 2008-11-24 15:45 um... 2008-11-24 15:46 so, it truncates related metadata too? 2008-11-24 15:46 we should be able to defer that if we want 2008-11-24 15:47 it's not mandatory to do it in the vmtruncate 2008-11-24 15:47 410 bit_spin_lock(BH_Uptodate_Lock, &first->b_state); <- we can avoid this nasty extra lock on block read completion 2008-11-24 15:47 http://lxr.linux.no/linux+v2.6.27.5/fs/buffer.c#L410 2008-11-24 15:47 if we handle metadata somehow, maybe we can 2008-11-24 15:48 oh 2008-11-24 15:52 in our atomic commit model it is natural to perform the metadata truncate at delta transition 2008-11-24 15:52 no immediately on truncate 2008-11-24 15:53 this code has to be written still 2008-11-24 15:53 hmm, I have something 2008-11-24 15:53 that works 2008-11-24 15:53 need improving though 2008-11-24 15:54 I don't think it is extent aware 2008-11-24 15:56 maybe, we need some lock for truncate vs read 2008-11-24 15:57 I wonder 2008-11-24 15:57 well the state transition is done by a single process 2008-11-24 15:57 not in parallel 2008-11-24 15:57 that is, the delta transition, which I call "staging" by the way 2008-11-24 15:58 yes 2008-11-24 15:58 so staging ensures that metadata is updated without races 2008-11-24 15:58 you could say that this might be a bottleneck 2008-11-24 15:59 it will be good enough to start with though, and I have an idea how to run staging in parallel on multiple cpus 2008-11-24 16:00 we, at the time of the delta transition we can look at the cummulative effects of all the truncates and writes from the active delta 2008-11-24 16:00 s/we/so/ 2008-11-24 16:01 in the active delta, truncate and write only affect the page cache and the cached i_size 2008-11-24 16:01 this separation is really nice 2008-11-24 16:01 and efficient 2008-11-24 16:02 I would like to have the same for dirops, which we discussed 2008-11-24 16:02 we will eventually have it for dirops too 2008-11-24 16:03 umm... 2008-11-24 16:05 sounds good, but I'm not sure at all 2008-11-24 16:05 I'll just keep posting new and better explanations until it's clear ;) 2008-11-24 16:05 and write code too 2008-11-24 16:06 but I'd like everybody to know what I'm writing 2008-11-24 16:06 yes, code is really helpful for me 2008-11-24 16:06 here's another nice thing we can do with our block model: change state from empty to clean and unlock in one operation 2008-11-24 16:06 of course, docs too though 2008-11-24 16:07 we get the previous state back, so we can verify that the block actually was locked, then do the wakeup 2008-11-24 16:07 it's set_clean_and_unlock in one cmpxchg 2008-11-24 16:07 yes 2008-11-24 16:08 or if we clear dirty before read, set_clean_and_lock? 2008-11-24 16:08 does we need to lock then? 2008-11-24 16:09 s/read/write/ 2008-11-24 16:09 yes, if we clear before the write 2008-11-24 16:09 but I don't think it is necessary to clear dirty before the write 2008-11-24 16:10 there are no asynchronous re-dirties to catch 2008-11-24 16:10 I guess there are other places to use this trick 2008-11-24 16:11 we can't dirty without lock? 2008-11-24 16:11 we can dirty without the block lock I think 2008-11-24 16:11 I'm pretty sure 2008-11-24 16:11 so, if we clear after write, re-dirty was lost? 2008-11-24 16:12 only the active delta is allowed to dirty a file page, and only the staging delta is allowed to dirty metadata 2008-11-24 16:12 there are no re-dirties 2008-11-24 16:12 s/file page/file block/ above 2008-11-24 16:13 if active delta was re-dirty file page, what happen? 2008-11-24 16:13 fork? 2008-11-24 16:13 forking isn't very useful for file data 2008-11-24 16:14 the page can be redirtied, the file block does not have to be 2008-11-24 16:14 until staging, then it is written to a different physical location 2008-11-24 16:14 um 2008-11-24 16:14 we _could_ write it to a different physical location 2008-11-24 16:14 but I haven't answered your question ;) 2008-11-24 16:16 dirty page 2008-11-24 16:16 write 2008-11-24 16:16 dirty page 2008-11-24 16:16 clear dirty 2008-11-24 16:16 ok, this only comes up if we have a mode like ordered data 2008-11-24 16:16 where we don't fork data blocks 2008-11-24 16:16 right and left are different writer 2008-11-24 16:17 the page will record the re-dirty fine 2008-11-24 16:17 now what about the blocks 2008-11-24 16:18 ah, blocks means handles? 2008-11-24 16:18 yes 2008-11-24 16:18 but I'm still thinking about the details 2008-11-24 16:19 ah 2008-11-24 16:19 and this is only an issue with "ordered data" mode 2008-11-24 16:19 where we don't enforce strict delta semantics on file data 2008-11-24 16:19 something btrs also has 2008-11-24 16:19 and ext3 2008-11-24 16:19 might be important to benchmarking ;) 2008-11-24 16:20 i see 2008-11-24 16:20 I'll think about it a moment 2008-11-24 16:22 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-24 16:24 btw, did you go to bed yesterday? 2008-11-24 16:24 for a short time ;) 2008-11-24 16:24 too short 2008-11-24 16:24 yes, it seems to be too short 2008-11-24 16:24 my atomic test for transition to "all uptodate" is not quite right 2008-11-24 16:25 because clean or dirty are both allowed 2008-11-24 16:25 the test has to be "no blocks empty" 2008-11-24 16:25 atomic test? 2008-11-24 16:26 test for cmpxchg stuff? 2008-11-24 16:26 the one I described above to detect the last block coming uptodate on a page, in order to unlock the page only once in the endio 2008-11-24 16:26 yes 2008-11-24 16:26 i see 2008-11-24 16:27 so we want a nice little bit operation that tests for "no state zero except the one we just updated" 2008-11-24 16:29 I can't see what happen without code... 2008-11-24 16:29 well 2008-11-24 16:29 that is, (~oldmap & 0x7777) == 0x77e7 2008-11-24 16:29 I think 2008-11-24 16:29 easy to test 2008-11-24 16:30 where the 'e' is the inverted value of clean state 2008-11-24 16:30 sorry 2008-11-24 16:30 empty state 2008-11-24 16:30 for the block we just updated 2008-11-24 16:30 ah, this depends on the 'empty' state being zero 2008-11-24 16:30 whereas I currently have it 1 2008-11-24 16:31 this is a good enough reason to make it 0 2008-11-24 16:31 hack alert ;) 2008-11-24 16:31 :) 2008-11-24 16:32 macros helps it? 2008-11-24 16:32 surely 2008-11-24 16:32 bit hex numbers help see whether it works 2008-11-24 16:32 s/bit/but/ 2008-11-24 16:33 yes 2008-11-24 16:33 anyway, that's wrong 2008-11-24 16:33 above 2008-11-24 16:33 the right expression will be simple like that though 2008-11-24 16:35 the check for no block empty is !(~oldmap & 0x7777) 2008-11-24 16:35 7 is BLOCK_DIRTY3? 2008-11-24 16:36 the check for no block empty is except the one we just updated is: !(~(oldmap | 0x0020) & 0x7777) 2008-11-24 16:36 7 is the state field mask 2008-11-24 16:36 8 possible states 2008-11-24 16:36 and 1 lock bit to make it 4 bits/block 2008-11-24 16:36 ah, 4bit 2008-11-24 16:37 fit perfectly in 32 bits, giving up to 8 blocks/page and up to 8 block states 2008-11-24 16:37 nice or what? 2008-11-24 16:38 it seems nice 2008-11-24 16:38 test for 4blocks? 2008-11-24 16:38 better check for no block empty except the one we just updated is: !(~oldmap & 0x7707) 2008-11-24 16:39 test four blocks at once, yes 2008-11-24 16:39 and it's atomic 2008-11-24 16:39 looking at the old value before our update 2008-11-24 16:39 atomic with respect to the transition 2008-11-24 16:42 so we write: oldmap = change_block_state(handle, 1, BLOCK_CLEAN); if (!(~oldmap & 0x7707)) { SetPageUptodate(page); unlock_page(page); } 2008-11-24 16:42 the 07707 will be calculated generically of course 2008-11-24 16:43 ~oldmap? 2008-11-24 16:43 this tests for transition to "no blocks empty" after completing read of block 1 2008-11-24 16:44 you're right 2008-11-24 16:44 so we write: oldmap = change_block_state(handle, 1, BLOCK_CLEAN); if (!(oldmap & 0x7707)) { SetPageUptodate(page); unlock_page(page); } 2008-11-24 16:44 shorter :) 2008-11-24 16:44 ah 2008-11-24 16:44 and wrong 2008-11-24 16:45 0 is empty? 2008-11-24 16:45 so we write: oldmap = change_block_state(handle, 1, BLOCK_CLEAN); if (!(oldmap & 0x7707) == 0x7707) { SetPageUptodate(page); unlock_page(page); } 2008-11-24 16:45 yes, 0 is empty 2008-11-24 16:46 !(oldmap & 0x7707)? 2008-11-24 16:46 no empty 2008-11-24 16:46 I'm being stupid today ;) 2008-11-24 16:46 there has to be a closed form bitmap expression 2008-11-24 16:46 well, time to take a short break 2008-11-24 16:47 ok 2008-11-24 16:51 !((~oldmap & 0x7707) + 0x1111) & 0x8888) 2008-11-24 16:52 changes the zeros into sevens 2008-11-24 16:52 add one to each field, only the sevens carry into the bit 3 position 2008-11-24 16:52 then check for any bit threes 2008-11-24 16:52 sick 2008-11-24 16:56 when bushman sees this he will call it demo coding ;) 2008-11-24 16:57 saves a bit fat loop from endio for block_read_full_page 2008-11-24 16:57 and we are also going to be using the bio endio directly, not chaining up though a bh_endio the way block read does 2008-11-24 16:57 it's all going to be a lot less code 2008-11-24 16:59 oh 2008-11-24 16:59 what is above? 2008-11-24 16:59 um... 2008-11-24 17:00 it's the test for "no block empty except the one we just updated" 2008-11-24 17:00 in one closed form bit bashing expression instead of a loop 2008-11-24 17:00 across blocks 2008-11-24 17:01 umm... same with !(oldmap & 0x7707)? 2008-11-24 17:02 that test for "any block empty" 2008-11-24 17:04 ah 2008-11-24 17:06 it tests for "any block nonempty" actually, either way it's not enough 2008-11-24 17:12 (oldmap & 0x7707) >= 0x1101? 2008-11-24 17:12 oops, wrong 2008-11-24 17:12 it's not as easy as it first seems 2008-11-24 17:13 I think my multiple carry strategy works 2008-11-24 17:13 it's a trick from signal processing 2008-11-24 17:14 i see 2008-11-24 17:14 if this code doesn't raise some eyebrows, nothing will ;) 2008-11-24 17:29 testing it now 2008-11-24 17:41 correct, except for a missing open paren: !(((~oldmap & 0x7707) + 0x1111) & 0x8888) 2008-11-24 17:42 "no blocks empty except for the one we just updated" 2008-11-24 17:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-24 17:50 good 2008-11-24 17:50 I've added overwite file data ability 2008-11-24 17:50 could you check it? 2008-11-24 17:51 ok 2008-11-24 17:51 and I found some issues 2008-11-24 17:51 set_buffer_dirty is mark_buffer_dirty in kernel 2008-11-24 17:52 and we are missing some set_buffer_dirty on some places 2008-11-24 17:52 right 2008-11-24 17:52 changing set_buffer_dirty to mark_buffer_dirty everywhere including user code is find, or wrap it with an inline 2008-11-24 17:52 either way 2008-11-24 17:53 I guess you did the second one 2008-11-24 17:53 after fixed it, chmod and chown should be easy 2008-11-24 17:53 anything special needed for mkdir? 2008-11-24 17:53 I don't check it yet 2008-11-24 17:54 maybe, it may work if we just add some helpers 2008-11-24 17:54 it might 2008-11-24 17:54 it's the same interface as ext2 2008-11-24 17:55 the ext2 caller of ext2_create_entry can be cut and pasted 2008-11-24 17:55 ah, it needs to create new inode 2008-11-24 17:57 ext2_mkdir 2008-11-24 17:57 should be able to cut and paste 2008-11-24 17:58 looks like it works 2008-11-24 17:58 :) 2008-11-24 17:58 I should have tried 2008-11-24 17:58 ext2_new_inode is need 2008-11-24 17:59 ah, and ext2_make_empty() 2008-11-24 18:00 well, I'll try to add to expand i_size before those 2008-11-24 18:02 this is the fun part 2008-11-24 18:02 hooking up interfaces 2008-11-24 18:03 for expand i_size? 2008-11-24 18:04 ah you're right, I didn't do that 2008-11-24 18:04 it was part of tuxio 2008-11-24 18:04 I thought vfs does that 2008-11-24 18:05 yes, write_end will do if we add block allocation to tux3_get_block 2008-11-24 18:06 yes 2008-11-24 18:08 btw, set_buffer_dirty conflicts with kernel's one 2008-11-24 18:09 so, mark_buffer_dirty or new name may be easy 2008-11-24 18:10 ok, let's change it in tux3/user too 2008-11-24 18:10 ok 2008-11-24 18:11 ACTION will take a break for a hour or two 2008-11-24 18:12 I guess we must only support ino greater than 32 bits on 64 bit arch 2008-11-24 18:12 that, and volume/file size > 16 TB 2008-11-24 18:12 see you 2008-11-24 18:14 + .direct_IO = tux3_direct_IO, :) 2008-11-24 18:16 oh, I still had max_grou_entries set to 8 ;) 2008-11-24 18:16 7 2008-11-24 18:19 hirofumi I ended up with two heads when I pulled 2008-11-24 18:19 hg doesn't like that 2008-11-24 18:20 oh 2008-11-24 18:20 um... 2008-11-24 18:21 if I just clone your repo I will break shapor's mirror 2008-11-24 18:21 this is probably the biggest problem with hg 2008-11-24 18:22 your repo only has one head 2008-11-24 18:22 well, I'll try to see what's happening 2008-11-24 18:22 I can pull a specific revision 2008-11-24 18:22 I'll try that 2008-11-24 18:24 um... 2008-11-24 18:24 abort: pull -r doesn't work for remote repositories yet 2008-11-24 18:25 but I cloned it, so... 2008-11-24 18:25 hg clone http://tux3.org/tux3/; hg pull static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-24 18:25 it seems to work 2008-11-24 18:25 I pull into my working repoo 2008-11-24 18:25 then pull from then into the public one 2008-11-24 18:25 the working repo seems somehow broken 2008-11-24 18:26 working repo has difference with public? 2008-11-24 18:26 I guess it must 2008-11-24 18:26 it's broken, the public one isn't ;-) 2008-11-24 18:26 i see 2008-11-24 18:27 after a merge, it says it succeeded, but I still have two heads 2008-11-24 18:27 hg bug, or broken by design 2008-11-24 18:27 oh 2008-11-24 18:27 hg view can see two heads? 2008-11-24 18:27 hg heads 2008-11-24 18:27 changeset: 515:2cc3dca083de 2008-11-24 18:27 tag: tip 2008-11-24 18:27 user: OGAWA Hirofumi 2008-11-24 18:27 date: Tue Nov 25 10:45:41 2008 +0900 2008-11-24 18:27 summary: Use #define MAX_GROUP_ENTRIES 255 for kernel 2008-11-24 18:27 changeset: 507:674c296fdc85 2008-11-24 18:28 user: OGAWA Hirofumi 2008-11-24 18:28 date: Mon Nov 24 20:00:59 2008 +0900 2008-11-24 18:28 summary: Tweaks user/kernel/Makefile to build module 2008-11-24 18:28 yes 2008-11-24 18:28 hirofumi@devron (tux3)$ ../../usr/bin/hg heads 2008-11-24 18:28 changeset: 514:2cc3dca083de 2008-11-24 18:28 tag: tip 2008-11-24 18:28 user: OGAWA Hirofumi 2008-11-24 18:28 date: Tue Nov 25 10:45:41 2008 +0900 2008-11-24 18:28 summary: Use #define MAX_GROUP_ENTRIES 255 for kernel 2008-11-24 18:28 but not in the public repo 2008-11-24 18:29 well, I am going to clone the public repo, and we will see what that does to shapor's mirror ;) 2008-11-24 18:30 ah, http://tux3.org/tux3/ was mirror? 2008-11-24 18:30 no, shapor has another mirror 2008-11-24 18:30 that just polls tux3.org/tux3 and pulls from it 2008-11-24 18:31 oh 2008-11-24 18:32 ok, all fixed 2008-11-24 18:32 except for maybe shapor's mirror, let's see what happens 2008-11-24 18:33 ok 2008-11-24 18:33 and I'll ask mercurial folks if they've seen a bug like this 2008-11-24 18:33 pulled to the public repo 2008-11-24 18:34 thanks 2008-11-24 18:35 http://www.bitbucket.org/shapor/tux3/ 2008-11-24 18:35 is mirror? 2008-11-24 18:36 yes 2008-11-24 18:36 seems to be happy 2008-11-24 18:36 yes 2008-11-24 18:36 that's good to know 2008-11-24 18:36 other ways of fixing this problem are really horrible 2008-11-24 18:37 I can't see why happened this... 2008-11-24 18:39 well, I'll continue a break 2008-11-24 18:46 damm, spelled your name wrong in the post :( 2008-11-24 18:51 no problem, I can rename my name :) 2008-11-24 18:58 hmm, I made one slightly excessive claim 2008-11-24 18:58 change state and unlock in one operation has a slight problem: we have to go into the innards of wake_up_bit to do the wakeup 2008-11-24 18:59 on no, wait 2008-11-24 18:59 it works 2008-11-24 18:59 :p 2008-11-24 18:59 because the bit clear is separate from the wake_up_bit, how useful 2008-11-24 19:57 -!- flips(~phillips@phunq.net) has joined #tux3 2008-11-24 20:59 write_inode seems to work 2008-11-24 21:05 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-24 21:07 hirofumi :) 2008-11-24 21:08 so, we can work for i_size 2008-11-24 21:08 hi 2008-11-24 21:09 write_inode is pretty complex in tux3 2008-11-24 21:09 I'm glad it's working 2008-11-24 21:09 entirely debugged in userspace 2008-11-24 21:10 eh, it's 5 lines, iirc 2008-11-24 21:10 int tux3_write_inode(struct inode *inode, int do_sync) 2008-11-24 21:10 { 2008-11-24 21:10 return save_inode(inode); 2008-11-24 21:10 } 2008-11-24 21:10 :) 2008-11-24 21:10 I meant save_inode then ;) 2008-11-24 21:10 calls all the attribute packing stuff 2008-11-24 21:10 yes 2008-11-24 21:11 hmm, I need to audit that 2008-11-24 21:11 I tweaked mark_buffer_dirty stuff though 2008-11-24 21:12 I'm not adding comment to patch yet... 2008-11-24 21:13 oh, save_inode looks pretty good 2008-11-24 21:13 yes 2008-11-24 21:13 it's all just store_attrs 2008-11-24 21:13 which is doing the right thing, it looks like 2008-11-24 21:14 easy to forget how much work that was ;) 2008-11-24 21:14 but it became a nice model 2008-11-24 21:14 I need to bring the filemap stuff up to the same quality 2008-11-24 21:15 oh, it's nice 2008-11-24 21:17 seg[1000] and dwalk stuff seems to be complex 2008-11-24 21:17 set[1000] is obviously a placeholder 2008-11-24 21:18 it's a kmalloc now, right? 2008-11-24 21:18 it's seg[10] now 2008-11-24 21:18 and checking limit of seg 2008-11-24 21:19 :) 2008-11-24 21:19 I was lazy 2008-11-24 21:19 well 2008-11-24 21:19 what is the biggest object that's ok for on stack these days? 2008-11-24 21:20 even 80 bytes seems to attract attention 2008-11-24 21:20 limit is 4k or 8k 2008-11-24 21:20 iirc, by default is 4k 2008-11-24 21:21 so 80 bytes is probably ok 2008-11-24 21:21 -!- RazvanM(~RazvanM@pool-96-244-34-153.bltmmd.east.verizon.net) has joined #tux3 2008-11-24 21:21 but suppose we wanted it to be longer 2008-11-24 21:21 I suppose... have a small one on stack 2008-11-24 21:21 and kmalloc/kfree for the case where a big one is needed 2008-11-24 21:21 because then the cost of the kmalloc is not important 2008-11-24 21:21 yes 2008-11-24 21:22 writing block_read_endio now... it walks across a bio and unlocks each block 2008-11-24 21:23 with one atomic operation per block, but... it could actually be just one atomic operation for all the blocks on a page 2008-11-24 21:23 but it is probably not a good use of time to optimize for the 1K blocksize case 2008-11-24 21:24 i see 2008-11-24 21:24 sounds good 2008-11-24 21:26 unsigned checkstates = 0x77777777 & ~(7 << handle_shift(handle)); 2008-11-24 21:27 if (!(((~oldmap & checkstates) + 0x11111111) & 0x88888888)) 2008-11-24 21:27 ugly magic numbers ;) 2008-11-24 21:28 or 0x88888888 & ~(8 << shift) 2008-11-24 21:30 well 2008-11-24 21:30 don't need, because we masked away the state for the target handle 2008-11-24 21:30 masked away the inverted state 2008-11-24 21:31 so it can't overflow into the lock bit 2008-11-24 21:31 but you get the idea obviously 2008-11-24 21:34 if (!(((~(oldmap | (7 << shift)) & 0x77777777) + 0x11111111) & 0x88888888)) 2008-11-24 21:34 :) 2008-11-24 21:34 right :) 2008-11-24 21:35 so, mask = 0x77777777 >> (blockbits - 9) 2008-11-24 21:36 no 2008-11-24 21:36 so, mask = 0x77777777 >> ((blockbits - 9) * 4) 2008-11-24 21:36 9 ? 2008-11-24 21:37 ah 2008-11-24 21:37 based on 512blocks 2008-11-24 21:37 based on 512 blocksize 2008-11-24 21:38 you're right, we need to make sure zeros in nonexistent fields don't get through 2008-11-24 21:38 by fixing the 8888 mask maybe 2008-11-24 21:38 ah 2008-11-24 21:39 fixing it in the 77777 mask is better 2008-11-24 21:39 because that already needs special treatment to ignore the target block field 2008-11-24 21:43 hirofumi, what's broken in writing at the moment? 2008-11-24 21:43 handle->statemap = 0x77770000 2008-11-24 21:44 initialize of statemap 2008-11-24 21:44 :) 2008-11-24 21:44 0x7777 part is unused blocks 2008-11-24 21:44 sure 2008-11-24 21:44 that's a real hack :D 2008-11-24 21:44 :) 2008-11-24 21:48 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-24 21:49 could you check it? 2008-11-24 21:49 pulling 2008-11-24 21:49 what do you think about move mark_buffer_dirty() stuff 2008-11-24 21:50 in the latest changesets? 2008-11-24 21:50 yes 2008-11-24 21:51 rename set to mark is fine 2008-11-24 21:51 "Call mark_buffer_dirty() after modification for xxxx" patches 2008-11-24 21:51 that's fine for now 2008-11-24 21:51 we will do it differently later 2008-11-24 21:51 set the state at create time 2008-11-24 21:52 one less atomic op per create 2008-11-24 21:52 oh 2008-11-24 21:52 new_leaf sets buffer as dirty? 2008-11-24 21:52 it should 2008-11-24 21:53 it doesn't now, and that's ok 2008-11-24 21:53 we can clean it up when the block handle stuff goes in 2008-11-24 21:53 um... 2008-11-24 21:54 so, write_inode was pretty simple :) 2008-11-24 21:54 yes 2008-11-24 21:54 and it seems to work fine 2008-11-24 21:54 now, we can chmod at least 2008-11-24 21:54 probably, chown too 2008-11-24 21:56 btw, why can handle stuff set dirty in new_leaf? 2008-11-24 21:58 any new block is created in state dirty 2008-11-24 21:58 I'm not sure what you're asking 2008-11-24 21:58 oh 2008-11-24 21:59 well, we have to return a handle for it 2008-11-24 21:59 so we will just init that handle to dirty 2008-11-24 21:59 ACTION thinks he still didn't answer the question 2008-11-24 21:59 ah, current problem is dirty bit clear by async flusher 2008-11-24 22:00 block IO library clear 2008-11-24 22:01 so, if we set it in new_leaf, flusher may write it out before modify 2008-11-24 22:01 I think 2008-11-24 22:02 the dirty bit is cleared by async flusher 2008-11-24 22:03 I see, well we better get rid of that flusher and replace it with our own that does the right thing ;) 2008-11-24 22:03 I'm looking forward to doing sync/flush properly 2008-11-24 22:03 the default scheme is so crazy 2008-11-24 22:03 ah, i see 2008-11-24 22:04 you probably want it to work properly, now ;) 2008-11-24 22:04 yes 2008-11-24 22:04 well, if you fix all the bugs, what will there be left for anybody else to do? ;) 2008-11-24 22:04 :) 2008-11-24 22:05 anyway, it's fine as you have it 2008-11-24 22:05 any reason not to pull? 2008-11-24 22:05 and we will change it after atomic commit? 2008-11-24 22:05 yes 2008-11-24 22:05 i see 2008-11-24 22:05 could you pull it? 2008-11-24 22:05 yes 2008-11-24 22:05 for now, we are just trying to be like ext2 2008-11-24 22:06 no safety 2008-11-24 22:06 yes 2008-11-24 22:06 I'm thinking all should be fixed by atomic commit 2008-11-24 22:07 make broke on this pull 2008-11-24 22:07 dleaf.c 2008-11-24 22:08 oh 2008-11-24 22:08 /usr/lib/gcc/i486-linux-gnu/4.1.2/include/stddef.h:214: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'typedef' 2008-11-24 22:08 In file included from /usr/include/_G_config.h:44, 2008-11-24 22:08 from /usr/include/libio.h:32, 2008-11-24 22:08 from /usr/include/stdio.h:72, 2008-11-24 22:08 from dleaf.c:11: 2008-11-24 22:08 /usr/include/gconv.h:72: error: expected declaration specifiers or '...' before 'size_t' 2008-11-24 22:08 some header breakage 2008-11-24 22:08 ah, compile error 2008-11-24 22:08 right, hg was fine 2008-11-24 22:09 userland? 2008-11-24 22:09 it seems I can compile it 2008-11-24 22:09 I'll check more 2008-11-24 22:13 11 is the first line in dleaf.c 2008-11-24 22:13 oh, dleaf.c was not changed 2008-11-24 22:13 I know 2008-11-24 22:14 so... strange 2008-11-24 22:14 must be the header files then ;-) 2008-11-24 22:14 the header file is stdio.h 2008-11-24 22:16 it was me ;) 2008-11-24 22:16 jeez 2008-11-24 22:17 I typed an x as the very first character in the file 2008-11-24 22:17 oh 2008-11-24 22:17 got hundreds of lines of errors, none of which was the slightest bit helpful 2008-11-24 22:18 lol 2008-11-24 22:20 pulled 2008-11-24 22:20 thanks 2008-11-24 22:21 maze, this block handle stuff is mindbendingly complex and fun 2008-11-24 22:21 yup 2008-11-24 22:22 this _looong_ weekend I plan on taking another junkfs hacking spree 2008-11-24 22:22 my wireless is now working ;-) 2008-11-24 22:22 still have to patch a swiotlb patch to linux & co 2008-11-24 22:22 meant linus 2008-11-24 22:22 maybe cut n paste the block handles stuff into it and take it for a spin 2008-11-24 22:23 and rewrite it in half the size ;) 2008-11-24 22:23 I think we're on to something 2008-11-24 22:24 having the block read endio page unlock code shrink by a factor of eight is a good sign 2008-11-24 22:24 definitely, I'm liking the way this is looking 2008-11-24 22:24 still have to work a little bit more on understanding all these locking/atomic primitives etc 2008-11-24 22:24 you and everybody else 2008-11-24 22:25 there isn't anybody who has the complete picture any more 2008-11-24 22:25 and want to code up dcas cmpxchg (what should it be called?) 2008-11-24 22:25 it's gotten a little out of hand 2008-11-24 22:25 whose cmpxchg? 2008-11-24 22:26 double pointer version 2008-11-24 22:26 ok, I finally got a little clue about what the barriers are for 2008-11-24 22:26 found a cute 'bug?' in the wireless driver or somewhere... 2008-11-24 22:27 no idea how the hell it happens 2008-11-24 22:27 when you put in a barrier before a state change, it's to ensure all the stuff you just set for the state is stored to all processor's memory before the state changes 2008-11-24 22:27 like, when you unlock, the thing you unlocked better have all its data current for all cpus 2008-11-24 22:27 ping -s 42912 remote machine works, any value between 42913-65507 doesn't (does work fine over wired) 2008-11-24 22:27 oh, that - of course ;-) 2008-11-24 22:28 so the barrier before the block clear is for the _stuff before the clear_ 2008-11-24 22:28 right and the one after the grab lock 2008-11-24 22:28 not the bit clear itself 2008-11-24 22:28 the barrier after makes the bit clear visible 2008-11-24 22:28 is to make sure you have the newest data fetched only after getting the lock 2008-11-24 22:28 right 2008-11-24 22:28 well, I didn't know that basic thing until today 2008-11-24 22:29 you need to grab_lock; barrier_to_prevent_reads_from_escaping_before_the_grab_lock; do something; barrier_to_prevent_writes_from_escaping_beyond_the_end_lock; end_lock 2008-11-24 22:30 it gets more complex on alpha though 2008-11-24 22:30 well the _compiler_ barrier prevents the write reordering 2008-11-24 22:31 compiler barrier? 2008-11-24 22:31 it also has to ensure that the state protected by the lock is written before the lock state is changed 2008-11-24 22:31 alpha: if you can make your multithreaded code run on it - it's correct ;-) 2008-11-24 22:31 compiler barrier... prevent gcc from reordering things 2008-11-24 22:31 right, thought you meant something in particular 2008-11-24 22:32 did I miss something in my g_l, b_t_p_r_f_e_b_t_g_l; d s; b_t_p_w_f_e_b_t_e_l; e_l above? 2008-11-24 22:32 not sure about the compiler reordering 2008-11-24 22:32 I'm sure even without compiler reordering the barriers are needed 2008-11-24 22:32 since atomic ops are not barriers in and of themselves by themselves 2008-11-24 22:33 for many if not most cpus 2008-11-24 22:33 I can't believe you typed that acronym ;) 2008-11-24 22:33 ;-) 2008-11-24 22:33 yes, it's the same stated another way 2008-11-24 22:34 "writes escaping" -> "stores not forced to execute 2008-11-24 22:34 " 2008-11-24 22:34 you need to both prevent the compiler from reordering stuff, and or storing junk in registers which it has already fetched (ie. cached) and you need to make sure the cpu doesn't reorder it afterwards 2008-11-24 22:34 depending on the cpu 2008-11-24 22:34 well... really only the order of execution is important 2008-11-24 22:34 but yeah 2008-11-24 22:35 getting all writes to commit now is one way to implement the before end_lock barrier 2008-11-24 22:35 well I'll go back to non barrier stuff for now 2008-11-24 22:35 it gets more fun when you start having both read/write/both memory barriers and io barriers ;-) 2008-11-24 22:36 we do have that 2008-11-24 22:36 at the same time, for performance, you want to have the minimum barrier appropriate for the task 2008-11-24 22:36 "life on the edge" 2008-11-24 22:37 fine out what "less than the minimum barrier" actually is by getting floods of oopses on lkml shortly after distributing a kernel to 5 million people 2008-11-24 22:39 smp_read_barrier_depends(): forces subsequent operations that depend on prior operations to be ordered. This primitive is a no-op on all platforms except Alpha. 2008-11-24 22:41 :) 2008-11-24 22:41 let's break alpha and see if anyone notices 2008-11-24 22:41 something tells me, alphas will stick with their old kernels 2008-11-24 22:41 in fact, I wonder why it's even kept in mainline 2008-11-24 22:41 I believe alpha is used exclusively for lock testing ;-) 2008-11-24 22:41 :) 2008-11-24 22:41 if it works on alpha it's golden 2008-11-24 22:41 http://www.linuxjournal.com/article/8211 2008-11-24 22:42 take a look at the table of cpu peculiarities 2008-11-24 22:42 + alpha is just cool 2008-11-24 22:42 really arch characteristics 2008-11-24 22:42 nice article 2008-11-24 22:43 Alpha in effect can fetch the data pointed to before it fetches the pointer itself 2008-11-24 22:43 the alpha guys went on to give us k8 and hypertransport 2008-11-24 22:43 hmm 2008-11-24 22:43 that one is just golden ;-) 2008-11-24 22:43 sounds like magic 2008-11-24 22:43 if it's already fetched for some other reason I believe is the answer 2008-11-24 22:58 actually the barrier before end_lock shouldn't allow reads to escape past it either, since if it does we could end up reading no-longer locked data 2008-11-24 22:58 same thing with the barrier after grab_lock 2008-11-24 22:59 so they behave more like brackets which don't permit memory operations out of the bounds of the lock, but permit stuff to enter from outside into the locked area 2008-11-24 23:08 thinking about getting hackfs to compile ;) 2008-11-24 23:09 gets in the way of barrier thoughts 2008-11-24 23:09 a barrier to barriers 2008-11-24 23:10 heh 2008-11-24 23:10 just looking through the barriers doc 2008-11-24 23:10 some interesting comments there 2008-11-24 23:11 Aside: In the case of data dependencies, the compiler would be expected to 2008-11-24 23:11 issue the loads in the correct order (eg. `a[b]` would have to load the value 2008-11-24 23:11 of b before loading a[b]), however there is no guarantee in the C specification 2008-11-24 23:11 that the compiler may not speculate the value of b (eg. is equal to 1) and load 2008-11-24 23:11 a before b (eg. tmp = a[1]; if (b != 1) tmp = a[b]; ). There is also the 2008-11-24 23:11 problem of a compiler reloading b after having loaded a[b], thus having a newer 2008-11-24 23:11 copy of b than a[b]. A consensus has not yet been reached about these problems... 2008-11-24 23:12 heh 2008-11-24 23:12 murkiness abounds 2008-11-24 23:13 in general there are no good ways to really tell the compiler what you mean 2008-11-24 23:13 murkiness abounds 2008-11-24 23:13 whoops 2008-11-24 23:13 ie. the language spec really pretty much ignores multi-threading issues 2008-11-24 23:17 195#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) 2008-11-24 23:17 ? 2008-11-24 23:23 just a macro to try to force the compiler to do the right thing 2008-11-24 23:23 basically abusing volatile 2008-11-24 23:31 block_read_endio is the first bio endio I ever wrote that doesn't need any bio->private 2008-11-24 23:31 because all the completion information is carried on the bio pages 2008-11-24 23:33 well, block_read_endio is mostly there 2008-11-24 23:34 hackfs_readpage has a long way to go 2008-11-24 23:35 http://lxr.linux.no/linux+v2.6.27.5/fs/buffer.c#L2151 <- something like this 2008-11-24 23:47 The Alpha defines the Linux kernel's memory barrier model. 2008-11-24 23:47 quote from memory-barriers.txt 2008-11-24 23:49 hey all 2008-11-24 23:51 well, without alpha we'd have no barrier model then, would we 2008-11-24 23:51 so we have to keep it 2008-11-24 23:51 it's a question of keeping our identity 2008-11-24 23:52 hi pranith 2008-11-24 23:53 flips: hello 2008-11-24 23:53 hows it going? 2008-11-24 23:53 lots of progress, largely because of hirofumi 2008-11-24 23:53 yeah, hirofumi seems to have kicked up a storm :) 2008-11-24 23:53 grinding out some potential nice new block handling stuff here 2008-11-24 23:54 if you want to emulate anyone as a kernel hacker, emulate him 2008-11-24 23:54 :) 2008-11-25 00:02 agreed - he rocks 2008-11-25 00:07 I'm inclined to always write bio_alloc(__GFP_NOFAIL, vecs) and not handle bio allocation failures 2008-11-25 00:07 there should be no way that bio allocation can fail 2008-11-25 00:08 I'll do that 2008-11-25 00:08 heh 2008-11-25 00:08 very inappropriate comment comes to mind 2008-11-25 00:08 just how inappropriate? 2008-11-25 00:09 some klucking sounds ;-) 2008-11-25 00:09 we basically say to the little bio "get out there and come back with that data or don't come back!" 2008-11-25 00:09 MaZe: i think you can say it... flips wont mind ;) 2008-11-25 00:09 but, yeah, I know - error handling is hard 2008-11-25 00:09 it's not that 2008-11-25 00:10 it's just that it's conceptually wrong for bio allocation to fail 2008-11-25 00:10 it would men our whole cache management model is wrong 2008-11-25 00:10 you're saying it isn't? 2008-11-25 00:10 it has been in the past 2008-11-25 00:10 well... 2008-11-25 00:11 but now you can count on the kernel making forward progress in terms of cleaning cache 2008-11-25 00:11 with that reasoning, we should never run out of ram in kernel... 2008-11-25 00:11 never stop making progress 2008-11-25 00:11 what if the cache is already empty? 2008-11-25 00:11 where progress is defined as "move that from here to there" 2008-11-25 00:11 that's a normal state in any unix 2008-11-25 00:12 then we evict clean pages and clean dirty ones 2008-11-25 00:12 it should never be unable to obtain cache short of a genuine memory leak 2008-11-25 00:12 or in-kernel oversubscription bug 2008-11-25 00:13 I'm assuming here we're not doing stupid stuff like file-backed swap? 2008-11-25 00:13 on a journaled filesystem ;-) 2008-11-25 00:13 there's no reason that should not work 2008-11-25 00:13 except we're too lame to make it work 2008-11-25 00:13 there's no recursion 2008-11-25 00:14 what if the block device we could write stuff back to is currently unavailable? network issues? 2008-11-25 00:14 filesystems do not eat anon memory, the stuff that gets written to swap 2008-11-25 00:14 then you have an issue indeed 2008-11-25 00:14 that's exteme 2008-11-25 00:14 we ought to be able to force unmount 2008-11-25 00:15 it's just temporary... it could be a few dozen seconds blip as network connections reconverge 2008-11-25 00:15 there's nothing stopping us except lameness, big iron unixes implement it 2008-11-25 00:15 that has little to do with the cache reclaim cycle 2008-11-25 00:15 it's like a closed circulation system, it ought to keep circulating 2008-11-25 00:15 if it doesn't, it's a bug 2008-11-25 00:15 hmm 2008-11-25 00:16 very commonly misunderstood, this 2008-11-25 00:16 I'm a little worried that the way deltas work there's some room for ram issues, unless the appropriate amount of memory is kept preallocated or something 2008-11-25 00:17 there's the concept of keeping enough of a reserve so that the cache flushing mechanisms can work 2008-11-25 00:17 that's build into linux 2008-11-25 00:17 right 2008-11-25 00:18 but will tux3 ever have to make use of them? 2008-11-25 00:18 hmm 2008-11-25 00:18 in tux3, we only worry about our reserve when we're called by the memory allocator to flush out cache 2008-11-25 00:18 I guess we just have to be careful elsewhere 2008-11-25 00:18 tux3 makes use of them in the above situation 2008-11-25 00:18 generally, the vm goes to work when cache hits zero, other than the reserve 2008-11-25 00:18 we want to be careful, that the amount of dirty state in memory doesn't grow faster than we're writing it out 2008-11-25 00:19 we don't ahve to be 2008-11-25 00:19 when we try to do a malloc, it will block 2008-11-25 00:19 that's what controlls that 2008-11-25 00:19 kmalloc 2008-11-25 00:19 or alloc_pages 2008-11-25 00:19 so long as that happens from user context - sure 2008-11-25 00:19 hmm 2008-11-25 00:19 I guess that does make sense 2008-11-25 00:19 all we worry about is that our block device queue doesn't grow enormously 2008-11-25 00:20 well 2008-11-25 00:20 we don't even worry about that 2008-11-25 00:20 generally, we try to dirty as much cache as we can, because that gets the job done fastest for the suer 2008-11-25 00:22 right 2008-11-25 00:31 "suer" is a funny typo 2008-11-25 00:31 if you sell software that doesn't work as advertised, user "users" become "suers" 2008-11-25 00:31 as microsoft is finding out right now 2008-11-25 00:32 heh 2008-11-25 00:32 well,I must sleep and I didn't get hackfs_readpage written once again 2008-11-25 00:32 maybe tomorrow 2008-11-25 01:41 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-25 08:05 well... 2008-11-25 08:05 today is the day to announce tux3 in kernel 2008-11-25 08:06 oh 2008-11-25 08:06 you're up ;) 2008-11-25 08:06 yes 2008-11-25 08:06 I thought it was too early? 2008-11-25 08:06 heyy 2008-11-25 08:06 :) 2008-11-25 08:06 or you're still up 2008-11-25 08:06 ? 2008-11-25 08:06 1:06 in japan 2008-11-25 08:06 flips: when is the post going to come? 2008-11-25 08:06 hirofumi, if you want, we can wait another week or so 2008-11-25 08:07 create and write are almost done 2008-11-25 08:07 should we wait a couple days? 2008-11-25 08:07 or a few hours :) 2008-11-25 08:07 ok, I'll write the announcement so it's ready ;) 2008-11-25 08:08 lol 2008-11-25 08:08 during my sleep I did solve the problem of how to structure the new block interface 2008-11-25 08:08 and what linux's future block library should look like 2008-11-25 08:08 it's pretty obvious actually 2008-11-25 08:08 oh, good 2008-11-25 08:09 we can approach it in two steps 2008-11-25 08:09 the first step is not a big change from the current model 2008-11-25 08:09 the filesystem provides a "get_extents" interface 2008-11-25 08:09 instead of "get_block" 2008-11-25 08:10 vfs tells it the logical range to map and gives it an empty array to fill in with extents 2008-11-25 08:10 the filesystem fills in the array and says how many it filled in 2008-11-25 08:10 really simple, and just like the segs[] filemap.c has 2008-11-25 08:11 and it fits with the current vfs interface pretty well 2008-11-25 08:11 because read/write_page just ask for a logical range of 1 page to be filled in 2008-11-25 08:12 yes 2008-11-25 08:13 so the big win here is that the filesystem does a probe into the index structure once at the beginning of the range, and fills in the segs by efficient incremental iteration 2008-11-25 08:13 but win 2008-11-25 08:13 big win 2008-11-25 08:13 after that there are more gains to be had 2008-11-25 08:14 by letting the filesystem do the grab_cache_page/find_get_page instead of the vfs 2008-11-25 08:14 I'm still thinking about how that will look 2008-11-25 08:15 but for example, even with an improved get_extents interface, the interface from generic_write is still wrong for deferred allocation 2008-11-25 08:15 yes 2008-11-25 08:15 it forces the filesystem to basically just remember "vfs told me to allocate this logical range" and do it later 2008-11-25 08:16 ext4 has some hack for it though 2008-11-25 08:16 right 2008-11-25 08:17 so we'd like to make it not a hack over the long run 2008-11-25 08:17 yes 2008-11-25 08:17 we will know what we really want very well, a few months from now 2008-11-25 08:17 for now, it's clear how to use the get_extents idea 2008-11-25 08:17 we'll have a prototype in hackfs very soon, we can start improving 2008-11-25 08:18 I'm not sure, after delalloc, whether we need get_extents or not 2008-11-25 08:19 well, get_block is a very constricting interface 2008-11-25 08:19 but right 2008-11-25 08:19 it's more useful for read 2008-11-25 08:19 ah 2008-11-25 08:19 because we can fix write, just by basically ignoring ->writepage 2008-11-25 08:20 read is way more common than write, get_extents will really show in the cpu side 2008-11-25 08:20 so, get_extents may not allocate blocks finally 2008-11-25 08:20 by the way, the symmetry between ->readpage and ->writepage is a little misleading 2008-11-25 08:21 because ->writepage can be ignored, but ->readpage can't 2008-11-25 08:21 ignore? 2008-11-25 08:21 get_extents is used to launch bio transfers, so it has to actually assign the physical locations 2008-11-25 08:22 yes, when we get a ->writepage we don't actually have to write anything, we only have to write when we get a sync 2008-11-25 08:23 but when we get a ->readpage, we _have_ to do the read, or tasks will block 2008-11-25 08:23 ah 2008-11-25 08:24 ->writepage just tell "please flush this page soon or later"? 2008-11-25 08:24 yes, in a modern filesystem it's really advisory 2008-11-25 08:24 and ->sync_page is the real "writepage" 2008-11-25 08:24 i see 2008-11-25 08:25 we also see this asymmetry in buffer IO 2008-11-25 08:26 we have bread to read, which is synchronous, but writing a buffer is just mark_buffer_dirty and the write is asynchronous 2008-11-25 08:26 anyway, I'm getting a bit too philisophical 2008-11-25 08:27 the point is, ->writepage will hardly need to do anything at all, maybe nothing 2008-11-25 08:31 probably, issue is memory balance? 2008-11-25 08:31 right, ->writepage is a poor interface for communicating that 2008-11-25 08:31 I meant: but ->writepage is a poor interface for communicating that 2008-11-25 08:32 it's important to communicate, and we don't have a good interface 2008-11-25 08:32 we should have ->shrinkcache 2008-11-25 08:33 maybe, we just want to know global vm state in writepage? 2008-11-25 08:33 yes, which we can know, sort of 2008-11-25 08:33 and know block congestion 2008-11-25 08:33 but it doesn't make design sense to check that on every page write 2008-11-25 08:33 so it will work, but it's a poor interface 2008-11-25 08:33 ah 2008-11-25 08:34 there are other broken aspects to the current interface 2008-11-25 08:35 the vmm knows the lru order, but it doesn't know how pages on the lru are related to each other 2008-11-25 08:35 it makes very little sense for it to evict clean pages randomly 2008-11-25 08:35 because it's inefficient for the filesystem to read them back 2008-11-25 08:36 has to go all over the disk, probing lots of different infrastructures 2008-11-25 08:36 writepages has writeback_control... 2008-11-25 08:36 right 2008-11-25 08:36 I'm talking about two different things, actually: 1) what we will do now 2) what we wish it was like 2008-11-25 08:37 the "wish" part is always more fun 2008-11-25 08:37 yes 2008-11-25 08:38 then there's the problem with dirty pages on the lru: each of our dirty pages is either already scheduled to be written, or it is pinned, so having dirty pages on the lru is basically useless 2008-11-25 08:38 and costly for scanning 2008-11-25 08:38 we should really lru the inodes 2008-11-25 08:39 and the filesystem can do that, the vmm does not have to 2008-11-25 08:39 the vmm is trying to be helpful, but a lot of the time it is just doing useless work 2008-11-25 08:40 well 2008-11-25 08:40 or separate it to write_ready_lru and prepare_list? 2008-11-25 08:40 it's too speculative for right now ;) 2008-11-25 08:41 something 2008-11-25 08:41 :) 2008-11-25 08:41 let's think about how it should be, we have an excellent chance to experiment 2008-11-25 08:41 once we get through basic optimization 2008-11-25 08:41 we can measure the effect of interface changes 2008-11-25 08:41 btw, I'm trying to understand filemap_extent_io 2008-11-25 08:42 ok 2008-11-25 08:42 now, current issue is I can't update dtree 2008-11-25 08:42 so 2008-11-25 08:42 dleaf? 2008-11-25 08:43 ah, yes 2008-11-25 08:43 the dwalk interface? 2008-11-25 08:43 we split dleaf, then update dtree in filemap_extent_io 2008-11-25 08:44 http://tux3.org/tux3?f=ad3bfa706437;file=user/kernel/filemap.c 2008-11-25 08:44 yes, I'm looking at it 2008-11-25 08:44 line 176 2008-11-25 08:44 I had the cursor on it ;) 2008-11-25 08:44 hg doesn't have link for line 2008-11-25 08:44 it should, indeed 2008-11-25 08:44 let's tell matt 2008-11-25 08:45 it may add new depth? 2008-11-25 08:45 yes, bigger html, but really useful 2008-11-25 08:45 let's people pass around urls to code 2008-11-25 08:45 oh 2008-11-25 08:45 sorry 2008-11-25 08:46 you meant the dtree ;) 2008-11-25 08:46 yes it can 2008-11-25 08:46 split may add new depth to root? 2008-11-25 08:46 :) 2008-11-25 08:46 -!- pgquiles(~pgquiles@222.Red-88-0-139.dynamicIP.rima-tde.net) has joined #tux3 2008-11-25 08:46 I hope I updated the depth correctly ;) 2008-11-25 08:46 otherwise it's a bug 2008-11-25 08:46 after btree_leaf_split goto retry; 2008-11-25 08:47 but, we don't re-probe() 2008-11-25 08:47 it it ok? 2008-11-25 08:47 right, because we already know the leaf 2008-11-25 08:47 and the path should be updated 2008-11-25 08:47 if not, it's a bug 2008-11-25 08:47 i see 2008-11-25 08:47 not reprobing saves a lot of time on a split 2008-11-25 08:49 http://userweb.kernel.org/~hirofumi/kernel.tar.gz 2008-11-25 08:49 that whole thing from the "probe" to the end of the retry loop with be our tux_get_extents 2008-11-25 08:49 current source 2008-11-25 08:49 more current that what's checked into hg I guess? 2008-11-25 08:50 probaby, no 2008-11-25 08:51 I patched to current hg to support write 2008-11-25 08:51 and I didn't update my hg 2008-11-25 08:52 so if you update your hg I can pull, then copy to my kernel and push to git? 2008-11-25 08:52 does that make sense? 2008-11-25 08:53 or should I start pulling from your git directly? 2008-11-25 08:53 no, no 2008-11-25 08:53 it is just temporary source 2008-11-25 08:54 ok, then always pullying from you into hg is the right thing to do for now 2008-11-25 08:54 I want you check it 2008-11-25 08:54 ok 2008-11-25 08:55 if you see where is tux3_get_block wrong, let me know 2008-11-25 08:56 with quick look 2008-11-25 08:56 ok 2008-11-25 08:57 thanks 2008-11-25 08:57 konqueror works very well for browsng your source 2008-11-25 08:57 firefox is pathetic ;) 2008-11-25 08:57 konqueror can read .tar.gz? 2008-11-25 08:58 it can, but I just unpacked it first 2008-11-25 08:59 let me try browsing the tgz 2008-11-25 08:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-25 08:59 yes 2008-11-25 08:59 works perfectly, and fast :) 2008-11-25 08:59 next time I will not untar 2008-11-25 09:00 nice feature 2008-11-25 09:00 I would like it to open the files inside the browser instead of opening an editor though 2008-11-25 09:00 I'm sure it can do it 2008-11-25 09:01 probably, I should put only filemap.c 2008-11-25 09:01 http://userweb.kernel.org/~hirofumi/filemap.c 2008-11-25 09:01 sorry 2008-11-25 09:02 ok, the loop at the end of filemap_extent_io is not present in your get_block, because this only ever processes a single segment 2008-11-25 09:02 right? 2008-11-25 09:02 yes 2008-11-25 09:03 that makes perfect sense 2008-11-25 09:03 if it allocated new segment 2008-11-25 09:03 when I define the get_extents interface then we can reduce the redundancy, and use something more like the original 2008-11-25 09:04 but now "any way it works" is the right way 2008-11-25 09:04 yes 2008-11-25 09:05 ok, a minor detail 2008-11-25 09:05 the get_extents interface will always return all the physical segments needed to fill the logical region 2008-11-25 09:06 block_read_full_page can't always deal with all those segments 2008-11-25 09:07 because of the limitation of the buffer interface 2008-11-25 09:07 as you pointed out a couple days ago 2008-11-25 09:07 yes 2008-11-25 09:07 so what it will do, is just deal with the parts it can 2008-11-25 09:08 yes 2008-11-25 09:08 and block_read_full_page will ask for the logical region for the other parts later 2008-11-25 09:08 in fact, block_read_full_pages always asks for one block at a time 2008-11-25 09:08 hmmm 2008-11-25 09:08 I'm really talking about ->getpages 2008-11-25 09:08 block_read_full_page is 2008-11-25 09:08 Long live beer :) 2008-11-25 09:09 I misspoke ;) 2008-11-25 09:09 mlankhorts, you saw the beer on the page? 2008-11-25 09:13 ok, so I'm really talking about the way ->get_block is used from mpage_writepages 2008-11-25 09:13 writepages? 2008-11-25 09:13 yes 2008-11-25 09:13 write side is always one block, iirc 2008-11-25 09:13 and get_block is called from mpage_writepages 2008-11-25 09:13 oh, thanks for telling me 2008-11-25 09:13 I imagined it was a more efficient interface than that 2008-11-25 09:14 ok, let's talk about ->readpages then 2008-11-25 09:14 very different interface 2008-11-25 09:14 yes 2008-11-25 09:14 ugly ;) 2008-11-25 09:14 :) 2008-11-25 09:15 by the way, do you know why it has to take a file as well as an inode? 2008-11-25 09:15 multiple block allocate was not implemented 2008-11-25 09:15 I seem to recall there was an nfs reason for that 2008-11-25 09:15 some awful reason 2008-11-25 09:15 file has something, iirc, um... 2008-11-25 09:16 ah, maybe, readahead state 2008-11-25 09:16 bleah 2008-11-25 09:16 file->f_ra? 2008-11-25 09:16 that's no longer relevant then 2008-11-25 09:16 because we have it through bdi 2008-11-25 09:16 and if that is the only reason, then the file parameter should go away 2008-11-25 09:17 it should be per file? 2008-11-25 09:17 per inode 2008-11-25 09:17 anyway, minor wart 2008-11-25 09:17 but it bothers me every time I see it 2008-11-25 09:18 well, it can be NULL 2008-11-25 09:18 makes it even uglier ;) 2008-11-25 09:18 :) 2008-11-25 09:18 vast majority of users ignore it 2008-11-25 09:18 some time I'll look at which use it, and see if that's bogus 2008-11-25 09:19 anyway, back to mpage_readpages 2008-11-25 09:19 which you're supporting 2008-11-25 09:19 even though we don't really need to, to function :) 2008-11-25 09:20 :) 2008-11-25 09:21 ok, let's look at the ->get_block call from mpage_readpages 2008-11-25 09:22 ok 2008-11-25 09:22 does not try to cross page boundaries? 2008-11-25 09:23 I think it's trying 2008-11-25 09:24 http://lxr.linux.no/linux+v2.6.27.5/fs/mpage.c#L372 2008-11-25 09:24 it loops for multiple pages 2008-11-25 09:25 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-25 09:25 and if something wrong was happened, stop, then submit_bio and fall back to block_read_full_page 2008-11-25 09:27 http://lxr.linux.no/linux+v2.6.27.5/fs/mpage.c#L232 2008-11-25 09:27 right 2008-11-25 09:27 ->get_block is 2008-11-25 09:27 it remembers its state 2008-11-25 09:28 yes 2008-11-25 09:28 and each get_block call into the fs can return one continguous mapped region, or one hole, is that right? 2008-11-25 09:28 yes 2008-11-25 09:28 ok 2008-11-25 09:28 so it's clear what our cleaned up interface will look like 2008-11-25 09:29 filemap_extent_io will be refactored to fill in a "segs" vector 2008-11-25 09:29 oh, i see 2008-11-25 09:29 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-25 09:30 and some state? 2008-11-25 09:30 then, it it returns a mapped region _and_ a whole, our tux3_get_block will just place the mapped region into the bh_result 2008-11-25 09:30 and ignore the hole 2008-11-25 09:30 which mpage_readpages will ask for on the next iteration 2008-11-25 09:30 it's slightly wasteful, but the interface is clean 2008-11-25 09:31 yes 2008-11-25 09:31 it's not nealry as wasteful as the fact that we have to probe on every ->get_block 2008-11-25 09:31 yes 2008-11-25 09:31 so that way, we will be able to merge the userspace code and your code, a few days from now 2008-11-25 09:31 very easy I think 2008-11-25 09:32 I will prototype the interface in hackfs 2008-11-25 09:32 i see 2008-11-25 09:32 make sense? 2008-11-25 09:32 sounds very good 2008-11-25 09:36 for now, you seem to have done exactly what is right 2008-11-25 09:36 ok 2008-11-25 09:37 thanks 2008-11-25 09:37 with 4K blocks, I doubt you will hit the bugs I have made ;) 2008-11-25 09:37 oh, I'm testing on blocksize==4k 2008-11-25 09:37 good :) 2008-11-25 09:38 I'd like to defer bug fixing for a few days and concentrate on making hacfs run the new block handles stuff 2008-11-25 09:38 what happen if I hit a bug? 2008-11-25 09:38 then we will fix it :) 2008-11-25 09:39 of course 2008-11-25 09:39 :) 2008-11-25 09:39 what is behaviour? 2008-11-25 09:39 I mean, fix the bugs in the corner cases, with long extents, rewrites, holes in files... 2008-11-25 09:39 defer that 2008-11-25 09:39 ah 2008-11-25 09:40 rewrites of files with holes is the bug 2008-11-25 09:40 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-25 09:40 not likely to be hit by casual testing 2008-11-25 09:40 current problem seems not to be that case 2008-11-25 09:40 what is the current problem? 2008-11-25 09:41 dtree was not updated 2008-11-25 09:41 ah 2008-11-25 09:41 which is why you were reading that code 2008-11-25 09:41 so let's concentrate on that 2008-11-25 09:42 http://userweb.kernel.org/~hirofumi/tux3.img.dot.png 2008-11-25 09:42 tux3graph after problem happened 2008-11-25 09:43 wow, that is cool 2008-11-25 09:43 don't delete that file, I'm going to link it from the lkml post ;) 2008-11-25 09:43 block == 0 in second externt on ino14_dtree 2008-11-25 09:43 it has bug :) 2008-11-25 09:43 fine 2008-11-25 09:43 that's the point 2008-11-25 09:44 it's a debugging tool 2008-11-25 09:44 ah 2008-11-25 09:44 yes 2008-11-25 09:44 not just a pretty pciture 2008-11-25 09:44 with this, I'm thinking I should dump rest of stuff 2008-11-25 09:44 ok, block 0 is a hole 2008-11-25 09:45 there are other ways of getting block == 0 though 2008-11-25 09:45 but, actually it should be 17 2008-11-25 09:45 what was the test? 2008-11-25 09:45 dd if=/dev/urandom of=/mnt/test.txt bs=4096 count=2 2008-11-25 09:45 I see that the rootdir allocated block 16 2008-11-25 09:46 creating the physical gap between 15 and 17 2008-11-25 09:46 ok, nice 2008-11-25 09:46 http://userweb.kernel.org/~hirofumi/serial.txt 2008-11-25 09:46 it's log of kernel 2008-11-25 09:47 ok 2008-11-25 09:47 log seems good? 2008-11-25 09:48 if so, it may not be the bug of tux3_get_block() 2008-11-25 09:48 492 tux3_get_block: <== inum 14, mapped 1, block 17, size 4096 2008-11-25 09:48 right? 2008-11-25 09:48 yes, it's result 2008-11-25 09:49 so the filemap appears to have done the right thing 2008-11-25 09:49 and I added dleaf_dump() before it 2008-11-25 09:49 yes 2008-11-25 09:49 but 17 is not stored 2008-11-25 09:49 right 2008-11-25 09:51 probably, it is not the bug of tux3_get_block? 2008-11-25 09:51 you dump your numbers in decimal, probably better to dump in hex 2008-11-25 09:51 to match the other debug output 2008-11-25 09:51 so we need to convert in our heads now ;) 2008-11-25 09:52 ah, yes :) 2008-11-25 09:52 probably it is my bug ;) 2008-11-25 09:52 :) 2008-11-25 09:52 so this is really physical block 0x11 2008-11-25 09:52 yes 2008-11-25 09:52 485 tux3_get_block: pack 0x1 => 11/1 2008-11-25 09:53 yes 2008-11-25 09:53 maybe dump the dleaf right there, did you do it? 2008-11-25 09:53 yes 2008-11-25 09:53 it is copy from filemap_extent_io 2008-11-25 09:53 491 0/2: 0 => f/1; 1 => 11/1; 2008-11-25 09:54 so it's there in the dleaf 2008-11-25 09:54 looks perfect 2008-11-25 09:54 yes, it is dleaf_dump 2008-11-25 09:54 we didn't write the dleaf to disk then 2008-11-25 09:54 ok, probably, I should debug writepage 2008-11-25 09:55 right, nothing flushed the dleaf 2008-11-25 09:55 the only other possibility is, we're working on the wrong buffer 2008-11-25 09:55 oh, but this is buffer cache? 2008-11-25 09:55 it is 14 2008-11-25 09:56 so, it seems to be right buffer 2008-11-25 09:56 I think we're working on the write buffer 2008-11-25 09:56 so it's just a missing flush 2008-11-25 09:56 which should be done at least on umount 2008-11-25 09:56 it doesn't have to be in ->writepage 2008-11-25 09:56 yes 2008-11-25 09:57 dleaf is from sb_getblk/sb_bread? 2008-11-25 09:57 Afternoon 2008-11-25 09:59 moin ;) 2008-11-25 10:00 hirofumi, ? 2008-11-25 10:00 dleaf is buffer cache, not page cache? 2008-11-25 10:00 oh right 2008-11-25 10:00 yes 2008-11-25 10:00 I see 2008-11-25 10:01 thanks, I'll debug other part 2008-11-25 10:01 ext2 uses flush_assoc_mappings or something for this 2008-11-25 10:01 and that's why it didn't "just happen" 2008-11-25 10:02 umm... but, it should flush by pdflush 2008-11-25 10:02 but it's buffer cache 2008-11-25 10:02 pdflush does page cache 2008-11-25 10:02 it flush via bd_inode 2008-11-25 10:03 url? 2008-11-25 10:03 maybe the buffer wasn't marked dirty 2008-11-25 10:03 it is same with normal flush 2008-11-25 10:03 mark_buffer_dirty(path[levels].buffer); 2008-11-25 10:04 it should do it 2008-11-25 10:04 ext2 uses mark_buffer_dirty_inode or something 2008-11-25 10:06 hirofumi, you have got enough done to for me to announce on lkml 2008-11-25 10:06 complete with bug 2008-11-25 10:06 I'm going to show your graphics as a debugging tool 2008-11-25 10:06 and show how we tracked it down from the log output 2008-11-25 10:06 and that will be today's Tux3 Report 2008-11-25 10:06 and you can think about sleeping ;) 2008-11-25 10:07 ok 2008-11-25 10:07 so today we will announce our bug 2008-11-25 10:07 with no bugs, it wouldn't be fun to join the project 2008-11-25 10:07 ok? 2008-11-25 10:07 oh, ok 2008-11-25 10:08 next week we can announce something that works 2008-11-25 10:08 this week, it is something with bugs, and we show your structure diagram, our tracing logs etc 2008-11-25 10:08 I might edit your decimal numbers to hex ;) 2008-11-25 10:08 "doctored log" 2008-11-25 10:09 I should recapture it? 2008-11-25 10:09 if you like 2008-11-25 10:09 it would save me time editing 2008-11-25 10:09 and I'll pull from you if you're ready 2008-11-25 10:09 hirofumi, that's beautiful 2008-11-25 10:09 thanks 2008-11-25 10:09 beyond beautiful :) 2008-11-25 10:09 it's something nobody's done before on linux 2008-11-25 10:10 well, ok, I'll recapture log and put it on same url 2008-11-25 10:10 a) actually design something b) use power tools to implement it 2008-11-25 10:10 well, chris did something like that and the ext4 group 2008-11-25 10:11 but this is pushing the envelope 2008-11-25 10:12 well, it only works for small volume though 2008-11-25 10:13 but a lot of tricky corner cases can be constructed with very few blocks 2008-11-25 10:14 and somebody is sure to make a rediculously big diagram at some point ;) 2008-11-25 10:14 just for fun 2008-11-25 10:16 graphviz doesn't support big picture, unfortunately 2008-11-25 10:17 well, it's beautiful for small filesystems as you showed 2008-11-25 10:17 you were able to illustrate the problem quickly with it 2008-11-25 10:17 yes, somehow, it seems to be good than I thought 2008-11-25 10:18 ok, I'll start to write a post 2008-11-25 10:18 ok 2008-11-25 10:18 and not post it until you wake up later 2008-11-25 10:18 post about bug? 2008-11-25 10:18 post about what you have accomplished 2008-11-25 10:18 and how the tools were used to locate this bug 2008-11-25 10:18 even though not finished 2008-11-25 10:19 it is a good story 2008-11-25 10:19 it shows the usefulness of the tool 2008-11-25 10:19 and shows that we are in kernel 2008-11-25 10:19 most importantly, it shows we're having fun ;) 2008-11-25 10:19 :) ok, I'll continute to debug tomorrow 2008-11-25 10:20 -!- persson(~persson@cocaine.bsnet.se) has joined #tux3 2008-11-25 10:23 ok, I've updated the serial.txt 2008-11-25 10:23 http://userweb.kernel.org/~hirofumi/serial.txt 2008-11-25 10:23 Subject: Tux3 Report: Now in kernel and the fun begins 2008-11-25 10:23 that should show hex number 2008-11-25 10:23 looks good 2008-11-25 10:24 people are going to hit my git repository and kill my server ;) 2008-11-25 10:24 I'll point to the hg instead 2008-11-25 10:24 and a patch 2008-11-25 10:24 yes 2008-11-25 10:25 nice 2008-11-25 10:26 I will describe how our user/kernel arrangement works 2008-11-25 10:26 which lets us compile for both user space under fuse, and real kernel 2008-11-25 10:26 oh, yes 2008-11-25 10:26 and also is the beginning of our fsck and mkfs 2008-11-25 10:27 i see 2008-11-25 10:28 btw, probably, I'll be silent more or less 2008-11-25 10:28 if I found next office 2008-11-25 11:03 hirofumi, you did the kernel port in ten days 2008-11-25 11:04 I think that must be a record of some kind 2008-11-25 12:01 hirofumi (when you wake up) I think the dleaf was actually synced to disk because it has the right number of extents, but it has a zero where 11 should be, even though the dleaf dump shows the 11 there 2008-11-25 13:33 tux3 kernel port announce almost ready 2008-11-25 13:46 git log is busted, it takes the liberty of outputing to less even if you don't want it to 2008-11-25 13:51 folks 2008-11-25 14:00 http://tux3.org/patches/ 2008-11-25 14:00 grand total of 4444 lines at this point 2008-11-25 14:01 and it isn't that far away from ext2 functionality 2008-11-25 14:01 will only be about another 1,000 lines to get from ext2 to ext3 functionality 2008-11-25 14:01 just guessing of course 2008-11-25 14:02 well, more fair to say ext4 2008-11-25 14:02 exabyte files and all 2008-11-25 14:02 hmm, and another 500 - 600 lines for phtree 2008-11-25 14:04 course we need to add block allocation strategy that doesn't suck 2008-11-25 14:05 still, it looks like the code base will be about the size of ext2 while providing functionality similar to ext4, then we move on and add versioning 2008-11-25 14:26 nice 2008-11-25 14:35 ok, should this be an [ANNOUNCE] or a Tux3 Report? 2008-11-25 14:36 maybe I'd better save the [ANNOUNCE] for hackfs 2008-11-25 14:36 an auspcious project if there ever was one 2008-11-25 14:39 if it's meant to be loud, it'll be loud without the announce 2008-11-25 14:39 otherwise best let it remain quiet ;-) 2008-11-25 14:40 flips: atomic commits are working now ? 2008-11-25 14:40 no 2008-11-25 14:41 settled, it's a Tux3 Report 2008-11-25 14:41 what about a disk integrity checker ? 2008-11-25 14:41 just curiosu 2008-11-25 14:41 curious 2008-11-25 14:41 closest thing is the tux3 command 2008-11-25 14:41 good start on it 2008-11-25 14:41 want to write it? 2008-11-25 14:41 thinking about it 2008-11-25 14:42 wondering what it would take to do it 2008-11-25 14:44 there are already a number of structure checking functions in the various source files 2008-11-25 14:44 like dleaf_check, ileaf_check, etc 2008-11-25 14:44 just being able to run them from user space would be fine 2008-11-25 14:45 basically a btree print with checks at the leaves instead of format dumpers 2008-11-25 14:45 that would already be a big step 2008-11-25 14:45 without doing the multiple passes that e2fsck does 2008-11-25 14:46 later, that kind of checking can be added 2008-11-25 15:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-25 16:25 unsigned get_extents(mapping_t mapping, block_t start, unsigned blocks, sector_t loc[], unsigned len[]) <- prototype for the get_extents interface 2008-11-25 16:25 first cut anyway 2008-11-25 16:26 probably better take a maxextents parameter 2008-11-25 16:26 and maybe have a signed return 2008-11-25 16:26 and it needs to know read/vs write 2008-11-25 16:27 maybe it us ok to use 'blocks' as the size of the output arrays 2008-11-25 16:29 int get_extents(mapping_t mapping, block_t start, unsigned blocks, sector_t loc[], unsigned len[]) <- maybe have positive mean number of extents returned, negative is errcode 2008-11-25 16:30 and I guess it would be too filthy to have -blocks mean "read" 2008-11-25 16:30 vs write 2008-11-25 16:39 int get_extents(mapping_t mapping, block_t start, unsigned blocks, sector_t loc[], unsigned len[], unsigned flags); 2008-11-25 16:39 could combine the vectors into one vector of structs... 2008-11-25 16:41 it can't be struct extent[] because extents are endian 2008-11-25 16:41 struct extent is 2008-11-25 16:55 >>> Tux3 U _might_ be late tonight, if so please start without me <<< 2008-11-25 16:58 hi 2008-11-25 16:59 I noticed test in userland, tux3 command has same behaviour 2008-11-25 16:59 dd if=/dev/zero bs=4096 count=128 | ./tux3 write tux3.img test.txt 2008-11-25 17:00 the result of command seems strange 2008-11-25 17:04 I broke userland possibly... 2008-11-25 17:29 whoops, sorry, it seems the bug of tux3graph 2008-11-25 17:30 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-25 17:55 -!- bushman_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-25 17:56 -!- ajonat(~ajonat@190.48.119.169) has joined #tux3 2008-11-25 18:17 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-25 18:36 hirofumi, I was thinking that 2008-11-25 18:36 that is was most probably a tux3graph bug 2008-11-25 18:36 sorry. now, tux3graph seems work 2008-11-25 18:36 heh 2008-11-25 18:36 I'm going to have to change my post ;) 2008-11-25 18:37 I'll send you a copy now 2008-11-25 18:37 diff -up /devel/linux/works/git/mercurial/tux3/user/tux3graph.c\~tux3graph-dleaf-fix /devel/linux/works/git/mercurial/tux3/user/tux3graph.c 2008-11-25 18:37 --- /devel/linux/works/git/mercurial/tux3/user/tux3graph.c~tux3graph-dleaf-fix 2008-11-26 10:41:22.000000000 +0900 2008-11-25 18:37 +++ /devel/linux/works/git/mercurial/tux3/user/tux3graph.c 2008-11-26 10:41:33.000000000 +0900 2008-11-25 18:37 @@ -216,7 +216,7 @@ static inline struct extent *dleaf_exten 2008-11-25 18:37 } 2008-11-25 18:37 if (ent) { 2008-11-25 18:37 entries = dleaf_entries(dleaf, groups, i); 2008-11-25 18:37 - extents += entry_limit(dleaf_entry(entries, ent)); 2008-11-25 18:37 + extents += entry_limit(dleaf_entry(entries, ent - 1)); 2008-11-25 18:37 } 2008-11-25 18:37 2008-11-25 18:37 return extents; 2008-11-25 18:37 it was off by one error in tux3graph 2008-11-25 18:37 :) 2008-11-25 18:38 are you ready for a pull? 2008-11-25 18:38 this patch? 2008-11-25 18:38 and your write changes 2008-11-25 18:38 not yet 2008-11-25 18:38 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-25 18:38 I found another problem in tux3_get_block 2008-11-25 18:39 something hard? 2008-11-25 18:39 it seems not to be so hard 2008-11-25 18:39 for (struct extent *extent; (extent = dwalk_next(walk));) 2008-11-25 18:39 if (dwalk_index(walk) + extent_count(*extent) > start) { 2008-11-25 18:39 if (dwalk_index(walk) <= start) 2008-11-25 18:39 dwalk_back(walk); 2008-11-25 18:39 break; 2008-11-25 18:39 } 2008-11-25 18:40 does it create a problem that we can see in tux3graph? 2008-11-25 18:40 no 2008-11-25 18:40 it is problem of read side 2008-11-25 18:41 that part of the code is tricky 2008-11-25 18:41 hard to get all cases working at the same time ;) 2008-11-25 18:41 in that part, dwalk_back seems to be backing entry, not extent 2008-11-25 18:41 right, it backs up one entry 2008-11-25 18:42 if so, next dwalk_next() will get below of start? 2008-11-25 18:42 but we should not have more than one extent per entry right now 2008-11-25 18:42 yes, < seems better than <= 2008-11-25 18:43 I wonder what I was thinking 2008-11-25 18:43 I need to make that more robust 2008-11-25 18:44 if dwalk_back() back one extent, it seems to work 2008-11-25 18:44 but, actually it is one entry 2008-11-25 18:44 back one entry and back one extent should be the same thing right now 2008-11-25 18:45 it is easy to set up a test case in the unit test code that tests exactly the failure case you have 2008-11-25 18:45 in last of dwalk_next, return walk->extent++; // also return key 2008-11-25 18:45 just jump out the leaf before the write, and the extent its trying to write 2008-11-25 18:45 just dump out I meant 2008-11-25 18:45 then the test case can easily be set up in the unit test 2008-11-25 18:46 using getblk, like I did the others 2008-11-25 18:46 current log is 2008-11-25 18:48 http://userweb.kernel.org/~hirofumi/serial2.txt 2008-11-25 18:48 which line shows the error? 2008-11-25 18:49 http://userweb.kernel.org/~hirofumi/filemap2.c 2008-11-25 18:49 log of source 2008-11-25 18:49 tux3_get_block: --- index 0, block 15, count 1 2008-11-25 18:50 oops 2008-11-25 18:50 it is not right log 2008-11-25 18:53 http://userweb.kernel.org/~hirofumi/serial2.txt 2008-11-25 18:53 this log is including problem 2008-11-25 18:53 if (dwalk_index(walk) <= start) <- I think this should be <, not <= 2008-11-25 18:53 dwalk_back(walk); 2008-11-25 18:54 hmm 2008-11-25 18:54 even if <, next dwalk_next returns index == 0? 2008-11-25 18:55 first dwalk_next(): tux3_get_block: --- index 0, block 15, count 1 2008-11-25 18:55 well, I think what that code says is, it wants to read the same extent again, if it starts exactly at start, that seems right 2008-11-25 18:55 ok, I'll get the new log 2008-11-25 18:56 second dwalk_next(): tux3_get_block: --- index 1, block 17, count 1 2008-11-25 18:56 and start is 1 2008-11-25 18:57 http://userweb.kernel.org/~hirofumi/tux3.img.dot.png2 2008-11-25 18:57 this is where I wish shapor was in town 2008-11-25 18:57 btw, current on-disk data is it 2008-11-25 18:57 he's great with this stuff 2008-11-25 18:58 finds the weird corner cases in my strange ideas 2008-11-25 18:58 but we will find it :) 2008-11-25 18:58 well, problem is clear in this case 2008-11-25 18:59 it's better to name the file dot.2.png, otherwise firefox gets confused 2008-11-25 18:59 stupid firefox 2008-11-25 18:59 on second dwalk_next(), index == 1 and start == 1 2008-11-25 18:59 oh 2008-11-25 18:59 http://userweb.kernel.org/~hirofumi/tux3.img.dot2.png 2008-11-25 19:00 konqueror shows it as a hex file ;) 2008-11-25 19:00 :) 2008-11-25 19:00 ok, ino14? 2008-11-25 19:00 yes 2008-11-25 19:01 and we fail to read that properly? 2008-11-25 19:01 starting at which index? 2008-11-25 19:01 start == 1 2008-11-25 19:01 and instead it goes dwalk_back and reads index 0? 2008-11-25 19:01 on second dwalk_next() in that part, index == 1 and start == 1 2008-11-25 19:01 yes 2008-11-25 19:03 has something to do with rewind 2008-11-25 19:03 I wonder, it may be "dwalk_index() > start" or something 2008-11-25 19:03 right 2008-11-25 19:03 if it's > then the next condition can't trigger 2008-11-25 19:03 has to be >= 2008-11-25 19:04 :p 2008-11-25 19:04 well 2008-11-25 19:04 no, it's + count > 2008-11-25 19:04 and count is always >= 1 2008-11-25 19:05 so next condition can trigger 2008-11-25 19:05 if (dwalk_index(walk) + extent_count(*extent) > start) { 2008-11-25 19:05 if (dwalk_index(walk) <= start) 2008-11-25 19:05 second one is + count >? 2008-11-25 19:06 well, I think we may not need dwalk_back here 2008-11-25 19:07 I'm gertting close 2008-11-25 19:07 we can just pass extent to below loops as next_extent? 2008-11-25 19:07 we can 2008-11-25 19:08 but then we need to dwalk_next under some conditions 2008-11-25 19:08 we have to handle if start is on hole? 2008-11-25 19:10 why does log2 refer to block 16, and the diagram refers to block 17 as second extent of ino 14? 2008-11-25 19:10 right, we have to handle start in the middle of a hole 2008-11-25 19:11 maybe, offset is 1, so 15 +1 == 16 2008-11-25 19:14 the flaw is, after the dwalk_back, it should reread the same entry it just returned 2008-11-25 19:14 and it did not 2008-11-25 19:15 yes 2008-11-25 19:15 do dwalk_back is broken 2008-11-25 19:15 so I mean 2008-11-25 19:15 it back one entry, not one extent 2008-11-25 19:15 one entry and one extent should be the same thing right now 2008-11-25 19:16 I think there are two case, one is before read extent, one is after read extent 2008-11-25 19:18 because dwalk_next() is not updating the entry until next dwak_next() 2008-11-25 19:18 let's set up this as a unit test case in filemap.c 2008-11-25 19:19 ok 2008-11-25 19:19 we will dirty blocks f and 11, and flush the file, then read it back 2008-11-25 19:19 there is a similar test case at the bottom of the file 2008-11-25 19:21 brelse_dirty(blockget(mapping(inode), 0xf)); 2008-11-25 19:21 brelse_dirty(blockget(mapping(inode), 0x11)); 2008-11-25 19:21 printf("flush... %s\n", strerror(-flush_buffers(mapping(inode)))); 2008-11-25 19:21 filemap_extent_io(blockget(mapping(inode), 1), 0); 2008-11-25 19:21 something like that 2008-11-25 19:22 yes 2008-11-25 19:25 ---- extent 0x1/1 ---- 2008-11-25 19:25 1 entry groups: 2008-11-25 19:25 0/2: f => 2/1; 11 => 3/1; 2008-11-25 19:25 prior extents: 0x11 => 3/1; 2008-11-25 19:25 ---- rewind to 0xf => 3/1 ---- 2008-11-25 19:25 segs (offset = 0): 1 => 0/1; (1) 2008-11-25 19:25 filemap_extent_io: extent 0x1/1 => 0 2008-11-25 19:25 filemap_extent_io: block 0x1 => 0 2008-11-25 19:25 shows the bug 2008-11-25 19:26 um 2008-11-25 19:26 it's not the same test ;) 2008-11-25 19:26 let me fix it 2008-11-25 19:32 -!- RazvanM(~RazvanM@pool-96-244-34-153.bltmmd.east.verizon.net) has joined #tux3 2008-11-25 19:40 hirofumi, I have a test case I think 2008-11-25 19:40 above 4 lines? 2008-11-25 19:41 no, how shall I deliver it 2008-11-25 19:41 #if 1 2008-11-25 19:41 sb->nextalloc = 0x10; 2008-11-25 19:41 balloc(sb); 2008-11-25 19:41 sb->nextalloc = 0xf; 2008-11-25 19:41 brelse_dirty(blockget(mapping(inode), 0x0)); 2008-11-25 19:41 printf("flush... %s\n", strerror(-flush_buffers(mapping(inode)))); 2008-11-25 19:41 brelse_dirty(blockget(mapping(inode), 0x1)); 2008-11-25 19:41 printf("flush... %s\n", strerror(-flush_buffers(mapping(inode)))); 2008-11-25 19:41 filemap_extent_io(blockget(mapping(inode), 1), 1); 2008-11-25 19:41 return 0; 2008-11-25 19:41 #endif 2008-11-25 19:41 goes in user/filemap.c 2008-11-25 19:42 writes a dleaf with two extents, one at 0 => 0xf and 1 => 0x11 2008-11-25 19:42 the same as the bug, right? 2008-11-25 19:42 i see 2008-11-25 19:42 wait a bit 2008-11-25 19:44 yes, it seems same case 2008-11-25 19:44 it works in the filemap.c test case 2008-11-25 19:45 ... I think 2008-11-25 19:46 it have two segments, even if start == 1... 2008-11-25 19:46 you're right, it shows the bug 2008-11-25 19:46 good 2008-11-25 19:46 that's what I wanted 2008-11-25 19:47 now we can fix it in isolation 2008-11-25 19:47 and we have a unit test for it 2008-11-25 19:47 ok 2008-11-25 19:47 #define trace trace_on <- you probably want to do this at the top of filemap.c 2008-11-25 19:47 yes 2008-11-25 19:48 ok, now we can kill that bug in user space ;) 2008-11-25 19:48 and in relative comfort 2008-11-25 19:49 ---- rewind to 0x0 => f/1 ---- 2008-11-25 19:49 <- that's already wrong 2008-11-25 19:49 show rewind to 1 2008-11-25 19:49 s/show/should/ 2008-11-25 19:49 -!- RalucaM(~ral@pool-96-244-34-153.bltmmd.east.verizon.net) has joined #tux3 2008-11-25 19:50 hi 2008-11-25 19:50 yes 2008-11-25 19:50 hi raluca 2008-11-25 19:50 tonight's tux3 U is going to be: "how to debug tux3" 2008-11-25 19:50 :) 2008-11-25 19:51 -!- ChanServ changed mode/#tux3 -> +o flips 2008-11-25 19:51 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: How do debug Tux3!" 2008-11-25 19:51 -!- ChanServ changed mode/#tux3 -> -o flips 2008-11-25 19:53 ok, it should not have gone dwalk_back 2008-11-25 19:54 as you know 2008-11-25 19:54 and already told me ;) 2008-11-25 19:54 :) 2008-11-25 19:55 so let's find out exactly why it decided to back up 2008-11-25 19:56 ok 2008-11-25 19:57 it is "dwalk_index(walk) <= start"? 2008-11-25 19:58 I think so 2008-11-25 19:59 as we thought a long time ago 2008-11-25 19:59 but it's nice to have a unit test 2008-11-25 19:59 so does it break anything else? 2008-11-25 19:59 this is obviously the bug ;) 2008-11-25 20:00 you knew that a long time ago, why didn't you just tell me? :) 2008-11-25 20:00 tux3_get_block is assuming first segment is including start 2008-11-25 20:00 it should never back up at a segment that starts exactly at start 2008-11-25 20:00 well wait 2008-11-25 20:00 no, it should 2008-11-25 20:01 because it wants to reread that segment 2008-11-25 20:01 and it fails to reread it 2008-11-25 20:01 so let's keep finding out why 2008-11-25 20:01 as you said, it appears to back up too far 2008-11-25 20:01 1 hour 20 minutes later... back to square one ;-) 2008-11-25 20:01 it's not quite that simple though 2008-11-25 20:01 maze, no, now we have a unit test in user space 2008-11-25 20:02 hmm, that is indeed progress 2008-11-25 20:02 and we don't just have to reason about it, but can try stuff easily 2008-11-25 20:02 so we could make an even more direct unit test now 2008-11-25 20:03 go into dleaf, and set up a test for dwalk_back 2008-11-25 20:03 but the current test is good enough to find this 2008-11-25 20:03 ok, dwalk_next() is just incremnting walk->extent++ 2008-11-25 20:03 and next dwalk_next() checks extent is last or now 2008-11-25 20:03 but, dwalk_back just back entry 2008-11-25 20:04 maze, see, we have started on tonight's session :) 2008-11-25 20:04 we are debugging tux3 as promised 2008-11-25 20:05 so, dwalk_back should check walk->extent at first? 2008-11-25 20:05 dwalk_next always leafs walk->extent pointing at the next extent 2008-11-25 20:06 yes, and dwalk_back didn't check it 2008-11-25 20:06 why does dwalk_back do ++entry? 2008-11-25 20:06 ++walk->entry 2008-11-25 20:07 well 2008-11-25 20:07 that is back 2008-11-25 20:07 because we are upside down 2008-11-25 20:07 which you also told me ;) 2008-11-25 20:07 yes 2008-11-25 20:07 it is back entry always 2008-11-25 20:07 but that test doesn't trigger because we are not crossing a group 2008-11-25 20:07 so it's a simple situation 2008-11-25 20:08 we should back ->entry only if ->extent is start extent? 2008-11-25 20:09 yes 2008-11-25 20:09 but, dwalk_back() doesn't check it? 2008-11-25 20:09 also yes 2008-11-25 20:09 it's wrongly conceived 2008-11-25 20:10 but... 2008-11-25 20:10 it should also work even if it does go back one entry 2008-11-25 20:10 why? 2008-11-25 20:11 dwak_next() - extent++ 2008-11-25 20:11 in that case, it should leave the walk in a state where if (walk->extent >= walk->exstop) will trigger right away 2008-11-25 20:11 dwalk_back() - ++walk->entry 2008-11-25 20:12 yes 2008-11-25 20:12 it's not symmetric 2008-11-25 20:13 so, "skip exntents below start" part was wrong? 2008-11-25 20:13 first thing dwalk_back should do is --extent 2008-11-25 20:13 dwalk_back is wrong 2008-11-25 20:13 ok 2008-11-25 20:14 and therefore, skip extents is wrong as you said 2008-11-25 20:15 so, now "how do we fix it" state? 2008-11-25 20:16 yes 2008-11-25 20:16 walk->extent = walk->exbase + (walk->estop + group_count(walk->group) - 1 - walk->entry); <- for some reason, I thought this would work 2008-11-25 20:16 folks 2008-11-25 20:16 ACTION is back from aikido :) 2008-11-25 20:16 I think it is right if one entry was backed 2008-11-25 20:16 good you can help debug ;) 2008-11-25 20:17 :) 2008-11-25 20:17 got to learn the code first 2008-11-25 20:17 http://tux3.org/tux3?f=7d3a8652286b;file=user/kernel/dleaf.c <- start here 2008-11-25 20:19 hirofumi, well it seems to me that dwalk_back should always start off with --walk->extent 2008-11-25 20:19 I don't know why I didn't write it that way 2008-11-25 20:20 well, what do you think about adding dwalk_extent()? 2008-11-25 20:20 dwalk_extent() returns current extent 2008-11-25 20:20 and dwalk_next() just does next 2008-11-25 20:21 so we can avoid dwalk_back until I fix it? 2008-11-25 20:21 that is ok with me 2008-11-25 20:21 use the trusted component 2008-11-25 20:21 yes 2008-11-25 20:22 I don't think you need dwalk_extent() 2008-11-25 20:22 and in the filemap_extent_io case, we don't need dwalk_back() 2008-11-25 20:22 how does that affect the rewind model? 2008-11-25 20:22 breaks it, doesn't it? 2008-11-25 20:23 I don't think it breaks 2008-11-25 20:23 there's another, crude and robust way to do dwalk_back 2008-11-25 20:24 that is to store the last returned extent 2008-11-25 20:25 and set a flag in dwalk that means "return it again if flag is set" 2008-11-25 20:25 that's disgusting, but robust 2008-11-25 20:25 I think no big change in that 2008-11-25 20:25 rewind depends on being able to go back to the same walk state several times in a row 2008-11-25 20:26 there is no big change 2008-11-25 20:26 or we can just add dwalk_extent() 2008-11-25 20:26 it returns current extent that dwalk points 2008-11-25 20:27 ok, dwalk_extent() just returns *(walk->extent - 1) ? 2008-11-25 20:27 no 2008-11-25 20:27 it returns walk->extent 2008-11-25 20:27 the next one 2008-11-25 20:27 dwalk_extent() - extent 0 2008-11-25 20:27 dwalk_next() 2008-11-25 20:27 dwalk_extent() - extent 1 2008-11-25 20:28 current dwalk_next() returns walk->extent, then increment 2008-11-25 20:29 right 2008-11-25 20:29 so, we can add dwalk_extent to return current extent? 2008-11-25 20:30 peek at it without incrementing 2008-11-25 20:30 ok good 2008-11-25 20:30 dwalk_peek ? 2008-11-25 20:30 ah, sounds good 2008-11-25 20:30 ok, let's do it 2008-11-25 20:30 struct extent *dwalk_peek(dwalk) 2008-11-25 20:30 { 2008-11-25 20:31 return dwalk->extent; 2008-11-25 20:31 } 2008-11-25 20:31 you want to do it while I fix dwalk_back, or should I work on it with you? 2008-11-25 20:31 yes 2008-11-25 20:31 jsut like that 2008-11-25 20:31 well 2008-11-25 20:31 there's a problem 2008-11-25 20:31 that extent might belong to the next entry 2008-11-25 20:31 I'll add it and use it in filemap_extent_io 2008-11-25 20:32 oh 2008-11-25 20:32 and so the dwalk_index() won't match 2008-11-25 20:32 probably the best thing to do is fix dwalk_back 2008-11-25 20:32 but and even better thing... 2008-11-25 20:33 but an even better thing... 2008-11-25 20:33 announce with the bug 2008-11-25 20:33 if it doesn't have bugs, then what's the point of asking for help? ;) 2008-11-25 20:33 :) 2008-11-25 20:34 ok, how about you add the test case I just wrote to your tree, and I'll pull? 2008-11-25 20:34 ok, I'll add test case 2008-11-25 20:35 sb->nextalloc = 0x10; 2008-11-25 20:35 balloc(sb); 2008-11-25 20:35 sb->nextalloc = 0xf; 2008-11-25 20:35 brelse_dirty(blockget(mapping(inode), 0x0)); 2008-11-25 20:35 printf("flush... %s\n", strerror(-flush_buffers(mapping(inode)))); 2008-11-25 20:35 brelse_dirty(blockget(mapping(inode), 0x1)); 2008-11-25 20:35 printf("flush... %s\n", strerror(-flush_buffers(mapping(inode)))); 2008-11-25 20:35 filemap_extent_io(blockget(mapping(inode), 1), 0); 2008-11-25 20:35 return 0; 2008-11-25 20:35 yes 2008-11-25 20:35 anybody learning anything tonight? :) 2008-11-25 20:36 hmm 2008-11-25 20:37 a bit 2008-11-25 20:37 something about unit testing maybe? 2008-11-25 20:37 something about stupid canadians? 2008-11-25 20:38 see, I played hockey when I was a kid, and I took a few pucks in the head... 2008-11-25 20:38 so where do you think I spent my childhood? 2008-11-25 20:39 stopping pucks with your nose? 2008-11-25 20:39 like me? 2008-11-25 20:39 no, didn't play much on ice, mostly played with tennis balls 2008-11-25 20:39 and hockey sticks 2008-11-25 20:39 good way to avoid brain damage 2008-11-25 20:40 tennis hockey took up significant parts of spring and autumn 2008-11-25 20:40 oh, valgrind points the bug out in that test case 2008-11-25 20:40 during winter I mostly kept to normal skating... wasn't all that good at hockey (and it's a tad too brutal for me to willingly want to play it) 2008-11-25 20:45 ACTION tries it 2008-11-25 20:46 hirofumi, valgrind is just complaining about the unitialized buffer data 2008-11-25 20:46 I'll fix that 2008-11-25 20:47 sb->nextalloc = 0x10; 2008-11-25 20:47 balloc(sb); 2008-11-25 20:47 sb->nextalloc = 0xf; 2008-11-25 20:47 brelse_dirty(blockread(mapping(inode), 0x0)); 2008-11-25 20:47 printf("flush... %s\n", strerror(-flush_buffers(mapping(inode)))); 2008-11-25 20:47 ah, i see 2008-11-25 20:47 brelse_dirty(blockread(mapping(inode), 0x1)); 2008-11-25 20:47 printf("flush... %s\n", strerror(-flush_buffers(mapping(inode)))); 2008-11-25 20:47 filemap_extent_io(blockget(mapping(inode), 1), 0); 2008-11-25 20:47 return 0; 2008-11-25 20:47 blockread needed instead of blockget 2008-11-25 20:47 to zero the buffer 2008-11-25 20:48 ok 2008-11-25 20:48 ready to pull, complete with bug + test case? 2008-11-25 20:49 recreating with this blockbread 2008-11-25 20:50 touch filemap.c && make && ./filemap testdev <- this is how I'm testing 2008-11-25 20:50 need to add the deps for kernel/* 2008-11-25 20:50 yes 2008-11-25 20:50 a little messy 2008-11-25 20:51 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-25 20:51 tux3graph fix and test case 2008-11-25 20:54 pulled 2008-11-25 20:54 sent the announce to you to preview 2008-11-25 20:54 the Tux3 Report 2008-11-25 20:56 could you use OGAWA Hirofumi or Hirofumi Ogawa? 2008-11-25 20:57 yes 2008-11-25 20:57 is the second ok? 2008-11-25 20:57 yes 2008-11-25 20:57 root@usermode:~# touch /mnt/foo 2008-11-25 20:57 touch: /mnt/foo: Permission denied 2008-11-25 20:57 yes 2008-11-25 20:58 what's the issue? 2008-11-25 20:58 we can't create, write yet 2008-11-25 20:58 still tux3_get_block 2008-11-25 20:58 ok, I need to fix the post some more 2008-11-25 20:58 sure 2008-11-25 20:58 we can create and write in user space, then read in kernel 2008-11-25 20:58 that's good enough for today 2008-11-25 20:58 ok 2008-11-25 20:59 will we update git? 2008-11-25 21:05 ah, it was saying about patch 2008-11-25 21:05 yes, I will update git 2008-11-25 21:05 and the patch 2008-11-25 21:05 "Even though the kernel port can't write to files as of today, we can already create files in user space, mount the volume in kernel and read them back." 2008-11-25 21:06 looks good to me 2008-11-25 21:09 everything looks good to me 2008-11-25 21:10 http://tux3.org/patches/tux3-2.6.26.5-0 <- how about this? 2008-11-25 21:13 look good to me 2008-11-25 21:13 it is same with current hg 2008-11-25 21:14 ok 2008-11-25 21:15 I didn't really mean to get the .gitignore diff in there, but I think it looks like I planned it ;) 2008-11-25 21:15 :) 2008-11-25 21:16 well, it should be ok 2008-11-25 21:22 posted 2008-11-25 21:22 ok, I'll fix dwalk_back 2008-11-25 21:22 sorry for giving you a broken one ;) 2008-11-25 21:22 no problem :) 2008-11-25 21:23 I also gave buggy tux3graph :) 2008-11-25 21:23 http://lkml.org/lkml/2008/11/26/13 2008-11-25 21:24 vger is running fast today 2008-11-25 21:24 looks like realtime 2008-11-25 21:26 Announcment: anybody who thinks Tux3 U was less fascinating than usual tonight can have their money back 2008-11-25 21:27 hacking on the user space code got slightly less convenient with the files split between user/ and kernel/ 2008-11-25 21:27 have to have two files open and do a lot more switching 2008-11-25 21:27 but it's not too bad 2008-11-25 21:28 yes 2008-11-25 21:28 the convenience of being able to have the same code in kernel and userspace is way more important 2008-11-25 21:29 emacs with cscope can jump to actual function 2008-11-25 21:30 :) 2008-11-25 21:30 and jump to each error? 2008-11-25 21:31 error() function? 2008-11-25 21:31 it can 2008-11-25 21:31 it can choose declar or define 2008-11-25 21:32 my editor (kate) is barbaric by comparison 2008-11-25 21:32 umm... users or actual function 2008-11-25 21:32 but I like a real gui 2008-11-25 21:32 I mean, there error that gcc reports on compile 2008-11-25 21:32 the error that gcc reports on compile 2008-11-25 21:34 ah, it also can 2008-11-25 21:34 jump to error or warninig or something file:line pair 2008-11-25 22:10 http://userweb.kernel.org/~hirofumi/tux3-kernel.png 2008-11-25 22:10 8.6M png 2008-11-25 22:10 now, it seems I can create/write on simple case 2008-11-25 22:11 with dirty hack for tux3_get_block 2008-11-25 22:26 :) 2008-11-25 22:26 I had to play some final fantasy xii with my girl before she went to sleep 2008-11-25 22:26 now working on dwalk_back 2008-11-25 22:26 adding test case to dleaf.c 2008-11-25 22:27 "The image ?http://userweb.kernel.org/~hirofumi/tux3-kernel.png? cannot be displayed, because it contains errors." 2008-11-25 22:27 -- mozilla 2008-11-25 22:28 firefox 3.0.4 seems to read it 2008-11-25 22:29 well, perhaps, you should hackfs, and I'll work for dwalk_back 2008-11-25 22:29 should work for 2008-11-25 22:30 FWIW, http://userweb.kernel.org/~hirofumi/tux3-kernel.png.bz2 2008-11-25 22:32 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-25 22:32 flips: firefox 2 reads it 2008-11-25 22:33 my firefox is lame, etch is lame ;) 2008-11-25 22:34 gimp reads it, but uses a massive amount of memory 2008-11-25 22:34 konq probably reads it too 2008-11-25 22:34 nice pictures :) 2008-11-25 22:35 one block per extent :) 2008-11-25 22:36 :) 2008-11-25 22:36 it's exercising lots of code 2008-11-25 22:36 you feel find with dwalk_back? 2008-11-25 22:36 fine? 2008-11-25 22:37 probably 2008-11-25 22:37 well, I think I can do it sooner or later 2008-11-25 22:37 first line should be: walk->extent--; 2008-11-25 22:37 it needs to do dwalk_next backwards 2008-11-25 22:37 yes 2008-11-25 22:38 actually, it only needs to go back one 2008-11-25 22:38 and probably, it chehcks ->extent is start or not 2008-11-25 22:38 and it might be enough just to do that one decrement, except at the beginning 2008-11-25 22:39 oh, first line has to be a check for at the beginning 2008-11-25 22:39 yes 2008-11-25 22:39 oh wait 2008-11-25 22:39 ah, yes it can be at the beginning if there are no extents 2008-11-25 22:42 so just exit if the leaf is entirely empty (!groups) 2008-11-25 22:42 ah, yes 2008-11-25 22:43 otherwise, decrement extent pointer, should be done 2008-11-25 22:43 add some error checks later ;-) 2008-11-25 22:43 like extent already at extent base (for the entry) 2008-11-25 22:43 it was already done? 2008-11-25 22:44 I didn't do it 2008-11-25 22:45 I should do it? 2008-11-25 22:45 void dwalk_back(struct dwalk *walk) 2008-11-25 22:45 { 2008-11-25 22:45 trace("back one entry"); 2008-11-25 22:45 if (walk->groups) 2008-11-25 22:45 walk->extent++; 2008-11-25 22:45 } 2008-11-25 22:45 ok? 2008-11-25 22:46 completely untested 2008-11-25 22:46 and wrong 2008-11-25 22:46 ->extent-- ? 2008-11-25 22:46 void dwalk_back(struct dwalk *walk) 2008-11-25 22:46 { 2008-11-25 22:46 trace("back one entry"); 2008-11-25 22:46 if (walk->groups) 2008-11-25 22:46 --walk->extent; 2008-11-25 22:46 } 2008-11-25 22:46 predecrement just for good luck :) 2008-11-25 22:47 :) 2008-11-25 22:47 dleaftest ran 2008-11-25 22:49 void dwalk_back(struct dwalk *walk) 2008-11-25 22:49 { 2008-11-25 22:49 trace("back one entry"); 2008-11-25 22:49 if (walk->groups) { 2008-11-25 22:49 assert(walk->extent > walk->exbase); 2008-11-25 22:49 --walk->extent; 2008-11-25 22:49 } 2008-11-25 22:49 } 2008-11-25 22:49 firefox chokes on the graphic. Crashes Safari too. 2008-11-25 22:50 tim_dimm, mac? 2008-11-25 22:50 no problem in photoshop or preview 2008-11-25 22:50 yeah 2008-11-25 22:50 512MB of vram 2008-11-25 22:50 you n me both have outdated browsers 2008-11-25 22:50 that helps a bit 2008-11-25 22:50 still shouldn't crash of course 2008-11-25 22:50 shouldn't 2008-11-25 22:50 I had ~15 other pages open 2008-11-25 22:51 illustrator didn't like it 2008-11-25 22:51 complain to hirofumi ;) 2008-11-25 22:51 heh 2008-11-25 22:51 oh :) 2008-11-25 22:51 it's just too big I guess 2008-11-25 22:51 needs to check out bigtiff 2008-11-25 22:51 or the viewer that aperio put together for 100GB+ images 2008-11-25 22:52 loads only what's being displayed into vram 2008-11-25 22:52 very fast 2008-11-25 22:53 http://www.aperio.com/bigtiff/ 2008-11-25 22:53 http://images2.aperio.com/BigTIFF/BreastCancer225.tif/view.apml?returnurl=http://bigtiff.org/ 2008-11-25 22:53 that's a 143GB image that can be opened in a browser 2008-11-25 22:54 it does 2008-11-25 22:54 and my code up there is wrong 2008-11-25 22:54 void dwalk_back(struct dwalk *walk) 2008-11-25 22:54 { 2008-11-25 22:54 trace("back one entry"); 2008-11-25 22:54 if (dleaf_groups(walk->leaf)) { 2008-11-25 22:54 assert(walk->extent > walk->exbase); 2008-11-25 22:54 --walk->extent; 2008-11-25 22:55 } 2008-11-25 22:55 } 2008-11-25 22:55 need to fix the user/kernel dependencies 2008-11-25 22:55 Makefile? 2008-11-25 22:55 yes 2008-11-25 22:55 now, I'm looking it 2008-11-25 22:55 just trying to think of something clean 2008-11-25 22:56 -Wp,-MD,$(@D)/.$(@F).d 2008-11-25 22:56 I'm thinking it may able to use 2008-11-25 22:56 it creates dependency by gcc 2008-11-25 22:57 woo, that's scary 2008-11-25 22:57 if it work, make dependency will be created automatically 2008-11-25 22:58 that's gcc parameters? 2008-11-25 22:58 yes 2008-11-25 22:58 kbuild seems to use it 2008-11-25 22:58 $(@D) <- some make language? 2008-11-25 22:58 yes 2008-11-25 22:59 what is @? 2008-11-25 22:59 it's directory part of something 2008-11-25 22:59 gnu make language 2008-11-25 22:59 -Wp,-MD,kernel/.$(@F).d <- is that the same for us? 2008-11-25 23:00 probably 2008-11-25 23:00 slightly less scary 2008-11-25 23:00 .c.o: 2008-11-25 23:00 $(CC) -Wp,-MD,$(@D)/.$(@F).d $(CFLAGS) -c -o $@ $< 2008-11-25 23:00 and the . makes them hidden 2008-11-25 23:01 power :) 2008-11-25 23:01 @D is from $< 2008-11-25 23:01 however, I'm not sure yet 2008-11-25 23:01 $< is? 2008-11-25 23:01 .c 2008-11-25 23:01 $@ is .o 2008-11-25 23:01 from .c.o rule 2008-11-25 23:01 :O 2008-11-25 23:01 whoever made up this language needs to go back to school ;) 2008-11-25 23:02 lol 2008-11-25 23:02 :) 2008-11-25 23:03 how does make know to check the hidden depedency files? 2008-11-25 23:03 -include $(depend_files) 2008-11-25 23:04 depend_files is including *.d files 2008-11-25 23:05 shall I wait for your patch? :) 2008-11-25 23:05 trouble with this kind of thing is, if it breaks, nobody can fix it 2008-11-25 23:05 well, many make files are like that 2008-11-25 23:05 and usually it doesn't break 2008-11-25 23:06 ah 2008-11-25 23:06 balloc.o: balloc. kernel/balloc.c 2008-11-25 23:06 will work 2008-11-25 23:07 that's much nicer! 2008-11-25 23:07 I was lazy to update it :) 2008-11-25 23:08 well, I'll try with simple one 2008-11-25 23:08 balloc.o: ... 2008-11-25 23:09 so -MD is actually a preprocessor option, not a gcc option 2008-11-25 23:09 ah, yes 2008-11-25 23:17 ok, back to hackfs 2008-11-25 23:18 ok 2008-11-25 23:27 int get_extents(mapping_t mapping, block_t start, unsigned count, unsigned flags, struct extent exents[]); 2008-11-25 23:28 result is negative if error, otherwise count of extents found in the logical region start..start + count - 1 2008-11-25 23:28 count is the number of extetnts 2008-11-25 23:28 number of logical blocks 2008-11-25 23:28 ah 2008-11-25 23:29 and we can require the array to be at least as big as count to handle the worst case 2008-11-25 23:29 i see 2008-11-25 23:29 only thing is, we can't call it struct extent, because that is endian 2008-11-25 23:30 ah, yes 2008-11-25 23:30 int get_spans(mapping_t mapping, block_t start, unsigned count, unsigned flags, struct span spans[]); 2008-11-25 23:30 spans? 2008-11-25 23:30 a span is "from here to there" 2008-11-25 23:30 just like an extent 2008-11-25 23:31 synonym 2008-11-25 23:31 i see 2008-11-25 23:33 we make a distinction between hole and block == 0? 2008-11-25 23:35 yes 2008-11-25 23:35 I don't really like using !block to mean hole 2008-11-25 23:35 because a block may well be zero, just not in tux3 2008-11-25 23:36 yes 2008-11-25 23:36 I thought, maybe negative count for a hole, is that too ugly? 2008-11-25 23:37 I think it is too ugly 2008-11-25 23:37 yah 2008-11-25 23:37 it may still valid number as block number 2008-11-25 23:38 next obvious choice is block == -1 => hole 2008-11-25 23:39 if block is unsigned, it is stil valid? 2008-11-25 23:39 yes 2008-11-25 23:39 tux3 is 48bit, so it's ok, but... 2008-11-25 23:39 ...but what about 32 bit... 2008-11-25 23:40 ah 2008-11-25 23:40 negative count is sounding not too bad then 2008-11-25 23:40 and the thing is, holes need special handling anyway 2008-11-25 23:40 so best not to ignore the check 2008-11-25 23:41 anyway 2008-11-25 23:41 I'll flip a coin and not get stuck on it 2008-11-25 23:42 ok 2008-11-26 00:33 I updated Makefile rules 2008-11-26 00:39 :) 2008-11-26 00:40 pull? 2008-11-26 00:40 please check it and pull 2008-11-26 00:40 I think it's ok 2008-11-26 00:44 it's easy to read 2008-11-26 00:44 should not break too much 2008-11-26 00:44 ok, please pull it 2008-11-26 00:44 on some test, it seems to work 2008-11-26 00:49 worked for everything I tried it on 2008-11-26 00:50 ok 2008-11-26 00:50 more efficient than having the compiler create dep files 2008-11-26 00:51 tux3 is small than kernel, so compiler deps may be too much 2008-11-26 00:52 well, dependency seems to be fixed 2008-11-26 02:19 -!- bhuey(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-11-26 03:25 -!- spsneo(~chatzilla@125.20.8.166) has joined #tux3 2008-11-26 03:25 hi 2008-11-26 03:25 i want to contribute to tux3 project 2008-11-26 03:25 how to get started? 2008-11-26 03:27 hi 2008-11-26 03:27 I think flips has ideas 2008-11-26 03:27 however, maybe he is sleeping now :) 2008-11-26 03:27 hirofumi: when will he be online? 2008-11-26 03:28 I think he is in PST timezone 2008-11-26 03:28 okk 2008-11-26 03:28 hirofumi: anyways are u also one of the devs? 2008-11-26 03:28 yes 2008-11-26 03:29 well, so tux3 is all set to be released? 2008-11-26 03:29 or still a lot of things are to be done before the release? 2008-11-26 03:30 I think we have to work a lot for tux3 2008-11-26 03:31 ACTION is excited then 2008-11-26 03:31 i would love to be one of the contributors 2008-11-26 03:31 now, I'm porting to kernel current userspace source 2008-11-26 03:31 what all else need to be done? 2008-11-26 03:32 a lot of work is needed 2008-11-26 03:32 as in? 2008-11-26 03:32 now, in kernel it just can readir and read 2008-11-26 03:32 and there is no atomic commit etc. 2008-11-26 03:33 okk 2008-11-26 03:33 how to get started ? 2008-11-26 03:33 maybe... 2008-11-26 03:33 http://tux3.org/ 2008-11-26 03:33 is site of tux3 2008-11-26 03:34 http://tux3.org/tux3 is hg repo for userspace sources 2008-11-26 03:34 http://phunq.net/ddtree?p=tux3fs is git repo for kernel 2008-11-26 03:35 well, I think asking to flips would be good start 2008-11-26 03:38 he will appear to here 2008-11-26 03:40 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-26 03:41 -!- spsneo(~chatzilla@125.20.8.166) has joined #tux3 2008-11-26 03:46 http://phunq.net/ddtree?p=tux3fs is git repo for kernel 2008-11-26 03:46 spsneo, can you read this line? 2008-11-26 03:47 hirofumi: i can :) 2008-11-26 03:47 to make sure 2008-11-26 03:47 http://tux3.org/ 2008-11-26 03:47 is site of tux3 2008-11-26 03:47 http://tux3.org/tux3 is hg repo for userspace sources 2008-11-26 03:47 http://phunq.net/ddtree?p=tux3fs is git repo for kernel 2008-11-26 03:47 and 2008-11-26 03:47 spsneo: what do u do? 2008-11-26 03:48 pranith: me? 2008-11-26 03:49 I am a computer science student at iit guwahati 2008-11-26 03:50 well, in hg repo, tux3/user/* is userland source 2008-11-26 03:50 tux3/user/kernel/* is kernel 2008-11-26 03:51 tux3/user/kernel/* can copy to linux/fs/tux3/ of git repo 2008-11-26 03:53 pranith: are u there? 2008-11-26 03:53 !logs 2008-11-26 03:57 spsneo: yeah 2008-11-26 03:57 here 2008-11-26 03:58 spsneo: most of the work left now is related to kernel.. you have any experience in that? 2008-11-26 04:03 -!- spsneo(~chatzilla@125.20.8.166) has joined #tux3 2008-11-26 04:05 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-26 06:48 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-11-26 08:06 -!- bushman(~bushman@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-26 08:34 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-26 09:13 -!- pgquiles(~pgquiles@222.Red-88-0-139.dynamicIP.rima-tde.net) has joined #tux3 2008-11-26 10:24 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-26 10:41 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-26 10:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-26 11:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-26 12:28 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-11-26 12:58 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-26 13:21 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-26 13:31 I'd better make a to do list 2008-11-26 13:31 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-26 13:34 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-26 13:59 damm, I forgot to cc the tux3 list 2008-11-26 14:01 there 2008-11-26 14:07 by the way, I think I broke fuse as a daemon 2008-11-26 14:07 works fine in foreground 2008-11-26 14:08 that is, make debug works, make testfs doesn't 2008-11-26 14:38 time to update the Wikipedia article 2008-11-26 15:24 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-26 15:39 I have to write two new posts today 2008-11-26 15:39 1) The delta transition "stage" operation 2008-11-26 15:40 2) Online filesystem check and repair 2008-11-26 15:40 need to show we aren't ignoring issues of how to run long term on multi terabyte volumes 2008-11-26 15:59 hey flips 2008-11-26 17:12 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-26 18:02 skating in the dark in the rain on oil and leaves is fun 2008-11-26 18:02 hi bhuey 2008-11-26 18:45 hirofumi, there? 2008-11-26 18:45 hi 2008-11-26 18:45 I'd like to do a minor change: struct tux_path -> struct cursor 2008-11-26 18:46 see the btree locking post on tux3 ml 2008-11-26 18:46 yes 2008-11-26 18:46 just want to check if you have anything not committed first 2008-11-26 18:47 I found around some bugs around path handling 2008-11-26 18:47 post them? 2008-11-26 18:47 or will you just fix? 2008-11-26 18:47 right now, I'm trying to fix it 2008-11-26 18:47 want help? 2008-11-26 18:48 um.. 2008-11-26 18:48 well 2008-11-26 18:48 one issue is 2008-11-26 18:48 alloc_path() allocates current depth 2008-11-26 18:48 path = alloc_path(levels + 1); 2008-11-26 18:49 but, if we call btree_leaf_split(), it adds new depth 2008-11-26 18:49 so, we have to reserve memory for it 2008-11-26 18:50 2008-11-26 18:50 and 2008-11-26 18:50 we copy depth like 2008-11-26 18:50 levels = sb->itable.root.depth 2008-11-26 18:50 but, if we call btree_leaf_split(), it will be changed 2008-11-26 18:51 we have to check, we are using new depth 2008-11-26 18:51 2008-11-26 18:51 and 2008-11-26 18:52 easiest thing is to change them all to "maxdepth" 2008-11-26 18:52 at least, tree_expand() and split calls release_path() at error point 2008-11-26 18:52 for the allocs 2008-11-26 18:52 i see 2008-11-26 18:53 for now, I'm just doing "alloc_path(levels + 2)" 2008-11-26 18:53 yes, a split can only add one level 2008-11-26 18:53 there are 3 users of split 2008-11-26 18:53 that's fine 2008-11-26 18:54 if we can define maxdepth, I feel we can use slab cache for path 2008-11-26 18:54 ok, let's think about that 2008-11-26 18:54 bit tree is likely to be the inode table 2008-11-26 18:54 biggest tree I meant 2008-11-26 18:55 branching factor of about 2**6 2008-11-26 18:55 or dtree for big file? 2008-11-26 18:55 yes 2008-11-26 18:55 sorry 2008-11-26 18:55 branching factor of 2**8 for both, and 75% average fullness 2008-11-26 18:56 50-75% 2008-11-26 18:56 so worst case is 2**7 2008-11-26 18:56 now... how many maximum leaf elements will we support? 2008-11-26 18:56 for now? 2008-11-26 18:57 ok 2008-11-26 18:57 we can allocate paths from slab for up to 4 levels, say 2008-11-26 18:57 depth == 4? 2008-11-26 18:57 including the leaf, that is 2**21 btree leaves 2008-11-26 18:58 which gives millions of inodes 2008-11-26 18:58 well, maxdepth of 3, say, then with the leaf the path is a nice binary sized structure 2008-11-26 18:58 i see 2008-11-26 18:58 I think 2008-11-26 18:58 or maxdepth of 7 2008-11-26 18:58 no big deal 2008-11-26 18:59 then, for any depth greater, we kmalloc 2008-11-26 18:59 and don't worry about that for now 2008-11-26 18:59 maxdepth of 7 will last us for a very long time before it becomes a problem ;) 2008-11-26 19:00 i see 2008-11-26 19:01 updating the root.depth... where is this bug? 2008-11-26 19:01 I fixed some of that 2008-11-26 19:01 not all I guess 2008-11-26 19:02 for now, I know make_inode() 2008-11-26 19:02 it uses release_path(path, levels + 1); 2008-11-26 19:02 but store_attrs() may update root.depth 2008-11-26 19:03 and save_inode() is same way 2008-11-26 19:03 and maybe filemap_extent_io() 2008-11-26 19:03 so it needs to use the levels from the btree froot, not the local cached value 2008-11-26 19:03 I'm thinking this is all of users 2008-11-26 19:03 or the local variable has to be updated 2008-11-26 19:03 yes 2008-11-26 19:04 just get rid of the local variable 2008-11-26 19:04 ok 2008-11-26 19:05 well, about tux_path rename, it's ok to update now for me 2008-11-26 19:06 It was tasteless of me to call it "levels" when the struct name is "depth" 2008-11-26 19:06 I'll pull it and I'll fix based on it 2008-11-26 19:06 ok 2008-11-26 19:07 well, somehow, it doesn't confuses me 2008-11-26 19:07 does or doesn't? 2008-11-26 19:09 doesn't confuses 2008-11-26 19:09 I didn't like to collide with the kernel "path" 2008-11-26 19:10 ah, it is about "levels" and "depth" 2008-11-26 19:10 ah 2008-11-26 19:10 right 2008-11-26 19:10 didn't confuse me either which is why I didn't fix it I guess 2008-11-26 19:10 but consistent names are nice 2008-11-26 19:10 yes 2008-11-26 19:10 I'll change levels to depth in this edit too 2008-11-26 19:11 ok 2008-11-26 19:25 pushed to public 2008-11-26 19:25 thanks 2008-11-26 19:26 re btree locking, I first thought we could use i_mutex to start, but that will deadlock on lock recursion I think 2008-11-26 19:27 which btree? 2008-11-26 19:27 directory 2008-11-26 19:27 i_mutex is held across rename 2008-11-26 19:28 directory is already protected by i_mutex? 2008-11-26 19:28 ah, inode btree? 2008-11-26 19:28 maybe 2008-11-26 19:28 yes 2008-11-26 19:28 and file btrees 2008-11-26 19:28 have no protection right now 2008-11-26 19:29 file btree meas dtree? 2008-11-26 19:29 means 2008-11-26 19:29 yes 2008-11-26 19:29 data tree 2008-11-26 19:30 maybe, directory dtree is already protected by i_mutex 2008-11-26 19:30 maybe 2008-11-26 19:30 probably 2008-11-26 19:30 yes 2008-11-26 19:30 that has to be a horrible contention point 2008-11-26 19:31 yes 2008-11-26 19:31 anyway, all btrees need to be protected 2008-11-26 19:31 yes 2008-11-26 19:31 I don't think it will hurt to have redundant locking in directory dtree 2008-11-26 19:32 and we can try a vfs patch at some point to allow parallel dirops 2008-11-26 19:32 there must be common cases there this is not dangerous 2008-11-26 19:32 i see 2008-11-26 19:32 though the general case of directory rename is dangerous 2008-11-26 19:33 a project for later 2008-11-26 19:33 yes 2008-11-26 19:33 it's hard to think 2008-11-26 19:35 well, now, we think about another locking? 2008-11-26 19:35 some uses of unsigned depth = are ok 2008-11-26 19:35 probe and advance for example 2008-11-26 19:36 any place where it is not always safe, let's eliminate the local variable 2008-11-26 19:36 it's safe in tree_chop 2008-11-26 19:36 if so, we add depth to cursor? 2008-11-26 19:37 because the depth isn't changed until the last step 2008-11-26 19:37 we could 2008-11-26 19:38 um... it may make simple 2008-11-26 19:38 so we have struct cursor { unsigned depth; struct { buffer *; void *; } path[] }; ? 2008-11-26 19:38 or { BTREE; struct { buffer *; void *; } path[];} 2008-11-26 19:39 ? 2008-11-26 19:39 we will have multiple cursors on a single btree 2008-11-26 19:39 so I don't think we should put the depth in the cursor 2008-11-26 19:40 however, we will probably have a per-cursor lock... that is for later 2008-11-26 19:40 ah, i see 2008-11-26 19:40 we can change depth after allocate 2008-11-26 19:40 that also 2008-11-26 19:40 we can change depth after allocate path 2008-11-26 19:41 so it's fine how it is 2008-11-26 19:41 except for any bugs where we cache the depth 2008-11-26 19:41 but.. 2008-11-26 19:42 if so, read from btree sounds like aslo have same issue 2008-11-26 19:42 which issue? 2008-11-26 19:42 depth can change? 2008-11-26 19:42 yes 2008-11-26 19:42 true 2008-11-26 19:43 with crab locking 2008-11-26 19:43 the tree can get deeper before lock is acquired 2008-11-26 19:43 well we can do this: int *depth = &btree->root.depth; 2008-11-26 19:44 and reference via *depth 2008-11-26 19:44 ok? 2008-11-26 19:44 multiple cursors on single btree is not issue? 2008-11-26 19:44 because we should protect it 2008-11-26 19:44 it will be a little tricky to implement, but a very nice optimization 2008-11-26 19:45 well, for me, it hard to think without locking 2008-11-26 19:45 the multiple cursors will be arranged in a tree, so that they can share read locks on their first N elements 2008-11-26 19:46 I'm pretty sure we are going to do something like this, but it does not have to be soon 2008-11-26 19:46 for now, we just read depth from btree? 2008-11-26 19:46 this method will elimiante a huge number of radix tree probes 2008-11-26 19:46 lets use unsigned *depth = &btree->root.depth; 2008-11-26 19:46 ok? 2008-11-26 19:47 where we have cached depth as a variable 2008-11-26 19:47 i see 2008-11-26 19:48 um.. looks like a bit tricky 2008-11-26 19:48 file/line number? 2008-11-26 19:48 why we don't just use btree->root.depth 2008-11-26 19:48 shortname? 2008-11-26 19:48 right 2008-11-26 19:49 i see 2008-11-26 19:49 where we have already have depth as a variable, we don't have to think hard 2008-11-26 19:50 by the way, one of my stylistic preferences... I don't at "p" to pointer variables a lot of the time 2008-11-26 19:50 for example, unsigned *depth, not unsigned *depthp 2008-11-26 19:50 ah 2008-11-26 19:50 me too 2008-11-26 19:51 I feel that the "p" is obvious, both to the reader and the compiler 2008-11-26 19:51 good :) 2008-11-26 19:52 now... is *depth-- the same as *(depth)-- ? 2008-11-26 19:52 whoops 2008-11-26 19:52 is *depth-- the same as (*depth)-- ? 2008-11-26 19:52 I'm not sure 2008-11-26 19:52 ;) 2008-11-26 19:53 equal precedence, right to lef ordering 2008-11-26 19:54 so needs parens 2008-11-26 19:54 --*depth is different? 2008-11-26 19:54 that's fine 2008-11-26 19:54 as long as it's not in an expression 2008-11-26 19:55 well, in my case, i use parens always in that case 2008-11-26 19:55 good idea 2008-11-26 19:56 the use of depth in insert_node looks ok 2008-11-26 19:57 where is there a problem? 2008-11-26 19:57 what problem? 2008-11-26 19:57 cached depth 2008-11-26 19:57 ah 2008-11-26 19:57 make_inode I guess 2008-11-26 19:57 make_inode and save_inode 2008-11-26 19:57 and filemap_extent_io 2008-11-26 19:58 but, I didn't check in root.depth-- case 2008-11-26 19:58 I didn't notice it 2008-11-26 20:00 I don't see a problem in make_inode 2008-11-26 20:00 oh 2008-11-26 20:00 release_path 2008-11-26 20:01 have to update the depth from the tree root there 2008-11-26 20:01 yes 2008-11-26 20:01 just use the tree root 2008-11-26 20:01 *depth = &btree.root.depth is also work 2008-11-26 20:01 I like that more for this function 2008-11-26 20:02 some of the assert code will get shorter 2008-11-26 20:03 same thing for save_inode 2008-11-26 20:03 yes 2008-11-26 20:03 is that all the obvious problems with this? 2008-11-26 20:04 release_cursor() now is a somewhat funny name 2008-11-26 20:04 filemap_extent_io? 2008-11-26 20:05 *depth there too I think 2008-11-26 20:05 and there are some references to root.depth that can be changed to *depth 2008-11-26 20:06 one reference 2008-11-26 20:06 if filemap_extent_io was fixed, callers of split are all 2008-11-26 20:07 well, there is no problem though, caller of split is volume.c:main and btree.c:main 2008-11-26 20:07 I should remove volume.c 2008-11-26 20:07 since we are not going to have subvolumes 2008-11-26 20:07 ah 2008-11-26 20:07 I think subvolumes are useless 2008-11-26 20:07 I could not find any argument for having them 2008-11-26 20:08 I wonder why zfs people like it 2008-11-26 20:08 and btrfs? 2008-11-26 20:08 yes 2008-11-26 20:08 I can't explain why normally sensible people like to have that bloat 2008-11-26 20:08 I don't know about subvolumes almost 2008-11-26 20:08 I think it's about admnistration 2008-11-26 20:09 but it's the wrong approach 2008-11-26 20:09 it is out of interest for me 2008-11-26 20:09 fine :) 2008-11-26 20:09 maybe, lvm or something is enough 2008-11-26 20:09 it should be 2008-11-26 20:10 lvm admin tools need improving 2008-11-26 20:10 i see 2008-11-26 20:10 and they will be improved... later 2008-11-26 20:10 anyway 2008-11-26 20:10 I think I will remove volume.c right now 2008-11-26 20:10 ok 2008-11-26 20:11 well, so, rest of root.depth changer is tree_chop? 2008-11-26 20:11 I'm grepping depth... 2008-11-26 20:12 volume.c is gone 2008-11-26 20:13 ok 2008-11-26 20:13 tree chop depth usage looks ok 2008-11-26 20:13 the depth is only changed in the final step 2008-11-26 20:13 caller is also ok? 2008-11-26 20:15 tux3.c and tux3fuse.c 2008-11-26 20:15 btw, "release cursor at point of error" rule may be bad 2008-11-26 20:15 yes, callers of tree_chop are ok 2008-11-26 20:16 ah, where does it break? 2008-11-26 20:16 durning grep "depth", 2008-11-26 20:16 btree_leaf_split(), release_cursor(cursor, btree->root.depth); 2008-11-26 20:17 it may be, release_cursor(cursor, btree->root.depth + 1); 2008-11-26 20:17 have to be very careful there 2008-11-26 20:18 not depth + 1 2008-11-26 20:18 oh 2008-11-26 20:18 because we just failed to read the leaf 2008-11-26 20:19 ah 2008-11-26 20:19 be back in a bit 2008-11-26 20:19 that one is worth a comment 2008-11-26 20:19 I think the "release at point of error" rule is good, but please prove me wrong if I am wrong 2008-11-26 20:20 wait a bit 2008-11-26 20:20 it seems obviously right to me: it gaurantees the cursor buffers are only released once in the error path 2008-11-26 20:21 btree_leaf_split(), leafbuf = cursor[btree->root.depth].buffer; 2008-11-26 20:21 we brelse cursor[btree->root.depth].buffer 2008-11-26 20:21 release_cursor() is 2008-11-26 20:21 for (int i = 0; i < depth; i++) 2008-11-26 20:21 brelse(cursor[i].buffer); 2008-11-26 20:22 so, we should pass +1? 2008-11-26 20:22 we have to brelse cursor[btree->root.depth].buffer 2008-11-26 20:48 back 2008-11-26 20:48 ok 2008-11-26 20:48 which line? 2008-11-26 20:48 btree_leaf_split 2008-11-26 20:48 yes 2008-11-26 20:49 on the error path? 2008-11-26 20:49 yes, 460 2008-11-26 20:50 yes 2008-11-26 20:50 + 1 2008-11-26 20:50 ok 2008-11-26 20:50 sorry about saying otherwise earlier ;) 2008-11-26 20:50 no problem :) 2008-11-26 20:51 do you change it, or I send patch? 2008-11-26 20:52 I'll change it 2008-11-26 20:52 just a sec 2008-11-26 20:53 pushed 2008-11-26 20:53 in practice, I do not think that buffer alloc can fail unless there is a leak 2008-11-26 20:54 yes 2008-11-26 20:54 when kernel is really oom, it does not matter whether we stay stuck in grow_buffers, or return to caller 2008-11-26 20:54 in that case, probably kernel will panic 2008-11-26 20:54 because caller will just be stuck too 2008-11-26 20:55 in the handle code I'm strongly tempted to define blockget as always succeeding 2008-11-26 20:55 well 2008-11-26 20:55 it can fail if the filesystem index is corrupt 2008-11-26 20:56 and then we need to know the actual error code 2008-11-26 20:56 returning null is inadequate 2008-11-26 20:56 so I think on the handle api, we should fix that by using ERR_PTR instead of returning null on error 2008-11-26 20:57 blockget? 2008-11-26 20:57 and blockread? 2008-11-26 20:57 yes 2008-11-26 20:57 yes, it's good 2008-11-26 20:58 ok, I will write a post on delta staging now 2008-11-26 20:58 this is the key operation in atomic commit 2008-11-26 20:59 ok, I'll prepare to push some patches include dwalk stuff 2008-11-26 20:59 sounds good 2008-11-26 21:56 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-26 22:39 hi 2008-11-26 22:39 could you check 2008-11-26 22:39 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-26 22:39 ? 2008-11-26 22:40 with this, dwalk_back will be fixed, and it seems we can "cp -a /bin/* /mnt/tux3" 2008-11-26 22:41 however, truncate is not implemented, and it looks like readdir has problem 2008-11-26 23:43 hirofumi, sorry I wasn't looking at the channel 2008-11-26 23:43 that's a big deal 2008-11-26 23:44 what is the readdir problem? 2008-11-26 23:44 I'm not checking it yet 2008-11-26 23:44 cp -a /bin/* .; ls 2008-11-26 23:45 I'll try it 2008-11-26 23:45 it seems it is missing some entries 2008-11-26 23:45 hmm, it's the ext2 readdir, basically 2008-11-26 23:45 it can ->lookup 2008-11-26 23:45 ok, bug hunt 2008-11-26 23:45 so, it may be around of filldir 2008-11-26 23:46 meanwhile... I think I have a solution to the d_delete that will allow us to do the defered dirops 2008-11-26 23:46 oh 2008-11-26 23:46 http://lxr.linux.no/linux+v2.6.27/fs/dcache.c#L1532 <- the problem is here 2008-11-26 23:46 d_delete wants to unhash any dentry that has a use count > 1 2008-11-26 23:47 yes 2008-11-26 23:47 I don't know how that can happen, but we need it not to do that 2008-11-26 23:47 with it, dentry become unvisible from lookup 2008-11-26 23:47 so that the negative dentry on a delete stays around until we actually do the delete 2008-11-26 23:48 right, be we need negative dentries to stay 2008-11-26 23:48 so... we set D_UNHASHED in the dentry flags 2008-11-26 23:48 telling vfs the dentry is unhashed, when it is actually still on the hash list 2008-11-26 23:49 now... I _think_ that always works 2008-11-26 23:49 because the dcache doesn't really use this flag for much 2008-11-26 23:49 I'm checking now 2008-11-26 23:49 this would be a good one for maze to look into 2008-11-26 23:50 -!- RazvanM(~RazvanM@96.234.237.148) has joined #tux3 2008-11-26 23:50 I had a problem with d_invalidate() in fatfs... 2008-11-26 23:51 checking email... 2008-11-26 23:56 ah, maybe, my case is unrelated 2008-11-26 23:56 - int err = -ENOENT, depth = btree->sb->itable.root.depth; 2008-11-26 23:56 + int err = -ENOENT, depth = btree->root.depth; <- this must have been a bug 2008-11-26 23:56 oh 2008-11-26 23:57 btree->sb->itable.root.dept is right one? 2008-11-26 23:57 nice job on dwak_back. if we want, we can use the much simpler dwalk_back I showed yesterday 2008-11-26 23:58 Your version can actually walk the dleaf in reverse 2008-11-26 23:58 yes 2008-11-26 23:58 which turns out to be unnecessary 2008-11-26 23:58 I don't know if we ever do it 2008-11-26 23:58 but 2008-11-26 23:58 we can call dwalk_next again 2008-11-26 23:59 yes, we only ever dwalk_back once in a row, and it is always after a dwalk_next 2008-11-26 23:59 so we have a lovely primitive with no actual use ;) 2008-11-26 23:59 we'll keep it 2008-11-26 23:59 yes 2008-11-26 23:59 and maybe a use will show up 2008-11-26 23:59 dleaf test is using it 2008-11-27 00:00 well 2008-11-27 00:00 btree->sb->itable.root.dept is right one? 2008-11-27 00:02 it doesn't sound like it 2008-11-27 00:02 ok 2008-11-27 00:03 let me check again 2008-11-27 00:03 yes, the original version used the wrong depth 2008-11-27 00:03 you made it right 2008-11-27 00:03 that saved a bug hunt 2008-11-27 00:04 ok 2008-11-27 00:04 why move dwalk_check to the bottom of dleaf.c? 2008-11-27 00:04 I added comment for dwalk after it 2008-11-27 00:05 to add the above of dwalk stuff 2008-11-27 00:05 it's fine 2008-11-27 00:06 dwalk_check should probably be before _dump and _chop 2008-11-27 00:06 so we can use it in those if we want 2008-11-27 00:06 later... 2008-11-27 00:07 it's fine as it is 2008-11-27 00:07 dwalk_check? 2008-11-27 00:08 dleaf_check? 2008-11-27 00:09 right :) 2008-11-27 00:09 don't mind me :) 2008-11-27 00:09 very nice diagram 2008-11-27 00:09 :) 2008-11-27 00:09 thanks 2008-11-27 00:10 you found a release_cursor bug in the error path 2008-11-27 00:10 yes 2008-11-27 00:10 is it filemap_extent_io? 2008-11-27 00:11 yes 2008-11-27 00:12 so, I thought caller may be release cusor in both case of success and error 2008-11-27 00:16 d_delete wants to unhash any dentry that has a use count > 1 2008-11-27 00:16 if count == 1, it frees dentry immidately? 2008-11-27 00:16 it calls our i_put function 2008-11-27 00:17 d_iput 2008-11-27 00:17 so, we avoid it? 2008-11-27 00:17 we want it to leave the dentry on the dentry hash 2008-11-27 00:18 as a negative dentry 2008-11-27 00:18 however... I just realized if we force it to do that, we may never get the d_iput call 2008-11-27 00:18 so I need to think more 2008-11-27 00:18 i see 2008-11-27 00:21 should we make our hidden inodes S_IFREG? 2008-11-27 00:21 I'm not sure, currently 0 is also working 2008-11-27 00:22 we will break atable with i_size = big 2008-11-27 00:22 so maybe it was a bad idea to depend on i_size there 2008-11-27 00:22 it's just for tx3_create_entry 2008-11-27 00:23 I'm thinking it work 2008-11-27 00:23 we call read/writepage directly from blockget/read 2008-11-27 00:24 ah 2008-11-27 00:24 and readpage won't read page i_size 2008-11-27 00:24 won't read past i_size 2008-11-27 00:24 so that's why you did that 2008-11-27 00:24 yes 2008-11-27 00:25 at least, block_read_full_page won't 2008-11-27 00:25 but create_entry of atable changes i_size? 2008-11-27 00:25 yes 2008-11-27 00:25 we don't have to fix this until we start testing xattrs 2008-11-27 00:25 yes 2008-11-27 00:25 and the xattr interface isn't enabled yet 2008-11-27 00:26 but it's a bug, perhaps a design bug of mine 2008-11-27 00:26 ah 2008-11-27 00:26 we can read page for atable always 2008-11-27 00:26 putting atom tables in the same file as the atom entries 2008-11-27 00:26 because i_size is big 2008-11-27 00:27 yes, but creating a dirent it broken now 2008-11-27 00:27 remember it's a bug 2008-11-27 00:27 and fix later 2008-11-27 00:27 ok 2008-11-27 00:29 um.. it may work 2008-11-27 00:29 blockread() returns unmapped buffer 2008-11-27 00:29 and tux_create_entry will dirty it 2008-11-27 00:30 and writepage will call get_block to map buffer 2008-11-27 00:31 it uses i_size to know what index to use for the blockread 2008-11-27 00:31 and also to know when to stop when searching for a dirent 2008-11-27 00:31 if zeroed buffer, it will stop? 2008-11-27 00:32 with an error 2008-11-27 00:32 is_deleted() entry 2008-11-27 00:32 and goto create 2008-11-27 00:33 well... :) 2008-11-27 00:33 ah 2008-11-27 00:33 it's a lot of work to call get_block just to get a zero to stop on 2008-11-27 00:34 we call "zero-length directory entry" part 2008-11-27 00:34 ? 2008-11-27 00:35 it seems not work, sorry 2008-11-27 00:36 it's probably a bad idea to store the big tables in the same inode 2008-11-27 00:36 we will just have another inode for that 2008-11-27 00:36 which will make the radix tree for the xattr lookup much smaller 2008-11-27 00:36 anyway 2008-11-27 00:36 i see 2008-11-27 00:37 your big patches are "allocation support" and tux3_create... reading 2008-11-27 00:38 that's a lot of work in one day! 2008-11-27 00:38 it's not one day 2008-11-27 00:38 maybe, 2 or 3 days 2008-11-27 00:39 still a lot of work 2008-11-27 00:39 tux_create_inode and tux_iget are similar to tux3.c I think 2008-11-27 00:39 yes 2008-11-27 00:39 I'm thinking we can share those code more or less 2008-11-27 00:41 probably, tux_iattr setup will be shared 2008-11-27 00:41 yes, that would be nice 2008-11-27 00:41 we're doing very well with sharing code between userspace and kernel 2008-11-27 00:42 tux3_create is all new 2008-11-27 00:42 taken from ext2? 2008-11-27 00:42 maybe, yes 2008-11-27 00:43 inum 32 bit overflow... 2008-11-27 00:43 in tux_create_entry 2008-11-27 00:43 can only happen if a volume is mounted on a 64 bit machine, we must add that limit 2008-11-27 00:44 probably 2008-11-27 00:44 then set a flag in the superblock to make the filesystem only mountable on a 64 bit system 2008-11-27 00:44 probably, yes 2008-11-27 00:44 and we probably will need a mount option or something to disable 64 bit operation on a 64 bit system 2008-11-27 00:44 I don't know 2008-11-27 00:45 tough one 2008-11-27 00:45 ah 2008-11-27 00:45 maybe we should only allow inum about 2**32 if the volume is bigger than 16 TB 2008-11-27 00:45 yes 2008-11-27 00:45 inum above I meant 2008-11-27 00:45 I'll write it in the bugs list 2008-11-27 00:45 ok 2008-11-27 00:46 well, in kernel, it's not problem 2008-11-27 00:46 we have iget5_locked for it 2008-11-27 00:47 but when exporting it to 32bit userland, it apears problem 2008-11-27 00:48 I'm not checking yet reiserfs or ntfs or something how is it doing 2008-11-27 00:49 we still have seg[1000] on the stack in kernel? 2008-11-27 00:50 I think it's seg[10] 2008-11-27 00:50 ah, and it's seg[1000] in my code which isn't used 2008-11-27 00:50 yes, in kernel, it's seg[10] 2008-11-27 00:51 it looks great 2008-11-27 00:51 thanks 2008-11-27 00:51 pull to the public repo 2008-11-27 00:52 ok, thanks 2008-11-27 00:53 what are the main issues remaining that you see? 2008-11-27 00:53 besides atomic commit 2008-11-27 00:54 for now, bug hunting 2008-11-27 00:54 then, I'm going to add missing handlers 2008-11-27 00:55 ok, I must concentrate on the block handle proposal and atomic commit 2008-11-27 00:55 ok 2008-11-27 00:56 I think it's good 2008-11-27 00:56 um, you mean you will not see email and irc? 2008-11-27 00:56 I will be here 2008-11-27 00:56 and reading email 2008-11-27 00:57 ok :) 2008-11-27 00:57 a student asked for a task 2008-11-27 00:58 yesterday? 2008-11-27 00:58 yes 2008-11-27 00:58 spsneo? 2008-11-27 00:58 yes 2008-11-27 00:58 yes 2008-11-27 00:58 one nice one to get started I thought of, is the endian conversion for bitmaps 2008-11-27 00:59 just a matter of making the bitops work in the other direction 2008-11-27 00:59 i see 2008-11-27 00:59 so we have be_get_bit 2008-11-27 00:59 and be_set_bit 2008-11-27 01:00 but, little engian and big endian is same in a byte? 2008-11-27 01:00 and the tricky set_big_range 2008-11-27 01:00 s/big/bit 2008-11-27 01:00 it's not the same if you want to be able to fetch four bytes from a bitmap with a word load 2008-11-27 01:01 and use bitops on the word 2008-11-27 01:01 ah, multiple byte bitops 2008-11-27 01:01 which I think we want to do eventually 2008-11-27 01:01 yes 2008-11-27 01:01 it's a big optimization 2008-11-27 01:01 i see 2008-11-27 01:01 8 bytes at a time is an optimization even on 32 bit 2008-11-27 01:02 at least, on x86 maybe yes 2008-11-27 01:03 changing set_bits and friends will be a little messy 2008-11-27 01:04 yes 2008-11-27 01:04 if you think of a better start project please suggest it 2008-11-27 01:04 this is something you and I can do in about 15 minutes ;) 2008-11-27 01:05 ok :) 2008-11-27 01:05 if I meat him here 2008-11-27 01:05 anyway he will be busy installing everything for a little while 2008-11-27 01:06 I will prepare my uml root filesystem and make it available 2008-11-27 01:06 and a nice project would be setting up a 64 bit root filesystem 2008-11-27 01:06 yes 2008-11-27 01:06 what do you use? 2008-11-27 01:07 I'm using kvm 2008-11-27 01:08 still need a root filesystem, don't you? 2008-11-27 01:08 another partition on your disk? 2008-11-27 01:09 yes, well, I can just install distro from cd to normal file 2008-11-27 01:09 yes 2008-11-27 01:09 my root filesystem is 100 MB :) 2008-11-27 01:09 I'm using normal file for kvm 2008-11-27 01:09 oh 2008-11-27 01:09 I'm using 4g or 10g 2008-11-27 01:10 right, makes it a little hard to back it up with cp 2008-11-27 01:10 kvm/qemu has qcow2 feature 2008-11-27 01:10 ah 2008-11-27 01:10 I am so old fashioned 2008-11-27 01:11 well, almost I don't need to recover it 2008-11-27 01:12 hey all 2008-11-27 01:12 ctrl-c and rerun seems work 2008-11-27 01:12 hi 2008-11-27 01:12 hi pranith 2008-11-27 01:12 I know 2008-11-27 01:13 tux3 fsck, a nice starting project 2008-11-27 01:13 it doesn't have to check very much ;) 2008-11-27 01:13 ah, it's good 2008-11-27 01:13 flips: i see you got some projects for us :D 2008-11-27 01:13 would you like to do tux3 fsck? 2008-11-27 01:14 flips: sure ... 2008-11-27 01:14 just tell me how to get started... 2008-11-27 01:14 and hope you wont mind if i bug u each and every time im stuck 2008-11-27 01:14 first you write a tux3_dump that dumps the whole filesystem as one big tree 2008-11-27 01:14 :) 2008-11-27 01:14 you can also bug hirofumi ;) 2008-11-27 01:15 hmm, ok :) 2008-11-27 01:15 well, I can also bug you :) 2008-11-27 01:15 of course 2008-11-27 01:16 one big tree? 2008-11-27 01:16 yes 2008-11-27 01:16 it's one big tree 2008-11-27 01:16 in memory? 2008-11-27 01:17 just dump it to console 2008-11-27 01:17 in ascii 2008-11-27 01:17 ok 2008-11-27 01:17 using the xxx_dump functions 2008-11-27 01:17 then you add xxx_check functions to every node 2008-11-27 01:18 you can write it nonrecursively, using the btree advance functions 2008-11-27 01:18 hmm 2008-11-27 01:18 and example is btree_dump 2008-11-27 01:18 ACTION looking thru that 2008-11-27 01:18 most of the code you need to dump things is already written 2008-11-27 01:19 flips: ok 2008-11-27 01:20 you know how tux3 is organized, right? 2008-11-27 01:20 it's basically one big tree that is an inode table 2008-11-27 01:20 yeah 2008-11-27 01:20 i saw the picture from hirofumi 2008-11-27 01:21 it was pretty clear... 2008-11-27 01:21 a picture is worth a thousand words 2008-11-27 01:21 yup.. thnks to hirofumi 2008-11-27 01:21 :) 2008-11-27 01:22 you can take advance and turn it into advance_and_dump that dumps each of the index nodes it loads 2008-11-27 01:23 then you change the dumps to checks 2008-11-27 01:23 write it in a new file? 2008-11-27 01:23 dump.c? 2008-11-27 01:24 fsck.c ? 2008-11-27 01:24 sure 2008-11-27 01:24 fsck.c is god 2008-11-27 01:24 good* 2008-11-27 01:25 can I change my mind? 2008-11-27 01:25 check.c 2008-11-27 01:25 we know it's a filesystem :) 2008-11-27 01:26 :) 2008-11-27 01:26 then we can have tux3 check 2008-11-27 01:26 and tux3 dump 2008-11-27 01:26 ok, will do 2008-11-27 01:27 if your dumper takes a struct of operations, like the btree ops, then you can pass one struct to make it check, and another to make it dump 2008-11-27 01:27 hmm, nice idea.. 2008-11-27 01:28 hg clone static-http is not working for me... only hg clone http 2008-11-27 01:28 if the second one works, you're ok 2008-11-27 01:29 so my post was wrong? 2008-11-27 01:29 hmm, not sure if its only me 2008-11-27 01:29 cd tux3, i have doc and user dirs 2008-11-27 01:29 is this the current one? 2008-11-27 01:30 I'll update the post 2008-11-27 01:30 then there is kernel dir in user dir 2008-11-27 01:31 yes 2008-11-27 01:33 so it should be cd tux3/user && make 2008-11-27 01:33 make tests 2008-11-27 01:33 yes 2008-11-27 01:33 :) 2008-11-27 01:34 i see lots of leakage... 2008-11-27 01:34 from make tests 2008-11-27 01:35 trivial stuff mostly 2008-11-27 01:35 ok.. 2008-11-27 01:35 because of returning from main instead of exit() 2008-11-27 01:36 that should be cleaned up 2008-11-27 01:36 a simple change 2008-11-27 01:36 make valgrind happy, and maybe find a real leak 2008-11-27 02:30 changed return's to exit's 2008-11-27 02:30 how do i take a diff using hg? 2008-11-27 02:30 hg diff ;) 2008-11-27 02:30 hg diff -up? 2008-11-27 02:30 hg is pretty obvious 2008-11-27 02:30 the default is good 2008-11-27 02:30 cd .. && hg diff 2008-11-27 02:31 if you want it to have the traditional -p1 2008-11-27 02:32 did it fix the allocation complaints from valgrind? 2008-11-27 02:33 yup 2008-11-27 02:33 :) 2008-11-27 02:33 mostly did .. 2008-11-27 02:34 i checked for inode 2008-11-27 02:34 checking for others now... 2008-11-27 02:34 the remaining ones can be investigated 2008-11-27 02:34 ACTION slaps flips around with an alarm clock 2008-11-27 02:34 oh right 2008-11-27 02:34 ;) 2008-11-27 02:34 get a failed assertion in dwalk_back 2008-11-27 02:35 in which test? 2008-11-27 02:35 walk->extent > walk->exbase 2008-11-27 02:35 dleaf 2008-11-27 02:35 of course ;) 2008-11-27 02:35 post the details to the mailing list and let mlankhorst fix it ;) 2008-11-27 02:36 Hold on, what's a pointer again? 2008-11-27 02:36 mlankhorst: hello.. 2008-11-27 02:36 it's one of those curvy black things 2008-11-27 03:06 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-27 03:12 mlankhorst, you got off easy this time 2008-11-27 03:33 Of course 2008-11-27 03:35 -!- spsneo(~chatzilla@125.20.8.166) has joined #tux3 2008-11-27 03:35 pranith: are u there? 2008-11-27 03:47 -!- pgquiles__(~pgquiles@222.Red-88-0-139.dynamicIP.rima-tde.net) has joined #tux3 2008-11-27 04:02 -!- spsneo(~chatzilla@125.20.8.166) has joined #tux3 2008-11-27 07:59 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-27 08:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-27 09:12 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-27 09:36 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-27 09:55 -!- pgquiles_(~pgquiles@222.Red-88-0-139.dynamicIP.rima-tde.net) has joined #tux3 2008-11-27 10:09 -!- pranihome(7aa246ac@webchat.mibbit.com) has joined #tux3 2008-11-27 10:38 hrm dleaftest is broken 2008-11-27 10:40 current hg? 2008-11-27 10:41 dleaftest seems to work 2008-11-27 10:41 i thought so, let me double check 2008-11-27 10:43 ah i had another patch in my tree, clena tree works 2008-11-27 10:43 :) 2008-11-27 10:45 dwalk_pack is a bit scary looking 2008-11-27 10:45 yes 2008-11-27 10:45 all dwalk stuff is complex for me 2008-11-27 10:47 I think we may have to clean it up 2008-11-27 10:58 the probe stuff is ok to follow 2008-11-27 11:03 dwalk_probe? 2008-11-27 11:04 it is also unclear for me 2008-11-27 11:16 ok. it seems some bugs was fixed 2008-11-27 11:17 so, with next patches, we can 2008-11-27 11:17 $ cp -rL /bin/* . 2008-11-27 11:17 $ diff -urNp /bin . 2008-11-27 11:17 I'll go to bed 2008-11-27 11:33 awesome 2008-11-27 11:43 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-27 11:45 shapor, this version of dwalk_back also works, since we only need dwalk_back to be able to "unread" the last extent, not walk backwards all the way to the beginning of the dleaf: 2008-11-27 11:45 void dwalk_back(struct dwalk *walk) 2008-11-27 11:45 { 2008-11-27 11:45 trace("back one entry"); 2008-11-27 11:45 if (dleaf_groups(walk->leaf)) { 2008-11-27 11:45 assert(walk->extent > walk->exbase); 2008-11-27 11:45 --walk->extent; 2008-11-27 11:45 } 2008-11-27 11:45 } 2008-11-27 11:46 shapor, however we thought we'd keep the fancy version around after hirofumi went to the trouble of making it work 2008-11-27 11:48 cp -r :) 2008-11-27 11:48 beyond awesome 2008-11-27 12:06 flips: will work on fsck tomorrow night... 2008-11-27 12:06 hope you will be around for me to bug 2008-11-27 12:06 :D 2008-11-27 12:06 cool 2008-11-27 12:06 I will 2008-11-27 12:06 thnx :) 2008-11-27 12:06 we can make the tux3 U session "how to write fsck" :) 2008-11-27 12:06 will try to complete it this weekend 2008-11-27 12:07 hmm 2008-11-27 12:07 sure, next tuesday u do that 2008-11-27 12:07 :D 2008-11-27 12:07 chalo then 2008-11-27 12:07 ACTION sleeping 2008-11-27 12:07 bye 2008-11-27 12:07 night 2008-11-27 14:30 folks 2008-11-27 15:42 second attempt on deferred unlink is posted 2008-11-27 15:42 bhuey, please try to find a hole in the analysis 2008-11-27 15:48 -!- ChanServ changed mode/#tux3 -> +o flips 2008-11-27 15:49 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: How do debug Tux3!" 2008-11-27 15:49 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: deferred delete, does it really work?" 2008-11-27 15:49 -!- ChanServ changed mode/#tux3 -> -o flips 2008-11-27 15:49 -!- ChanServ changed mode/#tux3 -> +o flips 2008-11-27 15:50 -!- flips changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: deferred unlink, does it really work?" 2008-11-27 15:50 -!- ChanServ changed mode/#tux3 -> -o flips 2008-11-27 15:50 my accuracy is not high today ;) 2008-11-27 18:23 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-27 19:32 hirofumi, around? 2008-11-27 19:39 flips: I'll look for the post, about to start eating 2008-11-27 19:46 hi 2008-11-27 19:46 hi, should I pull? 2008-11-27 19:47 I'm not write comment for those yet 2008-11-27 19:47 :) 2008-11-27 19:47 and I'm going to clean it up probably 2008-11-27 19:48 I'm trying out the deferred delete idea using ext2 2008-11-27 19:48 ah 2008-11-27 19:48 deferred unlink 2008-11-27 19:48 yes 2008-11-27 19:48 I hope my latest post made a little more sense 2008-11-27 19:49 but, I'm thinking why do we need d_hide 2008-11-27 19:50 I think we can just take refcount of dentry for deferred 2008-11-27 19:51 but then how does our filesystem ever get called for the unlink? 2008-11-27 19:51 it has to get called in d_delete 2008-11-27 19:52 i_op->unlink? 2008-11-27 19:53 yes 2008-11-27 19:53 then we must avoide d_delete 2008-11-27 19:54 in ->unlink, I'm thinking we can dget for derrered 2008-11-27 19:54 um... now, dentry should be ->d_count > 1 2008-11-27 19:54 yes, and let me see if we can avoid d_delete 2008-11-27 19:54 becase d_delete unconditionally unhashes any dentry with count > 1 2008-11-27 19:55 yes 2008-11-27 19:55 it is same with d_hide case 2008-11-27 19:56 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2216 <- vfs_unlink unconditionally calls d_delete 2008-11-27 19:57 ah, it is || 2008-11-27 19:57 right 2008-11-27 19:57 d_hide is called if d_count > 1 2008-11-27 19:57 yes 2008-11-27 19:58 maybe it should also be called if d_count == 1 2008-11-27 20:01 int ext2_sync_dir(struct file *file, struct dentry *dir_dentry, int datasync) 2008-11-27 20:01 { 2008-11-27 20:01 struct ext2_inode_info *dir = EXT2_I(dir_dentry->d_inode); 2008-11-27 20:01 while (!list_empty(&dir->defer)) { 2008-11-27 20:01 struct ext2_inode_info *inode = list_entry(&dir->defer, struct ext2_inode_info, defer); 2008-11-27 20:01 while (!list_empty(&inode->defer)) { 2008-11-27 20:01 dput(list_entry(&dir->defer, struct dentry, d_alias)); 2008-11-27 20:01 list_del(&inode->defer); 2008-11-27 20:01 } 2008-11-27 20:01 list_del(&dir->defer); 2008-11-27 20:01 } 2008-11-27 20:01 return ext2_sync_file(NULL, dir_dentry, datasync); 2008-11-27 20:01 } 2008-11-27 20:02 the d_alias list isn't used for a negative dentry 2008-11-27 20:02 so I use it to link the deferred unlink dentries to the inode 2008-11-27 20:02 with is ugly, because normally the list is a list of inodes 2008-11-27 20:02 not a list of dentries 2008-11-27 20:03 which is ugly I meant 2008-11-27 20:03 in here negative dentry is hashed? 2008-11-27 20:03 yes 2008-11-27 20:04 there is nothing wrong with letting the negative dentry stay forever 2008-11-27 20:04 if so, on create, we reuse the dentry 2008-11-27 20:04 on dcache shrink it will get recovered 2008-11-27 20:04 right, and then the list will be messed up 2008-11-27 20:04 yes 2008-11-27 20:04 problem :) 2008-11-27 20:04 ah, ok :) 2008-11-27 20:05 _but_ 2008-11-27 20:05 I think we get a d_delete call 2008-11-27 20:05 ->d_delete 2008-11-27 20:05 before it gets reused 2008-11-27 20:05 maybe 2008-11-27 20:07 in final dput 2008-11-27 20:07 I'm checking d_move_locked now 2008-11-27 20:08 probably, vfs can reuse before ->d_delete 2008-11-27 20:09 not being able to link through the dentry would make a messier patch 2008-11-27 20:09 but still doable 2008-11-27 20:10 ah, yes 2008-11-27 20:10 ok, we are officially in the ''deferred unlink, does it really work?" session 2008-11-27 20:10 anybody here? 2008-11-27 20:11 ACTION thinks all the yankees are having turkey 2008-11-27 20:14 in fact I have just been called to dinner 2008-11-27 20:14 ok 2008-11-27 20:14 later... 2008-11-27 20:46 back 2008-11-27 20:50 ok 2008-11-27 20:50 why do we want to that unhash dentry 2008-11-27 20:51 why do we want to avoid that unhash dentry 2008-11-27 20:51 ? 2008-11-27 20:51 because we want the negative dentry to make it seem like the dirent has been removed, before it actually has been 2008-11-27 20:52 but I think the vfs assumes that any hashed negative dentry has zero use count 2008-11-27 20:52 a problem for this idea 2008-11-27 20:53 we want the pretend that the dirent was deleted before we actually delete it 2008-11-27 20:53 on way to do that is to keep a list of deleted dentries in our fs 2008-11-27 20:53 unhash is also work as removed entry? 2008-11-27 20:53 and we check the list when lookup is called 2008-11-27 20:54 if the dentry is unhashed then vfs will call the filesystem to see if it already exists 2008-11-27 20:55 the other was is to make the dentry negative 2008-11-27 20:55 the other way 2008-11-27 20:55 ah, we want to avoid ->lookup 2008-11-27 20:55 we would like to 2008-11-27 20:55 i see 2008-11-27 20:56 and we would like to avoid having to keep our own list of unlinked-but-dirent-still-exists dentries 2008-11-27 20:57 ok 2008-11-27 20:59 434 if (dentry && dentry->d_op && dentry->d_op->d_revalidate) 2008-11-27 20:59 435 dentry = do_revalidate(dentry, nd); <- interesting 2008-11-27 21:00 it is most bad handler of dcache stuff 2008-11-27 21:00 gives the most problems? 2008-11-27 21:02 if it return 0, vfs try to d_drop dentry even if it is not negative 2008-11-27 21:03 and that dentry can be referenced dentry from user or mount-point or something 2008-11-27 21:03 so the current scheme is not perfect 2008-11-27 21:03 yes 2008-11-27 21:03 it really should wait_on_* for the d_count to go to zero 2008-11-27 21:04 d_revalidate? 2008-11-27 21:04 caller of d_revalidate 2008-11-27 21:04 cached_lookup 2008-11-27 21:05 I'm not sure 2008-11-27 21:05 and d_revalidate usage is unclear 2008-11-27 21:05 yes 2008-11-27 21:05 all of dcache is unclear 2008-11-27 21:05 and mostly undocumented 2008-11-27 21:05 :) 2008-11-27 21:05 and uncommented 2008-11-27 21:05 dcache is a "linux exclusive" 2008-11-27 21:06 meaning it is a matter of pride not to document it ;) 2008-11-27 21:06 :) 2008-11-27 21:08 just a idea 2008-11-27 21:08 we can allocate new negative dentry for derrered unlink 2008-11-27 21:08 might work 2008-11-27 21:08 and let the old one be unhashed 2008-11-27 21:09 yes 2008-11-27 21:10 well 2008-11-27 21:10 we create referenced negative dentry to avoid for deffered 2008-11-27 21:11 another way is, d_revalidate could let us take the dentry off the defer list and drop the reference count to zero, but we need to remember that, even though the dirent already exists, it has the wrong inode 2008-11-27 21:11 what do we do for rename? 2008-11-27 21:11 1) rename of negative dentry is not possible 2008-11-27 21:12 2) when a move overwrites a negative dentry it will try to reuse it 2008-11-27 21:12 2) when a move overwrites a non-negative dentry it will unlink it and reuse it 2008-11-27 21:13 3) that is 2008-11-27 21:13 it's messy 2008-11-27 21:13 let me think about it some more :) 2008-11-27 21:13 yes :) 2008-11-27 21:13 that is, let me think about the whole problem of deferred unlink 2008-11-27 21:14 our hit count on google is rising 2008-11-27 21:15 275,000 today 2008-11-27 21:15 well, I think I understand what is problems 2008-11-27 21:15 I'm slowly starting to understand 2008-11-27 21:15 with your help 2008-11-27 21:16 if so, I'm happy 2008-11-27 21:18 if I'm helping you 2008-11-27 21:22 I'd like to push currnet patches, before I break something by new patches 2008-11-27 21:22 could you check it? 2008-11-27 22:07 hirofumi, checking 2008-11-27 22:08 thanks 2008-11-27 22:14 hirofumi, the user space version of set_buffer_uptodate is right, and the kernel is wrong ;) 2008-11-27 22:14 :) 2008-11-27 22:15 we should make a "set_block_dirty" that does the right thing in both places 2008-11-27 22:15 later 2008-11-27 22:15 some name like that 2008-11-27 22:16 ...writing the superblock certainly seems like a good idea 2008-11-27 22:16 ok, totally sensible 2008-11-27 22:18 I noticed maybe it is useful for debug 2008-11-27 22:18 pulled 2008-11-27 22:18 thanks 2008-11-27 22:18 now copying into git 2008-11-27 22:18 developers really know about buffer state or not 2008-11-27 22:18 :) 2008-11-27 22:22 now, I'm trying to share super.c with userspace 2008-11-27 22:23 load_sb/save_sb uses disk directly 2008-11-27 22:23 call diskread/diskwrite 2008-11-27 22:24 yes 2008-11-27 22:24 http://userweb.kernel.org/~hirofumi/share-super.patch 2008-11-27 22:24 because the superblock size is a fixed size, not the same as block size 2008-11-27 22:24 I think it is intent, or prefer 2008-11-27 22:24 I think it is intent, and prefer 2008-11-27 22:25 4096? 2008-11-27 22:25 sizeof(struct disksuper)? 2008-11-27 22:25 yes, 4096 2008-11-27 22:26 it used to be size of (struct disksuper) before you removed the "bogopad" ;) 2008-11-27 22:26 do we support blocksize > 4096 if possible? 2008-11-27 22:26 the superblock size never changes 2008-11-27 22:26 it is independent of filesystem blocksize 2008-11-27 22:26 yes 2008-11-27 22:26 bigger buffers... 2008-11-27 22:26 one day 2008-11-27 22:26 i see 2008-11-27 22:26 maybe it can be done with the prototype block library 2008-11-27 22:27 sounds like so 2008-11-27 22:27 it's not very nice to have multiple discontiguous blocks in a buffer 2008-11-27 22:27 so we would want order 1 alloc I think 2008-11-27 22:27 which means it would not be reliable until active defrag works properly in kernel 2008-11-27 22:28 but it would still be useful as an experimental option 2008-11-27 22:28 to show the performance benefit 2008-11-27 22:28 which would provide an argument to merge more of the active defrag work 2008-11-27 22:28 and 2008-11-27 22:28 on powerpc, PAGE_SIZE can be 64k 2008-11-27 22:29 in save_sb I think you meant sb_getblk, not sb_bread 2008-11-27 22:29 ah 2008-11-27 22:29 well, so, I wouldn't like to use buffer and brelse_dirty for it 2008-11-27 22:29 because intent was gone 2008-11-27 22:30 yes 2008-11-27 22:30 we can use the vecio/syncio from hackfs 2008-11-27 22:30 it's a nice little interface directly to bio 2008-11-27 22:30 I think we should use it 2008-11-27 22:30 and I'm thinking we shouldn't share super.c for now, or not 2008-11-27 22:30 it will be very much like using diskread/diskwrite in userspace 2008-11-27 22:30 i see 2008-11-27 22:31 in super.c there are a few functions that are worth sharing 2008-11-27 22:31 and no other good file to put them in 2008-11-27 22:31 yes 2008-11-27 22:32 if we user handle or vecio, I think we can share cleanly 2008-11-27 22:32 sb save is synchronous, right? 2008-11-27 22:32 in kernel? 2008-11-27 22:32 yes 2008-11-27 22:32 no 2008-11-27 22:33 it just dirty buffer 2008-11-27 22:33 and there is another function to actually write the sb 2008-11-27 22:33 yes 2008-11-27 22:33 I think buffer cache flusher writes it 2008-11-27 22:34 bleah 2008-11-27 22:34 or, maybe ->sync_fs or ->write_super_lockfs can be synchronous 2008-11-27 22:35 how does generic kernel know the filesystem superblock size? 2008-11-27 22:36 vfs don't know about it 2008-11-27 22:36 write_super() sets dirty proper buffer 2008-11-27 22:37 anyway, let's use your super.c patch 2008-11-27 22:37 then let's try to share super.c more 2008-11-27 22:37 http://userweb.kernel.org/~hirofumi/share-super.patch? 2008-11-27 22:37 rename my load/save_sb to pack/unpack_sb maybe 2008-11-27 22:37 this? 2008-11-27 22:37 yes 2008-11-27 22:37 this shares load_sb/save_sb 2008-11-27 22:38 yes 2008-11-27 22:38 it's a big improvement 2008-11-27 22:38 ah, then we use pack/unpack_sb? 2008-11-27 22:38 yes, those functions would be identical between kernel and userspace 2008-11-27 22:39 i see 2008-11-27 22:39 ok 2008-11-27 22:39 then only the way of getting them in and out of memory changes 2008-11-27 22:39 yes 2008-11-27 22:39 I'll take a break a bit 2008-11-27 22:40 I'll compile, run, and if I can write a file like you did, I will push to public git 2008-11-27 22:40 ls -l /mnt 2008-11-27 22:40 total 0 2008-11-27 22:40 -rw-r--r-- 1 root root 101 Nov 28 07:40 crypto 2008-11-27 22:40 -rw-r--r-- 1 root root 218 Nov 28 07:40 ddtest 2008-11-27 22:40 -rw-r--r-- 1 root root 401 Nov 28 07:40 ddtest2 2008-11-27 22:40 -rw-r--r-- 1 root root 1024 Nov 28 07:40 fakedev 2008-11-27 22:41 -rw-r--r-- 1 root root 65536 Nov 28 07:40 foodev 2008-11-27 22:41 -rw-r--r-- 1 root root 16 Nov 28 07:40 fsck 2008-11-27 22:41 -rw-r--r-- 1 root root 30 Nov 28 07:40 hackfs 2008-11-27 22:41 -rw-r--r-- 1 root root 6 Nov 28 07:40 hello 2008-11-27 22:41 -rw-r--r-- 1 root root 33 Nov 28 07:40 remount 2008-11-27 22:41 -rw-r--r-- 1 root root 37 Nov 28 07:40 test0 2008-11-27 22:41 -rwxr-xr-x 1 root root 123 Nov 28 07:40 test1 2008-11-27 22:41 -rwxr-xr-x 1 root root 131 Nov 28 07:40 test2 2008-11-27 22:41 -rwxr-xr-x 1 root root 19 Nov 28 07:40 test3 2008-11-27 22:41 -rw-r--r-- 1 root root 19 Nov 28 07:40 test4 2008-11-27 22:41 -rwxr-xr-x 1 root root 14405 Nov 28 07:40 test~ 2008-11-27 22:41 -rwxr-xr-x 1 root root 72 Nov 28 07:40 try 2008-11-27 22:41 -rwxr-xr-x 1 root root 32 Nov 28 07:40 tux3 2008-11-27 22:41 -rw-r--r-- 1 root root 28 Nov 28 07:40 tuxtest 2008-11-27 22:41 :) 2008-11-27 22:41 mkdir /mnt/foo 2008-11-27 22:41 mkdir: cannot create directory `/mnt/foo': Operation not permitted 2008-11-27 22:41 any problem there? 2008-11-27 22:48 -!- RazvanM(~RazvanM@96.234.237.148) has joined #tux3 2008-11-27 22:49 http://tux3.org/ddtree?p=tux3fs 2008-11-27 22:50 broken style ;) 2008-11-27 22:50 uglier than sin 2008-11-27 22:50 but beautiful 2008-11-27 22:52 .create = tux3_create, 2008-11-27 22:52 .lookup = tux_lookup, 2008-11-27 22:52 / .link = ext3_link, 2008-11-27 22:52 / .unlink = ext3_unlink, 2008-11-27 22:52 / .symlink = ext3_symlink, 2008-11-27 22:52 / .mkdir = ext3_mkdir, 2008-11-27 22:52 / .rmdir = ext3_rmdir, 2008-11-27 22:52 / .mknod = ext3_mknod, 2008-11-27 22:52 / .rename = ext3_rename, 2008-11-27 22:52 / .setattr = ext3_setattr, 2008-11-27 22:52 / .setxattr = generic_setxattr, 2008-11-27 22:52 / .getxattr = generic_getxattr, 2008-11-27 22:52 / .listxattr = ext3_listxattr, 2008-11-27 22:52 / .removexattr = generic_removexattr, 2008-11-27 22:52 / .permission = ext3_permission, 2008-11-27 22:52 hirofumi (when you get back) some of the above might be nice projects for people who want to try 2008-11-27 22:54 probalby you can do them all in a couple days, but it might be fun to offer them as projects 2008-11-27 22:54 maze might do a couple ;) 2008-11-27 22:54 I have to fix tux3 xattr cache 2008-11-27 22:54 to allow empty xattrs 2008-11-27 22:55 I'll do that after sleeping 2008-11-27 22:55 it's a little messy 2008-11-27 23:00 Eyes bleed 2008-11-27 23:00 about? 2008-11-27 23:00 That ops struct 2008-11-27 23:00 it's totally normal 2008-11-27 23:01 the implementations will sometimes make your eyes bleed though 2008-11-27 23:20 hirofumi, I took the liberty of inviting everybody on our list to help hook up file and dir methods 2008-11-27 23:20 mlankhorst, ready? 2008-11-27 23:21 ok, I'll leave as is for a while, for someone would have fun 2008-11-27 23:21 yes, it would be good project 2008-11-27 23:22 sync is the biggest inode operation left 2008-11-27 23:22 not exactly a starter project 2008-11-27 23:23 get/set_xattr should be easy, even before I fix the xcache 2008-11-27 23:23 sync is "sync" mount option? 2008-11-27 23:23 or reaction of sync(2) 2008-11-27 23:23 ? 2008-11-27 23:23 -!- stargazr5(~gauravstt@59.95.58.190) has joined #tux3 2008-11-27 23:23 anyway, yes 2008-11-27 23:23 as in file_operations.fsyc 2008-11-27 23:24 .fsync 2008-11-27 23:24 that's both fsync(2) and sync(2) I think 2008-11-27 23:24 there is easily way though 2008-11-27 23:24 ah 2008-11-27 23:25 are you getting tired of all that tracing output, or is it still helpful? 2008-11-27 23:25 ah 2008-11-27 23:25 I'm also thinking about it 2008-11-27 23:26 it can be turned off easily with #define trace trace_off 2008-11-27 23:26 personally, right now, I'm not using it 2008-11-27 23:26 yes 2008-11-27 23:26 and convert some printf() to trace() 2008-11-27 23:26 printf's need to be converted to trace() 2008-11-27 23:26 yes 2008-11-27 23:26 a _really_ easy start project 2008-11-27 23:27 mlankhorst, you could do it ;) 2008-11-27 23:27 (actually mlankhorst is a very serious hacker) 2008-11-27 23:27 probably, note is some printf() and trace() might be printf() or warn 2008-11-27 23:27 yes 2008-11-27 23:28 I was thinking of having a static variable to use as a flag to turn trace output off and on 2008-11-27 23:28 yes 2008-11-27 23:28 or pass sb to every trace, and have a flag in the sb 2008-11-27 23:28 ah, one minor thing 2008-11-27 23:29 I think I want to s/extent/diskextent/ 2008-11-27 23:29 everywhere 2008-11-27 23:29 i see 2008-11-27 23:29 so we can used struct extent to me a cpu-ordered extent 2008-11-27 23:29 bleah 2008-11-27 23:29 so we can use struct extent to mean a cpu-ordered extent 2008-11-27 23:29 yes 2008-11-27 23:30 ok, I'll do it right now 2008-11-27 23:30 or we want to add prefix all on disk structure 2008-11-27 23:30 ? 2008-11-27 23:31 yes 2008-11-27 23:31 or we want to add prefix to all on-disk structure 2008-11-27 23:31 maybe 2008-11-27 23:31 :) 2008-11-27 23:31 I'm solving an immediate problem now, though 2008-11-27 23:31 ok 2008-11-27 23:31 which is that it sucks to store the segs[] array in disk extent format 2008-11-27 23:32 also, I want to use struct extent as the interface for get_extents 2008-11-27 23:32 but it's currently the disk form 2008-11-27 23:32 yes 2008-11-27 23:32 it would be nice to be consistent and put "disk" or "be" in front of more things 2008-11-27 23:33 I like "disk" because it doesn't say be or le, it just says be careful 2008-11-27 23:33 yes 2008-11-27 23:33 however, um... 2008-11-27 23:34 disksuper, diskextent, diskgroup, diskentry, diskdeaf, diskbnode... 2008-11-27 23:34 it reads ok 2008-11-27 23:34 except for diskbnode ;) 2008-11-27 23:35 disk_super? 2008-11-27 23:35 d_super, um... 2008-11-27 23:35 disksuper reads the best to me 2008-11-27 23:36 it takes me longer to read "disk_super" than "disksuper" for some reason 2008-11-27 23:36 maybe it's just me 2008-11-27 23:36 ok, anyway, it is good than nothing 2008-11-27 23:36 easy to change in a mass edit 2008-11-27 23:37 ok, I'll try to fix some bugs of core stuff I already know 2008-11-27 23:38 sounds good 2008-11-27 23:44 done 2008-11-27 23:44 oh, fast 2008-11-27 23:44 we will do more later I'll put it on the to do list 2008-11-27 23:44 ok 2008-11-27 23:45 ah 2008-11-27 23:45 now I can finish up my get_extents interface :) 2008-11-27 23:45 we don't have "." and ".." entries on directory? 2008-11-27 23:46 I think it is going to be "negative count for hole" for now 2008-11-27 23:46 um 2008-11-27 23:46 we are supposed to 2008-11-27 23:46 I might have forgotten to store them for a new dir 2008-11-27 23:46 yes 2008-11-27 23:46 we don't actually need them 2008-11-27 23:46 according to ted 2008-11-27 23:47 yes 2008-11-27 23:47 it was only ever for backward compatibility 2008-11-27 23:47 and nfs 2008-11-27 23:47 and nfsd 2008-11-27 23:47 vfs supplies them for getdents automatically I think 2008-11-27 23:47 it calls get_parent() 2008-11-27 23:47 but vfs can do that without going to the fs 2008-11-27 23:47 yes, actually fat/exfat doesn't have on some directries 2008-11-27 23:48 lets see if anything breaks before we decide to actually store them on disk 2008-11-27 23:48 vfs can't do for nfsd 2008-11-27 23:48 ah 2008-11-27 23:48 nfsd should then 2008-11-27 23:48 I think it is only issue 2008-11-27 23:49 nfsd patch ;) 2008-11-27 23:49 it's easy to store them 2008-11-27 23:49 if we can do, it would be really good 2008-11-27 23:49 the nfsd patch? 2008-11-27 23:49 yes 2008-11-27 23:49 I know somebody who would be ideal to do it ;) 2008-11-27 23:49 one of the citi nfsd team 2008-11-27 23:50 eh, they are trying to remove ->get_parents? 2008-11-27 23:50 who is? 2008-11-27 23:51 one of citi nfsd team 2008-11-27 23:52 well, another issue of "." and ".." is on rename() 2008-11-27 23:52 url? 2008-11-27 23:52 ->get_parent? 2008-11-27 23:53 http://lxr.linux.no/linux+v2.6.27.5/fs/ext2/super.c#L356 2008-11-27 23:54 and 2008-11-27 23:54 http://lxr.linux.no/linux+v2.6.27.5/fs/ext2/namei.c#L73 2008-11-27 23:54 linus has expressed interest in getting "." and ".." off of physical disk 2008-11-27 23:54 it's lame to have them there in a new filesystem 2008-11-27 23:55 yes 2008-11-27 23:55 we can be lame for now, if it saves a little bit of time ;) 2008-11-27 23:55 I'll put it on the to do list 2008-11-27 23:55 ok 2008-11-27 23:56 I just noticed it, I'm thinking about->mkdir() 2008-11-27 23:56 I just noticed it, I'm thinking about ->mkdir() 2008-11-27 23:57 I'm thinking about->sleep ;) 2008-11-27 23:57 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-11-27 23:57 oh, sorry :) 2008-11-27 23:57 good night 2008-11-27 23:57 good night 2008-11-28 00:09 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-11-28 00:33 hirofumi, still there? 2008-11-28 00:34 yes 2008-11-28 00:34 a very small step forward with the defer project... 2008-11-28 00:34 ok 2008-11-28 00:34 dentry.d_subdirs is at least a list of dentries 2008-11-28 00:34 not as ugly as using a list that normally holds inodes 2008-11-28 00:35 and it better be empty when d_delete is called 2008-11-28 00:35 yes, if it is directory 2008-11-28 00:35 or file 2008-11-28 00:35 since files can have subdirs 2008-11-28 00:35 anyway, d_revalidate could be our friend 2008-11-28 00:36 at the time d_revalidate is called, we know the dentry is going to be reused 2008-11-28 00:36 so our revalidate can dput it 2008-11-28 00:36 the problem of what to do about references still on the dentry is unresolved 2008-11-28 00:36 but it's not our problem any more at that point 2008-11-28 00:37 our fs just needs to remember that the dirent now needs to be _updated_ not just changed 2008-11-28 00:37 delete and recreate counts as an update 2008-11-28 00:37 anyway, still more details to consider 2008-11-28 00:37 it's a mess 2008-11-28 00:37 i see 2008-11-28 00:39 do_revalidate appears to not be under spinlock 2008-11-28 00:39 yes 2008-11-28 00:39 and the place of revalidate is strace 2008-11-28 00:40 that is the reason for it? 2008-11-28 00:40 and the place of revalidate is strange 2008-11-28 00:41 ah ;) 2008-11-28 00:41 one calls it under ->i_mutex 2008-11-28 00:41 and it's called from a few places 2008-11-28 00:41 one doesn't have ->i_mutex 2008-11-28 00:41 yes 2008-11-28 00:41 that's a heavily contended lock 2008-11-28 00:41 it shows up in my file system tests all of the time in lockstat 2008-11-28 00:42 oh, i see 2008-11-28 00:42 test is metadata operations? 2008-11-28 00:42 bhuey, how about the parent dir i_mutexes during create/unlink/rename? 2008-11-28 00:44 anyway, sleep->time for real 2008-11-28 00:44 ok :) 2008-11-28 00:45 not sure, imo, run with lockdep/stat compiled in to all kernels and watch the stats 2008-11-28 00:45 flips: at least two of the inode related locks were heavily contended against a "find" load 2008-11-28 00:45 flips: night 2008-11-28 00:48 if test tries to change/read directory entries, ->i_mutex would be contented 2008-11-28 00:49 because those take ->i_mutex of parent dir in almost all operations 2008-11-28 01:57 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-28 05:02 -!- pgquiles_(~pgquiles@222.Red-88-0-139.dynamicIP.rima-tde.net) has joined #tux3 2008-11-28 05:52 -!- bushman(~bushman@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-28 06:03 http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/index.html?ca=dgr-lnxw09GCCKernel&S_Tact=105AGX59&S_CMP=GRsitelnxw09 GCC hacks for the kernel coders 2008-11-28 08:33 bushman, yet I have that one open for reading ;) 2008-11-28 08:33 yes I mean 2008-11-28 08:42 hey, wanna talk? i got all the time today 2008-11-28 08:43 sure 2008-11-28 08:43 subject? 2008-11-28 10:10 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-28 10:30 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-28 11:17 ok, time to fix xattrs to be posix compliant 2008-11-28 11:32 ACTION does make xattr && ./xattr foodev  2008-11-28 11:32 this is my "ide" 2008-11-28 11:50 Hmm, fixing xattr support to be posix compliant turned out to be just a matter of deleting some code: 2008-11-28 11:50 atom 777 => 0x8055310: 77 6f 72 6c 64 21 "world!" 2008-11-28 11:50 atom 111 => 0x805531a: 63 6c 61 73 73 "class" 2008-11-28 11:50 atom 666 => 2008-11-28 11:50 atom 222 => 0x8055327: 62 6f 6f 6f 79 61 68 "boooyah" 2008-11-28 11:52 hg diff | diffstat 2008-11-28 11:52 kernel/xattr.c | 58 +++++++++++++++++++++++---------------------------------- 2008-11-28 11:52 xattr.c | 2 - 2008-11-28 11:52 2 files changed, 25 insertions(+), 35 deletions(-) 2008-11-28 12:13 ok, done, xattrs should now be posix compliant 2008-11-28 12:13 now need to add a xattr remove function 2008-11-28 12:13 and xattr list 2008-11-28 14:34 ok, we have more incremental progress on deferred nameops 2008-11-28 14:35 the d_subdirs field turns out to be ok for linking the dentries together, even including non-negative dentries 2008-11-28 14:35 what makes it work is the fact that only directories have subdirs 2008-11-28 14:36 and in the case of a directory, there can be only one dentry 2008-11-28 14:36 so in the one case that the subdir field can't be used, the dentry can be referenced directly from the list field of the inode 2008-11-28 14:36 filthy... 2008-11-28 14:49 ok, let's see how the kernel list xattrs interface is supposed to look 2008-11-28 15:04 wow, it's crapaciously ugly 2008-11-28 15:12 -!- dmoerner(~dmr@ppp-71-139-11-48.dsl.snfc21.pacbell.net) has joined #tux3 2008-11-28 15:22 xattr internal api, oh wow 2008-11-28 21:47 xattr support is mostly done now 2008-11-28 21:47 need to decide what to do with linux's weirdo xattr namespaces 2008-11-28 21:56 what is problem? 2008-11-28 21:56 namespaces... does it mean we should have four different atom tables? 2008-11-28 21:57 I don't know about it 2008-11-28 21:57 or should we store the whole xattr name including the prefix in the atom table? 2008-11-28 21:57 me neither 2008-11-28 21:58 security.foo.bar -> security./foo.bar ? 2008-11-28 21:58 the / doesn't really add anything 2008-11-28 21:58 if user want security.foo.bar, we get security./foo.bar ? 2008-11-28 21:59 yes 2008-11-28 21:59 it means just subdir 2008-11-28 21:59 anyway, we will store the whole thing 2008-11-28 21:59 in fact I don't think we have to interpret it at all 2008-11-28 21:59 let vfs do that 2008-11-28 21:59 stupid design 2008-11-28 22:00 ok, well deferred nameops is starting to look ok 2008-11-28 22:00 um.. vfs seems to just call i_op->setxattr 2008-11-28 22:01 yes, and generic_*_attr does some whacky stuff 2008-11-28 22:01 but we can skip it 2008-11-28 22:01 ah, yes 2008-11-28 22:01 it doesn't add anything except complexity, it looks like 2008-11-28 22:03 it's pretty dumb that both vfs permission and generic xattr ops scan the prefixes 2008-11-28 22:03 a patch maybe later 2008-11-28 22:03 yes 2008-11-28 22:04 anyway, for deferred nameops, we can use the dentry subdir list both for negative and non-negative deferred dentries I think 2008-11-28 22:05 the only time the subdir list is used for subdirs is for a directory 2008-11-28 22:05 and that can only have one link, so we can point to the dentry directly from the inode 2008-11-28 22:05 or we can just add a new dentry list field 2008-11-28 22:06 and d_revalidate looks like our friend 2008-11-28 22:06 we need to set a flag in the dentry if it gets turned from negative to positive 2008-11-28 22:07 to say that the non-negative dentry is _not_ backed by the filesystem yet 2008-11-28 22:07 actually, this is implied by being on the deferred list and being nonnegative 2008-11-28 22:08 -!- RazvanM(~RazvanM@96.234.237.148) has joined #tux3 2008-11-28 22:08 the bit we need is a bit that says the deferred dentry still exists on disk 2008-11-28 22:09 so if a negative deferred dentry is reused, it still has that bit that says the name exists on disk 2008-11-28 22:09 and therefore needs to be reused on disk, not create a new dirent 2008-11-28 22:11 it hard to understand for me at least for now 2008-11-28 22:11 I'll make a patch 2008-11-28 22:11 easier that way, I think I'm done researching 2008-11-28 22:12 if it doesn't work, then I will talk to you ;) 2008-11-28 22:12 ok :) 2008-11-28 22:12 I pushed a few patches 2008-11-28 22:12 you might want to merge 2008-11-28 22:12 xattr stuff ? 2008-11-28 22:12 yes 2008-11-28 22:12 and a little bit of fiddling with tux_inode 2008-11-28 22:12 yes, I pulled 2008-11-28 22:13 ok 2008-11-28 22:13 ah 2008-11-28 22:13 about cursor 2008-11-28 22:13 well, I guess we are going to store the full xattr name including the prefix, since we only store it once in the atom table, that does not hurt much 2008-11-28 22:14 that means I can make xattr_list a little shorter 2008-11-28 22:14 it would be good 2008-11-28 22:15 I think we can rethink it later 2008-11-28 22:15 well, I think we want struct cursor_head { array_size, struct cursor { buffer, next } cursor[] } 2008-11-28 22:16 array_size is not current btree depth 2008-11-28 22:16 it means length of cursor array 2008-11-28 22:17 ok 2008-11-28 22:17 so, we can just pass to cursor_head to release_cursor 2008-11-28 22:17 yes, I think it will be more simple 2008-11-28 22:18 and on read side, if it doesn't walk all btree, it doesn't need to sync with write side 2008-11-28 22:18 read side doesn't care about new depth 2008-11-28 22:19 probe(); 2008-11-28 22:19 buffer = cursor[btree->root.depth].buffer 2008-11-28 22:19 it means we need to serialize with write side 2008-11-28 22:20 you mean, for smp 2008-11-28 22:20 yes 2008-11-28 22:20 so, I thought we would want cursor_head->array_length 2008-11-28 22:21 the depth can actually change in the middle of a probe 2008-11-28 22:21 in theory 2008-11-28 22:21 if we don't lock the whole tree 2008-11-28 22:21 yes 2008-11-28 22:22 probe() requires lock 2008-11-28 22:22 anyway, I think we need struct cursor_head 2008-11-28 22:22 however... 2008-11-28 22:22 can we do it struct cursor { ... struct { buffer, neext } path[] } ? 2008-11-28 22:22 it would need to change many files 2008-11-28 22:23 yes 2008-11-28 22:23 so, I want to ask before change 2008-11-28 22:23 well I will work on defered nameops while you do taht 2008-11-28 22:23 ok 2008-11-28 22:23 struct cursor { ... struct { buffer, next } path[] } 2008-11-28 22:23 ok? 2008-11-28 22:23 ok 2008-11-28 22:24 I'm not thinking about detail of structure yet 2008-11-28 22:24 it will change over time 2008-11-28 22:24 because, for example, the tree depth can change during a read, if we using crab locking 2008-11-28 22:25 yes 2008-11-28 22:25 cursor->length is just means current array size 2008-11-28 22:25 also, during crab locking, only two path elements are valid at once 2008-11-28 22:26 ok 2008-11-28 22:26 cursor->size, ok? 2008-11-28 22:26 ok 2008-11-28 22:26 after this change it will be easier to change in the future 2008-11-28 22:27 struct cursor { int size, struct { buffer, next } path[] }? 2008-11-28 22:27 looks good 2008-11-28 22:27 ok, I'll try with it 2008-11-28 22:27 ok, I'm pushing a change right now, then I will wait 2008-11-28 22:29 done 2008-11-28 22:29 probably, I need a bit long time to finish 2008-11-28 22:29 and I have a lot of work to do on deferring anyway 2008-11-28 22:29 I'm glad I got something done on xattrs 2008-11-28 22:30 it takes time probably 2008-11-28 22:30 yes 2008-11-28 22:30 so, you don't need to wait 2008-11-28 22:31 you don't need to wait me 2008-11-28 22:35 well, anyway, I'll try to fix release_cursor problems 2008-11-28 22:35 so I'll be silent for a while 2008-11-29 02:25 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-11-29 02:55 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-11-29 02:58 flips: hey 2008-11-29 02:58 hi 2008-11-29 02:58 been reading the list 2008-11-29 02:58 anything interesting? 2008-11-29 02:59 yeah, all of it 2008-11-29 03:15 ACTION slaps flips around a bit with a pillow 2008-11-29 03:16 oh right 2008-11-29 03:39 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-11-29 06:25 -!- pgquiles(~pgquiles@109.Red-83-35-113.dynamicIP.rima-tde.net) has joined #tux3 2008-11-29 06:25 -!- pgquiles(~pgquiles@109.Red-83-35-113.dynamicIP.rima-tde.net) has joined #tux3 2008-11-29 06:50 flips: therE? 2008-11-29 08:24 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-29 13:47 http://www.phoronix.com/scan.php?page=news_item&px=Njg4NQ <- nice writeup on Phoronix 2008-11-29 13:48 flips: yeah saw that 2008-11-29 13:48 welcome back 2008-11-29 13:49 also a slashdot mention 2008-11-29 13:56 oh? 2008-11-29 13:57 oh wow, first article on slashdot 2008-11-29 13:58 http://www.kev009.com/wp/2008/11/on-file-systems/ 2008-11-29 14:00 from the comments on the /. post: "2) ext3+nfs simply sucks with very large amount of files. I used to routinely have directories with 500,000 files (very easy to reach such amounts with a cartesian multiplication of options). The result is simply downright appalling performance." 2008-11-29 14:01 i wonder why that is 2008-11-29 14:02 http://hardware.slashdot.org/comments.pl?sid=1045705&cid=25927779 2008-11-29 14:02 interesting comments, much in line with what we've been talking abuot 2008-11-29 14:04 we dodged a bullet, didn't actually get linked 2008-11-29 14:05 shapor. I wonder if he is talking before htree? 2008-11-29 14:06 not sure, but ted tso responded with "this is fixed in btrfs" which is.. intersting 2008-11-29 14:06 oh no, thats was re the other one 2008-11-29 14:07 ted's a big btrfs booster, but I think that Linux's next gen fs will actually be... ext4 2008-11-29 14:07 lacks snapshots is the only thing 2008-11-29 14:08 and could use versioned pointers for that without changing the model 2008-11-29 14:09 "You seem very knowledgeable regarding filesystems in general" -- slashdotter to tytso 2008-11-29 14:10 "You seem very knowledgeable regarding operating systems in general" -- if Linus shows up on slashdot 2008-11-29 14:10 "and you have a big beak" 2008-11-29 14:13 http://www.kev009.com/wp/2008/11/on-file-systems/ 2008-11-29 14:19 nice, google alerts didn't pick that one up for me 2008-11-29 14:20 it was /.-ed 2008-11-29 14:21 kind of hard to miss 2008-11-29 14:24 maybe googlebot is getting old and senile 2008-11-29 14:24 ;) 2008-11-29 14:25 flips: i got the alert right away 2008-11-29 14:26 hmm 2008-11-29 14:27 didn't get anything for kev009.com 2008-11-29 14:27 maybe googlebot just doesn't like *you* ;) 2008-11-29 14:27 ah yeah i meant /. 2008-11-29 14:27 which linked to that 2008-11-29 14:28 googlebot supposedly indexes slasdot 2008-11-29 14:28 doesn't see to rank it very high 2008-11-29 14:28 googlebot seems way more interested in my git tree ;) 2008-11-29 14:29 speaking of which, I should clone from linus's tree instead of the clone-from-tarball I currently have 2008-11-29 14:29 and get a home on git.kernel.org 2008-11-29 14:30 http://lxr.linux.no/linux+v2.6.27/fs/libfs.c#L128 <- here is the key to deferred nameops I think 2008-11-29 14:31 this shows how to traverse the cached children of a directory dirent 2008-11-29 14:32 which is just what is needed to flush the deferred nameops 2008-11-29 14:32 where "flush" means "sync to back end cache" 2008-11-29 14:39 ok, I think d_subdirs field in struct dentry is misnamed, it's actually not a list of subdirs, it's a list of all cached names in the directory 2008-11-29 14:40 at file sync time (on the directory) we will just traverse that list and do dput for any deferred negative dentry, and ->create (tux3_create) for any deferred nonnegative dentry 2008-11-29 14:42 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-29 14:43 flips: they say that tux3 is going to take 3 year for it to be at the point of kernel inclusion 2008-11-29 14:43 no, mainstream use :) 2008-11-29 14:44 bhuey, who ways that? 2008-11-29 14:44 mainstream use... sure 2008-11-29 14:44 but I don't think so, really 2008-11-29 14:44 tux3 is not very complicated 2008-11-29 14:44 best way is just to show that 2008-11-29 14:45 the strategy of getting the low level bugs out in user space is really powerful 2008-11-29 14:45 flips: the article linked from /. kev009 2008-11-29 14:45 ah 2008-11-29 14:45 3 yrs isn't very long from design to mainstream use ;) 2008-11-29 14:46 to me, mainstream use is when people completely trust their data with the file system 2008-11-29 14:47 right, 3 years 2008-11-29 14:47 with replication we can trust it earlier on servers 2008-11-29 14:47 well 2008-11-29 14:48 anybody can ;) 2008-11-29 14:48 has to have zero crashes, we have a lot better chance of getting there quickly with a simpler code base 2008-11-29 14:48 also, online fsck 2008-11-29 14:48 design for that is now worked out in some detail 2008-11-29 15:00 how do you guys test new kernels? uml? vmware? qemu? 2008-11-29 15:00 I use uml, hirofumi uses kvm 2008-11-29 15:00 I will post my uml stuff to our mailing list 2008-11-29 15:01 great. 2008-11-29 15:01 I have a very cool 100 MB root filesystem for uml 2008-11-29 15:01 has nano for an editor in it, stuff like grep, but not gcc 2008-11-29 15:02 which doesn't matter, because it only takes 5 seconds to boot or so, and that is mainly due to the bogomips init loop 2008-11-29 15:02 which can be patched out 2008-11-29 15:03 did you build it yourself? 2008-11-29 15:03 it started life as an ancient slackware image 2008-11-29 15:03 and eventually got a recent libc, which I ported by hand 2008-11-29 15:03 to conserve space 2008-11-29 15:04 the original came from jdike's uml page 2008-11-29 15:04 now nobody seems to care about small rootfs's except me ;) 2008-11-29 15:04 I really like being able to back up the whole thing with cp in 2 seconds 2008-11-29 15:04 and tar it up for email 2008-11-29 15:07 well, i care for them sometimes as i volunteer for installing some servers over ethernet 2008-11-29 15:09 well, i do the install over ethernet 2008-11-29 15:09 that's not strictly neccessary, but the alternative is opening 100 drives and putting a cdin 2008-11-29 15:11 having a small debian root around, just big enough to do apt-get update is really nice for install 2008-11-29 15:11 install with dd 2008-11-29 15:11 that's pretty much how it works 2008-11-29 15:13 ls -l tux3.root.tbz2 2008-11-29 15:13 -rw-r--r-- 1 daniel daniel 23884343 2008-11-29 15:11 tux3.root.tbz2 2008-11-29 15:13 4x compression 2008-11-29 15:15 http://lxr.linux.no/linux+v2.6.27/fs/libfs.c#L170 <- this code is really demented 2008-11-29 15:15 flips: right, it was a good decision to start in FUSE 2008-11-29 15:15 it was a contribution from outside 2008-11-29 15:15 konrad 2008-11-29 15:15 ? 2008-11-29 15:15 and tero made it work 2008-11-29 15:16 right 2008-11-29 15:16 it was a good decision 2008-11-29 15:16 konrad, you're the reason we have fuse ;) 2008-11-29 15:16 ah 2008-11-29 15:16 I think the person responsible for the low level fuse api started independently 2008-11-29 15:16 but ok :) 2008-11-29 15:16 it allowed you to work on core things first before working on kernel issues which are also hard but not core to the function of the file system 2008-11-29 15:16 "I take no responsibility for this" -- konrad 2008-11-29 15:17 it sucks to debug fiddling disk format code in kenrel 2008-11-29 15:17 flips: are you going to publish it? 2008-11-29 15:17 and btree stuff, what a pain 2008-11-29 15:17 publish the root file? 2008-11-29 15:17 yes, I just need to clean out a bit of stuff from old projects 2008-11-29 15:18 this root has a lot of operational history ;) 2008-11-29 15:18 later today 2008-11-29 15:18 today has been over for 20 minutes :) 2008-11-29 15:19 looking forward to it, i'll be around 2008-11-29 15:20 $ mount /src/zuma/root 2008-11-29 15:20 $ ls /zuma 2008-11-29 15:20 bar boot dev floppy foobar initrd lost+found proc sbin sys test1 test3 tmp try~ try2~ var zot 2008-11-29 15:20 bin cdrom etc foo home lib mnt root src test test2 test4 try try2 usr x zumastor 2008-11-29 15:20 daniel@moonbase:/src/2.6.26.5.tux3/fs/ramfs$ 2008-11-29 15:21 $ umount /src/zuma/root 2008-11-29 15:21 it's listed as user in fstab 2008-11-29 15:21 makes it really convenient to hack it up 2008-11-29 15:21 I often forget to umount it before booting uml though ;) 2008-11-29 15:21 it has survived at least a 100 of those 2008-11-29 15:21 just turned it into an ext3 fs yesterday 2008-11-29 15:22 flips: when/if I do start tux3 development, it'll be bug fixing first or something like that 2008-11-29 15:22 now it doesn't spend time fscking 2008-11-29 15:22 bhuey, that would be awesome 2008-11-29 15:22 because dleaf is a bit overwhelming 2008-11-29 15:27 I don't really understand b-trees and stuff 2008-11-29 15:27 it's going to take a bit of time 2008-11-29 15:27 there are lots of other bits that are more obvious 2008-11-29 15:27 ACTION bails 2008-11-29 15:27 shapor has dleaf under control 2008-11-29 15:28 and hirofumi 2008-11-29 15:28 fixed dleaf_back ;) 2008-11-29 15:29 moonbase:/zuma/root# rm /bin/devfsd 2008-11-29 15:29 uml used to be very pushing about wanting devfs 2008-11-29 15:29 converted that to traditional devs long before the downfall of devfs 2008-11-29 15:29 was a lot more robust that way 2008-11-29 15:30 and I could use it on my real partitions 2008-11-29 15:30 only have to make a small change to inittab to use uml's ttys 2008-11-29 15:48 cleanup of tuxroot is done 2008-11-29 15:48 I'll post a link in a moment 2008-11-29 15:54 http://tux3.org/downloads/tuxroot.jbz2 2008-11-29 15:54 ./linux ubd0=tuxroot ubd1=testdev 2008-11-29 15:55 make defconfig ARCH=um && make linux ARCH=um 2008-11-29 15:55 that's the whole recipe 2008-11-29 15:55 in a slightly wrong order 2008-11-29 16:19 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-11-29 16:24 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-29 16:58 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-29 17:50 hey tim_timm 2008-11-29 17:50 hey 2008-11-29 18:13 ok, I finally got my google alert for kev009 2008-11-29 18:35 turns out we have 3 possible deffered dentry states 2008-11-29 18:36 1) Positive dentry, but no dirent in itable block 2) Negative dentry, but dirent still exists in itable block 3) Positive dentry, but wrong dirent exists in itable block 2008-11-29 18:37 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-29 18:37 so we need 2 new bits of dcache state to represent that: a D_DIRENT_EXISTS bit and a D_DIRENT WRONG bit 2008-11-29 18:38 this tells our back end flush whether it has to: 1) remove a dirent 2) create a dirent 3) find and update a dirent 2008-11-29 18:39 this is working out really well :) 2008-11-29 18:40 now I need to make the traverse over directory child dentries work, for the deferred dentry flush 2008-11-29 18:40 d_genocide might be a good place to get that from 2008-11-29 18:41 http://lxr.linux.no/linux+v2.6.27/fs/dcache.c#L2184 2008-11-29 19:07 #define DCACHE_BACKED 0x0020 /* FS has dirent */ 2008-11-29 19:07 #define DCACHE_WRONG 0x0040 /* FS has wrong dirent */ 2008-11-29 19:31 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-11-29 20:11 http://tux3.org/downloads/tuxroot.tar.bz2 2008-11-29 20:16 time to compile the latest user/kernel and make a new patch 2008-11-29 20:47 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-29 20:48 hi 2008-11-29 20:48 hey dude 2008-11-29 20:48 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-11-29 20:48 is cursor cleanup 2008-11-29 20:48 pulling... 2008-11-29 20:49 it's still in progress though 2008-11-29 20:49 ah 2008-11-29 20:49 ok, I'll just read 2008-11-29 20:49 yes 2008-11-29 20:50 it may tell what do I want to do 2008-11-29 20:50 last 3 patches is main part 2008-11-29 20:50 http://lkml.org/lkml/2008/11/29/202 <- Developing Tux3 with UML 2008-11-29 20:52 looks like good 2008-11-29 20:52 xattr->size ? hexdump(xattr->body, xattr->size) : printf("\n"); <- didn't like that, hmm? 2008-11-29 20:53 result is different type 2008-11-29 20:53 maybe, int and void 2008-11-29 20:53 right 2008-11-29 20:53 it was pretty sloppy ;) 2008-11-29 20:53 sparse warns about it :) 2008-11-29 20:53 I was surprised it compiled 2008-11-29 20:54 it is right for c, but sparse don't handle it correctly 2008-11-29 20:54 hirofumi, it all looks fine 2008-11-29 20:54 especially the bug fix ;) 2008-11-29 20:55 thanks :) 2008-11-29 20:55 just tell me when you're ready 2008-11-29 20:55 ok, I'll start real cleanup stuff 2008-11-29 20:56 maybe, release_cursor(cursor, depth + 1) -> release_cursor(cursor) 2008-11-29 20:56 cursor->path[depth].buffer -> cursor_leafbuf(cursor) 2008-11-29 20:56 and simular stuff 2008-11-29 21:01 it will be nice 2008-11-29 21:09 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-29 22:09 .depth = iroot >> 48 <- hirofumi, I guess we better have wrappers for root_block, root_depth and make_root(block, depth) 2008-11-29 22:21 sounds good 2008-11-29 22:25 -!- pgquiles(~pgquiles@25.Red-83-34-135.dynamicIP.rima-tde.net) has joined #tux3 2008-11-29 22:26 a issue seems to be encode_attrs() 2008-11-29 22:27 make_root() will return be_u64 like others 2008-11-29 22:28 but, since encode64() changes endian by itself, it takes u64 2008-11-29 22:40 ah 2008-11-29 22:40 well I thought make_root should not be be_ 2008-11-29 22:40 what other make_ is be_? 2008-11-29 22:42 ah, make_group 2008-11-29 22:42 and make_entry 2008-11-29 22:42 and make_extent 2008-11-29 22:42 that's it 2008-11-29 22:43 I think the easiest is just to let that be inconsistent 2008-11-29 22:44 let make_root return cpu order 2008-11-29 22:57 or __make_root() for consistency? 2008-11-29 22:59 well I don't think it actually needs to be consistent 2008-11-29 22:59 it seems to have difference with others 2008-11-29 22:59 pack_root/unpack_root? 2008-11-29 22:59 there is make_atom that also returns cpu 2008-11-29 22:59 pack_root is fine 2008-11-29 22:59 both returns cpu order 2008-11-29 23:00 ok, pack_root? 2008-11-29 23:00 pack_root returns struct root 2008-11-29 23:00 right 2008-11-29 23:00 that's fine 2008-11-29 23:00 ah, unpack_root() 2008-11-29 23:01 root_block(root) and root_depth(root) I think 2008-11-29 23:01 it's needed? 2008-11-29 23:01 trying to be too consistent leads to high blood pressure ;) 2008-11-29 23:01 it's better than >> 48 2008-11-29 23:01 .root = { .block = v64 & (-1ULL >> 16), .depth = v64 >> 48 } }; 2008-11-29 23:01 .root = unpack_root(iroot); 2008-11-29 23:02 ? 2008-11-29 23:02 .root = unpack_root(v64); 2008-11-29 23:03 maybe, we need to pack to u64, and unpack to cache (i.e. struct root) 2008-11-29 23:04 hmm 2008-11-29 23:04 I think I was being too random ;) 2008-11-29 23:04 only the be_ form is packed 2008-11-29 23:05 pack_root can return u64 2008-11-29 23:05 yes 2008-11-29 23:05 pack to u64 does it 2008-11-29 23:06 now the unpack... should return struct root 2008-11-29 23:06 ok 2008-11-29 23:06 as you said? 2008-11-29 23:06 and it takes u64, not be_u64 2008-11-29 23:06 yes 2008-11-29 23:06 sure 2008-11-29 23:07 hah, the easiest things are sometimes the hardest ;) 2008-11-29 23:07 yes :) 2008-11-29 23:08 well, it's for begginer people 2008-11-29 23:08 it tells make_* is not pack_* 2008-11-29 23:08 sure, it's fine 2008-11-29 23:09 now I just need to convince ext2 to call my dir_sync function and see if defer works... 2008-11-29 23:09 I'm getting closer 2008-11-29 23:09 great 2008-11-29 23:14 -!- Bobby_(~Bobby@122.162.68.92) has joined #tux3 2008-11-29 23:17 flips, there? 2008-11-29 23:18 hi bobby_ 2008-11-29 23:18 ready to start on fsck? 2008-11-29 23:18 :) 2008-11-29 23:19 sure 2008-11-29 23:19 ok.. 2008-11-29 23:19 so, we walk through the tree and dump the info as a first task? 2008-11-29 23:19 yes 2008-11-29 23:19 info == data in each block 2008-11-29 23:20 using as many of the existing dump functions as possible 2008-11-29 23:20 ok, we have a structure of ops which is passed to our fsck to dump 2008-11-29 23:20 sure 2008-11-29 23:21 and another structure is used to check using the same functions... 2008-11-29 23:21 you can add that after the dump is working, to save some time 2008-11-29 23:21 ok, for this i think i first need to understand the tree structure.. hirofumi's image helps for this... 2008-11-29 23:21 yes 2008-11-29 23:22 ill go through the existing functions which _might_ help in dump... 2008-11-29 23:22 superblock -> inodetable -> index blocks -> ileaf -> data btree -> index blocks -> dleaf 2008-11-29 23:23 that's the main structure 2008-11-29 23:23 hmm 2008-11-29 23:23 ok 2008-11-29 23:23 simpler: superblock -> inode table index blocks -> iode table leaf -> data index blocks -> dleaf 2008-11-29 23:24 it's pretty simple actually 2008-11-29 23:24 hmm 2008-11-29 23:24 some of the files are special though, and can be dumped more 2008-11-29 23:24 flips the figure might really help with this structure in mind.. can you point me to it? 2008-11-29 23:24 like directories, bitmaps 2008-11-29 23:24 i seem to have lost the link... 2008-11-29 23:24 atom tables 2008-11-29 23:25 dumped more?? 2008-11-29 23:25 i thought i was to dump all the nodes in the tree? 2008-11-29 23:25 well you dump a normal file by just showing its disk extents (dleaf_show) 2008-11-29 23:26 but a directory file has structure, you can dump the dirents 2008-11-29 23:26 for the bitmap inode... there is a function to dump bitmap as a list of extents 2008-11-29 23:27 ok 2008-11-29 23:27 for the atom table... a function to dump the atom dictionary 2008-11-29 23:27 http://userweb.kernel.org/~hirofumi/tux3.img.dot.png 2008-11-29 23:28 I should delete the vtable 2008-11-29 23:28 we aren't going to use it 2008-11-29 23:28 on wait 2008-11-29 23:28 version table ;) 2008-11-29 23:28 yes we are 2008-11-29 23:29 ?? 2008-11-29 23:29 never mind ;) 2008-11-29 23:29 :) 2008-11-29 23:30 ACTION should really learn fast... 2008-11-29 23:30 well, I would have to add bitmap graph to tux3graph 2008-11-29 23:31 however, dirent would be after phtree 2008-11-29 23:31 hmm.. 2008-11-29 23:32 tux3graph dumps only, super, inode btree, and data btree 2008-11-29 23:32 hirofumi, when we add extent map, it would be useful to see the bitmap details I think 2008-11-29 23:33 extent map? it is dleaf? 2008-11-29 23:33 hirofumi, what else is there? 2008-11-29 23:33 extent map is for later, it is free extents 2008-11-29 23:33 bitmap data, dirents, and version 2008-11-29 23:33 ah 2008-11-29 23:34 replacement of bitmap data 2008-11-29 23:34 yes 2008-11-29 23:34 sometimes extents are much more efficient than bitmaps, sometimes bitmaps are more efficient 2008-11-29 23:34 looks like 2008-11-29 23:34 so for each 128 MB region, we will either represent with a bitmap, or an extent map 2008-11-29 23:35 oh 2008-11-29 23:35 an optimization for a few months from now 2008-11-29 23:35 it is like zfs? 2008-11-29 23:35 original for tux3 2008-11-29 23:35 zfs uses only free extents? 2008-11-29 23:36 yes 2008-11-29 23:36 i see 2008-11-29 23:36 for highly fragmented filesystems, bitmaps are more efficient, and they were also much easier to start with, plus I already had code from zumastor 2008-11-29 23:37 i see 2008-11-29 23:37 ok, well, if I add some circle (without detail) for undumped stuff, it would helps 2008-11-29 23:37 yes 2008-11-29 23:37 it's already beautiful ;) 2008-11-29 23:38 and when I define the commit log structure, that will be a nice addition 2008-11-29 23:38 and very useful 2008-11-29 23:38 oh, yes 2008-11-29 23:38 to be able to see it 2008-11-29 23:39 ok, I'll add some circle later 2008-11-29 23:39 brb 2008-11-29 23:43 -!- Bobby_(~Bobby@122.162.71.188) has joined #tux3 2008-11-30 00:35 -!- Bobby__(~Bobby@122.162.67.201) has joined #tux3 2008-11-30 02:23 anybody know a bash command for doing fsync on a file/dir? 2008-11-30 02:24 I don't know, but, maybe, perl -e 'fsync("/foo/bar")' 2008-11-30 02:25 seems reasonable. It's funny there isn't a command utility way 2008-11-30 02:25 I just wrote my own fsync in c 2008-11-30 02:25 coreutils may have it 2008-11-30 02:26 dd seems have 2008-11-30 02:26 conv=fsync 2008-11-30 02:26 ah 2008-11-30 02:26 thanks 2008-11-30 02:26 also, fdatasync 2008-11-30 02:26 nice 2008-11-30 02:28 root@usermode:~# fsync /mnt 2008-11-30 02:28 >>> ext2_sync_dir 09851974 2008-11-30 02:28 >>> flush deferred dentry 09854c28 2008-11-30 02:28 >>> flush deferred dentry 09854a9c 2008-11-30 02:28 linux doesn't seem to do any fsync like thing on directories by default 2008-11-30 02:29 instead it just writes out the directory pages 2008-11-30 02:29 no hooks for doing a better sync, except sync_fs 2008-11-30 02:29 no hooks for per-inode sync 2008-11-30 02:30 I guess syncing to disk is a relatively new technology ;) 2008-11-30 02:30 I think it depends on i_op->fsync 2008-11-30 02:30 vfs flushes data pages 2008-11-30 02:31 I think fsync is only called if sys_fsync is called 2008-11-30 02:31 whoops, it is filp->f_op->fsync 2008-11-30 02:31 it is not called by sys_sync 2008-11-30 02:31 yes 2008-11-30 02:31 "fsync" command does fsync(2)? 2008-11-30 02:31 and I do not see why it is not 2008-11-30 02:32 yes 2008-11-30 02:32 fsync doesn't garantee to sync filesystem 2008-11-30 02:32 it just do for one file? 2008-11-30 02:32 no, but sys_sync should call ->fsync in my opinion 2008-11-30 02:32 anyway 2008-11-30 02:32 I don't have to make this work reliably in ext2 2008-11-30 02:33 I can just rely on my fsync /mnt/somedir command 2008-11-30 02:33 to prove the concept 2008-11-30 02:33 yes, on ext2, it is not reliable 2008-11-30 02:34 so my plan is, if I can make deferred creates and unlinks work reliably on ext2+fsync, then we can make it work properly in tux3 2008-11-30 02:35 vfs keeps a list of dirty inodes per sb 2008-11-30 02:35 so we don't need to invent a new list I think 2008-11-30 02:35 um... it would depend on how reliable 2008-11-30 02:35 I hope, very 2008-11-30 02:36 I'm trying to do a good analysis of it 2008-11-30 02:36 it is looking promising 2008-11-30 02:36 I hoped to know for sure by today, but it looks like it will be tomorrow 2008-11-30 02:36 if perfect, I think only way is full sync on ext2 2008-11-30 02:36 well, plus some vfs hack 2008-11-30 02:36 and block new operations 2008-11-30 02:37 because sys_sync does not do anything like ->fsync 2008-11-30 02:37 yes 2008-11-30 02:37 it doesn't flush the dentry cache or anything 2008-11-30 02:37 it try to flush all buffers 2008-11-30 02:37 there's no concept of flushing the dentry cache on linux at all right now 2008-11-30 02:37 vfs just throws it away 2008-11-30 02:37 yes, dcache is pure cache 2008-11-30 02:37 and I think it's more useful than that 2008-11-30 02:38 it can be a dirty cache as well, without many changes 2008-11-30 02:38 it's complex, messy code as you know 2008-11-30 02:38 so I could be wrong about that ;) 2008-11-30 02:38 yes, if we start to cache, readdir 2008-11-30 02:38 yes, if we start to cache readdir 2008-11-30 02:39 I think the fact that ramfs works reliably proves the concept will work 2008-11-30 02:39 um... 2008-11-30 02:39 ramfs doesn't have backend storage 2008-11-30 02:40 ->sync_fs is enough of a hook to do our backend flush 2008-11-30 02:40 right, it has no backend, but it also never looses any dentry state 2008-11-30 02:40 tux3 will do 2008-11-30 02:40 I think ext2 is hard 2008-11-30 02:41 yes, I will only test with fsync on ext2 2008-11-30 02:41 not sys_sync 2008-11-30 02:41 ah 2008-11-30 02:41 if fsync works reliably on ext2, then sys_sync will work on tux3 2008-11-30 02:41 because we will have our own ->sync_fs 2008-11-30 02:41 like ext3 does 2008-11-30 02:42 actually, I can probably make this work on ext3 2008-11-30 02:42 but that would be a distraction 2008-11-30 02:42 yes 2008-11-30 02:42 one nice thing... I don't have to do any trickery with d_revalidate 2008-11-30 02:42 no need to use that 2008-11-30 02:43 I don't know posix garantees about parent dir recursibly 2008-11-30 02:43 on fsync 2008-11-30 02:43 none 2008-11-30 02:43 no guarantee 2008-11-30 02:43 if so, we can lose it? 2008-11-30 02:43 fsync'ed file 2008-11-30 02:44 I don't think we can, just posix doesn't make a gaurantee 2008-11-30 02:44 posix doesn't have anything to say about journalling filesystems even 2008-11-30 02:44 yes 2008-11-30 02:44 it defines sync though 2008-11-30 02:44 kind of loosely 2008-11-30 02:45 this whole area is pretty loose 2008-11-30 02:45 well, so, test of fsync also doesn't garantee about parent dir? 2008-11-30 02:45 my test? 2008-11-30 02:45 yes 2008-11-30 02:46 right, I don't care about that on ext2 2008-11-30 02:46 ok 2008-11-30 02:46 I just want to be sure that the dentry states are always correct 2008-11-30 02:46 if so, fsync test sounds good 2008-11-30 02:46 and that the fs always does the iput at the right time etc 2008-11-30 02:46 i see 2008-11-30 02:46 I found I needed two need bits of dentry state 2008-11-30 02:47 wrong and something mentioned on irc? 2008-11-30 02:47 #define DCACHE_BACKED 0x0020 /* FS has dirent */ 2008-11-30 02:47 #define DCACHE_WRONG 0x0040 /* FS has wrong dirent */ 2008-11-30 02:47 yes 2008-11-30 02:47 i see 2008-11-30 02:47 the "WRONG" state looks kind of funny ;) 2008-11-30 02:47 but it accurately describes the situation 2008-11-30 02:48 I don't understand those yet though 2008-11-30 02:48 I will make a state transition diagram 2008-11-30 02:48 it will help 2008-11-30 02:48 I hope you post some patch for defferred 2008-11-30 02:49 "WRONG" is what happens when user deletes a file then creates a new one of the same name, without a sync in between 2008-11-30 02:49 yes 2008-11-30 02:49 it's starting to look like a reasonable patch 2008-11-30 02:49 with it, I think I can try to understand it 2008-11-30 02:49 ok 2008-11-30 02:50 BACKED bit is set on the dentry whenever read_lookup succeeds 2008-11-30 02:50 sorry 2008-11-30 02:50 BACKED bit is set on the dentry whenever real_lookup succeeds 2008-11-30 02:50 and after a delta transition in tux3 2008-11-30 02:51 after a deferred create, the delta transition will do the real create, and set the BACKED bit in the dentry 2008-11-30 02:51 unlink just clears the BACKED bit 2008-11-30 02:52 and sets the dentry negative 2008-11-30 02:52 it doesn't even have to put the dentry on a list, like I thought 2008-11-30 02:52 negative? 2008-11-30 02:52 yes, unlink turns a BACKED dentry into a negative, !BACKED dentry 2008-11-30 02:52 sorry 2008-11-30 02:52 ah, unlink 2008-11-30 02:53 yes, unlink turns a BACKED dentry into a negative dentry that is still BACKED 2008-11-30 02:53 unlink just clears the BACKED bit <- that was wrong, it leaves the BACKED bit set 2008-11-30 02:53 in the case of "create", set BACKED is used? 2008-11-30 02:53 in the case of "create", same BACKED is used? 2008-11-30 02:54 create does not set the BACKED bit 2008-11-30 02:54 ah 2008-11-30 02:54 only real_lookup and delta transition do 2008-11-30 02:54 -!- Bobby_(~Bobby@122.162.67.201) has joined #tux3 2008-11-30 02:54 in delta staging, dentry cache is flushed to directory block cache 2008-11-30 02:54 a cache-to-cache flush 2008-11-30 02:55 I think I see what are you trying more or less 2008-11-30 02:55 I think I see more or less what are you trying 2008-11-30 02:55 good, because I didn't really see it before today ;) 2008-11-30 02:55 :) 2008-11-30 02:55 today I made a state transition diagram and it started to make sense 2008-11-30 02:56 time to sleep... 2008-11-30 02:56 well, perhaps, I can read c than english :) 2008-11-30 02:56 good night 2008-11-30 03:18 folks 2008-11-30 03:22 -!- Bobby__(~Bobby@122.162.72.50) has joined #tux3 2008-11-30 04:30 -!- Bobby_(~Bobby@122.162.71.27) has joined #tux3 2008-11-30 06:45 -!- pgquiles_(~pgquiles@139.Red-81-38-97.dynamicIP.rima-tde.net) has joined #tux3 2008-11-30 07:22 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-30 08:34 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-30 11:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-30 11:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-30 12:35 -!- pgquiles(~pgquiles@78.Red-83-44-239.dynamicIP.rima-tde.net) has joined #tux3 2008-11-30 13:18 -!- ollebull(~olle@ip6-43.bon.riksnet.se) has joined #tux3 2008-11-30 13:18 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-30 13:34 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-30 13:55 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-11-30 13:57 http://userweb.kernel.org/~hirofumi/tux3-new.png <- sweet 2008-11-30 14:19 nice 2008-11-30 14:19 how was that generated 2008-11-30 14:20 ? 2008-11-30 14:25 /usr/bin/dot 2008-11-30 14:27 graphviz 2008-11-30 14:28 a hack by hirodumi to change the tux3 frontend to generate graphics 2008-11-30 15:12 ugh arrays in bash behaving badly 2008-11-30 15:13 bash arrayes are weird 2008-11-30 15:17 yeah, indeed 2008-11-30 15:24 http://tux3.org/ <- part of a quick start page? 2008-11-30 15:24 whoops 2008-11-30 15:24 damm frames 2008-11-30 15:24 http://tux3.org/pipermail/tux3/2008-November/000351.html 2008-11-30 15:39 flips: first time uml-user to say the truth. With your image, i get an OOM right away, and when I add mem=128M to the command line it just hangs. Any hints? 2008-11-30 15:40 running 64 bit? 2008-11-30 15:40 yes 2008-11-30 15:40 doesn't work 2008-11-30 15:40 somebody needs to make a 64 bit root 2008-11-30 15:47 oh wow, yours is still potato-based. That was my first debian 2008-11-30 15:55 yes, it's ancient 2008-11-30 15:55 I wonder if I remembered to delte the bdflush stuff 2008-11-30 15:56 it's like a jalopy with a v8 2008-11-30 15:56 has a recent libc 2008-11-30 16:23 data, you don't need to use my root filesystem of course, you can use a real partition 2008-11-30 16:23 yeah, i know. i am already running debootstrap 2008-11-30 16:23 although on a file, no more space available in lvm 2008-11-30 16:24 proving that "disks are always full" 2008-11-30 16:24 it's important to handle fragmentation well in the 99% full state 2008-11-30 16:26 certainly is, as mine always are :) 2008-11-30 17:06 flips: http://github.com/shapor/bashcms/tree/master 2008-11-30 17:07 what do I do with it? 2008-11-30 17:07 read the docs? 2008-11-30 17:07 run it? 2008-11-30 17:09 if you want 2008-11-30 17:09 can I just admire it and wait for the result? 2008-11-30 17:10 sur 2008-11-30 17:10 e 2008-11-30 17:10 thats what i was thinking ;) 2008-11-30 17:24 :) 2008-11-30 17:46 oh that's some good bash hacks, i'm totally bumming them, shap 2008-11-30 17:47 you should look at zumastor ;) 2008-11-30 17:47 link me up scotty 2008-11-30 17:48 http://code.google.com/p/zumastor/source/browse/trunk/zumastor/bin/zumastor 2008-11-30 17:49 eww that got up to 1722 lines 2008-11-30 17:50 i'm gonna leave this one for later, i better get back to my own (lame) code 2008-11-30 22:19 flips: What does local -r do? 2008-11-30 22:44 mlankhorts, bashes idea of a local variable 2008-11-30 22:45 man bash ;) 2008-11-30 22:46 -r is readonly 2008-11-30 22:47 Just looked it up 2008-11-30 23:05 -!- RazvanM(~RazvanM@96.234.237.148) has joined #tux3 2008-12-01 00:24 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-01 01:23 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-01 01:39 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-12-01 01:50 -!- pgquiles__(~pgquiles@78.Red-83-44-239.dynamicIP.rima-tde.net) has joined #tux3 2008-12-01 04:36 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-01 04:36 hirofumi: there? 2008-12-01 04:38 flips: yeah, i forgot to reply-to-list... 2008-12-01 04:40 just for fun I tried booting a 32 bit kernel on your 64 bit root 2008-12-01 04:40 VFS: Mounted root (ext2 filesystem) readonly. 2008-12-01 04:40 request_module: runaway loop modprobe binfmt-464c 2008-12-01 04:40 request_module: runaway loop modprobe binfmt-464c 2008-12-01 04:40 request_module: runaway loop modprobe binfmt-464c 2008-12-01 04:40 request_module: runaway loop modprobe binfmt-464c 2008-12-01 04:40 request_module: runaway loop modprobe binfmt-464c 2008-12-01 04:40 Kernel panic - not syncing: Out of memory and no killable processes... 2008-12-01 04:40 that's what i get when booting my 64-bit kernel on your 32-bit root 2008-12-01 04:40 something fragile there 2008-12-01 04:41 anyway, cool I think, I need to fire up a 64 bit machine and try it 2008-12-01 04:41 I'll loopback mount it 2008-12-01 04:43 data, I made a 1 MB journal in my 32 bit rootfs and mount ext3 on it now 2008-12-01 04:43 saves lots of fscking time 2008-12-01 04:45 hi 2008-12-01 04:45 data, amd64 is compatible with intel64 I think 2008-12-01 04:45 hi hirofumi 2008-12-01 04:45 somebody with a core-2 can try it 2008-12-01 04:46 flips: yes, it is. just a habit, as x86_64is called amd64 in gentoo 2008-12-01 04:46 i have a core-2, so 2008-12-01 04:46 elf is module? 2008-12-01 04:46 I'm ok with calling it amd64 ;) 2008-12-01 04:47 it seems to be modprobing for something 2008-12-01 04:47 some binfmt 2008-12-01 04:48 maybe, it's elf or elf32 or aout or something 2008-12-01 04:48 well, amazing thing is i getthe same message with your root, using a 64-bit kernel 2008-12-01 04:49 it's a very random error message 2008-12-01 04:49 messages like that usually indicate an error was not picked up and reported much earlier 2008-12-01 04:52 how do i update my tux3 using hg? hg update says no updates.. but im sure there are some... 2008-12-01 04:53 whg pull 2008-12-01 04:53 hg pull 2008-12-01 04:53 ok.. how do i see the differences im pulling in? 2008-12-01 04:53 hg incoming 2008-12-01 04:53 for i in ../../../tux3/user/kernel/*; do ln -sf $i `basename $i`; done; 2008-12-01 04:53 saves a little work 2008-12-01 04:53 thanks ... 2008-12-01 04:54 just execute it in fs/tux3/ 2008-12-01 04:54 ok, that is binfmt-"FL" 2008-12-01 04:54 it should be a part of EFL header 2008-12-01 04:54 data, nice 2008-12-01 04:55 I think it doesn't compiled into that kernel 2008-12-01 04:56 hirofumi: do you think it's an issue with the config? 2008-12-01 04:56 yes 2008-12-01 04:57 CONFIG_ELF or CONFIG_IA32_EMULATION 2008-12-01 04:57 oh, CONFIG_BINFMT_ELF 2008-12-01 04:57 hirofumi, maybe that's all that is needed to run the 32 bit root? 2008-12-01 04:58 data, how about we call yours tuxroot64? 2008-12-01 04:58 yes, if kernel is 64bit 2008-12-01 04:58 ok, i'll rename it in the tar 2008-12-01 04:59 CONFIG_IA32_EMULATION is needed to load ia32 binary on x86_64 2008-12-01 05:00 funny it's not the default 2008-12-01 05:02 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-12-01 05:02 do you have any plan to start phtree? 2008-12-01 05:03 I've worked on the detailed design 2008-12-01 05:03 I think, after atomic commit and versioning 2008-12-01 05:03 oh, I thought it was already done 2008-12-01 05:03 not yet 2008-12-01 05:03 i see 2008-12-01 05:03 it will use the generic btree code 2008-12-01 05:03 which needs to be generalized a little me 2008-12-01 05:03 a little more 2008-12-01 05:04 i see 2008-12-01 05:04 I think it is something ext4 can use also 2008-12-01 05:05 I have deferred dentry delete working 2008-12-01 05:05 yes 2008-12-01 05:05 next I need to add deferred inode deleting 2008-12-01 05:05 and then deferred dentry create and inode create 2008-12-01 05:06 oh 2008-12-01 05:06 I will post a patch when the deferred inode delete works 2008-12-01 05:06 before doing the deferred creates 2008-12-01 05:06 to get some feedback 2008-12-01 05:06 I think my locking is all messed up ;) 2008-12-01 05:07 anyway 2008-12-01 05:07 I'd like to see it 2008-12-01 05:07 oh :) 2008-12-01 05:07 time to sleep :) 2008-12-01 05:07 ok :) 2008-12-01 05:07 good night 2008-12-01 05:08 good night 2008-12-01 05:09 night 2008-12-01 05:29 -!- pgquiles__(~pgquiles@78.Red-83-44-239.dynamicIP.rima-tde.net) has joined #tux3 2008-12-01 06:42 -!- mlankhorst(~m@fw1.astro.rug.nl) has left #tux3 2008-12-01 06:54 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-01 06:55 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-01 07:13 -!- bain(~bain@59.95.30.42) has joined #tux3 2008-12-01 07:13 hi can anyone tell be git clone url for latest kernel port of tux3? 2008-12-01 07:13 and for that matter latest hg clone url for main stuff 2008-12-01 07:13 the site is very confusing :p 2008-12-01 07:14 git clone http://phunq.net/ddtree does not work 2008-12-01 07:15 there is not published git-tree 2008-12-01 07:15 the bandwidth is somewhat limited 2008-12-01 07:15 ok so what can i do? 2008-12-01 07:15 to get started on hacking the latest tux3? 2008-12-01 07:15 http://tux3.org/pipermail/tux3/2008-November/000380.html 2008-12-01 07:15 look into that 2008-12-01 07:16 there are a few hints as what is to do 2008-12-01 07:16 i am currently chasing a bug in the xattrs. it's not setting an atom correcly somewhere 2008-12-01 07:16 dunno where 2008-12-01 07:17 data: ok, thanks 2008-12-01 07:17 http://tux3.org/pipermail/tux3/2008-November/000378.html 2008-12-01 07:17 there it is 2008-12-01 07:19 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-01 07:21 hirofumi: i am planning on hacking kernel tux3... i read in daniel's post that lot of the operations are to be completed 2008-12-01 07:21 yes 2008-12-01 07:21 can you suggest one i can start with which doesn't conflict with what you are doing? 2008-12-01 07:21 hi hirofumi. i am looking into a problem with xattrs. I was trying to implement the listxattr interface, when I noticed that the get_xattr does not work anymore 2008-12-01 07:21 :) 2008-12-01 07:23 well, I'm leaving the rest handers as is for now 2008-12-01 07:23 funny thing is: i can't graph the situation on disk as tux3graph segfaults, so i'm suspecting set_xattr actually 2008-12-01 07:23 hirofumi: i was thinking about .rename = ext3_rename, 2008-12-01 07:23 hirofumi: oh ok 2008-12-01 07:23 oh 2008-12-01 07:23 rename would be most complex stuff 2008-12-01 07:24 data, which function? 2008-12-01 07:25 hirofumi: this is with fuse rightnow, but i don't think it is a fuse-problem 2008-12-01 07:25 hirofumi: :) ok ... i will take a look and figure out what to start with and drop you a line as well. 2008-12-01 07:25 when i do an attr -s foo -v asd test/test 2008-12-01 07:25 and then an attr -g foo test/test 2008-12-01 07:26 data@desktop ~/programming/tux3/user $ attr -g foo test/test 2008-12-01 07:26 attr_get: No data available 2008-12-01 07:26 Could not get "foo" for test/test 2008-12-01 07:26 i get this 2008-12-01 07:26 bain, ok 2008-12-01 07:26 data, yes 2008-12-01 07:26 I think fuse also doesn't have xattr implement 2008-12-01 07:26 although debugging shows: 2008-12-01 07:26 atom 001 => 0x1509908: 61 73 64 0a "asd." 2008-12-01 07:27 hirofumi: i think it does, at least somethings there :) 2008-12-01 07:27 and it's not really more than a find_atom 2008-12-01 07:27 oh, it have "warn("not implemented");" :) 2008-12-01 07:28 not in my version? 2008-12-01 07:28 ah, getxattr and setxattr 2008-12-01 07:28 it was listxattr 2008-12-01 07:28 yeah. i was trying to implement that, when i noticed the problem with setxattr 2008-12-01 07:28 flips changed it recently 2008-12-01 07:29 so, he may break it 2008-12-01 07:29 well, if you don't know anything specific i'll keep looking 2008-12-01 07:29 yes, I don't know about fuse 2008-12-01 07:30 it's probably more related to xattr and on-disk-structure 2008-12-01 07:30 nothign to do with fuse, really 2008-12-01 07:30 yes 2008-12-01 07:30 it can't find the atom 2008-12-01 07:31 although it created it 2008-12-01 07:31 and tux3graph crashes 2008-12-01 07:31 maybe, related to get_xattr/set_xattr 2008-12-01 07:31 tux3graph crash? 2008-12-01 07:31 tux3graph doesn't read xattr 2008-12-01 07:32 how can I reproduce it? 2008-12-01 07:32 with a segfault 2008-12-01 07:35 sorry, my internet connection is really buggy today 2008-12-01 07:35 so 2008-12-01 07:35 just do a make mkfs 2008-12-01 07:36 make debug 2008-12-01 07:36 touch test/test 2008-12-01 07:36 attr -s foo -V bar test/test 2008-12-01 07:36 attr -g foo test/test 2008-12-01 07:37 and inspect the fs 2008-12-01 07:37 it should be faulty at that point 2008-12-01 07:37 it seems like when the atom is read, the dirent->name isn't set 2008-12-01 07:37 $2 = {inum = 33554432, rec_len = 16, name_len = 8 '\b', type = 0 '\0', name = 0x1b76208 ""} 2008-12-01 07:37 it does a tux_matchon that with name 2008-12-01 07:38 i see 2008-12-01 07:49 any ideas? 2008-12-01 07:49 "tux3 set" also doesn't work 2008-12-01 07:49 atom directory seems strange 2008-12-01 08:01 ok, it seems not to write directory correctly 2008-12-01 08:02 I've put fixed tux3graph for inode has xattr 2008-12-01 08:02 http://userweb.kernel.org/~hirofumi/tux3graph.c 2008-12-01 08:03 url is temporary fix of tux3graph.c 2008-12-01 08:24 man, myinternet feels like i live in the third world 2008-12-01 08:25 :) 2008-12-01 08:25 regarding tux3graph.c: you have the free((char *)volname); there in the end 2008-12-01 08:25 *** glibc detected *** ./tux3graph: free(): invalid pointer: 0x00007fff51ebfe4d *** 2008-12-01 08:25 this happens on my system 2008-12-01 08:25 oh 2008-12-01 08:26 maybe, popt returns non-malloced memory 2008-12-01 08:26 tux3graph.c: In function ‘draw_cursor’: 2008-12-01 08:26 tux3graph.c:175: error: ‘struct cursor’ has no member named ‘path’ 2008-12-01 08:26 oops 2008-12-01 08:27 ;) 2008-12-01 08:27 I've created patch on my tree 2008-12-01 08:28 just copy around struct inode in draw_ileaf 2008-12-01 08:29 not surewhatyou mean 2008-12-01 08:29 ok, I'll create new patch 2008-12-01 08:30 thanks 2008-12-01 08:30 updated 2008-12-01 08:34 *** glibc detected *** ./tux3graph: free(): invalid pointer: 0x00007fff370bee6c *** 2008-12-01 08:34 still get that 2008-12-01 08:34 just ignore it 2008-12-01 08:34 probably, difference of popt version 2008-12-01 08:34 commenting out works, though 2008-12-01 08:34 probably 2008-12-01 08:34 yes 2008-12-01 08:35 free() is just for valgrind 2008-12-01 08:35 btw, which version are you using? 2008-12-01 08:35 my popt is 1.14 2008-12-01 08:36 that version seems to work 2008-12-01 08:36 in that version, poptGetArg() returns malloc'ed memory always 2008-12-01 08:37 1.10.7 2008-12-01 08:37 a little older 2008-12-01 08:37 ok 2008-12-01 08:37 might be because of that 2008-12-01 08:37 it seems older version didn't 2008-12-01 08:37 well, i'll just ignore it 2008-12-01 08:37 thanks anyway 2008-12-01 08:37 I'll remove free() later 2008-12-01 08:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-01 08:50 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-01 08:50 data, still there? 2008-12-01 08:51 just a hint: directory entries seems to be broken 2008-12-01 08:51 good luck :) 2008-12-01 08:55 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-01 09:17 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-01 09:26 hirofumi: thanks 2008-12-01 09:27 girlfriend came over, but i send her away ;) 2008-12-01 09:41 hirofumi: do you know of any description of the on-disk-format or is it all in the sources? :) 2008-12-01 09:42 I know a bit 2008-12-01 09:42 atable is directory 2008-12-01 09:43 and another metadata is written beyond directory size 2008-12-01 09:45 well, where would be the best way to start debugging that problem? or the directory entries? I want to learn about it anyway :) 2008-12-01 09:46 gotta to a small filesystem for university, too, so it won't hurt 2008-12-01 09:46 do you know about ext2 directory format? 2008-12-01 09:46 yeah, pretty much 2008-12-01 09:46 atable is also using it to store name and stom 2008-12-01 09:46 atom 2008-12-01 09:47 inum field is used for atom 2008-12-01 09:47 another metadata is original 2008-12-01 09:48 ok, so the atable is just treated a little differently, but it is build the same as any other directory? 2008-12-01 09:48 i guess the same goes for the other ?tables too 2008-12-01 09:48 http://lwn.net/Articles/300416/ 2008-12-01 09:49 I think it's atom doc 2008-12-01 09:49 yes 2008-12-01 09:49 it is 2008-12-01 09:49 those doesn't have S_IFDIR and i_size though 2008-12-01 09:49 ah, no 2008-12-01 09:50 those uses blocks beyond i_size 2008-12-01 09:51 atomref_base and unatom_base fileds are not initialized properly 2008-12-01 09:52 the cause may be it 2008-12-01 09:52 could be. well thanks for the first hints, i am reading the design docs (once more, forgot most of it) 2008-12-01 09:53 ok 2008-12-01 10:10 those two are initialized in super.c 2008-12-01 10:11 make_tux3()? 2008-12-01 10:11 no, yes 2008-12-01 10:11 yes 2008-12-01 10:11 sorry 2008-12-01 10:16 make_tux3() is used for mkfs only 2008-12-01 10:16 I think we should initialize it on others too 2008-12-01 10:16 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-01 10:19 isn't the sb saved by make_tux3 and thusly loaded by fuse/kernel/whatever as the sb? 2008-12-01 10:20 no 2008-12-01 10:20 those fields are not stored 2008-12-01 10:20 at least for now 2008-12-01 10:20 unpack_sb() loads disksuper 2008-12-01 10:21 so it should go in kernel/super.c? 2008-12-01 10:21 probably 2008-12-01 10:22 I'm not sure about it 2008-12-01 10:22 well, it should be unpack_sb() for now 2008-12-01 10:23 works 2008-12-01 10:23 if i do that 2008-12-01 10:23 ok 2008-12-01 10:26 patch should go flips :) 2008-12-01 10:27 will send it to him soon :) 2008-12-01 10:27 thanks for the help 2008-12-01 10:27 no problem 2008-12-01 10:27 was actually your fix :) 2008-12-01 10:27 don't care about it :) 2008-12-01 10:29 you fixed free() of tux3graph instead of it :) 2008-12-01 10:32 how did you create your patches? i usually use git 2008-12-01 10:32 I'm using my scripts 2008-12-01 10:32 for hg, I just use my hg tree via http 2008-12-01 10:35 btw, if you want to create to push, "hg export" works like "git format-patch" 2008-12-01 10:35 s/create/create the patch/ 2008-12-01 10:56 ok, thanks 2008-12-01 10:56 hope this is correct, but it looks like your earlier mails 2008-12-01 11:00 maybe, one problem 2008-12-01 11:00 thunderbind seems to be converting tab to space 2008-12-01 11:02 oh, that sucks... 2008-12-01 11:02 wtf... 2008-12-01 11:03 yes, recent many mailer does it 2008-12-01 11:07 for whatever reason... 2008-12-01 11:07 well 2008-12-01 11:09 i just noticed that isn't all 2008-12-01 11:09 balloc_extent_from_range: Failed assertion "run == blocks"! 2008-12-01 11:09 that happens when i create a second xattr 2008-12-01 11:10 well, i'll post my patch for listxattr anyway and then look further into that 2008-12-01 11:10 it has nothing to do with my implementation 2008-12-01 11:11 sb->unatom_base = sb->unatom_base + (1 << (34 - sb->blockbits)); 2008-12-01 11:11 it seems strange 2008-12-01 11:11 I think sb->unatom_base == 0 here 2008-12-01 11:12 maybe, it should be, 2008-12-01 11:12 sb->unatom_base = sb->atomref_base + (1 << (34 - sb->blockbits)); 2008-12-01 11:13 hmm 2008-12-01 11:13 yes, it should be 2008-12-01 11:13 at least that's how i read the design document 2008-12-01 11:14 maybe, those paths was not tested until now 2008-12-01 11:14 and i think it's not thunderbird replacing the tabs but hg or my terminal 2008-12-01 11:14 copy and past? 2008-12-01 11:15 ok, it's my terminal 2008-12-01 11:17 linux/Document/SubmittingPatches and linux/Document/email-clients.txt 2008-12-01 11:17 those are docs for those problems 2008-12-01 11:18 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-01 11:19 thanks 2008-12-01 11:19 guess i did it wrong again :) 2008-12-01 11:19 :) 2008-12-01 11:23 ok, amazing thing is: if i continue after the assert, everything works 2008-12-01 11:23 data@desktop ~/programming/tux3/user $ attr -l test/test 2008-12-01 11:23 Attribute "foo" has a 3 byte value for test/test 2008-12-01 11:23 Attribute "bla" has a 6 byte value for test/test 2008-12-01 11:24 data@desktop ~/programming/tux3/user $ touch test/test && attr -s foo -V bar test/test && attr -s bla -V blabla test/test 2008-12-01 11:24 for this 2008-12-01 11:24 oh 2008-12-01 11:24 balloc is block allocation 2008-12-01 11:24 i know 2008-12-01 11:25 it may allocate wrong block or just assert() is wrong 2008-12-01 11:25 if (++run < blocks) 2008-12-01 11:25 continue; 2008-12-01 11:25 assert(run == blocks); 2008-12-01 11:25 it's this code 2008-12-01 11:25 run would have to be greater than blocks 2008-12-01 11:26 um... 2008-12-01 11:28 good morning 2008-12-01 11:28 oh 2008-12-01 11:28 morning 2008-12-01 11:28 oh 2008-12-01 11:28 found it 2008-12-01 11:28 run == blocks 2008-12-01 11:28 in the first run 2008-12-01 11:28 morning 2008-12-01 11:28 blocks == 0? 2008-12-01 11:29 yes 2008-12-01 11:29 i see 2008-12-01 11:29 should assert on blocks == 0 ? 2008-12-01 11:30 this is line 200 in kernel/balloc.c 2008-12-01 11:30 I think so 2008-12-01 11:30 caller would be wrong 2008-12-01 11:31 http://rafb.net/p/nFPOJE61.html 2008-12-01 11:32 this is the full debugging output 2008-12-01 11:32 flips: you're more knowledgeable in this area. in the patch i posted on the list: should it be sb->unatom_base = sb->atomref_base + (1 << (34 - sb->blockbits));? 2008-12-01 11:33 hirofumi suggested so 2008-12-01 11:33 line number? 2008-12-01 11:34 in xattr.c? 2008-12-01 11:34 ah 2008-12-01 11:34 I can find it 2008-12-01 11:34 it's in user/super.c : L96 2008-12-01 11:34 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-01 11:36 yes 2008-12-01 11:36 :p 2008-12-01 11:37 btw, we should store those? 2008-12-01 11:40 ok, maybe, it should check limit like tux3_get_block() 2008-12-01 11:40 store what? 2008-12-01 11:40 atomref_base and unatom_base 2008-12-01 11:40 as on disk metadata instead of assuming? 2008-12-01 11:41 we don't need to 2008-12-01 11:41 ah 2008-12-01 11:41 disksuper didn't have it 2008-12-01 11:41 we don't need to store it if it never changes 2008-12-01 11:42 yes 2008-12-01 11:43 btw, filemap_extent_io seems:128 2008-12-01 11:44 next_extent = NULL; 2008-12-01 11:44 continue; 2008-12-01 11:44 it seems it should have those lines to check limit? 2008-12-01 11:46 ? 2008-12-01 11:46 run == blocks problem 2008-12-01 11:46 where in filemap.c? 2008-12-01 11:46 line 128 2008-12-01 11:47 kernel/filemap.c 127 2008-12-01 11:48 the log says, index == 0, limit == 1 2008-12-01 11:48 but, it try to get the extent of index == 1 2008-12-01 11:49 so you want to set next_extent to NULL when limit is reached, ok? 2008-12-01 11:49 or? 2008-12-01 11:49 yes 2008-12-01 11:50 in the condition if (start > dwalk_index(walk)) ? 2008-12-01 11:50 makes sense to me 2008-12-01 11:50 no 2008-12-01 11:51 unsigned count = extent_count(*next_extent); 2008-12-01 11:51 if (start > dwalk_index(walk)) 2008-12-01 11:51 count -= start - dwalk_index(walk); 2008-12-01 11:51 index += count; 2008-12-01 11:51 next_extent = NULL; 2008-12-01 11:51 continue; 2008-12-01 11:51 } 2008-12-01 11:52 next_extent is always set, just below 2008-12-01 11:53 before it, with continue, we check "index < limit" 2008-12-01 11:53 I think your suggestion is right 2008-12-01 11:54 thanks 2008-12-01 11:54 it may fix "run == block" problem 2008-12-01 11:54 data, can you try it? 2008-12-01 11:55 and we should assert on blocks == 0 2008-12-01 11:55 yes 2008-12-01 11:56 btw, for now, we don't use sb->max_inodes_per_block 2008-12-01 11:56 it should be removed? 2008-12-01 11:57 I think sb->itable.entires_per_leaf does it? 2008-12-01 11:57 I wonder what is supposed to use it 2008-12-01 11:58 inode table leaf splitting 2008-12-01 11:58 have I hardwired that to 64? 2008-12-01 11:58 yes 2008-12-01 11:58 it should use the value from the sb instead of 64 2008-12-01 11:58 but, not used for now 2008-12-01 11:58 and we should init that to 64 2008-12-01 11:59 with a comment, to make it depend on the block size 2008-12-01 11:59 entries_per_leaf? 2008-12-01 11:59 right 2008-12-01 12:00 entries_per_leaf is initialized by unpack_sb() 2008-12-01 12:00 sb->max_inodes_per_block is not used for now 2008-12-01 12:01 and there's a comment there 2008-12-01 12:01 um.. which file? 2008-12-01 12:01 kernel/ileaf.c:164 2008-12-01 12:03 um.. it seems to use entries_per_leaf 2008-12-01 12:04 well, sb->max_inodes_per_block should be used for it? 2008-12-01 12:05 instead of (btree->sb->blocksize / 2)? 2008-12-01 12:05 hirofumi: you wereright, that fixed it 2008-12-01 12:05 sorry for the delay, eating dinner:) 2008-12-01 12:06 ok 2008-12-01 12:16 hirofum, yes 2008-12-01 12:16 it seems to work for write too 2008-12-01 12:16 can you send the patch to flips? 2008-12-01 12:16 i see 2008-12-01 12:17 I think you meant, to the list 2008-12-01 12:17 ah, yes 2008-12-01 12:17 probably, ->max_inodes_per_block should be initialized by unpack_sb() 2008-12-01 12:19 another fileds is sb->entries_per_node and sb->version 2008-12-01 12:19 if my net stays up for longer than 20 seconds... 2008-12-01 12:20 I can also create the patch, however it will be after btree work 2008-12-01 12:20 ok, did you mean i should send it to the list, or that flips should send it there? 2008-12-01 12:21 data, if you need/want it for next work, could you send to list? 2008-12-01 12:22 now, I'm working for another part 2008-12-01 12:22 at it 2008-12-01 12:27 one question: is the continue really needed? 2008-12-01 12:27 or do you think it's better if there ever is another case included 2008-12-01 12:28 it is needed to check "index < limit" with updated index 2008-12-01 12:28 ah, right 2008-12-01 12:32 I'll be out for an hour 2008-12-01 12:32 data, I think the continue is ok 2008-12-01 12:32 it's similar to the way the rest of that loop works 2008-12-01 12:33 it's not really easy code to understand 2008-12-01 12:33 :-/ 2008-12-01 13:13 ok, finally i have vim as an editor for thunderbird ;) 2008-12-01 13:14 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-01 13:15 good, you are ready for lkml too :) 2008-12-01 13:16 well, i've always wanted to switch to mutt as I use a lot of cli-apps, but i haven't gotten around to it 2008-12-01 13:19 I've read the email with patch, thanks for that :) 2008-12-01 13:23 yeah i really like just attaching to my screen from anywhere and mutt is there 2008-12-01 13:26 when i got imap i kind of let it be 2008-12-01 13:31 flipsout: could you redownload my http://www.jonasfietz.de/tuxroot.amd64.tar.bz2 ? I updated it to include attr 2008-12-01 13:42 http://shapor.com/tux3/irclogs/chart.png 2008-12-01 13:42 #tux3 irc activity 2008-12-01 13:49 data: EPERM 2008-12-01 13:50 I got rid of the ugly frames: http://beta.tux3.org/ 2008-12-01 13:50 and we'll be moving the site to a real server :) 2008-12-01 13:50 so I can download it from there 2008-12-01 13:51 data: oops my bad, EPERM on my end not yours ;) 2008-12-01 13:52 23,967,640 860.97K/s from de->california 2008-12-01 13:53 http://beta.tux3.org/downloads/ 2008-12-01 13:54 oh, beautiful 2008-12-01 13:55 could you update to my name to "Hirofumi Ogawa" or "OGAWA Hirofumi" if possible? 2008-12-01 13:55 in About Us 2008-12-01 13:55 which do you prefer? 2008-12-01 13:56 I prefer "OGAWA Hirofumi" 2008-12-01 13:56 I use it for email from 2008-12-01 13:56 ok 2008-12-01 13:57 " is not needed actually 2008-12-01 13:57 just OGAWA Hirofumi 2008-12-01 13:57 thanks 2008-12-01 14:01 updated, also added Jonas 2008-12-01 14:01 thanks 2008-12-01 14:09 hmm, any ideas where best to place the function tux3_setxattr for kernel-use? 2008-12-01 14:11 or should i name it differently to prevent a name clash with the one in userspace? 2008-12-01 14:11 I think new file is prefer 2008-12-01 14:12 i was thinking kernel/xattr.c as ext4 puts its listxattr there 2008-12-01 14:13 shapor: when you say moving to a real server, does that imply a git-repo? 2008-12-01 14:13 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-01 14:13 data: soon 2008-12-01 14:13 not yet 2008-12-01 14:13 i have plenty of bandwidth (100Mb) 2008-12-01 14:13 it's just that i keep typing git instead of hg :) 2008-12-01 14:13 merged the patches 2008-12-01 14:14 thanks 2008-12-01 14:14 ext4 handles xattr in xattr.c 2008-12-01 14:14 data, so next time maybe I will pull from your hg? 2008-12-01 14:14 hirofumi: that's what i meant 2008-12-01 14:14 our xattr.c also does 2008-12-01 14:14 flips: first i have to find out how best to publish it 2008-12-01 14:14 however, glue for vfs is xattr_*, iirc 2008-12-01 14:16 data, maybe we should create a login for you on tux3.org and you can push to your own hg repo there? 2008-12-01 14:17 data: you can put your repo in a http accessible dir 2008-12-01 14:17 that would work, too, and is less work for me ;) 2008-12-01 14:17 and then static-http://site/path/to/rpo 2008-12-01 14:17 will work 2008-12-01 14:17 thats how i set up mine 2008-12-01 14:17 data, right, if you have a web server we are all set 2008-12-01 14:18 i have one, let me see 2008-12-01 14:24 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-01 14:28 flips, do you remember delete_from_leaf() in tree_chop()? 2008-12-01 14:28 remember? 2008-12-01 14:28 it seems to have bug 2008-12-01 14:29 I'm not surprised 2008-12-01 14:29 if (delete_from_leaf(btree, bufdata(leafbuf), info)) 2008-12-01 14:29 mark_buffer_dirty(leafbuf); 2008-12-01 14:29 delete_from_leaf() returns 0 always 2008-12-01 14:30 should it return 1 if it's modified? 2008-12-01 14:30 yes 2008-12-01 14:30 or just return only error? 2008-12-01 14:30 yes a sec 2008-12-01 14:31 yes, it should return 1 => modified 2008-12-01 14:31 wow, I was lazy ;) 2008-12-01 14:31 i see :) 2008-12-01 14:32 so, ->leaf_chop should return 1 or 0 or error? 2008-12-01 14:32 yes 2008-12-01 14:32 ok 2008-12-01 14:33 dleaf_chop does not actually return an error now, but it could in theory 2008-12-01 14:33 I'm trying to fix cursor stuff in tree_chop, then I try to test it by adding truncate() to tux3 2008-12-01 14:34 then, I found this :) 2008-12-01 14:34 yes 2008-12-01 14:34 btw, dleaf_chop doesn't update dleaf->free 2008-12-01 14:35 iirc, we are going to remove dleaf->free? 2008-12-01 14:36 hirofumi, truncate is really not tested at all, and the btree delete is not tested 2008-12-01 14:36 however, the same algorithm has heavily used and tested in zumastor 2008-12-01 14:36 but I have made many small changes without testing 2008-12-01 14:36 i see 2008-12-01 14:36 well, I'll fix those 2008-12-01 14:38 if we are going to remove dleaf->free, I think I can ignore it 2008-12-01 14:56 folks 2008-12-01 14:58 hirofumi, we do need ->free methods 2008-12-01 14:58 i see, ok 2008-12-01 15:00 we're not going to remove it, just maybe remove the internal ->free and ->used fields 2008-12-01 15:00 I haven't decided what is best 2008-12-01 15:01 what is internal meaning? 2008-12-01 15:01 cache? 2008-12-01 15:01 dleaf has two 16 bit fields, .free and .used 2008-12-01 15:02 yes 2008-12-01 15:02 I think it might be better just to use a loop to for the ->free() method 2008-12-01 15:02 however... 2008-12-01 15:03 i see 2008-12-01 15:03 dleaf will work without ->free and ->used if calc it? 2008-12-01 15:03 yes 2008-12-01 15:03 ok 2008-12-01 15:04 for now, I'll just try to fix dleaf_chop 2008-12-01 15:04 and only the dleaf_merge would need an additional loop 2008-12-01 15:04 i see 2008-12-01 15:04 so I think it will be a good change because it makes things more simple, but it isn't needed now and it is a little bit hard to do 2008-12-01 16:22 dleaf_chop seems to work partially 2008-12-01 16:22 ok, time to sleep 2008-12-01 16:22 good night 2008-12-01 16:41 good night! 2008-12-01 16:42 that was a really long one 2008-12-01 16:42 I hope to have something to post on deferred nameops when you wake up 2008-12-01 16:43 hirofumi, and dleaf_chop needs to be scrapped and replaced by the dleaf_walk + dleaf_chop_after interface 2008-12-01 16:50 whoops, I've fixed dleaf_chop partially 2008-12-01 16:50 but, yes, dwalk_probe() + dwalk_base()/dwalk_chop_after sounds good 2008-12-01 16:53 hirofumi, it's not wrong to fix dlead_chop 2008-12-01 16:53 it can work 2008-12-01 16:54 but the dleaf_walk model is what we need for tree_chop 2008-12-01 16:54 more importantly 2008-12-01 16:54 it's what we need when we add the versioning code 2008-12-01 16:55 it means we should change the interface of ->leaf_chop handler? 2008-12-01 16:55 anyway, I don't think you slept last night, is it true? 2008-12-01 16:55 yes 2008-12-01 16:55 you should ;) 2008-12-01 16:56 :) 2008-12-01 16:56 we need leaf_chop for now, just to start benchmarking on non-versioning loads 2008-12-01 16:56 very important 2008-12-01 16:56 but the model has to be changed to support versioned data in the dleaf and ileaf nodes 2008-12-01 16:57 it will be based on the model of dleaf_walk + chop_after 2008-12-01 16:57 i see 2008-12-01 16:57 this was always planned, and even written in a design note somewhere, but it's buried way deep, sorry 2008-12-01 16:57 it will be _very_ efficient 2008-12-01 16:58 compared to existing, on-key-at-a-time filesystem update models 2008-12-01 16:58 i see. however, I can't see big difference for interface 2008-12-01 16:58 the change can be made incrementally 2008-12-01 16:58 but we want to support inodes that span more than one leaf eventually 2008-12-01 16:59 i see 2008-12-01 16:59 and dleaf blocks where the pointers for one logical address require more than one block 2008-12-01 16:59 so the _walk model is much better for that 2008-12-01 16:59 and then, the higher level code in tree_chop will be affected, a little 2008-12-01 17:00 but we can do this change later 2008-12-01 17:00 for now, with everything strictly per-block, the interface is ok 2008-12-01 17:00 we need to improve dwalk stuff for it? 2008-12-01 17:00 yes, the dwalk will know how to cross btree block boundaries 2008-12-01 17:01 so it will call advance() 2008-12-01 17:01 this issue needs a design note 2008-12-01 17:01 i see, dwalk_ stuff also on one block for now, but it should work like btree stuff? 2008-12-01 17:02 it has to call an ->advance method I think 2008-12-01 17:03 we will only need this when we get versioning 2008-12-01 17:03 which is a couple months away 2008-12-01 17:03 i see 2008-12-01 17:03 designing it now will be good, and fixing the _chop will be good 2008-12-01 17:04 one way to fix dleaf_chop is by rewriting it using dleaf_walk 2008-12-01 17:04 yes, it seems we should merge dwalk and dleaf stuff 2008-12-01 17:06 and then the code will actually shrink a little bit 2008-12-01 17:06 the dleaf.c code 2008-12-01 17:07 probably, dleaf_dump, and dleaf_chop, maybe dleaf_check 2008-12-01 17:10 well, I'll leave it for now, until I finish the cursor stuff at least 2008-12-01 17:10 sounds good 2008-12-01 17:11 dleaf_chop fix was to test of tree_chop of cursor fix :) 2008-12-01 17:12 I should back to original work 2008-12-01 17:13 good night :) 2008-12-01 17:20 good night :) 2008-12-01 17:21 hirofumi, agreed, and we can propose easier projects than hooking up kernel ops methods, for start projects 2008-12-01 17:42 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-01 18:46 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-01 19:41 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-01 20:59 hmm, it just occurred to me, we should change all our breads from: if (!(buffer = bread(inode->map, block))) 2008-12-01 20:59 return -EIO; 2008-12-01 20:59 to something like: 2008-12-01 20:59 if (!(buffer = bread(inode->map, block, &err))) 2008-12-01 20:59 return err; 2008-12-01 21:07 ok, dirent creation seems to be deferred more or less in ext2 2008-12-01 21:07 err, dirent deletion 2008-12-01 21:07 unlink 2008-12-01 21:08 next step is to defer inode deletion 2008-12-01 21:08 ACTION takes aim at ext2_delete_inode 2008-12-01 21:25 wow, I can't believe that ext2/3/4 gets its own flag in i_flags to indicated directory indexed or not 2008-12-01 21:26 well, since I feel some ownership of that particular flag, and ext2 doesn't have indexed directories and never will (sorry google) then I will repurpose that one to mean "inode delete deferred" 2008-12-01 21:26 not that there is any shortage of inode flags 2008-12-01 21:27 it just feels good to take that one back ;) 2008-12-01 21:28 #define FS_INODE_DELETED FS_INDEX_FL 2008-12-01 21:56 if (atomic_dec_and_lock(&inode->i_count, &inode_lock)) { 2008-12-01 21:56 ext2_delete_deferred_inode(inode); 2008-12-01 21:56 WARN(">>> failed to delete inode, say your prayers now"); 2008-12-01 22:40 ahem, recycling FS_INDEX_FL might well be a good idea for tux3, but is a dumb idea when hacking ext2 ;) 2008-12-02 00:27 for data deduplication.. we need to write the block to disk and then check if its duplicated... 2008-12-02 00:28 if its duplicated then link to old block and delete the new block... 2008-12-02 00:29 and keep track of how many links there are to each block 2008-12-02 00:29 the main problem here to find if the block is duplicated... 2008-12-02 00:29 yep... 2008-12-02 00:29 so to find if the block is duplicated.. all i can think of is using a hash... 2008-12-02 00:30 but the effect of this on performance... 2008-12-02 00:30 that's the only way I know of 2008-12-02 00:30 im not sure.. 2008-12-02 00:30 well you don't want performance _and_ deduplication do you? ;) 2008-12-02 00:30 :) 2008-12-02 00:30 read performance should not change much 2008-12-02 00:31 yup.. write on the other hand might take a hit... 2008-12-02 00:31 thinking of ways to avoid the hit... 2008-12-02 00:31 will for sure 2008-12-02 00:32 you can't avoid hashing the contents of the block. At least use an efficient hash 2008-12-02 00:32 maybe, try to deduplicate only if the disk is x% full 2008-12-02 00:32 and an efficient way of storing the hash 2008-12-02 00:32 hmm 2008-12-02 00:32 another thing you might think about is getting it working before optimizing 2008-12-02 00:32 or run deduplication after disk is x% full 2008-12-02 00:32 :) 2008-12-02 00:32 that is true.. 2008-12-02 00:34 design your hash table first 2008-12-02 00:34 you will probably want to store it in a file 2008-12-02 00:35 hmm 2008-12-02 00:36 and yes, that's not a bad strategy 2008-12-02 00:36 run deduplication has a full volume pass every now and then 2008-12-02 00:36 s/has/as/ 2008-12-02 00:39 reading deduplication discussion in btrfs... 2008-12-02 00:39 val has some interesting points... 2008-12-02 00:40 subject line? 2008-12-02 00:40 Data-deduplication? 2008-12-02 00:41 right, think about your space reservation model 2008-12-02 00:42 goin to lunch... c u soon 2008-12-02 04:50 -!- kbingham_(~kbingham@82-46-4-172.cable.ubr03.aztw.blueyonder.co.uk) has joined #tux3 2008-12-02 05:27 -!- pgquiles(~pgquiles@78.Red-83-44-239.dynamicIP.rima-tde.net) has joined #tux3 2008-12-02 07:28 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-02 07:54 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-12-02 08:41 -!- pgquiles__(~pgquiles@78.Red-83-44-239.dynamicIP.rima-tde.net) has joined #tux3 2008-12-02 09:40 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-02 10:05 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-02 12:21 -!- mingming(~mingming@32.97.110.55) has joined #tux3 2008-12-02 12:47 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-02 13:32 -!- flipz(~phillips@phunq.net) has joined #tux3 2008-12-02 14:06 ok now its been 86400 sec since i dropped the ttl, time to switch the site over to the newone 2008-12-02 14:06 should propagate in < 5 min 2008-12-02 14:07 ˙slowpoke... 2008-12-02 14:07 ;-) 2008-12-02 14:12 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2008-12-02 14:14 An interesting article was posted on lwn.net on tux3 today 2008-12-02 14:50 sejeff, haven't seen it yet 2008-12-02 14:50 now I have :) 2008-12-02 14:54 the sejeff alert arrive before the google alert 2008-12-02 14:54 just a few seconds 2008-12-02 15:04 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-02 15:05 sk8 oclock 2008-12-02 15:07 flipz: site is live, let me know if you notice anything weird 2008-12-02 15:07 :) 2008-12-02 15:07 although i'm getting on a plane soon so I won't really be able to do anyhting about it 2008-12-02 15:43 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-02 16:39 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-02 16:49 Nice looking site :D 2008-12-02 17:07 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-02 17:13 rmul, tell shapor :) 2008-12-02 17:57 -!- camby(~chatzilla@116.204.35.246) has joined #tux3 2008-12-02 18:05 -!- camby(~chatzilla@116.204.35.246) has left #tux3 2008-12-02 18:14 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-02 18:30 -!- camby(~chatzilla@116.204.35.246) has joined #tux3 2008-12-02 18:30 msg #tux3 register 2008-12-02 18:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-02 19:23 j uml 2008-12-02 19:24 root@usermode:~# touch /mnt/foo 2008-12-02 19:24 root@usermode:~# rm /mnt/foo 2008-12-02 19:24 >>> deferred_unlink: 098895f8/1 28 "foo" 2008-12-02 19:24 >>> hide 098895f8 2008-12-02 19:24 defer delete: 0988be0c/0 0 2008-12-02 19:24 root@usermode:~# fsync /mnt 2008-12-02 19:24 >>> ext2_sync_dir 098572e0 "/" 2008-12-02 19:24 >>> dentry: 098895f8/1 28 "foo" 2008-12-02 19:24 >>> deferred delete: 098895f8/1 28 "foo" 2008-12-02 19:24 root@usermode:~# sync 2008-12-02 19:24 >>> delete deferred: 0988be0c/0 0 2008-12-02 19:24 >>> ext2_delete_inode 2008-12-02 19:24 root@usermode:~# umount /mnt 2008-12-02 19:24 yay, no panic on unmount 2008-12-02 19:25 deferred unlink 2008-12-02 19:25 maze, ping 2008-12-02 19:25 hirofumi, there? 2008-12-02 19:26 hi 2008-12-02 19:26 hey, did my first deferred undelete without a kernel panic 2008-12-02 19:27 so now I will clean up the patch a little and post for discussion 2008-12-02 19:27 this is hard ;) 2008-12-02 19:27 oh, nice 2008-12-02 19:28 the first part of my cursor fixes was also done 2008-12-02 19:29 :) 2008-12-02 19:29 15 patches were queued up 2008-12-02 19:31 big one 2008-12-02 19:31 I'll go to lunch 2008-12-02 19:31 and all I have is this one little one ;) 2008-12-02 19:32 but it is a really hard one, and maybe a big optimization, even for other filesystems 2008-12-02 19:32 well, main part of my patches is little though :) 2008-12-02 19:32 good 2008-12-02 20:07 I've pushed cursor work 2008-12-02 20:07 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-02 20:08 please check it 2008-12-02 20:08 ok 2008-12-02 20:17 hirofumi, http://mailman.tux3.org/pipermail/tux3/2008-December/000399.html -- Deferred delete prototype for Ext2 2008-12-02 20:18 thanks, I'll read it 2008-12-02 20:18 it's pretty weird stuff 2008-12-02 20:18 :) 2008-12-02 20:20 hirofumi, your repo shows as having two heads 2008-12-02 20:20 so... I can pull just one and not break my repo 2008-12-02 20:20 whoops 2008-12-02 20:21 can you just do a merge on your side? 2008-12-02 20:21 yes 2008-12-02 20:21 let me check it 2008-12-02 20:21 I can also pull just one head, but if I accidentally pull both it gets hg into some wierd state where it can't merge 2008-12-02 20:22 I need to complain to matt about that ;) 2008-12-02 20:25 um.. 2008-12-02 20:25 hg clone http://tux3.org/tux3/ 2008-12-02 20:25 hg pull static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-02 20:26 hg heads 2008-12-02 20:26 changeset: 579:84f78bf68220 2008-12-02 20:26 tag: tip 2008-12-02 20:26 user: OGAWA Hirofumi 2008-12-02 20:26 date: Wed Dec 03 12:59:43 2008 +0900 2008-12-02 20:26 summary: Fix make_inode()/save_inode() error path 2008-12-02 20:26 hirofumi@devron (tux3)$ 2008-12-02 20:26 it's a mystery ;) 2008-12-02 20:26 it doesn't have two heads 2008-12-02 20:27 I do here, I pulled into a repo that had "tree_chop fix" as the last change 2008-12-02 20:28 something strange 2008-12-02 20:28 I should be running on current hg 2008-12-02 20:29 it seems that hg can't handle distributed devlopment correctly 2008-12-02 20:29 strange :) 2008-12-02 20:30 let's see what happens when I pull into my working repo, in a few minutes 2008-12-02 20:30 if this breaks, I can alwasy reclone from the public repo 2008-12-02 20:30 that's my safety net 2008-12-02 20:31 yes 2008-12-02 20:34 a cursor_leaf(cursor) function would shorten some code ;) 2008-12-02 20:34 yes 2008-12-02 20:35 and it's safe 2008-12-02 20:35 and we can't need the full path to read leaf 2008-12-02 20:36 we don't need 2008-12-02 20:36 ah, kernel brelse allows null buffer? 2008-12-02 20:36 I wonder if we should though 2008-12-02 20:36 yes 2008-12-02 20:36 it does not seem like very good coding style to allow null pointers in free functions 2008-12-02 20:37 it covers up bugs 2008-12-02 20:37 yes 2008-12-02 20:37 unfotunately, that kind of thing is done a lot in kernel 2008-12-02 20:37 I don't know why 2008-12-02 20:37 in kernel, too many users assumed it to check null somehow 2008-12-02 20:37 I'd just put a BUG there 2008-12-02 20:37 and they can fix it 2008-12-02 20:38 well 2008-12-02 20:38 let it seg fault, I think that's the usual suggestion 2008-12-02 20:38 tree chop is fixed :) 2008-12-02 20:38 ok 2008-12-02 20:39 no, I mean, you fixed it 2008-12-02 20:39 thankyou 2008-12-02 20:39 yes 2008-12-02 20:39 ok is, removing null check from brelse 2008-12-02 20:40 I think it's better 2008-12-02 20:40 we never pass a null buffer to it 2008-12-02 20:40 current user I know is only tree_chop() 2008-12-02 20:40 it's need to check on error path 2008-12-02 20:40 oh, of the null check? 2008-12-02 20:40 yes 2008-12-02 20:40 prev 2008-12-02 20:40 prev[] may have null 2008-12-02 20:41 if sb_bread returned null 2008-12-02 20:41 right, I think it's better to put the if (! in tree_chop, where we can see it 2008-12-02 20:41 hmm 2008-12-02 20:41 well 2008-12-02 20:41 ok, I'll queue patch for it 2008-12-02 20:42 and I should read the whole function, instead of the patch 2008-12-02 20:42 for brelse null check? 2008-12-02 20:42 the fixed tree_chop 2008-12-02 20:42 ah 2008-12-02 20:42 hard to see the changes from the patch 2008-12-02 20:43 yes, it was complex to read to fix 2008-12-02 20:45 well there is complex stuff in there, and a bunch of little bug fixes and new error handling added 2008-12-02 20:45 so it's a big patch, nice 2008-12-02 20:45 pulling 2008-12-02 20:45 thanks 2008-12-02 20:46 when I pulled into my working repo, I did not get two heads 2008-12-02 20:46 it was just when I pulled into the repo that I use for reading your changes 2008-12-02 20:46 working repo head was which rev? 2008-12-02 20:47 I noticed there is "hg parents" 2008-12-02 20:48 the temporary repo was at 3771e62c8184b32ac6fb96155439b0a482f830b7, tree_chop-fix 2008-12-02 20:49 um..., those are same head 2008-12-02 20:50 working repo was at 59e1bc7d8ff5228313f55da1cff13f9402a5444f, Fix a bug in filemap extent scanning 2008-12-02 20:50 ah, same with public repo 2008-12-02 20:51 what hg version are you running? 2008-12-02 20:52 0.9.1 2008-12-02 20:52 same here 2008-12-02 20:52 yes 2008-12-02 20:52 ok, I will just watch for more strangeness in the future 2008-12-02 20:52 wait 2008-12-02 20:52 tree_chop-fix? 2008-12-02 20:52 oh :) 2008-12-02 20:53 that explains it 2008-12-02 20:53 duh 2008-12-02 20:53 :) 2008-12-02 20:53 I pulled your work in progress 2008-12-02 20:53 yes 2008-12-02 20:53 :) 2008-12-02 20:53 ok, it seems to work 2008-12-02 20:54 pushed to public 2008-12-02 20:54 thanks 2008-12-02 20:54 shall we talk about the deferred stuff? 2008-12-02 20:54 ok 2008-12-02 20:54 I know I this is not really on the critical path to having a fully functioning tux3 2008-12-02 20:54 I'm not reading all though 2008-12-02 20:54 ok 2008-12-02 20:55 I will wait 2008-12-02 20:55 ok 2008-12-02 20:57 I am really not sure about the spinlocking in ext2_sync_dir, but the way I did it is similar to how other code does it... that does not mean it is right 2008-12-02 21:00 ok 2008-12-02 21:00 hirofumi, I will do a patch to change all SB to sb_t sb, because jon corbet complained about that in his lwn article today 2008-12-02 21:00 heh 2008-12-02 21:00 I will call it the "Jon Corbet" patch 2008-12-02 21:01 I don't know about it, maybe BTREE too? 2008-12-02 21:01 yes 2008-12-02 21:01 btree_t btree 2008-12-02 21:01 I like it more than struct btree btree everywhere 2008-12-02 21:01 well, I like just "struct sb *sb", and struct btree *btree 2008-12-02 21:02 oh 2008-12-02 21:02 ok 2008-12-02 21:02 I don't think we have a problem there like we do with inodes 2008-12-02 21:02 yes 2008-12-02 21:06 I've read that email, quick look though 2008-12-02 21:59 maybe, one issue I noticed is, it may not work if dentry is still opened? 2008-12-02 22:00 it should, where do you see a problem? 2008-12-02 22:00 it does ->d_hide(), then dentry_iput() 2008-12-02 22:01 which turns it into a negative dentry 2008-12-02 22:01 but, if it is still opened, dentry->d_inode should be available 2008-12-02 22:01 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-02 22:02 because user may call write(2)/etc 2008-12-02 22:02 which uses the dentry? 2008-12-02 22:02 I don't think write does 2008-12-02 22:02 fd -> filp -> filp->f_path.dentry -> dentry->d_inode? 2008-12-02 22:03 it could well be an issue 2008-12-02 22:03 yes 2008-12-02 22:03 more like "what do we do about it" than "just can't do it" I think 2008-12-02 22:05 yes, however I don't have idea for now 2008-12-02 22:06 supporting your concern... file does not reference inode directly 2008-12-02 22:06 for now, I'm just picking issues up 2008-12-02 22:06 so, unlink the dentry and files don't work any more, that's not good ;) 2008-12-02 22:07 yes, well, I beleive we can fix it 2008-12-02 22:09 one way is to use a flag to mark the dentry as negative instead of relying on null inode 2008-12-02 22:10 yes 2008-12-02 22:10 that doesn't work when somebody creates a new file with that dentry 2008-12-02 22:10 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-02 22:10 it seems to me that have struct file reference through the dentry is a bad idea, but that would be a big change to fix 2008-12-02 22:11 we can clear on ->create()? 2008-12-02 22:11 there might still be a file open on the old dentry 2008-12-02 22:11 well 2008-12-02 22:11 then we don't reused the dentry 2008-12-02 22:11 make it negative 2008-12-02 22:12 um 2008-12-02 22:12 unhash it, like now 2008-12-02 22:12 oh, then we want last close()? 2008-12-02 22:12 s/want/watch/ 2008-12-02 22:13 maybe 2008-12-02 22:13 sounds good 2008-12-02 22:13 we also have to handle the opened dentry on ext2_sync_dir() 2008-12-02 22:14 so, last close() or something may be good 2008-12-02 22:14 yes 2008-12-02 22:15 818#define f_dentry f_path.dentry 2008-12-02 22:15 819#define f_vfsmnt f_path.mnt <- hmm, this is a hack 2008-12-02 22:15 I didn't realize f_dentry is a macro 2008-12-02 22:15 oh 2008-12-02 22:16 that was done to avoid making a big edit? 2008-12-02 22:16 I didn't know there is users of f_dentry 2008-12-02 22:17 I wonder what uses it, except to get the inode 2008-12-02 22:17 at least, getcwd(2), iirc 2008-12-02 22:18 well you can also find the dentry by following the alias list 2008-12-02 22:18 so maybe the right thing is to make all struct file point straight at the inode, which would be more efficient, and the few users of d_dentry can follow the alias list 2008-12-02 22:18 ah, getcwd(2) is not using f_path 2008-12-02 22:18 just an idea 2008-12-02 22:19 inode has multiple names if hard link 2008-12-02 22:19 right :( 2008-12-02 22:19 duh 2008-12-02 22:20 um.., so we need dentry, at least on current dcache modle 2008-12-02 22:20 s/modle/model/ 2008-12-02 22:20 so if somebody does an unlink, the file still thinks it has a path/name 2008-12-02 22:20 that can't be consistent in all cases 2008-12-02 22:21 working around the problem could get messy as we talked about 2008-12-02 22:21 yes 2008-12-02 22:21 must not set the dentry negative until count hits zero 2008-12-02 22:22 it sounds very good at least for now 2008-12-02 22:23 there is a hook in dput, dentry->d_op->d_delete 2008-12-02 22:23 ah, but we need to avoid ->lookup()... 2008-12-02 22:23 to catch the hitting zero case 2008-12-02 22:23 because? 2008-12-02 22:24 we unhash it, so ->lookup() reads current directory 2008-12-02 22:24 ->unlink() has to remove that name actually for unhash 2008-12-02 22:25 we would only unhash if somebody wants to reuse the name for a new create 2008-12-02 22:25 so, we want to avoid ->lookup() for it 2008-12-02 22:25 so, at the same time we unhash, we create a new dentry with the same name, so real lookup does not take place 2008-12-02 22:26 I guess this probably works, and is the right thing to do 2008-12-02 22:26 wait a bit 2008-12-02 22:26 I'll keep working on my jon corbet patch 2008-12-02 22:26 user call unlink() - 2008-12-02 22:26 ah 2008-12-02 22:26 then 2008-12-02 22:27 if d_count == 1, d_hide()? 2008-12-02 22:27 yes, it's ok then 2008-12-02 22:27 if d_count > 1, what we should? 2008-12-02 22:27 if not... then we have to defer the defer ;) 2008-12-02 22:28 :) 2008-12-02 22:28 well, so idea is unhash it? 2008-12-02 22:28 maybe mark the dentry, and check for the mark in dput, dentry->d_op->d_delete 2008-12-02 22:28 yes 2008-12-02 22:28 we can't unhash until file users go away 2008-12-02 22:29 that is pretty easy to try 2008-12-02 22:29 but, it was unlinked? 2008-12-02 22:29 full example? 2008-12-02 22:29 user can lookup it via cached_lookup? 2008-12-02 22:29 yes 2008-12-02 22:29 I forgot to mention that in the post 2008-12-02 22:30 if (d_count > 1) unhash(dentry) 2008-12-02 22:30 that's the currrent behaviour in d_delete 2008-12-02 22:30 d_delete { if (d_count > 1) just_mask(dentry) } 2008-12-02 22:30 ? 2008-12-02 22:30 right 2008-12-02 22:30 exactly 2008-12-02 22:31 if dentry is not unhashed, user can lookup that dentry? 2008-12-02 22:31 if somebody tries to create one of the same name, _then_ we unhash 2008-12-02 22:31 right, that's what we want 2008-12-02 22:31 we try to always have a dentry for a name that hasn't been backed yet 2008-12-02 22:32 or is still backed, by an old name that neees to be deleted 2008-12-02 22:32 it is pretty easy to keep the dentries from disappearing, just by taking a reference count 2008-12-02 22:32 even if it's normal open(2), the user lookup masked dentry via dentry cache layer 2008-12-02 22:33 unless it isn't in cache, in which case it isn't deferred either 2008-12-02 22:33 exactly as you said 2008-12-02 22:33 but, unlinked dentry should be removed from namespace 2008-12-02 22:34 so, we need some trick for that case? 2008-12-02 22:34 what is the case again, exactly? 2008-12-02 22:35 lookup of a masked dentry? 2008-12-02 22:35 unlink(2) -> d_delete { if (d_count > 1) just_mask(dentry) } 2008-12-02 22:35 yes 2008-12-02 22:35 that's not hard I think 2008-12-02 22:35 in here, we don't unhash this dentry? 2008-12-02 22:35 right 2008-12-02 22:35 and lookup should return ENOENT 2008-12-02 22:36 we check marked dentry on some list in ->lookup hander? 2008-12-02 22:36 I don't see why we need to 2008-12-02 22:36 maybe I'm being dumb ;) 2008-12-02 22:37 but if we have masked it, it's a kind of negative dentry and the lookup will fail 2008-12-02 22:37 hashed dentry can use without calling ->lookup hanlder 2008-12-02 22:37 is that good or bad? 2008-12-02 22:38 I think it's bad, because we can't check mark on detnry in ->lookup() handler 2008-12-02 22:38 it's another change needed in core, yes 2008-12-02 22:39 ah, ok 2008-12-02 22:39 the mark has to be treated as negative 2008-12-02 22:39 vfs check that mark, and if dentry has mark, it thinks as negative 2008-12-02 22:40 i see 2008-12-02 22:40 and yet another change is that when somebody does a create using the same name, if the negative dentry is still in use it has to be unhashed 2008-12-02 22:40 I think that is all the changes needed 2008-12-02 22:40 it will make the hard-to-understand dcache even harder to understand ;) 2008-12-02 22:40 well 2008-12-02 22:41 if nobody uses that feature then they can ignore the code 2008-12-02 22:41 yes 2008-12-02 22:41 I'm glad I posted this for review early 2008-12-02 22:42 it never occurred to me that an open file would be the reason for the extra dentry count at d_delete 2008-12-02 22:42 marked dentry should never be reused 2008-12-02 22:42 why not? 2008-12-02 22:42 it can be reused if count goes to zero 2008-12-02 22:42 that is, the file closes 2008-12-02 22:43 yes, if zero, it's already true negative dentry 2008-12-02 22:43 mark is not needed anymore 2008-12-02 22:43 right, we can unmark 2008-12-02 22:43 maybe have to clear the inode 2008-12-02 22:43 yes 2008-12-02 22:43 the d_inode 2008-12-02 22:43 I don't think anybody does that for us 2008-12-02 22:44 maybe, dentry_iput() does in dput()? 2008-12-02 22:45 ACTION checks 2008-12-02 22:45 final dput() -> d_kill() -> dentry_iput() 2008-12-02 22:45 yes 2008-12-02 22:45 that's right 2008-12-02 22:45 so there is where we can clear the mask 2008-12-02 22:45 that flag does no harm though 2008-12-02 22:46 if the inode is zero 2008-12-02 22:46 yes 2008-12-02 22:46 it's just two different ways of saying the dentry is negative 2008-12-02 22:46 yes 2008-12-02 22:46 ok, I think I can fix this, now... is this all worth it? 2008-12-02 22:47 I think it really is 2008-12-02 22:47 can take a lot of contention away from i_mutex 2008-12-02 22:47 I think so too 2008-12-02 22:48 your review was great :) 2008-12-02 22:48 honestory, I'm not sure until complete this work 2008-12-02 22:48 spotted the issue right away 2008-12-02 22:48 thanks 2008-12-02 22:48 me neither 2008-12-02 22:48 I also need to do some work on tux3 basics 2008-12-02 22:48 so this will be a background project 2008-12-02 22:48 but, I can't create great design like you though 2008-12-02 22:49 ACTION blushes 2008-12-02 22:49 well, it's true 2008-12-02 22:55 ok, SB is gone and tested in userspace 2008-12-02 22:55 now test in kernel 2008-12-02 23:03 now to change BTREE 2008-12-02 23:10 ok, after that work, I'll cleanup cursor stuff, then also I'll try to think about deferred stuff more or less 2008-12-02 23:10 it's just about ready 2008-12-02 23:12 989 line patch :-/ 2008-12-02 23:14 ok, thanks 2008-12-03 00:00 -!- camby(~root@116.204.35.246) has joined #tux3 2008-12-03 00:16 if (!(entry->d_flags & DCACHE_NEGATIVE) && !d_unhashed(dentry)) 2008-12-03 00:16 __d_drop(dentry); 2008-12-03 00:17 static int ext2_hide_dentry(struct dentry *dentry) 2008-12-03 00:17 { 2008-12-03 00:17 show_dentry("hide dentry", dentry); 2008-12-03 00:17 entry->d_flags |= DCACHE_NEGATIVE; 2008-12-03 00:17 dget(dentry); 2008-12-03 00:17 return 1; 2008-12-03 00:17 } 2008-12-03 00:18 yes 2008-12-03 00:18 no 2008-12-03 00:18 return 0? 2008-12-03 00:20 no 2008-12-03 00:20 probably, we have to avoid dentry->d_inode = NULL in dentry_iput() 2008-12-03 00:33 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-03 00:37 or we just create new negative dentry? 2008-12-03 00:37 and we handle all deferred dentry as opened file 2008-12-03 00:51 hirofumi, actually, I don't think we will ever hit d_delete, because we took an extra ref count on the dentry 2008-12-03 00:52 ->d_delete() handler? 2008-12-03 00:54 ->d_delete in iput 2008-12-03 00:54 (bad name collision there) 2008-12-03 00:55 ok, it can hit ->d_delete, but only after we have done the deferred processing. The file might have close by then, but it probably does not matter. 2008-12-03 00:55 static int ext2_delete_dentry(struct dentry *dentry) 2008-12-03 00:55 { 2008-12-03 00:55 show_dentry("delete dentry", dentry); 2008-12-03 00:55 if ((dentry->d_flags &= ~DCACHE_NEGATIVE)) { 2008-12-03 00:55 dentry->d_flags &= ~DCACHE_NEGATIVE; 2008-12-03 00:55 dentry->d_inode = NULL; 2008-12-03 00:55 list_del_init(&dentry->d_alias); 2008-12-03 00:56 } 2008-12-03 00:56 return 0; /* do not unhash */ 2008-12-03 00:56 } 2008-12-03 00:56 bleah 2008-12-03 00:56 extra ~ 2008-12-03 00:56 if ((dentry->d_flags & DCACHE_NEGATIVE)) { 2008-12-03 00:56 http://lwn.net/Articles/309094/ 2008-12-03 00:56 flipz: can u get me a link to that.. dont have lwn subscription :( 2008-12-03 00:57 heh 2008-12-03 00:57 article abt tux3... 2008-12-03 00:59 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-03 00:59 ok, dont bother.. found a way ;) 2008-12-03 01:00 hey maze 2008-12-03 01:14 hey 2008-12-03 01:14 all the usual students were on vacation today 2008-12-03 01:14 so we will pick up on thursday... auditing the dentry paths 2008-12-03 01:15 it's mind numbingly twisty 2008-12-03 01:15 cool, I both forgot and have an excuse: comcast was down till 1 am 2008-12-03 01:15 so even if I had remembered... 2008-12-03 01:16 heh 2008-12-03 01:16 http://lwn.net/Articles/309094/ 2008-12-03 01:16 still too fresh for me 2008-12-03 01:16 jon corbet comments we have a small team, let's grow it 2008-12-03 01:17 get some of that "smart people" effect 2008-12-03 01:17 you don't really need or want a large team, you want a small, smart, good, dedicated team 2008-12-03 01:17 5 would do 2008-12-03 01:17 at the moment we're at about 3 2008-12-03 01:19 I did some hacking over the long break, although mostly it was trying to figure out why my machine can't seem to recognize disks - I have problems with port multiplier support 2008-12-03 01:19 printk debugging ;-) 2008-12-03 01:19 but so far no success 2008-12-03 01:19 however: I have two identical external enclosures... and one of them seems to consistently work better 2008-12-03 01:20 I'm beginning to think it's a firmware (or something) issue, but haven't yet figured out how to verify that 2008-12-03 01:20 hardcore 2008-12-03 01:21 I've been trying to get this media server functioning for over half a year now 2008-12-03 01:21 and I still have: a) disk issues, b) firewire issues, c) no cpu clock scaling 2008-12-03 01:21 really quite ridiculous 2008-12-03 01:22 cpu scaling is I believe caused by bad acpi bios tables, firewire is caused by something which causes IRQ's to get disabled occasionally 2008-12-03 01:22 and the port multiplier jazz is weird 2008-12-03 01:23 are you going to tough it through? 2008-12-03 01:24 I don't know - it's black magic 2008-12-03 01:25 the firewire I don't much care about - discovered that mostly by accident today 2008-12-03 01:25 I'm changing about 1 bazillion places where dentry.d_inode is treated as a logical value to an inline macro, I think my head will explode soon 2008-12-03 01:25 the cpu scaling is only mildly annoying - it burns more power - tough :-( 2008-12-03 01:25 bad the lack of access to my disks... that's more of an issue 2008-12-03 01:25 I'd say 2008-12-03 01:27 mind you ... I have a seagate 500GB external portable drive - connects over usb2.0/firewire400/firewire800 and it totally rocks 2008-12-03 01:28 from linux on my macbook pro I get 30.5/39.5/75 MB/s read speeds over USB2/F400/F800 2008-12-03 01:30 I take it windows runs flawlessly on this machine? 2008-12-03 01:57 hey flipz 2008-12-03 01:57 hi 2008-12-03 01:58 how's it going ? 2008-12-03 01:58 ACTION is too drunk to drive at the moment 2008-12-03 01:58 heh 2008-12-03 01:58 progress 2008-12-03 02:03 -!- dmitri(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-03 02:03 welcome, dmitri 2008-12-03 02:05 привет 2008-12-03 02:05 sry.. hello 2008-12-03 02:05 :) 2008-12-03 02:05 I presume that was Cyrillic 2008-12-03 02:06 russian ;) 2008-12-03 02:06 daniel is around? 2008-12-03 02:07 here 2008-12-03 02:07 hello again 2008-12-03 02:07 :) 2008-12-03 02:07 :) 2008-12-03 02:07 came to read a patch? 2008-12-03 02:07 ACTION has a big nasty one brewing 2008-12-03 02:08 sure.. 2008-12-03 02:08 hope its having comments 2008-12-03 02:08 :) 2008-12-03 02:08 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-12-03 02:08 http://mailman.tux3.org/pipermail/tux3/2008-December/000399.html <- first version is here 2008-12-03 02:08 comments are in the mail 2008-12-03 02:09 new version coming in maybe 20 minutes 2008-12-03 02:09 much bigger, therefore needing more checking 2008-12-03 02:09 ok. let me go thru that 2008-12-03 02:12 downloading tux3... 2008-12-03 02:13 flipz: Apparantly you're not known as a tight lipped developer ;) 2008-12-03 02:13 I think he was talking about the length of my posts 2008-12-03 02:14 I think most people appreciate really long ones, if you're the kind of person who reads that stuff you want the details 2008-12-03 02:24 root@usermode:~# ls /mnt/foo 2008-12-03 02:24 ls: /mnt/foo: No such file or directory 2008-12-03 02:24 root@usermode:~# ls /mnt 2008-12-03 02:24 d dir foo lost+found 2008-12-03 02:25 so that's an issue 2008-12-03 02:25 getdents reads from the filesystem, pretty much impossible to take account of the dirent cache 2008-12-03 02:25 so it has to flush the directory first 2008-12-03 02:26 ok... that should not be hard 2008-12-03 02:34 root@usermode:~# rm /mnt/foo 2008-12-03 02:34 >>> defer unlink: 0985263c/1 48 "foo" 2008-12-03 02:34 >>> hide dentry: 0985263c/1 48 "foo" 2008-12-03 02:34 >>> defer delete: 0988de0c/0 0 2008-12-03 02:34 root@usermode:~# ls /mnt 2008-12-03 02:34 >>> ext2_sync_dir 098533c8 "/" 2008-12-03 02:34 >>> dentry: 0985263c/1 68 "foo" 2008-12-03 02:34 >>> deferred delete: 0985263c/1 68 "foo" 2008-12-03 02:34 >>> delete dentry: 0985263c/0 28 "foo" 2008-12-03 02:34 >>> ext2_sync_dir 098533c8 "/" 2008-12-03 02:34 >>> dentry: 0985263c/0 8 "foo" 2008-12-03 02:34 d dir lost+found 2008-12-03 02:35 it was not hard 2008-12-03 02:35 crude solution though 2008-12-03 02:35 has to walk through all the dentries of a directory to see if any are deferred deletes 2008-12-03 02:35 I guess that could be fixed by putting the deferred deletes at the beginning of the list or something 2008-12-03 02:37 and store the number of deferred entries? 2008-12-03 02:38 s/entries/deletes 2008-12-03 02:38 stop at the first non-deferred one 2008-12-03 02:38 stop flushing 2008-12-03 02:38 actually, put them at the end of the list, because normal lookup puts new ones at the front 2008-12-03 02:38 should work fine 2008-12-03 02:39 go over the entire list then 2008-12-03 02:39 yes, be lazy for now 2008-12-03 02:39 it's the principle that matters 2008-12-03 02:40 yes 2008-12-03 02:40 flipz: can you point to the atom post? subject may be.. 2008-12-03 02:41 just a sec 2008-12-03 02:41 atom smashing? 2008-12-03 02:41 ok 2008-12-03 02:42 http://lwn.net/Articles/300416/ 2008-12-03 02:42 thanq 2008-12-03 02:49 flipz: nice details in your mails 2008-12-03 02:50 thankyou 2008-12-03 02:50 http://kerneltrap.org/Linux/Tux3_Hierarchical_Structure 2008-12-03 02:51 right, that got simpler actually 2008-12-03 02:51 after removing volumes? 2008-12-03 02:52 yes 2008-12-03 02:52 all the structures under volume table go one level up 2008-12-03 02:52 ? 2008-12-03 02:52 http://userweb.kernel.org/~hirofumi/tux3.img.dot.png 2008-12-03 02:53 yes 2008-12-03 02:53 yeah, found this image on a blog 2008-12-03 03:33 http://lwn.net/Articles/309094/rss 2008-12-03 03:55 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-03 04:05 flipz: the code does need comments... 2008-12-03 04:09 it does 2008-12-03 04:09 post the first six places where you want comments to the list, and it shall be done 2008-12-03 04:13 on it... 2008-12-03 04:13 ok, I'll sleep and check it tomorrow 2008-12-03 04:15 one thing, where to start reading the code will help :) 2008-12-03 04:15 i picked up btree.c 2008-12-03 04:15 a README will help 2008-12-03 04:28 start in tux3.c I think 2008-12-03 04:29 that implements high level operations as a command line interface 2008-12-03 04:29 you can follow them down into the code 2008-12-03 04:30 ok, thnq 2008-12-03 04:41 the figure is neat and precise.. 2008-12-03 05:31 flips: comments about the various fields in data structures will help... 2008-12-03 05:37 i am not able to subscribe to the list 2008-12-03 05:37 it is saying that i need to supply a valid email address which i am doing 2008-12-03 08:09 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-03 08:10 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-12-03 09:18 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-03 09:48 -!- dmitri(7aa33163@webchat.mibbit.com) has joined #tux3 2008-12-03 09:50 anyone here? 2008-12-03 10:01 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-03 10:01 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-03 10:08 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-03 12:14 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-03 13:45 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-03 15:43 -!- ajonat(~ajonat@190.48.116.118) has joined #tux3 2008-12-03 15:43 sk8 oclock 2008-12-03 17:31 later 2008-12-03 17:32 that was 2 hours ago :) 2008-12-03 17:32 good thing we have time stamps on irc text lines and stuff 2008-12-03 17:42 woohoo 2008-12-03 17:42 root@usermode:~# rm /mnt/foo 2008-12-03 17:42 >>> defer unlink: 0985365c/1 48 "foo" 2008-12-03 17:42 >>> hide dentry: 0985365c/1 48 "foo" 2008-12-03 17:42 root@usermode:~# ls /mnt/foo 2008-12-03 17:42 /mnt/foo 2008-12-03 17:42 root@usermode:~# ls /mnt 2008-12-03 17:42 >>> ext2_sync_dir 09851cac "/" 2008-12-03 17:42 >>> dentry: 0985365c/1 68 "foo" 2008-12-03 17:42 >>> drop hidden dentry: 0985365c/1 48 "foo" 2008-12-03 17:42 >>> deferred unlink: 0985365c/1 8 "foo" 2008-12-03 17:42 >>> defer delete: 0988ae0c/0 0 2008-12-03 17:42 >>> ext2_sync_dir 09851cac "/" 2008-12-03 17:42 >>> dentry: 0985365c/0 8 "foo" 2008-12-03 17:42 d dir lost+found 2008-12-03 17:42 root@usermode:~# umount /mnt 2008-12-03 17:42 >>> >>> delete deferred: 0988ae0c/0 0 2008-12-03 17:42 >>> ext2_delete_inode 2008-12-03 17:43 deferred unlink working for the file-is-open case 2008-12-03 17:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-03 18:56 -!- MaZe1(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-03 20:27 -!- dmitri(7aa33163@webchat.mibbit.com) has joined #tux3 2008-12-03 20:35 -!- RazvanM(~RazvanM@96.234.233.232) has joined #tux3 2008-12-03 21:00 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-03 21:23 -!- camby(~root@203.156.193.68) has joined #tux3 2008-12-03 21:43 hirofumi, there? 2008-12-03 21:43 hi 2008-12-03 21:46 now, I'm fixing cursor stuff 2008-12-03 21:46 nice 2008-12-03 21:47 hirocumi, do you know who I should ask to get a tux3 git tree at kernel.org? 2008-12-03 21:47 iirc, ftp-admin@vger.kernel.org 2008-12-03 21:47 wait a bit 2008-12-03 21:48 http://www.kernel.org/faq/#account 2008-12-03 21:48 ftpadmin@kernel.org 2008-12-03 21:48 thanks 2008-12-03 22:08 ah, I forgot to mention about important subject 2008-12-03 22:08 for ftpadmin, subject must have "[KORG]" prefix 2008-12-03 22:09 flips, please see the bottom of http://www.kernel.org/ 2008-12-03 22:10 right, [KORG] 2008-12-03 22:38 Did I just see flips not reading a faq? Shame :> 2008-12-03 22:39 ACTION will get mlankhorst to send the beg next time 2008-12-03 22:40 mlankhorst, were you going to help pranith out with the fsck hack? 2008-12-03 22:40 nah 2008-12-03 22:40 lamer ;) 2008-12-03 22:41 I'm weeding out gcc bugs 2008-12-03 22:41 More fun 2008-12-03 22:41 oh yes, lots more fun 2008-12-03 22:41 If there were no gcc bugs, my simple mingw-w64 app would probably work. 2008-12-03 22:43 (Though you would never guess what the app looks like) 2008-12-03 22:46 Ironically, it's the printf("Hello, world\n"); line that exposed the bug 2008-12-03 22:47 early out tonight 2008-12-03 22:47 dcache hacking really wears one down 2008-12-04 00:36 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-12-04 02:10 folks 2008-12-04 02:10 ACTION must get a new dev motherboard it seems because of a failure 2008-12-04 03:31 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-04 04:05 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-04 04:06 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-04 04:29 -!- hub(~hub@pD9E94602.dip.t-dialin.net) has joined #tux3 2008-12-04 04:36 -!- hub(~hub@pD9E94602.dip.t-dialin.net) has left #tux3 2008-12-04 05:59 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-12-04 06:16 Where is the actually addressing of the harddrive made, which sectors is to be written etc? Is it up to the Linux VFS? 2008-12-04 07:02 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-04 07:02 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-04 07:04 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-12-04 07:04 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-04 07:04 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-04 07:04 -!- ollebull(~olle@ip6-43.bon.riksnet.se) has joined #tux3 2008-12-04 07:05 -!- ChanServ changed topic to "http://tux3.org ~ Tux3 U, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: deferred unlink, does it really work?" 2008-12-04 07:09 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-12-04 08:02 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-04 08:26 ollebull, it is not up to the vfs, it is up to each filesystem 2008-12-04 08:27 ollebull, most filesystems make this decision in their ->get_block callback 2008-12-04 08:36 Ok flipzzz , is that how tux3 is handling it too, or supposed to handle it? 2008-12-04 08:37 -!- mlankhorst(~m@fw1.astro.rug.nl) has left #tux3 2008-12-04 08:38 tux3 is currently using ->get_block, but later will use a more efficient get_extents interface 2008-12-04 08:40 Ok, I'm just trying to get some things straighten out for my self 2008-12-04 08:42 Would be fun and interesting to participate in the developing of tux3, if there is any help needed, that is... 2008-12-04 08:42 sure, help is needed 2008-12-04 08:42 Think I have to read some more about the Linux filesystem-structure 2008-12-04 08:43 anything from testing, to checking patches, to heavy hacking and design stuff 2008-12-04 08:44 see the topic above for tux3 university, all about linux's filesystem design 2008-12-04 08:47 Yeah, I have been reading some of the previous irc-logs, gives some clarification to the subject 2008-12-04 08:49 Is there any particular piece of the projected that needs more attention? 2008-12-04 08:49 testing 2008-12-04 08:49 fsck 2008-12-04 08:50 Ok, is anyone working on fsck now? 2008-12-04 08:52 pranith is looking at it 2008-12-04 08:52 you could compare notes 2008-12-04 08:53 Compare which notes? 2008-12-04 08:54 that's an idiom 2008-12-04 08:54 it means, see what he thinks about it, how he plans to approach, etc, and discuss your own ideas 2008-12-04 08:54 useful idiom 2008-12-04 08:54 Oh, right =) 2008-12-04 09:33 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-04 09:37 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-04 10:48 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-04 11:19 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-04 12:05 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-04 13:58 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-04 14:17 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-04 14:34 http://www.phoronix.com/scan.php?page=article&item=ext4_benchmarks 2008-12-04 14:37 those tests mostly seem stream-centric 2008-12-04 14:38 looking for the bottom line numbers... 2008-12-04 14:38 http://www.phoronix.com/scan.php?page=article&item=ext4_benchmarks&num=9 2008-12-04 14:38 http://www.phoronix-test-suite.com/ 2008-12-04 14:38 we'll eventually want to run that :) 2008-12-04 14:39 yes 2008-12-04 14:39 ext4 creates are doing well, deletes regressed, streaming read is much better 2008-12-04 14:39 I wonder why deletes regressed 2008-12-04 14:39 mingming, there? 2008-12-04 14:45 flips, yes 2008-12-04 14:46 ACTION looking at the benchmarks... 2008-12-04 14:46 mingming, do you have a theory about what happened to delete vs ext3? 2008-12-04 14:47 hmm. no, it surpise to see the regression 2008-12-04 14:48 are those small files? 2008-12-04 14:48 extents + delayed allocation in ext4 should help in reduce file delete times 2008-12-04 14:50 I didn't run the test 2008-12-04 14:50 right, configuration would be nice to know 2008-12-04 14:59 deleting random 4G file?what it is doing? sorry, haven't run bonnie for a long long time 2008-12-04 15:01 I guess it means deleting one 4G file 2008-12-04 15:15 flips, it seems too fast to delete 200 4G files per second, 2008-12-04 15:15 not for ext3:) 2008-12-04 15:16 maybe it's 4G of small files 2008-12-04 15:17 if I had used bonnie recently, I'd know :) 2008-12-04 15:29 maybe I will download that test and see what it's doing with bonnie++ 2008-12-04 16:29 -!- warthog9(~warthog9@c-67-164-30-157.hsd1.ca.comcast.net) has joined #tux3 2008-12-04 16:30 -!- warthog9(~warthog9@c-67-164-30-157.hsd1.ca.comcast.net) has left #tux3 2008-12-04 17:36 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-04 17:53 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-04 18:54 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-04 19:05 -!- inverse(~chatzilla@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-04 19:11 looks like I won't be able to make tux3 u today... unfortunately. 2008-12-04 20:02 aw 2008-12-04 20:02 well, who is here tonight? 2008-12-04 20:03 hi 2008-12-04 20:03 :) 2008-12-04 20:03 shall we go over the new deferred delete patch? 2008-12-04 20:03 I will commit it to git 2008-12-04 20:04 ok 2008-12-04 20:04 hi 2008-12-04 20:04 im new :) 2008-12-04 20:05 hi inverse 2008-12-04 20:05 hi 2008-12-04 20:05 just a moment while I git-stash my pulic repo and pull from private 2008-12-04 20:08 oh, changeset:583 breaks xattr stuff 2008-12-04 20:08 whoops 2008-12-04 20:08 ah, that's why it was needed :-/ 2008-12-04 20:09 ok, it goes back 2008-12-04 20:09 but not just now 2008-12-04 20:09 yes 2008-12-04 20:11 hmm, that took a while 2008-12-04 20:12 and I ended up based off the wrong head 2008-12-04 20:12 git isn't entirely intuitive 2008-12-04 20:12 http://phunq.net/ddtree?p=tux3fs;a=shortlog;h=refs/heads/nameops 2008-12-04 20:12 but good enough 2008-12-04 20:13 here is the diff: http://phunq.net/ddtree?p=tux3fs;a=commitdiff;h=544364862a82debfbc8b2b5000d325d4ac910ee9;hp=69296d9d4e065beb1e67bf71592519bcd5100a1e 2008-12-04 20:13 and here are the files: http://phunq.net/ddtree?p=tux3fs;a=tree;h=544364862a82debfbc8b2b5000d325d4ac910ee9;hb=544364862a82debfbc8b2b5000d325d4ac910ee9 2008-12-04 20:13 hmm, this all works much better with mercurial 2008-12-04 20:14 no gigantic sha1's in the urls 2008-12-04 20:14 this is dcache.c: http://phunq.net/ddtree?p=tux3fs;a=blob;f=fs/dcache.c;h=94b36566b87d92915df477d978f88f6226d0ee4b;hb=544364862a82debfbc8b2b5000d325d4ac910ee9 2008-12-04 20:15 ah, git really sucks for this ;) 2008-12-04 20:16 ok, let's use this patch from mailman instead: http://mailman.tux3.org/pipermail/tux3/attachments/20081203/bb90a9c3/attachment-0001.patch 2008-12-04 20:18 and please point your browser here: http://phunq.net/ddtree?p=tux3fs;a=tree;f=fs;h=44f76ed70ce730b8a27c896707e7ad220e696555;hb=544364862a82debfbc8b2b5000d325d4ac910ee9 2008-12-04 20:18 this is the top level of the git tree for the deferred namespace operations 2008-12-04 20:18 ok 2008-12-04 20:18 after that, I can just talk about filenames 2008-12-04 20:18 and this is the last time I try to do this with git ;) 2008-12-04 20:18 ok, let's start with dcache.c 2008-12-04 20:19 correction: the above url is the top level for linux/fs 2008-12-04 20:19 1388 void d_delete(struct dentry * dentry) 2008-12-04 20:20 d_delete is the one thing I had to hack to support deferred delete 2008-12-04 20:20 it now has a ->d_hide method 2008-12-04 20:20 (hirorumi knows this already) 2008-12-04 20:20 yes 2008-12-04 20:21 there are some changes from last time 2008-12-04 20:21 now, the return from ->hide is actually used 2008-12-04 20:21 before it was just a feature I thought somebody might need, a flag to say: I didn't hide it, just do everything the usual way 2008-12-04 20:21 then I found out I needed it :) 2008-12-04 20:22 it is the ->d_hide method that does the defer 2008-12-04 20:22 by taking a reference count on the dentry, to be sure it will not disappear 2008-12-04 20:23 and setting the "DCACHE_HIDDEN" flag on the dentry, which means that the dentry is hidden 2008-12-04 20:23 sorry, is negative 2008-12-04 20:23 even though it still has a ->d_inode 2008-12-04 20:24 do now it seems to lookup as if the file has been unlinked 2008-12-04 20:24 because there is a negative dentry in the dcache that can't be automatically pruned/shrunk away 2008-12-04 20:25 1399 if (atomic_read(&dentry->d_count) == 1 + hidden) { <- this is the condition that controls detaching the inode from the dentry 2008-12-04 20:25 normally, it does this when dentry count has reached 1 2008-12-04 20:25 meaning that the dentry cache itself holds the only reference 2008-12-04 20:26 so the dentry can be safely detached 2008-12-04 20:26 it can also be safely detached if the defer mechanism additionally holds a reference 2008-12-04 20:27 that is the meaning of "1 + hidden" 2008-12-04 20:27 yes 2008-12-04 20:27 the defer mechanism tracks the inode and dentry separately, which is why it can be detached 2008-12-04 20:27 the only case it can't be detached (I think) is when a file has the dentry open 2008-12-04 20:28 then the dcache normally unhashes the dentry, but that is not allowed with deferred delete 2008-12-04 20:28 that would make the unlinked file visible again, when the a lookup goes straight to the filesystem 2008-12-04 20:29 so if the dentry was hidden, we skip the unash, and the dentry is negative with an inode attached and an extra ref count 2008-12-04 20:30 next stop should be include/linux/dcache.c, but that's too inconvenient with this git tree 2008-12-04 20:30 +static inline int d_negative(struct dentry *dentry) 2008-12-04 20:30 +{ 2008-12-04 20:30 + return !dentry->d_inode || (dentry->d_flags & DCACHE_HIDDEN); 2008-12-04 20:30 +} 2008-12-04 20:30 so here we are 2008-12-04 20:30 cut n paste 2008-12-04 20:31 normally, people just do !dentry->d_inode to detect "negative dentry" 2008-12-04 20:31 that is sloppy coding style that has somehow made it this far 2008-12-04 20:32 I had to go wrap all those with this d_negative function, as they should have been already 2008-12-04 20:32 then d_negative gets an additional condition: (dentry->d_flags & DCACHE_HIDDEN) 2008-12-04 20:32 our new dcache flag, set by ->d_hide 2008-12-04 20:33 all these changes are to fs/namei.c 2008-12-04 20:33 they make cached_lookup return ENOENT for this dentry 2008-12-04 20:34 this is what creates the appearance that the dentry that had to be kept around because of the open file, has been unlinked 2008-12-04 20:34 next stop is where we do the actualy unlink 2008-12-04 20:34 fs/ext2/dir.c 2008-12-04 20:35 ceatinge, did I lose you way back? 2008-12-04 20:36 I suspect so ;) 2008-12-04 20:36 this one happens to be pretty deep in core kernel hack land 2008-12-04 20:36 http://phunq.net/ddtree?p=tux3fs;a=blob;f=fs/ext2/namei.c;h=bdd640c279f8dea3a72e093cd63726c862c4651e;hb=544364862a82debfbc8b2b5000d325d4ac910ee9#l273 2008-12-04 20:36 here? 2008-12-04 20:37 276 int ext2_flush_dir(struct dentry *dir) 2008-12-04 20:37 yes 2008-12-04 20:37 this is a new function 2008-12-04 20:37 ext2 never had a flush_dir before 2008-12-04 20:37 the dentry we have hidden will be found on the list of child dentries of the directory dentry 2008-12-04 20:38 so we scan the list, and do special things to any that need deferred processing 2008-12-04 20:38 here, we can always remove the HIDDEN flag 2008-12-04 20:38 and d_delete the dentry 2008-12-04 20:39 because we will be done with it, no more defer 2008-12-04 20:39 if this negative dentry has the BACKED flag, then it still exists on the filesystem 2008-12-04 20:39 and has to be unlinked 2008-12-04 20:40 a simple function that just calls the low lowlevel directory unlink 2008-12-04 20:40 the d_delete is then called, to detach the dentry from the inode, if that is now possible 2008-12-04 20:40 however, we call spin_unlock(dcache_lock) here 2008-12-04 20:41 I might have got those wrong ;) 2008-12-04 20:41 didn't test on smp 2008-12-04 20:41 the dcache lock is supposed to protect the dentry lists 2008-12-04 20:41 including the child list 2008-12-04 20:41 so, probably it can race with cached_lookup() 2008-12-04 20:42 yes, probably so 2008-12-04 20:42 thankyou 2008-12-04 20:43 it has to remain hidden after the d_delete, which removes the d_inode 2008-12-04 20:43 yes 2008-12-04 20:44 you found a bug, making this entire session worth it ;) 2008-12-04 20:44 :) 2008-12-04 20:44 one more thing about this dir flush 2008-12-04 20:44 in tux3 we need some way of knowing which dirs need to be flushed 2008-12-04 20:45 I haven't thought about that yet, but it does not seem ahrd 2008-12-04 20:45 hard 2008-12-04 20:45 the other thing that happens here, is we clear the BACKED flag 2008-12-04 20:45 which similarly has to be done after the real unlink 2008-12-04 20:45 now fs/ext2/namei.c 2008-12-04 20:46 +static int ext2_hide_dentry(struct dentry *dentry) 2008-12-04 20:46 +{ 2008-12-04 20:46 + if (dentry->d_flags & DCACHE_BACKED) { 2008-12-04 20:46 + BUG_ON(!dentry->d_inode); 2008-12-04 20:46 + show_dentry("hide dentry", dentry); 2008-12-04 20:46 + dentry->d_flags |= DCACHE_HIDDEN; 2008-12-04 20:46 + dget(dentry); 2008-12-04 20:46 + return 1; 2008-12-04 20:46 + } 2008-12-04 20:46 + return 0; 2008-12-04 20:46 +} 2008-12-04 20:46 this hides the dentry 2008-12-04 20:47 the reason we can call it from the flush dir without hiding the dentry again is, we cleared the BACKED flag first 2008-12-04 20:47 this part is a little messy 2008-12-04 20:47 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-04 20:47 I would have had to expose some other dcache functions otherwise 2008-12-04 20:48 it's pretty obvious what that does, including taking the extra reference count on the dentry 2008-12-04 20:48 we also need to "manually" add dentry operations to every dentry ext2 creates 2008-12-04 20:49 there is no nice vfs mechanism for this, and there probably should be 2008-12-04 20:49 not many filesystems mess with the dentries, actually 2008-12-04 20:49 cluster filesystems, NFS and CIFS 2008-12-04 20:50 besides deferring the name unlink, the inode destroy also has to be deferred 2008-12-04 20:50 this after all is the reason tux3 wants this 2008-12-04 20:50 to avoid having to modify an inode table block until after delta transition 2008-12-04 20:51 deferring the inode delete is considerably easier 2008-12-04 20:51 we re-purpose the inode dirty list, and have a new list for 'deleted' inodes 2008-12-04 20:51 off our superblock 2008-12-04 20:53 this is done in what would normally be the filesystem's unlink method 2008-12-04 20:53 so when the filesystem is called to unlink the name, instead of doing that we just decrement the inode's link count 2008-12-04 20:54 +int ext2_unlink_deferred(struct inode *dir, struct dentry *dentry) <- this is called from the dir flush to do the "real unlink" 2008-12-04 20:54 notice how much smaller it got from the original ext2 unlink 2008-12-04 20:55 fs/ext2/super.c <- inode deferring is done here 2008-12-04 20:56 because we have unlinked the inode, the vfs may free the cached inode if its count drops to zero 2008-12-04 20:57 then we want to delete the inode on disk 2008-12-04 20:59 (had to take care of something for a minute there) 2008-12-04 21:00 in here, we remove the directory entry too? 2008-12-04 21:00 the directory entry is removed in the directory flush 2008-12-04 21:00 so that is done separately 2008-12-04 21:01 we have to do both, to be consistent, if the inode count drops to zero 2008-12-04 21:01 if user closed file before directory flush? 2008-12-04 21:01 we have to handle both cases 2008-12-04 21:01 closed or not closed 2008-12-04 21:02 if not closed, the inode becomes an orphan 2008-12-04 21:02 whoops, in that case is not i_nlink > 0 2008-12-04 21:02 if nlink remains > 0 we don't have to do anything here 2008-12-04 21:02 yes 2008-12-04 21:03 the vfs tells us when the inode's use count and link count dropped to zero, by the ->drop_inode super_operation 2008-12-04 21:04 (I don't really know why that is a super_operatoin and not an inode_operation) 2008-12-04 21:05 on a filesystem sync, we walk through the deferred delete list and call the real delete 2008-12-04 21:05 generic_delete_inode, which calls the filesystem 2008-12-04 21:05 well, that is the whole thing 2008-12-04 21:06 it's a pretty complex mechanism 2008-12-04 21:06 even though it is not much code 2008-12-04 21:06 and it had a race ;) 2008-12-04 21:06 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-04 21:07 in tux3, we will always do those directory flushes and sync_fs calls in the "staging" operation 2008-12-04 21:08 this means that the unlink and the inode delete will always be done together, in the same delta 2008-12-04 21:08 unless an orphan was created, because the file is open 2008-12-04 21:09 we need to detect that case somehow, I haven't covered that here 2008-12-04 21:09 yes 2008-12-04 21:10 ok, that was an hour of deep dcache hack, I need to fix the directory flush race now 2008-12-04 21:10 ok 2008-12-04 21:10 first I need to fix the xattrs 2008-12-04 21:10 sorry about that 2008-12-04 21:11 I will partly revert that patch 2008-12-04 21:11 ah, yes 2008-12-04 21:11 revert the whole patch 2008-12-04 21:11 git has a command for that? 2008-12-04 21:12 git? 2008-12-04 21:12 to revert a changeset 2008-12-04 21:12 remove nameops branch? 2008-12-04 21:12 I will remove that branch 2008-12-04 21:12 iirc, git branch -D nameops 2008-12-04 21:13 oh sorry 2008-12-04 21:13 I meant mercurial :) 2008-12-04 21:13 to revert changeset 583 2008-12-04 21:13 ah 2008-12-04 21:13 I don't know :) 2008-12-04 21:17 hg diff -r 1ade293577b0 -r bdae7b8b8f22 | patch -p1 -R 2008-12-04 21:17 :p 2008-12-04 21:20 yes 2008-12-04 21:24 hirofumi, did xattrtest pick up the problem before? 2008-12-04 21:24 what problem? 2008-12-04 21:24 the broken xattrs problem 2008-12-04 21:25 tux_create_entry() problem? 2008-12-04 21:25 yes 2008-12-04 21:25 let me check ;) 2008-12-04 21:25 I don't know, I just tried to do same thing before :) 2008-12-04 21:26 main: ---- dump atom table ---- 2008-12-04 21:26 unatom: atom 2 reverse entry broken 2008-12-04 21:26 dump_atoms: atom name lookup failed 2008-12-04 21:26 show_buffers: (map 0x4173028) 2008-12-04 21:26 :) 2008-12-04 21:26 it did, it just didn't bail with an error exit code 2008-12-04 21:27 which I suppose we should add to the tests 2008-12-04 21:27 yes, assert() would be help 2008-12-04 21:27 that would do it 2008-12-04 21:29 ok, fixed 2008-12-04 21:29 and I made you nice kernel code a little uglier, sorry ;) 2008-12-04 21:29 :) 2008-12-04 21:29 we can clean up sometime 2008-12-04 21:29 ok, I'll take a break for a bit 2008-12-04 21:29 ok, see you 2008-12-04 21:42 backed 2008-12-04 21:42 hirofumi, http://www.kernel.org/pub/linux/kernel/people/daniel/ 2008-12-04 21:43 good 2008-12-04 21:43 and: http://mailman.tux3.org/pipermail/tux3/2008-December/000411.html -- Design note: Delta staging 2008-12-04 21:44 I guess I am just one design note away from having a complete design for atomic commit 2008-12-04 21:44 the last one to do is rollup 2008-12-04 21:45 well, some details on logging and log commit block format would be good 2008-12-04 21:45 yes 2008-12-04 21:46 btw, staging is replacement of transition? 2008-12-04 21:46 I forgot 2008-12-04 21:46 no, a replacement for "setup" 2008-12-04 21:47 staging happens at the time of delta transition 2008-12-04 21:47 delta transition is a very short thing, just increment a counter 2008-12-04 21:48 I need a note on that I guess 2008-12-04 21:48 delta transition -> delta staging -> delta transfer -> commit? 2008-12-04 21:48 rith 2008-12-04 21:48 delta transition -> delta staging -> delta transfer -> commit -> complete 2008-12-04 21:49 complete means "delta commit"? 2008-12-04 21:50 depends on what we mean by "commit" 2008-12-04 21:50 kernel meaning is "start the transfer", in this case, the transfer of the commit block 2008-12-04 21:50 database meaning is "complete the operation" I think 2008-12-04 21:51 so we will have to choose one of those 2008-12-04 21:51 I think the kernel meaning is probably better, "commit to disk", i.e., submit a bio 2008-12-04 21:51 so, complete is when we get the endio 2008-12-04 21:51 i see 2008-12-04 21:52 the delta commit block can't be commited until the delta body transfer has completed 2008-12-04 21:52 (using the kernel terminology) 2008-12-04 21:52 yes 2008-12-04 21:53 we may use barrier (flush command of disk)? 2008-12-04 21:53 yes 2008-12-04 21:53 we can get clever 2008-12-04 21:54 I have something more aggressive in mind 2008-12-04 21:54 with checksums 2008-12-04 21:54 i see 2008-12-04 21:54 mount -o live-fast-die-young 2008-12-04 21:55 that is, a theoretically riskier strategy that is in practice perfectly safe, and goes significantly faster 2008-12-04 21:55 an option 2008-12-04 21:55 i see 2008-12-04 21:55 btw, about xattr 2008-12-04 21:56 ->unatom_base data size is unlimited? 2008-12-04 21:57 it is limited by the range of an atom, 2**32 2008-12-04 21:57 there is enough space for the largest, I think 2008-12-04 21:57 needs a comment 2008-12-04 21:57 well, I just though - if we put atomref_base and unatom_base data at top, and bottom is directory 2008-12-04 21:58 it may work without i_size problem 2008-12-04 21:58 you mean, put the directory at the top? 2008-12-04 21:58 that is a solution I did not think of :) 2008-12-04 21:58 directory is last posistion of atable file 2008-12-04 21:59 so the base of the directory is way up high? 2008-12-04 21:59 yes 2008-12-04 22:00 and xattr stuff does seek() for directory lookup or something 2008-12-04 22:00 that would work 2008-12-04 22:01 well, there is no file, so we have no pos 2008-12-04 22:01 ah 2008-12-04 22:01 but... 2008-12-04 22:01 ah 2008-12-04 22:01 lookup doesn't have f_pos 2008-12-04 22:01 well, we would have to have a new parameter 2008-12-04 22:01 right 2008-12-04 22:01 so it would probably be less intrusive to pass the address of inode->i_size 2008-12-04 22:02 well, that doesn't work with get_inode_size, does it 2008-12-04 22:02 how about using i_blocks instead? 2008-12-04 22:03 i_blocks seems strange 2008-12-04 22:03 we can't use it for normal directory? 2008-12-04 22:03 we can 2008-12-04 22:03 directories are never sparse 2008-12-04 22:04 i_blocks will always be just as we calculate blocks from i_size now 2008-12-04 22:04 ah 2008-12-04 22:04 well... after the blocks have actually been allocated anyway 2008-12-04 22:04 which might be a problem 2008-12-04 22:05 probably the right thing to do is just add another parameter, the address of i_size 2008-12-04 22:05 but, normal directory has to update i_size too? 2008-12-04 22:05 or we can pass f_pos to lookup? 2008-12-04 22:06 the big problem with the i_blocks idea is that it doesn't get set correctly until the blocks are actually allocated, which will normally be deferred 2008-12-04 22:06 so forget that ;) 2008-12-04 22:06 yes, we can pass f_pos to lookup and put the directory up high 2008-12-04 22:06 that is fine with me 2008-12-04 22:07 and normal one pass 0 always 2008-12-04 22:07 loff_t *pos 2008-12-04 22:07 um 2008-12-04 22:07 sorry 2008-12-04 22:07 don't need to pass address 2008-12-04 22:07 yes 2008-12-04 22:07 easy enough 2008-12-04 22:08 and if we pass address, we can optimize lookup 2008-12-04 22:08 it can remember previous lookup pos 2008-12-04 22:08 next can start with it, and loop cyclicly 2008-12-04 22:09 it will fine for "ls -l" 2008-12-04 22:09 :) 2008-12-04 22:10 but if it has to wrap to the bottom, how does it know where the bottom is? 2008-12-04 22:10 i_size is limit? 2008-12-04 22:10 right, but where is the bottom? 2008-12-04 22:10 ah 2008-12-04 22:11 right 2008-12-04 22:11 on a regular directory bottom is zero 2008-12-04 22:11 i_size is top 2008-12-04 22:11 ah 2008-12-04 22:11 yes 2008-12-04 22:11 this is sick ;) 2008-12-04 22:11 we can pass file * instead of inode * 2008-12-04 22:11 and we don't need an extra parameter ;) 2008-12-04 22:12 it's a little bogus for xattr lookup 2008-12-04 22:12 anyway, we will not have this problem with phtree 2008-12-04 22:12 because phtree doesn't need to rely on i_size 2008-12-04 22:12 good 2008-12-04 22:14 we don't have filp for ->lookup 2008-12-04 22:15 ah 2008-12-04 22:15 well, if this works, I think it may be enough for a while 2008-12-04 22:16 ok, how about this: we pass a unsigned *blocks pointer 2008-12-04 22:16 and we do all the i_size manipulation outside 2008-12-04 22:16 for ->lookup? 2008-12-04 22:17 readdir too? 2008-12-04 22:17 yes, for all the dir functions that need to know number of blocks 2008-12-04 22:17 readdir needs to know f_pos 2008-12-04 22:17 let me see how readdir is called 2008-12-04 22:17 yes, *blocks would be an additional parameter 2008-12-04 22:18 i see 2008-12-04 22:18 on another topic, I think I should clear the _HIDDEN flag in d_delete 2008-12-04 22:18 that will fix the race you noticed 2008-12-04 22:19 yes, ->d_delete 2008-12-04 22:20 it sounds work 2008-12-04 22:20 I will try to get it right this time ;) 2008-12-04 22:21 it's close 2008-12-04 22:22 close or deferred unlink? 2008-12-04 22:23 I mean, it's close to working 2008-12-04 22:24 btw, is there any way to do "git ORIG_HEAD.." in hg? 2008-12-04 22:24 what does that do? 2008-12-04 22:25 gitk 2008-12-04 22:25 hg view? 2008-12-04 22:25 gitk ORIG_HEAD.. shows changes from after previous pull 2008-12-04 22:27 hg view doesn't seem to have any parameters 2008-12-04 22:27 yes 2008-12-04 22:27 it would not be hard to write in python I'msure 2008-12-04 22:27 author probably never thought of it 2008-12-04 22:28 I think I see a race in dcache.c 2008-12-04 22:28 the original 2008-12-04 22:28 git has various way to view history, it's really powerful 2008-12-04 22:28 oh 2008-12-04 22:29 in dentry_iput 2008-12-04 22:29 what is problem? 2008-12-04 22:29 spin_unlock(&dentry->d_lock); 2008-12-04 22:29 spin_unlock(&dcache_lock); 2008-12-04 22:29 if (!inode->i_nlink) 2008-12-04 22:29 fsnotify_inoderemove(inode); 2008-12-04 22:29 if (dentry->d_op && dentry->d_op->d_iput) 2008-12-04 22:29 dentry->d_op->d_iput(dentry, inode); 2008-12-04 22:29 the d_iput is not called under a lock 2008-12-04 22:29 and therefore, dentry state may have changed 2008-12-04 22:30 well 2008-12-04 22:30 I suppose if it only does semantics of iput it is ok 2008-12-04 22:30 caller is only user 2008-12-04 22:31 so, state will not be changed 2008-12-04 22:31 true 2008-12-04 22:32 well, dcache is too complex 2008-12-04 22:34 the easiest place by far to clear the hidden flag is in dentry_iput 2008-12-04 22:34 I will do it there 2008-12-04 22:36 static void dentry_iput(struct dentry * dentry) 2008-12-04 22:36 __releases(dentry->d_lock) 2008-12-04 22:36 __releases(dcache_lock) 2008-12-04 22:36 { 2008-12-04 22:36 struct inode *inode = dentry->d_inode; 2008-12-04 22:36 if (inode) { 2008-12-04 22:36 dentry->d_inode = NULL; 2008-12-04 22:36 list_del_init(&dentry->d_alias); 2008-12-04 22:36 dentry->d_flags &= ~DCACHE_HIDDEN; 2008-12-04 22:36 spin_unlock(&dentry->d_lock); 2008-12-04 22:37 yes 2008-12-04 22:38 probably, ->d_delete will also work 2008-12-04 22:40 yes, if I want to have as little change to dcache.c as possible 2008-12-04 22:41 I think that the HIDDEN flag could be used instead of !dentry->d_inode 2008-12-04 22:43 in effect, ->d_inode is being used as a flag, it's a pretty big flag 2008-12-04 22:43 yes 2008-12-04 22:48 hirofumi, actually, the right thing to do is keep the HIDDEN flag until the dentry stops being negative 2008-12-04 22:49 the only way that happens is if somebody creates or moves to it, and we get a method call for both 2008-12-04 22:49 (I think) 2008-12-04 22:50 and you are right, ->d_delete will work 2008-12-04 22:50 well 2008-12-04 22:50 it is for reuse? 2008-12-04 22:51 d_delete won't work, because somebody could have reused the name by then 2008-12-04 22:51 maybe, HIDDEN dentry is unhashed, so it will not be reused? 2008-12-04 22:52 HIDDEN dentries are hashed 2008-12-04 22:52 that is why they have to be hidden 2008-12-04 22:52 ah 2008-12-04 22:53 I forget about it 2008-12-04 22:53 yes, it will be reused 2008-12-04 22:53 easiest is to do it in dentry_iput, it does no harm 2008-12-04 22:53 just a little extra overhead, and the d_negative is a little extra overhead too 2008-12-04 22:54 plus the check for ->hide method 2008-12-04 22:54 these will all be issues when it gets shown on lkml 2008-12-04 22:54 but, dentry_iput() and ->d_delete is same? 2008-12-04 22:54 has same problem? 2008-12-04 22:55 dentry_iput is ok, because the dentry is set negative there anyway 2008-12-04 22:55 so clearing the HIDDEN flag is always ok 2008-12-04 22:55 yes 2008-12-04 22:55 but, it still have the reuse problem 2008-12-04 22:56 what is that problem? 2008-12-04 22:56 vfs thinks it is negative (d_inode == NULL) 2008-12-04 22:56 that's what we want 2008-12-04 22:56 so, vfs will reuse it 2008-12-04 22:57 reuse it for new create 2008-12-04 22:57 ah, and it will still be marked HIDDEN 2008-12-04 22:57 right 2008-12-04 22:57 so I have to complete the patch 2008-12-04 22:57 it's not just for delete 2008-12-04 22:57 it has to handle create as well, which will clear the HIDDEN flag 2008-12-04 22:58 and we get to do that in our filesystem 2008-12-04 22:58 just before d_instantiate 2008-12-04 22:59 ->d_delete can do it too, instead of dentry_iput() 2008-12-04 23:00 yes 2008-12-04 23:00 ok 2008-12-04 23:00 at that point, the dentry is being destroyed 2008-12-04 23:00 so it doesn't matter if we do it ;) 2008-12-04 23:01 ok, not if it is still hashed 2008-12-04 23:01 it becomes unused 2008-12-04 23:02 yes, and it will be true negative dentry 2008-12-04 23:02 if possible 2008-12-04 23:03 no 2008-12-04 23:03 becomes the negative or positive dentry cache 2008-12-04 23:03 and unused 2008-12-04 23:03 yes 2008-12-04 23:04 if it didn't work like that we would get a lot of hate mail from users ;) 2008-12-04 23:04 yes :) 2008-12-04 23:04 ok, "just for now" I am going to clear it in dentry_iput 2008-12-04 23:04 and later, when the create part is working, that will not be necessary 2008-12-04 23:04 yes 2008-12-04 23:11 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-04 23:11 I've pushed cursor stuff 2008-12-04 23:11 please check it 2008-12-04 23:12 ok 2008-12-04 23:13 ah, I_VERSION() 2008-12-04 23:13 I'm reading nfsd wishlist 2008-12-04 23:13 it may be about I_VERSION() which introduced recently 2008-12-04 23:16 yes, I need to answer bruce 2008-12-04 23:16 very nice post 2008-12-04 23:17 yes 2008-12-04 23:17 well, tux3 can store that version as an attribute 2008-12-04 23:17 it's good for that kind of thing 2008-12-04 23:17 if it is necessary to work properly with NFS then we should do it 2008-12-04 23:17 yes 2008-12-04 23:19 I like the explicit check for null buffer more, it helps understand the code 2008-12-04 23:19 yes, me too 2008-12-04 23:21 + mark_buffer_dirty(cursor_leafbuf(cursor)); 2008-12-04 23:21 + block_t block = bufindex(cursor_leafbuf(cursor)); <- this code reads really well 2008-12-04 23:21 and it can detect the bug more earlier 2008-12-04 23:21 yes 2008-12-04 23:21 yes 2008-12-04 23:22 cursor_bnode(cursor, level) also may helps 2008-12-04 23:22 cursor_bnodebuf(cursor, level) 2008-12-04 23:22 yes 2008-12-04 23:22 I think it can be just cursor_node 2008-12-04 23:23 and cursor_nodebuf 2008-12-04 23:23 what is node? 2008-12-04 23:23 node = !leaf 2008-12-04 23:23 ah, node == bnode? 2008-12-04 23:24 right 2008-12-04 23:24 i see 2008-12-04 23:24 we can keep struct bnode, but most other uses can just be node 2008-12-04 23:24 I like bnode, because node may be inode and others 2008-12-04 23:25 both is node 2008-12-04 23:25 but when you have cursor_ you always know it is a btree node 2008-12-04 23:25 otherwise you would write cursor_bleaf too 2008-12-04 23:25 ACTION hopes hirofumi doesn't like that 2008-12-04 23:26 yes, I don't like it :) 2008-12-04 23:26 :) 2008-12-04 23:26 nice cleanup on advance 2008-12-04 23:27 but, the begginer may not know it is node == bnode 2008-12-04 23:27 well, it doesn't matter 2008-12-04 23:27 I don't think they will confuse with inode 2008-12-04 23:27 inode is always inode 2008-12-04 23:27 oh, tuxnode_t 2008-12-04 23:28 it doesn't have i 2008-12-04 23:28 tnode? ;) 2008-12-04 23:28 it's pretty damm nice code 2008-12-04 23:28 however we spell it ;) 2008-12-04 23:29 well, I'm not sure which is good 2008-12-04 23:29 I like fewer abbreviations in a name if possible 2008-12-04 23:30 cursor->path[level].next++->block <- so you didn't like this, hmm? 2008-12-04 23:30 I have to admit 2008-12-04 23:30 that gets close to some of the ugliest c I have seen 2008-12-04 23:30 another thing in there 2008-12-04 23:31 I thought if it fails, maybe we shouldn't update ->next 2008-12-04 23:31 ah 2008-12-04 23:31 subtle 2008-12-04 23:31 yes 2008-12-04 23:32 probably, it doesn't matter actually 2008-12-04 23:32 you got rid of my swap in bree_leaf_split, that is good 2008-12-04 23:33 the swap was too clever 2008-12-04 23:33 very hard to read 2008-12-04 23:33 thanks 2008-12-04 23:35 that sparse warning in dir.c was probably a real bug 2008-12-04 23:35 I tried to factor the bswap out of the loop, and did it wrong 2008-12-04 23:35 not that it matters, really 2008-12-04 23:35 ah 2008-12-04 23:36 ok, pulling 2008-12-04 23:37 you are right, we can avoid from_be_u32() 2008-12-04 23:37 you are just like me :) 2008-12-04 23:37 have to save every nanosecond 2008-12-04 23:38 yes 2008-12-04 23:38 ok, I will do the cursor_bnode respelling 2008-12-04 23:38 which one? 2008-12-04 23:39 cursor_bnode -> cursor_node, like cursor_leaf 2008-12-04 23:39 ah, ok 2008-12-04 23:40 a small thing, but I like it more, and bnode is eventually going to go away when the btree index gets generalized 2008-12-04 23:40 which is necessary in order to handle a btree mapped into a file 2008-12-04 23:40 oh 2008-12-04 23:40 right now it works because inode table index and data leaf index is exactly the same 2008-12-04 23:40 it won't always be 2008-12-04 23:41 there will be different accelerator bits stored in the unused bits of the index pointers 2008-12-04 23:43 um... 2008-12-04 23:46 /src/tux3/kernel/namei.c:16: error: implicit declaration of function 'ERR_CAST' 2008-12-04 23:49 we don't have ERR_CAST in userspace 2008-12-04 23:49 now we do 2008-12-04 23:50 ok, I restored tux_dir_is_empty() optimization 2008-12-04 23:55 hg repo seems some strange is happend 2008-12-05 00:24 what is strange? 2008-12-05 00:24 whoops 2008-12-05 00:24 two heads 2008-12-05 00:24 and it seems not be merged 2008-12-05 00:24 this is hard to recover from 2008-12-05 00:25 hg rollback? 2008-12-05 00:25 yes 2008-12-05 00:27 now, when I commit changes I get two heads 2008-12-05 00:27 rollback goes back to one head 2008-12-05 00:27 ugh 2008-12-05 00:27 really 2008-12-05 00:27 we want "hg reset --hard" 2008-12-05 00:34 ok, fixed 2008-12-05 00:36 yes 2008-12-05 00:36 I did a rollback, saved my changes as a patch, checkout your revision, then commit 2008-12-05 00:36 this is a definite weakness in mercurial 2008-12-05 00:36 not being able to merge two heads 2008-12-05 00:36 ther is bogus rev though 2008-12-05 00:37 so there is 2008-12-05 00:37 and I have two heads 2008-12-05 00:37 damm 2008-12-05 00:37 I thought I had only one 2008-12-05 00:37 it seems there are 591 and 592 2008-12-05 00:37 oh wait 2008-12-05 00:38 I have only one head in my private repo 2008-12-05 00:38 hirofumi@devron (tux3)$ ../../usr/bin/hg log 2008-12-05 00:38 changeset: 592:14bb381c568f 2008-12-05 00:38 tag: tip 2008-12-05 00:38 parent: 590:a1a5e8fd56a1 2008-12-05 00:38 user: daniel@moonbase.phunq.net 2008-12-05 00:38 date: Fri Dec 05 00:34:16 2008 -0800 2008-12-05 00:38 summary: Respell cursor_bnode as cursor_node; finished restoring offset return from tux_create_entry; add ERR_CAST to userspace ERR_PTR header 2008-12-05 00:39 changeset: 591:aad4758a962f 2008-12-05 00:39 parent: 585:186d2d41b6b5 2008-12-05 00:39 user: daniel@moonbase.phunq.net 2008-12-05 00:39 ah, I know what happened 2008-12-05 00:39 date: Thu Dec 04 23:49:55 2008 -0800 2008-12-05 00:39 summary: Respell cursor_bnode as cursor_node; finished restoring offset return from tux_create_entry; add ERR_CAST to userspace ERR_PTR header 2008-12-05 00:39 I pulled to the public repo without doing a rollback first 2008-12-05 00:39 I will clone my private now 2008-12-05 00:42 seems to be ok now 2008-12-05 00:43 http://www.selenic.com/mercurial/bts/ 2008-12-05 00:43 looks like good to me 2008-12-05 00:44 hg merge for this condition? 2008-12-05 00:44 that didn't work 2008-12-05 00:45 I did a rollback, removed my local changes, then did hg checkout on your latest revision 2008-12-05 00:45 after that it worked 2008-12-05 00:45 but that was only possible because the problem happend just on revision back 2008-12-05 00:46 if it was further back it would be a much harder problem 2008-12-05 00:46 I can't see why hg doesn't work for this 2008-12-05 00:47 bug? 2008-12-05 00:47 um... 2008-12-05 00:47 maybe, this is primary job for hg 2008-12-05 00:48 I would think 2008-12-05 00:48 or our version is too old 2008-12-05 00:49 "If the pull added two heads, it means you have both committed and uncommited changes in your repository. In this case, your changes will be as they were, and you may continue working in that state until you see fit to commit, and then hg merge to merge with the extra head." 2008-12-05 00:49 http://opensolaris.org/os/community/tools/scm/hg_teamware_transition/hg_workflow/ 2008-12-05 00:51 this change partially undoes some of your work: http://hg.tux3.org/tux3?cs=14bb381c568f 2008-12-05 00:51 I will manually reapply it 2008-12-05 00:51 oh 2008-12-05 01:00 ok, "hg merge" seems to solves this 2008-12-05 01:01 should I continue manually reapplying your changes? 2008-12-05 01:01 not sure 2008-12-05 01:01 seems safest 2008-12-05 01:02 let me tell what I found 2008-12-05 01:02 if repo has two heads, 2008-12-05 01:02 hirofumi@devron (tux3)$ ../../usr/bin/hg heads 2008-12-05 01:02 changeset: 591:aad4758a962f 2008-12-05 01:02 tag: tip 2008-12-05 01:02 parent: 585:186d2d41b6b5 2008-12-05 01:02 user: daniel@moonbase.phunq.net 2008-12-05 01:02 date: Thu Dec 04 23:49:55 2008 -0800 2008-12-05 01:02 summary: Respell cursor_bnode as cursor_node; finished restoring offset return from tux_create_entry; add ERR_CAST to userspace ERR_PTR header 2008-12-05 01:02 changeset: 590:a1a5e8fd56a1 2008-12-05 01:02 user: OGAWA Hirofumi 2008-12-05 01:02 date: Fri Dec 05 16:09:34 2008 +0900 2008-12-05 01:02 summary: Fix sparse warnning in tux_dir_is_empty() 2008-12-05 01:02 hirofumi@devron (tux3)$ ../../usr/bin/hg parent 2008-12-05 01:02 changeset: 590:a1a5e8fd56a1 2008-12-05 01:02 user: OGAWA Hirofumi 2008-12-05 01:02 date: Fri Dec 05 16:09:34 2008 +0900 2008-12-05 01:02 summary: Fix sparse warnning in tux_dir_is_empty() 2008-12-05 01:02 and parent meas current working head 2008-12-05 01:03 if we want to merge 591 to current 590 2008-12-05 01:03 hg merge 591 2008-12-05 01:03 :) 2008-12-05 01:04 if there is conflict, vi or something will be used 2008-12-05 01:04 2008-12-05 01:04 <<<<<<< /devel/linux/works/git/mercurial/a/tux3/user/kernel/btree.c 2008-12-05 01:04 trace_off(printf("pop to level %i, block %Lx, %i of %i n 2008-12-05 01:04 odes\n", level, bufindex(cursor->path[level].buffer), cursor->path[level].next - 2008-12-05 01:04 cursor_bnode(cursor, level)->entries, bcount(cursor_bnode(cursor, level)));); 2008-12-05 01:04 ======= 2008-12-05 01:04 trace_off(printf("pop to level %i, block %Lx, %i of %i n 2008-12-05 01:04 odes\n", level, cursor->path[level].buffer->index, cursor->path[level].next - cu 2008-12-05 01:04 rsor_node(cursor, level)->entries, bcount(cursor_node(cursor, level)));); 2008-12-05 01:04 >>>>>>> /tmp/btree.c~other.r-lC-D 2008-12-05 01:04 2008-12-05 01:04 like cvs conflict mark 2008-12-05 01:05 then we have to solve this conflict by hand 2008-12-05 01:06 and we will commit this change as new rev 2008-12-05 01:06 this seems the way to merge two heads 2008-12-05 01:10 right 2008-12-05 01:11 ok, I copied btree.c from the previous revision and redid the bnode change 2008-12-05 01:11 it builds 2008-12-05 01:11 your changes all look good 2008-12-05 01:12 current public repo? 2008-12-05 01:12 not yet 2008-12-05 01:12 just checking it buidls in kernel 2008-12-05 01:13 it seems fine 2008-12-05 01:13 what should the commit comment be? this should be interesting 2008-12-05 01:14 comment for "hg merge"? 2008-12-05 01:14 for my restore of your changes 2008-12-05 01:15 it will be depend on final history view 2008-12-05 01:15 Comment: Restore Hirofumi's improvements to btree.c after an interesting 'merge event' 2008-12-05 01:15 looks good 2008-12-05 01:17 pushed to public 2008-12-05 01:18 fs/tux3/inode.c:182: undefined reference to `cursor_leafbuf' 2008-12-05 01:18 :p 2008-12-05 01:19 (.text+0x52ff1): undefined reference to `cond_resched' 2008-12-05 01:20 looks strange 2008-12-05 01:20 ah I didn't copy in the latest 2008-12-05 01:21 ah no I did 2008-12-05 01:21 cd linux/fs && mv tux3 tux3.orig && ln -s ../../../tux3/usr/kernel tux3 2008-12-05 01:22 public repo seems ok in kernel and userland 2008-12-05 01:22 I didn't make that mistake, inode.c is actually funny 2008-12-05 01:24 probably, we have to re-clone from public repo 2008-12-05 01:24 struct buffer_head *cursor_leafbuf(struct cursor *cursor); <- kernel/tux3.h has this 2008-12-05 01:24 yes 2008-12-05 01:25 and the function is defined in btree.c 2008-12-05 01:25 and is not static 2008-12-05 01:25 so I don't know why the link error happened 2008-12-05 01:26 from error, it seems not recompiled inode.o 2008-12-05 01:26 yes, likely 2008-12-05 01:26 doing make clean 2008-12-05 01:27 or rm -f fs/tux3/* 2008-12-05 01:27 ah 2008-12-05 01:27 there is also that cond_resched weirdness 2008-12-05 01:28 it may be inline function 2008-12-05 01:28 maybe 2008-12-05 01:28 btw, are you building kernel on different directory with source? 2008-12-05 01:29 ? 2008-12-05 01:29 just a normal build 2008-12-05 01:29 i.e. $ cd tux3-build-dir && make -C ../linux-tux3 O=`pwd` 2008-12-05 01:29 no, just cd linux && make linux ARCH=um 2008-12-05 01:30 it will output *.o or something to tux3-build-dir 2008-12-05 01:30 yes, that is nice 2008-12-05 01:30 for modules 2008-12-05 01:30 I build it as a built-in in uml 2008-12-05 01:31 O= is for kernel 2008-12-05 01:31 for vmlinux 2008-12-05 01:33 it can keep the source tree clean 2008-12-05 01:33 ah 2008-12-05 01:33 and it can work for different .config 2008-12-05 01:33 so not just for moduels 2008-12-05 01:33 right 2008-12-05 01:33 handy with git 2008-12-05 01:34 ok, it built fine 2008-12-05 01:34 it was just some make breakage 2008-12-05 01:34 make was always a bad idea 2008-12-05 01:34 almost everything about it is wrong ;) 2008-12-05 01:35 anyway, we seem to be back to a stable state 2008-12-05 01:35 with O=, we can just rm -f fs/tux3/* 2008-12-05 01:35 yes 2008-12-05 01:36 now I am thinking that your tux3/user/kernel arrangement is a little funny, it's an extra level 2008-12-05 01:36 however, we may have to re-clone from public repo for clean history 2008-12-05 01:37 hg pull seems to remove old history 2008-12-05 01:37 well, there is no actual problem though 2008-12-05 01:38 hg pull seems to don't remove old history 2008-12-05 01:38 ok 2008-12-05 01:38 no problem is good 2008-12-05 01:38 now soon it will be time to make a git repo on kernel.org 2008-12-05 01:38 this time, it should be a clone from linus's tree 2008-12-05 01:38 yes 2008-12-05 01:38 and I am ok with losing the history 2008-12-05 01:38 me too 2008-12-05 01:39 we still have all the hg history 2008-12-05 01:39 yes 2008-12-05 01:39 and nobody really wants to know about junkfs ;) 2008-12-05 01:39 so next week some time 2008-12-05 01:39 however, git repo is what's for? 2008-12-05 01:40 it's for the kernel code 2008-12-05 01:40 like this one: http://phunq.net/ddtree?p=tux3fs 2008-12-05 01:40 now, we have user/kernel in hg 2008-12-05 01:41 right, but other git users can't pull from that 2008-12-05 01:41 yes 2008-12-05 01:41 having a full tree also let's us put the deferred namespace ops on a branch 2008-12-05 01:42 however, I think we may want to copy user/kernel/* to branch 2008-12-05 01:43 branch of what? 2008-12-05 01:43 oh 2008-12-05 01:43 so they can be in the same branch? 2008-12-05 01:43 for more later, I think we can make good history user/kernel from hg 2008-12-05 01:43 for more later, I think we can make good history git:master from user/kernel of hg 2008-12-05 01:43 oh, you mean in hg 2008-12-05 01:43 oh 2008-12-05 01:43 that would be nice 2008-12-05 01:43 how do we do that? 2008-12-05 01:44 is there an importer? 2008-12-05 01:44 basically convert hg to git 2008-12-05 01:44 probably, it have to filter user/* out 2008-12-05 01:45 and import converted git to master 2008-12-05 01:45 the user/ directory isn't really doing anything for us now 2008-12-05 01:45 I wonder if it should all be lifted up another level in the tree? 2008-12-05 01:45 yes 2008-12-05 01:45 mv tux3/user/* tux3/ 2008-12-05 01:46 hg mv tux3/user/* tux3/ 2008-12-05 01:46 it's not enough probably 2008-12-05 01:46 the patch would be fs/tux3 2008-12-05 01:47 maybe, convert hg to git 2008-12-05 01:47 maybe ;) 2008-12-05 01:47 then git format-patch 2008-12-05 01:47 and change the patches for kernel (user/kernel to fs/tux3, etc.) 2008-12-05 01:48 maybe, it would make history 2008-12-05 01:48 well, hg is still kind of nice for the user space work 2008-12-05 01:49 maybe I will want to switch to git after using it more 2008-12-05 01:49 it seems right for a git tree to just be a clone of linus's tree, with tux3 added 2008-12-05 01:49 userspace too? 2008-12-05 01:49 userspace in a separate repo 2008-12-05 01:50 ok 2008-12-05 01:50 I think it is right to have the kernel files in two places like we do 2008-12-05 01:50 but I could be convinced otherwise 2008-12-05 01:50 we have a week to think about it 2008-12-05 01:51 ok 2008-12-05 01:51 the move of the HIDDEN bit unmask, broke the defer code ;) 2008-12-05 01:51 need to figure out why 2008-12-05 01:52 moved to dentry_iput()? 2008-12-05 01:52 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-05 01:52 yes 2008-12-05 01:53 i see 2008-12-05 01:53 dput() was called somehow probably 2008-12-05 01:53 yes it was 2008-12-05 01:53 eh 2008-12-05 01:53 if d_count == 1 + hide, it will call dentry_iput()? 2008-12-05 01:54 ah, it's ok 2008-12-05 01:54 hide will be zero at the d_delete because _BACKED was unset 2008-12-05 01:55 hey flips 2008-12-05 01:55 hi 2008-12-05 01:58 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-05 01:59 ah, the d_delete unhashes the dentry, because the ref count is 2 2008-12-05 01:59 I've pushed to restore is_empty() optimization 2008-12-05 01:59 :) 2008-12-05 01:59 obsessive optimizer like me 2008-12-05 02:00 yes, if I noticed 2008-12-05 02:01 and you put in the FIXME for 32 bit inum 2008-12-05 02:01 yes 2008-12-05 02:01 which we have to handle with mount flags I think 2008-12-05 02:01 it is just reminder 2008-12-05 02:02 pulled 2008-12-05 02:02 thanks 2008-12-05 02:04 btw, is there any request for next work I tackle? 2008-12-05 02:08 more directory operations? 2008-12-05 02:09 we don't seem to be getting any volunteers 2008-12-05 02:09 I asked twice ;) 2008-12-05 02:10 some days ago, brain says he wants to try 2008-12-05 02:10 well, ok 2008-12-05 02:10 link and unlink should be pretty easy 2008-12-05 02:10 yes 2008-12-05 02:10 symlink... I need to do some work to support that 2008-12-05 02:11 store_attrs or page data 2008-12-05 02:11 store_attrs 2008-12-05 02:11 if it's big, we may use page data? 2008-12-05 02:12 yes 2008-12-05 02:12 ok 2008-12-05 02:12 what is the maximum size of a symlink in linux? 4K? 2008-12-05 02:12 iirc, 4k 2008-12-05 02:13 so if it's that big it has to be file data 2008-12-05 02:13 vfs name is 4k, iirc 2008-12-05 02:13 we will implement smaller symlnks first 2008-12-05 02:13 yes 2008-12-05 02:14 maybe, it's not so hard 2008-12-05 02:14 ACTION looks in attr.c 2008-12-05 02:14 ah 2008-12-05 02:14 it's like an extended attribute 2008-12-05 02:14 so, not a big change 2008-12-05 02:15 probably 2008-12-05 02:15 it could be an extended attribute even 2008-12-05 02:16 and be cached in the xattr cache 2008-12-05 02:17 I'm not sure, it's good for vfs interface 2008-12-05 02:17 maybe not 2008-12-05 02:18 heh, you can set an xattr on a symlink 2008-12-05 02:18 it seems ok 2008-12-05 02:18 why not 2008-12-05 02:18 just create ->follow_link hander 2008-12-05 02:19 I'm just figuring out how that works now 2008-12-05 02:19 vfs will does other things 2008-12-05 02:21 like generic_readlink and page_follow_link_light 2008-12-05 02:21 but we have to get the link data into the page 2008-12-05 02:21 it is path of page data 2008-12-05 02:21 _fast_ is non page data 2008-12-05 02:22 yes, so page data is just read by block_read_full_page? 2008-12-05 02:23 yes, if symlink was stored to page data 2008-12-05 02:23 so it is just the fast link that needs treatment like xattr 2008-12-05 02:24 yes 2008-12-05 02:24 ok, well you can make symlink work with page data then, and I can work on the attribute version, how does that sound? 2008-12-05 02:25 if xattr already in cache 2008-12-05 02:25 sounds good 2008-12-05 02:25 as an xattr it should be nice, I wonder if we just want to reserve an atom number to be a symlink 2008-12-05 02:26 I think I intended to save the two byte xattr code for symlinks and immediate file data 2008-12-05 02:26 but, why do we need to lookup another blocks for fast link? 2008-12-05 02:27 we don't 2008-12-05 02:27 xattr lookup on happens in the sys_getxattr etc paths 2008-12-05 02:27 s/on/only/ 2008-12-05 02:27 fast symlink is stored as xattr? 2008-12-05 02:28 maybe 2008-12-05 02:28 and we do getxattr to get fast symlink? 2008-12-05 02:28 if not, it will be almost the same as an xattr 2008-12-05 02:29 not getxattr, but xcache_lookup 2008-12-05 02:29 xcache_lookup is very fast, does not look up a name 2008-12-05 02:29 but xcache is needed to read xattr to setup? 2008-12-05 02:29 setup? 2008-12-05 02:30 oh 2008-12-05 02:30 data of xcache 2008-12-05 02:30 yes, it has to store the fast symlink somewhere 2008-12-05 02:30 ext2 stores it in the pointer region of its inode 2008-12-05 02:30 cache inode 2008-12-05 02:31 if we use xcache, we can make our inodes smaller 2008-12-05 02:31 maybe that doesn't matter 2008-12-05 02:31 and on-disk, it stores to inode itself 2008-12-05 02:31 yes 2008-12-05 02:31 just like an xattr 2008-12-05 02:32 on disk, ext2 also has it in the pointer fields of the inode 2008-12-05 02:32 if there were huge numbers of xattrs on files, then the linear search for the atom number might go slow 2008-12-05 02:33 I don't think there ever will be huge numbers of xattrs 2008-12-05 02:33 typically there are one or two at most 2008-12-05 02:33 acls 2008-12-05 02:34 if we store fast symlink like mtime, we don't need to read any block? 2008-12-05 02:34 any other block 2008-12-05 02:35 right 2008-12-05 02:35 however, xattr should be fast, so we will use it? 2008-12-05 02:36 xattr is fast to decode and encode 2008-12-05 02:36 and to do the atom lookup 2008-12-05 02:36 in the xcache 2008-12-05 02:36 resolving the name in the getxattr interface is not so fast, but that will not happen in the symlink path 2008-12-05 02:37 well I should sleep 2008-12-05 02:38 ah, it stores to xattr->body? 2008-12-05 02:38 ok 2008-12-05 02:38 right 2008-12-05 02:38 i see, I thought it stores to atable 2008-12-05 02:38 fortunately, no 2008-12-05 02:38 ok, thanks 2008-12-05 02:39 so we can reserve a few atom codes and use them for special things like symlinks 2008-12-05 02:39 i see 2008-12-05 02:39 and there is the thing I talked about, a version link 2008-12-05 02:39 which is like a symlink, except it links a file in a different snapshot 2008-12-05 02:40 oh, i see 2008-12-05 02:40 people should be able to have fun with those :) 2008-12-05 02:41 yes :) 2008-12-05 02:41 good night 2008-12-05 02:41 yes, good night 2008-12-05 03:16 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-05 03:26 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-12-05 08:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-05 08:58 -!- Bobby_(~Bobby@122.162.67.26) has joined #tux3 2008-12-05 08:59 anyone here... 2008-12-05 08:59 I highly doubt flips is awake 2008-12-05 08:59 hmm.. 2008-12-05 08:59 not able to mount tux3 in uml 2008-12-05 08:59 :( 2008-12-05 08:59 bummer 2008-12-05 08:59 can u help? 2008-12-05 09:00 hah 2008-12-05 09:00 I'm just the resident tech evangelist 2008-12-05 09:00 :) 2008-12-05 09:00 ohk 2008-12-05 09:00 someone has to do that, right? 2008-12-05 09:01 yup 2008-12-05 09:01 good tht we have u :) 2008-12-05 09:01 :) 2008-12-05 09:06 what error? 2008-12-05 09:06 -!- pranihome(~Bobby@122.162.67.26) has joined #tux3 2008-12-05 09:06 pranihome, what error? 2008-12-05 09:06 hirofumi is here to save the day... 2008-12-05 09:07 mount /dev/ubdb /mnt 2008-12-05 09:07 says u need to mention the fs type 2008-12-05 09:08 /dev/ubdb is formatted device? 2008-12-05 09:08 so i gave.. mount -t tux3 /dev/ubdb /mnt 2008-12-05 09:08 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-12-05 09:08 maybe, tux3 mkfs /dev/ubdb 2008-12-05 09:08 lwn.net/articles/308950 2008-12-05 09:09 404 not found 2008-12-05 09:10 lwn.net/Articles/308950 2008-12-05 09:10 -!- pranihome(~Bobby@122.162.67.26) has joined #tux3 2008-12-05 09:10 # make a tux3 filesystem in a file 2008-12-05 09:10 dd if=/dev/zero of=testdev bs=1M count=1 2008-12-05 09:10 tux3 mkfs testdev 2008-12-05 09:10 yup 2008-12-05 09:10 did you run this part? 2008-12-05 09:11 hmm 2008-12-05 09:11 i ran that in userspace :) 2008-12-05 09:11 not in uml 2008-12-05 09:11 in userspace is ok 2008-12-05 09:11 -!- pranihome(~Bobby@122.162.67.26) has joined #tux3 2008-12-05 09:11 in userspace is ok 2008-12-05 09:12 tux3 there the tux3 formed in userspace, right? 2008-12-05 09:12 yes 2008-12-05 09:12 grep tux3 /proc/filesystems? 2008-12-05 09:13 i.e. uml has CONFIG_TUX3=y in .config? 2008-12-05 09:14 yeah 2008-12-05 09:14 oh, strange 2008-12-05 09:14 -!- pranihome(~Bobby@122.162.67.26) has joined #tux3 2008-12-05 09:15 i have tux3 in /proc/filesystems 2008-12-05 09:16 od -x /dev/ubdb | head 2008-12-05 09:17 0000000 0000 0000 0000 0000 0000 0000 0000 0000 2008-12-05 09:17 * 2008-12-05 09:17 0010000 7574 3378 08dd 0609 0000 0000 0000 0000 2008-12-05 09:18 it should have "tux3" like above 2008-12-05 09:18 i have the first two lines 2008-12-05 09:18 basically setup seems ok 2008-12-05 09:18 third line has 4000000 2008-12-05 09:19 it is dump of struct disksuper 2008-12-05 09:19 ah 2008-12-05 09:20 third line seems bogus 2008-12-05 09:20 hmm 2008-12-05 09:20 shud i start over? 2008-12-05 09:20 retry mkfs in userspace 2008-12-05 09:20 tux3 mkfs.. 2008-12-05 09:20 tux3 mkfs testdev 2008-12-05 09:21 this is in the tux3/user folder, right? 2008-12-05 09:21 yes 2008-12-05 09:21 ok 2008-12-05 09:21 lemme try 2008-12-05 09:21 and specify the patch of testdev which is using in uml 2008-12-05 09:21 s/patch/path/ 2008-12-05 09:22 -!- pranihome(~Bobby@122.162.67.26) has joined #tux3 2008-12-05 09:23 -!- bobby(~bobby@122.162.67.26) has joined #tux3 2008-12-05 09:25 deep login: ReiserFS: ubdb: warning: sh-2021: reiserfs_fill_super: can not find reiserfs on ubdb 2008-12-05 09:25 VFS: Can't find ext3 filesystem on dev ubdb. 2008-12-05 09:25 TUX3: invalid superblock [0]<4>ReiserFS: ubdb: warning: sh-2021: reiserfs_fill_super: can not find reiserfs on ubdb 2008-12-05 09:25 VFS: Can't find ext3 filesystem on dev ubdb. 2008-12-05 09:25 TUX3: invalid superblock [0]<4>ReiserFS: ubdb: warning: unknown mount option "force" 2008-12-05 09:25 VFS: Can't find ext3 filesystem on dev ubdb. 2008-12-05 09:25 TUX3: invalid superblock [0]<3>TUX3: invalid superblock [0]<3>TUX3: invalid superblock [0]<4>ReiserFS: ubdb: warning: sh-2021: reiserfs_fill_super: can not find reiserfs on ubdb 2008-12-05 09:25 VFS: Can't find ext3 filesystem on dev ubdb. 2008-12-05 09:25 TUX3: invalid superblock [0] 2008-12-05 09:26 it seems, od -x /dev/ubdb |head has still strange data 2008-12-05 09:26 yeah 2008-12-05 09:27 od -x is still the same 2008-12-05 09:27 did you specify the same path of testdev to uml? 2008-12-05 09:28 i think i followed the article to the dot... 2008-12-05 09:29 tux3 mkfs testdev and ./linux ubda=tuxroot ubdb=testdev 2008-12-05 09:29 the both of testdev must be same file 2008-12-05 09:29 yup, lemme try agian 2008-12-05 09:50 -!- pranihome(~bobby@122.162.67.26) has joined #tux3 2008-12-05 09:58 running uml in virtualbox is a pain :( 2008-12-05 09:59 I don't know about virtualbox though 2008-12-05 09:59 -!- pranihome(~bobby@122.162.67.26) has joined #tux3 2008-12-05 10:00 can't virtualbox run normal linux? 2008-12-05 10:01 yeah 2008-12-05 10:01 im running normal linux in virtualbox 2008-12-05 10:01 and uml in that normal linux 2008-12-05 10:01 :) 2008-12-05 10:02 if so, creating new virtuanl environment for tux3 would be easy 2008-12-05 10:03 you can just run tux3 kernel as normal kernel 2008-12-05 10:03 hmm 2008-12-05 10:03 im running virtualbox on vista 2008-12-05 10:03 :) 2008-12-05 10:03 that's ok 2008-12-05 10:04 -!- pgquiles(~pgquiles@95.Red-88-23-241.staticIP.rima-tde.net) has joined #tux3 2008-12-05 10:04 you can just create new guest for tux3? 2008-12-05 10:04 hmm 2008-12-05 10:04 im not sure how... 2008-12-05 10:06 install normal distro to guest env 2008-12-05 10:06 yup, ubuntu 2008-12-05 10:06 then, replace kernel with tux3 kernel 2008-12-05 10:06 oh :) 2008-12-05 10:06 that way... yeah 2008-12-05 10:07 and specify new disk (maybe file) for tux3 device 2008-12-05 10:08 -!- pranihome(~bobby@122.162.67.26) has joined #tux3 2008-12-05 10:08 and specify new disk (maybe file) for tux3 device 2008-12-05 10:09 yippee.. tux3 uml FTW 2008-12-05 10:09 :) 2008-12-05 10:09 hmm 2008-12-05 10:09 uml worked? 2008-12-05 10:10 yup 2008-12-05 10:11 -!- Solaris(~satan@a83-132-82-62.cpe.netcabo.pt) has joined #tux3 2008-12-05 10:11 ok, good 2008-12-05 10:11 enjoy :) 2008-12-05 10:11 :) 2008-12-05 10:15 ok.. im not able to write to a file :) 2008-12-05 10:16 hirofumi, is write implemented? 2008-12-05 10:16 yes 2008-12-05 10:16 create and write 2008-12-05 10:16 hmm 2008-12-05 10:16 but, truncate is not implemented 2008-12-05 10:16 ohk 2008-12-05 10:16 that seems to be the problem 2008-12-05 10:16 i did touch hello 2008-12-05 10:16 then cat >> hello 2008-12-05 10:16 this seems to need truncate 2008-12-05 10:17 it would not be needed 2008-12-05 10:18 hmm 2008-12-05 10:18 vi hello and :wq needs truncate 2008-12-05 10:18 rm doesnt work 2008-12-05 10:18 yes 2008-12-05 10:18 rm needs unlink... 2008-12-05 10:19 yes, unlink or rmdir 2008-12-05 10:23 hirofumi, which op u working on? 2008-12-05 10:23 current one is symlink 2008-12-05 10:24 -!- pranihome(~bobby@122.162.67.26) has joined #tux3 2008-12-05 10:24 btw, are you using the copy of tux3/user/kernel/*? 2008-12-05 10:24 or just using git tree? 2008-12-05 10:24 i using hg copy 2008-12-05 10:25 ok 2008-12-05 10:25 which op u working on? 2008-12-05 10:25 current one is symlink 2008-12-05 10:25 ohk 2008-12-05 10:25 u knw of any easy ones? 2008-12-05 10:25 ;) 2008-12-05 10:25 i want to give it a go 2008-12-05 10:26 it depends on which is you know 2008-12-05 10:27 hmm 2008-12-05 10:28 well, rename would be most complex 2008-12-05 10:29 it does create/delete/rename 2008-12-05 10:29 hmm 2008-12-05 10:29 mknod? 2008-12-05 10:30 it needs to implement device attributes 2008-12-05 10:31 well, unlink and mkdir would be easiler than others 2008-12-05 10:31 easier 2008-12-05 10:31 ok 2008-12-05 10:31 ill try mkdir then 2008-12-05 10:31 since u are doing unlink.. 2008-12-05 10:32 or u doin symlinkk 2008-12-05 10:32 yes, symlink for now 2008-12-05 10:32 ill see lxr for the correspondin ext3 implementation 2008-12-05 10:32 ok then 2008-12-05 10:32 c u soon 2008-12-05 10:32 ok 2008-12-05 10:37 -!- pranihome(~bobby@122.162.67.26) has joined #tux3 2008-12-05 10:50 -!- pranihome(~bobby@122.162.67.26) has joined #tux3 2008-12-05 11:03 -!- pranihome(~bobby@122.162.67.26) has joined #tux3 2008-12-05 11:10 -!- pranihome(~bobby@122.162.67.26) has joined #tux3 2008-12-05 11:23 -!- pranihome(~bobby@122.162.67.26) has joined #tux3 2008-12-05 11:36 -!- pranihome(~bobby@122.162.67.26) has joined #tux3 2008-12-05 12:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-05 12:17 http://www.informationweek.com/news/software/linux/showArticle.jhtml?articleID=212100714&pgno=3&queryText=&isPrev= 2008-12-05 12:23 http://www.pro-linux.de/news/2008/13527.html 2008-12-05 12:24 Oof, no pressure. 2008-12-05 12:24 Expectations are starting to grow :P 2008-12-05 12:31 -!- bobby(~bobby@122.162.67.26) has joined #tux3 2008-12-05 12:38 hirofumi, here? 2008-12-05 12:38 anyone here? 2008-12-05 12:41 hi pranihome 2008-12-05 12:42 hey flips.. 2008-12-05 12:42 trying out mkdir... 2008-12-05 12:42 :) 2008-12-05 12:43 :) 2008-12-05 12:43 we have S_IFDIR? 2008-12-05 12:43 for mode? 2008-12-05 12:43 yes 2008-12-05 12:44 ok, checking ext2 and tryin to do something similar here 2008-12-05 12:58 good luck 2008-12-05 12:58 any questions, I will be here 2008-12-05 13:14 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-05 13:32 bushman, MaZe: http://osnews.pl/tag/tux3/ 2008-12-05 13:32 what are they saying? :) 2008-12-05 13:32 you're 3 days late 2008-12-05 13:32 i've sent it to flips 2008-12-05 13:33 have you set up some google alters for tux3 or something? 2008-12-05 13:33 i was busy freezing my ass off in chicago, missed the memo :) 2008-12-05 13:33 drinking Zyweic Porter, so ha! 2008-12-05 13:34 er Zywiec even 2008-12-05 13:34 ACTION is jealous 2008-12-05 13:38 -!- ajonat(~ajonat@190.48.107.136) has joined #tux3 2008-12-05 13:41 -!- bobby(~bobby@122.162.69.68) has joined #tux3 2008-12-05 14:19 we should take a look at smp locking around now 2008-12-05 14:20 ext2 has a toall of 12 spinlocks 2008-12-05 14:21 in other words, most locking is already done for the filesystem by the vfs 2008-12-05 14:21 locking of file indexes in ext2 is rather arcane 2008-12-05 14:22 that would be the part that our per-inode btree mutex fills in for now 2008-12-05 14:24 ext2_fill_super is gross ;) 2008-12-05 14:25 reservations seem to account for the majority of ext2's spinlocks 2008-12-05 15:12 hmm my push to bitbucket is failing due to: 2008-12-05 15:12 abort: push creates new remote branches! 2008-12-05 15:13 flips: did you do anything weird to the repo recently? 2008-12-05 15:21 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-05 15:22 yes 2008-12-05 15:22 probably broke your mirror badly 2008-12-05 15:23 we had an "event" yesterday 2008-12-05 16:07 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-05 18:31 what was the event? 2008-12-05 18:40 shapor, it was a funny merge that got into a hard-to-recover situation on the public repo, and I recloned from the repaired private 2008-12-05 18:40 probably I overreacted a little. What happened to your mirror? 2008-12-05 18:40 can you reclone? 2008-12-05 18:41 hm probably 2008-12-05 18:41 i think recloning is... always overreacting? 2008-12-05 18:41 I'll be impressed if you can pull your way out of this one 2008-12-05 18:43 i think i did 2008-12-05 18:43 commit something 2008-12-05 18:44 certainly there something to commit 17 hrs after your last commit 2008-12-05 18:44 ;) 2008-12-05 18:51 flips: wow tux3 in uml works! 2008-12-05 18:51 better than fuse 2008-12-05 18:52 ACTION just got his dev environment ready for the weekend :) 2008-12-05 19:12 :) 2008-12-05 19:13 I just commit something to my private git, does that count? 2008-12-05 20:13 flips: nope, no good :) 2008-12-05 20:14 well there's a new commit just for you 2008-12-05 20:14 sweet http://www.bitbucket.org/shapor/tux3/ 2008-12-05 20:14 it updated no problem :) 2008-12-05 20:15 :) 2008-12-05 20:15 no clone goofiness necessary 2008-12-05 20:15 all i had to do was -f orce one push 2008-12-05 20:15 I m n0t as l33t 2008-12-05 20:16 my oneliner cronjob is working pretty well 2008-12-05 20:16 */5 * * * * cd /home/shapor/src/tuxxx && hg pull >/dev/null && hg push ssh://hg@bitbucket.org/shapor/tux3/ >/dev/null 2008-12-05 20:16 isn't there a place for one line bash hacks on the net? 2008-12-05 20:16 probably? 2008-12-05 20:17 i could post the original bashgal oneliner there 2008-12-05 20:18 when is bashboy coming? 2008-12-05 20:19 bashguy 2008-12-05 20:23 just got a little earthquake 2008-12-05 20:23 seemed like... um 4? 2008-12-05 20:23 3.something? 2008-12-05 20:36 weird, i didn't feel anything here 2008-12-05 20:37 5.5 Friday, December 05, 2008 at 08:18:42 PM at epicenter 2008-12-05 20:37 (116 miles) ENE (63°) from Los Angeles 2008-12-05 20:38 could have been that 2008-12-05 20:43 bummer i never feel them here 2008-12-05 21:03 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-05 21:03 you want to? 2008-12-05 21:31 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-05 21:43 yeah, sure, why not 2008-12-05 21:48 hi 2008-12-05 21:49 http://tux3.org/ seems to be broken 2008-12-05 21:50 ah, sorry 2008-12-05 21:51 refresh cache fixed 2008-12-05 22:01 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-05 22:40 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-05 23:07 -!- bobby(~bobby@122.162.68.8) has joined #tux3 2008-12-06 01:26 deferred unlink and create are both kind of working now 2008-12-06 01:27 with glitches: a create followed by an unlink with no syncl leaves an extra count on the dentry 2008-12-06 01:28 i.e.: create, sync works; unlink, sync works; create, unlink, sync goes Boom on the extra dentry count 2008-12-06 01:29 thats good news :) 2008-12-06 01:29 i mean other than the boom thing... 2008-12-06 01:29 it's moving along ;) 2008-12-06 01:29 about directories... 2008-12-06 01:29 are we goin to use the same semantics as ext2? 2008-12-06 01:29 which semantics? 2008-12-06 01:30 directory in ext2 store the file names together with inode numbers 2008-12-06 01:30 you mean like "directories never shrink" ? 2008-12-06 01:30 in the same file... 2008-12-06 01:30 yes 2008-12-06 01:30 that's a good thing 2008-12-06 01:31 what alternative were you thinking of? 2008-12-06 01:31 we need a new type like ext2_dir_entry_2 then? 2008-12-06 01:31 we have tux_dirent 2008-12-06 01:31 it's almost like ext2_dir_entry_2 except its big endian 2008-12-06 01:31 hmm, was going through ext2 implementation 2008-12-06 01:31 ok 2008-12-06 01:32 we're only keeping ext2-like directories until we have a proper indexed directory to replace it with 2008-12-06 01:32 tux_make_empty is the only thing to implement... 2008-12-06 01:32 yes 2008-12-06 01:32 if we even need it 2008-12-06 01:33 I think an empty file is a pretty good empty directory 2008-12-06 01:33 yeah, all the create code + a call to make_empty shoudl create our directory 2008-12-06 01:33 we should try it, and then see if the lack of explicit "." and ".." entries actually hurts anything 2008-12-06 01:33 we might not need make_empty 2008-12-06 01:34 well 2008-12-06 01:34 it would not hurt 2008-12-06 01:34 easy to hijack from ext2, just like I did all the other dir.c functions 2008-12-06 01:34 hmm... "." and ".." are not necessary? 2008-12-06 01:34 I think they aren't 2008-12-06 01:34 hirofumi worried about some corner cases 2008-12-06 01:34 we have to see 2008-12-06 01:35 stupid question but, how do u get to the parent directory when u are in a particular dir? 2008-12-06 01:35 we can try it with make_empty, get it working, then remove it and see if it still works 2008-12-06 01:36 dentry->d_parent 2008-12-06 01:36 why do you need to? 2008-12-06 01:37 some applications specifically look for ".." to get to parent dir... 2008-12-06 01:37 we might work around using d_parent then 2008-12-06 01:37 oh, the vfs parses that 2008-12-06 01:37 and does it 2008-12-06 01:37 hmm, ok :) 2008-12-06 01:37 see namei.c -> path functions 2008-12-06 01:38 there is an argument that actually storing . and .. helps fsck 2008-12-06 01:38 anyway, go ahead and copy make_empty from ext2 2008-12-06 01:38 okies... 2008-12-06 01:51 ok, almost all op seems to work 2008-12-06 01:52 rest stuff would be mkdir, rmdir, mknod, and rename 2008-12-06 01:55 hey flips 2008-12-06 01:56 :) 2008-12-06 01:57 hirofumi, did you try xattr? 2008-12-06 01:57 ah 2008-12-06 01:57 ok, makes sense 2008-12-06 01:57 I'm forgetting about it 2008-12-06 01:57 data may be trying 2008-12-06 01:57 I think I have a marginal preference for adding loff_t *size to the parameter lists of dir functions 2008-12-06 01:58 and loff_t size when the size doesn't change 2008-12-06 02:00 i_size_read()/i_size_write() can't use 2008-12-06 02:00 do those matter on directories? 2008-12-06 02:00 I thought they're only to take care of parallel truncate 2008-12-06 02:00 directory wouldn't be needed 2008-12-06 02:01 ext2 doesn't use them 2008-12-06 02:01 I'm not sure about atom 2008-12-06 02:01 atom is ok, it's not truncated either 2008-12-06 02:01 expand and read has race 2008-12-06 02:01 race on size? 2008-12-06 02:02 ah, not protected by mutex 2008-12-06 02:02 yes 2008-12-06 02:02 we need some form of locking anyway 2008-12-06 02:02 maybe, on atom 2008-12-06 02:02 for the atom that 2008-12-06 02:02 for the atom table 2008-12-06 02:02 what should it be? 2008-12-06 02:03 there's no locking on getxattr at all, right? 2008-12-06 02:03 if we have lock like ->i_mutex, I think it's ok 2008-12-06 02:03 it's fine to start 2008-12-06 02:03 maybe ->i_mutex 2008-12-06 02:03 and that will take care of the race on size 2008-12-06 02:03 right 2008-12-06 02:03 why not :) 2008-12-06 02:03 we have an inode 2008-12-06 02:04 same for allocations 2008-12-06 02:04 the bitmap i_mutex 2008-12-06 02:04 it will be a contention point we can fix later 2008-12-06 02:04 yes 2008-12-06 02:05 I'm considering whether I should add another dentry state bit to keep track of whether a dentry has a deferred op waiting, and therefore an extra reference that has to be dropped if the dentry is created then deleted before sync 2008-12-06 02:05 that might be the only case it's needed 2008-12-06 02:05 but it might be a good debug check too 2008-12-06 02:06 I'm not sure 2008-12-06 02:06 me neither, I will keep hacking on it 2008-12-06 02:06 it's getting there 2008-12-06 02:07 i see 2008-12-06 02:07 getting close to a pull? 2008-12-06 02:08 it would be tomorrow 2008-12-06 02:08 fine 2008-12-06 02:08 well, some related work would apper 2008-12-06 02:08 e.g. now, we don't store ->i_nlink 2008-12-06 02:08 whoops 2008-12-06 02:09 forgot :) 2008-12-06 02:09 it would be just add LINK_COUNT_ATTR 2008-12-06 02:09 yes 2008-12-06 02:09 easy 2008-12-06 02:09 but, we don't want to add it to all inodes 2008-12-06 02:09 directories don't need it 2008-12-06 02:09 what others? 2008-12-06 02:09 internal inodes 2008-12-06 02:10 sure 2008-12-06 02:10 we can just say, any nodes with 0 links don't have it 2008-12-06 02:10 0 links is deleted inode 2008-12-06 02:11 an orphan 2008-12-06 02:11 a real deleted inode will have a zero size inode 2008-12-06 02:11 yes 2008-12-06 02:12 but, it will do on last close 2008-12-06 02:12 I don't think we will ever store it then 2008-12-06 02:12 unless it is an orphan 2008-12-06 02:13 anyway 2008-12-06 02:13 well, there are some related work 2008-12-06 02:13 marking orphans by having no links attribute is reasonable 2008-12-06 02:13 probably 2008-12-06 02:13 we will also see them in the log 2008-12-06 02:14 maybe, we can various optimization 2008-12-06 02:14 how many bytes is an inode now? 2008-12-06 02:14 no change 2008-12-06 02:14 I forget what it was ;) 2008-12-06 02:14 we have to add nlinks 2008-12-06 02:14 owner - 12 2008-12-06 02:15 ctime_size - 14 2008-12-06 02:15 mtime - 6 2008-12-06 02:15 btree - 8 2008-12-06 02:15 link - 4 2008-12-06 02:15 and xattr 2008-12-06 02:16 44 bytes 2008-12-06 02:16 resize inum 0xe at 0xa0 from 28 to 3a 2008-12-06 02:16 (from make inodetest) 2008-12-06 02:16 make_inode() doesn't store some attributes 2008-12-06 02:17 ah, that line I pasted is for storing an xattr in it 2008-12-06 02:17 so yes, 44 bytes 2008-12-06 02:17 and the nlinks attr will be... 6 more bytes? 2008-12-06 02:18 um 2008-12-06 02:18 link is nlink? 2008-12-06 02:18 yes 2008-12-06 02:18 you said 4 above 2008-12-06 02:18 yes 2008-12-06 02:18 6 more bytes? 2008-12-06 02:19 no, I was speaking nonsense 2008-12-06 02:19 but a 4 byte link attribute only allows 2**16 links 2008-12-06 02:19 oh 2008-12-06 02:20 so might need to have a variant attribute with 6 or 8 bytes 2008-12-06 02:20 where is limit of 2**16? 2008-12-06 02:20 2**32? 2008-12-06 02:21 ah, it is u32 for link count 2008-12-06 02:21 yes 2008-12-06 02:21 well, there are two extra bytes for each attribute 2008-12-06 02:22 we can support 2**32 links? 2008-12-06 02:22 yes, currently 2008-12-06 02:22 ok 2008-12-06 02:23 so let me say something more sensible: it might be worth having a variant links attribute that goes up to 255 or 2**16 2008-12-06 02:23 just to save 2 or 3 bytes per inode 2008-12-06 02:23 we can just use decode16 for ->i_nlink? 2008-12-06 02:24 we could 2008-12-06 02:24 btw, ext2 max seems 32000 2008-12-06 02:24 but then we have to add a higher range variant 2008-12-06 02:24 sure 2008-12-06 02:24 in _theory_ we could have 2**48 links 2008-12-06 02:24 oh 2008-12-06 02:25 limiting it to 2**32 is probably ok :) 2008-12-06 02:25 it is easy, because ->i_nlink is unsigned int 2008-12-06 02:25 anyway, I think our total for an inode is 44 + 2 * 6 = 56 bytes 2008-12-06 02:26 2*6 is xattr? 2008-12-06 02:26 that is the two byte header on each attribute 2008-12-06 02:26 ah 2008-12-06 02:26 there is a code+version = 16 bits 2008-12-06 02:26 kind+version I mean 2008-12-06 02:27 yes 2008-12-06 02:27 now what about blocks count 2008-12-06 02:28 is blocks count posix? 2008-12-06 02:29 not sure, I'm looking kernel source 2008-12-06 02:29 i_blocks is used for stat->blocks 2008-12-06 02:30 and would be for quota 2008-12-06 02:30 it gets tricky with versions 2008-12-06 02:31 it's probably better not to use it for quota 2008-12-06 02:31 keep a separate quota table 2008-12-06 02:31 if we have a blocks count attribute in the inode, I think it will be for the total of all data blocks in the inode, for all versions 2008-12-06 02:32 and then I wonder what it is really used for 2008-12-06 02:32 I think it's block count of current ->i_size 2008-12-06 02:33 and we can just calc it from ->i_size 2008-12-06 02:33 I guess 2008-12-06 02:33 it normally does not include spare regions of a file 2008-12-06 02:33 ah 2008-12-06 02:33 but we could do that 2008-12-06 02:34 and probably nobody would mind 2008-12-06 02:34 if somebody does mind, then we fix it ;) 2008-12-06 02:34 ah 2008-12-06 02:34 it may be used for tar, cpio or something 2008-12-06 02:35 those may ask - this inode is sparse? 2008-12-06 02:35 what happens if you tar a sparse file? 2008-12-06 02:35 nothing good I think 2008-12-06 02:35 it has option for sparse, iirc 2008-12-06 02:36 anyway, I think we can just leave it, and use your suggestion for stat 2008-12-06 02:36 ok 2008-12-06 02:36 and we wait for people to tell us if it is a problem 2008-12-06 02:36 yes 2008-12-06 02:37 we will eventually provide a nice utility for showring blocks usage across versions like we do in zumastor 2008-12-06 02:37 with versioning filesystems that share blocks, the blocks count has even less meaning 2008-12-06 02:38 i see 2008-12-06 02:39 in zumastor we provide a table that says, how many blocks a snapshot owns uniquely, how many are shared with one other snapshot, with two other snapshots, and so on 2008-12-06 02:40 this is used by the admin to decide which snapshot to delete, to recover space 2008-12-06 02:40 I rememberd one thing 2008-12-06 02:41 that's good 2008-12-06 02:41 good df 2008-12-06 02:42 if i_blocks == 0, some commands thinks it as all hole file 2008-12-06 02:42 i.e. cp -a does it, iirc 2008-12-06 02:42 e.g. 2008-12-06 02:43 so if it is always (i_size + blocksize - 1) >> blockbits, does anything break? 2008-12-06 02:43 maybe 2008-12-06 02:43 http://lgl.epfl.ch/teaching/case_tools/doc/tar-1.11.8/tar_85.html 2008-12-06 02:43 7.3.3 Archiving Sparse Files 2008-12-06 02:46 ah, and du will confuse 2008-12-06 02:46 yes 2008-12-06 02:46 so we have to fix it eventually 2008-12-06 02:46 it goes on the to.do list, not very important right now 2008-12-06 02:46 yes 2008-12-06 02:47 ah wait 2008-12-06 02:47 we can also count up the blocks ina stat ;) 2008-12-06 02:47 probalby not a good idea 2008-12-06 02:47 btw, cp is checking i_size/block < st_blocks 2008-12-06 02:47 yes 2008-12-06 02:47 :) 2008-12-06 02:48 > st_blocks 2008-12-06 02:48 oh 2008-12-06 02:48 right 2008-12-06 02:48 if (x->sparse_mode == SPARSE_AUTO && S_ISREG (src_open_sb.st_mode) 2008-12-06 02:48 && ST_NBLOCKS (src_open_sb) < src_open_sb.st_size / ST_NBLOCKSIZE) 2008-12-06 02:48 make_holes = true; 2008-12-06 02:49 how did you find that? 2008-12-06 02:49 I have coreutils git locally 2008-12-06 02:49 and grep -r st_blocks * 2008-12-06 02:49 so... cp creates a sparse file when you cp a sparse file?? 2008-12-06 02:50 that's a cool way to find stuff ;) 2008-12-06 02:50 yes, if I use --sparse=xxx option 2008-12-06 02:51 ah 2008-12-06 02:51 we store the blocks attribute only if the file has a hole 2008-12-06 02:51 :) 2008-12-06 02:51 duh 2008-12-06 02:52 :) 2008-12-06 02:52 so most files won't have it 2008-12-06 02:52 yes 2008-12-06 02:52 that is a nice use of variable attributes 2008-12-06 02:53 i see 2008-12-06 02:55 ok, I better sleep 2008-12-06 02:56 ok, good night 2008-12-06 02:56 good night 2008-12-06 02:56 can I say sayonara? 2008-12-06 02:56 sayonara is good bye 2008-12-06 02:56 good night is oyasumi :) 2008-12-06 02:56 and good night is? 2008-12-06 02:56 ok 2008-12-06 02:57 oyasumi :) 2008-12-06 02:57 oyasumi :) 2008-12-06 03:41 mkdir works... 2008-12-06 03:41 :) 2008-12-06 03:45 prani, congrats :) 2008-12-06 03:45 ACTION didn't quite get to sleep 2008-12-06 03:45 now... 2008-12-06 03:46 thanks :) 2008-12-06 03:47 -!- camby(~root@58.100.248.215) has joined #tux3 2008-12-06 04:20 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-06 05:41 -!- rmull(~rmull@acsx01.bu.edu) has joined #tux3 2008-12-06 05:46 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-12-06 05:46 well, mkdir and create are totally same... except for the flag we pass... 2008-12-06 05:46 im not sure its correct.. but it works (tm) 2008-12-06 05:46 :) 2008-12-06 05:52 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-06 05:55 -!- camby(~root@58.100.248.215) has joined #tux3 2008-12-06 07:13 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-06 07:18 -!- bobby(~bobby@122.162.71.202) has joined #tux3 2008-12-06 07:26 I kind of have an OT question, but you guys might know 2008-12-06 07:27 How come, during a cp -rv, it goes alphabetically? It doesn't seem like that would be the fastest way to do things 2008-12-06 08:29 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-06 08:29 -!- Solaris(~satan@a83-132-82-62.cpe.netcabo.pt) has joined #tux3 2008-12-06 08:29 -!- ollebull(~olle@ip6-43.bon.riksnet.se) has joined #tux3 2008-12-06 08:29 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-06 08:29 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-06 08:29 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-06 11:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-06 11:27 rmull, you are correct. I can't think of any reason why cp should do that. 2008-12-06 11:28 but the expense of sorting the dir list is unlikely to be noticed in comparison to the copies 2008-12-06 11:31 cp -rv shouldn't do it 2008-12-06 11:31 maybe, you are doing cp -rv *? 2008-12-06 11:31 shell's glob may does 2008-12-06 11:41 hey all 2008-12-06 11:42 flipzzz, mkdir works? 2008-12-06 11:42 prnai, it's close 2008-12-06 11:42 any cases i need to handle? 2008-12-06 11:42 yes, I'm writing a reponse to your post 2008-12-06 11:43 links count, which we haven't used at all yet 2008-12-06 11:43 is different for mkdir 2008-12-06 11:43 ok.. 2008-12-06 11:59 hirofumi, you are right, cp -rv . foo/ did an unordered copy 2008-12-06 13:08 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 13:14 -!- Solaris(~satan@a83-132-82-62.cpe.netcabo.pt) has left #tux3 2008-12-06 13:15 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 13:33 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 13:45 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 13:52 ok, now I have: rm foo; touch foo; umount /mnt; KABOOM 2008-12-06 13:53 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 13:53 next bug 2008-12-06 13:53 the "touch" leaves an extra dentry count on foo 2008-12-06 13:53 so it needs to know not to do that 2008-12-06 13:55 the state before the touch is: STALE | HIDDEN 2008-12-06 13:56 HIDDEN because the delete was deferred and STALE because there is still a dirent on the filesystem 2008-12-06 13:58 the state arefer the touch is: STALE, which tells the deferred sync that the old dirent must be removed and a new dirent created (or do this in one step if the fs is capable of it) 2008-12-06 14:01 HIDDEN in ->create()? 2008-12-06 14:01 so... the deferred version of ext2_add_nondir should not do its own dget if there is already a deferred rm in flight... which is indicated by the state STALE | HIDDEN 2008-12-06 14:01 the HIDDEN flag is set in d_delect->d_hide 2008-12-06 14:02 yes, and vfs passed the HIDDEN dentry to ->create()? 2008-12-06 14:02 yes 2008-12-06 14:02 and ext2_add_nondir handles this 2008-12-06 14:02 now, dentry->d_inode may still live 2008-12-06 14:02 yes, it may have an inode or not 2008-12-06 14:02 if it has an inode, it must be HIDDEN 2008-12-06 14:03 um..., we may have to allocate new dentry 2008-12-06 14:04 if HIDDEN detnry is still opened by user 2008-12-06 14:04 yes, in that case the new dentry will not be HIDDEN 2008-12-06 14:04 right 2008-12-06 14:04 we will do a __d_drop in that case 2008-12-06 14:04 I have not done that yet 2008-12-06 14:04 just trying to get all combinations of rm and create working now 2008-12-06 14:05 then... handle rename, and it is done I think 2008-12-06 14:05 i see 2008-12-06 14:05 for this problem, I think the fix is: 2008-12-06 14:05 static int ext2_add_nondir(struct dentry *dentry, struct inode *inode) 2008-12-06 14:05 { 2008-12-06 14:05 show_dentry("defer create", dentry); 2008-12-06 14:05 if (dentry->d_flags & DCACHE_HIDDEN) 2008-12-06 14:05 dget(dentry); 2008-12-06 14:05 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 14:05 um 2008-12-06 14:05 if it's not opened, we may want to reuse it 2008-12-06 14:05 if (!(dentry->d_flags & DCACHE_HIDDEN)) 2008-12-06 14:06 yes, we will reuse the dentry normally 2008-12-06 14:06 unless it is opened 2008-12-06 14:06 with the current code 2008-12-06 14:06 it have to handle ->d_inode issue 2008-12-06 14:07 yes, the solution in that case will be __d_drop 2008-12-06 14:07 __d_drop is for open case? 2008-12-06 14:07 yes 2008-12-06 14:08 in the case we have d_negative plus non-null inode 2008-12-06 14:08 yes 2008-12-06 14:08 ah, no 2008-12-06 14:09 ->create() can't allocate new dentry 2008-12-06 14:09 it's job of ->lookup() 2008-12-06 14:09 it has to allocate a new dentry if it does a __d_drop 2008-12-06 14:09 create does 2008-12-06 14:10 the rm; touch; umount case works now :) 2008-12-06 14:10 ok, time to think about the __d_drop 2008-12-06 14:10 this is all very confusing :) 2008-12-06 14:11 how did ->create return detnry? 2008-12-06 14:11 in the current patch? 2008-12-06 14:11 yes 2008-12-06 14:11 vfs supplies it 2008-12-06 14:12 vfs thinks it's a normal negative dentry 2008-12-06 14:12 yes, but passed dentry may be still using 2008-12-06 14:12 it might be a hidden, in-use dentry 2008-12-06 14:12 right 2008-12-06 14:12 so we have to __d_drop it there (unhash) 2008-12-06 14:12 in ->create 2008-12-06 14:13 and get a new negative dentry 2008-12-06 14:13 this has to be done under the dcache_lock and d_lock I think 2008-12-06 14:14 the result of ->create is new negative dentry? 2008-12-06 14:14 positive dentry 2008-12-06 14:14 and it is only a new dentry if it has to unhash an in-use one 2008-12-06 14:15 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 14:15 um... 2008-12-06 14:17 new dentry is set to nd.path.dentry? 2008-12-06 14:17 yes, it comes from there 2008-12-06 14:18 so... things will get confused if we don't update the nd.path? 2008-12-06 14:18 ACTION reads 2008-12-06 14:19 I'm not sure what current ->create in patch does 2008-12-06 14:19 I will repost in an hour or so 2008-12-06 14:19 ok 2008-12-06 14:20 ext2_create receives the *nd, and does not do anything with it 2008-12-06 14:20 I wonder what the nd parameter is for, NFS? 2008-12-06 14:20 well, this is for record before I forget this issue 2008-12-06 14:21 mainly, it's for open intent, so yes, basically NFS 2008-12-06 14:23 __open_namei_create is strange, it has both a *path and a nd->path 2008-12-06 14:23 weird 2008-12-06 14:24 after the ->create it does: 2008-12-06 14:24 dput(nd->path.dentry); 2008-12-06 14:24 nd->path.dentry = path->dentry; 2008-12-06 14:24 without comments 2008-12-06 14:24 sick 2008-12-06 14:25 one for the parent in current namei, one for open() itself 2008-12-06 14:26 I'll make a note to submit a comment patch like that 2008-12-06 14:30 I'm thinking which is handling if last entry is symlink 2008-12-06 14:31 might be 2008-12-06 14:31 I remember trond hit a NFS bug here 2008-12-06 14:31 and symlink refers non exists entry 2008-12-06 14:31 and I had no idea what his patch was about 2008-12-06 14:31 at that time 2008-12-06 14:31 it's very confusing code 2008-12-06 14:32 about two path? 2008-12-06 14:32 yes 2008-12-06 14:32 ok 2008-12-06 14:32 I'm beginning to get it 2008-12-06 14:33 a couple of simple comments in that code would help a lot 2008-12-06 14:34 for __open_namei_create(), I think it can be just dentry 2008-12-06 14:34 that would be nice 2008-12-06 14:35 maybe, with trend of struct path, it replaced with path 2008-12-06 14:35 like I did with btree path, which now includes the leaf 2008-12-06 14:36 I think it is overkill for __open_namei_create() 2008-12-06 14:36 maybe a cleanup patch some time? 2008-12-06 14:37 yes 2008-12-06 14:37 mnt and dentry pair was replaced with path, iirc 2008-12-06 14:38 and some namei stuff takes struct path 2008-12-06 14:38 because linux has bind mount 2008-12-06 14:39 a few comments in the code about bind would be nice too 2008-12-06 14:39 ah, not only for bind mount 2008-12-06 14:40 walk on mount point 2008-12-06 14:40 which used to be handled by the nd struct? 2008-12-06 14:40 __follow_mount() changes 2008-12-06 14:42 well, back to deffered nameops 2008-12-06 14:42 deferred 2008-12-06 14:42 yes 2008-12-06 14:44 more cases are working each time, and the code gets more logical each time, but it is slow progress 2008-12-06 14:46 yes 2008-12-06 14:47 it's enough hard to do 2008-12-06 14:47 at this point I am encouraged that it can work reliably and efficiently 2008-12-06 14:49 good 2008-12-06 14:50 I'm not sure about opened dentry and reuse dentry 2008-12-06 14:50 we can't reuse an opened dentry 2008-12-06 14:51 yes 2008-12-06 14:51 but we can make a copy of it, and unhash the open one 2008-12-06 14:51 I think 2008-12-06 14:51 yes, basically 2008-12-06 14:52 it has to be done atomically, so nobody can do a real lookup in between 2008-12-06 14:52 reuse dentry means non opened dentry 2008-12-06 14:52 yes 2008-12-06 14:53 but, it may have HIDDEN 2008-12-06 14:53 yes 2008-12-06 14:53 there are two HIDDEN case 2008-12-06 14:53 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 14:53 in ->create or ->lookup 2008-12-06 14:54 yes 2008-12-06 14:54 I think ->create can't handle it 2008-12-06 14:54 because it can't return new allocated dentry 2008-12-06 14:55 ->lookup for ->create already checked it is negative dentry 2008-12-06 14:56 vfs might have to know about it 2008-12-06 14:56 I haven't looked at how to handle this case deeply yet 2008-12-06 14:56 still fixing bugs in the rm + touch combinations 2008-12-06 14:56 ok 2008-12-06 14:57 just hit a new one ;) 2008-12-06 14:57 it is current my concern 2008-12-06 14:57 well, about vfs 2008-12-06 14:58 let's talk again after the new patch is up? 2008-12-06 14:58 and after my skate ;) 2008-12-06 14:58 yes 2008-12-06 14:58 and I'll sleep :) 2008-12-06 14:58 :) 2008-12-06 15:00 oyasumi 2008-12-06 15:01 ...if it was still night in Japan 2008-12-06 15:02 oyasumi, now 8:00 though :) 2008-12-06 15:03 that used to be my normal time for sleeping when I lived in Berlin 2008-12-06 15:03 that is when the clubs close 2008-12-06 15:04 oh 2008-12-06 15:04 some clubs actually open at that time, for people who still need more party 2008-12-06 15:04 at least one famous kernel hacker still lives like that, and is very productive at the same time 2008-12-06 15:06 oh, not healthy :) 2008-12-06 15:06 and he's very healthy :) 2008-12-06 15:06 it must be the german beer 2008-12-06 15:07 :) 2008-12-06 15:27 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 16:15 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 17:16 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 17:31 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 18:04 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 18:38 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 18:55 http://neotactics.com/blog/technology/zfs-to-go-gpl/ :) 2008-12-06 18:59 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 19:03 http://en.wikipedia.org/wiki/Ceph <- interesting 2008-12-06 19:43 flips: 2008-12-06 19:43 hey 2008-12-06 19:43 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 19:52 -!- ajonat(~ajonat@190.48.122.78) has joined #tux3 2008-12-06 19:58 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 20:20 hi bh 2008-12-06 20:25 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 20:51 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 21:12 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 21:25 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 21:35 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 21:45 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 21:52 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 21:57 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 22:18 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-06 22:32 -!- prani(~bobby@122.162.71.202) has joined #tux3 2008-12-07 00:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-07 01:37 -!- pranith(~bobby@122.162.71.202) has joined #tux3 2008-12-07 02:16 hi 2008-12-07 02:16 flips, there? 2008-12-07 02:16 hi 2008-12-07 02:16 http://userweb.kernel.org/~hirofumi/tux3.png 2008-12-07 02:16 it is graph after create/unlink 2008-12-07 02:17 it is beautiful 2008-12-07 02:17 is it right? 2008-12-07 02:17 ileaf is leaved as empty 2008-12-07 02:17 ahah, so a bug 2008-12-07 02:17 ok 2008-12-07 02:18 I was not sure whether we leave it or not 2008-12-07 02:18 well 2008-12-07 02:18 yes, we need to destory the btree completely 2008-12-07 02:18 and I never did that 2008-12-07 02:18 because zumastor never destroy's its one btree ;) 2008-12-07 02:19 i see 2008-12-07 02:19 do you want me to do that? 2008-12-07 02:19 it seems issue of purge_inum() 2008-12-07 02:19 yes 2008-12-07 02:19 not needed 2008-12-07 02:19 it should be pretty easy 2008-12-07 02:19 I or you or someone will fix later 2008-12-07 02:19 oh 2008-12-07 02:19 you are welcome to :) 2008-12-07 02:20 I am still hacking the deferred nameops 2008-12-07 02:20 so can I talk about that a bit? 2008-12-07 02:20 ok 2008-12-07 02:21 I think you noticed last night that ->create is going to have to be able to return a different dentry than it was passed 2008-12-07 02:21 like ->lookup 2008-12-07 02:21 I don't see any way around that 2008-12-07 02:21 so I did a hack to add a dentry return to ->lookup, just changed namei.c, ext2, ext3 and a few built-in filesystems 2008-12-07 02:22 it's not too bad 2008-12-07 02:22 I don't think it is really controversial either, because it makes create more like lookup 2008-12-07 02:22 the only thing is, it changes every filesystem, but we like to do that, don't we? 2008-12-07 02:22 trivial change 2008-12-07 02:23 well, it can 2008-12-07 02:23 but it would be strange 2008-12-07 02:23 maybe, ->lookup can do it 2008-12-07 02:23 not stranger than ->lookup returning a dentry 2008-12-07 02:23 that is pretty strange 2008-12-07 02:24 the disconnected dentry tree thing from NFS 2008-12-07 02:24 so this is the deferred namespace thing from tux3 ;) 2008-12-07 02:24 why don't it in ->lookup? 2008-12-07 02:25 maybe 2008-12-07 02:25 well 2008-12-07 02:25 because ->lookup doesn't get called 2008-12-07 02:25 ah 2008-12-07 02:25 there is a negative dentry that prevents the ->lookup 2008-12-07 02:25 which is what we want 2008-12-07 02:26 the only way I can see is to do this in ->create, but if you can see another way that would be fine 2008-12-07 02:26 once we have the dentry return from ->create, the rest isn't too hard I think 2008-12-07 02:27 when we find that the dentry is hidden, but still with an inode, then we have to clone the dentry and replace it in the hash 2008-12-07 02:27 the locking for this seems to work out ok, in d_instantiate 2008-12-07 02:28 vfs_create is also going to be changed? 2008-12-07 02:28 I already changed it ;) 2008-12-07 02:29 actually, I created a namei_create, and kept the interface the same for vfs_create, which calls namei_create with the new interface 2008-12-07 02:29 so this change was small 2008-12-07 02:29 the ->create change is the big one 2008-12-07 02:29 and it's not that big 2008-12-07 02:29 just a few lines per filesystem 2008-12-07 02:31 d_instantiate is called without a lock, so it can do a racy check to see if it should allocate a new dentry, then take the dentry lock and the d_lock, and check again 2008-12-07 02:32 this is hard to think 2008-12-07 02:33 I need a time 2008-12-07 02:33 ok 2008-12-07 02:33 I'll post the updated patch 2008-12-07 02:33 with the change to ->create, but not the new d_instantiate handling 2008-12-07 02:33 well, probably ->create() would work though 2008-12-07 02:34 maybe, to avoid this situation, we may be able to allocate negative dentry int the case 2008-12-07 02:35 in that case 2008-12-07 02:36 ah 2008-12-07 02:36 that is a possibility 2008-12-07 02:36 hmm, why didn't I think of that :) 2008-12-07 02:37 :) 2008-12-07 02:37 well, it would introduce the another issue 2008-12-07 02:37 well, in that case we will _always_ allocate a new dentry whenever a file that is open is unlinked 2008-12-07 02:37 that other issue? 2008-12-07 02:38 maybe 2008-12-07 02:38 yes 2008-12-07 02:39 we may want to add some trick in that case 2008-12-07 02:39 following my initial idea, we only have to allocate a new dentry in the rare case that somebody recreates the same name after it is unlinked while it is open 2008-12-07 02:40 that will almost never happen, and when it does, the cost is only one dentry allocation 2008-12-07 02:40 yes 2008-12-07 02:41 well, we will change all creation 2008-12-07 02:41 mknod, mkdir, rename, etc. 2008-12-07 02:42 yes 2008-12-07 02:42 I thought I would try it just with ->create 2008-12-07 02:42 and see how it works, and how messy the patch is 2008-12-07 02:43 and I haven't even thought about rename yet 2008-12-07 02:43 I can say that the patch is getting bigger than I thought it would be 2008-12-07 02:43 rename may be very complex 2008-12-07 02:43 it might 2008-12-07 02:43 I haven't thought about it at all 2008-12-07 02:44 it might be simple too 2008-12-07 02:44 old dentry may still be opened 2008-12-07 02:44 then we have to do the same trick as for create 2008-12-07 02:44 but the locking might be messier 2008-12-07 02:45 no 2008-12-07 02:45 old dentry is still live as new name 2008-12-07 02:45 that's easy 2008-12-07 02:45 just mark it HIDDEN 2008-12-07 02:46 I was thinking about the dentry that can get clobbered 2008-12-07 02:46 move on top of 2008-12-07 02:47 old dentry should be not HIDDEN 2008-12-07 02:47 by old dentry, you mean, the name the object used to have? 2008-12-07 02:48 ah 2008-12-07 02:48 ok, the old dentry has to be moved into the new hash 2008-12-07 02:48 old dentry may be just replaced the name 2008-12-07 02:48 right 2008-12-07 02:49 and a negative dentry has to be created to fill the old place 2008-12-07 02:49 so this is a little bit of extra messiness 2008-12-07 02:49 and if the move is on top of an existing name that is open, that dentry might have to be cloned 2008-12-07 02:50 so rename is messy, but sounds doable 2008-12-07 02:50 ah, old dentry shouldn't be HIDDEN already 2008-12-07 02:50 it is original name, so it should still live in namespace 2008-12-07 02:51 well, if it was open then the dentry has to go to the new hash, just as d_move does now 2008-12-07 02:51 but we have to fill in a negative dentry where the old name was 2008-12-07 02:51 because the rename will be deferred 2008-12-07 02:52 and we don't want a lookup to find the stale name on the fs 2008-12-07 02:53 dentry state may be complex 2008-12-07 02:53 I'd like to try to simple it 2008-12-07 02:54 it has the new HIDDEN, BACKED and STALE states 2008-12-07 02:54 yes 2008-12-07 02:54 and HIDDEN has open and non-open 2008-12-07 02:54 the obvious way to simplify it is to _always_ use the HIDDEN flag, and don't rely on d_inode == 0 2008-12-07 02:55 BACKED and STALE bits can't be made any simpler, I think 2008-12-07 02:55 yes 2008-12-07 02:56 so, if we can separate HIDDEN-open and pure HIDDEN, it sounds simple 2008-12-07 02:58 btw, I forget why HIDDEN is hashed... 2008-12-07 02:58 because the stale name may still exist on the filesystem 2008-12-07 02:58 it has been unlinked from the dentry cache, but not removed from the dirent block 2008-12-07 02:58 ah, it works like negative dentry 2008-12-07 02:58 yes 2008-12-07 02:59 in fact d_negative does HIDDEN || !->d_inode 2008-12-07 02:59 what will happen if we add new real negative dentry, um... 2008-12-07 02:59 can't ;) 2008-12-07 03:00 whoops 2008-12-07 03:00 um... why? 2008-12-07 03:00 because the HIDDEN one is already a negative dentry 2008-12-07 03:00 cached_lookup treats it as negative 2008-12-07 03:00 ah, it will unhash instead 2008-12-07 03:01 there's is only one place in the whole vfs where an in-use dentry is unhashed 2008-12-07 03:01 d_delete 2008-12-07 03:02 maybe d_invalidate and d_move 2008-12-07 03:02 yes 2008-12-07 03:03 d_invalidate is NFS only I hope 2008-12-07 03:03 so that leaves d_move, which is coming soon 2008-12-07 03:03 in the case of d_move, it's rehashed 2008-12-07 03:04 old is rehash, new is d_drop? 2008-12-07 03:04 not quite d_drop 2008-12-07 03:04 __d_drop(target) 2008-12-07 03:05 just dput 2008-12-07 03:06 http://lxr.linux.no/linux+v2.6.27.5/fs/dcache.c#L1672 2008-12-07 03:06 ...and __d_drop 2008-12-07 03:06 as you said ;) 2008-12-07 03:06 ah, ok :) 2008-12-07 03:08 well, somehow, I think I need a time to think this 2008-12-07 03:08 me too 2008-12-07 03:08 I'll think there is a way to simple that 2008-12-07 03:09 if there is a way 2008-12-07 03:10 btw, now we are creating inode in make_inode() 2008-12-07 03:11 but, it would be deferred? 2008-12-07 03:12 yes 2008-12-07 03:12 see how the ext2 patch works 2008-12-07 03:12 ok, thanks 2008-12-07 03:12 static int ext2_add_nondir(struct dentry *dentry, struct inode *inode) 2008-12-07 03:13 { 2008-12-07 03:13 show_dentry("defer create", dentry); 2008-12-07 03:13 if (!(dentry->d_flags & DCACHE_HIDDEN)) 2008-12-07 03:13 dget(dentry); 2008-12-07 03:13 dentry->d_op = &ext2_dentry_operations; 2008-12-07 03:13 d_instantiate(dentry, inode); 2008-12-07 03:13 show_dentry("instantiated", dentry); 2008-12-07 03:13 return 0; 2008-12-07 03:13 } 2008-12-07 03:13 this is going to be really fast :) 2008-12-07 03:13 latency of a buffer create will go way down 2008-12-07 03:13 buffered create 2008-12-07 03:14 this is delayed dirent? 2008-12-07 03:15 we have to defer the inode too? 2008-12-07 03:16 yes 2008-12-07 03:16 let me see, how does that work 2008-12-07 03:16 I'm thinking separate inode initialization from make_inode() 2008-12-07 03:17 thanks for reminding me, I have not deferred the inode create yet 2008-12-07 03:17 I deffered the inode delete,but not the create 2008-12-07 03:18 yes 2008-12-07 03:18 userspace tux3 also allocate 2008-12-07 03:18 static void defer_drop_inode(struct inode *inode) 2008-12-07 03:18 { 2008-12-07 03:18 if (inode->i_nlink) { 2008-12-07 03:18 generic_drop_inode(inode); 2008-12-07 03:18 return; 2008-12-07 03:18 } 2008-12-07 03:18 show_inode("defer inode delete", inode); 2008-12-07 03:19 inode->i_state |= I_DIRTY; 2008-12-07 03:19 list_move(&inode->i_list, &EXT2_SB(inode->i_sb)->delete); 2008-12-07 03:19 spin_unlock(&inode_lock); 2008-12-07 03:19 } 2008-12-07 03:19 defer_create_inode will be similar 2008-12-07 03:20 maybe 2008-12-07 03:20 I will do it tomorrow, and we will see 2008-12-07 03:20 I forgot completely about it 2008-12-07 03:23 http://mailman.tux3.org/pipermail/tux3/2008-December/000424.html 2008-12-07 03:24 ok 2008-12-07 03:25 I'll see it deeply tomorrow 2008-12-07 03:25 it's oyasumi time for me 2008-12-07 03:26 oyasumi 2008-12-07 03:26 I'll send pull request to list 2008-12-07 05:17 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-07 08:22 -!- pgquiles(~pgquiles@139.Red-81-38-97.dynamicIP.rima-tde.net) has joined #tux3 2008-12-07 10:35 -!- pranith(~Bobby@122.162.71.202) has joined #tux3 2008-12-07 10:36 hey all 2008-12-07 10:36 ok, back to sleep 2008-12-07 10:37 oyasumi 2008-12-07 10:37 :) 2008-12-07 18:31 folks 2008-12-07 20:06 deferred recreate of an unlinked, still open file is kinda working... 2008-12-07 20:06 instead of ending up with two copies of the offending dirent, there are three 2008-12-07 22:35 flips: ping 2008-12-07 22:36 hi 2008-12-07 22:36 whats up with dwalk_chop 2008-12-07 22:36 should be the basis for dlead_chop 2008-12-07 22:37 dleaf_chop 2008-12-07 22:37 dwalk_chop_after I think it's called 2008-12-07 22:37 yeah, so its basically already written 2008-12-07 22:37 no? 2008-12-07 22:38 maybe 2008-12-07 22:38 needs to be put together and reality checked 2008-12-07 22:38 ok 2008-12-07 22:38 it could use some tests too 2008-12-07 22:38 yes 2008-12-07 22:40 it looks close 2008-12-07 22:40 could drop the _after 2008-12-07 22:40 156 * * Does not provide a generic mechanism that can be adapted to other 2008-12-07 22:40 157 * truncation tasks. 2008-12-07 22:41 what did you have in mind? 2008-12-07 22:41 using dwalk api 2008-12-07 22:41 then later, it can be elaborated 2008-12-07 22:42 to handle situations like versioning 2008-12-07 22:43 ah and perhaps hole punching 2008-12-07 22:43 although thats not really a truncation task 2008-12-07 22:44 and the concept of generilizing it to streams 2008-12-07 22:44 there is a read stream with dwalk, and a write stream with dleaf_pack 2008-12-07 22:44 to implement version delete 2008-12-07 22:44 chop a stream? 2008-12-07 22:45 chop is a simple case of a leaf edit 2008-12-07 22:45 we also have version delete to worry about 2008-12-07 22:46 chop will be a sort of example of how to use the walk api 2008-12-07 22:46 now... why is it better 2008-12-07 22:46 than the original 2008-12-07 22:46 besides, should be more readable 2008-12-07 22:46 all the mesy stuff is tucked away in the walk and chop 2008-12-07 22:46 i've been using filemap_extent_io as my example ;) 2008-12-07 22:46 right 2008-12-07 22:47 so that is lame in a few respects 2008-12-07 22:47 on is, it always repacks the entire tail of the leaf 2008-12-07 22:47 could be costly to do all the unpacking an repacking, when a simple memmove would do 2008-12-07 22:48 I dunno 2008-12-07 22:48 somebody smarter than me needs to take charge of dleaf ;) 2008-12-07 22:51 to tell the truth, the case when extent_io is just making a small change to the beginning of a leaf is fairly rare 2008-12-07 22:51 the code duplication between get_block and extent_io makes me wonder if there is something missing from the api 2008-12-07 22:52 hirofumi is waiting for me to come up with a proper get_extents api 2008-12-07 22:52 to eliminate the duplication 2008-12-07 22:52 see the segs[] vector, it will fill that in and hand it back to caller, which will be get_block 2008-12-07 22:53 for now, the duplication is ok, it's a reminder to me to get off my butt and finish the get_extents api 2008-12-07 22:53 ah i see the discussion from 2008-11-25 2008-12-07 22:54 i like the one irc log 2008-12-07 22:54 renamed struct extent to diskextent so we could use struct extent in the api 2008-12-07 22:54 grep get_extent current.log :) 2008-12-07 22:54 then I got distracted by deferring 2008-12-07 22:54 hopefully not too much longer on that 2008-12-07 22:55 if you look at tree_chop, it uses a streaming approach to editing the tree... it reads source extents and writes destination extents 2008-12-07 22:56 so what I would like to do is generalize that so we can define a source stream which can be either dleaf entries or ileaf attributes 2008-12-07 22:58 so it goes while (leaf_walk(...)) { if () leaf_pack; ) 2008-12-07 22:58 and leaf_pack can pack into a different leaf that is being read from 2008-12-07 22:59 why in to a different leaf? 2008-12-07 22:59 this supports merging leafs in the tree_chop 2008-12-07 23:00 well, it is not too bad to have separate purge and merge steps 2008-12-07 23:00 -!- camby(~root@60.205.80.132) has joined #tux3 2008-12-07 23:01 and actually, it's already abstracted not too badly ;) 2008-12-07 23:02 ACTION should go back to deferred create involving a hidden active orphan 2008-12-07 23:41 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-08 00:29 hey 2008-12-08 01:28 hi, flips there? 2008-12-08 01:28 hi 2008-12-08 01:29 I thought a bit about defer 2008-12-08 01:29 :) 2008-12-08 01:29 then, I think allocate new negative dentry may be simple 2008-12-08 01:29 latest tux3 broke 64 bit uml? 2008-12-08 01:30 details to the list manybe? 2008-12-08 01:30 okies 2008-12-08 01:30 it allocates new negative dentry only if BACKED and still open 2008-12-08 01:30 I guess 2008-12-08 01:31 this is just before instantiating? 2008-12-08 01:31 or when? 2008-12-08 01:31 I think it does in d_hide() 2008-12-08 01:31 that might work 2008-12-08 01:31 BUG_ON(dentry->d_flags & DCACHE_DELETE); 2008-12-08 01:31 if (dentry->d_flags & DCACHE_CREATE) { 2008-12-08 01:31 BUG_ON(atomic_read(&dentry->d_count) <= 1); 2008-12-08 01:31 dentry->d_flags &= ~DCACHE_CREATE; 2008-12-08 01:31 dput(dentry); 2008-12-08 01:31 return 0; 2008-12-08 01:31 } else { 2008-12-08 01:31 struct dentry *new; 2008-12-08 01:31 if (atomic_read(&dentry->d_count) == 1) { 2008-12-08 01:31 dentry->d_flags |= DCACHE_DELETE; 2008-12-08 01:31 dget(dentry); 2008-12-08 01:31 return 1; 2008-12-08 01:32 } 2008-12-08 01:32 spin_unlock(&dentry->d_lock); 2008-12-08 01:32 spin_unlock(&dcache_lock); 2008-12-08 01:32 new = d_alloc(dentry->d_parent, &dentry->d_name); 2008-12-08 01:32 d_instantiate(new, NULL); 2008-12-08 01:32 d_rehash(new); 2008-12-08 01:32 spin_lock(&dentry->d_lock); 2008-12-08 01:32 spin_lock(&dcache_lock); 2008-12-08 01:32 return 0; 2008-12-08 01:32 } 2008-12-08 01:32 flags are not current one 2008-12-08 01:32 sorry, DCACHE_DELETE means it delayed delete 2008-12-08 01:32 DCACHE_CREATE means it delayed create 2008-12-08 01:32 yes, you've been working on this I can see 2008-12-08 01:33 I wrote something similar, except in d_instantiate 2008-12-08 01:33 very similar :) 2008-12-08 01:33 oh, good :) 2008-12-08 01:33 this seems we can avoid HIDDEN too 2008-12-08 01:34 the code above is in d_delete? 2008-12-08 01:35 oh 2008-12-08 01:35 yes, actually ->d_hide() 2008-12-08 01:35 right 2008-12-08 01:35 have you tried it? 2008-12-08 01:36 no, I didn't try this 2008-12-08 01:36 I've noticed this right now 2008-12-08 01:36 that would simplify things 2008-12-08 01:36 indeed 2008-12-08 01:37 yes, I hope so 2008-12-08 01:37 so, at the point d_delete would have unhashed the busy dentry, we create a new, negative one instead 2008-12-08 01:37 yes, after that, all is normal state 2008-12-08 01:38 ah, no 2008-12-08 01:38 unfortunately, we keep negative dentry until delayed delete is completed 2008-12-08 01:39 so, more code is needed 2008-12-08 01:39 we need dput() for created new negative 2008-12-08 01:41 DCACHE_CREATE also has to be set somewhere 2008-12-08 01:41 in ext2_add_nondir I guess 2008-12-08 01:41 yes 2008-12-08 01:42 DCACHE_DELETE needs to know if a dirent exisisted 2008-12-08 01:42 well, I think those flags can be replaced by current flags 2008-12-08 01:42 I think the current flags are about right 2008-12-08 01:43 but that may be a better place to handle the unhashing 2008-12-08 01:43 DCACHE_DELETE is setted only if a dirent exsisted 2008-12-08 01:44 if create -> delete pattern, dentry should have DCACHE_CREATE 2008-12-08 01:46 ah, so new negative dentry should have DCACHE_DELETE 2008-12-08 01:47 how do we know the dirent exists at that point? 2008-12-08 01:47 because ~DCACHE_CREATE? 2008-12-08 01:47 in my guess, DCACHE_DELETE/DCACHE_CREATE tells it 2008-12-08 01:48 I will try working it through with that flag logic and see what happens 2008-12-08 01:48 yes, it assumes backend clear it if backend flushed 2008-12-08 01:49 the reason we need a HIDDEN flag is so that we can make a dentry negative but still carry an inode in it 2008-12-08 01:49 I think there is no special case like it 2008-12-08 01:50 we need to dentry_iput when the dentry count goes to zero 2008-12-08 01:50 yes 2008-12-08 01:51 HIDDEN flag is needed for still open dentry 2008-12-08 01:52 in that code, the code unhashes it, and instead adds new negative dentry 2008-12-08 01:52 instead of it, the code adds new negative dentry 2008-12-08 01:53 and we can find the unhashed on because it is still on the child list, true 2008-12-08 01:53 so your approach may work and be simpler 2008-12-08 01:53 and if create -> delete and still open, we just unhash it 2008-12-08 01:54 because we don't need to create new dirent, we handle orphan though 2008-12-08 01:55 I will try it that way 2008-12-08 01:55 I'm getting pretty good at testing this mess :) 2008-12-08 01:55 oh, good :) 2008-12-08 01:56 so that also avoids the big change to ->create 2008-12-08 01:56 yes 2008-12-08 01:57 ok, I will clean up and post what I have, which add a dentry-cloning variant of d_instantiate, then try it your way 2008-12-08 01:57 I guess, rest issue is locking against backend 2008-12-08 01:57 ok, thanks 2008-12-08 01:57 the locking against back end is pretty simple 2008-12-08 01:58 the back end does what I have written as ext2_fsync, and ext2_sync_fs 2008-12-08 01:58 we make sure, fs is stable state on some point 2008-12-08 01:58 yes, this happens in staging 2008-12-08 01:58 I wrote the details a few days ago 2008-12-08 02:00 I'm not sure it is right - clear DCACHE_*, then unlock, and modify buffer 2008-12-08 02:01 http://mailman.tux3.org/pipermail/tux3/2008-December/000411.html - Delta staging 2008-12-08 02:01 in the fsync processing? 2008-12-08 02:02 yes 2008-12-08 02:02 well, it may be only me 2008-12-08 02:03 notice in my ext2_flush_dir I am relying on the directory i_mutex to keep the child list stable while walking it 2008-12-08 02:03 it just means I'm not thinking about it 2008-12-08 02:03 oh 2008-12-08 02:04 because the dcache_lock is dropped while doing the walk and picked back up again 2008-12-08 02:04 yes 2008-12-08 02:04 so I am not sure what dcache_lock actually protects 2008-12-08 02:04 ext2_flush_dir() seems doesn't take ->i_mutex 2008-12-08 02:05 ah 2008-12-08 02:05 ->fsync() caller does 2008-12-08 02:05 yes, and our staging would have to 2008-12-08 02:05 sounds good 2008-12-08 02:05 I missed it completely 2008-12-08 02:05 anything that can add or remove a child has to have the same dir->i_mutex I think 2008-12-08 02:06 otherwise it would be really hard to walk a child list 2008-12-08 02:06 i see 2008-12-08 02:06 it sounds good 2008-12-08 02:06 well I'm not totally sure its right ;) 2008-12-08 02:07 the only list dcache_look would be protecting then would appear to be the lru 2008-12-08 02:07 I'll try to make race in my brain for testing 2008-12-08 02:07 ok 2008-12-08 02:07 however, it really sound good 2008-12-08 02:07 oh good 2008-12-08 02:08 I will do a little more tonight then sleep 2008-12-08 02:08 ok, oyasumi 2008-12-08 02:08 oyasumi 2008-12-08 02:08 ok, my bad. the testdev was not clean.. 2008-12-08 02:09 works fine after building a new testdev 2008-12-08 02:09 :) 2008-12-08 02:09 i hit an assertion in probe during mount 2008-12-08 02:09 make mkfs makes a new testdev, I guess you know 2008-12-08 02:09 no :( 2008-12-08 02:09 i know now :) 2008-12-08 02:09 :) 2008-12-08 02:10 i used gdb to see what was wrong 2008-12-08 02:10 should be make testdev I suppose 2008-12-08 02:10 ileaf_sniff was returning 0... 2008-12-08 02:10 ah, fun 2008-12-08 02:10 yups.. gdb in emacs :D 2008-12-08 02:10 you can also do make debug, that will mount under fuse and you should hit the assert too 2008-12-08 02:11 hmm 2008-12-08 02:11 fuse is not installed on my office desktop 2008-12-08 02:11 :( 2008-12-08 02:12 well running under uml or a real machine works fine too ;) 2008-12-08 02:12 :) 2008-12-08 02:12 I guess it must be uml 2008-12-08 02:13 yup 2008-12-08 02:13 uml 2008-12-08 02:15 hirofumi, shall I check in michael pattricks's rename? 2008-12-08 02:16 I think, checking in something only marginally functional is better than nothing at all :) 2008-12-08 02:16 and it actually looks pretty close 2008-12-08 02:16 ah, I think I know some bugs in that code 2008-12-08 02:16 how about, check in with bugs, then fix? 2008-12-08 02:17 first thing is to lindent it ;) 2008-12-08 02:17 then fix would be enough 2008-12-08 02:17 yes 2008-12-08 02:17 coding style fix would be first :) 2008-12-08 02:17 ok, I'll do it now 2008-12-08 02:17 checkpatch.pl would work for it 2008-12-08 02:18 you do ? 2008-12-08 02:19 not normally 2008-12-08 02:19 I'll try it 2008-12-08 02:20 oh, "michael, please fix it" is good lazy maintainer :) 2008-12-08 02:20 save_inode: save inode 0xd lookup inode 0xd, 0 + d resize inum 0xd at 0x8a from 2e to 2e 2008-12-08 02:20 flips: inode is being resized here? 2008-12-08 02:21 size stays the same there 2008-12-08 02:22 hmm, but resize is called anyway.. 2008-12-08 02:22 yes, can be optimized 2008-12-08 02:23 ok.. anyone found what ext2 inc_link_count and dec_link_count are for?? 2008-12-08 02:23 which one? 2008-12-08 02:24 well, some inc/dec is for sync mode 2008-12-08 02:24 hmm, the one flips mentioned is his mail 2008-12-08 02:24 to michael 2008-12-08 02:24 ah 2008-12-08 02:25 it hard to mention until understand the other part 2008-12-08 02:25 right, and checking it in broken will help :) 2008-12-08 02:26 :) 2008-12-08 02:27 well, hint is it cares about crash 2008-12-08 02:29 e.g. inc_link, add_entry, crash 2008-12-08 02:29 ok 2008-12-08 02:29 it is safe to delete entry 2008-12-08 02:29 because it has extra link count 2008-12-08 02:29 patched, checked in and lindented 2008-12-08 02:30 hirofumi, but those links will never make it to disk in tux3 2008-12-08 02:30 sync mode does 2008-12-08 02:31 in sync mode, tux3 still needs to commit only fully consistent sets of blocks 2008-12-08 02:31 yes, tux3 doesn't need this trick at all 2008-12-08 02:31 so I don't think those inc/dec link does anything for us, I think it is to guard against asynchronous writeout in ext2 2008-12-08 02:31 right 2008-12-08 02:31 ext3 actuall doesn't 2008-12-08 02:32 makes sense 2008-12-08 02:32 so, I think we would like to do like ext3 2008-12-08 02:33 yes 2008-12-08 02:33 well, anyway, I'll cleanup rmdir to share with unlink 2008-12-08 02:34 and I leaves rename bugs for a while to Michael 2008-12-08 02:34 hirofumi: mind if i try? 2008-12-08 02:35 rmdir cleanup? 2008-12-08 02:35 yup 2008-12-08 02:35 if you want to try, please do it 2008-12-08 02:35 thanks :) 2008-12-08 02:36 maybe, it will create tux_del_dirent() or something 2008-12-08 02:37 and unlink will dec_link, and rmdir will clear_link/mark_inode_dirty 2008-12-08 02:37 well, good luck :) 2008-12-08 02:37 hmm 2008-12-08 02:37 i thought we dont need dec_link ? 2008-12-08 02:37 and inc_link 2008-12-08 02:38 it may be hardlink 2008-12-08 02:38 so, we can't just remove it if not directory 2008-12-08 02:38 oh, I noticed an issue 2008-12-08 02:39 for deferred rmdir 2008-12-08 02:39 we need to know the dir is empty 2008-12-08 02:39 but current practice is only to inc dir link count for subdirs, not files 2008-12-08 02:39 is that posix? or can we inc/dec for files too if we want? 2008-12-08 02:40 I'm not sure for it 2008-12-08 02:40 "." inc itself, ".." inc parent 2008-12-08 02:40 in practice 2008-12-08 02:42 if we want to count it, we can walk ->d_subdirs? 2008-12-08 02:43 that's only the ones that happen to be in cache 2008-12-08 02:44 ah 2008-12-08 02:44 I suspect that counting all children in the dir link count is posix-correct 2008-12-08 02:44 and so is just counting subdirs 2008-12-08 02:45 maybe 2008-12-08 02:45 hah, somebody mentions that posix allows directory hard links 2008-12-08 02:46 http://www.opengroup.org/austin/mailarchives/ag/msg08496.html 2008-12-08 02:47 "The command "link" was originally provided 2008-12-08 02:47 in order to be able to create links to directories" 2008-12-08 02:48 scary 2008-12-08 02:49 have you thought about whether we need to store . and .. ? 2008-12-08 02:49 that mail also says posix does not require . and .. to be returned by getdents 2008-12-08 02:49 I'm not sure about nfsd 2008-12-08 02:49 I think it is just implementation issue 2008-12-08 02:49 let's try it without and see what breaks :) 2008-12-08 02:50 I also need to respond to bruce's very informative mail 2008-12-08 02:50 I'm nearly sure it is work 2008-12-08 02:50 yes 2008-12-08 02:51 ted ts'o said it was only there in ext3 for backward compatibility 2008-12-08 02:51 "." and ".."? 2008-12-08 02:52 if nfsd works, I think we don't need . and .. at all 2008-12-08 02:53 so, basically we don't need 2008-12-08 02:53 yes 2008-12-08 02:53 so some code can go way from the rename 2008-12-08 02:54 yes 2008-12-08 03:02 folks 2008-12-08 03:03 hi 2008-12-08 03:03 http://www.opengroup.org/onlinepubs/000095399/functions/link.html 2008-12-08 03:03 RATIONALE mentions about directory 2008-12-08 03:36 static int tux3_rmdir(struct inode *dir, struct dentry *dentry) { struct inode *inode = dentry->d_inode; int err = -ENOTEMPTY; struct buffer_head *buffer; tux_dirent *de; if (tux_dir_is_empty(inode)) { err = tux3_unlink(dir, dentry); if (!err) { inode->i_size = 0; inode_dec_link_count(dir); } } return err; } 2008-12-08 03:36 oops 2008-12-08 03:36 :) 2008-12-08 03:36 http://www.mibbit.com/pb/8j17rx 2008-12-08 03:36 :) 2008-12-08 03:37 hirofumi: ping 2008-12-08 03:37 hi 2008-12-08 03:37 hello 2008-12-08 03:37 above url.. mibbit one... can u check? 2008-12-08 03:37 ok 2008-12-08 03:39 hmm.. 2008-12-08 03:39 right now, it would be work 2008-12-08 03:39 one mistake :) 2008-12-08 03:39 however, I think mkdir should have ->i_nlink == 2 2008-12-08 03:39 i removed the inode_dec_link_count(inode) 2008-12-08 03:39 yeah 2008-12-08 03:39 so dec_link twice? 2008-12-08 03:40 clear_link() 2008-12-08 03:40 once in tux3_unlink and once in tux3_rmdir? 2008-12-08 03:40 hmm 2008-12-08 03:40 clear_nlink 2008-12-08 03:41 hmm.. new function? 2008-12-08 03:41 which clears 2 links? 2008-12-08 03:41 include/fs.h has it 2008-12-08 03:41 ok.. 2008-12-08 03:41 so, we don't share tux3_unlink entirely 2008-12-08 03:41 so instead of inode_dec_link_count(inode) i do a clear_link(inode) 2008-12-08 03:42 which inode_dec_link_count(inode) 2008-12-08 03:42 ? 2008-12-08 03:42 in tux3_unlink()? 2008-12-08 03:42 it should actually be in tux3_rmdir... 2008-12-08 03:42 right 2008-12-08 03:42 and mark_inode_dirty(inode) 2008-12-08 03:43 ok 2008-12-08 03:46 hirofumi: which inclue/fs.h? 2008-12-08 03:46 s/inclue/include 2008-12-08 03:47 ah 2008-12-08 03:47 include/linux/fs.h 2008-12-08 03:47 :) 2008-12-08 03:47 ok 2008-12-08 03:49 http://www.mibbit.com/pb/lUNKlz 2008-12-08 03:50 basically 2008-12-08 03:50 but we can't share tux3_unlink entirely 2008-12-08 03:51 why is that? 2008-12-08 03:51 because it does dec_link 2008-12-08 03:51 yeah 2008-12-08 03:51 it will be decremented once there 2008-12-08 03:51 and mark_dirty_inode 2008-12-08 03:51 and then in clear_nlink to 0 2008-12-08 03:52 it works, however it's not efficient 2008-12-08 03:52 hmm 2008-12-08 03:52 let's just do mark_inode_dirty() once 2008-12-08 03:53 so, I thought tux_del_dirent() will be introduced 2008-12-08 03:54 hmm 2008-12-08 03:54 actually in rmdir, we dont need mark_inode_dirty 2008-12-08 03:54 why? 2008-12-08 03:54 because in tux_delete_entry we are anyhow marking it dirtl 2008-12-08 03:55 dirty* 2008-12-08 03:55 hmm, i may be wrong 2008-12-08 03:55 that is on dir.. 2008-12-08 03:55 it diry dirent 2008-12-08 03:55 dirty 2008-12-08 03:55 inode is not dirtyed 2008-12-08 03:56 hmm, ok 2008-12-08 03:58 pranith, http://userweb.kernel.org/~hirofumi/tux3.png 2008-12-08 03:59 tux_delete_entry marks root_dtree_data as dirty 2008-12-08 03:59 and dir inode 2008-12-08 04:00 hmm 2008-12-08 04:01 ok, now the inode which was assigned to dirent needs to be dirtied? 2008-12-08 04:01 rmdir() should mark the inode which is referenced by root_dtree_data 2008-12-08 04:02 deleted dirent 2008-12-08 04:02 yes 2008-12-08 04:02 tux_delete_entry marks parent of it 2008-12-08 04:03 ok, got it... 2008-12-08 04:03 http://www.mibbit.com/pb/fO6ejp 2008-12-08 04:04 we do that here.. 2008-12-08 04:04 yes 2008-12-08 04:04 so you want this to go in tux_del_dirent? 2008-12-08 04:04 no, I want to avoid mark_inode_dirty twice 2008-12-08 04:05 tux3_unlink does mark_inode_dirty, and tux3_rmdir does again 2008-12-08 04:06 inode_dec_link_count is inode->i_nlink-- and mark_inode_dirty() 2008-12-08 04:06 ok.. 2008-12-08 04:10 http://www.mibbit.com/pb/y3pj0Z 2008-12-08 04:10 final one hopefully :) 2008-12-08 04:11 unfortunately, no :) 2008-12-08 04:12 we need to call mark_inode_dirty() after clear_nlink() 2008-12-08 04:12 :( 2008-12-08 04:12 that is called onced in tux3_unlink as u said 2008-12-08 04:13 because it may clear the dirty before clear_nlink() 2008-12-08 04:13 yes 2008-12-08 04:14 I think we can't avoid to introduce tux_del_dirent() 2008-12-08 04:14 :) 2008-12-08 04:14 my only worry is duplicating the code 2008-12-08 04:14 which will happen with this.. 2008-12-08 04:15 around of mark_inode_dirty() has difference on rmdir() and unlink() 2008-12-08 04:15 so, it can't share about it 2008-12-08 04:16 we can just share deleting directory entry part only 2008-12-08 04:16 rmdir() has to clear ->nlink, and unlink has to dec ->nlink 2008-12-08 04:17 those are different, so code should also different 2008-12-08 04:18 ok 2008-12-08 04:31 -!- inverse(~chatzilla@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-08 04:35 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-08 04:37 http://www.mibbit.com/pb/HsZk54 2008-12-08 04:37 please check for rmdir... i 2008-12-08 04:38 ok, am starting on rename... 2008-12-08 04:39 inode_dec_link_count(dir) was lost in rmdir 2008-12-08 04:40 buggy me :( 2008-12-08 04:40 and let's move "inode->i_ctime = dir->i_ctime" in rmdir/unlink 2008-12-08 04:40 to rmdir/unlink 2008-12-08 04:40 ok 2008-12-08 04:42 and tux3_del_dirent() to tux_del_dirent() for consistency of tux_add_dirent() 2008-12-08 04:43 and mark_inode_dirty(dir) in rmdir, it should be mark_inode_dirty(inode) 2008-12-08 04:44 yeah, i changed that.. and rename will be done 2008-12-08 04:44 i _really_ have to review my own patches before posting :/ 2008-12-08 04:45 :) 2008-12-08 04:46 hirofumi: i_ctime needs to be updated in rmdir? 2008-12-08 04:46 i think only in unlink will do? 2008-12-08 04:46 yes 2008-12-08 04:46 I think update would be better 2008-12-08 04:46 hmm 2008-12-08 04:47 because that directory may still open 2008-12-08 04:47 so, fstat() will return updated i_ctime 2008-12-08 04:48 ok.. understood 2008-12-08 04:48 btw, tux_dir_is_empty() was lost? 2008-12-08 04:49 in rmdir 2008-12-08 04:49 hmm.. seems so 2008-12-08 04:49 oh, it's needed 2008-12-08 04:50 yups, my bad ... 2008-12-08 04:51 repatching everything.. seems i did too many mistakes 2008-12-08 04:53 that would be ok, use this experience at next time 2008-12-08 04:53 :) 2008-12-08 05:25 hirofumi: inode_dec_link_count(inode) at rmdir time == clear + mark_dirty 2008-12-08 05:40 i seriously prefer the old one.. instead of adding two lines here :) 2008-12-08 05:52 currently mkdir is ->i_nlink == 1, so it happens 2008-12-08 05:53 it will do that way, if we use ->i_nlink == 2 in mkdir 2008-12-08 06:51 oh hi, I'm michael from the mailinglist 2008-12-08 06:58 hi 2008-12-08 07:28 -!- snitm(~snitm@64.2.136.226.ptr.us.xo.net) has joined #tux3 2008-12-08 07:54 question: why not include the dec_link_count logic in the tux_del_dirent function that pranith wrote? 2008-12-08 07:55 rmdir will do clear_nlink(), because mkdir will create inoe with nlink=2 2008-12-08 07:56 s/inoe/inode/ 2008-12-08 07:59 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-12-08 09:01 -!- prani(~bobby@122.163.48.222) has joined #tux3 2008-12-08 09:22 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-08 09:24 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-12-08 09:36 -!- prani(~bobby@122.163.48.222) has joined #tux3 2008-12-08 09:37 -!- snitm_(~snitm@64.2.136.226.ptr.us.xo.net) has joined #tux3 2008-12-08 09:51 -!- prani(~bobby@122.163.48.222) has joined #tux3 2008-12-08 10:02 -!- pranihome(7aa330de@webchat.mibbit.com) has joined #tux3 2008-12-08 10:02 hey all 2008-12-08 10:08 anyone here? 2008-12-08 10:11 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-08 10:12 yes :) 2008-12-08 10:13 inverse, hello 2008-12-08 10:13 how did u find the tux3 U the other day? 2008-12-08 10:14 i mean, how was it? 2008-12-08 10:15 confusing yet informitive 2008-12-08 10:44 -!- pranihome(~bobby@122.163.48.222) has joined #tux3 2008-12-08 11:12 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-08 11:19 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-08 15:55 tux3 u was very hard core last time 2008-12-08 15:55 me and hirofumi chasing weird dentry cache ideas 2008-12-08 15:56 since we have broken the back of that problem, I think, the next one can be more of a general tour of the dentry cache 2008-12-08 17:13 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-08 20:13 hirofumi, there? 2008-12-08 20:22 -!- pranihome(~bobby@122.162.72.155) has joined #tux3 2008-12-08 20:29 the "hirorumi mod" of deferred nameops passed tests 1, 2, 3 and 4 2008-12-08 20:29 probably works 2008-12-08 20:44 so apparently that rename function I wrote was a lot more buggy then I initially thought 2008-12-08 21:26 inverse, well lots of improvements could be made to it 2008-12-08 21:26 but before you sent it, there was no rename 2008-12-08 21:28 I really should learn more about how the vfs layer works 2008-12-08 21:28 that's a good way to start 2008-12-08 21:28 come on tuesday, we will take a look at the dentry cache 2008-12-08 21:29 -!- ChanServ changed mode/#tux3 -> +o flips 2008-12-08 21:30 -!- flips changed topic to "http://tux3.org ~ Tux3 University, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: A cook's tour of the dentry cache" 2008-12-08 21:30 -!- ChanServ changed mode/#tux3 -> -o flips 2008-12-08 22:01 flips: your latest commit breaks namei.c :) 2008-12-08 22:01 any reason u removed "goto out;" with "return err;"?? 2008-12-08 22:02 pranith, brain rot? 2008-12-08 22:02 which function? 2008-12-08 22:07 tux3_rename 2008-12-08 22:07 out: return err; 2008-12-08 22:07 err is undeclared in that scope 2008-12-08 22:07 just delete out: then? 2008-12-08 22:08 hmm, i actually prefer goto out :) but upto you 2008-12-08 22:09 it's not a big difference 2008-12-08 22:09 deleting should be fine... 2008-12-08 22:09 it's a couple lines shorter this way 2008-12-08 22:09 yups 2008-12-08 22:10 i think goto out is useful in cases where we need to clean up something 2008-12-08 22:10 we are just returning, better return from there itself 2008-12-08 22:10 nice 2008-12-08 22:10 right 2008-12-08 22:10 and let the compiler optimize the return if it wants to 2008-12-08 22:11 there's a mistake 2008-12-08 22:13 the buffer from find_entry is not released 2008-12-08 22:13 ACTION is looking 2008-12-08 22:15 yup 2008-12-08 22:15 seems so.. 2008-12-08 22:16 um..., why did you add kernel/namei.c to Makefile? 2008-12-08 22:17 deps were not working for kernel/namei.c 2008-12-08 22:17 kernel/namei.c is kernel only? 2008-12-08 22:19 yes 2008-12-08 22:19 it's the only file that is kernel-only I think 2008-12-08 22:20 so, userland is not needed to recompile? 2008-12-08 22:20 :) 2008-12-08 22:20 ok, I'm stupid 2008-12-08 22:26 that's why I didn't pick up the warning on out: 2008-12-08 22:28 pushed a fix 2008-12-08 22:28 http://hg.tux3.org/tux3?cs=6d49f259c4ac 2008-12-08 22:29 second hunk of user/inode.c 2008-12-08 22:29 + if (!IS_ERR(entry)) 2008-12-08 22:29 + return NULL; // should allow create of a file that already exists!!! 2008-12-08 22:29 I should actually compile it in kernel 2008-12-08 22:29 it should do brelse(buffer) 2008-12-08 22:29 yes 2008-12-08 22:30 in user/kernel/dir.c, we return -EINVAL 2008-12-08 22:31 it may conflict other error 2008-12-08 22:31 how about EIO? 2008-12-08 22:32 ok 2008-12-08 22:36 makefile fixed and brelse restored 2008-12-08 22:37 EIO is misleading too 2008-12-08 22:37 how about EBORKED? 2008-12-08 22:38 there is EBORKED? 2008-12-08 22:38 joke 2008-12-08 22:38 :) 2008-12-08 22:38 would be a nice error to have 2008-12-08 22:40 yes, fs wants ECORRUPTFS 2008-12-08 22:41 I'm using EIO in that case for now 2008-12-08 22:42 we wouldn't have this problem if we just returned NULL for every error ;) 2008-12-08 22:42 :) 2008-12-08 22:46 can we define how long line is acceptable? 2008-12-08 23:15 hirofumi: u knw hiro nakamura? 2008-12-08 23:15 :D 2008-12-08 23:16 ? no 2008-12-08 23:18 hmm, ok.. he can break time-space continuum 2008-12-08 23:19 in the hit tv series heroes :D 2008-12-08 23:19 am talking rubbish.. pardon me :) 2008-12-08 23:20 ah, heroes 2008-12-08 23:20 I know a bit 2008-12-08 23:23 :) 2008-12-08 23:49 hirofumi, http://phunq.net/pipermail/tux3/2008-December/000440.html <- the Hirofumi Method 2008-12-08 23:49 hirofumi, there is no real defition of how long a line can be now 2008-12-08 23:50 I think it is: "whatever is most readable" 2008-12-08 23:50 it was getting silly the way simple expressions were folded over mutliple lines, just because we're using long field names and 8 char tabs 2008-12-08 23:50 I think it is hard, and make crappy flamewar 2008-12-08 23:51 the flamewar is over, linus declared on it 2008-12-08 23:51 yes, linux is 80 columns 2008-12-08 23:51 but, we use more 2008-12-08 23:51 actually, linus declared it could be more, if the result is more readable 2008-12-08 23:52 oh 2008-12-08 23:52 anyway, if the last thing we have to do to get merged is fold some lines, that is ok with me :) 2008-12-08 23:52 akpm and others says please fix long line 2008-12-08 23:52 real problem is readable can not define 2008-12-08 23:53 ah, well we won't do anything tasteless 2008-12-08 23:53 some of our lines are way too long 2008-12-08 23:53 like debug stuff in btree.c 2008-12-08 23:53 a couple lines that go to 90 columns in namei.c will not bother anybody 2008-12-08 23:54 function declarations often use long lines in linux, when exported, so that the header only has one line per function 2008-12-08 23:54 yes, actually I don't care much 2008-12-08 23:54 wrapped declarations in header files are seriously ugly 2008-12-08 23:55 I don't care awfully much, but when developing code rapidly, fewer lines is better 2008-12-08 23:55 now that some parts are getting stable, it's ok if it gets a little more awkward to work on 2008-12-08 23:55 yes, on some case 2008-12-08 23:55 anyway, your suggestion from yesterday worked out fine 2008-12-08 23:56 I forgot to mention... there's a race in it 2008-12-08 23:56 yes, maybe I noticed it too 2008-12-08 23:56 when the dentry locks are dropped and reacquired, there is a chance to end up with two negative dentries, only one of which will ever be removed 2008-12-08 23:56 yes, exactly 2008-12-08 23:57 code will become more complex 2008-12-08 23:57 this is easy to fix, it will just make it look ugly 2008-12-08 23:57 typical spinlock race fix 2008-12-08 23:57 yes 2008-12-08 23:57 it should be like d_move 2008-12-08 23:58 yes 2008-12-08 23:59 now rename 2008-12-09 00:00 I started to look at it last night 2008-12-09 00:00 d_move in fact 2008-12-09 00:00 it seems to be not complex 2008-12-09 00:01 probably, we need same way like d_delete 2008-12-09 00:02 we have a d_move override feature, that might be useful 2008-12-09 00:02 FS_RENAME_DOES_D_MOVE 2008-12-09 00:03 -!- hirofumi(~hirofumi@210.171.168.39) has left #tux3 2008-12-09 00:03 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-09 00:04 wb :) 2008-12-09 00:04 pressed the wrong button? 2008-12-09 00:04 ACTION pushed wrong button :) 2008-12-09 00:04 ctrl-i 2008-12-09 00:04 i see 2008-12-09 00:04 /* Unhash the target: dput() will then get rid of it */ 2008-12-09 00:04 __d_drop(target); <- we need to do the clone here 2008-12-09 00:04 in d_move 2008-12-09 00:04 yes 2008-12-09 00:05 d_move _alway_ unhashes a replaced target 2008-12-09 00:05 that's a little simpler 2008-12-09 00:05 I think I wrote that last night 2008-12-09 00:06 it would be same with DCACHE_BACKED 2008-12-09 00:06 in d_delete 2008-12-09 00:06 if we don't care d_count == 1 2008-12-09 00:06 right, the cloned dentry is always going to be negative 2008-12-09 00:06 that means we can let the unhashed one take the inode with it 2008-12-09 00:07 this is pretty weird stuff ;) 2008-12-09 00:07 yes 2008-12-09 00:08 I don't think we have to override the standard d_move 2008-12-09 00:08 copy and modify? 2008-12-09 00:09 all the required processing can be done in our ->rename 2008-12-09 00:09 by the way, d_move is really clever and insane 2008-12-09 00:09 FWIW, difference is ->i_mutex 2008-12-09 00:10 what difference does it make to us? 2008-12-09 00:11 I don't know 2008-12-09 00:11 just a difference of code 2008-12-09 00:11 for now, I think it doesn't matter 2008-12-09 00:17 btw, we want to update kernel version 2008-12-09 00:17 new one has __d_instantiate() I added :) 2008-12-09 00:18 we may want 2008-12-09 00:44 ah 2008-12-09 00:47 I had a d_attach_locked in the previous patch, a helper that can be used in lots of places 2008-12-09 00:48 yes 2008-12-09 00:48 static inline void d_attach_locked(struct dentry *dentry, struct inode *inode) 2008-12-09 00:48 { 2008-12-09 00:48 list_add(&dentry->d_alias, &inode->i_dentry); 2008-12-09 00:48 dentry->d_inode = inode; 2008-12-09 00:48 fsnotify_d_instantiate(dentry, inode); 2008-12-09 00:48 } 2008-12-09 00:49 recent version has simular one 2008-12-09 00:50 well we should port to a more recent kernel pretty soon 2008-12-09 00:50 at the same time as cloning from linus's tree and setting up a git repo 2008-12-09 00:50 probably, changes are minimal for us 2008-12-09 00:52 yes 2008-12-09 00:52 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-09 00:52 can you have time to review this temporary repo? 2008-12-09 00:52 right away 2008-12-09 00:53 I'm not sure it is very good or not 2008-12-09 00:54 I see I am improving my checking rate, mostly by fixing mistakes in previous checkins ;) 2008-12-09 00:54 points is symlink-support(wrong comment), ..., tux_new_inode-introduce 2008-12-09 00:55 oh another stupid mistake by me... PTR_ERR(old_de) 2008-12-09 00:55 :) fixes may be right 2008-12-09 00:56 symlink-support(wrong comment) ~ tux_new_inode-introduce - is new try 2008-12-09 00:57 symlink-support - it makes new_inode() like kernel 2008-12-09 00:57 so, restoring the traditional iget interface 2008-12-09 00:58 well this will fit better with deferred inode allocation 2008-12-09 00:58 make_inode-cleanup - new one takes inum as hint for inode allocation 2008-12-09 00:58 good 2008-12-09 00:58 and the block allocation functions should take the goal as a parameter, not form sb->nextalloc 2008-12-09 00:58 and I think we may need iget actually 2008-12-09 00:59 it lookup inode in cache 2008-12-09 01:00 maybe have a set_inum(inode, inum) wrapper? 2008-12-09 01:00 yes, sounds good 2008-12-09 01:00 and... we sould use i_ino 2008-12-09 01:01 yes, it may be removed entirely 2008-12-09 01:01 ok, from core? 2008-12-09 01:01 so ino because a property of the filesystem? 2008-12-09 01:01 or, no refererences to ->i_ino from vfs, as it should be 2008-12-09 01:01 inum replaced by i_ino, and 16TB check was added 2008-12-09 01:02 I think we should use i_ino 2008-12-09 01:02 wiht 16TB check 2008-12-09 01:02 on 32bit arch 2008-12-09 01:03 or check flag for max inode number 2008-12-09 01:03 yes 2008-12-09 01:03 so, I think we add limit, then remove ->inum 2008-12-09 01:04 ok, did you say i_inum might go away from core? 2008-12-09 01:04 well 2008-12-09 01:04 from vfs inode? 2008-12-09 01:04 no 2008-12-09 01:05 I think tux_inode(inode)->inum will be removed 2008-12-09 01:05 yes 2008-12-09 01:05 and we will use inode->i_ino always 2008-12-09 01:05 change the field name to i_ino in the user space inode struct 2008-12-09 01:05 yes 2008-12-09 01:05 yes 2008-12-09 01:06 unpresent-init - define the initial value of inode 2008-12-09 01:06 leave the variables as inum 2008-12-09 01:06 oh 2008-12-09 01:06 well maybe not, it's arbitrary 2008-12-09 01:06 who is use ->inum? 2008-12-09 01:07 not the struct fields, just the temporary variables 2008-12-09 01:07 ah, i see 2008-12-09 01:07 yes 2008-12-09 01:07 ino is not clearly understandable, inum is, it's a marginal preference 2008-12-09 01:08 but using the kernel's field name is a good idea 2008-12-09 01:08 reminds me... deferred inum assignment means that fstat will have to wait on any inode that has not had inum assigned yet 2008-12-09 01:08 well, consistency may be good 2008-12-09 01:08 ino is shorter ;) 2008-12-09 01:09 :) 2008-12-09 01:09 "no" is actually an abbreviation for "number", but it's archaic 2008-12-09 01:09 yes, I think so 2008-12-09 01:10 it's also the opposed of "yes" and the first three letters of "inode" 2008-12-09 01:10 so it doesn't make for really readable code 2008-12-09 01:10 well, ino is word already for me 2008-12-09 01:10 ok, that's good enough for me 2008-12-09 01:11 global change, inum to ino then 2008-12-09 01:11 you or me? 2008-12-09 01:11 it would be later 2008-12-09 01:11 ok 2008-12-09 01:12 probably, it will be after adding limit 2008-12-09 01:12 ah, and you have already written _INO ;) 2008-12-09 01:12 :) 2008-12-09 01:13 I forgot it where come from 2008-12-09 01:14 ino? 2008-12-09 01:14 maybe I refer some code in tux3 2008-12-09 01:14 yes 2008-12-09 01:14 from linus 2008-12-09 01:14 let's see what bsd uses 2008-12-09 01:14 I remember I though about inum or ino 2008-12-09 01:15 I thought 2008-12-09 01:16 and somehow, I used ino :) 2008-12-09 01:17 VFS_VGET(MP, INO, FLAGS, VPP) 2008-12-09 01:18 bsd? 2008-12-09 01:18 yes, freebsd 2008-12-09 01:18 likely where it came from then 2008-12-09 01:18 struct stat is ino_t st_ino 2008-12-09 01:19 anyway, the make_inode interface change looks right 2008-12-09 01:19 thanks 2008-12-09 01:19 well, so, unpresent-init - define the initial value of inode 2008-12-09 01:20 right, so I am thinking we should have a state bit what we can wait on with a bit wait 2008-12-09 01:20 for sys_fstat 2008-12-09 01:20 until the ino is assigned 2008-12-09 01:21 yes 2008-12-09 01:21 only inode, not dentry 2008-12-09 01:21 right 2008-12-09 01:22 it should be very rare it has to wait, and a race check on the bit is good enough 2008-12-09 01:22 once it is set, it will stay set 2008-12-09 01:22 so hardly any smp cost 2008-12-09 01:22 s/race/racy/ 2008-12-09 01:23 it is rare for fstat to follow create immediately 2008-12-09 01:24 probably 2008-12-09 01:24 if library didn't it 2008-12-09 01:25 a library that immediatley forgets what it just created? ;) 2008-12-09 01:25 :) 2008-12-09 01:25 well, it would be rare 2008-12-09 01:26 e.g. create() in caller, and pass fd to some library function 2008-12-09 01:27 so struct iattr goes away completely? 2008-12-09 01:27 I think it is needed to share code 2008-12-09 01:27 ok, and it's a handy way of initializing inodes 2008-12-09 01:27 fuse can't tell uid and gid 2008-12-09 01:28 you're talking about fstat or iattr? 2008-12-09 01:28 iattr 2008-12-09 01:28 right 2008-12-09 01:29 unpresent-init defined inital value 2008-12-09 01:29 -!- camby(~root@60.205.82.146) has joined #tux3 2008-12-09 01:29 so, next patch is tux_new_inode-introduce 2008-12-09 01:29 - if (make_inode(sb->rootdir, TUX_ROOTDIR_INO, &(struct tux_iattr){ .mode = S_IFDIR | 0755 })) 2008-12-09 01:29 + if (make_inode(sb->rootdir, TUX_ROOTDIR_INO)) <- where does i_mode for / get set now? 2008-12-09 01:30 tux_new_inode() initialize inode corrently 2008-12-09 01:30 if inode is exsists, open_inode initialize inode correctly 2008-12-09 01:30 so, unpresent sets default 2008-12-09 01:31 ah, so make_inode does not have to 2008-12-09 01:31 tux_new_inode()/open_inode() overwrite it 2008-12-09 01:31 yes 2008-12-09 01:31 makes sense 2008-12-09 01:31 now, make_inode means allocate new inum 2008-12-09 01:31 yes 2008-12-09 01:32 we could call it choose_ino (inum?) 2008-12-09 01:32 well 2008-12-09 01:32 it also makes it 2008-12-09 01:32 by searching 2008-12-09 01:32 so make is fine for now 2008-12-09 01:32 I'm not sure this strategy is good or not in patches 2008-12-09 01:33 what is your doubt? 2008-12-09 01:33 it looks like an structural improvement 2008-12-09 01:34 I see mkdir got link support 2008-12-09 01:34 I'm just not sure than other patches in past 2008-12-09 01:34 it looks solid to me 2008-12-09 01:35 thanks 2008-12-09 01:35 and it's ok to change around structure a few times, I think that is a good development strategy 2008-12-09 01:35 if you never do that, then things rot 2008-12-09 01:35 yes 2008-12-09 01:36 well, this series has another purpose 2008-12-09 01:37 now, we make cached inode by tux_new_inode()/open_inode() 2008-12-09 01:37 it means we don't have to call make_inode() now 2008-12-09 01:38 yes, split that like in the ext2 deferred patch 2008-12-09 01:38 i.e. we don't have to allocate inum immediately 2008-12-09 01:38 right 2008-12-09 01:38 and, maybe we can 2008-12-09 01:39 store_inode() 2008-12-09 01:39 { 2008-12-09 01:39 if (inode->inum == TUX_INVALID_INO) 2008-12-09 01:39 make_inode() 2008-12-09 01:39 else 2008-12-09 01:39 save_inode() 2008-12-09 01:39 } 2008-12-09 01:39 I was thinking of something a little different 2008-12-09 01:40 oh, good 2008-12-09 01:40 another strategy 2008-12-09 01:40 so make_inode is supposed to happen only in delta staging 2008-12-09 01:40 which will happen sometime soon, no matter what happens 2008-12-09 01:41 it means store_inode is called in delta staging? 2008-12-09 01:41 yes 2008-12-09 01:41 all changes to inode table blocks in staging 2008-12-09 01:41 and also, all changes to dirent blocks 2008-12-09 01:41 yes 2008-12-09 01:42 in userlance, dirent is need big change 2008-12-09 01:42 in userland, dirent is needed big change 2008-12-09 01:42 what sort of change? 2008-12-09 01:42 userland doesn't have dentry 2008-12-09 01:42 oh right 2008-12-09 01:43 so I was thinking we might do this only in kernel 2008-12-09 01:43 yes 2008-12-09 01:43 just to avoid spending time emulating dentry cache 2008-12-09 01:43 so, my this series may not be good enough 2008-12-09 01:44 it looks like progress, just tell me when you're happy 2008-12-09 01:44 there are a lot of little fixes in it 2008-12-09 01:44 and the structural changes are progress I think 2008-12-09 01:45 so I would say, put it in and make further improvements from there 2008-12-09 01:45 ok 2008-12-09 01:45 I'll test this more, and cleanup 2008-12-09 01:45 then I'll push those 2008-12-09 01:46 meanwhile I guess I am maybe two days from having a pretty complete demonstration of deferred nameops in ext2 2008-12-09 01:46 good 2008-12-09 01:46 your suggestion made it sane 2008-12-09 01:47 thanks :) 2008-12-09 01:48 if we can use dentry cache like writeback cache, it would be good 2008-12-09 01:49 yes 2008-12-09 01:49 ramfs proves it can do that 2008-12-09 01:49 but the devil is in the details 2008-12-09 01:50 yes 2008-12-09 01:53 I think the only racy case we need to worry about in ->d_hide() is when d_count goes to 1 because the dentry was closed when we dropped the spinlock 2008-12-09 01:55 yes 2008-12-09 01:55 ah 2008-12-09 01:55 and 2008-12-09 01:56 maybe, we need dentry->d_lock 2008-12-09 01:56 to read the count? 2008-12-09 01:56 because __d_lookup() is using rcu 2008-12-09 01:56 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-09 01:56 folks 2008-12-09 01:57 that lock make sure it is not unhashed 2008-12-09 01:58 we also need to make sure it doesn't disappear completely when we drop the locks 2008-12-09 01:59 we have refcount for it? 2008-12-09 01:59 not yet 2008-12-09 01:59 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-12-09 01:59 I think we need it, and I was thinking of using atomic_dec_and_lock to retake the dcache_lock 2008-12-09 02:00 well, if we take a refcount, atomic_dec_and_lock will not hit zero, so that is not quite right 2008-12-09 02:00 we have refcount for dentry for unlink? 2008-12-09 02:01 I guess somebody does 2008-12-09 02:01 ok, the dentry won't disappear 2008-12-09 02:01 but thecount may go to one 2008-12-09 02:01 but, it may be ==1 or > 1 2008-12-09 02:01 if the count is >1, then d_delete will unhash it, which is what we want 2008-12-09 02:02 yes 2008-12-09 02:02 if it is = 1... then d_delete will dentry_iput it, which I think is also right 2008-12-09 02:02 however, we should add new one atomicity 2008-12-09 02:02 yes 2008-12-09 02:05 or, if we drop locks, we just think dentry is > 1 2008-12-09 02:05 well, I am not sure what the race is, but it seems likely there is one 2008-12-09 02:06 ah 2008-12-09 02:06 it is complex - need source to see 2008-12-09 02:06 the race is, if d_delete does dentry_iput, it will not unhash it, that is bad 2008-12-09 02:06 um... which case? 2008-12-09 02:07 in the case that the count of the old dentry dropped to 1 when we dropped the locks 2008-12-09 02:07 so the dentry is not in use any more 2008-12-09 02:07 is still hashed 2008-12-09 02:08 and if we let d_delete dentry_iput it, then we have two negative dentries hashed for the same name 2008-12-09 02:08 ah, yes 2008-12-09 02:08 fun to think about, I will sleep on it ;) 2008-12-09 02:09 the patch is posted 2008-12-09 02:09 ok :) 2008-12-09 02:09 yes 2008-12-09 02:09 oyasumi :) 2008-12-09 02:09 oyasumi :) 2008-12-09 02:14 well, I thought about it ;) 2008-12-09 02:14 when we retake the locks and find the old dentry has count = 1, we know that unlink is the only holder 2008-12-09 02:15 so we can drop the locks and dput our new dentry, the old dentry won't change because nobody else has a reference 2008-12-09 02:15 yes, and dput is after unlock 2008-12-09 02:16 because dput takes lock 2008-12-09 02:16 then we retake the locks and return, um, 1 I think 2008-12-09 02:17 we need to change d_delete 2008-12-09 02:17 we do? 2008-12-09 02:18 to make it more elegant maybe 2008-12-09 02:18 but the current one will work 2008-12-09 02:18 current as patched by me 2008-12-09 02:18 dput(new dentry) can't? 2008-12-09 02:18 that's fine if we drop the locks first 2008-12-09 02:19 allocate new -> retack lock -> check d_count == 1 -> dput(new) 2008-12-09 02:19 last dput(new) can't, because we have retaked the lock 2008-12-09 02:19 allocate new -> retack lock -> check d_count == 1 -> drop locks -> dput(new) 2008-12-09 02:20 allocate new -> retack lock -> check d_count == 1 -> drop locks -> dput(new) -> retake locks -> return to d_delete 2008-12-09 02:20 if we drop locks, dentry can be > 1 2008-12-09 02:20 ah, because it's still hashed 2008-12-09 02:20 yes 2008-12-09 02:20 but not if we __d_drop it first 2008-12-09 02:21 if so, it is negative dentry anymore 2008-12-09 02:21 it is not negative 2008-12-09 02:21 we want hashed negative dentry in that point 2008-12-09 02:21 but we have a negative dentry at that point, so we are protected while we futz with the old one 2008-12-09 02:21 our new one is hashed then 2008-12-09 02:22 if we use new one, we don't need - retake lock -> check d_count == 1 2008-12-09 02:22 check is not needed 2008-12-09 02:23 so, we use new one alway if we retake locks 2008-12-09 02:23 it sounds work 2008-12-09 02:24 and who does the dentry_iput of the old dentry? 2008-12-09 02:24 if d_count == 1, d_delete() does immidiately 2008-12-09 02:24 if d_count > 1, last dput() does 2008-12-09 02:25 and turns it into a negative dentry, but we have to get rid of our new one 2008-12-09 02:25 no, 2008-12-09 02:25 old one already unhashed 2008-12-09 02:25 so, I think it makes unhashed negative dentry 2008-12-09 02:26 allocate new -> retake lock -> check d_count == 1 -> unhash old -> add new 2008-12-09 02:27 if it dentry_iputs it, it leaves it as a hashed negative dentry 2008-12-09 02:27 we checks d_count == 1, and we does unhash 2008-12-09 02:27 ah, that is ok 2008-12-09 02:27 maybe 2008-12-09 02:28 don't we have to unhash and dput in that case? 2008-12-09 02:28 because the 1 is the dcache reference 2008-12-09 02:29 I think d_delete() does, dentry_iput() and caller of d_delete() does dput()? 2008-12-09 02:30 d_delete() - dentry_iput(unhashed old) 2008-12-09 02:30 that's right 2008-12-09 02:30 ok, that might do it 2008-12-09 02:30 yes, maybe 2008-12-09 02:32 so we unhash old and return 2 from ->hide, which therefore ignores it ;) 2008-12-09 02:32 in fact, we don't have to unhash, we just return 2 and d_delete unhashes it 2008-12-09 02:33 ah, right 2008-12-09 02:33 this is not a pretty interface, but it works ;) 2008-12-09 02:33 it sounds good :) 2008-12-09 02:34 also unhashes if we return 1 and count = 1 2008-12-09 02:35 so we: return atomic_read(..) == 1; 2008-12-09 02:35 now that is obscure 2008-12-09 02:35 um.. 2008-12-09 02:36 if we retaked locks, return 2 always? 2008-12-09 02:36 sounds risky 2008-12-09 02:37 if old dentry count == 3 on return to d_delete, then we end up with two hashed names 2008-12-09 02:38 that is, if we are being lazy and letting d_delete unhash for us 2008-12-09 02:38 -> retaked locks -> allocate new and hashed -> d_delete drop old -> unlocks 2008-12-09 02:39 yes, that is what we want 2008-12-09 02:39 so we just need to make sure d_delete always unhashes old 2008-12-09 02:39 yes, so return 2 is enough? 2008-12-09 02:40 that depends on what the count of old is at that time 2008-12-09 02:40 so we have to check it, because it could be anything > 1 2008-12-09 02:40 >= 1 2008-12-09 02:41 ah 2008-12-09 02:41 if (!d_unhashed(dentry) && !hidden) 2008-12-09 02:41 right 2008-12-09 02:41 I said the interface was not pretty ;) 2008-12-09 02:41 we don't need && !hidden anymore? 2008-12-09 02:42 maybe not 2008-12-09 02:42 maybe for other cases 2008-12-09 02:42 we allocate new instead of hidden 2008-12-09 02:42 I think we don't need it 2008-12-09 02:43 ok 2008-12-09 02:43 looks like we have a solution 2008-12-09 02:44 yes 2008-12-09 02:45 return atomic_read(&dentry->d_count) == 1; // from ->hide 2008-12-09 02:45 that makes d_delete unhash it 2008-12-09 02:46 ugly ;) 2008-12-09 02:46 somebody else can make it pretty 2008-12-09 02:47 I think HIDDEN flag is not needed anymore 2008-12-09 02:47 that would be nice 2008-12-09 02:48 because we add new negative dentry in that case 2008-12-09 02:49 well, it would be later 2008-12-09 02:49 yes, I'm just checking ext2_flush_dir, and it does not use the inode of a negative dentry 2008-12-09 02:50 well, anyway, I think you are oyasumi time :) 2008-12-09 02:50 right 2008-12-09 02:50 oyasumi again 2008-12-09 02:50 oyasumi for healthy :) 2008-12-09 02:50 :) 2008-12-09 03:20 flipzzz: there? 2008-12-09 03:25 caught me ;) 2008-12-09 03:26 hmm 2008-12-09 03:26 build errors :) 2008-12-09 03:26 can u check? http://bitbucket.org/pranith/pranith 2008-12-09 03:30 ok 2008-12-09 03:30 i fixed it, pull if ok :) 2008-12-09 03:30 why did you need to restore int err? 2008-12-09 03:31 let me compile this 2008-12-09 03:31 return err at end 2008-12-09 03:31 either u delete that return 2008-12-09 03:31 or declare int err; 2008-12-09 03:31 err is not there in that scope 2008-12-09 03:32 bleah, I never compiled namei.c because I forgot that user space does not use it 2008-12-09 03:32 ok, pulling 2008-12-09 03:32 :) 2008-12-09 03:32 ok, thanks 2008-12-09 03:35 your new hg repo works fine 2008-12-09 03:36 :) 2008-12-09 03:37 still learning hg 2008-12-09 03:37 donno where to put the author name.. 2008-12-09 03:37 it took my user name from the laptop.. 2008-12-09 03:38 looks good to me 2008-12-09 03:38 you can always check in with --user "whatever you want" 2008-12-09 03:38 that is, commit with 2008-12-09 03:38 you can probably also configure the default user 2008-12-09 03:38 hg commit --user "Pranith Kumar"? 2008-12-09 03:38 yes 2008-12-09 03:38 ok, will check it out 2008-12-09 03:39 u better get some sleep :D 2008-12-09 03:40 yes 2008-12-09 03:41 http://linux.die.net/man/5/hgrc 2008-12-09 03:41 ok, looking it up.. thanks 2008-12-09 04:35 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-09 05:17 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-09 05:59 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-09 07:53 -!- nemysis(~nemysis@tor-irc.dnsbl.oftc.net) has joined #tux3 2008-12-09 08:05 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-09 09:15 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-09 10:01 -!- nemysis(~nemysis@tor-irc.dnsbl.oftc.net) has joined #tux3 2008-12-09 10:05 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-09 10:25 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-09 11:00 -!- nemysis(~nemysis@tor-irc.dnsbl.oftc.net) has joined #tux3 2008-12-09 11:21 -!- nemysis(~nemysis@tor-irc.dnsbl.oftc.net) has joined #tux3 2008-12-09 12:01 -!- nemysis(~nemysis@91-64-188-120-dynip.superkabel.de) has joined #tux3 2008-12-09 12:02 -!- pgquiles(~pgquiles@26.Red-79-144-194.staticIP.rima-tde.net) has joined #tux3 2008-12-09 12:11 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-09 14:53 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-09 16:30 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-09 16:54 -!- camby(~root@60.205.82.146) has joined #tux3 2008-12-09 17:18 -!- camby_(~root@60.205.81.133) has joined #tux3 2008-12-09 18:04 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-09 19:31 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-09 20:19 -!- flips(~daniel@phunq.net) has joined #tux3 2008-12-09 20:19 whoops, out of the "office" at the moment 2008-12-09 20:20 Tux3 U is postponed till Thurs 2008-12-09 20:29 ...coming to you live from a screening of upcoming animation "Bolt" in deep Burbank 2008-12-09 22:22 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-09 23:49 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-12-10 01:00 its pretty quite here today :) 2008-12-10 01:17 yeah, hopefully development isn't quiet 2008-12-10 03:48 -!- pgquiles(~pgquiles@26.Red-79-144-194.staticIP.rima-tde.net) has joined #tux3 2008-12-10 04:04 flipzzz: mknod patch sent :) 2008-12-10 07:29 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 07:29 hey all 2008-12-10 07:32 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 07:52 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-12-10 07:54 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 08:03 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 08:14 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 08:19 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-10 09:12 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 09:17 pranith, hey 2008-12-10 09:17 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 09:21 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 09:25 pranith, there? 2008-12-10 09:26 hey flips 2008-12-10 09:26 here 2008-12-10 09:26 what issue with your mknod patch? 2008-12-10 09:26 seg fault :) 2008-12-10 09:26 when i do mknod -m 660 ./tty c 4 64 2008-12-10 09:27 seg faulting in tux_create_entry 2008-12-10 09:27 i tried in gdb 2008-12-10 09:27 at this point it's ok to apply first, then fix the seg fault 2008-12-10 09:28 I'll try it 2008-12-10 09:28 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 09:28 let me see if i can fix it... 2008-12-10 09:28 ill get back to you 2008-12-10 09:28 ok 2008-12-10 09:38 open_inode: found inode 0xa 2008-12-10 09:38 new_xcache: realloc xcache to 4 2008-12-10 09:38 mode 0000000 uid 0 gid 0 root a:1 ctime 0 size 0 2008-12-10 09:38 Program received signal SIGSEGV, Segmentation fault. 2008-12-10 09:38 0x08101613 in tux3_lookup (dir=0x9893c64, dentry=0x988f2c8, nd=0x9ddbe74) at fs/tux3/namei.c:11 2008-12-10 09:38 11 inode = tux3_iget(dir->i_sb, from_be_u32(entry->inum)); 2008-12-10 09:38 (gdb) bt 2008-12-10 09:38 #0 0x08101613 in tux3_lookup (dir=0x9893c64, dentry=0x988f2c8, nd=0x9ddbe74) at fs/tux3/namei.c:11 2008-12-10 09:38 #1 0x080b0ec9 in __lookup_hash (name=0x9ddbe7c, base=, nd=0x9ddbe74) at fs/namei.c:13open_inode: found inode 0xa 2008-12-10 09:38 new_xcache: realloc xcache to 4 2008-12-10 09:39 mode 0000000 uid 0 gid 0 root a:1 ctime 0 size 0 2008-12-10 09:39 Program received signal SIGSEGV, Segmentation fault. 2008-12-10 09:39 0x08101613 in tux3_lookup (dir=0x9893c64, dentry=0x988f2c8, nd=0x9ddbe74) at fs/tux3/namei.c:11 2008-12-10 09:39 11 inode = tux3_iget(dir->i_sb, from_be_u32(entry->inum)); 2008-12-10 09:39 (gdb) bt 2008-12-10 09:39 #0 0x08101613 in tux3_lookup (dir=0x9893c64, dentry=0x988f2c8, nd=0x9ddbe74) at fs/tux3/namei.c:11 2008-12-10 09:39 #1 0x080b0ec9 in __lookup_hash (name=0x9ddbe7c, base=, nd=0x9ddbe74) at fs/namei.c:1339 2008-12-10 09:39 #2 0x080b0f22 in lookup_hash (nd=0x9ddbe74) at fs/namei.c:1361 2008-12-10 09:39 #3 0x080b3a29 in do_filp_open (dfd=-100, pathname=0x9df2000 "/mnt/foo", open_flag=32833, mode=438) at fs/namei.c:1815 2008-12-10 09:39 39 2008-12-10 09:39 #2 0x080b0f22 in lookup_hash (nd=0x9ddbe74) at fs/namei.c:1361 2008-12-10 09:39 #3 0x080b3a29 in do_filp_open (dfd=-100, pathname=0x9df2000 "/mnt/foo", open_flag=32833, mode=438) at fs/namei.c:1815 2008-12-10 09:39 ACTION has to fix that unneeded xcache alloc 2008-12-10 09:40 pranth, it was me who broke it 2008-12-10 09:42 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-10 09:44 tux_find_entry() bugs? 2008-12-10 09:45 yes, entry return is now ERR_PTR 2008-12-10 09:45 have to check for ENOENT, which is allowed 2008-12-10 09:45 ah, I know and I fixed those 2008-12-10 09:46 got a changeset already? 2008-12-10 09:46 the patches still in queue 2008-12-10 09:47 let me know what you're ready 2008-12-10 09:47 basically, yesterday's temporary repo 2008-12-10 09:48 in the case of find_entry, I wonder if we should return NULL instead of ENOENT? 2008-12-10 09:48 it's not really an error in that case 2008-12-10 09:49 -ENOENT is fine 2008-12-10 09:49 consistent anyway 2008-12-10 09:49 caller should check it 2008-12-10 09:50 http://userweb.kernel.org/~hirofumi/tux_find_entry-fix.patch 2008-12-10 09:51 it is patch in my queue 2008-12-10 09:52 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 09:53 should I apply now? 2008-12-10 09:53 if possible, please don't 2008-12-10 09:53 ok 2008-12-10 09:54 adding the ERR_PTR was a good cleanup, even though it broke things 2008-12-10 09:54 ok, I'll flush my queue tomorrow 2008-12-10 09:54 yes, it is right 2008-12-10 09:54 got a moment to talk about ino? 2008-12-10 09:55 we should return right error to userland 2008-12-10 09:55 ok 2008-12-10 09:57 I should talk about ino? 2008-12-10 10:00 most users of ino look pretty safe 2008-12-10 10:00 /proc/locks... nobody should use the ino from there 2008-12-10 10:00 smaps... that should be just informative 2008-12-10 10:00 pipefs... I think it only has names because we did not have the concept of anon inodes then? 2008-12-10 10:00 sorry, I was writing in the wrong channel ;) 2008-12-10 10:00 :) 2008-12-10 10:01 probably, yes 2008-12-10 10:01 fs/nfsd/export.c <- we should do wait-on-bit for nfs export fns 2008-12-10 10:02 CONFIG_AUDIT may have to learn to deal with 0 ino 2008-12-10 10:02 it's good to hear selinux does not use ino, I was wondering about that 2008-12-10 10:02 struct export_operations, yes 2008-12-10 10:02 ->get_name, we have to wait 2008-12-10 10:03 I'm not sure about selinux plugin modules 2008-12-10 10:03 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 10:03 and selinux may uses it with abnormal way 2008-12-10 10:03 I think we can just inform them, they have to handle 0 ino 2008-12-10 10:04 for readdir we will wait on phase transition 2008-12-10 10:04 getdents 2008-12-10 10:05 probably, yes 2008-12-10 10:05 we can have a bit in the directory inode for that 2008-12-10 10:05 inode and dentry 2008-12-10 10:05 it will be a fine grained wait, rarely taken, the only cost is checking the bit 2008-12-10 10:05 and dentry? 2008-12-10 10:06 ah, about ino 2008-12-10 10:06 dentry is for name 2008-12-10 10:06 yes 2008-12-10 10:06 oh, another cost is wakeups 2008-12-10 10:06 releasing the bit lock 2008-12-10 10:07 yes 2008-12-10 10:08 and ->getattr() too 2008-12-10 10:08 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 10:13 getattr can wait on inode->dentry->d_parent->d_inode->d_flags 2008-12-10 10:13 maybe 2008-12-10 10:13 instead of having a bit in each inode 2008-12-10 10:14 use the bit in the parent directory 2008-12-10 10:14 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 10:14 to have fewer wakeups 2008-12-10 10:14 parent? 2008-12-10 10:14 flips, there? 2008-12-10 10:14 ok, i see u are here :) 2008-12-10 10:14 pranith, and the segfault was my fault, not yours 2008-12-10 10:14 hirofumi already fixed it 2008-12-10 10:14 tux3 seg faulting much before we reach tux3_mknod 2008-12-10 10:14 :D 2008-12-10 10:14 was just seeing that 2008-12-10 10:15 mknod? 2008-12-10 10:15 ok 2008-12-10 10:15 yes 2008-12-10 10:15 yup, what was the fix? 2008-12-10 10:15 I have the patch for it 2008-12-10 10:16 it have to change core more or less 2008-12-10 10:16 hirofumi, d_parent... use the parent of the dentry, which is a directory, for the wait instead of having a "ino is invalid" bit in each inode 2008-12-10 10:16 and we have to add RDEV_BIT or something 2008-12-10 10:16 RDEV_BIT? 2008-12-10 10:17 for special file, it has major and minor 2008-12-10 10:17 device file 2008-12-10 10:18 if ino is invalid, tux_inode(inode)->inum is invalid? 2008-12-10 10:19 yes 2008-12-10 10:19 and we were going to get rid of tux_inode(inode)->inum 2008-12-10 10:20 ah, I think we use ->inum 2008-12-10 10:20 -!- pgquiles(~pgquiles@26.Red-79-144-194.staticIP.rima-tde.net) has joined #tux3 2008-12-10 10:20 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 10:21 in the check of ->inum, I think we can use 64bit inode number in 32bit arch 2008-12-10 10:21 the only valid use of ino for us should be a real lookup 2008-12-10 10:21 with stat64 and getdents64, ino is using "unsigned long long" 2008-12-10 10:22 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-10 10:22 http://userweb.kernel.org/~hirofumi/inum-fixes.patch 2008-12-10 10:23 please see this patch 2008-12-10 10:24 hirofumi, this fixes mknod crash? 2008-12-10 10:24 not that one, pranith 2008-12-10 10:24 hmm, ok 2008-12-10 10:25 hirofumi, but maybe we should get rid of ->inum completely and use ->i_ino? 2008-12-10 10:26 after ->i_ino check, now I think we will use ->inum 2008-12-10 10:26 ->inum (64bit) keeps real inode number, ->i_ino (may 32bit) will be used as just hash value 2008-12-10 10:26 I see 2008-12-10 10:27 that inum check was also for this 2008-12-10 10:28 ah, so 64 bit inum will work on 32 bit arch 2008-12-10 10:28 as you said 2008-12-10 10:28 yes 2008-12-10 10:28 that is cool 2008-12-10 10:28 issue is same with deferred ino 2008-12-10 10:28 is it? 2008-12-10 10:29 in that case, we cannot make a valid has of the ino until we have assigned it 2008-12-10 10:29 for the 64bit inum, we can add ino limit for it 2008-12-10 10:29 yes 2008-12-10 10:30 64bit inum also can't set real inum to ->i_ino 2008-12-10 10:30 so, audit can't access real inum at same places 2008-12-10 10:31 the detail is different though 2008-12-10 10:32 busy for a moment 2008-12-10 10:32 ok 2008-12-10 10:32 I'll sleep 2008-12-10 10:36 hirofumi, find_entry fix? 2008-12-10 10:36 maybe 2008-12-10 10:36 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-10 10:36 can u send it to me.. am eager to try mknod :) 2008-12-10 10:37 http://userweb.kernel.org/~hirofumi/tux_find_entry-fix.patch 2008-12-10 10:37 btw, I have the patch for mknod 2008-12-10 10:38 to support it, we have to change core stuff 2008-12-10 10:38 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-10 10:41 ok.. 2008-12-10 10:41 so my mknod is not good 2008-12-10 10:43 -!- pranith(~bobby@122.162.73.189) has joined #tux3 2008-12-10 10:44 ok, your patch didn't save rdev 2008-12-10 10:44 hmm 2008-12-10 10:45 ok... 2008-12-10 10:45 and trivial thing though, error path uses drop_nlink 2008-12-10 10:45 we should remove inode 2008-12-10 10:46 ok.. since u already have the patch.. ill not bother with this 2008-12-10 10:46 :) 2008-12-10 10:47 sorry :) 2008-12-10 10:47 but im interested to see ur patch for mknod.. do let me know :) 2008-12-10 10:47 hey, no problem.. u beat me to it.. im still learning :) 2008-12-10 10:47 btw, the patch wrap the line at 80 columns? 2008-12-10 10:47 +static int tux3_mknod(struct inode *dir, struct dentry *dentry, int 2008-12-10 10:47 mode, dev_t rdev) 2008-12-10 10:47 +{ 2008-12-10 10:47 i think so 2008-12-10 10:47 :( 2008-12-10 10:48 http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-10 10:48 it is a bit old though 2008-12-10 10:48 i did a hg diff 2008-12-10 10:49 mailer may corrupt it 2008-12-10 10:49 that hg repo has mknod support, iirc 2008-12-10 10:49 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-10 10:50 if you want to see 2008-12-10 10:50 ok, going through 2008-12-10 10:50 it still not save rdev though 2008-12-10 10:50 preparation was done 2008-12-10 10:50 hmm 2008-12-10 10:50 yeah 2008-12-10 10:51 to save rdev, we have to decide the size of it 2008-12-10 10:51 tux_create_inode takes rdev? 2008-12-10 10:51 yes, and initialize inode depend on i_mode 2008-12-10 10:52 then make_inode() saves inode attributes 2008-12-10 10:52 hmm 2008-12-10 10:52 at this point, encode_attrs() will save rdev 2008-12-10 10:52 u seem to have rewrote most the functions 2008-12-10 10:53 some interfaces change 2008-12-10 10:53 hmm 2008-12-10 10:53 mknod() seems to be used everywhere :) 2008-12-10 10:53 the comment may mention about the reason of it 2008-12-10 10:54 yes 2008-12-10 10:54 mknod() takes all parameters which is needed 2008-12-10 10:54 ok, this is nice... 2008-12-10 10:55 ok 2008-12-10 10:55 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-10 10:55 ok hirofumi, a great job. hope to hack like that some day 2008-12-10 10:55 :) 2008-12-10 10:55 :) 2008-12-10 10:56 well, now I have 10 years experience or more 2008-12-10 10:56 :) 2008-12-10 10:56 i have 6 months almost now :D 2008-12-10 10:57 but not entirely on kernel... 2008-12-10 10:57 ok, sleep time here in india too :) 2008-12-10 10:57 oyasumi 2008-12-10 10:58 oyasumi 2008-12-10 11:01 back 2008-12-10 11:02 ok 2008-12-10 11:02 I will write another Tux3 Report now I think 2008-12-10 11:03 "Tux3 by Christmas?" 2008-12-10 11:03 oh 2008-12-10 11:03 will not promise it 2008-12-10 11:03 but just state that the goal is to implement atomic commit by christmas 2008-12-10 11:03 and not promise any stability ;) 2008-12-10 11:04 ooohh, atomic commit 2008-12-10 11:04 that will be the interesting bit 2008-12-10 11:04 yes 2008-12-10 11:05 I guessed it is ext2 like fs completion 2008-12-10 11:05 :) 2008-12-10 11:06 so we will try for ext3-like instead 2008-12-10 11:07 ok, big challenge 2008-12-10 11:07 but, it would be fun 2008-12-10 11:24 yes 2008-12-10 11:26 hirofumi, when will you be ready for a pull, for me to get the find_entry fixes? 2008-12-10 11:26 yes 2008-12-10 11:26 those will be including that fix 2008-12-10 11:28 or I'll send the two patches for fix right now 2008-12-10 11:29 good idea 2008-12-10 11:29 wait a bit 2008-12-10 11:32 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-10 11:33 although, I'm not fully reviewing those yet 2008-12-10 11:33 it should fix the bugs 2008-12-10 11:34 reading 2008-12-10 11:40 it fixes the segfault, as expected 2008-12-10 11:41 ok 2008-12-10 11:41 I will use this checkin to post a patch to lkml today, "purely for developers" 2008-12-10 11:41 there is a lot of functionality working 2008-12-10 11:42 ok 2008-12-10 11:42 and it is nice that a number of new contributers added some patches lately 2008-12-10 11:42 well, next series will add ->inum support and mknod 2008-12-10 11:43 yes 2008-12-10 11:43 I will finish up my ext2 deferred name ops work in two days I think 2008-12-10 11:43 oh, good 2008-12-10 11:43 just do the deferred inode create, and rename 2008-12-10 11:43 then spend the rest of the time till christmas working on atomic commit 2008-12-10 11:44 btw, maybe, d_rehash() in d_hide has race 2008-12-10 11:44 ACTION looks 2008-12-10 11:44 you may know already though 2008-12-10 11:45 not that one 2008-12-10 11:47 maybe the rehash should be before instantiate? 2008-12-10 11:47 it adds negative with d_rehash() 2008-12-10 11:48 right 2008-12-10 11:48 but, we are not removing old dentry yet 2008-12-10 11:48 I think it should be atomicity 2008-12-10 11:48 I did notice that, but did not think it was a problem 2008-12-10 11:49 i.e. we should take the both of ->d_lock 2008-12-10 11:49 I forgot the detail of race, um... 2008-12-10 11:50 ah 2008-12-10 11:51 maybe, if we retake the locks, then check the d_count 2008-12-10 11:51 and if we drop the new negative, it has race 2008-12-10 11:52 drop the new negative in the flush? 2008-12-10 11:52 because d_delete will not drop it 2008-12-10 11:53 yes, now we may not have the problem 2008-12-10 11:54 anyway, if we do see a problem I think the worst is, we might have to export _d_rehash to do it atomically 2008-12-10 11:54 but I don't see a problem right now 2008-12-10 11:54 maybe viro will ;) 2008-12-10 11:55 yes 2008-12-10 11:55 well, my point was cached_lookup can take temporary state 2008-12-10 11:55 in d_delete() if we did unlocks 2008-12-10 11:56 what state change is dangerous for us? 2008-12-10 11:57 temporary state meant, it can glab the negative or old 2008-12-10 11:57 new negative 2008-12-10 11:58 the both of dentries apear in hash list 2008-12-10 11:58 without lock 2008-12-10 11:58 yes, it seems risky 2008-12-10 11:59 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-10 11:59 -!- pgquiles(~pgquiles@26.Red-79-144-194.staticIP.rima-tde.net) has joined #tux3 2008-12-10 11:59 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-10 11:59 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-10 11:59 -!- ollebull(~olle@ip6-43.bon.riksnet.se) has joined #tux3 2008-12-10 11:59 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-10 11:59 in that point, I felt I found the race, however I forgot... 2008-12-10 11:59 ok, just post it if you remember 2008-12-10 12:00 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-10 12:00 -!- ollebull(~olle@ip6-43.bon.riksnet.se) has joined #tux3 2008-12-10 12:00 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-10 12:00 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-10 12:00 in that point, I felt I found the race, however I forgot... 2008-12-10 12:00 ok, just post it if you remember 2008-12-10 12:01 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-10 12:01 -!- pgquiles(~pgquiles@26.Red-79-144-194.staticIP.rima-tde.net) has joined #tux3 2008-12-10 12:01 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-10 12:01 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-10 12:01 -!- ollebull(~olle@ip6-43.bon.riksnet.se) has joined #tux3 2008-12-10 12:01 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-10 12:02 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-10 12:02 -!- pgquiles(~pgquiles@26.Red-79-144-194.staticIP.rima-tde.net) has joined #tux3 2008-12-10 12:02 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-10 12:02 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-10 12:02 -!- ollebull(~olle@ip6-43.bon.riksnet.se) has joined #tux3 2008-12-10 12:02 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-10 12:02 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-10 12:02 hirofumi, there? 2008-12-10 12:02 irc server seems to be having a problem 2008-12-10 12:02 yes 2008-12-10 12:03 reconnected 2008-12-10 12:03 well it is a race, but so is d_delete 2008-12-10 12:03 the dentry will appear positive, then appear negative at some random time 2008-12-10 12:04 yes, it may be possible 2008-12-10 12:04 so, when both dentries are hashed, it is random which one the user will see 2008-12-10 12:04 but if they see the positive, then it was postive before, and will turn negative when d_delete unhashes the old 2008-12-10 12:05 if they see the negative, then it will stay negative 2008-12-10 12:05 so I don't think it hurts, if you find otherwise then please give the details 2008-12-10 12:06 yes, I try to recall it 2008-12-10 12:07 I think that if Al does not hate the whole idea, then he will probably suggest a better approach anyway 2008-12-10 12:08 btw, now, in that patch, if we retake the locks, d_delete() will d_drop the old dentry always? 2008-12-10 12:08 always 2008-12-10 12:08 that needs a comment 2008-12-10 12:08 ok, it's key point of race I think 2008-12-10 12:09 probalby, the ->hide interface is not the best one, however it is good enough to demonstrate the technique 2008-12-10 12:09 yes 2008-12-10 12:10 I think, maybe vfs should do those 2008-12-10 12:11 yes, if latency results are good then it is an argument for making it general 2008-12-10 12:11 yes 2008-12-10 12:12 getting late in the land of the rising sun? 2008-12-10 12:12 ah 2008-12-10 12:12 no 2008-12-10 12:12 eh 2008-12-10 12:13 in japan? 2008-12-10 12:13 yes 2008-12-10 12:13 very late ;) 2008-12-10 12:13 yes :) 2008-12-10 12:14 not sure yet, however... 2008-12-10 12:15 if user which is opening the old dentry try to reopen it... 2008-12-10 12:16 I thought that would be ok, that is already possible 2008-12-10 12:16 it may see strange 2008-12-10 12:16 um... 2008-12-10 12:16 d_delete has not changed anything by the time ->hide is called 2008-12-10 12:17 but the both dentry is still live 2008-12-10 12:17 right 2008-12-10 12:18 however, it does not matter which one is visible at that time: 1) if the positive is visible, then that just continues the state that existed before d_delete was called 2) if the negative is visible, then that state will continue 2008-12-10 12:19 so the flip between positive and negative happens at some random time, but it does not flip back and forth 2008-12-10 12:20 it does feel strange ;) 2008-12-10 12:21 by the way, I do not think that the unhashed dentry actually needs a name, so a more clever version of this could swap the d_name pointers for a long name instead of allocating a new name 2008-12-10 12:21 if user open two times, it may random behavior? 2008-12-10 12:21 I think that is ok 2008-12-10 12:21 why? 2008-12-10 12:22 if they get the old dentry, it is just "more busy" 2008-12-10 12:23 if they open twice, get a positive at one time and a negative after, their application is racy 2008-12-10 12:23 first try return -ENOENT, second try returns 0 2008-12-10 12:23 you mean, get the negative first, then the positive? 2008-12-10 12:23 I don't think that can happen 2008-12-10 12:23 yes 2008-12-10 12:24 d_rehash() was done, and before retack locks 2008-12-10 12:25 ah 2008-12-10 12:25 I don't think they can see negative, then positive in that situation 2008-12-10 12:26 if dcache_lookup is walking the middle of list 2008-12-10 12:26 it may be able to do 2008-12-10 12:26 oh :) 2008-12-10 12:26 because it is rcu 2008-12-10 12:26 :) 2008-12-10 12:27 I am tempted to just show the racy version to lkml and ask for suggestions ;) 2008-12-10 12:27 ok :) 2008-12-10 12:28 well, maybe I thought like that race before 2008-12-10 12:30 well, I'll sleep 2008-12-10 12:30 oyasumi 2008-12-10 12:37 oyasumi :) 2008-12-10 13:26 flips: I turn 40 today 2008-12-10 13:26 happy birthday, young'un 2008-12-10 13:36 getting up there 2008-12-10 14:16 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-10 14:36 http://lkml.org/lkml/2008/12/10/358 <- Tux3 by Christmas? 2008-12-10 14:36 you think ? 2008-12-10 14:37 flips: keep it up 2008-12-10 14:37 good work 2008-12-10 14:37 I'll be joining you folks in a few months 2008-12-10 14:41 we'll try to leave something for you to do ;) 2008-12-10 14:41 real time support should be hard enough 2008-12-10 14:44 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-10 14:47 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-10 15:18 -!- ajonat(~ajonat@190.48.121.30) has joined #tux3 2008-12-10 15:41 I also plan on contributing a bit more once I'm done with finals. hopefully my patches wont embed too many bugs into tux3 ;) 2008-12-10 17:14 hg log | grep hirofumi | wc 2008-12-10 17:14 176 2008-12-10 17:39 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-10 17:43 -!- konrad_(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-10 17:56 -!- pgquiles(~pgquiles@26.Red-79-144-194.staticIP.rima-tde.net) has joined #tux3 2008-12-10 18:37 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-10 18:47 -!- ajonat_(~ajonat@190.48.120.75) has joined #tux3 2008-12-10 18:49 -!- _ajonat(~ajonat@190.48.126.76) has joined #tux3 2008-12-10 19:46 http://lwn.net/Articles/310718/ <- Tux3 report: Tux3 by Christmas? 2008-12-10 20:16 that was fast 2008-12-10 20:16 in under the wire 2008-12-10 20:17 the wire? 2008-12-10 20:18 idiom 2008-12-10 20:18 means just in time 2008-12-10 20:18 i know 2008-12-10 20:18 but what is time sensitive about an email being mirrored? 2008-12-10 20:19 oh, lwn weekly edition 2008-12-10 20:19 ohh 2008-12-10 20:19 means Jon was watching lkml like a hawk and picked it up moments after posting 2008-12-10 20:19 he does thaty 2008-12-10 20:21 or he has a google alert for tux3 ;) 2008-12-10 20:22 although that didn't even pick it up that quickly 2008-12-10 20:32 right, that's a corbet alert 2008-12-10 20:32 I'm pretty sure he eyeballs every message 2008-12-10 21:06 -!- RazvanM(~RazvanM@96.234.238.110) has joined #tux3 2008-12-10 21:27 flips: nice post 2008-12-10 21:27 thx 2008-12-10 21:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-10 22:39 hey, I just had a little idea 2008-12-10 22:40 what if we store the i_size mod blocksize in the dleaf? 2008-12-10 22:40 thus not having to update the inode table block separately after a write? 2008-12-10 22:44 ? 2008-12-10 22:44 wouldn't you have to update it anyway? 2008-12-10 22:46 to update the ctime 2008-12-10 22:47 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-10 23:23 maybe 2008-12-10 23:23 or maybe put that all in the log commit block 2008-12-10 23:23 and update the inode table block rarely, hopefully 2008-12-10 23:54 sure, i always assumed the metadata would be in the commit log too 2008-12-11 00:42 ah, andrew morton on tux3... 2008-12-11 00:43 yes, nice 2008-12-11 00:43 basically says to us, clean up and merge, damn the features 2008-12-11 00:43 :) 2008-12-11 00:43 well I think atomic commit is pretty much essential 2008-12-11 00:43 but the features are what makes us distinct 2008-12-11 00:43 but deferred namespace ops isn't, I can put that on hold 2008-12-11 00:44 well, being very clean is a feature too 2008-12-11 00:44 ok 2008-12-11 00:44 right now the kernel code is about 5,000 lines 2008-12-11 00:44 and atomic commit is unlikely to bring it over 6,000 2008-12-11 00:44 when do u think atomic commit will be done? 2008-12-11 00:45 by Christmas :) 2008-12-11 00:45 :) 2008-12-11 00:45 or I'll eat a christmas tree ornament 2008-12-11 00:45 :) 2008-12-11 00:46 good that u reminded.. time for me to eat 2008-12-11 00:46 my lunch that is :D 2008-12-11 01:41 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-11 02:55 hey flips 2008-12-11 02:55 hi 2008-12-11 02:55 what's cracking ? 2008-12-11 02:57 doing a little hacking before crashing 2008-12-11 03:07 good 2008-12-11 03:07 how's atomic commits and the kernel port ? good ? 2008-12-11 03:07 much like your update earlier ? 2008-12-11 03:07 btw, you're doing a good job on outreah 2008-12-11 03:07 outreach 2008-12-11 03:16 I guess I start on atomic commit seriously tomorrow 2008-12-11 04:17 -!- yanzheng_(~yanzheng@124.42.72.21) has joined #tux3 2008-12-11 04:23 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-11 04:55 -!- yanzheng(~zhyan@124.42.72.21) has joined #tux3 2008-12-11 05:32 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-11 07:12 -!- pranith(~bobby@122.162.72.53) has joined #tux3 2008-12-11 07:24 -!- pranith(~bobby@122.162.72.53) has joined #tux3 2008-12-11 07:31 -!- pranith(~bobby@122.162.72.53) has joined #tux3 2008-12-11 08:09 -!- pranith(~bobby@122.162.72.53) has joined #tux3 2008-12-11 08:24 -!- pranith(~bobby@122.162.72.53) has joined #tux3 2008-12-11 08:30 -!- pranith(~bobby@122.162.72.53) has joined #tux3 2008-12-11 08:42 -!- pranith(~bobby@122.162.72.53) has joined #tux3 2008-12-11 08:49 -!- pranith(~bobby@122.162.72.53) has joined #tux3 2008-12-11 09:06 -!- pranith(~bobby@122.162.72.53) has joined #tux3 2008-12-11 09:37 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-11 10:22 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-11 10:44 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-11 10:55 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-11 11:14 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-11 12:58 -!- ajonat(~ajonat@190.48.121.52) has joined #tux3 2008-12-11 14:35 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-11 16:04 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-11 16:12 -!- tone(~tone@adsl-70-142-37-178.dsl.tul2ok.sbcglobal.net) has joined #tux3 2008-12-11 16:39 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-11 16:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-11 17:06 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-11 17:34 flips: ping 2008-12-11 17:34 the most recent patch in tux3.org/patches is borked 2008-12-11 17:47 back 2008-12-11 17:47 which one? 2008-12-11 17:47 oh 2008-12-11 17:48 forgot to pull to public 2008-12-11 17:49 hirofumi, reading patches now 2008-12-11 17:51 it references namei in the makefile, which is missing 2008-12-11 17:51 name of patch? 2008-12-11 17:51 also other things are broken 2008-12-11 17:51 oh 2008-12-11 17:51 http://tux3.org/patches/tux3-2.6.26.5-2 2008-12-11 17:51 ok 2008-12-11 17:51 like tux_dir_ops 2008-12-11 17:51 extern, but never declared 2008-12-11 17:52 its just all broken 2008-12-11 17:52 we need a real git repo 2008-12-11 17:52 got patch ready to send? 2008-12-11 17:52 yes 2008-12-11 17:52 ready :) 2008-12-11 17:52 well, not to fix the patch 2008-12-11 17:52 right 2008-12-11 17:52 to add setattr 2008-12-11 17:52 just to fix the repo 2008-12-11 17:52 so mtimes work 2008-12-11 17:52 then I will remake the patch 2008-12-11 17:52 and quietly overwrite the borked one :) 2008-12-11 17:53 how about a patch to fix the broken things? 2008-12-11 17:53 i dont know what going on, it seems like shit is missing 2008-12-11 17:53 you rolled it wrong or something, and i cant clone your git repo 2008-12-11 17:53 ok I'll take a look 2008-12-11 17:53 so its hard to fixc 2008-12-11 17:54 is there really no git repo to pull from ? 2008-12-11 17:54 oh 2008-12-11 17:54 git rolled it wrong 2008-12-11 17:54 stupid stupid git 2008-12-11 17:54 likes to leave new files out of its diff 2008-12-11 17:55 like namei 2008-12-11 17:55 you have to say something magic to include them 2008-12-11 17:55 flips: is there a git i can pull from? 2008-12-11 17:55 git diff HEAD 5023112d97ace1a7363ab4b0da2701a21f6e3ffd >/var/www/tux3/patches/tux3-2.6.26.5-2 2008-12-11 17:56 I thought this one was pullable 2008-12-11 17:56 what breaks? 2008-12-11 17:56 hrm let me try again 2008-12-11 17:56 whats the path? 2008-12-11 17:56 I'll try cloning it 2008-12-11 17:56 http://phunq.net/ddtree?p=tux3fs 2008-12-11 17:57 oh right 2008-12-11 17:57 -!- ajonat(~ajonat@190.48.114.25) has joined #tux3 2008-12-11 17:57 git update-server-info 2008-12-11 17:57 git is actualy pretty crappy, usability wise 2008-12-11 17:58 yeah hg has grown on me now 2008-12-11 17:58 it is not without warts either 2008-12-11 17:58 still doesn't work 2008-12-11 17:58 well, send the patch against mercurial, not git 2008-12-11 17:58 I mean, let me pull from you 2008-12-11 18:01 i dont even have a git repo 2008-12-11 18:01 since i wasn't able to clone 2008-12-11 18:01 so i guess i can just mail a patch 2008-12-11 18:01 well forget git for now 2008-12-11 18:01 let me pull the fixes from mercurial 2008-12-11 18:01 anyway 2008-12-11 18:02 try the current patch 2008-12-11 18:02 I haven't uploaded it to tux3.org yet 2008-12-11 18:04 shapor, somewhat fixed patch is uploaded 2008-12-11 18:16 flips: patch sent 2008-12-11 18:17 the Right Way is just to do the change in user/kernel 2008-12-11 18:17 I'll apply the patch to user/kernel 2008-12-11 18:17 shapor, or you could probably beat me to that 2008-12-11 18:23 ah 2008-12-11 18:23 ok i'll do that 2008-12-11 18:23 :) 2008-12-11 18:24 bbiaf 2008-12-11 18:29 flips: static-http://shapor.com/tux3/shapor-tux3 2008-12-11 18:34 I guess the course of wisdom would be to pull hirofumi's first then see if yours still pulls 2008-12-11 18:37 hirofumi, sorry for the delay reading 2008-12-11 18:37 I am up to introduce tux_new_inode now, everything looks good 2008-12-11 18:38 check_present, cool idea 2008-12-11 18:41 iget5 stuff is very cool 2008-12-11 18:42 64 bit ino on 32 bit arch 2008-12-11 18:58 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-11 19:01 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-11 19:05 shapor, when I pull from you an old changeset from way back keep waking up like a zombie 2008-12-11 19:05 shapor, and you reclone the public repo and add your patch to it? 2008-12-11 19:05 can you I mean 2008-12-11 19:06 and that way a conflict with one of hirofumi's changes will also be fixed 2008-12-11 19:17 hrm my tree was up to date 2008-12-11 19:17 and i recloned not long ago 2008-12-11 19:17 i will reclone now 2008-12-11 19:22 flips: updated 2008-12-11 19:23 recloned/reapplied 2008-12-11 19:23 kay I'll try it 2008-12-11 19:24 shapor, got just one head this time :) 2008-12-11 19:58 maze, ping 2008-12-11 19:58 razvanm, here? 2008-12-11 19:58 here 2008-12-11 19:58 greets 2008-12-11 19:58 :-) 2008-12-11 19:59 now let's see, who hasn't carved their name in the tux3 credits yet? 2008-12-11 19:59 I think we should have a CREDITS file in fs/tux3 2008-12-11 19:59 along with COPYING 2008-12-11 20:01 ok, where shall we start with the dentry cache? 2008-12-11 20:01 dcache.c? 2008-12-11 20:01 too obvious 2008-12-11 20:02 let's start in fs/namei.c 2008-12-11 20:02 cached_lookup 2008-12-11 20:03 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L424 2008-12-11 20:04 ACTION pulls up a chair 2008-12-11 20:04 got your browser going? 2008-12-11 20:04 yup 2008-12-11 20:04 cached_lookup is static, so let's see how namei.c users it 2008-12-11 20:05 namei -> "name/inode translations" 2008-12-11 20:05 ah! 2008-12-11 20:06 1196static struct dentry *__lookup_hash(struct qstr *name, 2008-12-11 20:06 1244 static struct dentry *lookup_hash(struct nameidata *nd) 2008-12-11 20:06 the only caller 2008-12-11 20:07 ACTION wishes firefox was unlame enough to leave a space between fields when clipping multiple fields 2008-12-11 20:08 1722 path.dentry = lookup_hash(&nd); <- opening a file 2008-12-11 20:09 why is "open" "flip" 2008-12-11 20:09 ACTION guesses the problem isn't quite with firefox... since it probably cuts rich text 2008-12-11 20:09 somewhere in the cut n paste hairball 2008-12-11 20:10 filp = file pointer? 2008-12-11 20:10 let's see how the cached lookup in do_filp_open works 2008-12-11 20:10 yes, file would be a much better name than filp in 100% of cases 2008-12-11 20:11 so first we try to fill in the entry at the end of the path with a cached lookup 2008-12-11 20:11 hrm filp_open just a wrapper 2008-12-11 20:11 i.e., from the dentry cache 2008-12-11 20:11 wrappers on wrappers ;) 2008-12-11 20:11 doesn't always make sense 2008-12-11 20:12 there's probably also an openat 2008-12-11 20:12 the dentry we get from the lookup can come back positive or negative, the latter is when the name is known not to exist in the directory searched 2008-12-11 20:12 so it probably does make sense - in this case 2008-12-11 20:13 yeah, calls it at AT_FDCWD 2008-12-11 20:14 you'll note do_filp_open isn't static - probably used from open.c 2008-12-11 20:14 MaZe: yup 2008-12-11 20:14 do_sys_open 2008-12-11 20:15 let's see where the real lookup to the filesystem happens 2008-12-11 20:15 that is, handle a dcache miss 2008-12-11 20:15 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1848 - brilliant 2008-12-11 20:16 real_lookup? 2008-12-11 20:16 :) 2008-12-11 20:16 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L479 2008-12-11 20:16 it was the lustre guys, I cannot tell a lie :) 2008-12-11 20:17 in do_filp_open, we have two paths, one for start opens, and open for creates 2008-12-11 20:17 the first does its work in path_lookup_open 2008-12-11 20:18 the second in path_lookup_create ;-) 2008-12-11 20:18 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L1045 <- gets to do_path_lookup eventually 2008-12-11 20:19 which calls path_walk 2008-12-11 20:19 which calls link_path_walk 2008-12-11 20:19 :p 2008-12-11 20:20 __link_path_walk 2008-12-11 20:20 finally we get some action 2008-12-11 20:20 926 err = do_lookup(nd, &this, &next); 2008-12-11 20:21 808 static int do_lookup(struct nameidata *nd, struct qstr *name, 2008-12-11 20:21 so we are here, finally 2008-12-11 20:21 this is where we see the dcache lookup, followed by the real lookup if the first misses 2008-12-11 20:22 __d_lookup heads into dcache.c, we will look at it later 2008-12-11 20:23 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L556 2008-12-11 20:23 we can get back, a pos, neg, or no dentry 2008-12-11 20:23 just for nfs? 2008-12-11 20:23 null is a cache miss 2008-12-11 20:23 a negative dentry is not a miss 2008-12-11 20:23 revalidate... nfs 2008-12-11 20:23 whats a negative dentry? 2008-12-11 20:24 we considered repurposing/abusing it for deferred name ops ;) 2008-12-11 20:24 negative dentry... says "this name does not exist" 2008-12-11 20:24 in order to avoid having to ask the fs that 2008-12-11 20:24 which probably just told us it didn't exist 2008-12-11 20:24 yeah 2008-12-11 20:24 makes sense 2008-12-11 20:24 it's pretty cool 2008-12-11 20:25 we used it for something _very_ cool with the deferred nameops 2008-12-11 20:25 so exactly what it sounds like 2008-12-11 20:25 negative cache 2008-12-11 20:25 can't do that for nfs 2008-12-11 20:25 we use those negative dentries to make a name appear not to exist, when the fs still has it in its directory 2008-12-11 20:25 i guess you could getattr the parent dir 2008-12-11 20:25 nfs uses this too, on client and in nfsd 2008-12-11 20:26 but in client you dont know if your neg dentry cache is good 2008-12-11 20:26 you need to check with the server to be sure no one else has created 2008-12-11 20:26 hence revalidate 2008-12-11 20:27 there is a bunch of other stuff to poke into on that path walk path, but today we're just looking at the cache structure 2008-12-11 20:27 nfs really fucks with everything in the kernel 2008-12-11 20:27 so back to do_filp_open 2008-12-11 20:27 it does ;) 2008-12-11 20:27 was a bad idea from the start 2008-12-11 20:28 stateful... should have been obviously necessasry 2008-12-11 20:28 1695 error = path_lookup_open(dfd, pathname, lookup_flags(flag), 2008-12-11 20:29 1705 error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, 2008-12-11 20:30 not really interested in the difference between _open and _create at this point 2008-12-11 20:30 well 2008-12-11 20:30 if we're creating it, first thing is to know it doesn't already exist 2008-12-11 20:31 theoretically we've already checked at this point 2008-12-11 20:31 but it probably still has to protect against a race 2008-12-11 20:31 ugh, ignore that ;-) 2008-12-11 20:31 hm this MAY_APPEND reminds me we dont support that stuff do we 2008-12-11 20:32 there's a goto 2008-12-11 20:32 instead of an else 2008-12-11 20:32 acl stuff is separate from xattrs or no? 2008-12-11 20:32 1722 path.dentry = lookup_hash(&nd); <- that check is here 2008-12-11 20:33 acls are stored in xattrs 2008-12-11 20:33 shapor, yes and no 2008-12-11 20:33 if a fs wants to store acls in a custom way it can 2008-12-11 20:33 but there is a generic mechanism for storing in xattrs 2008-12-11 20:34 we're drifting away from dcache 2008-12-11 20:34 1737 /* Negative dentry, just create the file */ 2008-12-11 20:34 1738 if (!path.dentry->d_inode) { <- this is the test for negative dentry 2008-12-11 20:34 very badly wants a d_negative(dentry) wrapper 2008-12-11 20:35 if inode is NULL, dentry is negative 2008-12-11 20:36 so the last thing to notice about do_filp_open is, there is no way a real filesystem lookup can happen if there's a dentry in cache 2008-12-11 20:36 there always is a dentry after the lookup 2008-12-11 20:37 the dentry will have a ref count on it, to keep it alive through the do_filp_open 2008-12-11 20:37 unless it gets evicted? 2008-12-11 20:37 oh 2008-12-11 20:38 yes, you're right with me 2008-12-11 20:38 particularly important for positive dentries 2008-12-11 20:38 but enforced for negative as well, which is very useful for us with deferred namespace ops 2008-12-11 20:38 and of course there must be a lock on the dentry to permit atomicity of operations (ie. creation of a file) 2008-12-11 20:39 two locks 2008-12-11 20:39 ? 2008-12-11 20:39 on on dentry lists and on on the dentry flags 2008-12-11 20:39 if ->lookup -ENOENT, fs can avoid negative dentry 2008-12-11 20:40 hirofumi, url? 2008-12-11 20:40 e.g. xfs_vn_ci_lookup() 2008-12-11 20:40 ah, no 2008-12-11 20:41 just return NULL without d_instantiate() 2008-12-11 20:41 let's see if we can find where the ref on the negative dentry is dropped 2008-12-11 20:41 in do_filp_open 2008-12-11 20:41 http://lxr.linux.no/linux+v2.6.27.8/fs/xfs/linux-2.6/xfs_iops.c#L324 2008-12-11 20:41 http://lxr.linux.no/linux+v2.6.27.5/fs/xfs/linux-2.6/xfs_iops.c#L324 2008-12-11 20:41 oh 2008-12-11 20:41 yes 2008-12-11 20:41 nice oddity maze ;) 2008-12-11 20:42 what do you mean? 2008-12-11 20:42 oh 2008-12-11 20:42 sorry 2008-12-11 20:42 it was from hirofumi 2008-12-11 20:42 ok, so what happens to the negative dentry in do_filp_open? 2008-12-11 20:42 it gets turned into a positive dentry 2008-12-11 20:43 and the ref count becomes owned by a file 2008-12-11 20:43 it thinks as negative dentry 2008-12-11 20:43 and dput() frees it 2008-12-11 20:43 we only drop the dentry ref if something goes wrong in this process 2008-12-11 20:44 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L377 2008-12-11 20:44 there is the dput on error 2008-12-11 20:44 create task need to care 2008-12-11 20:44 and on success 2008-12-11 20:45 which means that the file has to take its own ref count somewhere 2008-12-11 20:45 fs? 2008-12-11 20:45 the filp ;) 2008-12-11 20:45 we finished with the fs a while ago on this path 2008-12-11 20:46 ok, let's see if we can find where the dentry ref is taken 2008-12-11 20:47 this is pretty much the most important aspect of understanding the dentry cache 2008-12-11 20:47 following the dget/dputs 2008-12-11 20:47 which define the lifetime of the cache objects 2008-12-11 20:48 1610static int __open_namei_create(struct nameidata *nd, struct path *path, <- looks like a good place to check first 2008-12-11 20:48 well, we have a dput there... 2008-12-11 20:48 so yet another dget must match that 2008-12-11 20:49 no dgets in vfs_create 2008-12-11 20:49 must be in the filessystem... 2008-12-11 20:50 let's look at ext2_create 2008-12-11 20:50 http://lxr.linux.no/linux+v2.6.27.8/+code=ext2_add_nondir 2008-12-11 20:51 43 d_instantiate(dentry, inode); 2008-12-11 20:51 no dget there 2008-12-11 20:51 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-11 20:51 I lost count somewhere ;) 2008-12-11 20:51 dput does nd->path? 2008-12-11 20:52 for nd->path.dentry, not path->dentry 2008-12-11 20:52 473int ext2_add_link (struct dentry *dentry, struct inode *inode) 2008-12-11 20:52 hirofumi, thanks 2008-12-11 20:53 ok, that's about an hour on dentry 2008-12-11 20:53 how was that sink-or-swim intro? 2008-12-11 20:53 sink-y :-) 2008-12-11 20:54 but still worthwhile 2008-12-11 20:54 next time we can look at dentry lifetimes on another path 2008-12-11 20:54 follow the *puts 2008-12-11 20:54 yeah, these tux3 u sessions are far more remedial when hirofumi talks less :) 2008-12-11 20:54 speaking of next time - I've got a flight to Europe Tuesday evening... so if I'll be here it'll be from the airport 2008-12-11 20:54 the dcache internals are insteresting too 2008-12-11 20:55 help to understand how it works 2008-12-11 20:55 but we can also treat that as a black box 2008-12-11 20:55 maze, when back? 2008-12-11 20:55 after xmas? 2008-12-11 20:55 (and afterwards 8pm is going to be 5 am so I probably won't be attending till 10th jan) 2008-12-11 20:56 you'll have a tough time getting your patch in ;) 2008-12-11 20:56 what patch? 2008-12-11 20:56 to get the glory of being in fs/tux3/CREDITS by christmas 2008-12-11 20:56 I don't know what patch yet ;) 2008-12-11 20:57 ah 2008-12-11 20:57 nearly all the namespace hooksups are done now, anything left? 2008-12-11 20:57 ->rename()? 2008-12-11 20:57 and xattr stuff 2008-12-11 20:57 static const struct inode_operations tux_file_iops = { 2008-12-11 20:57 .truncate = tux3_truncate, 2008-12-11 20:57 / .permission = ext4_permission, 2008-12-11 20:58 hirofumi, I will propose a patch tonight to handle the i_size issue 2008-12-11 20:58 for xattrs 2008-12-11 20:58 ah 2008-12-11 20:58 ok 2008-12-11 20:58 what is special about .permission? 2008-12-11 20:58 it is for acl 2008-12-11 20:59 maze, how about ioctl support? 2008-12-11 20:59 I'll finish up the option parsing stuff --- side note: if anyone is bored, here's a cute issue I ran into today: diff <(echo a) <(echo b) fails on a machine which has been up for too long 2008-12-11 20:59 maze, cool 2008-12-11 20:59 64 bit kernel, 32 bit diff binary 2008-12-11 21:00 ooh 2008-12-11 21:00 very nice deep bug you found 2008-12-11 21:00 reason is something related to the inode of pipefs going past 32 bits, and the stat64 call done by libc in diff failing to fit in the stat library call done by the diff source code 2008-12-11 21:00 I haven't quite figured out yet what the exact issue is... 2008-12-11 21:01 maze, I was heaping abuse on that very thing a couple weeks bad 2008-12-11 21:01 called it an abomination I think 2008-12-11 21:01 and was going to make a test case 2008-12-11 21:01 the worst thing is, that bogus inum generator is shared by _all_ filesystems 2008-12-11 21:01 so it can wrap pretty fast 2008-12-11 21:01 yeah, I think that's precisely what's happening 2008-12-11 21:02 this issue is the ino variable is static 2008-12-11 21:02 ino generating variable 2008-12-11 21:02 it's really bogus 2008-12-11 21:02 it's not quite clear to me exactly how it matters, since strace doesn't provide enough info about the stat64 call, but guessing it 2008-12-11 21:02 's the inode >32bit thing which breaks it 2008-12-11 21:02 details would be fun 2008-12-11 21:03 cleanup badly needed there, the bogus static counter should go away completely 2008-12-11 21:03 should be a variable in the fs-private superblock for filesystems that need it 2008-12-11 21:04 here's the example on an 'infected' machine: 2008-12-11 21:04 # stat <(echo a) 2008-12-11 21:04 File: "/dev/fd/63" -> "pipe:[5682879846]" 2008-12-11 21:04 Size: 64 Blocks: 2 Symbolic Link 2008-12-11 21:04 Access: (0500/lr-x------) Uid: ( 0/ root) Gid: ( 0/ root) 2008-12-11 21:04 Device: 3 Inode: 427327551 Links: 1 2008-12-11 21:04 :) 2008-12-11 21:05 I don't understand why the pipe:[#] and Inode: # fields don't match 2008-12-11 21:05 stat is 32bit binary? 2008-12-11 21:06 it is, but the two numbers don't seem to be related in the truncated sense either 2008-12-11 21:06 ah 2008-12-11 21:07 it is /dev/fd/63 ino 2008-12-11 21:07 that's what I thought as well 2008-12-11 21:07 but then shouldn't they be pretty close to each other? 2008-12-11 21:08 # echo $[5682879846-(1<<32)-427327551] 2008-12-11 21:08 960584998 2008-12-11 21:08 isn't exactly close... 2008-12-11 21:08 it is inode number of symlink (/dev/fd/63 itself) 2008-12-11 21:09 pipe:[] is virtual ino of pipefs 2008-12-11 21:10 got it 2008-12-11 21:10 # stat -l <(echo a) 2008-12-11 21:10 /dev/fd/63: Value too large for defined data type 2008-12-11 21:10 statting the wrong thing - you're right 2008-12-11 21:11 it seems, stat() or something returned -EOVERFLOW 2008-12-11 21:11 it is not completed with LFS 2008-12-11 21:11 oh, it gets better 2008-12-11 21:11 not compiled 2008-12-11 21:12 strace: 2008-12-11 21:12 stat64(0xffb10bc8, 0xffb0eaa0) = 0 2008-12-11 21:12 write(2, "/dev/fd/63: Value too large for "..., 50/dev/fd/63: Value too large for defined data type 2008-12-11 21:12 ) = 50 2008-12-11 21:12 so the kernel call succeeds and libc throws the error 2008-12-11 21:13 yeah, so I guess it can only happen in a 32-bit non-LFS userspace binary with 64-bit kernel... 2008-12-11 21:13 still, it's really odd that you can run into these failures with frickin' pipes 2008-12-11 21:14 busted 2008-12-11 21:14 stat (not stat64) ->ino is 32bit on 32bit arch 2008-12-11 21:14 maze, how'd you notice that one? 2008-12-11 21:14 my scripts were randomly - but persistently on some machines - failing 2008-12-11 21:15 btw, which kernel version? 2008-12-11 21:16 2.6.18 2008-12-11 21:17 but I doubt it matters 2008-12-11 21:17 probably, recent kernel was fixed it 2008-12-11 21:18 eh, recent kernels also break tons of other things... like the scheduler. 2008-12-11 21:18 struct inode *new_inode(struct super_block *sb) 2008-12-11 21:18 { 2008-12-11 21:18 - static unsigned long last_ino; 2008-12-11 21:18 + /* 2008-12-11 21:18 + * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW 2008-12-11 21:18 + * error if st_ino won't fit in target struct field. Use 32bit counter 2008-12-11 21:18 + * here to attempt to avoid that. 2008-12-11 21:18 + */ 2008-12-11 21:18 + static unsigned int last_ino; 2008-12-11 21:18 struct inode * inode; 2008-12-11 21:18 this may fix it 2008-12-11 21:18 ok, that does look relevant 2008-12-11 21:19 any idea when it was added? 2008-12-11 21:19 commit 866b04fccbf125cd39f2bdbcfeaa611d39a061a8 2008-12-11 21:19 pull linus's tree and run blame on it 2008-12-11 21:19 Author: Jeff Layton 2008-12-11 21:19 Date: Tue May 8 00:32:29 2007 -0700 2008-12-11 21:19 hirofumi, fast 2008-12-11 21:20 thanks 2008-12-11 21:20 yeah, I still have to get a git dev environment set up 2008-12-11 21:20 git is very useful for this work 2008-12-11 21:20 time to set up the tux3 kernel.org tree very soon 2008-12-11 21:21 that static counter has to go 2008-12-11 21:21 2.6.22 2008-12-11 21:21 it's really bogus that i_ino is not zero after new_inode 2008-12-11 21:22 and it's not in 2.6.21.7 2008-12-11 21:55 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-11 22:18 -!- RazvanM(~raz@96.234.235.9) has joined #tux3 2008-12-11 22:42 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-11 23:04 hirofumi, there? 2008-12-11 23:35 hi 2008-12-11 23:49 hi 2008-12-11 23:49 just posted a patch idea for xattrs 2008-12-11 23:50 ok 2008-12-11 23:50 besides that... 2008-12-11 23:50 the defition of "acceptable performance" for merge 2008-12-11 23:51 I think if we have to probe the btree on every ->readpage and ->writepage, it will be a little too slow 2008-12-11 23:51 I could be wrong about that 2008-12-11 23:52 it means the issue of tux3_get_block? 2008-12-11 23:52 yes 2008-12-11 23:52 we will find out how costly it is pretty soon :) 2008-12-11 23:52 could find out right now even 2008-12-11 23:52 yes 2008-12-11 23:53 I am expecting, a little too costly, and since we are working on atomic commit anyway, we can do something about it at the same time 2008-12-11 23:53 I think read side is not big problem 2008-12-11 23:53 because? 2008-12-11 23:53 because it uses readahead almost all case 2008-12-11 23:54 and readahead uses b_size of up to how big? 2008-12-11 23:54 if so, it uses multiple blocks feature 2008-12-11 23:54 configuarable, now it's 128k 2008-12-11 23:54 that will help a lot 2008-12-11 23:54 yes 2008-12-11 23:55 just concentrate on write then 2008-12-11 23:55 write side may be bad 2008-12-11 23:55 the other issue is, if we call the block io library from ->writepage, then the front end is change dleafs 2008-12-11 23:55 we want those changes to happen in the delta staging instead 2008-12-11 23:56 we can make the atomic commit work, even the way it is, with brutal locking 2008-12-11 23:56 ah, yes 2008-12-11 23:56 for example, every write operation has to take a read lock on the staging_lock 2008-12-11 23:57 but I think I would rather tackle the deferred write, right now, as part of the atomic commit effort 2008-12-11 23:57 for right now, maybe we can bh_delay() 2008-12-11 23:57 don't know about it 2008-12-11 23:57 we can use bh_dealy() 2008-12-11 23:57 I also don't know well 2008-12-11 23:58 but, maybe it is not hard to do 2008-12-11 23:58 we can also use a simple technique from my deferred name ops patch 2008-12-11 23:58 for writepage? 2008-12-11 23:59 actually, it was for inodes, we put the inode on a list on the superblock for later processing 2008-12-11 23:59 and use the inode dirty link for that, which works well 2008-12-11 23:59 when we know the dirty inode, we can walk the dirty pages, and write them out like mpage does 2008-12-12 00:00 but without the model of filemap.c 2008-12-12 00:00 yes 2008-12-12 00:00 vm already does the right thing for us, and puts all the dirty page cache pages on a list we can walk 2008-12-12 00:00 I think we can anyway, I think everything we need is exported 2008-12-12 00:01 yes 2008-12-12 00:01 ok, I just wanted to mention this 2008-12-12 00:01 ok 2008-12-12 00:01 the xattr i_size issue... 2008-12-12 00:01 shall we just do it that way and get it done for now? 2008-12-12 00:01 I'll see bh_delay() 2008-12-12 00:01 ok 2008-12-12 00:02 I should write the rest of the xattr i_size patch 2008-12-12 00:02 I'm reading email... 2008-12-12 00:05 yes, it would be work 2008-12-12 00:05 I'll finish it now 2008-12-12 00:05 we need to have atable->i_size anywhere 2008-12-12 00:05 right, in the superblock 2008-12-12 00:06 and unpack_sb reads it from disk 2008-12-12 00:06 unpack_sb()? 2008-12-12 00:07 atable itself has i_size attribute 2008-12-12 00:07 right, but that has to be set way high 2008-12-12 00:07 to let the tables be read/written 2008-12-12 00:08 it can have real i_size 2008-12-12 00:08 well, it would be need more changes 2008-12-12 00:08 yes 2008-12-12 00:08 this is a pretty small change 2008-12-12 00:08 yes, it is good for now 2008-12-12 00:08 right, we can make it nicer a couple of months from now 2008-12-12 00:09 so, this sounds good for now 2008-12-12 00:10 k 2008-12-12 00:12 ah, I noticed another way 2008-12-12 00:12 we can copy block_write_full_page() and modify :) 2008-12-12 00:12 yes, exactly 2008-12-12 00:13 we will essentially do that when we change to the block handles interface 2008-12-12 00:13 later 2008-12-12 00:13 ah, yes 2008-12-12 00:16 bh_delay() seems to do basically 2008-12-12 00:17 ->write_begin() sets the delay flag with set_buffer_delay(), and block is not allocated 2008-12-12 00:17 and how does the filesystem know it has to allocate that block? 2008-12-12 00:18 has to remember in ->write_begin? 2008-12-12 00:18 buffer_delay() means it is not allocated yet 2008-12-12 00:18 next get_block() clears it, and allocate block actually 2008-12-12 00:19 and buffer_unwritten() seems helper for something 2008-12-12 00:20 we can compare approaches after I decribe details of how the staging algorithm is supposed to work 2008-12-12 00:20 right after this xattr patch 2008-12-12 00:20 ok 2008-12-12 00:25 well, xattrtest still runs :) 2008-12-12 00:25 I'll check it in 2008-12-12 00:25 ok 2008-12-12 00:25 must not be too broken ;) 2008-12-12 00:32 ah, I will need another patch to make it compile in kernel 2008-12-12 00:32 should have tested it first :p 2008-12-12 00:33 I am going to have to start making the first line of my commit comment shorter 2008-12-12 00:35 oh, MITME_BIT 2008-12-12 00:35 we don't need it now 2008-12-12 00:36 because we always store it? 2008-12-12 00:36 yes 2008-12-12 00:36 tux_new_inode() sets it 2008-12-12 00:36 but, setattr itself will be used for acl 2008-12-12 00:37 no reason to keep it around then 2008-12-12 00:37 oh, I need to update the super magic 2008-12-12 00:39 flips: what is this k&r bottom up position? 2008-12-12 00:39 writing c code in bottom up style... lower level functions at the top of the file 2008-12-12 00:40 no forward references unless recursion is required 2008-12-12 00:40 oh, ok 2008-12-12 00:47 pranith, there is a discussion of deduplication on the btrfs list you might find interesting 2008-12-12 00:47 subject? 2008-12-12 00:48 http://www.usenix.org/events/osdi08/tech/full_papers/gunawi/gunawi_html/index.html 2008-12-12 00:48 interesting apporch of fsck 2008-12-12 00:48 pranith, http://kerneltrap.org/mailarchive/linux-btrfs/2008/12/10/4389214 2008-12-12 00:49 I was thinking about it, and just walking our filesystem tree will already be a good start on fsck 2008-12-12 00:50 wow, ambitious 2008-12-12 00:50 hey 2008-12-12 00:50 then, the tree walk can be ammended for checking no cycles in the tree, and no blocks referenced more than once 2008-12-12 00:50 with the help of a big bitmap 2008-12-12 00:50 really big 2008-12-12 00:51 hmm 2008-12-12 00:51 let me see, the bitmap is 32 meg/terabyte 2008-12-12 00:51 not really really big 2008-12-12 00:51 just really big 2008-12-12 00:51 :) 2008-12-12 00:52 ok, ill start looking at it.. 2008-12-12 00:52 fsck ftw!! :) 2008-12-12 00:52 and when the walk is done, everything not marked off in the tree walk bitmap better be marked free in the allocation bitmap 2008-12-12 00:52 so it's a big part of fsck already 2008-12-12 00:55 ok, next thing is to write a design note on atomic commit 2008-12-12 00:56 with the assumption that we will be doing our own style of delayed allocation 2008-12-12 00:59 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-12 02:32 I will check in the i_size_read/write() version 2008-12-12 02:33 i_size_read/write takes inode as parameter 2008-12-12 02:33 so, we can't do 2008-12-12 02:33 did I goof? 2008-12-12 02:33 no 2008-12-12 02:34 we have race, but it will fix later 2008-12-12 02:34 with phtree 2008-12-12 02:34 race window is really small 2008-12-12 02:34 loff_t size = i_size_read(dir); <- there is still a race here? 2008-12-12 02:35 no 2008-12-12 02:35 but, we have to read from ->dictsize? 2008-12-12 02:35 that does not have any smp contention 2008-12-12 02:35 like i_size does 2008-12-12 02:36 ah 2008-12-12 02:36 oh, and I don't think we have contention for directories either 2008-12-12 02:36 it's only regular files 2008-12-12 02:36 so maybe we don't need i_size_read/write 2008-12-12 02:36 but, write side? 2008-12-12 02:36 even then, what can it race with? 2008-12-12 02:37 block_write_full_page() read ->i_size 2008-12-12 02:37 the dirops are protected by dir->i_mutex 2008-12-12 02:37 block_write_full_page is not allowed to operation on a directory 2008-12-12 02:37 um 2008-12-12 02:37 maybe :) 2008-12-12 02:37 we use 2008-12-12 02:37 right 2008-12-12 02:37 well I still don't think it is racy 2008-12-12 02:38 it happens in the same task 2008-12-12 02:38 it has race with tux_create_entry vs writepage 2008-12-12 02:38 writepage is async 2008-12-12 02:38 yes 2008-12-12 02:38 but not in tux3 2008-12-12 02:39 we will flush out the directory block with synchronization 2008-12-12 02:39 and not let the vm do it for us 2008-12-12 02:39 synchronaize, with tux_create_entry? 2008-12-12 02:39 tux_create entry is synchronized against delta staging 2008-12-12 02:39 it has to be 2008-12-12 02:40 we will have a rwsem to start with 2008-12-12 02:40 I need to write a post 2008-12-12 02:40 ah 2008-12-12 02:40 so you know what I'm thinking 2008-12-12 02:40 with deferred name ops 2008-12-12 02:40 I started to draft it already 2008-12-12 02:40 deferred name ops will have different synchronization, but still synchronized 2008-12-12 02:41 even ext3 has to synchronize like this, with its journal transactions 2008-12-12 02:41 I don't think ext3 is synchronize 2008-12-12 02:42 it journals directory blocks 2008-12-12 02:42 even in ordered data mode 2008-12-12 02:42 yes 2008-12-12 02:42 but, i_size is not atomic 2008-12-12 02:42 true 2008-12-12 02:42 and read is not blocked 2008-12-12 02:43 I could be wrong about this, but I think in the case of directories, other synchronization around i_size is sufficient 2008-12-12 02:43 in ext3 2008-12-12 02:43 and I'm pretty sure about that in tux3... where I've thought about it more 2008-12-12 02:44 if we didn't use block I/O library at all, I think it is true 2008-12-12 02:44 anyway, now I have a nice patch to add i_size_read/write, and maybe it's useless ;) 2008-12-12 02:45 how did you use i_size_write()? 2008-12-12 02:45 like in the post 2008-12-12 02:45 loff_t size = i_size_read(dir); 2008-12-12 02:45 int err = _tux_create_entry(dir, name, len, inum, mode, &size); 2008-12-12 02:45 i_size_write(dir, size); 2008-12-12 02:45 return err; 2008-12-12 02:45 ah 2008-12-12 02:45 we have to redirty after i_size change 2008-12-12 02:46 if we have async writeout 2008-12-12 02:46 otherwise, create_entry did it 2008-12-12 02:46 yes 2008-12-12 02:46 right now... we have async writeout 2008-12-12 02:46 yes 2008-12-12 02:46 so we might as well redo the dirty 2008-12-12 02:47 well, this race can be ignored for now 2008-12-12 02:47 race window is very small 2008-12-12 02:47 shall I check in the wrapper above? 2008-12-12 02:47 or just put it aside? 2008-12-12 02:48 I think, just ignore is enough 2008-12-12 02:49 ok :) 2008-12-12 02:49 ah, ext3/ext4 uses buffer cache for directory 2008-12-12 02:49 so, it doesn't have this race at all 2008-12-12 02:50 right 2008-12-12 02:51 even in page cache, it would not let the mm write a directory page asynchronously 2008-12-12 02:51 and ext3/ext4 dir should be in page cache 2008-12-12 02:51 there's no reason for it to be in buffer cache 2008-12-12 02:52 one day when we have time, we should fix that a post a benchmark to show why 2008-12-12 02:52 jbd depends on buffer cache, so it still is not converted 2008-12-12 02:52 but ext3 can journal data blocks too 2008-12-12 02:52 yes 2008-12-12 02:52 but, anybody didn't do it 2008-12-12 02:53 right, it's work, and it's not immediately obvious why page cache is better 2008-12-12 02:54 probably, with htree, it become not very important? 2008-12-12 02:55 I wrote a post about it 2008-12-12 02:55 looking for it now 2008-12-12 02:58 https://kerneltrap.org/mailarchive/linux-ext4/2008/8/20/3016754 2008-12-12 02:58 this one got buried pretty deep ;) 2008-12-12 03:00 the big reason for mapping a btree into a file is few radix tree probes, the second reason is higher index fanout 2008-12-12 03:01 i see 2008-12-12 03:02 it's oyasumi time 2008-12-12 03:02 tomorrow's project is to complete a detailed post on atomic commit, then start on it 2008-12-12 03:03 ok, oyasumi 2008-12-12 03:05 oh, that's also the thread where Ted says dot and dotdot are only for backward compatibility 2008-12-12 03:05 and for sanity check 2008-12-12 03:06 yes, we need that 2008-12-12 03:06 just as it says in the thread 2008-12-12 03:06 not just in one dirent block though, in every one 2008-12-12 03:06 oh 2008-12-12 03:06 so you can reassemble a directory without the help of index blocks 2008-12-12 03:06 i see 2008-12-12 03:07 ah 2008-12-12 03:07 i see 2008-12-12 03:45 -!- tone(~tone@adsl-70-142-37-178.dsl.tul2ok.sbcglobal.net) has left #tux3 2008-12-12 07:15 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-12 08:02 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-12 08:52 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-12 09:48 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-12 10:24 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-12 10:43 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-12-12 10:52 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-12 11:57 -!- pranihome(~bobby@122.162.67.42) has joined #tux3 2008-12-12 12:03 -!- pranihome(~bobby@122.162.67.42) has joined #tux3 2008-12-12 12:08 -!- pranihome(~bobby@122.162.67.42) has joined #tux3 2008-12-12 12:09 http://www.usenix.org/events/osdi08/tech/full_papers/gunawi/gunawi_html/index.html <- this fsck paper is well worth reading 2008-12-12 12:10 yes 2008-12-12 12:11 thanks for that 2008-12-12 12:11 could we make a section on the website of 'start reading here'? 2008-12-12 12:11 start reading about the design? 2008-12-12 12:11 ok, background 2008-12-12 12:12 design, operation, theory, gotchas, other implementations and ideas... 2008-12-12 12:12 a filesystem encyclopedia 2008-12-12 12:12 wikipedia is pretty weak actually 2008-12-12 12:12 could take a while ;) 2008-12-12 12:12 some general filesystem links would be good 2008-12-12 12:12 -!- pranihome(~bobby@122.162.67.42) has joined #tux3 2008-12-12 12:13 you could come up 'top 5 papers' or something that would get it going 2008-12-12 12:13 pointers to sct's talks for example 2008-12-12 12:13 then the restwill come naturally 2008-12-12 12:13 sct? 2008-12-12 12:13 tweedy? 2008-12-12 12:15 http://www.linux.genova.it/sections/04_cosa_facciamo/02_corsi_linux/98_novembre_2002/03_DesignImplementationEXT3.pdf 2008-12-12 12:15 yes 2008-12-12 12:15 ok, here's the start: 2008-12-12 12:16 http://www.usenix.org/events/osdi08/tech/full_papers/gunawi/gunawi_html/index.html 2008-12-12 12:16 http://www.linux.genova.it/sections/04_cosa_facciamo/02_corsi_linux/98_novembre_2002/03_DesignImplementationEXT3.pdf 2008-12-12 12:16 personally i could use some simpler stuff, cuz tux3 university is way beyond my abilities 2008-12-12 12:16 and when shapor gets around to the design doc it will be awesome 2008-12-12 12:17 bushman, this is right for you: http://oreilly.com/catalog/9780596005658/ 2008-12-12 12:17 -!- pranihome(~bobby@122.162.67.42) has joined #tux3 2008-12-12 12:19 bushman, my design notes are about right for you I hope 2008-12-12 12:19 today's is on the atomic commit 2008-12-12 12:20 parts of them are, there are parts that make my head spin very easy. you apparently dont believe in intellectual foreplay, just go full speed without a warmup ;) 2008-12-12 12:20 I'm not a tease 2008-12-12 12:21 http://lwn.net/Articles/248180/ <- ValH's take on fsck 2008-12-12 12:22 folks 2008-12-12 12:24 prani, a little light on technical detail 2008-12-12 12:25 yup.. for starters like me :) 2008-12-12 12:25 hey, there's nothing wrong with a little overview, gotta start somewhere 2008-12-12 12:25 -!- RazvanM(~raz@dazzler.isi.jhu.edu) has joined #tux3 2008-12-12 12:25 I wonder where her project is at this point, it's been years now 2008-12-12 12:25 bh, which proj? 2008-12-12 12:25 the chunkfs guy works on ext4 now 2008-12-12 12:26 he seems bright 2008-12-12 12:27 http://lkml.org/lkml/2008/1/13/146 <- here I went through the e2fsck passes 2008-12-12 12:27 got that by reading the e2fsck source 2008-12-12 12:28 flips, you wrote a really cool thing few months back on bio...something, been a long time now, but that's what got me interested in this whole project 2008-12-12 12:28 ACTION forgets what that was 2008-12-12 12:28 i'll find it 2008-12-12 12:53 has anyone started working on a mkfs.tux3 yet? 2008-12-12 12:54 konrad, there is "tux3 mkfs" 2008-12-12 12:54 so yes 2008-12-12 12:54 right 2008-12-12 12:54 ok 2008-12-12 12:54 it got a new feature today, clearing out other people's superblocks 2008-12-12 12:55 though there is some question of how much of the tail of a volume needs clearing 2008-12-12 13:55 design notes on fsck posted 2008-12-12 13:55 now... design note on atomic commit 2008-12-12 13:55 the christmas project 2008-12-12 14:52 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-12 15:23 -!- tim_dimm_(~mobile@166.190.213.112) has joined #tux3 2008-12-12 19:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-12 19:57 -!- camby(~root@60.205.80.45) has joined #tux3 2008-12-12 20:20 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-12 21:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-12 22:33 -!- pranihome(~bobby@122.162.71.11) has joined #tux3 2008-12-12 22:49 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-13 00:25 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-13 00:52 -!- RazvanM(~RazvanM@96.234.235.9) has joined #tux3 2008-12-13 06:09 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-13 06:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-13 07:59 http://lkml.org/lkml/2008/12/10/410 2008-12-13 08:00 The server is taking too long to respond; please wait a minute or 2 and try again. 2008-12-13 08:00 im not sure whats goin on there. the main page is loading, but this is not 2008-12-13 08:15 hmm, i get that error too 2008-12-13 09:49 -!- RazvanM_(~RazvanM@96.234.232.67) has joined #tux3 2008-12-13 12:58 ACTION hopes lkml.org recovers soon from whatever ails it 2008-12-13 12:59 ok, time to have a little think about ->rename for deferred name ops 2008-12-13 13:00 even though it's not needed for immediate work 2008-12-13 13:00 if it's easy to get a prototype running today, then I will, and then put the deferred namespace ops project aside for now 2008-12-13 13:02 so, fs/namei.c does ->rename followed by d_move 2008-12-13 13:02 and what we need to do is always leave a negative dentry in the position vacated by the d_move 2008-12-13 13:03 the old position is always unhashed, so that means we always clone a new negative dentry to fill that position, and hash it before the d_move takes place 2008-12-13 13:05 we also need to worry about what happens when a destination dirent is overwritten... if there was a deferred create in flight for it, then it needs to be cancelled 2008-12-13 13:10 hi 2008-12-13 13:10 hi 2008-12-13 13:10 we have to care older kernel? 2008-12-13 13:11 probably not 2008-12-13 13:11 it was just interesting that we can very nearly compile on it 2008-12-13 13:11 so I played a little bit 2008-12-13 13:11 2.6.24.3 2008-12-13 13:12 I never saw the point of ERR_CAST anyway 2008-12-13 13:12 seems like a useless decoration 2008-12-13 13:12 on 2.6.28 or near 2008-12-13 13:12 some interface was changed 2008-12-13 13:12 e.g. slab stuff 2008-12-13 13:13 right, it makes a lot of sense to focus on exacty one kernel 2008-12-13 13:13 stay with the kernel we are on until we have atomic commit I think 2008-12-13 13:13 just to save a little time 2008-12-13 13:14 ok 2008-12-13 13:14 going out for a bit 2008-12-13 13:14 back in two hours I think 2008-12-13 13:14 ok 2008-12-13 13:14 I'll sleep 2008-12-13 13:15 now, I'm thinking about dleaf format 2008-12-13 13:15 good plan :) 2008-12-13 13:15 and dleaf codes 2008-12-13 13:15 ok, I'm thinking about the obvious thing: however we do atomic commit, we need to handle the bio endio ourselves 2008-12-13 13:16 probably, yes 2008-12-13 13:16 which block library does not give us a way of doing, if we use the block library then we have to wait on bits and things like that 2008-12-13 13:16 which is going to be just as complex as submitting our own bio I think 2008-12-13 13:17 anyway, I will have a more specific plan when you wake up :) 2008-12-13 14:34 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-13 14:41 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-13 16:35 -!- inverse_(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-13 16:57 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-13 16:57 back 2008-12-13 16:58 windy down there on the boardwalk 2008-12-13 16:58 have to doge the flying palm fronds 2008-12-13 16:58 and evade the sand drifts on the bike path 2008-12-13 18:42 -!- pranihome(~bobby@122.162.71.11) has joined #tux3 2008-12-13 20:14 -!- Aks(~ankitsriv@123.237.71.123) has joined #tux3 2008-12-13 21:48 -!- Aks(~ankitsriv@123.237.71.123) has left #tux3 2008-12-13 23:57 -!- pranihome(~bobby@122.162.71.11) has joined #tux3 2008-12-14 00:26 -!- RazvanM(~RazvanM@96.234.232.67) has joined #tux3 2008-12-14 02:12 if libc gets ENOENT from sys_rename (because of a bug in rename) then it does something very bizarre: it does the rename "by hand", but creating the destination, copying the data and unlinking the original 2008-12-14 02:12 rather than reporting the obvious error 2008-12-14 02:12 this is demented 2008-12-14 02:13 folks 2008-12-14 02:16 hi bh 2008-12-14 02:16 I presume this is proof there is life after 40 2008-12-14 02:17 kind of 2008-12-14 02:17 I had an aikido rank test today 2008-12-14 02:18 and? 2008-12-14 02:18 woke up at about 8, only got about 3-4 hours of partial sleep since my legs were throbbing for some reason 2008-12-14 02:18 it went fine 2008-12-14 02:18 I thought you were going to say some lil punk kicked your butt 2008-12-14 02:18 the instructor wouldn't select you for testing unless he was pretty confident that you'd pass 2008-12-14 02:19 so what's your rank now? 2008-12-14 02:19 6th kyu 2008-12-14 02:20 the lower beginner's rank 2008-12-14 02:20 lower=lowst 2008-12-14 02:20 lowest 2008-12-14 02:20 there was very little hard stuff on the test 2008-12-14 02:21 basic rolls, two basis techniques and variantions on it 2008-12-14 02:21 freestyle grabs 2008-12-14 02:21 and throws 2008-12-14 02:21 nothing hard per se 2008-12-14 02:21 sounds like judo 2008-12-14 02:22 yeah, similiar 2008-12-14 02:22 used to do that a long time ago 2008-12-14 02:23 I'm pretty interested in it now, mainly because it's a really cool art if it's taught well within a good school 2008-12-14 02:23 it's something other than kernel programming to take my mind off of things 2008-12-14 02:24 never mind it's fun to know you can kick butt when necessary 2008-12-14 02:25 how's tux3 development going ? 2008-12-14 02:25 moving along 2008-12-14 03:18 -!- RazvanM(~RazvanM@96.234.232.67) has joined #tux3 2008-12-14 04:03 -!- bushman_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-14 04:36 flips, there? 2008-12-14 04:36 hi 2008-12-14 04:36 hello 2008-12-14 04:36 how did you like the fsck post? 2008-12-14 04:36 need some help :) 2008-12-14 04:37 yeah, very helpful 2008-12-14 04:37 i was following that to write the check... 2008-12-14 04:38 and hirofumi's draw_graph... 2008-12-14 04:38 it has the basic structure to traverse the graph i think.. 2008-12-14 04:39 i wrote a basic skeleton.. need you to give your suggestions 2008-12-14 04:39 any time 2008-12-14 04:40 http://rafb.net/p/F4V3DV18.html 2008-12-14 04:40 its minimalist, far from complete and just beginning 2008-12-14 04:41 im trying to do it for itable first 2008-12-14 04:41 check_advance is cut n paste from advance 2008-12-14 04:42 good 2008-12-14 04:42 do{} while in check_tree is actually not necessary... 2008-12-14 04:43 a simple while(); would have sufficed .. 2008-12-14 04:43 many changes.. 2008-12-14 04:43 it looks fine 2008-12-14 04:44 minor things: check_advance should be above check_tree in the file (k&r style) 2008-12-14 04:44 could of missing spaces 2008-12-14 04:44 couple of missing spaces 2008-12-14 04:44 "start_check" doesn't really have a reason to exist 2008-12-14 04:45 it will take each tree and call check_tree on that 2008-12-14 04:45 right now, we are doing only for itable 2008-12-14 04:45 but later.. may be other trees? 2008-12-14 04:45 it's just one big tree 2008-12-14 04:45 tree of trees 2008-12-14 04:46 hmm 2008-12-14 04:46 just have another advance loop inside the main one, to handle file index trees 2008-12-14 04:47 hmm 2008-12-14 04:55 hmm, return (struct btree){ }; <- in new_btree 2008-12-14 04:57 then I suppose the btree is filled in when inode is opened 2008-12-14 04:59 pranihome, anyway, I was going to say... you should not open a file to get a btree to traverse, you chould construct a struct btree "by hand" 2008-12-14 04:59 like decode_attrs does 2008-12-14 04:59 ok, i was opening the volume and traversing the tree... 2008-12-14 04:59 i should not be doing that? 2008-12-14 05:00 no, you can't be sure that open will work until you have checked the structure of the volume 2008-12-14 05:00 hmm 2008-12-14 05:01 so you don't need those open_inodes and igets 2008-12-14 05:01 ok 2008-12-14 05:01 ACTION updating.. 2008-12-14 05:02 I better sleep ;) 2008-12-14 05:02 :), see u 2morrow with something more useful :D 2008-12-14 05:35 -!- pranihome(~bobby@122.162.71.11) has joined #tux3 2008-12-14 07:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-14 08:15 -!- pranihome(~bobby@122.162.71.11) has joined #tux3 2008-12-14 10:08 -!- pgquiles(~pgquiles@91.Red-88-12-135.dynamicIP.rima-tde.net) has joined #tux3 2008-12-14 10:17 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-14 11:30 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-14 11:38 -!- pranihome(~bobby@122.162.71.11) has joined #tux3 2008-12-14 12:01 anyone here? 2008-12-14 12:21 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-14 13:07 -!- kbingham_(~kbingham@82-46-4-172.cable.ubr03.aztw.blueyonder.co.uk) has joined #tux3 2008-12-14 13:14 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-14 14:26 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-14 16:51 -!- yanzheng(~zhyan@inet-sc10-o.oracle.com) has joined #tux3 2008-12-14 16:52 -!- yanzheng(~zhyan@inet-sc10-o.oracle.com) has left #tux3 2008-12-14 16:57 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-12-14 18:14 -!- pranihome(~bobby@122.162.71.11) has joined #tux3 2008-12-14 19:35 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-14 20:55 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-14 21:22 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-14 21:58 -!- RazvanM(~RazvanM@96.234.232.67) has joined #tux3 2008-12-14 23:47 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-15 00:46 -!- camby(~root@58.82.180.206) has joined #tux3 2008-12-15 00:58 pranihome, now 2008-12-15 01:03 mlankhorst, congrats on the slashdotting 2008-12-15 01:35 flips: Morning :) 2008-12-15 02:28 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-12-15 02:48 -!- pgquiles__(~pgquiles@91.Red-88-12-135.dynamicIP.rima-tde.net) has joined #tux3 2008-12-15 04:47 -!- pgquiles_(~pgquiles@91.Red-88-12-135.dynamicIP.rima-tde.net) has joined #tux3 2008-12-15 05:37 flips: there? 2008-12-15 07:02 flips, when you wake up, gimme a holler, i got some good news for ya 2008-12-15 07:54 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 08:00 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 08:11 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 08:20 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 08:26 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 08:33 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 08:38 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 08:43 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 08:47 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-15 08:52 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 09:02 -!- Aks(~ankitsriv@123.237.67.163) has joined #tux3 2008-12-15 09:02 -!- Aks(~ankitsriv@123.237.67.163) has left #tux3 2008-12-15 09:03 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 09:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-15 09:09 -!- pranith(~bobby@122.162.71.199) has joined #tux3 2008-12-15 10:30 bushman, hey 2008-12-15 11:28 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-15 11:29 woot, Just finished my final exam 2008-12-15 11:29 that's a woot indeed 2008-12-15 11:29 now I have a whole two weeks to play iwth tux3 2008-12-15 11:29 :) 2008-12-15 11:29 which school? 2008-12-15 11:30 http://uoit.ca/ 2008-12-15 11:32 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-12-15 11:32 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-15 11:40 hi 2008-12-15 11:44 http://userweb.kernel.org/~hirofumi/dleaf-cleanup-fix.patch 2008-12-15 11:44 this is test of dwalk_probe() 2008-12-15 11:44 we have the issue on it 2008-12-15 11:45 the dwalk_probe() behavior is a bit random 2008-12-15 11:45 data[] is the data of dleaf 2008-12-15 11:46 hi hirofumi 2008-12-15 11:46 the second entry: logical index start is 0x3001000002ULL, number of blocks is 5, physical block is 2 2008-12-15 11:46 hi 2008-12-15 11:46 + ret = dwalk_probe(leaf1, sb->blocksize, walk1, 0x3001000003ULL); 2008-12-15 11:46 +// assert(ret); 2008-12-15 11:46 + dwalk_probe_check(walk1, data[1].index, &data[1].ex); 2008-12-15 11:47 the test is probe of 0x3001000003ULL 2008-12-15 11:47 I think the corrent result should hit 0x3001000002ULL 2008-12-15 11:48 but current hits 0x3001000011ULL 2008-12-15 11:48 because probe doesn't check extent_count() 2008-12-15 11:49 which keys lie between 3...3 and 3..11? 2008-12-15 11:50 probe key is 0x3001000003ULL 2008-12-15 11:50 + { 0x3001000002ULL, make_extent(0x2, 5), }, 2008-12-15 11:50 + { 0x3001000011ULL, make_extent(0x8, 1), }, 2008-12-15 11:51 data is 2~6, 11~11 2008-12-15 11:51 looking at the patch now 2008-12-15 11:52 ah 2008-12-15 11:52 the problem is, dwalk_probe does not know anything about extents 2008-12-15 11:52 yes 2008-12-15 11:53 so what I do is probe to 64 blocks less than the extent address 2008-12-15 11:53 and maybe, the real problem is, we can't know walk all extents on dleaf->entry actually 2008-12-15 11:53 exactly 2008-12-15 11:54 to know that, we would have to put an additional flag in the index block 2008-12-15 11:54 I can fix this, but it is not efficient 2008-12-15 11:54 that indicates "some extents overlap the next leaf block" 2008-12-15 11:54 probing to 64 less than the target address is good enough for now I think 2008-12-15 11:55 I think this may be more complex 2008-12-15 11:56 if there is two entries, 2~7, 9~10 2008-12-15 11:56 and if probes 8 2008-12-15 11:57 probably, dwalk_probe() should point 9~10 extent 2008-12-15 11:57 yes 2008-12-15 11:57 but, we don't know read all extents on 2~7 2008-12-15 11:57 we don't know until we read all extents 2008-12-15 11:58 in this case, both entries are in the same leaf? 2008-12-15 11:58 yes 2008-12-15 11:59 dleaf->entry (keylo is 2) 2008-12-15 11:59 we don't know total block count until read all extent 2008-12-15 12:00 if it has 7 blocks, it is 2~8, so probe() should hit it 2008-12-15 12:00 ah, ~7 means count is 7? 2008-12-15 12:00 no 2008-12-15 12:00 ok, means least significant part of logaddr is 7 2008-12-15 12:00 it meant entry has logical index 2~7 2008-12-15 12:01 keylo == 2, total extent_count() == 6 2008-12-15 12:01 but isn't this also solved by problem to 64 blocks lower than the target? 2008-12-15 12:02 sorry 2008-12-15 12:02 but isn't this also solved by probing to 64 blocks lower than the target? 2008-12-15 12:02 ah 2008-12-15 12:03 entry only have one extent? 2008-12-15 12:04 for now, yes 2008-12-15 12:04 versioning will add more 2008-12-15 12:05 and we can also save some dict space by introducing the rule that multiple extents in one entry are logically end-to-end 2008-12-15 12:05 and dleaf will get more complex 2008-12-15 12:05 yes 2008-12-15 12:05 right now I think we need to try to avoid making dleaf more complex ;) 2008-12-15 12:05 I think actuall problem is complexicy 2008-12-15 12:06 at the expense of slightly slower lookup 2008-12-15 12:06 right, it's already complex 2008-12-15 12:06 yes 2008-12-15 12:06 I try to remove complexicy, but not success 2008-12-15 12:07 yes, I think it's about as complex as it needs to be to implement that two level scheme 2008-12-15 12:07 the question is, is the space savings worth the complexity? 2008-12-15 12:07 I think it is 2008-12-15 12:07 yes 2008-12-15 12:08 if file is fragmented, current one may have win 2008-12-15 12:08 dleaf_insert was the most complex function, and it was removed when dwalk_* arrived 2008-12-15 12:08 if not fragmented, it is not good 2008-12-15 12:08 i see 2008-12-15 12:09 removing the ->free and ->used in-leaf accounting will be a slight reduction in complexity 2008-12-15 12:10 dleaf_chop is going to be rewritten using _walk and _chop_after 2008-12-15 12:10 yes 2008-12-15 12:10 I was going to try it 2008-12-15 12:10 shapor is also interested 2008-12-15 12:10 shapor, around? 2008-12-15 12:10 oh 2008-12-15 12:11 shapor is pretty good with this stuff, and only you really know the vfs 2008-12-15 12:12 well, my first thought is tux3_get_block and sys_read/sys_write stuff more reliable 2008-12-15 12:12 yes, I'm reading that right now too 2008-12-15 12:12 e.g. I thought it should pass fsx-linux before atomic commit 2008-12-15 12:13 and the problem is truncate 2008-12-15 12:13 fsx-linux? 2008-12-15 12:13 good stress tool 2008-12-15 12:13 ah, so you want _chop working better now 2008-12-15 12:13 random read/write/mmap/truncate/expand size etc. 2008-12-15 12:14 yes 2008-12-15 12:14 and it can use chop_after + free blocks 2008-12-15 12:14 make sense, and with rewrites, the filemap handling for overlapping extents has to be workiing properly 2008-12-15 12:15 right 2008-12-15 12:15 I hope so 2008-12-15 12:15 that was the intention 2008-12-15 12:15 well, second one is hole handlining in tux3_get_block/filemap_extent_io 2008-12-15 12:16 it would have some problems 2008-12-15 12:16 have we discussed these before? 2008-12-15 12:17 maybe, didn't discuss 2008-12-15 12:17 what problem then? 2008-12-15 12:17 maybe, it fills all hole if it found 2008-12-15 12:18 on write? 2008-12-15 12:18 yes 2008-12-15 12:18 ok, that's bad 2008-12-15 12:18 but, I'm not sure though 2008-12-15 12:19 I think it would help to add the get_extents interface I talked about 2008-12-15 12:19 yes 2008-12-15 12:19 ok, how about I do that, and you do the _chop? 2008-12-15 12:19 this work is also for it 2008-12-15 12:20 some work you have in patches now? 2008-12-15 12:20 I rewrited the dwalk_probe() 2008-12-15 12:20 but, it has the mentioned issue 2008-12-15 12:22 ok, and it is not an issue if we use the technique of probing lower than the target by the maximum extent size 2008-12-15 12:22 and I think we should have dwalk_extent()/dwalk_end() 2008-12-15 12:22 um... 2008-12-15 12:22 I would like to see your probe rewrite :) 2008-12-15 12:22 dwalk_extent makes a lot of sense 2008-12-15 12:22 ok 2008-12-15 12:22 what does walk_end do? 2008-12-15 12:23 it tells end of extent in dleaf 2008-12-15 12:23 actually, dwalk_next can just be improved to be aware of position within the extent 2008-12-15 12:23 we don't need too functions 2008-12-15 12:23 yes 2008-12-15 12:24 well, it can use in dwalk_next() too 2008-12-15 12:24 dwalk_next() needs to check it 2008-12-15 12:25 right now, dwalk_next just gives you an extent, and the higher level code is responsible for remembering the position in it 2008-12-15 12:25 (mumbling to myself) 2008-12-15 12:25 yes 2008-12-15 12:26 in my think, dwalk_next should just change cursor 2008-12-15 12:26 I think you are right 2008-12-15 12:26 that will simplify filemap.c 2008-12-15 12:26 yes, and we don't need dwalk_back() for now 2008-12-15 12:27 because we can walk to an exact position 2008-12-15 12:27 also right :) 2008-12-15 12:27 if we want to walk extent, 2008-12-15 12:28 for (has_next = probe(); has_next; next()) { get_extent()/get_index()} 2008-12-15 12:28 for (has_next = probe(); has_next; has_next = next()) { get_extent()/get_index()} 2008-12-15 12:28 in my think 2008-12-15 12:29 and if we want, it can macro 2008-12-15 12:29 yes 2008-12-15 12:29 for_each_extent(key, index, extent) {} 2008-12-15 12:29 instead of returning the extent struct 2008-12-15 12:30 yes 2008-12-15 12:30 that makes it more like the btree advance 2008-12-15 12:30 ok, it is better 2008-12-15 12:30 and, I tried dwalk_probe() at first 2008-12-15 12:31 and found that issue 2008-12-15 12:31 get_index/get_count, where count is the count to the end of the extent maybe 2008-12-15 12:32 ok, well let's use the seek-below technique for now in probe 2008-12-15 12:32 then walk to where we actually want to be 2008-12-15 12:33 um..., I'm not sure if it removes the issue 2008-12-15 12:33 well, I'll put current patch 2008-12-15 12:33 http://userweb.kernel.org/~hirofumi/dleaf-cleanup-fix.patch 2008-12-15 12:34 this is what did I try 2008-12-15 12:35 reading 2008-12-15 12:35 ok, I'm thinking about seek-below 2008-12-15 12:39 it try to define the some states explicitly 2008-12-15 12:39 the case - !dealf_groups() 2008-12-15 12:39 the case - searched all extents 2008-12-15 12:40 your code is easier to understand because it exits early on degnerate cases 2008-12-15 12:40 for _probe 2008-12-15 12:40 yes, it is for the above reason 2008-12-15 12:41 and we should be able to see the current cursor position from dwalk 2008-12-15 12:42 even if !deaf_groups() , searched all extents, etc 2008-12-15 12:42 yes, better 2008-12-15 12:42 cleaner interface 2008-12-15 12:43 I try to do it too 2008-12-15 12:43 but, we can't return the middle of extent 2008-12-15 12:43 middle of entry 2008-12-15 12:44 because, caller wants to know current extent_count 2008-12-15 12:44 it is why orginal didn't search extents 2008-12-15 12:44 for the entire extent, and not just to the end of the current one? 2008-12-15 12:45 if entry has multiple extents 2008-12-15 12:45 the cursor should be the first extent of entry 2008-12-15 12:46 we don't have that problem yet, but it is good to plan for it 2008-12-15 12:46 because dwalk_index() returns the index of first extent 2008-12-15 12:46 yes 2008-12-15 12:47 when we have multiple extents per entry then the reason is either that there are multiple versions or there are end-to-end extents for the same version 2008-12-15 12:48 processing can be pretty complicated 2008-12-15 12:48 yes 2008-12-15 12:49 I think we should just leave that for later, and rely on one extent per entry for now 2008-12-15 12:50 or just remove entry for now? 2008-12-15 12:50 remove entry? 2008-12-15 12:50 we don't need entry if there is no multiple extent? 2008-12-15 12:51 you mean, remove the whole layer from the dleaf indexing? 2008-12-15 12:51 no 2008-12-15 12:51 :) 2008-12-15 12:52 maybe, merge group and entry, or entry and extent 2008-12-15 12:52 it would be simple 2008-12-15 12:53 ...thinking 2008-12-15 12:54 well, then we would have to re-add it when we get to versioning 2008-12-15 12:54 so I think we better keep it 2008-12-15 12:54 i see 2008-12-15 12:55 just not starting coding to handle versions at this time 2008-12-15 12:56 if we ignore multiple extent, it may make code simple 2008-12-15 12:57 just write a comment where this is ignored 2008-12-15 12:57 yes 2008-12-15 12:57 your probe is a big improvement in readability/provability 2008-12-15 12:58 thanks 2008-12-15 12:58 the dwalk_back at the end is only needed to handle multiple extents/entry I think 2008-12-15 12:58 so this can be a little simpler now, right? 2008-12-15 12:58 probe is also needed 2008-12-15 12:59 last extent probe will be removed in dwalk_probe() 2008-12-15 13:01 ? 2008-12-15 13:01 we don't need total count in entry 2008-12-15 13:02 oh right 2008-12-15 13:02 we can see it to read one extent 2008-12-15 13:02 is the logical flag return still needed? 2008-12-15 13:03 logical flag? 2008-12-15 13:03 return 1 vs 0 from probe, it used to always return 0 2008-12-15 13:03 ah 2008-12-15 13:03 I think it is useful for caller 2008-12-15 13:04 caller can call dwalk_next() or not 2008-12-15 13:04 and the sense is, found an exact match? 2008-12-15 13:05 ACTION reads the comment 2008-12-15 13:05 um..., it sounds good, but it can know by dwalk_index and dwalk_extent 2008-12-15 13:06 and caller will do it after dwalk_probe() 2008-12-15 13:07 ok, I should wait to see a complete patch set 2008-12-15 13:08 I think this will include changes to dleaf and filemap, right? 2008-12-15 13:08 ok, I'll try with no multiple extent 2008-12-15 13:08 yes 2008-12-15 13:08 probaby, many of dleaf stuff 2008-12-15 13:09 and I'm thinking about hideing the dleaf internal from others 2008-12-15 13:09 e.g. filemap.c knows about dleaf internal 2008-12-15 13:10 if we can hide it, maybe we can change dleaf format easily 2008-12-15 13:10 e.g. pass struct extent to dleaf (it is not diskextent) 2008-12-15 13:11 however, it can be overkill 2008-12-15 13:13 yes, we need get_extents now 2008-12-15 13:13 I will return to that 2008-12-15 13:13 I am done with deferred namespace for now 2008-12-15 13:13 oh 2008-12-15 13:14 everything is working except is_empty for directories 2008-12-15 13:15 and that is not very interesting, performance tests can run even if that is broken 2008-12-15 13:15 I'm not reading recent deferred nameop stuff yet, because I'm tackling dleaf stuff 2008-12-15 13:15 that's fine :) 2008-12-15 13:15 your contribution was big 2008-12-15 13:16 thanks 2008-12-15 13:16 the same technique of cloning a negative dentry worked fine for rename 2008-12-15 13:16 I have some bug fixes 2008-12-15 13:16 good 2008-12-15 13:17 I found some bugs while reading the dleaf stuff and others 2008-12-15 13:17 ready to pull? 2008-12-15 13:18 wait a bit 2008-12-15 13:18 anyway, on deferred namespace... some time in january maybe the deferred namespace demo patch for ext2 can be posted for comment along with latency measurements 2008-12-15 13:18 -!- inverse(~chatzilla@h80-net10.simres.netcampus.ca) has joined #tux3 2008-12-15 13:19 and then we can get wider opinions on things like where problems might come with deferred assignement of ino 2008-12-15 13:19 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-15 13:20 i see 2008-12-15 13:20 sounds good 2008-12-15 13:20 and I wonder if it is ok to make directory link count be equal to number of entries in the directory + 1 instead of number of subdirectories + 2 2008-12-15 13:21 or maybe number of entries in the directory + 2 is safer 2008-12-15 13:21 some tools may depend on it, I'm not sure 2008-12-15 13:21 ah 2008-12-15 13:21 there is some talk about optimizations libc tries to do, using that count 2008-12-15 13:21 and how the optimizations do not work well 2008-12-15 13:21 at least, find checks it is ->i_nlink == 2 2008-12-15 13:22 if == 2, it skips, iirc 2008-12-15 13:22 i see 2008-12-15 13:23 anyway it is hard to imagine a utility failing because the count was higher than the number of subdirs, but maybe somebody out there on lkml will know 2008-12-15 13:23 anyway, that side project is finished until later 2008-12-15 13:24 or, search with www.google.com/codesearch 2008-12-15 13:24 ah, I googled, but not with codesearch 2008-12-15 13:24 well, we can add new attribute for it, instead 2008-12-15 13:25 yes, I was just seeing if we could avoid that, but it is not a big cost 2008-12-15 13:26 yes 2008-12-15 13:26 and a new directory index format can include count of entries, then we track the change in the in-memory inode 2008-12-15 13:26 good 2008-12-15 13:27 ah, btw, we will return fake "." and ".." in readdir()? 2008-12-15 13:27 I guess we better, nothing else does that for us 2008-12-15 13:28 yes 2008-12-15 13:28 need to interpret dir->f_pos = 0 as, return "." and = 1 means "return ".." 2008-12-15 13:28 well 2008-12-15 13:28 yes 2008-12-15 13:29 FWIW, it is in my todo list 2008-12-15 13:29 and return 2 as the pos for telldir for the zeroth entry 2008-12-15 13:30 yes 2008-12-15 13:31 if (filp->f_pos < 2) { filp->f_pos + 1} 2008-12-15 13:54 ok 2008-12-15 13:55 oops, wrong channel ;) 2008-12-15 13:56 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-15 13:57 flips, please review it later 2008-12-15 13:57 I'll sleep 2008-12-15 13:57 reading now 2008-12-15 13:57 ok 2008-12-15 13:58 7 am in the morning over there! 2008-12-15 13:58 hehehe 2008-12-15 13:58 good morning flips 2008-12-15 13:58 Another all nighter? 2008-12-15 13:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-15 13:59 good morning maarten 2008-12-15 13:59 congrats on the slashdotting 2008-12-15 14:00 thanks :) 2008-12-15 14:00 I'm working on some assembly tests now. 2008-12-15 14:02 Some guy from valgrind is interested in helping me get wine64 under valgrind 2008-12-15 14:02 then windows devs can develop on linux :) 2008-12-15 14:04 sort of yeah 2008-12-15 14:04 so we should look around for the least sucky wiki incaration 2008-12-15 14:04 there are a lot of them 2008-12-15 14:04 whoops 2008-12-15 14:05 :) 2008-12-15 14:06 Did you mean: sucky wiki incarnation? 2008-12-15 14:07 yes 2008-12-15 14:07 evil google autocomplete ;) 2008-12-15 14:07 thinking about having a wiki on tux3.org 2008-12-15 14:08 but somehow not have the site run php => don't get 0wN3d 2008-12-15 14:08 Write it in shell script ;) 2008-12-15 14:08 shapor finds a friend in you 2008-12-15 14:09 :) 2008-12-15 14:09 mlankhorst: you read my mind http://github.com/shapor/bashcms 2008-12-15 14:09 http://infomesh.net/pwyky/ 2008-12-15 14:09 lol 2008-12-15 14:11 did anybody ever write a bash jit? 2008-12-15 14:11 Be the first! 2008-12-15 14:11 Then write bash to bootstrap itself 2008-12-15 14:12 it's hard to think of a filthier project 2008-12-15 14:12 I think the only way to do that properly is to make GCC output bash code 2008-12-15 14:12 to be even filthier you mean 2008-12-15 14:13 :> 2008-12-15 14:13 pussies use llvm 2008-12-15 14:13 a bash jit would probably have a huge following... take all those completely sucky bash scripts and make your server run, oh, 5% faster 2008-12-15 14:19 hirofumi completely rewrote tux_del_dirent, one line survived 2008-12-15 14:22 tux3_rename I mean 2008-12-15 14:24 I saw the new version, atleast I was going in the right direction :) 2008-12-15 14:25 you looked at hirofumi's repo? 2008-12-15 14:31 static-http://userweb.kernel.org/~hirofumi/tux3/ <- just pull/clone this and run hg view 2008-12-15 14:36 yes, its too bad exams got in the way or I would have kept on updating it ;) 2008-12-15 14:37 no problem 2008-12-15 14:37 you're now on the way to becoming a vfs expert ;) 2008-12-15 14:38 and actually, some of your original survived with minor respellings 2008-12-15 14:39 obvious fixes, like checking if a dir is empty only if the inode is a dir 2008-12-15 14:46 just checked if hg rollback rolls back all the changes from a pull... it does 2008-12-15 14:46 thanks for that :) 2008-12-15 14:48 lol 2008-12-15 15:06 root@usermode:~# sh test 2008-12-15 15:06 VFS: Can't find ext3 filesystem on dev ubdb. 2008-12-15 15:06 invalid superblock [74757833dd080906] 2008-12-15 15:06 that is nice 2008-12-15 15:07 not only tells me it is a tux3 superblock, but the date of the incompatible disk format 2008-12-15 15:07 sept 6 2008-12-15 15:25 isnt the tux3 magic number dd 08 12 12 2008-12-15 15:26 but ya, I had the same thign happen to me earlier. I assumed that a modification that I made somehow overwrote the superblock magic 2008-12-15 15:29 by 'same thing' I mean: 2008-12-15 15:29 tux3:~# ./mnt.sh 2008-12-15 15:29 invalid superblock [74757833dd080906] 2008-12-15 15:54 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-15 16:37 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-15 18:15 -!- ajonat(~ajonat@190.48.93.57) has joined #tux3 2008-12-15 18:16 inverse, dd 08 12 12 is the date part of the magic, in bcd 2008-12-15 18:17 date of last incompatible disk format change 2008-12-15 18:17 the cure is "tux3 mkfs" 2008-12-15 18:18 we will not attempt to be compatible with old disk formats for now 2008-12-15 18:21 oh I see 2008-12-15 18:22 super.c is supposed to have a comment that lists what the changes were, just below the magic 2008-12-15 18:22 though our repository will also give that information 2008-12-15 18:53 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-15 22:08 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-15 22:10 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-15 23:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-16 00:48 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-16 01:01 hirofumi, there? 2008-12-16 01:48 -!- RazvanM(~RazvanM@96.234.232.67) has joined #tux3 2008-12-16 01:53 hi 2008-12-16 02:21 hi 2008-12-16 02:22 well, I improved filemap.c I think, but not completely correctly 2008-12-16 02:22 my interpretation of buffer_new is wrong 2008-12-16 02:22 yes 2008-12-16 02:23 I think ->get_block should leave the buffer !uptodate for a hole 2008-12-16 02:23 I'll see after dleaf stuff 2008-12-16 02:23 um 2008-12-16 02:23 no 2008-12-16 02:23 !mapped 2008-12-16 02:24 yes, !mapped and !uptodate 2008-12-16 02:24 and if allocated new block, new and mapped 2008-12-16 02:25 or is it !mapped and uptodate? 2008-12-16 02:25 no 2008-12-16 02:26 mapped means it has physical block 2008-12-16 02:26 http://lkml.indiana.edu/hypermail/linux/kernel/0009.2/0425.html 2008-12-16 02:26 so, if block was allocated, I think it should be mapped 2008-12-16 02:27 !maped and uptodate is hole 2008-12-16 02:27 right 2008-12-16 02:27 write never return hole 2008-12-16 02:27 read does 2008-12-16 02:27 yes 2008-12-16 02:28 and there, we want set_buffer_uptodate, not set_buffer_new 2008-12-16 02:28 I think set_buffer_new is for any newly allocated block 2008-12-16 02:29 yes 2008-12-16 02:29 so that the caller will try to invalidate any buffer cache block at that address 2008-12-16 02:30 anyway, the current code is not quite right, but I think it is a lot easier to work with than earlier today :) 2008-12-16 02:30 yes, if it's new 2008-12-16 02:30 good :) 2008-12-16 02:31 it's oyasumi time 2008-12-16 02:31 ok 2008-12-16 02:31 well, maybe, write path and read path would be separated 2008-12-16 02:32 or partly separated 2008-12-16 02:32 the code is easier to factor now 2008-12-16 02:33 yes 2008-12-16 02:43 I see we don't actually respect the 64 block extent limit in the dwalk_pack loop 2008-12-16 02:44 dwalk_pack() checks it, iirc 2008-12-16 02:44 :) 2008-12-16 02:46 I don't see a check 2008-12-16 02:47 MAX_GROUP_ENTRIES? 2008-12-16 02:47 ah 2008-12-16 02:47 it is not 2008-12-16 02:47 we should assert 2008-12-16 02:47 guess_extent does? 2008-12-16 02:47 right 2008-12-16 02:48 there is not really a reason for guess_extent to limit itself to 64 blocks 2008-12-16 02:48 well, I was thinking we may want to add some code for it 2008-12-16 02:48 I am not very proud of this code :) 2008-12-16 02:49 but it is going to be getting much more attention over the next few days 2008-12-16 02:49 it is why I think I may want to add extent interface (not diskextent) 2008-12-16 02:49 yes, that was my idea with get_segs 2008-12-16 02:50 it is getting close to being an extent interface 2008-12-16 02:50 yes, get_segs is extent array 2008-12-16 02:50 and I thought, we want to pass a extent to dleaf stuff 2008-12-16 02:50 we may want to pass 2008-12-16 02:51 yes 2008-12-16 02:51 guess_extent was just a hack to let me test extents 2008-12-16 02:51 instead changing dleaf internal directly 2008-12-16 02:52 i see 2008-12-16 02:53 ah, you meant you want to make dleaf walk etc more extent aware 2008-12-16 02:53 right 2008-12-16 02:53 yes, may want 2008-12-16 02:54 but, it may also be overkill 2008-12-16 02:54 it might. I was just thinking what dwalk_next would do 2008-12-16 02:54 yes 2008-12-16 02:55 and dwalk_index may also be 2008-12-16 02:55 and dwalk_pack too 2008-12-16 02:55 it may be right the way it is 2008-12-16 02:56 yes 2008-12-16 02:56 probably, it needs big change, so it would be later 2008-12-16 02:56 yes 2008-12-16 02:57 ok, goodnight 2008-12-16 02:57 good night 2008-12-16 05:54 -!- pgquiles__(~pgquiles@143.Red-79-154-136.staticIP.rima-tde.net) has joined #tux3 2008-12-16 08:06 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 08:17 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 08:24 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-16 08:27 -!- RazvanM(~RazvanM@96.234.232.67) has joined #tux3 2008-12-16 08:27 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-16 08:27 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-12-16 08:27 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 08:40 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 09:05 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 09:13 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 09:17 -!- bushman_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-16 09:18 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 09:38 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-16 09:39 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 09:51 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 10:00 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 10:01 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-16 10:11 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 10:32 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 10:34 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 10:43 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 10:54 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 11:02 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 11:08 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 11:16 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 11:27 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 11:57 hirofumi, there? 2008-12-16 11:57 hi 2008-12-16 11:57 hi 2008-12-16 11:57 some more work on filemap today 2008-12-16 11:57 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 11:57 and a design note 2008-12-16 11:57 good 2008-12-16 11:57 I thought I'd talk about it a bit 2008-12-16 11:57 a couple of minor changes 2008-12-16 11:57 dwalk_next/back seems good now 2008-12-16 11:58 I like it :) 2008-12-16 11:58 ok, the big thing is... get_segs will have to sometimes redirect block writes to new locations 2008-12-16 11:59 if we are doing the equivalent of ext3 data=journal 2008-12-16 11:59 yes 2008-12-16 11:59 to do that, it needs to know the state of the pages/buffers in page cache 2008-12-16 12:00 yes 2008-12-16 12:00 it has to redirect if a buffer is: clean or dirty in a previous delta 2008-12-16 12:00 so the get_segs interface by itself can't know that 2008-12-16 12:00 because it does not see the buffer states 2008-12-16 12:01 yes 2008-12-16 12:01 ah, but the get_segs interface can be told by a flag maybe 2008-12-16 12:01 Did I miss academy? :P 2008-12-16 12:01 :) 2008-12-16 12:01 or just separate write pass? 2008-12-16 12:01 mlankhorst, tux3 u at the normal time tonight 2008-12-16 12:02 or just separate write path? 2008-12-16 12:02 ah 2008-12-16 12:02 separating the write path by itself does not fix it 2008-12-16 12:02 filemap_extent_io check state and allocate block 2008-12-16 12:03 then it will tell result to dleaf stuff? 2008-12-16 12:03 ACTION forgot what time that is in holland, probably at a time he 's asleep 2008-12-16 12:03 oh I see what you mean 2008-12-16 12:03 yes, the kernel get_block interface really is not good for this 2008-12-16 12:03 and the write has to take a different path 2008-12-16 12:04 I was just about to go read the ext3 write path 2008-12-16 12:04 yes 2008-12-16 12:04 ext3 is using jbd stuff? 2008-12-16 12:04 yes 2008-12-16 12:04 it's a complex path to follow 2008-12-16 12:05 I have looked at it a number of times, but without the knowledge I picked up over the last few months 2008-12-16 12:05 allocate shadow buffer, then it writes data to it? 2008-12-16 12:05 something like that 2008-12-16 12:05 i see 2008-12-16 12:05 ok, lxr time :) 2008-12-16 12:06 aha 2008-12-16 12:07 http://lxr.linux.no/linux+v2.6.27/fs/ext3/inode.c#L1138 <- ext3_write_begin 2008-12-16 12:07 write data path 2008-12-16 12:07 ext3 takes over some of what generic_file_write used to do 2008-12-16 12:07 exactly for the reason I'm talking about now 2008-12-16 12:08 does its own grab_cache_page, and checks buffer states 2008-12-16 12:08 yes 2008-12-16 12:08 so if we want to keep using the generic_* functions for now, we will do the same 2008-12-16 12:09 ext3 can use get_block() 2008-12-16 12:09 -!- ajonat(~ajonat@190.48.127.40) has joined #tux3 2008-12-16 12:09 because it doesn't change final position and doesn't use delyed alloc 2008-12-16 12:09 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 12:09 exactly 2008-12-16 12:10 but write_begin/write_end is the interface for us 2008-12-16 12:10 so, it adds buffer to transaction list 2008-12-16 12:10 yes 2008-12-16 12:10 right, and we need to remember the buffers too 2008-12-16 12:11 there is difference 2008-12-16 12:11 because we need to know when writeout is completed on them 2008-12-16 12:11 if we use delayed alloc 2008-12-16 12:11 yes 2008-12-16 12:12 now, we will think this with delayed alloc? or without it? 2008-12-16 12:12 I think we should do without delayed alloc first 2008-12-16 12:13 ok 2008-12-16 12:13 so, write_begin() allocates the block, and adds those to transaction list 2008-12-16 12:14 yes 2008-12-16 12:15 http://lxr.linux.no/linux+v2.6.27/fs/jbd/transaction.c#L531 2008-12-16 12:15 just checking the relationship between ->writepage and ->write_begin 2008-12-16 12:15 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 12:15 ok 2008-12-16 12:16 right, that is our "fork" 2008-12-16 12:16 write_begin -> do_journal_get_write_access -> do_get_write_access 2008-12-16 12:17 similar to our path 2008-12-16 12:17 ok 2008-12-16 12:17 it checks buffer state 2008-12-16 12:18 and allocate new jbd 2008-12-16 12:19 and, next is ->writepage? 2008-12-16 12:20 ->writepage must be defined in terms of _begin/end 2008-12-16 12:21 ACTION checks 2008-12-16 12:21 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 12:22 it isn't 2008-12-16 12:22 but for us it can be 2008-12-16 12:22 I mean, it isn't in ext3 2008-12-16 12:23 the relationship between ->writepage and ->write_begin is a little complex 2008-12-16 12:24 yes 2008-12-16 12:25 ok, so back to the get_segs interface... 2008-12-16 12:25 yes 2008-12-16 12:25 it can either be told, one logical range at a time, whether it should redirect or overwrite 2008-12-16 12:26 or it can examine buffer states and figure that out for itself 2008-12-16 12:26 the advantage of the second one is, it can return a longer range of segments 2008-12-16 12:26 -!- pranith(~bobby@122.162.69.154) has joined #tux3 2008-12-16 12:26 the advantage of the first is, it is probably easier to integrate with the existing vfs interface 2008-12-16 12:28 vfs interface means get_block()? 2008-12-16 12:28 get_block for one 2008-12-16 12:28 I think we will find that it is convenient to move away from get_block for writing 2008-12-16 12:28 it is not hard to set up our own bio 2008-12-16 12:29 then we will be handling the ->writepage and mpage stuff ourself 2008-12-16 12:30 but for now we are using get_block, because it works :) 2008-12-16 12:30 I was thinking writepage will just trigger the transaction to flush 2008-12-16 12:31 and then where do we do the get_segs? 2008-12-16 12:31 in our delta flush? 2008-12-16 12:32 yes 2008-12-16 12:32 I was thinking the same 2008-12-16 12:32 in the ext3 case, maybe, it write a page to journal 2008-12-16 12:33 I guess, because it already have space in journal 2008-12-16 12:33 the logic is very similar 2008-12-16 12:33 but, we don't have final position in that case? 2008-12-16 12:33 with the main difference that ext3 does not remap 2008-12-16 12:34 in which case? 2008-12-16 12:34 ->writepage 2008-12-16 12:34 right 2008-12-16 12:35 so, probably, we can't write a page like ext3 2008-12-16 12:35 in ->writepage 2008-12-16 12:35 if we do the mapping inside the our ->writepage, we can 2008-12-16 12:36 and then we can convert the the delayed allocation stye 2008-12-16 12:36 we will need to know the buffer flags accurately 2008-12-16 12:36 that is, our private flags that tell us which delta the buffer was dirty in 2008-12-16 12:36 ah 2008-12-16 12:37 so that ->get_block paths in the vfs have to be audited to make sure we always see the flags we set 2008-12-16 12:37 yes, probably 2008-12-16 12:37 if those flags get lost, then we can't go ->writepage that way 2008-12-16 12:37 can't do I mean 2008-12-16 12:38 ok, the chat has been helpful 2008-12-16 12:38 tonight's tux3 u, we will follow those paths 2008-12-16 12:38 and if you are going to be there, you need to sleep how ;) 2008-12-16 12:38 now 2008-12-16 12:39 yes :) 2008-12-16 12:39 I will fiddle with get_segs a little more today 2008-12-16 12:39 and fix the set_buffer_uptodate for holes thing 2008-12-16 12:40 http://userweb.kernel.org/~hirofumi/dleaf/ 2008-12-16 12:41 FWIW, it is dleaf stuff in progress 2008-12-16 12:41 it's worth a lot :) 2008-12-16 12:42 thanks :) 2008-12-16 12:42 ah 2008-12-16 12:42 why I think we need to walk buffers for write 2008-12-16 12:43 we want to make a bigger extent_count 2008-12-16 12:43 exactly 2008-12-16 12:43 so, I thought it is necessary 2008-12-16 12:43 I think get_segs can't do it 2008-12-16 12:44 we want to walk buffers before get_segs? 2008-12-16 12:44 that was that I was saying about the flag to tell it to remap 2008-12-16 12:44 ah, yes 2008-12-16 12:45 if we have a buffer/page walk that determines which logical region needs to be remapped, it can call get_segs for that region 2008-12-16 12:45 I think ext4's delayed alloc will be doing it 2008-12-16 12:46 and if it walked buffers, it should already know about flags 2008-12-16 12:46 yes 2008-12-16 12:46 and get_segs should be simple? 2008-12-16 12:46 yes, a lot like it is now 2008-12-16 12:46 i see 2008-12-16 12:46 but now it does not redirect 2008-12-16 12:47 yes 2008-12-16 12:47 so there will be some change inside, but the interface will be similar 2008-12-16 12:47 one small change I will make, return negative count for a hole 2008-12-16 12:47 it works out ok 2008-12-16 12:47 and it is kind of a strong way of making sure that the caller takes care of the hole 2008-12-16 12:48 or block == 0? 2008-12-16 12:48 that is the current interface 2008-12-16 12:48 negative count? 2008-12-16 12:48 it bothers me, because maybe somebody wants to write block zero with this interface 2008-12-16 12:49 so even if it does not affect us right now, allowing block zero makes me feel the interface is cleaner 2008-12-16 12:49 this is a feeling thing ;) 2008-12-16 12:49 I gues, get_segs is not generic interface 2008-12-16 12:49 negative count for the segment that is a hole 2008-12-16 12:50 because fs need to call it directly from walker 2008-12-16 12:50 it might become a generic interface, if proves to be useful and flexible 2008-12-16 12:50 but for now, it is our internal interface 2008-12-16 12:50 right 2008-12-16 12:50 it would be a library interface 2008-12-16 12:50 like the block library 2008-12-16 12:51 and improvement over ->writepages, maybe 2008-12-16 12:51 but, who does it use? 2008-12-16 12:51 only us 2008-12-16 12:51 now 2008-12-16 12:51 later? 2008-12-16 12:51 maybe, someone want to use own interface? 2008-12-16 12:52 later if the interface is good we can offer a library call that plugs into ->writepages and uses the get_segs interface 2008-12-16 12:52 right 2008-12-16 12:52 anyway, it is all internal to us for now 2008-12-16 12:52 yes 2008-12-16 12:53 well, it is later 2008-12-16 12:53 yes 2008-12-16 12:53 now it just has to work ;) 2008-12-16 12:54 negative count may not be good... 2008-12-16 12:54 um... 2008-12-16 12:55 I will just post the negative count patch to the list 2008-12-16 12:55 and you can tell me if you hate it then :) 2008-12-16 12:55 ok :) 2008-12-16 12:56 seg[segs++] = (struct seg){ -gap }; 2008-12-16 12:56 simple :) 2008-12-16 12:56 user of it is? 2008-12-16 12:56 the handling outside is simple too 2008-12-16 12:56 tux3_get_block and filemap_extent_io 2008-12-16 12:57 if (seg->count < 0) -seg->count else seg->count 2008-12-16 12:57 like that 2008-12-16 12:58 or if (seg->count < 0) seg->count = -seg->count; use it 2008-12-16 12:58 int count = seg[i].count, hole = count < 0; 2008-12-16 12:58 if (hole) 2008-12-16 12:58 count = -count; 2008-12-16 12:58 in filemap_extent_io 2008-12-16 12:58 i see 2008-12-16 12:58 if (count < 0) 2008-12-16 12:58 set_buffer_uptodate(bh_result); 2008-12-16 12:58 else { 2008-12-16 12:58 in tux3_get_block 2008-12-16 12:59 count must be signed 2008-12-16 12:59 it is now 2008-12-16 12:59 ah 2008-12-16 12:59 it doesn't have to be size_t 2008-12-16 13:00 struct seg also use signed? 2008-12-16 13:00 yes 2008-12-16 13:00 i see 2008-12-16 13:05 valgrind found a mistake, should be: seg[segs++] = (struct seg){ .count = -gap }; 2008-12-16 13:06 good 2008-12-16 13:06 ACTION wants to more test case with assert() 2008-12-16 13:07 yes, more asserts are better 2008-12-16 13:08 yes 2008-12-16 13:08 e.g. filemap doesn't use assert at all 2008-12-16 13:09 if someone try to do it, it would be good, and it would also help to know about tux3 2008-12-16 13:15 yes 2008-12-16 13:16 it is nice the way lots of people have been able to put in patches 2008-12-16 13:16 more of that is better 2008-12-16 13:16 yes 2008-12-16 13:17 ok, dleaf looks good 2008-12-16 13:17 pull? 2008-12-16 13:17 negative count patch is up 2008-12-16 13:17 far from it :) 2008-12-16 13:17 ok 2008-12-16 13:18 now, probe/next/back uses new cursor position 2008-12-16 13:19 next is I should fix the users of those 2008-12-16 13:23 new cursor position in what sense? 2008-12-16 13:23 old one was using next position of extent 2008-12-16 13:23 new one uses current extent 2008-12-16 13:24 and dwalk_next() will move position to next extent 2008-12-16 13:24 right 2008-12-16 13:24 it was done 2008-12-16 13:24 with dump and assert() 2008-12-16 13:25 I will read the patch in more detail while you sleep ;) 2008-12-16 13:25 ok :) 2008-12-16 13:25 I'll sleep 2008-12-16 13:25 oyasumi :) 2008-12-16 13:26 oyasumi :) 2008-12-16 13:51 I found another benefit 2008-12-16 13:52 we don't need to check !dleaf_groups(leaf) 2008-12-16 13:52 in some case 2008-12-16 13:52 because dwalk_first() and dwalk_end() is true in that case 2008-12-16 13:52 the patch was updated 2008-12-16 13:52 that is a good sign of a good approach 2008-12-16 13:53 dleaf/dleaf-cleanup-fix.patch? 2008-12-16 13:53 dleaf/dwalk_next-back-cleanup.patch 2008-12-16 13:54 !dleaf_groups was removed from next/back 2008-12-16 13:54 and maybe, removed from dwalk_chop 2008-12-16 13:55 and the amount of code is about the same, except for the new checks 2008-12-16 13:55 maybe a little less 2008-12-16 13:58 it looks like dwalk_probe was replaced by dwalk_probe_check 2008-12-16 14:02 dwalk_probe_check is just for sanity check 2008-12-16 14:03 well, code size would be same with old one 2008-12-16 14:04 I need to apply the patch and look at the result 2008-12-16 14:04 or you could check it in over there 2008-12-16 14:04 commit 2008-12-16 14:04 to a temporary repo 2008-12-16 14:05 http://userweb.kernel.org/~hirofumi/dleaf/tux3.tar.gz 2008-12-16 14:05 quilt :) 2008-12-16 14:05 I put the working dir 2008-12-16 14:06 yes, but it is whole files 2008-12-16 14:06 that was easy 2008-12-16 14:07 ark, used by firefox is a little lame 2008-12-16 14:07 no text search 2008-12-16 14:08 archive viewer? 2008-12-16 14:08 yes 2008-12-16 14:09 ah, maybe old firefox 2008-12-16 14:10 firefox shows the dialog box for me 2008-12-16 14:10 choose viewer or download 2008-12-16 14:11 well, I'll sleep :) 2008-12-16 14:11 for real :) 2008-12-16 14:11 dleaf.c looks good 2008-12-16 14:11 easier to read your way 2008-12-16 14:19 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-16 14:29 ah, writing posts about computer code is good 2008-12-16 14:29 I noticed a monster bug in filemap.c 2008-12-16 14:31 get_segs can't fit its new list of segments in the dleaf on write, it splits the leaf and repeats the whole process of scanning extents and allocating new blocks 2008-12-16 14:31 not so good 2008-12-16 14:31 it just lost all the blocks allocated on the failed attempt 2008-12-16 14:32 it needs to split the leaf and insert the list of extents it already has 2008-12-16 15:01 -!- Stonekeeper(~lea@81.168.113.60) has joined #tux3 2008-12-16 15:02 hi. i only found out about tux3 about 10 mins ago so sorry if this is a dumb question: will tux3 have quota capability and if so, will older versions of files be counted against that quota? MAny thanks. 2008-12-16 15:26 yes, quota is planned 2008-12-16 15:27 all versions better be counted against quota 2008-12-16 15:28 quotas on subdirectories would be nice too, if we can figure out what that means for hard links 2008-12-16 15:42 the problem i have with that is that people can reach quota without really realising it. They will say "I have 2x100M files but my gig quota is full" 2008-12-16 15:43 it also makes it really really hard to try to determine a fair quota for users 2008-12-16 15:43 unless they have a really really easy way of killing previous versions 2008-12-16 16:07 stonekeeper, why don't you post the question to the tux3 mailing list? 2008-12-16 16:07 it's interesting 2008-12-16 16:08 and I will think about it while I go for my skate 2008-12-16 16:13 :) 2008-12-16 16:13 meanwhile i'm off to bed. nn :) 2008-12-16 16:13 -!- Stonekeeper(~lea@81.168.113.60) has left #tux3 2008-12-16 16:48 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-16 16:54 -!- inverse(~michael@d141-25-32.home.cgocable.net) has joined #tux3 2008-12-16 17:09 i think that would be fine. the files would only count towards their quota more than once for the data which has changed between snapshots 2008-12-16 17:09 so if they change 10 mb of data after a snapshot is set, they will be using 210 mb 2008-12-16 17:10 it would be nice to be able to exclude directories from snapshotting as well 2008-12-16 17:11 such a model of accounting would create a disconnect between what you have and what you can get 2008-12-16 17:12 you can ask for 410mb of files while you're storing 210 total 2008-12-16 17:12 so then do you report 210 to the user and 410 to the admin? 2008-12-16 17:17 huh? 2008-12-16 17:17 why 410? 2008-12-16 17:17 you already have a disconnect between reported size and quota usage 2008-12-16 17:17 such as sparse files 2008-12-16 17:18 true 2008-12-16 17:19 but then how do you estimate what you need for backup? 2008-12-16 17:19 you can look at the block count 2008-12-16 17:19 for sparse files 2008-12-16 17:19 and what happens for true versioning? 2008-12-16 17:20 well since each snapshot has its own namespace, it will just report what exists in the current namespace i guess 2008-12-16 17:20 i think other snapshotting filesystems jsut dont care about quota usage in snapshots 2008-12-16 17:21 although i could be remembering wrong 2008-12-16 17:21 i suppose it could be configurable, file-system wide, or current-snapshot-only quotas 2008-12-16 17:22 estimation of backup space is kinda important, unless we'd invent a type format that's tux3 compatible and be versioning aware 2008-12-16 17:23 well it would be a good estimate, assuming you are backing up just one snapshot 2008-12-16 17:24 with zumastor/ddsnap, quota only applies to the latest version 2008-12-16 17:24 so do we have multiple strategies for backing up the current 'view' of things, vs block level copy of the partition? 2008-12-16 17:24 depends how you plan on backing up i suppose 2008-12-16 17:25 most likely i would say you take a snapshot, and make a copy of it on to some other media 2008-12-16 17:25 differencial and incremental models change quite a bit if you got snapshots going 2008-12-16 17:26 depends what you're trying to do i guess, just makes it easier really 2008-12-16 17:26 what scenario are you thinking 2008-12-16 17:26 and what if we're not using versioning for temporal versioning, but let's say doing polyinstantiation? 2008-12-16 17:26 i'm thinking i need to stop reading tux3 irc and get back to my real work ;) 2008-12-16 17:27 at 8pm ? 2008-12-16 17:27 let's just deadlines start with 'dead' for a reason ;) 2008-12-16 17:28 haha 2008-12-16 17:45 back 2008-12-16 17:45 was quiet on the boardwalk 2008-12-16 17:47 not caring about quota use on snapshots means a user can just keep rewriting their own data a fill up the volume 2008-12-16 17:47 in Zumastor, we used snapshot autodelete to solve a similar problem 2008-12-16 17:48 I think it is applicable here 2008-12-16 17:48 flips, dont think just backups, think other applications of versioning 2008-12-16 17:49 yes 2008-12-16 17:50 in Zumastor/ddsnap we gave the user a snapshot priority settable 2008-12-16 17:50 so the user can say what versions are important, and which should not be deleted (therefor should count towards quota) 2008-12-16 17:52 i can tell you in my environments autodeleting anything is not an option 2008-12-16 17:53 then you can set it so it doesn't happen :) 2008-12-16 17:54 well, snapshots held for replication are natually auto-deletable 2008-12-16 17:54 please make defaults sane ;) 2008-12-16 17:54 "never delete anything except snapshots for replication" 2008-12-16 17:54 better ;) 2008-12-16 17:54 probably not a sane default, because it leads to volumes going enospc too much 2008-12-16 17:56 well, versioning adds a whole dimension (literally, if you use it for temporal versioning) of complexity--you're gonna seriously have to write a set of guidelines of how to deal with the new fs without killing your systems 2008-12-16 17:56 autodeleting snapshots sounded like a really crazy idea when we first started doing it in zumastor, but it proved to be very natural 2008-12-16 17:57 well yea, that makes sense the moment you explained what you autodeleted 2008-12-16 18:10 shapor, to make quota even more interesting, we are going to have snapshots of subdirectories 2008-12-16 18:28 fun 2008-12-16 18:29 we should probably brainstorm about use cases and interfaces on the mailing list 2008-12-16 18:29 get input from all the lurkers who probably care a lot about it 2008-12-16 18:29 but were afraid to ask :P 2008-12-16 19:04 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-16 19:45 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-16 19:56 -!- rmull(~rmull@busw043-0b01-dhcp81.bu.edu) has joined #tux3 2008-12-16 20:00 hi 2008-12-16 20:01 -!- ChanServ changed mode/#tux3 -> +o flips 2008-12-16 20:01 howdy 2008-12-16 20:01 tux3U tonight? 2008-12-16 20:02 -!- flips changed topic to "http://tux3.org ~ Tux3 University, right here Tuesdays and Thursdays at 8 pm Pacific Time ~ Next session: Chasing the elusive buffer flag" 2008-12-16 20:02 maybe :) 2008-12-16 20:02 depends who's here 2008-12-16 20:02 heh 2008-12-16 20:02 Pamina's cooking 2008-12-16 20:02 :) 2008-12-16 20:02 ACTION is still at school :-( 2008-12-16 20:03 well I will just start mumbling 2008-12-16 20:03 and maybe we will return to the topic above on thursday 2008-12-16 20:03 so, the hot tux3 issue is atomic commit 2008-12-16 20:03 and we are down there in the nitty gritty building pieces 2008-12-16 20:03 that started seriously a few days ago 2008-12-16 20:04 looks like it 2008-12-16 20:04 with me hacking on fiilemap.c and Hirofumi cleaning up the dwalk api 2008-12-16 20:04 there is a post up on the mailing list that is highly relevant 2008-12-16 20:05 http://mailman.tux3.org/pipermail/tux3/2008-December/000505.html 2008-12-16 20:05 The get_segs interface 2008-12-16 20:05 in there I talk about redirect on write, and how we don't actually need it right now 2008-12-16 20:05 and actually, that is wrong 2008-12-16 20:06 we do need it, for directories 2008-12-16 20:06 directories are metadata, and we can't use loose semantics for getting directory contents onto disk 2008-12-16 20:07 so that means, we need to be looking at the states of buffer_heads when writing out 2008-12-16 20:08 to see if we need to fork a buffer (if the buffer is dirty in a previous delta) 2008-12-16 20:09 to do that, we need to keep our own filesystem-private flags in the buffer_heads, and we need to be sure that the vfs will respect those flags, not throw them away, and provide them to us at the right time when it calls ->get_block 2008-12-16 20:09 I thought that would be a good tux3 u project 2008-12-16 20:10 to go chasing through lxr and see what it does with those buffers 2008-12-16 20:10 and bitmap blocks? 2008-12-16 20:10 yes 2008-12-16 20:10 not versioned, but need atomic commit 2008-12-16 20:11 and eventually, we will hit an interesting recursion there, where we write out a bitmap block atomically, and need to reallocate it, which changes a bitmap block 2008-12-16 20:11 this will be a fun one 2008-12-16 20:12 ah, yes 2008-12-16 20:12 we really need to have the details of the write algorithm before knowing how exactly to solve the recursion 2008-12-16 20:13 buffer forking may solve the problem already 2008-12-16 20:13 hirofumi, you didn't sleep much ;) 2008-12-16 20:14 yes :) 2008-12-16 20:14 I put an hour into chasing details of md superblock positioning 2008-12-16 20:14 what a mess that code is 2008-12-16 20:15 please let that not be us :) 2008-12-16 20:15 ah, no in english 2008-12-16 20:15 right 2008-12-16 20:15 so, I'll sleep again :) 2008-12-16 20:15 good plan 2008-12-16 20:16 we can try this again on thursday 2008-12-16 20:18 so, we clear first 4kb and last 128kb 2008-12-16 20:18 ? 2008-12-16 20:20 if we want to clear the 0.9 md sb 2008-12-16 20:21 if we want to do 1.0 sb, only last 12 k 2008-12-16 20:21 and 1.1, 1.2, nothing at the top 2008-12-16 20:21 I don't think we really need to care about 0.9 md 2008-12-16 20:22 but I'm willing to be corrected :) 2008-12-16 20:22 but, maybe we want to clear EFI partition? 2008-12-16 20:22 ah, I don't know anything about that. Pointer? 2008-12-16 20:22 wait a bit 2008-12-16 20:23 http://en.wikipedia.org/wiki/EFI_System_Partition 2008-12-16 20:23 ah, yes 2008-12-16 20:23 http://developer.intel.com/technology/efi/efi.htm 2008-12-16 20:24 wikipedia is getting to be a nice code resource 2008-12-16 20:25 yes 2008-12-16 20:25 I don't see how it invades our partition 2008-12-16 20:26 ah, somebody might think we are an efi partition 2008-12-16 20:26 yes 2008-12-16 20:26 even if we used whole disk 2008-12-16 20:27 http://en.wikipedia.org/wiki/GUID_Partition_Table 2008-12-16 20:27 count on big companies to do stupid things 2008-12-16 20:29 after v1.02, it seems to require the mbr 2008-12-16 20:30 so, it's not problem 2008-12-16 20:30 good 2008-12-16 20:30 but, before version, clear mbr is not enough 2008-12-16 20:31 in linux, if user used "force_gpt" option, it skip mbr check 2008-12-16 20:31 mkfs.tux3 --clear=full :-) 2008-12-16 20:31 yes :) 2008-12-16 20:32 well, it would be --bad-block-check 2008-12-16 20:32 that too 2008-12-16 20:32 mkfs with surface scan 2008-12-16 20:33 one nice little feature we can provide is, mount -o mkfs 2008-12-16 20:33 because our filesystem create code is tiny, it can go in kernel 2008-12-16 20:34 ah 2008-12-16 20:34 yes 2008-12-16 21:20 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-16 21:42 time to fiddle with get_segs a little more 2008-12-16 21:42 ah 2008-12-16 21:42 need to fix the big bug 2008-12-16 21:42 we can't retry writes like that 2008-12-16 21:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-16 22:06 hirofumi, there? 2008-12-16 23:02 -!- RazvanM(~RazvanM@96.234.232.67) has joined #tux3 2008-12-16 23:50 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-17 02:40 -!- kbingham_(~kbingham@82-46-4-172.cable.ubr03.aztw.blueyonder.co.uk) has joined #tux3 2008-12-17 02:52 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-17 03:20 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-17 04:03 -!- ceatinge(~ceatinge@veryclever.net) has joined #tux3 2008-12-17 04:03 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-12-17 04:03 -!- rmull(~rmull@busw043-0b01-dhcp81.bu.edu) has joined #tux3 2008-12-17 04:03 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-12-17 04:03 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-17 04:03 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-17 04:03 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-17 04:03 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-12-17 04:03 -!- pgquiles__(~pgquiles@143.Red-79-154-136.staticIP.rima-tde.net) has joined #tux3 2008-12-17 04:03 -!- kbingham_(~kbingham@82-46-4-172.cable.ubr03.aztw.blueyonder.co.uk) has joined #tux3 2008-12-17 04:03 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-17 04:03 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-17 04:03 -!- RazvanM(~RazvanM@96.234.232.67) has joined #tux3 2008-12-17 04:03 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-17 04:03 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-17 04:45 -!- Man_of_W1x(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-17 04:59 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-17 05:30 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-17 05:58 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-17 07:13 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-17 07:36 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-17 09:51 -!- mingming(~mingming@c-71-193-163-244.hsd1.or.comcast.net) has joined #tux3 2008-12-17 10:13 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-17 10:49 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-17 11:15 hirofumi, there? 2008-12-17 11:15 hi 2008-12-17 11:16 I was thinking, if we make cursor point at its btree, some code cleanups are possible 2008-12-17 11:16 the btree depth code gets clean, and we can reduce some parameters 2008-12-17 11:17 well, it can 2008-12-17 11:17 however, ->len != btree->depth 2008-12-17 11:18 but we can have cursor->btree.depth 2008-12-17 11:18 e.g. if we added new depth, it can be not equal 2008-12-17 11:19 len is just the allocation size of the cursor, right? 2008-12-17 11:19 it is current stack length 2008-12-17 11:20 maxlen is allocation size 2008-12-17 11:20 oops 2008-12-17 11:20 and maxlen is only used for debugging 2008-12-17 11:21 still, what is the problem with having a btree field in the cursor? 2008-12-17 11:21 I don't get it yet 2008-12-17 11:21 however, I don't have any objection to add btree to cursor 2008-12-17 11:21 I also thought it 2008-12-17 11:23 5 functions in btree.c take a btree and a cursor 2008-12-17 11:24 there is a lot of code that maintains depth separately from the btree 2008-12-17 11:24 maintains depth? 2008-12-17 11:24 keeps a separate depth variable 2008-12-17 11:25 on dwalk_probe... 2008-12-17 11:26 there is this code in filemap.c: 2008-12-17 11:26 for (struct diskextent *extent; (extent = dwalk_next(walk));) 2008-12-17 11:26 if (dwalk_index(walk) + extent_count(*extent) > key) { 2008-12-17 11:26 if (dwalk_index(walk) <= key) 2008-12-17 11:26 dwalk_back(walk); 2008-12-17 11:26 break; 2008-12-17 11:26 } 2008-12-17 11:26 I thought it too, reading todo list 2008-12-17 11:26 ... 2008-12-17 11:27 all that code does is a dwalk_probe, and goes back one if it lands in the middle of an extent 2008-12-17 11:27 is this related to cursor->btree? 2008-12-17 11:28 not at all 2008-12-17 11:28 yes 2008-12-17 11:28 just two things that came up while I was working on filemap.c 2008-12-17 11:28 it does back 2008-12-17 11:28 ah 2008-12-17 11:29 back will be fixed with new dwalk position 2008-12-17 11:29 that code should just be something like dwalk_probe(...); if (some test) dwalk_back(..); 2008-12-17 11:29 maybe, we don't need to back 2008-12-17 11:29 -!- pranihome(~bobby@122.162.71.226) has joined #tux3 2008-12-17 11:30 flips, there? 2008-12-17 11:30 yes 2008-12-17 11:30 seems there is a bug in current tux3 2008-12-17 11:30 details? 2008-12-17 11:31 mount /dev/ubdb /mnt 2008-12-17 11:31 cd /mnt 2008-12-17 11:31 vim hello.c 2008-12-17 11:31 write something, then do :wq 2008-12-17 11:31 first.c E667: Fsync failed 2008-12-17 11:31 Warning: Original file may be lost or damaged 2008-12-17 11:32 probably, it is tux3_get_block() 2008-12-17 11:32 don't quit the editor until the file is succesfully written! 2008-12-17 11:32 there should be tons of tracing output in you message log 2008-12-17 11:32 hmm 2008-12-17 11:32 is it reproducible there? 2008-12-17 11:33 probably 2008-12-17 11:33 what do you see in /var/log/messages? 2008-12-17 11:33 :wq gives the above error, but somehow saves the file... 2008-12-17 11:33 :q!, ls shows hello.c 2008-12-17 11:34 trying in uml 2008-12-17 11:34 just a min.. ill paste /var/log/messages 2008-12-17 11:34 yes, in uml 2008-12-17 11:37 bah, copy paste is a waste in virtual console :( 2008-12-17 11:38 pranihome, same here, error message from vi while file is written successfully 2008-12-17 11:39 http://rafb.net/p/NZmjH121.html 2008-12-17 11:39 oh, ok 2008-12-17 11:39 no need of the paste then :) 2008-12-17 11:39 still useful 2008-12-17 11:39 flips, can you walk me through the log... 2008-12-17 11:40 once? 2008-12-17 11:40 :) 2008-12-17 11:40 that sounds useful 2008-12-17 11:40 which? 2008-12-17 11:41 line number? 2008-12-17 11:41 :( 2008-12-17 11:41 well the first word on each tracing line is the function that produced the output 2008-12-17 11:41 so you go to the source to see what it means 2008-12-17 11:41 ok 2008-12-17 11:41 tux3_fill_super first 2008-12-17 11:42 ok, super.c:196 2008-12-17 11:43 hirofumi, and reason for printing out the itable.ops? 2008-12-17 11:43 just for debug 2008-12-17 11:43 it can read and initialized properly? 2008-12-17 11:44 we don't need it anymore 2008-12-17 11:44 probably not 2008-12-17 11:44 ok, lookup_inode 2008-12-17 11:45 ino = 0 2008-12-17 11:45 ok fill_super initializes a super block and all the trees... 2008-12-17 11:46 lookup inode is at ileaf_lookup:114 2008-12-17 11:47 so let's look at our special inodes 2008-12-17 11:47 ok 2008-12-17 11:48 in tux3.h 2008-12-17 11:48 special as in? 2008-12-17 11:48 e.g., TUX_BITMAP_INO 2008-12-17 11:48 okkk 2008-12-17 11:48 0, 1, 2, 10, 13 2008-12-17 11:48 so it's opening the allocation bitmap 2008-12-17 11:49 oh.. nice 2008-12-17 11:49 tracing line 8, new_xcache: realloc xcache to 4, is something that should be fixed 2008-12-17 11:50 there is no xattr, so should not create an xattr cache for the inode 2008-12-17 11:50 there is no xattr for bitmap inode? 2008-12-17 11:50 no 2008-12-17 11:50 hmm 2008-12-17 11:50 ok.. 2008-12-17 11:50 lazy programmer (me) always creates the xcache whether need or not 2008-12-17 11:51 :) 2008-12-17 11:51 next are the attributes found for the inode 2008-12-17 11:51 ok 2008-12-17 11:51 now its looking up inode 4 2008-12-17 11:51 0xd 2008-12-17 11:51 see dump_attrs in iattr.c 2008-12-17 11:52 then a bunch more inodes 2008-12-17 11:52 just opens them whether needed or not... more cleanup needed there 2008-12-17 11:52 should only open when needed 2008-12-17 11:52 hmm 2008-12-17 11:53 what is the loop in dump_attrs? 2008-12-17 11:53 k 1 to 32? 2008-12-17 11:53 0 to 32* 2008-12-17 11:53 it loops over the attribute present bits 2008-12-17 11:53 we don't necessary store every attribute in an inode 2008-12-17 11:53 variable sized inodes 2008-12-17 11:54 yup, that is not yet implemented right? 2008-12-17 11:54 variable size inode? 2008-12-17 11:54 or is it? 2008-12-17 11:54 ACTION is not sure 2008-12-17 11:54 the dwalk_probe/next/back calls are very detailed dumps of the walking through a file btree leaf 2008-12-17 11:54 it is implemented 2008-12-17 11:54 works great 2008-12-17 11:55 that's what the attr present bits are for 2008-12-17 11:55 :) ok 2008-12-17 11:55 hmm 2008-12-17 11:56 looking at dleaf.c for dtree ops.. 2008-12-17 11:56 tux3_get_block calls are the main interface between vfs and tux3 files 2008-12-17 11:58 we can turn off most of this tracing output now 2008-12-17 11:59 hmm 2008-12-17 11:59 the walk_ tracing anyway 2008-12-17 11:59 only needs to be on for debugging 2008-12-17 11:59 about that.. 2008-12-17 11:59 i was not able to debug properly using gdb 2008-12-17 12:00 dont we need a -g flag for proper debugging? 2008-12-17 12:00 in our makefiles? 2008-12-17 12:00 should be in the makefile by default 2008-12-17 12:00 for uml 2008-12-17 12:00 DEBUG=1 needs to be passed? 2008-12-17 12:00 hmm 2008-12-17 12:00 ok 2008-12-17 12:01 just gdb -args ./linux ... 2008-12-17 12:01 then cont 2008-12-17 12:01 a bunch of times 2008-12-17 12:01 yeah, there were some places where variables were not accessible .. 2008-12-17 12:01 because of inlines 2008-12-17 12:01 oh.. 2008-12-17 12:02 hirofumi, how do we compile kernel with -O0 ? 2008-12-17 12:02 CFLAGS= ? 2008-12-17 12:02 we can set it in our local makefile? 2008-12-17 12:03 oh 2008-12-17 12:03 make CFLAGS=O0 ? 2008-12-17 12:03 maybe 2008-12-17 12:03 however, some codes may require some optimize level 2008-12-17 12:04 I think akpm compiles with O0 when he debugs 2008-12-17 12:04 um... 2008-12-17 12:04 at some time, the kernel would break if not O2 2008-12-17 12:04 well, I don't use debugger at all for kernel 2008-12-17 12:05 :O 2008-12-17 12:05 only trace? 2008-12-17 12:05 yes 2008-12-17 12:05 so, need to find out where an error code is escaping 2008-12-17 12:05 well, mainly read the source, then add trace 2008-12-17 12:06 first is strace or something 2008-12-17 12:06 to find start point 2008-12-17 12:06 i tried strace with vim :D 2008-12-17 12:06 then, read from start 2008-12-17 12:07 and think about problem :) 2008-12-17 12:07 strace craps out too much output 2008-12-17 12:07 got an error code from fsync 2008-12-17 12:07 yup 2008-12-17 12:07 667 2008-12-17 12:07 :) 2008-12-17 12:07 667? 2008-12-17 12:07 yes 2008-12-17 12:07 E667 2008-12-17 12:07 fsyn failed 2008-12-17 12:08 very populat acrdn to google :) 2008-12-17 12:09 http://lxr.linux.no/linux+v2.6.27/fs/sync.c#L84 2008-12-17 12:10 so I will naturally do b do_fsync 2008-12-17 12:10 we have a do_fsyn? 2008-12-17 12:10 I can't see why it is 667 2008-12-17 12:10 uml bug? 2008-12-17 12:11 hirofumi, no 2008-12-17 12:11 it was working fine a week ago 2008-12-17 12:11 this started happening 2 days ago afair 2008-12-17 12:11 strace really work? 2008-12-17 12:11 2-3 2008-12-17 12:11 hirofumi, probably not a uml bug 2008-12-17 12:11 uml depends on ptrace 2008-12-17 12:11 it is not very strace friendly 2008-12-17 12:12 well, anyway, I think this problem is tux3_get_block() 2008-12-17 12:12 try gdb -args ./linux 2008-12-17 12:12 it's fun :) 2008-12-17 12:13 could be 2008-12-17 12:13 better, try gdb in emacs :D 2008-12-17 12:13 Breakpoint 1, do_fsync (file=0x9d71540, datasync=0) at fs/sync.c:84 2008-12-17 12:13 84 if (!file->f_op || !file->f_op->fsync) { 2008-12-17 12:13 yes, I did it some times 2008-12-17 12:13 in past 2008-12-17 12:13 :) 2008-12-17 12:13 kvm also can do it 2008-12-17 12:14 somehow, I don't use it now 2008-12-17 12:14 uml runs that code? 2008-12-17 12:14 -EINVAL 2008-12-17 12:16 yippee, i too got that 2008-12-17 12:16 :D 2008-12-17 12:16 (gdb) p ret $1 = 2008-12-17 12:16 :( 2008-12-17 12:17 gdb sucks 2008-12-17 12:17 no 2008-12-17 12:17 it's why we need to compile with O0 2008-12-17 12:17 -9 is for -EINVAL? 2008-12-17 12:17 it is gcc problem 2008-12-17 12:18 where are the error codes? linux/kernel.h? 2008-12-17 12:18 asm-generic/errno.h 2008-12-17 12:18 ok.. 2008-12-17 12:18 -9 is EBADF 2008-12-17 12:19 yup :) 2008-12-17 12:19 and the file is asm-generic/errno-base.h 2008-12-17 12:19 now, what is a bad file number??? :) 2008-12-17 12:20 -9 is from where code? 2008-12-17 12:21 __do_fsync 2008-12-17 12:21 sync.c:108 2008-12-17 12:21 it is after do_fsync(), or before? 2008-12-17 12:21 if (file) { ret = do_sync();} 2008-12-17 12:21 hmm, i did a next from do_fsync.. 2008-12-17 12:22 it came to __do_fsync... 2008-12-17 12:22 then i did a "p ret" 2008-12-17 12:22 it depends on optimized code 2008-12-17 12:23 (gdb) n 2008-12-17 12:23 __do_fsync (fd=, datasync=0) at fs/sync.c:116 2008-12-17 12:23 (gdb) p ret 2008-12-17 12:23 $3 = -9 2008-12-17 12:23 (gdb) up 2008-12-17 12:23 #1 0x000000006008d0f9 in sys_fsync (fd=1641645696) at fs/sync.c:123 2008-12-17 12:23 (gdb) p fd 2008-12-17 12:23 $4 = 1641645696 2008-12-17 12:23 hmm, need to do something about this optimization.. trying -O0 in our makefile 2008-12-17 12:23 disass 2008-12-17 12:24 ? 2008-12-17 12:24 gdb command 2008-12-17 12:24 shud i give that? 2008-12-17 12:26 check it 2008-12-17 12:26 disassemble do_fsync 2008-12-17 12:26 info regs 2008-12-17 12:27 http://osdir.com/ml/kernel.crash-dump.crash-utility/2006-09/msg00041.html 2008-12-17 12:27 re O0 2008-12-17 12:27 http://rafb.net/p/Q6GelD71.html 2008-12-17 12:29 it is current code 2008-12-17 12:29 info registers? 2008-12-17 12:29 0x000000006008d074 : callq 0x60178170 2008-12-17 12:30 0x000000006008d079 : mov %r13,%rdi 2008-12-17 12:30 0x000000006008d07c : callq 0x600509b4 2008-12-17 12:30 0x000000006008d081 : test %ebx,%ebx 2008-12-17 12:30 0x000000006008d083 : jne 0x6008d08e 2008-12-17 12:30 0x000000006008d085 : mov %eax,%ebx 2008-12-17 12:30 0x000000006008d087 : jmp 0x6008d08e 2008-12-17 12:30 0x000000006008d089 : mov $0xffffffea,%ebx 2008-12-17 12:30 0x000000006008d08e : movslq %ebx,%rax 2008-12-17 12:30 this is disassemble do_fsync 2008-12-17 12:30 (gdb) info registers 2008-12-17 12:30 rax 0xffffffffffffffea -22 2008-12-17 12:30 rbx 0x61e79b40 1642568512 2008-12-17 12:30 rcx 0x1 1 2008-12-17 12:30 rdx 0x1 1 2008-12-17 12:31 hmm, rax has -22 2008-12-17 12:31 that is -EINVAL 2008-12-17 12:31 rip? 2008-12-17 12:31 well, it seems -EINVAL 2008-12-17 12:31 rip 0x6008d0c8 0x6008d0c8 <__do_fsync+46> 2008-12-17 12:32 after return from do_fsyn 2008-12-17 12:32 ok, so do_fsync returned -EINVAL 2008-12-17 12:32 fsckin gdb, says it -9 :( 2008-12-17 12:32 yes 2008-12-17 12:32 gdb is right 2008-12-17 12:33 it is still on register 2008-12-17 12:33 : mov $0xffffffea,%ebx 2008-12-17 12:33 0x000000006008d08e : movslq %ebx,%rax 2008-12-17 12:33 this is disassemble do_fsync 2008-12-17 12:33 not stored to memory 2008-12-17 12:33 well, so kernel will return -EINVAL 2008-12-17 12:33 this is -22 in %ebx->%rax 2008-12-17 12:33 yes 2008-12-17 12:33 -EINVAL it is :) 2008-12-17 12:34 other thing would be userland and uml 2008-12-17 12:34 now the question is why -EINVAL... 2008-12-17 12:34 userland? 2008-12-17 12:34 im not a chicken to try userland.. i do everything in kernel :D 2008-12-17 12:34 hehe 2008-12-17 12:34 strace or glibc or vim 2008-12-17 12:34 uml is usually right about bugs like this 2008-12-17 12:36 strace on vim is horrible... 2008-12-17 12:36 too much outputs.. breaks my terminal 2008-12-17 12:37 well, this is not fsync problem 2008-12-17 12:37 no error return from do_fsync? 2008-12-17 12:37 hirofumi, ?? 2008-12-17 12:37 it return, howver -EINVAL is right behavior 2008-12-17 12:38 for now 2008-12-17 12:38 but why -EINVAL? 2008-12-17 12:38 reason? 2008-12-17 12:38 we don't implement it yet 2008-12-17 12:38 we are not implementing 2008-12-17 12:38 fsync? 2008-12-17 12:39 yes 2008-12-17 12:39 hmm 2008-12-17 12:39 if (!file->f_op || !file->f_op->fsync) { 2008-12-17 12:39 /* Why? We can still call filemap_fdatawrite */ 2008-12-17 12:39 ret = -EINVAL; 2008-12-17 12:39 goto out; 2008-12-17 12:39 } 2008-12-17 12:39 :) 2008-12-17 12:39 yes 2008-12-17 12:40 write error is tux3_get_block() bugs 2008-12-17 12:40 if I was able to compile kernel without inlines an optimization, I would have seen it within 2 minutes 2008-12-17 12:40 :) 2008-12-17 12:40 so... got to figure out how to do it 2008-12-17 12:41 check gcc 2008-12-17 12:41 ? 2008-12-17 12:41 how come this was working earlier? 2008-12-17 12:41 which one it use first option or second? 2008-12-17 12:42 if it uses last -O, fs/tux3/Makefile can do it 2008-12-17 12:42 are you talking about inlines/optimization? 2008-12-17 12:42 ah 2008-12-17 12:42 just about -O 2008-12-17 12:43 Makefile:KBUILD_CFLAGS += -O2 2008-12-17 12:43 I'll build with that off, see if it boots 2008-12-17 12:46 ok, KCFLAGS=-O0 would work 2008-12-17 12:47 and then turning off inlining 2008-12-17 12:47 and CONFIG_DEBUG_INFO is recommended 2008-12-17 12:47 hirofumi, where do we pass this flag? during make? 2008-12-17 12:47 make KCFLAGS=-O0 2008-12-17 12:48 ok 2008-12-17 12:48 I can't remember if dwarf can handle inline or not 2008-12-17 12:48 flips, what abt that wiki?? it will really be helpful 2008-12-17 12:48 I'll bug shapor :) 2008-12-17 12:48 :) 2008-12-17 12:48 ok 2008-12-17 12:49 how do we know it really compiled with O0? 2008-12-17 12:49 make KCFLAGS=-O0 V=1 2008-12-17 12:49 :) 2008-12-17 12:49 ok, so make KCFLAGS=-O0 linux ARCH=um CONFIG_TUX3=y CONFIG_DEBUG_INFO=y 2008-12-17 12:50 this should do for uml? :) 2008-12-17 12:50 TUX3 and DEBUG_INFO can be in .config 2008-12-17 12:51 wow, V=1 craps a lot 2008-12-17 12:51 ok 2008-12-17 12:51 well, I don't recommend to use debugger though :) 2008-12-17 12:52 ok 2008-12-17 12:53 time to sleep.. good night guys.. debugging is fun.. :) 2008-12-17 12:53 however, it is just what do you like 2008-12-17 12:53 :) 2008-12-17 12:53 sayanora hirofumi 2008-12-17 12:54 oyasumi 2008-12-17 12:54 sry, oyasumi 2008-12-17 12:54 :) 2008-12-17 13:12 hirofumi, btree split in filemap.c is broken, the retry idea is bad, it loses allocated blocks... I have rewritten it and need to write a test driver 2008-12-17 13:12 sounds good 2008-12-17 13:13 btw, back to depth talk 2008-12-17 13:14 right, I don't understand why cursor can't point at btree, is because sometimes we discard the cursor? 2008-12-17 13:15 there is no reason 2008-12-17 13:15 we can do it 2008-12-17 13:16 however, I think ->len can't replaced with it 2008-12-17 13:16 true 2008-12-17 13:16 that is fine 2008-12-17 13:16 and maintain depth separately 2008-12-17 13:16 depth will be in the btree 2008-12-17 13:16 e.g. advance()? 2008-12-17 13:17 it reads btree->root.depth 2008-12-17 13:17 then inc/dec "depth" value 2008-12-17 13:17 it looks strange? 2008-12-17 13:17 right, and it would be cleaner if it always inc/decs btree->root.depth directly 2008-12-17 13:17 and not make a separate variable 2008-12-17 13:18 let the compiler optimize it 2008-12-17 13:18 yes, there was it in my todo list 2008-12-17 13:18 but, it is not btree->root.depth 2008-12-17 13:18 depth is tree depth 2008-12-17 13:19 btree->root.depth is tree depth 2008-12-17 13:19 "depth" is current level in cursor 2008-12-17 13:19 so, I thought we can use cursor->len 2008-12-17 13:19 it is "depth" exactly 2008-12-17 13:20 also I think it should be renamed from ->len to ->level 2008-12-17 13:20 fine 2008-12-17 13:20 and may add cursor_level() to read it 2008-12-17 13:21 level_pop/level_push changes current level 2008-12-17 13:21 cursor_level() read current level 2008-12-17 13:21 good 2008-12-17 13:21 I will be gone for 2 hours now 2008-12-17 13:21 ok 2008-12-17 14:06 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-17 15:33 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-12-17 15:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-17 16:57 back 2008-12-17 17:07 evening flips :) 2008-12-17 17:17 http://userweb.kernel.org/~hirofumi/dleaf/tux3.tar.gz 2008-12-17 17:18 dwalk_next/dwalk_back cleanup is almost done 2008-12-17 17:32 looking 2008-12-17 17:33 hi mlankhorst 2008-12-17 17:33 I'll go to food shop and gone for 1 hour 2008-12-17 17:33 see you 2008-12-17 17:34 see you, and let's talk about get_segs() stuff and around 2008-12-17 17:34 happily 2008-12-17 18:30 backed 2008-12-17 18:32 current problem of get_segs() for get_block() is, caller can't know if which seg was allocated 2008-12-17 18:33 if blocks is allocated newly, get_block() set buffer_new to buffer to zeroed it 2008-12-17 18:35 maybe, one of other issues is dwalk_mock/dwalk_pack needs to read all extents? 2008-12-17 20:15 back 2008-12-17 20:15 hirofumi, still around? 2008-12-17 20:15 ye 2008-12-17 20:15 yes 2008-12-17 20:15 right 2008-12-17 20:15 it doesn't have enough information by itself 2008-12-17 20:16 so I was thinking, it would be called from a loop that examines buffer heads an cache pages 2008-12-17 20:16 yes 2008-12-17 20:17 and make to be called function? 2008-12-17 20:17 make function to be called from it 2008-12-17 20:17 just a moment 2008-12-17 20:17 ok 2008-12-17 20:19 ok, let's think about two things get_segs does not know just from looking at dleaf 2008-12-17 20:19 one is whether a buffer has to be forked 2008-12-17 20:20 ah, but that is not the issue right now 2008-12-17 20:20 yes 2008-12-17 20:20 buffer_new is not for zeroing 2008-12-17 20:20 what's for? 2008-12-17 20:20 it is for unmapping metadata the first time a block is assigned as data 2008-12-17 20:21 so get_segs does not have a way of indicating that now 2008-12-17 20:21 yes 2008-12-17 20:22 however, buffer_new is needed for zeroing? 2008-12-17 20:22 no 2008-12-17 20:22 um... 2008-12-17 20:23 if caller is used getblk(), buffer_new is not needed? 2008-12-17 20:23 we may not need the vfs to unmap metadata for us, we probably always know 2008-12-17 20:24 get_segs does have a way of indicating when a buffer should be zeroed, negative count 2008-12-17 20:24 later? or right now? 2008-12-17 20:24 later 2008-12-17 20:24 yes 2008-12-17 20:25 for get_block() with buffer_head, buffer_new is needed 2008-12-17 20:25 I think the issue vfs is trying to prevent is, reallocate metadata as data, then later as metadata again, and finding the stale metadata buffer in cache 2008-12-17 20:26 we can add a flags field to struct seg 2008-12-17 20:26 I think it is one of issue 2008-12-17 20:26 and set that for a each new allocation 2008-12-17 20:27 ok, let's do this, it is most like current filesystems right now 2008-12-17 20:27 if not buffer_new, library doesn't try to read 2008-12-17 20:27 if not buffer_new, library try to read 2008-12-17 20:27 right 2008-12-17 20:27 ok, that's the important issue 2008-12-17 20:29 so, get_block() needs to know it is exsistant area or unexsistant area 2008-12-17 20:29 on write 2008-12-17 20:30 it is true for read too? 2008-12-17 20:30 we already tell it that, by telling it the holes 2008-12-17 20:30 read never allocations 2008-12-17 20:30 yes 2008-12-17 20:30 so can never set buffer_new 2008-12-17 20:31 so, I think get_segs() is right for it 2008-12-17 20:31 for write... I don't think vfs ever reads a block 2008-12-17 20:31 if it does, it sets create = 0 2008-12-17 20:31 no 2008-12-17 20:31 example? 2008-12-17 20:31 overwrite is also create=1 2008-12-17 20:32 actually, the flags is write=1 2008-12-17 20:32 not it is 2008-12-17 20:32 sorry 2008-12-17 20:32 now it is 2008-12-17 20:32 yes, get_block() interface does it 2008-12-17 20:33 get_segs() tells exsistant area 2008-12-17 20:33 we need function to handle unexsistant area 2008-12-17 20:33 let's look at buffer.c 2008-12-17 20:33 lxr time :) 2008-12-17 20:33 ok 2008-12-17 20:34 look write=1? 2008-12-17 20:34 block_write_full_page 2008-12-17 20:34 yes 2008-12-17 20:34 see what it does with a partial page 2008-12-17 20:34 partial page write 2008-12-17 20:34 where it has to read first 2008-12-17 20:35 it would be __block_prepare_write() 2008-12-17 20:35 right 2008-12-17 20:35 http://lxr.linux.no/linux+v2.6.27.5/fs/buffer.c#L1849 2008-12-17 20:36 1885 err = get_block(inode, block, bh, 1); <- ok, this is actually a read, but called with create flag 2008-12-17 20:36 so you are right 2008-12-17 20:36 read or create? 2008-12-17 20:37 it can be create 2008-12-17 20:37 it is going to read this block, but it wants the fs to create it 2008-12-17 20:37 because it knows it is going to write it too 2008-12-17 20:38 but write area may not be full page 2008-12-17 20:39 if it partial overwrite, it will read rest of area 2008-12-17 20:39 yes 2008-12-17 20:40 if page is not existant, it will be allocated 2008-12-17 20:41 page is not existant on disk 2008-12-17 20:43 anyway, so we decided that get_segs needs to tell get_block when it first created a block 2008-12-17 20:43 per-block 2008-12-17 20:44 I think it is first point 2008-12-17 20:44 get_segs() will allocate blocks? or another function will allocate blocks()? 2008-12-17 20:44 get_segs will 2008-12-17 20:44 why? 2008-12-17 20:45 only reason is, it has done the probe and wants to keep the dleaf around 2008-12-17 20:45 for doing the allocation 2008-12-17 20:46 yes 2008-12-17 20:46 we can break it apart into two pieces for write (create) = 1, but it is not a natural break 2008-12-17 20:47 um... 2008-12-17 20:47 I'm not sure, it is good or not 2008-12-17 20:47 just a idea 2008-12-17 20:47 it is not too complex a function now 2008-12-17 20:48 read current extents on interesting area 2008-12-17 20:48 then, caller knows current extents state 2008-12-17 20:48 and decide what does it want to do 2008-12-17 20:49 and tell the result to some function 2008-12-17 20:49 and we have to make sure that each seg (extent) has the same state for all blocks 2008-12-17 20:49 each seg? 2008-12-17 20:50 each seg is either created new, or already existed 2008-12-17 20:51 read (get_segs) tells already existed segs 2008-12-17 20:51 one idea I was thinking of, is make the first part of get_segs the same for read and write... any seg that nees to be allocated for write is returned as a hole 2008-12-17 20:51 then we have another function (what is currently called inside update_dtree) that does the write allocation 2008-12-17 20:52 so get_block can call the first function, and know from the holes, when to set buffer_new, then call the second function to do the allocations 2008-12-17 20:52 and the cursor has to exist across the two calls 2008-12-17 20:53 yes 2008-12-17 20:53 so a slightly more complex api for get_block, but gives it all the information it wants 2008-12-17 20:53 the above 4 lines meant it 2008-12-17 20:53 ok, we're thinking the same thing 2008-12-17 20:53 yes 2008-12-17 20:54 so we still want get_segs to do the probe? 2008-12-17 20:54 however, I thought read current area will be called as get_segs() 2008-12-17 20:54 current? 2008-12-17 20:54 existed extent on interesting area 2008-12-17 20:55 I thought get_segs() will read existed extents to tell it 2008-12-17 20:55 so... create = 0 would just call get_segs, but create = 1 would call get_gets then create_segs 2008-12-17 20:56 yes 2008-12-17 20:56 is that what you were saying? 2008-12-17 20:56 yes 2008-12-17 20:57 now, should get_segs to the btree probe, or should we make that a separate function? 2008-12-17 20:57 if get_segs() calls read_segs and create_segs(), I don't care 2008-12-17 20:57 read_segs() means get_segs() I said 2008-12-17 20:58 right, get_segs has the create part removed from it, so that is your read_segs 2008-12-17 20:58 I am going to check in my updates to filemap.c right now 2008-12-17 20:58 ok 2008-12-17 20:58 even though I have not properly tested 2008-12-17 20:59 makes it easier to move on 2008-12-17 20:59 and allocate_cursor() is used by caller of read_segs()? 2008-12-17 20:59 yes, and we pass a cursor to get_segs 2008-12-17 20:59 ok 2008-12-17 21:00 we thought same things, however function was different 2008-12-17 21:00 function name 2008-12-17 21:00 :) 2008-12-17 21:00 :) 2008-12-17 21:02 my filemap.c changes are in public 2008-12-17 21:02 ok 2008-12-17 21:02 now I must write a test case in user/filemap.c to cause a btree split 2008-12-17 21:03 do you want to refactor get_segs or should I? 2008-12-17 21:03 I can do it pretty quickly 2008-12-17 21:03 ok 2008-12-17 21:04 I will apply dwalk_next cleanup to it 2008-12-17 21:09 get_segs uses its inode parameter only to get the btree 2008-12-17 21:09 yes 2008-12-17 21:10 I think now is a good time to add btree to struct cursor, then we will pass only the cursor 2008-12-17 21:10 not make those changes everywhere, but just in filemap.c for now 2008-12-17 21:12 let's change it before change this 2008-12-17 21:12 ok, it's a bigger change 2008-12-17 21:12 so let's change the plan a little 2008-12-17 21:12 you want to add the struct btree to cursor? 2008-12-17 21:13 I will see what happen with it 2008-12-17 21:13 I can spend some time writing my test case for btree split, to check the retry bug fix 2008-12-17 21:13 ok 2008-12-17 21:13 ah 2008-12-17 21:14 btw, allocation path has bug 2008-12-17 21:14 dwalk_mock/dwalk_pack needs to read all extents 2008-12-17 21:14 but, now it doesn't read 2008-12-17 21:15 read from the beginning? 2008-12-17 21:15 or read to the all to the end? 2008-12-17 21:15 after write position 2008-12-17 21:16 if (write) { 2008-12-17 21:16 while (next_extent) { 2008-12-17 21:16 trace("save tail"); 2008-12-17 21:16 seg[segs++] = *next_extent; 2008-12-17 21:16 next_extent = dwalk_next(walk); 2008-12-17 21:16 } 2008-12-17 21:16 } 2008-12-17 21:16 old filemap_extent_io has it 2008-12-17 21:16 ah right 2008-12-17 21:17 I don't like this though 2008-12-17 21:17 it is needless work, true 2008-12-17 21:17 but tricky to avoid the work, so I thought it would be less bugs for now 2008-12-17 21:18 yes 2008-12-17 21:18 ok, need to restore that code 2008-12-17 21:18 to do it, we have to allocate big memory in kernel 2008-12-17 21:18 yes 2008-12-17 21:18 stack can't have it 2008-12-17 21:18 so it has to be done properly pretty soon 2008-12-17 21:19 yes 2008-12-17 21:19 sometimes we will need extra memory to work in 2008-12-17 21:19 hey flips 2008-12-17 21:19 maybe 2008-12-17 21:20 on new extent could be inserted near the beginning of a dleaf 2008-12-17 21:20 one new extent 2008-12-17 21:21 yes 2008-12-17 21:21 so then we have to reorganize the tail of the dleaf 2008-12-17 21:21 doing that with no working memory is tricky 2008-12-17 21:22 what we could do is kmalloc a working buffer for segs and attach it to the cursor 2008-12-17 21:22 memmove don't work? 2008-12-17 21:22 also have to update the leaf dict 2008-12-17 21:22 but yes 2008-12-17 21:23 that's what we want to do 2008-12-17 21:23 memmove, then a clever update of the dict 2008-12-17 21:23 yes 2008-12-17 21:23 but I don't feel too clever right now, so why don't we just allocate some working memory for segs and point to it from the cursor? 2008-12-17 21:24 and we may want to merge some extent 2008-12-17 21:24 yes 2008-12-17 21:24 it is not right now 2008-12-17 21:24 so, working memory is not problem 2008-12-17 21:25 did you mean "not right" or "right now" ? 2008-12-17 21:26 it will be not done right now 2008-12-17 21:26 it will not be done right now 2008-12-17 21:27 dwalk_mock/pack change will be changed later 2008-12-17 21:27 but we do need to handle the tail of the leaf now, it's a big bug 2008-12-17 21:27 so... more working memory for that? It can be around 4K 2008-12-17 21:28 yes 2008-12-17 21:28 we can't use mock/pack without it 2008-12-17 21:30 we want to write that code so it works in both user space and kernel 2008-12-17 21:31 so that it always gets tested in our unit tests 2008-12-17 21:32 so... #define malloc(bytes) kmalloc(bytes, GFP_KERNEL); ? 2008-12-17 21:32 yes, I think so 2008-12-17 21:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-17 21:36 if (update_dtree) { 2008-12-17 21:36 /* Update dtree by new extents */ 2008-12-17 21:36 + while (next_extent) { 2008-12-17 21:36 + trace("save tail"); 2008-12-17 21:36 + seg[segs++] = (struct seg){ extent_block(*next_extent), extent_count(*next_extent) }; 2008-12-17 21:36 + next_extent = dwalk_next(walk); 2008-12-17 21:36 + } 2008-12-17 21:36 *walk = rewind; 2008-12-17 21:36 should fix that? 2008-12-17 21:37 looks good to me 2008-12-17 21:37 commit coming... 2008-12-17 21:51 http://userweb.kernel.org/~hirofumi/dleaf/cursor-btree.patch 2008-12-17 21:51 balloc extent -> [16/1] 2008-12-17 21:51 get_segs: --------- split leaf --------- 2008-12-17 21:51 split 0x4176800 into 0x4177000 <- user test case was easy to write, should have done it long ago 2008-12-17 21:52 http://userweb.kernel.org/~hirofumi/dleaf/get_block-warn-fix.patch 2008-12-17 21:52 cursor->btree was done 2008-12-17 21:52 reading 2008-12-17 21:55 it's a nice cleanup 2008-12-17 21:56 and more cleanups can be done with this change 2008-12-17 21:56 trivial cleanup can be later 2008-12-17 21:56 pull it? 2008-12-17 21:57 maybe, apply patch is more fast 2008-12-17 21:57 ok 2008-12-17 21:58 sorry, I didn't write comment for it 2008-12-17 21:58 I'll make one up 2008-12-17 21:59 --user = hirofumi 2008-12-17 21:59 thanks 2008-12-17 21:59 three different names for you messes up your hg churn stats, and hg churn alias file doesn't really work well ;) 2008-12-17 22:00 oh 2008-12-17 22:00 well, if we really want to fix it, probably, we can do it 2008-12-17 22:01 no doubt 2008-12-17 22:02 -!- RazvanM(~RazvanM@96.234.237.45) has joined #tux3 2008-12-17 22:02 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-17 22:04 pushed to public 2008-12-17 22:05 thanks 2008-12-17 22:05 now to check the results of my retry fix more closely 2008-12-17 22:05 it seemed to work on the first try, which always worries me ;) 2008-12-17 22:05 need to print out the resulting btree 2008-12-17 22:06 ah, test is without assert 2008-12-17 22:06 it's not much of a test yet 2008-12-17 22:07 save tail test is 2008-12-17 22:07 it's not testing save tail yet 2008-12-17 22:07 which test are you talking about? 2008-12-17 22:07 http://userweb.kernel.org/~hirofumi/dleaf/try-open.patch 2008-12-17 22:08 new test case for save tail 2008-12-17 22:08 the patch will call tuxopen before tuxcreate() 2008-12-17 22:08 yes 2008-12-17 22:08 a rewrite 2008-12-17 22:08 dd if=/dev/zero bs=1M count=1 | ../tux3/user/tux3 --seek=$((1024*1024*17 * 4096)) write ./tux3.img test.txt > /dev/null 2008-12-17 22:08 wget and patch? 2008-12-17 22:08 dd if=/dev/zero bs=1M count=1 | ../tux3/user/tux3 --seek=$((1024*1024*8 * 4096)) write ./tux3.img test.txt > /dev/null 2008-12-17 22:09 will test save tail 2008-12-17 22:09 just info 2008-12-17 22:09 write at bigger offset 2008-12-17 22:09 then write at small offset 2008-12-17 22:10 good test 2008-12-17 22:10 the result can be see with tux3graph 2008-12-17 22:10 I also like to test these things at a low level, that is, call get_segs directly 2008-12-17 22:11 also show_tree_range 2008-12-17 22:11 good 2008-12-17 22:12 real test will read it, and compare with expected result 2008-12-17 22:13 well when I print the btree I see it didn't work properly 2008-12-17 22:14 so now I feel better :) 2008-12-17 22:14 there should be bugs sometimes 2008-12-17 22:14 I'll check in, and you can run the test 2008-12-17 22:14 very easy to work on the filemap code this way 2008-12-17 22:15 ok 2008-12-17 22:15 pushed 2008-12-17 22:15 next is to chase this 2008-12-17 22:16 one extent ended up in the wrong leaf 2008-12-17 22:18 inserted into the wrong leaf after split 2008-12-17 22:18 extent 0x28 2008-12-17 22:18 extent 0x28/1 2008-12-17 22:21 because I just used the same leaf after the split 2008-12-17 22:21 need to look in the cursor to get the leaf 2008-12-17 22:22 yes 2008-12-17 22:22 fixed 2008-12-17 22:22 that was an easy case 2008-12-17 22:24 this bug would not have happened if we used the macro to get the leaf from the cursor 2008-12-17 22:25 and let the compiler do the optimization 2008-12-17 22:25 so that goes on the cleanup list 2008-12-17 22:29 next is to run the test backwards 2008-12-17 22:30 macro? 2008-12-17 22:30 cursor_leafbuf(cursor) 2008-12-17 22:30 I think of it as a macro 2008-12-17 22:30 wrapper 2008-12-17 22:31 what is difference? 2008-12-17 22:31 I think of inlines as macros... that's just me 2008-12-17 22:32 I should be more precise 2008-12-17 22:33 running the test backwards fails instantly 2008-12-17 22:33 so it's really nice to have get_segs factored out for proper testing 2008-12-17 22:34 now creating the segments in reverse logical order, to exercise the tail copies and things 2008-12-17 22:34 --- a/user/filemap.c Wed Dec 17 22:24:31 2008 -0800 2008-12-17 22:34 +++ b/user/filemap.c Wed Dec 17 22:34:40 2008 -0800 2008-12-17 22:34 @@ -74,7 +74,7 @@ 2008-12-17 22:34 inode->map->inode = inode; 2008-12-17 22:34 inode = inode; 2008-12-17 22:34 - for (int i = 0; i < 30; i++) { 2008-12-17 22:34 + for (int i = 30; --i;) { 2008-12-17 22:34 struct seg seg; 2008-12-17 22:34 get_segs(inode, 2*i, 2*i + 1, &seg, 1, 1); 2008-12-17 22:35 } 2008-12-17 22:35 ah, good 2008-12-17 22:35 it didn't write 0 2008-12-17 22:35 i == 0 2008-12-17 22:35 my test isn't perfect ;) 2008-12-17 22:35 :) 2008-12-17 22:35 but it fails way before it gets there 2008-12-17 22:36 valgrind stops with an assert 2008-12-17 22:38 get_segs: --------- split leaf --------- 2008-12-17 22:38 ==23418== Invalid write of size 4 2008-12-17 22:38 ==23418== at 0x804C919: dleaf_init (dleaf.c:75) 2008-12-17 22:39 very messed up 2008-12-17 22:39 extents inserted in wrong order 2008-12-17 22:39 lots of fixing to do 2008-12-17 22:41 for (int i = 2; i--;) { <- already shows a bug 2008-12-17 22:42 logical index in wrong order in dleaf 2008-12-17 22:42 1 entry groups: 2008-12-17 22:42 0/2: 2 => 2/1; 0 => 3/1; 2008-12-17 22:43 which test? 2008-12-17 22:43 ah, i=2 2008-12-17 22:49 rewind would be wrong 2008-12-17 22:54 which line? 2008-12-17 22:55 maybe, 128 2008-12-17 23:05 physical extent 2/1 should have been put into the seg[] list, but was not 2008-12-17 23:06 trace("segs:"); 2008-12-17 23:06 for (i = 0; i < segs; i++) 2008-12-17 23:06 trace("%Lx/%x", seg[i].block, seg[i].count); 2008-12-17 23:07 where is it added? 2008-12-17 23:08 after save tail? 2008-12-17 23:09 walk should be pointing at it after the loop beginning at line 56 2008-12-17 23:10 yes 2008-12-17 23:10 ah 2008-12-17 23:11 there used to be dwalk_back after that loop 2008-12-17 23:11 and not now 2008-12-17 23:11 we are supposed to use next_extent 2008-12-17 23:12 actually we do 2008-12-17 23:13 the tracing output for the gaps-finding loop is not the greatest 2008-12-17 23:17 ok, stupid bug 2008-12-17 23:18 - if (dwalk_index(walk) <= key) 2008-12-17 23:18 + if (dwalk_index(walk) >= key) 2008-12-17 23:18 :p 2008-12-17 23:19 ah 2008-12-17 23:19 on to the next one 2008-12-17 23:20 we may want to sync dwalk stuff 2008-12-17 23:20 any time 2008-12-17 23:21 get_segs: --------- split leaf --------- 2008-12-17 23:21 btree_leaf_split: split leaf 2008-12-17 23:21 balloc_extent_from_range: balloc 1 blocks from [10/fff0] 2008-12-17 23:21 balloc: balloc -> [10] 2008-12-17 23:21 valgrind: m_mallocfree.c:178 (mk_plain_bszB): Assertion 'bszB != 0' failed. 2008-12-17 23:21 next one 2008-12-17 23:21 for (int i=2; i--;) was fixed? 2008-12-17 23:21 yes 2008-12-17 23:21 by the above diff 2008-12-17 23:22 if you are ready to send dwalk stuff, I will back out my debug work and pull 2008-12-17 23:22 http://userweb.kernel.org/~hirofumi/dleaf/tux3.tar.gz 2008-12-17 23:22 now, just for review 2008-12-17 23:23 it may add new bug 2008-12-17 23:23 that's ok, we're debugging right now ;) 2008-12-17 23:23 I will just have a quick look 2008-12-17 23:23 for (int i=2; i--;) was really fixed? 2008-12-17 23:23 1 entry groups: 2008-12-17 23:23 0/2: 2 => 2/1; 0 => 3/1; 2008-12-17 23:24 ah 2008-12-17 23:24 forgot to change 2008-12-17 23:24 oh, fixed 2008-12-17 23:25 I introduced that 2008-12-17 23:25 by not thinking 2008-12-17 23:25 next bug is probably equally stupid 2008-12-17 23:25 it's great to have a nice unit test that allows directly testing the complicated code 2008-12-17 23:29 my patch adds new bug :( 2008-12-17 23:29 it gets rid of the chatting tracing output from dwalk_* 2008-12-17 23:29 chatty 2008-12-17 23:30 that tracing output was getting annoying anyway 2008-12-17 23:30 if we need it again, we can add it back 2008-12-17 23:30 yes 2008-12-17 23:30 want to chase your bug for a while? 2008-12-17 23:30 I can chase the split bug 2008-12-17 23:31 I'll see my bug 2008-12-17 23:31 ok, I will look for the split bug 2008-12-17 23:31 well, from review I think split has bug 2008-12-17 23:32 it certainly does 2008-12-17 23:32 it saves current tail 2008-12-17 23:32 but, after split, tail should be different 2008-12-17 23:32 also, dwalk state should be different 2008-12-17 23:33 right... that was why I did retry the way I did originally 2008-12-17 23:33 time to think 2008-12-17 23:33 it resets the dwalk state, so at least that is not a bug 2008-12-17 23:33 ah, yes 2008-12-17 23:33 but it should copy out the tail _after_ the split 2008-12-17 23:34 yes 2008-12-17 23:34 so the algorithm is not beyond repair 2008-12-17 23:34 yes 2008-12-17 23:35 but, it needs some info 2008-12-17 23:35 yes 2008-12-17 23:35 it needs to find the start of the tail 2008-12-17 23:36 which would be given by next_extent 2008-12-17 23:36 yes 2008-12-17 23:36 so, dleaf_probe to there 2008-12-17 23:39 actually, to the logical index of next_extent 2008-12-17 23:40 in other words, limit 2008-12-17 23:40 dleaf_probe to limit 2008-12-17 23:40 I don't think so 2008-12-17 23:41 split can be middle of seg[] 2008-12-17 23:42 yes, we already don't handle overlapping segments at beginning and end of logical range, so that would not be a new bug 2008-12-17 23:43 i see 2008-12-17 23:43 ok 2008-12-17 23:44 my bug is the change of dwalk state 2008-12-17 23:44 mock/pack don't know about it 2008-12-17 23:44 which dwalk state? 2008-12-17 23:45 new rule of ->extent 2008-12-17 23:45 it sounds like you want to work on it a little more? 2008-12-17 23:45 and ->group and ->entry stuff is possible 2008-12-17 23:45 yes 2008-12-17 23:46 and later we can think about a clean way to handle the overlapping segments problem that I keep deferring 2008-12-17 23:46 to fix it, I'll create new write segs functions 2008-12-17 23:47 to fix your bug, or the overlapping bug? 2008-12-17 23:47 without save tail 2008-12-17 23:47 new one should fix both 2008-12-17 23:47 I hope all 2008-12-17 23:47 that would be nice :) 2008-12-17 23:47 :) 2008-12-17 23:47 ok, I will get something at least a little more correct for the split 2008-12-17 23:48 yes 2008-12-17 23:48 this code did not get tested properly originally because it was buried inside the file writing code 2008-12-17 23:49 now that it is exposed, it is much easier to test and improve it 2008-12-17 23:49 yes 2008-12-17 23:49 well, that change was expected 2008-12-17 23:50 and dleaf_chop was also main target 2008-12-17 23:50 I was forgetting it :) 2008-12-17 23:51 and dleaf stuff, I hope truncate and write works properly 2008-12-17 23:51 and after dleaf stuff 2008-12-17 23:52 we can also do unit tests for that in user/filemap.c 2008-12-17 23:53 if there is test cases, it would help me much 2008-12-17 23:53 logs or test cases is remaining? 2008-12-17 23:54 you meant logs of? 2008-12-17 23:54 no 2008-12-17 23:54 logs, or test case 2008-12-17 23:54 logs? 2008-12-17 23:54 ah, logs is not helpful 2008-12-17 23:54 the result of test 2008-12-17 23:55 yes, for automated regression testing 2008-12-17 23:55 yes, I want it 2008-12-17 23:56 things need to settle down a little 2008-12-17 23:56 I'm adding it for dleaf stuff 2008-12-17 23:56 written all in c, or bash style? 2008-12-17 23:56 c 2008-12-17 23:56 good 2008-12-17 23:57 ok, that will be cool 2008-12-17 23:57 e.g. main() in user/dleaf.c 2008-12-17 23:57 if your style of regression testing is easy to use, it would be a good project for some of our project members 2008-12-17 23:57 to do it in more places 2008-12-17 23:58 yes, exactly 2008-12-17 23:58 user/dleaf.c 132 line 2008-12-17 23:59 it is test of dwalk_next/back 2008-12-18 00:00 supid ark program doesn't show line numbers or let me choose the file viewer, I need to try a different archive viewer 2008-12-18 00:00 it is in current repo 2008-12-18 00:01 oh, hg has like of line 2008-12-18 00:01 http://hg.tux3.org/tux3/file/7a3d508796bb/user/dleaf.c#l132 2008-12-18 00:01 link of line 2008-12-18 00:01 that's really nice :) 2008-12-18 00:02 ark is more useful when called from konqueror, let's you choose the viewer 2008-12-18 00:02 the default viewer is still lame 2008-12-18 00:02 but fast 2008-12-18 00:03 firefox 2.0.x? 2008-12-18 00:04 yes 2008-12-18 00:04 yes, that assert style is good 2008-12-18 00:04 brings everything to a halt if an invariant fails 2008-12-18 00:04 yes 2008-12-18 00:05 and it is noticable with make tests 2008-12-18 00:05 yes 2008-12-18 00:05 perfect for now 2008-12-18 00:06 yes 2008-12-18 00:07 and maybe that asm("int3") should go 2008-12-18 00:08 it means int3 should be removed? 2008-12-18 00:09 "go" is what mean, in this case? 2008-12-18 00:09 yes 2008-12-18 00:09 it's something only I use 2008-12-18 00:10 and it only gets back in when I forget to remove it after debugging 2008-12-18 00:10 it makes no sense on non x86 arch 2008-12-18 00:10 so, we call abort()? 2008-12-18 00:11 or exit(1) 2008-12-18 00:12 I like abort() or something for now 2008-12-18 00:12 it is easy to notice 2008-12-18 00:12 fine 2008-12-18 00:43 for (int i = 30; i-- > 28;) { <- next bug is here 2008-12-18 00:44 result is: 1 entry groups: 2008-12-18 00:44 0/3: 38 => 3/1; 39 => 2/1; 3a => 0/1; 2008-12-18 00:44 should be: 2008-12-18 00:45 0/3: 38 => 3/1; 3a => 2/1; 2008-12-18 00:46 sorry, 0/2: 38 => 3/1; 3a => 2/1; 2008-12-18 00:49 strange 2008-12-18 00:49 just another stupid bug 2008-12-18 00:50 due to me probably 2008-12-18 00:50 getting close to it now 2008-12-18 02:05 stupid me 2008-12-18 02:06 using seg[] to copy out the tail of the leaf is stupid 2008-12-18 02:06 that's the result vector for get_segs 2008-12-18 02:44 -!- flips(~phillips@phunq.net) has joined #tux3 2008-12-18 02:47 -!- dheckman(~david@c-76-112-81-42.hsd1.mi.comcast.net) has joined #tux3 2008-12-18 02:48 -!- dheckman(~david@c-76-112-81-42.hsd1.mi.comcast.net) has left #tux3 2008-12-18 03:16 Morning 2008-12-18 03:16 good morning 2008-12-18 03:29 Another conceptual error in dleaf editing 2008-12-18 03:30 the seg[] list is assumed to be densely packed with segments, no gaps between them 2008-12-18 03:30 using the seg[] list to copy out the tail does not work, because the tail of the dleaf make have gaps between the extents 2008-12-18 03:30 stupid me, once again 2008-12-18 03:31 time to consider a proper, accurate approach 2008-12-18 05:48 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-18 08:30 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-18 09:54 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-18 10:52 -!- pranihome(~bobby@122.162.71.130) has joined #tux3 2008-12-18 11:05 -!- pranihome(~bobby@122.162.71.130) has joined #tux3 2008-12-18 11:18 -!- pranihome(~bobby@122.162.71.130) has joined #tux3 2008-12-18 11:34 -!- pranihome(~bobby@122.162.71.130) has joined #tux3 2008-12-18 11:42 -!- pranihome(~bobby@122.162.71.130) has joined #tux3 2008-12-18 15:33 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-18 15:42 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-18 15:43 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-18 16:15 -!- bushman(~bushman@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-18 16:39 hirofumi, there? 2008-12-18 16:39 hi 2008-12-18 16:39 hey, last night was a long one 2008-12-18 16:39 did you sleep at all? 2008-12-18 16:39 yes 2008-12-18 16:40 good :) 2008-12-18 16:40 however, at strange time 2008-12-18 16:40 20:00 or so 2008-12-18 16:40 ah, not strange for some people 2008-12-18 16:40 no you can be a morning person for a while 2008-12-18 16:41 I sometimes get on the other side of the clock like that, and after a few days slip back into the normal night owl thing 2008-12-18 16:41 yes, basically I'm night person 2008-12-18 16:41 well, I picked up another couple of bugs last night, then I realize the model of copying the leaf tail into the segs vector is just wrong 2008-12-18 16:42 it doesn't work at all 2008-12-18 16:42 maybe 2008-12-18 16:42 the segs vector has block, len, but extents have index, block,len 2008-12-18 16:42 I'm going to create, dwalk_update, dwalk_insert, dwalk_merge 2008-12-18 16:42 in other words, the extents can be sparse 2008-12-18 16:42 yes 2008-12-18 16:43 but, it can tell hole 2008-12-18 16:43 segs has negative count 2008-12-18 16:43 right, but it doesn't make sense to copy in the tail by inserting a bunch of artificial holes 2008-12-18 16:43 I have a better model in mind 2008-12-18 16:44 yes 2008-12-18 16:44 actually, copying in the tail is something like a leaf merge 2008-12-18 16:44 and moving the tail out of the way so we can pack in new extents is something like a split 2008-12-18 16:45 um... maybe 2008-12-18 16:45 good answer :) 2008-12-18 16:45 so, I am going to try having a nice big buffer, one block, and use the actual leaf split code to put the tail in it 2008-12-18 16:45 I'm thinking it, caller should help some jobs 2008-12-18 16:45 also a possibility 2008-12-18 16:46 current format has limit of extent->count 2008-12-18 16:46 it can be complex 2008-12-18 16:46 but, caller can handle it more easily 2008-12-18 16:47 right, the segs[] will still be returned to the caller just like now 2008-12-18 16:47 and it is easy to edit the beginning and end of that list to be eactly right 2008-12-18 16:47 and it makes sense to return the holes, even for write, and let the caller fill them in by doing allocation 2008-12-18 16:48 all that is pretty nice 2008-12-18 16:48 it's just handling the leaf tail that didn't work well 2008-12-18 16:48 and caller should also help merge jobs 2008-12-18 16:48 merge? 2008-12-18 16:49 ah 2008-12-18 16:49 extend extent->count 2008-12-18 16:49 yes, editing the leaf 2008-12-18 16:49 yes 2008-12-18 16:49 exactly 2008-12-18 16:49 and merge two extents 2008-12-18 16:49 there is just a little bit of messy work to do, adjusting ->count at the beginning and end of the logical range 2008-12-18 16:49 i.e. fill hole 2008-12-18 16:50 right 2008-12-18 16:50 ok 2008-12-18 16:50 that's what you meant 2008-12-18 16:50 yes, it should handle by caller 2008-12-18 16:50 and also decide whether to overwrite without allocating new extents, or to redirect 2008-12-18 16:50 yes 2008-12-18 16:51 so, maybe caller will call dwalk_merge() 2008-12-18 16:51 maybe 2008-12-18 16:51 dwalk_merge() shirinks entry/extent 2008-12-18 16:52 oh yes 2008-12-18 16:52 caller should tell range and detail of extents 2008-12-18 16:52 yes, and it can have information about the buffer state to help it 2008-12-18 16:52 it's the right way to divide it 2008-12-18 16:53 so get_segs will just return the existing extents and holes 2008-12-18 16:53 yes 2008-12-18 16:53 and there is no difference for read or write 2008-12-18 16:53 and the cursor is allocated outside, and the probe is done outside 2008-12-18 16:53 so it gets to be a much nicer size of function 2008-12-18 16:53 -!- inverse(~michael@d141-25-32.home.cgocable.net) has joined #tux3 2008-12-18 16:54 yes 2008-12-18 16:54 ok, back to my tail handling 2008-12-18 16:54 I'm trying to create dwalk_update() now 2008-12-18 16:54 what does it do? 2008-12-18 16:54 it updates existed extent and insert new extent 2008-12-18 16:55 ok, let's think about my split/merge idea a little more 2008-12-18 16:55 maybe, with it, caller can add extents 2008-12-18 16:55 ok 2008-12-18 16:55 in 99% of cases, the tail will be zero length 2008-12-18 16:56 because rewrite is a rare case 2008-12-18 16:56 but we do need to handle those rare cases really well 2008-12-18 16:56 what is meaning "rewrite"? 2008-12-18 16:56 update extent->count? 2008-12-18 16:57 write a file without truncating it first 2008-12-18 16:57 usual meaning of rewrite 2008-12-18 16:57 rewrite is rare? 2008-12-18 16:57 very rare actually 2008-12-18 16:57 yes, the only common use is database 2008-12-18 16:57 yes 2008-12-18 16:58 or fs image 2008-12-18 16:58 and databases will most likely just want to allocate once, then handle their own atomic update, snapshot, etc 2008-12-18 16:58 right 2008-12-18 16:58 well, some sort of database 2008-12-18 16:59 the fs image case is important actually, it is the xen workload 2008-12-18 16:59 yes 2008-12-18 16:59 anyway, it's still true that 99% or more of writes are truncate/writes 2008-12-18 17:00 so in those cases, there is no tail to copy out 2008-12-18 17:00 yes 2008-12-18 17:00 and even when there is, split and merge are efficient operations, just a couple of memcopies and so dict adjustments 2008-12-18 17:01 best, it uses existing code that has to work perfectly anyway 2008-12-18 17:01 and that can be tested in isolation, already has been tested pretty well 2008-12-18 17:02 for the working buffer, a single block kmalloced and pointed to by the cursor 2008-12-18 17:02 the cursor has to be locked for exclusive anyway 2008-12-18 17:03 why is it in cursor? 2008-12-18 17:04 because the cursor has to be locked exclusive anyway 2008-12-18 17:04 for this edit operation 2008-12-18 17:04 not every cursor needs it 2008-12-18 17:04 working buffer is not needed at all? 2008-12-18 17:04 in some cases not 2008-12-18 17:04 but in some cases it's really hard to avoid 2008-12-18 17:05 um... 2008-12-18 17:05 when a leaf is split, the new leaf can be used as a working buffer 2008-12-18 17:05 that's what I did in htree 2008-12-18 17:05 it's clever, efficient code 2008-12-18 17:05 but complex 2008-12-18 17:05 and that was for a pretty simple situation, no versioning, no extents, no fancy two level dict 2008-12-18 17:06 if it's actual split, it needs lock 2008-12-18 17:06 it is normal to modify leaf 2008-12-18 17:06 right, and any edit to the leaf needs an exclusive lock 2008-12-18 17:07 I thought the working buffer means temporary buffer 2008-12-18 17:07 yes 2008-12-18 17:07 if it temporary buffer, it doesn't need lock? 2008-12-18 17:07 of course, orignal leaf is needed 2008-12-18 17:08 It's not so great to do a kmalloc/kfree on every write op 2008-12-18 17:09 so I thought the working space would be allocated once and attached to the cursor, then the cursor will be kept around as a resource 2008-12-18 17:09 right now, it's allocated and freed on every op, but that is easy to change 2008-12-18 17:10 how do we do it? 2008-12-18 17:10 cursor as a resource? 2008-12-18 17:10 without allocate and freed on every op 2008-12-18 17:11 for example, have a list of cursors, take a spin lock, grab a cursor from the list, if there is none, allocate one, when done, leave it on the list 2008-12-18 17:11 simple way to start 2008-12-18 17:11 sounds like it is jobs of slab 2008-12-18 17:12 taking on an off a list is more efficient than slab... but that is not the only point 2008-12-18 17:13 what we want after that, is to no release the btree blocks in the cursor path 2008-12-18 17:13 not release 2008-12-18 17:13 so a bunch of operations to the same btree can keep using the same cursor, and avoid btree probe in many cases 2008-12-18 17:14 nice optimization for later 2008-12-18 17:14 we have to manage cursor cache 2008-12-18 17:14 yes 2008-12-18 17:14 I am thinking this will be a very worthwhile optimization, especially for the inode table 2008-12-18 17:15 and for database, and virtual fs images, also very nice 2008-12-18 17:15 well, it is another story 2008-12-18 17:16 yes, it is future optimization 2008-12-18 17:16 for now we can use slab or kmalloc (which is slab inside...) 2008-12-18 17:16 why do we keep temporary buffer ? 2008-12-18 17:16 a place to put the leaf tail when editing the leaf 2008-12-18 17:17 split and merge are pretty nice, they do very little unpacking and repacking of the dick 2008-12-18 17:17 ah, why do we cache it? 2008-12-18 17:17 cache buffer? we don't have to 2008-12-18 17:18 it's just convenient to 2008-12-18 17:18 if we add temporary buffer to cursor, it will be cache 2008-12-18 17:18 it will be cached 2008-12-18 17:18 yes 2008-12-18 17:19 I can't see, the point to cache temporary buffer 2008-12-18 17:19 just kmalloc/kfree? 2008-12-18 17:19 yes 2008-12-18 17:19 ok, and pass it to the function as a pointer in the cursor :) 2008-12-18 17:20 or the function kmallocs and kfrees 2008-12-18 17:20 fine 2008-12-18 17:20 or just pass it as parameter 2008-12-18 17:21 anyway that gets some working space is fine at the moment 2008-12-18 17:21 yes 2008-12-18 17:22 for this, my point is, I'd like to keep cursor simple 2008-12-18 17:22 it's your cursor ;) 2008-12-18 17:22 :) 2008-12-18 17:22 I already contribute it :) 2008-12-18 17:22 over time I want to make the cursor smarter, but right now I just want to handle leaf tails accurately 2008-12-18 17:23 yes 2008-12-18 17:23 it will have flags or something to know locking state 2008-12-18 17:23 yes 2008-12-18 17:24 at first we have just one semaphore per btree I think 2008-12-18 17:24 yes 2008-12-18 17:24 we don't want to use i_mutex for that 2008-12-18 17:24 should be a new mutex or rwsem 2008-12-18 17:24 maybe 2008-12-18 17:25 one rwsem per btree, in the struct btree would be a good start 2008-12-18 17:25 and need i_mutex to assign the btree to an inode, maybe 2008-12-18 17:26 after that, crabbing locking in the cursor path 2008-12-18 17:26 well 2008-12-18 17:26 maybe it really temporary, it may be able to i_lock 2008-12-18 17:27 yes, to assign the btree 2008-12-18 17:27 yes 2008-12-18 17:28 ok, I need to try the split/merge idea 2008-12-18 17:28 maybe, lock for crabbing locking is in the header of buffer? 2008-12-18 17:29 exactly 2008-12-18 17:29 and cursor points that lock? 2008-12-18 17:29 that's why I said "well" above ;) 2008-12-18 17:29 :) 2008-12-18 17:30 and we can do better than that later, by having the cursor keep its reference on a block buffer, then the user locks the cursor and gets direct access to the block buffer without probing 2008-12-18 17:30 fancier and fancier, faster and faster 2008-12-18 17:31 normal progression of locking strategy 2008-12-18 17:31 good 2008-12-18 17:32 cursor is an idea I got from matt dillon by the way 2008-12-18 17:32 name? 2008-12-18 17:33 you know about matt dillon and hammer? 2008-12-18 17:33 I know a bit 2008-12-18 17:33 lots of fine design work 2008-12-18 17:33 linux -> freebsd -> dragonfly bsd 2008-12-18 17:33 yes, we got our 2.6 vm design from matt, indirectly 2008-12-18 17:34 oh, 2.6 2008-12-18 17:34 2.2 or 2.4, matt was working for it? 2008-12-18 17:34 he was working on freebsd then 2008-12-18 17:35 but explained the ideas of reverse mapping to rik van riel 2008-12-18 17:35 ah 2008-12-18 17:35 i see 2008-12-18 17:35 later, that idea was combined with andrea arcangelli's lru design 2008-12-18 17:36 first time we ever had a vm that worked 2008-12-18 17:36 i see 2008-12-18 17:42 hack time 2008-12-18 17:42 I need to get to a point where we can merge again 2008-12-18 17:42 ok 2008-12-18 17:42 not drift too far apart 2008-12-18 17:43 I think merge is not needed for a while 2008-12-18 17:43 I'll change dwalk stuff 2008-12-18 17:43 but, maybe you don't need to change it? 2008-12-18 17:43 dwalk is working well for me 2008-12-18 17:44 new position is not working yet 2008-12-18 17:44 dwalk_probe()/dwalk_next() cleanup 2008-12-18 17:45 right 2008-12-18 17:45 but the current version works well enough for this work 2008-12-18 17:45 every time I think it has a bug, the bug is actually somewhere else 2008-12-18 17:45 so, I think merge is not needed for a while 2008-12-18 17:45 yes 2008-12-18 17:46 well, actually dwalk_probe() is buggy 2008-12-18 17:46 current dwalk_probe() 2008-12-18 17:46 but, we don't use it at all for now 2008-12-18 17:47 we pass 0 to it always 2008-12-18 18:50 yes 2008-12-18 18:51 lucky :) 2008-12-18 19:15 -!- ajonat(~ajonat@190.48.102.173) has joined #tux3 2008-12-18 20:43 -!- bushman(~bushman@c-76-23-106-132.hsd1.sc.comcast.net) has left #tux3 2008-12-18 23:15 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-19 00:43 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-19 02:58 folks 2008-12-19 02:58 bbl 2008-12-19 03:24 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-19 06:02 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-19 08:42 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-19 10:15 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-19 11:45 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-19 11:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-19 13:00 back into filemap.c 2008-12-19 13:01 hopefully, this is the last stretch of work to get something usable, until we get to versions 2008-12-19 13:03 so, the plan is, first we look in a dleaf and return a vector of segs - block/count - and holes for a given logical region 2008-12-19 13:04 (still waving my hands at extent overlap at the ends of the region) 2008-12-19 13:08 for write, extent allocations are then done and the segs list is updated, replacing the holes with new segments or redirecting existing segments to new locations (the segs list can become longer or shorter) 2008-12-19 13:09 then its time to insert the segs list back into the dleaf, this is the fun part 2008-12-19 13:10 from the initial scan, we should have a pointer to the first entry above the logical region 2008-12-19 13:11 we use the new dleaf_split_at to split out the tail of the leaf into a temporary leaf 2008-12-19 13:11 then use dwalk_pack to pack extents into the leaf, until the leaf fills 2008-12-19 13:12 then we create a new leaf by splitting the btree (don't have to split the leaf, we will be biased in favor of linear insertion and let the leaves be unbalanced for now) 2008-12-19 13:14 continue that until all the segs fit in the leaf (we break up long segs into our limited-size extents here, so we could generate more than one new leaf) 2008-12-19 13:15 when done, check if the leaf tail will fit, if not, split 2008-12-19 13:16 then merge the tail in using the existing dleaf_merge (also used for btree deletion, i.e., version delete or maybe later, hole punch) 2008-12-19 13:17 this is all pretty efficient... the larger logical regions we can work with, the more efficient 2008-12-19 13:17 now, I'm thinking split/merge is not good way 2008-12-19 13:18 now is the time to decide that 2008-12-19 13:18 split needs to copy 4 times 2008-12-19 13:18 good morning :) 2008-12-19 13:18 hi 2008-12-19 13:18 :) 2008-12-19 13:18 and merge also needs to copy 4 times 2008-12-19 13:19 I think we have to copy 8 times totally 2008-12-19 13:19 four times? 2008-12-19 13:19 (copy means memcpy()) 2008-12-19 13:20 in split, just two copies 2008-12-19 13:20 split is, extent, entry, group, and kill gap of entry and group 2008-12-19 13:21 usually, killing the gap is not needed 2008-12-19 13:21 if so, we need to change dwalk_pack? 2008-12-19 13:21 I think it assuming there is no gap 2008-12-19 13:22 there will be no gap when it is packing 2008-12-19 13:22 usually, number of groups will not change 2008-12-19 13:22 it will just be one group 2008-12-19 13:23 when it is more groups, it is a special case where the small extra overhead does not matter 2008-12-19 13:23 however, it can be 2008-12-19 13:23 yes 2008-12-19 13:23 but rare 2008-12-19 13:23 and we have to handle 2008-12-19 13:23 and it is a small overhead anyway 2008-12-19 13:23 yes 2008-12-19 13:23 the split handles it 2008-12-19 13:23 these functions are pretty efficient 2008-12-19 13:24 more efficient than looping over _pack to handle the tail 2008-12-19 13:24 yes 2008-12-19 13:24 but, we can create dwalk_insert 2008-12-19 13:24 yes, and we can view that as an opportunistic optimization 2008-12-19 13:25 when we hit a case that is not easily handled by dwalk_insert, we go to the split/merge 2008-12-19 13:25 the split/merge is inherently robust, because those operations _must_ be reliable, or the btree will be corrupt 2008-12-19 13:26 but, we make new dleaf_split_at? 2008-12-19 13:26 already done 2008-12-19 13:26 it was factored out of dleaf_split 2008-12-19 13:26 it was clean 2008-12-19 13:27 um... 2008-12-19 13:28 dwalk_insert will replace dwalk_pack, so I don't see extra complexicy 2008-12-19 13:28 how close are you with that? 2008-12-19 13:29 still need some work 2008-12-19 13:30 how does it handle the case where the leaf tail has to be moved? 2008-12-19 13:31 it means we have to split leaf in that case? 2008-12-19 13:31 or save tail issue? 2008-12-19 13:31 yes, that is the issue I am solving now 2008-12-19 13:32 For save tail issue, I'm going to use memmove() 2008-12-19 13:33 it's a little complex because the arrangement of groups can change 2008-12-19 13:33 yes 2008-12-19 13:34 so I am thinking, we handle the general case with split/merge, then optimize it with techniques like you are developing now 2008-12-19 13:34 however, it also use memmove() 2008-12-19 13:35 list, split moves things around, but it is efficient, and in many cases we will be able to avoid the split/merge completely 2008-12-19 13:35 expecially for linear writing to a new file 2008-12-19 13:36 if it is linear, insert will just add extent and entry 2008-12-19 13:36 yes 2008-12-19 13:36 ah 2008-12-19 13:36 yes 2008-12-19 13:37 in general, any case where dwalk_insert is easy to implement and reliable, we use it instead of split/merging the tail 2008-12-19 13:37 this gives us a robust strategy 2008-12-19 13:37 split/merge has another issue 2008-12-19 13:37 we may can merge after added new extents 2008-12-19 13:38 yes 2008-12-19 13:38 can't did you mean? 2008-12-19 13:38 how do we solve? 2008-12-19 13:38 I mentioned that above 2008-12-19 13:38 it is pretty easy 2008-12-19 13:38 ah, sorry, can't 2008-12-19 13:38 it is the same test as needed for btree coalesce 2008-12-19 13:39 make a conservative test that the tail will fit in the free space of the leaf, split if not 2008-12-19 13:39 split btree 2008-12-19 13:40 split two times? 2008-12-19 13:40 the split algorithm works pretty well, all the splitting ends up in the same place, splitting on dwalK_pack/dwalk_merge, and splitting on merging the tail 2008-12-19 13:40 there will be no extra splits 2008-12-19 13:41 ah 2008-12-19 13:41 no extra btree splits 2008-12-19 13:42 if we can't merge temporary buffer, we do real split 2008-12-19 13:42 yes 2008-12-19 13:42 that part works out pretty well 2008-12-19 13:43 I went through something like this with the inode table too, first I tried to have a classic btree split in the middle of the leaf 2008-12-19 13:43 that produced pretty bad behaviour 2008-12-19 13:43 and it was hard to think about the algorithm 2008-12-19 13:44 then it changed to creating a new, empty leaf in the btree when a table block fills up 2008-12-19 13:44 the create algorithm worked out better and the leaf block usage improved a lot 2008-12-19 13:45 there will be cases where we want a more balanced split, for example, random writes to a db file 2008-12-19 13:46 and then we will add a heuristic, for now we won't worry about it 2008-12-19 13:46 ok... I keep waving my hands at the extent overlap issue 2008-12-19 13:47 thinking that the main algorithm should work reliably first in the case that extents do not overlap, then add the cleanup for the ends of the logical region 2008-12-19 13:48 but really, it is better to have a plan for the handling of the ends now 2008-12-19 13:48 yes 2008-12-19 13:48 in the !create case, we just make the fixup in the segs vector 2008-12-19 13:48 that is easy, but not done for the top end yet 2008-12-19 13:50 the overlapping segments will be entered into the seg vector, so that the top segment is not part of the tail 2008-12-19 13:50 (I think) 2008-12-19 13:50 top extent I mean 2008-12-19 13:51 and the bottom overlapping extent will be in the segs list 2008-12-19 13:51 overlapping means 2~6, and 6~10 logcal region? 2008-12-19 13:52 that means 6 blocks and 10 blocks? 2008-12-19 13:52 start 2:size 4, start 6:size 5 2008-12-19 13:53 that does not overlap 2008-12-19 13:53 but your first example does 2008-12-19 13:54 existing extent: 2/6, create region 6/10 2008-12-19 13:54 overlap is not meaning, extents can be contiguous 2008-12-19 13:55 2/6 is index-2:count-6? 2008-12-19 13:55 in my example, extent 2/6 may have to end up as 2/4 if the physical extent is not contiguous, or may stay as 2/6, or get longer if a new physical extent above it is contiguous 2008-12-19 13:56 6 is count of blocks? 2008-12-19 13:56 yes 2008-12-19 13:56 so, I think we shouldn't make that case 2008-12-19 13:57 details? 2008-12-19 13:57 you think it will not happen, or we should not allow it? 2008-12-19 13:57 caller (e.g. filemap) should know about extent 2/6 2008-12-19 13:58 both is true 2008-12-19 13:58 well, we should not allow it 2008-12-19 13:58 yes, the caller that is doing the block allocations worries about this 2008-12-19 13:58 but get_segs has to tell it about the overlap 2008-12-19 13:59 because the segs[] vector does not represent that 2008-12-19 13:59 for this, I though we should use read_seg/write_seg 2008-12-19 13:59 I thought 2008-12-19 13:59 read_seg tells current extents 2008-12-19 14:00 user of read_seg, check it, then it decides what it want to do 2008-12-19 14:01 yes 2008-12-19 14:01 in the above case, user will try to create 7/9 (and if possible, contiguous with 2/6) 2008-12-19 14:01 yes 2008-12-19 14:02 and tell result of allocation to dleaf stuff 2008-12-19 14:02 I think 2008-12-19 14:02 right 2008-12-19 14:03 and I think dleaf should provide some function to do it 2008-12-19 14:03 I am thinking that caller gets a segs list from read_segs, does allocations and makes changes to the segs list, then calls write_segs to insert into the dtree 2008-12-19 14:03 add/insert/merge/update/truncate 2008-12-19 14:04 it can 2008-12-19 14:04 the walk_ api was supposed to be for that 2008-12-19 14:04 so the details remain exposed to the caller, which is going to do much more complex processing over time 2008-12-19 14:04 while still being a pretty simple api 2008-12-19 14:05 however, I guessed the packing those info and tell to another is not overkill 2008-12-19 14:05 add/insert/merge/update/truncate is operations that caller is needed 2008-12-19 14:05 I guess 2008-12-19 14:06 when we are deleting versioned segments, we can't really write that as a function in dleaf 2008-12-19 14:06 unless we want dleaf.c to know all about version details 2008-12-19 14:06 which is not a good factoring, I think 2008-12-19 14:06 i see 2008-12-19 14:07 that is one of the two reasons for the walk api 2008-12-19 14:07 dleaf can do it with versioning helper? 2008-12-19 14:07 the other reason is, it is efficient for multiple extents per operation 2008-12-19 14:07 that helper would be gigantic 2008-12-19 14:07 have a look at version.c 2008-12-19 14:08 but, in this case, dleaf stuff is including dwalk 2008-12-19 14:08 btw, in this case 2008-12-19 14:08 yes 2008-12-19 14:08 ok 2008-12-19 14:08 so what I wanted is to have the versioning code happening at as high a level as possible 2008-12-19 14:08 not in a helper 2008-12-19 14:09 it has to happen inside get_block of course 2008-12-19 14:09 ok 2008-12-19 14:09 but that is just for now until we have some kind of extent-oriented call from the vfs 2008-12-19 14:09 yes 2008-12-19 14:09 or we write our own handler for the mpage-type logic 2008-12-19 14:10 yes, I think we'll do it 2008-12-19 14:10 should give a measurable performance increase 2008-12-19 14:10 yes 2008-12-19 14:11 however, it is later 2008-12-19 14:11 yes 2008-12-19 14:11 did you fix the bug in the original version of dwalk_probe? 2008-12-19 14:11 no 2008-12-19 14:12 ok, well your new stuff, is it nearly solid except for dwalk_merge? 2008-12-19 14:12 dleaf_merge? 2008-12-19 14:12 I thought you were writing a dwalk_merge 2008-12-19 14:13 did I read that wrong? 2008-12-19 14:13 ah 2008-12-19 14:13 now, I'm writing dwalk_insert 2008-12-19 14:13 sorry 2008-12-19 14:13 I'll try dwalk_merge after it 2008-12-19 14:13 :) 2008-12-19 14:13 ok, well your new stuff, is it nearly solid except for dwalk_insert? 2008-12-19 14:13 and may dwalk_update 2008-12-19 14:14 I think so 2008-12-19 14:14 however, it will break dwalk_pack/dwalk_mock 2008-12-19 14:14 dwalk_pack works differently, or does not work at all? 2008-12-19 14:15 it doesn't work at all 2008-12-19 14:15 because the detail of struct dwalk was changed 2008-12-19 14:15 it would be nice if it worked :) 2008-12-19 14:15 yes 2008-12-19 14:15 dwalk_insert was replacement of it 2008-12-19 14:16 I would like to have both for now, and replace dwalk_pack over time 2008-12-19 14:16 and for now, I just need dwalk_pack to get filemap.c working reliably 2008-12-19 14:17 maybe, for it, I will need to adjust the detail of dwalk_pack 2008-12-19 14:18 if you could do that, and let me pull even with dwalk_insert not working, that would be nice 2008-12-19 14:19 dwalk_pack is assuming, leaf doesn't have entry/extent after target 2008-12-19 14:19 yes, that is perfect for me now 2008-12-19 14:19 ok 2008-12-19 14:20 I'll try 2008-12-19 14:20 ah 2008-12-19 14:21 btw, I'm thinking caller has another additional work 2008-12-19 14:21 caller should check maximum extent->count, and split it to two segements 2008-12-19 14:22 agreed, I wrote that up above 2008-12-19 14:22 in fact, it could be a lot of segments 2008-12-19 14:22 good 2008-12-19 14:23 ah, and pack takes one extent 2008-12-19 14:23 yes 2008-12-19 14:23 but, I think it should take multiple extents 2008-12-19 14:24 dwalk_pack should take multiple extents and try to merge them? 2008-12-19 14:24 no 2008-12-19 14:24 add multiple segemnts at once time 2008-12-19 14:25 to reduce overhead? 2008-12-19 14:25 yes 2008-12-19 14:25 well, maybe, it would be later 2008-12-19 14:26 yes 2008-12-19 14:27 because passed segments is contiguous, it shouldn't be complex to add multiple segments 2008-12-19 14:27 contiguous logical region 2008-12-19 14:32 btw, dwalk_mock seems strange 2008-12-19 14:34 where is initializing the walk->mock? 2008-12-19 14:38 the struct dwalk has the right state 2008-12-19 14:38 walk->mock.used and walk->mock.free 2008-12-19 14:38 it doesn't seem to be initialized 2008-12-19 14:43 initialized to zero 2008-12-19 14:43 and they accumulate the change 2008-12-19 14:43 free and used should have same value with current dleaf? 2008-12-19 14:44 no, they have the difference from the current dleaf values 2008-12-19 14:44 initial is also not same? 2008-12-19 14:45 if so, used - free is really strange 2008-12-19 14:45 right, they are initally zero, they could be initialized to the same as the dleaf, then they would be the same 2008-12-19 14:46 and dwalk_pack may not add entry 2008-12-19 14:46 and dwalk_pack may not add new entry 2008-12-19 14:46 right, it might just add a new extent 2008-12-19 14:46 I think it shouldn't happen for now 2008-12-19 14:46 because we don't allow multiple extents on one entry 2008-12-19 14:47 -!- andrem(~andre@201-1-148-5.dsl.telesp.net.br) has joined #tux3 2008-12-19 14:47 that is ok 2008-12-19 14:47 if it is allowed, we have to handle it by dwalk_probe() too 2008-12-19 14:47 and dwalk_index() 2008-12-19 14:48 those will always seek to an exact entry 2008-12-19 14:49 and we will handle the details of what is inside the entry in higher level code 2008-12-19 14:49 um... 2008-12-19 14:49 I can't see why doesn't it need to change entry_limit 2008-12-19 14:51 maybe it does 2008-12-19 14:52 um... 2008-12-19 14:52 it does change it, in the mock field 2008-12-19 14:53 oh 2008-12-19 14:53 inc_entry_limit(&walk->mock.entry, 1); 2008-12-19 14:54 it changes dleaf before split? 2008-12-19 14:54 what does? 2008-12-19 14:54 no 2008-12-19 14:54 inc_entry_limit() changes entry->limit? 2008-12-19 14:55 in the mock, no 2008-12-19 14:55 and mock is before btree_leaf_split 2008-12-19 14:55 yes 2008-12-19 14:55 it is just to find out of the new entries fit 2008-12-19 14:56 ah 2008-12-19 14:56 inc_entry_limit() in pack 2008-12-19 14:56 I'm missing it 2008-12-19 14:56 yes 2008-12-19 14:56 ok, in your new code 2008-12-19 14:57 I will be out for an hour 2008-12-19 14:58 ok, well, I'll try 2008-12-19 15:27 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-19 16:39 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-19 16:49 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-19 16:59 back 2008-12-19 17:00 I remembered dwalk_pack() is working 2008-12-19 17:00 nice 2008-12-19 17:00 maybe 2008-12-19 17:00 :) 2008-12-19 17:00 well, it will get some testing from me 2008-12-19 17:00 and I don't need _mock right now 2008-12-19 17:01 good 2008-12-19 17:01 pull? 2008-12-19 17:01 chop_after is needed? 2008-12-19 17:01 ACTION thinks 2008-12-19 17:01 it is also broken 2008-12-19 17:01 no 2008-12-19 17:01 that's ok 2008-12-19 17:01 it's being replaced with dleaf split/merge for the tail 2008-12-19 17:01 the whole thing is getting much nicer 2008-12-19 17:01 dwalk_pack is strange, however it will work 2008-12-19 17:02 strange? 2008-12-19 17:02 oh, funny to code 2008-12-19 17:02 e.g. walk->entry == walk->estop 2008-12-19 17:02 yes 2008-12-19 17:02 things being upside down helps make it strange 2008-12-19 17:02 and one byte pointers... 2008-12-19 17:03 I take full blame 2008-12-19 17:03 well, dwalk_pack doesn't update ->estop, but test it 2008-12-19 17:04 well, I'll review it 2008-12-19 17:04 the original did not test estop either 2008-12-19 17:05 did not update I mean 2008-12-19 17:05 bug? 2008-12-19 17:06 now, dwalk_pack also doesn't update ->estop 2008-12-19 17:06 can be bug 2008-12-19 17:06 however, it is assuming "add" operation 2008-12-19 17:06 so, maybe it works 2008-12-19 17:07 I don't see how ;) 2008-12-19 17:07 if we remove chop* and mock, it is very easy to work 2008-12-19 17:08 you can leave them in, just write a /* this is broken */ comment 2008-12-19 17:08 or assert(0) 2008-12-19 17:08 that will do 2008-12-19 17:08 dwalk_pack() also test dwalk_index(walk) != index, it will match always 2008-12-19 17:08 yes, so it is a bug that I covered up 2008-12-19 17:08 and you noticed, good 2008-12-19 17:09 easy to fix 2008-12-19 17:09 well, I'll review dwalk_pack fully 2008-12-19 17:09 why is it even testing estop? 2008-12-19 17:09 for other stuff, I can add "#if 0"? 2008-12-19 17:09 good 2008-12-19 17:10 ah, it meant - can I add "#if 0"? 2008-12-19 17:10 yes 2008-12-19 17:10 ok, good 2008-12-19 17:10 and I am thinking, that test of estop is bogus 2008-12-19 17:10 remove it to fix :) 2008-12-19 17:10 unit test will be also removed 2008-12-19 17:11 because it doesn't work? 2008-12-19 17:11 yes 2008-12-19 17:11 oh, one that depends on mock 2008-12-19 17:11 well, remove is "#if 0" actually 2008-12-19 17:11 sure 2008-12-19 17:11 I'll fix it to just do the _pack test 2008-12-19 17:11 mock? 2008-12-19 17:12 the unit test needs to be #ifdef 0 because _chop is broken? 2008-12-19 17:12 yes 2008-12-19 17:12 fine 2008-12-19 17:12 _chop and _chop_after and _mock 2008-12-19 17:13 well, it may be work 2008-12-19 17:13 but, I can't mark it as ok 2008-12-19 17:13 that is fine for now 2008-12-19 17:13 and I know the bug of _chop and _chop_after 2008-12-19 17:14 I don't want to fix those if we don't need 2008-12-19 17:14 right, not a good use of time 2008-12-19 17:14 static int get_segs(struct inode *inode, block_t start, unsigned limit, 2008-12-19 17:14 struct seg seg[], unsigned max_segs, unsigned *below, unsigned *above) <- revised get_segs interface 2008-12-19 17:14 this is the one you called read_segs 2008-12-19 17:15 so, I think we can merge now 2008-12-19 17:15 ok 2008-12-19 17:15 below, above? 2008-12-19 17:15 yes, the size of overlap below and above, needed only for create = 1 2008-12-19 17:16 ah, I assumed creater will adjust start and limit 2008-12-19 17:17 start - 1, and limit + 1 2008-12-19 17:17 caller can adjust them 2008-12-19 17:17 if it wants 2008-12-19 17:17 yes 2008-12-19 17:17 get_segs should say exactly what lies in the io region 2008-12-19 17:18 pull and review now? 2008-12-19 17:18 I'll check my patches 2008-12-19 17:18 I'll work on an email to describe the filemap.c strategy 2008-12-19 17:19 ok 2008-12-19 17:49 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-19 18:05 ugh, I've noticed dwalk_pack didn't work 2008-12-19 18:06 and I noticed why does current code work 2008-12-19 18:07 aha 2008-12-19 18:07 hard thing? 2008-12-19 18:08 dwalk_probe and others has different dwalk state entriely 2008-12-19 18:08 and dwalk_pack depends on it 2008-12-19 18:08 right, _probe is supposed to set up the right state for _pack 2008-12-19 18:09 as well as _next 2008-12-19 18:09 but, current _probe dones't do it 2008-12-19 18:09 whoops 2008-12-19 18:09 now, we use _back 2008-12-19 18:10 so, it works 2008-12-19 18:10 maybe, it is not hard to fix 2008-12-19 18:11 however, new filemap is how work? 2008-12-19 18:11 it still wants a _back 2008-12-19 18:11 bug it does not need _chop or _mock 2008-12-19 18:11 but 2008-12-19 18:11 I meant 2008-12-19 18:12 it's ok 2008-12-19 18:12 it needs _probe, _next, _back, _pack 2008-12-19 18:12 now, new _back has same state to _probe 2008-12-19 18:12 good 2008-12-19 18:12 _pack depends assumes old state 2008-12-19 18:12 bad ;) 2008-12-19 18:12 so, I have to fix _pack 2008-12-19 18:12 ok 2008-12-19 18:13 however, why new filemap doesn't need _chop_after? 2008-12-19 18:13 ah, it does 2008-12-19 18:14 i see 2008-12-19 18:14 do make the dleaf state consistent after the _packs are done 2008-12-19 18:14 so, I have to fix chop_after 2008-12-19 18:14 but 2008-12-19 18:14 it can be very simple 2008-12-19 18:15 well 2008-12-19 18:15 no 2008-12-19 18:15 however, maybe it is chop, not after? 2008-12-19 18:15 it is called before the packs 2008-12-19 18:15 yes, it's just chop 2008-12-19 18:15 chop, current extent and after? 2008-12-19 18:16 it chops current extent and after? 2008-12-19 18:16 yes 2008-12-19 18:16 ok 2008-12-19 18:16 I'll do it 2008-12-19 18:17 ok, I'll go back to work on filemap 2008-12-19 18:17 ok 2008-12-19 18:17 this is the most important part of tux3, the filemap and dwalk 2008-12-19 18:18 yes 2008-12-19 18:18 and it's getting close to usable, so I'm happy 2008-12-19 18:19 and later, we will do benchmark and optimize? 2008-12-19 18:20 yes 2008-12-19 18:20 good 2008-12-19 18:20 big files first, because having a full btree for even a small file doesn't look so good ;) 2008-12-19 18:22 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-19 18:22 hi konrad 2008-12-19 18:23 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-19 19:14 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-19 20:17 get_segs/new_segs break up is now done 2008-12-19 20:17 each is a reasonable 75 lines or so 2008-12-19 20:17 now to make it work again 2008-12-19 20:18 good 2008-12-19 20:18 dwalk_chop was done 2008-12-19 20:18 ACTION goes to work on new_segs 2008-12-19 20:18 after all, old dwalk_chop seems to had bugs 2008-12-19 20:18 good, that is a reason for rewriting it ;) 2008-12-19 20:19 and I think, position on current extent is good 2008-12-19 20:19 instead of next 2008-12-19 20:19 yes 2008-12-19 20:20 those may still have bug, however I think we can merge those 2008-12-19 20:21 and start working with them to find any bugs 2008-12-19 20:21 now, filemap seems to have same behavior 2008-12-19 20:21 I will take your word for it, because my filemap is now changed a lot 2008-12-19 20:22 whoops, public repo has new changeset 2008-12-19 20:22 yes, sorry 2008-12-19 20:22 I'll pull it 2008-12-19 20:22 it should merge easily 2008-12-19 20:23 I don't think you touched _split 2008-12-19 20:23 yes 2008-12-19 20:23 ah, no 2008-12-19 20:23 :) 2008-12-19 20:23 um... 2008-12-19 20:23 I did't touched 2008-12-19 20:24 :) 2008-12-19 20:24 you can write "no (english)" 2008-12-19 20:24 ;) 2008-12-19 20:24 ok 2008-12-19 20:24 so, no, I didn't 2008-12-19 20:25 just kidding, I knew what you meant 2008-12-19 20:25 :) 2008-12-19 20:25 in korea "ney" is slang for yes and in germany it is slang for "no" 2008-12-19 20:26 they both say "ney ney" and it means the opposite 2008-12-19 20:26 :) 2008-12-19 20:26 well, maybe I can say "hai" or "iie" 2008-12-19 20:27 it is yes or no in japanese 2008-12-19 20:27 :) 2008-12-19 20:27 that must lead to lots of confusion 2008-12-19 20:27 :) 2008-12-19 20:27 yes 2008-12-19 20:28 cantonese has a slightly different problem, they put "m" in front of anything to mean the "not" 2008-12-19 20:28 but the "m" is very quiet 2008-12-19 20:28 and is often lost 2008-12-19 20:29 I had to add an array of two struct dwalks to the get_segs interface 2008-12-19 20:29 ok 2008-12-19 20:29 to get the dwalk position at start and end of IO region 2008-12-19 20:30 btw, -Werror bothered you? 2008-12-19 20:31 I was getting errors on unused local parameters 2008-12-19 20:31 while testing 2008-12-19 20:31 oh 2008-12-19 20:31 some slight improvement to the way we use Werror will be good 2008-12-19 20:32 always use on make all and make tests 2008-12-19 20:32 -Wno-unused-parameter didn't work for it? 2008-12-19 20:32 it would have 2008-12-19 20:32 ah, local variables? 2008-12-19 20:33 right 2008-12-19 20:33 ah, ok 2008-12-19 20:33 we can put Werror back, or adjust it a little to not get in the way 2008-12-19 20:34 it does a lot of good, pointer warnings really should be errors 2008-12-19 20:35 well, it still warns, so it is not problem 2008-12-19 20:36 static int get_segs(struct cursor *cursor, block_t start, unsigned limit, 2008-12-19 20:36 struct seg seg[], unsigned max_segs, struct dwalk seek[2], unsigned overlap[2]) <- how get_segs looks now 2008-12-19 20:36 not really pretty 2008-12-19 20:36 btw, it changed filemap more or less 2008-12-19 20:36 what changed it? 2008-12-19 20:37 dwalk_next() 2008-12-19 20:37 now, it doesn't return extent 2008-12-19 20:37 ah 2008-12-19 20:37 so I will have a slightly interesting merge 2008-12-19 20:38 if needed, I'll apply that change to your version 2008-12-19 20:38 I can do it on this side 2008-12-19 20:39 my version is massively changed 2008-12-19 20:39 of filemap 2008-12-19 20:39 ok 2008-12-19 20:40 I will put my changes aside, pull yours, then do the merge with "patch" I think 2008-12-19 20:41 hg merge is much more powerful, but I am used to patch 2008-12-19 20:43 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-19 20:45 basically, my filemap should change it logically 2008-12-19 20:46 so, you can overwrite it 2008-12-19 20:46 and if needed, I'll re-apply dwalk_next change 2008-12-19 20:46 well, anyway, please check it 2008-12-19 20:49 whoops 2008-12-19 20:49 basically, my filemap should not change it logically 2008-12-19 20:57 reading 2008-12-19 21:00 hg view needs a text search 2008-12-19 21:03 Find? 2008-12-19 21:03 oh :) 2008-12-19 21:03 ah, but it didn't search the patch 2008-12-19 21:03 right 2008-12-19 21:04 gitk can it, hg has to update hgk 2008-12-19 21:05 and I would like it to be qt :) 2008-12-19 21:05 well, it's very nice as it is 2008-12-19 21:05 ah, maybe hg has qt fontend 2008-12-19 21:07 anyway, I will just copy my filemap.c over yours and fix up as you did 2008-12-19 21:07 yes 2008-12-19 21:23 pushed to public, now to integrate with my filemap.c 2008-12-19 21:23 ok 2008-12-19 21:44 compiles without warnings 2008-12-19 21:45 good 2008-12-19 21:46 it seems like dwalk_next could easily return the pointer to next extent instead of a flag if we want 2008-12-19 21:46 but it makes little difference to the code, just a few lines 2008-12-19 21:48 need to check if next extent is valid 2008-12-19 21:49 ah, it can 2008-12-19 21:49 right, you just return 0 now when it is not valid 2008-12-19 21:49 anyway 2008-12-19 21:50 I'm leaving it as it is 2008-12-19 21:50 and concentrate on making new_segs work 2008-12-19 21:50 ok 2008-12-19 21:50 when did you sleep last, just out of interest? 2008-12-19 21:51 um.., maybe, 0:00 ~ 6:00 or so 2008-12-19 21:52 ACTION tries to parse that 2008-12-19 22:38 -!- RazvanM(~RazvanM@96.234.237.45) has joined #tux3 2008-12-19 23:39 folks 2008-12-19 23:47 hi bh 2008-12-19 23:55 hi all 2008-12-20 00:20 hirofumi: ping 2008-12-20 00:21 flips: ping 2008-12-20 00:22 hi 2008-12-20 00:22 wassup? 2008-12-20 00:22 wonder why we started storing mtimes always 2008-12-20 00:22 (rather than just when it is different than ctime) 2008-12-20 00:22 ah 2008-12-20 00:23 change that back? 2008-12-20 00:23 http://hg.tux3.org/tux3/rev/adeebbeef7c5 2008-12-20 00:24 hi 2008-12-20 00:24 hi 2008-12-20 00:25 hirofumi, I forget what the reasoning was about mtime 2008-12-20 00:25 i assume the reason is just to keep it simpler? 2008-12-20 00:25 if inode doesn't have MTIME_BIT, save_inode will strip it? 2008-12-20 00:25 yeah 2008-12-20 00:26 so, normal inode should have it 2008-12-20 00:26 in practice most files dont need it 2008-12-20 00:26 i_mtime? 2008-12-20 00:26 right because it is usually the same as i_ctime 2008-12-20 00:26 we can set the mtime in the inode without storing it on disk 2008-12-20 00:26 so we dont need to store it 2008-12-20 00:27 so we assume it is = ctime 2008-12-20 00:27 ah 2008-12-20 00:27 i see 2008-12-20 00:27 I think what happened was, the wrong mtime was observed in a real kernel, and that was the fix, but there is a more elegant fix 2008-12-20 00:28 that takes advantage of our variable inode attributes 2008-12-20 00:28 I think we can optimize save_inode/make_inode and tux_new_inode/open_inode 2008-12-20 00:29 yeah, why can't save_inode just deal with it 2008-12-20 00:30 save_inode/make_inode stores attributes 2008-12-20 00:30 so, it will strip if ctime and mtime 2008-12-20 00:30 yeah i dont think setattr is quite the right place 2008-12-20 00:30 yes 2008-12-20 00:30 open_inode can set mtime to ctime if mtime is not present 2008-12-20 00:31 yes 2008-12-20 00:31 and we should rename open_inode as load_inode 2008-12-20 00:31 both should be done at same time 2008-12-20 00:31 or legaci read_inode? 2008-12-20 00:31 legacy 2008-12-20 00:32 well, make_inode will also be renamed 2008-12-20 00:32 when I see "save_inode" I naturally look around for a "oad_inode" 2008-12-20 00:32 load_inode 2008-12-20 00:33 i see 2008-12-20 00:33 well, load_inode is fine for me 2008-12-20 00:34 I am the one who gave it the stupid name in the first place ;) 2008-12-20 00:34 perhaps it would be nice to have commit messages emailed to the mailing list 2008-12-20 00:34 w/ diffs if they aren't too long 2008-12-20 00:34 that would bump the traffic up 2008-12-20 00:34 maybe only daily though? 2008-12-20 00:34 diffs are pretty long these days, and the repo should be just a click away 2008-12-20 00:34 if needed, I'd like to different mailing-list 2008-12-20 00:35 if needed, I'd like to have different mailing-list 2008-12-20 00:35 tux3-commit or something 2008-12-20 00:35 it's not too much traffic now 2008-12-20 00:35 hey flips 2008-12-20 00:35 i dont think flips wants to setup another mailing list 2008-12-20 00:35 heh 2008-12-20 00:35 flips: up for fighting with mailman :P 2008-12-20 00:35 really not :) 2008-12-20 00:35 :) 2008-12-20 00:35 and commit messages going to the list is nice 2008-12-20 00:35 could be a daily digest 2008-12-20 00:35 it's always topical, and somebody might spot something 2008-12-20 00:36 is that a good compromise? 2008-12-20 00:36 all of them when they happen is fine with me 2008-12-20 00:36 we're far from having too much traffic 2008-12-20 00:36 true 2008-12-20 00:37 hirofumi: i can set the subject or a header to something you can easily procmail out if you think it will be too much ;) 2008-12-20 00:37 if it has subject, I'm happy :) 2008-12-20 00:38 if possible, header is more happy 2008-12-20 00:39 because commit is too late to review 2008-12-20 00:40 after commit 2008-12-20 00:40 but not too late to spot a problem, and email to the list about it 2008-12-20 00:40 yes, for developers 2008-12-20 00:41 however, users will see problems 2008-12-20 00:41 right, I don't think we have a user yet :) 2008-12-20 00:41 :) 2008-12-20 00:41 when we have some, then we can separate the list 2008-12-20 00:41 yes, it's ok for now 2008-12-20 00:42 if so, that is good for me 2008-12-20 00:44 well, hg has rss/atom though 2008-12-20 00:46 shapor has a pretty good history of good taste in such things 2008-12-20 00:46 great 2008-12-20 01:15 why on earth does C assume unsigned for $x? It's like... signed hexadecimal, ooh scary 2008-12-20 01:15 for %x I mean 2008-12-20 01:18 because it is %u in decimal? 2008-12-20 01:19 :) 2008-12-20 01:19 because... it just grew 2008-12-20 03:21 :o 2008-12-20 07:41 -!- pranith(~bobby@122.162.70.237) has joined #tux3 2008-12-20 08:20 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-20 08:29 -!- pranith(~bobby@122.162.70.237) has joined #tux3 2008-12-20 09:07 -!- pranith(~bobby@122.162.70.237) has joined #tux3 2008-12-20 09:58 -!- pranihome(~bobby@122.162.73.236) has joined #tux3 2008-12-20 10:29 ok, a new push for fsck 2008-12-20 10:39 -O0 working for anyone? 2008-12-20 10:42 -!- pranith(~bobby@122.162.73.236) has joined #tux3 2008-12-20 11:40 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-20 11:48 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-20 12:04 -!- bushman(~bushman@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-20 13:43 -!- bushman(~bushman@c-76-23-106-132.hsd1.sc.comcast.net) has left #tux3 2008-12-20 15:48 sk8 oclock 2008-12-20 16:49 hey flips 2008-12-20 18:32 -!- flips(~phillips@phunq.net) has joined #tux3 2008-12-20 20:35 -!- bushman_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-20 20:48 hi, flips there? 2008-12-20 20:48 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-20 20:48 I found dwalk_chop bug 2008-12-20 20:48 please pull it 2008-12-20 23:12 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-20 23:32 -!- MaZe(~MaZe@abls23.neoplus.adsl.tpnet.pl) has joined #tux3 2008-12-21 00:14 -!- RazvanM(~RazvanM@96.234.237.45) has joined #tux3 2008-12-21 00:15 hirofumi ok 2008-12-21 01:52 well a plausible new_segs is written 2008-12-21 02:38 -!- MaZe(~MaZe@aaho211.neoplus.adsl.tpnet.pl) has joined #tux3 2008-12-21 06:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-21 08:45 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-21 09:08 -!- pranith(~bobby@122.163.51.185) has joined #tux3 2008-12-21 10:32 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-21 12:32 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-21 13:53 starting to test the new filemap.c 2008-12-21 14:31 -!- pgquiles(~pgquiles@91.Red-88-18-198.staticIP.rima-tde.net) has joined #tux3 2008-12-21 17:21 ...and starting to get correct results 2008-12-21 17:22 a couple simple cases running, using the leaf split/merge statregy for handling tails, mumbled about here 2008-12-21 17:22 more complex cases can wait until after dinner 2008-12-21 17:23 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-21 17:23 please check it later 2008-12-21 17:23 most doutable one is balloc change 2008-12-21 17:28 ok 2008-12-21 17:28 so far, your rewritten dwalk has functioned fine 2008-12-21 17:29 all I did was put tracing output back in it 2008-12-21 17:29 this is very useful for debugging the higher level 2008-12-21 17:30 good 2008-12-21 17:41 folks 2008-12-21 17:41 hey flips 2008-12-21 17:41 it's raining here like crazy, coming down your way 2008-12-21 18:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-21 19:36 flips, there? 2008-12-21 19:36 now, I'm thinking about extent merge 2008-12-21 21:09 -!- ajonat(~ajonat@190.48.101.107) has joined #tux3 2008-12-21 21:45 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-21 22:09 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-21 22:10 -!- MaZe(~MaZe@aaho211.neoplus.adsl.tpnet.pl) has joined #tux3 2008-12-21 22:57 hirofumi, I'm ok with the balloc change, it's minor 2008-12-21 22:57 it moves balloc/bfree_extent version 2008-12-21 22:58 removes 2008-12-21 22:58 sure 2008-12-21 22:58 it's natural 2008-12-21 22:58 ok 2008-12-21 22:58 good 2008-12-21 22:58 I think balloc should always take a goal parameter 2008-12-21 22:59 but since we don't have any allocation policy yet, it's not needed right now 2008-12-21 22:59 yes, maybe 2008-12-21 23:00 my reason is, we should be thinking carefully about the allocation policy for every call 2008-12-21 23:00 yes 2008-12-21 23:00 but not yet ;) 2008-12-21 23:01 there are enough other things to think about 2008-12-21 23:01 sure 2008-12-21 23:01 ready for a pull? 2008-12-21 23:01 I've updated the filemap test 2008-12-21 23:01 so, not yet 2008-12-21 23:02 ok 2008-12-21 23:03 can you leave it as dwalk_pack for now? make merging a little easier for me 2008-12-21 23:04 hmm 2008-12-21 23:04 affects a bunch of stuff 2008-12-21 23:04 so it is ok either way 2008-12-21 23:04 I'll push to your tree if possile 2008-12-21 23:05 I'll merge to your tree 2008-12-21 23:05 not to current public repo 2008-12-21 23:06 I think you will be ready before me 2008-12-21 23:06 I just started testing it in user space, have not integrated with kernel 2008-12-21 23:06 I think breakage is ok 2008-12-21 23:07 because current repo is also not working 2008-12-21 23:07 I can produce some :) 2008-12-21 23:07 no problem 2008-12-21 23:07 ok, I will do a quick kernel hookup then, and push to public, easier than exposing another repo 2008-12-21 23:08 yes 2008-12-21 23:53 well, I can still echo text into a file 2008-12-21 23:54 good 2008-12-21 23:56 new filemap already handles extent merge? 2008-12-21 23:57 yes, but not well tested 2008-12-21 23:57 oh 2008-12-21 23:57 simple cases seem to work 2008-12-21 23:57 it deletes extent? 2008-12-21 23:58 if filled hole and extents become contiguous 2008-12-21 23:58 fills in gaps between, as before 2008-12-21 23:58 but it is now easier to make it replace the extent 2008-12-21 23:58 most of the improvement is the way the leaf tail is handled 2008-12-21 23:58 and the broken retry loop is replaced with something sensible 2008-12-21 23:59 I think your _merge can we added as a fast path 2008-12-21 23:59 and that the split/merge approach is for the general case 2008-12-21 23:59 yes, this question is for that work 2008-12-22 00:00 I thought the merge has 3 cases 2008-12-22 00:00 back merge, front merge, fill hole 2008-12-22 00:01 back merge and front merge can do by dwalk_update() 2008-12-22 00:01 for fill hole, now I'm writing dwalk_delete() 2008-12-22 00:02 that is when your are adding one new extent to replace a range of existing extents? 2008-12-22 00:02 or I don't understand what you said 2008-12-22 00:03 those are adding one new extent 2008-12-22 00:03 ok, and what is the defintion of back merge? 2008-12-22 00:03 and new extent is merging with existing extents 2008-12-22 00:04 new extent can merge with next extent 2008-12-22 00:04 next existing extent 2008-12-22 00:05 right 2008-12-22 00:05 in that case, dwalk_update needs to update existing entry and extent 2008-12-22 00:05 sometimes merge is possible, sometimes not 2008-12-22 00:05 yes 2008-12-22 00:06 ok 2008-12-22 00:06 I get it 2008-12-22 00:07 A description of what dwalk_merge does on the mailing list would be helpful 2008-12-22 00:07 ok 2008-12-22 00:08 well, I'm going to add more low level operation, instead of merge 2008-12-22 00:08 dwalk_update() and dwalk_delete() 2008-12-22 00:09 to use it, caller have to check the type of merge 2008-12-22 00:10 well, we don't need those right now 2008-12-22 00:11 ok, cleaned up enough to push 2008-12-22 00:11 ok 2008-12-22 00:13 ok, it's in public 2008-12-22 00:13 not as tidy as your changesets 2008-12-22 00:14 I don't know if the find_segs/fill_segs factoring is useful at all 2008-12-22 00:14 get_segs seems good though 2008-12-22 00:14 and the only real problem we had with that interface is get_block knowing when to seg buffer_new 2008-12-22 00:14 hey flips 2008-12-22 00:14 hi bh 2008-12-22 00:14 oh 2008-12-22 00:14 I goofed 2008-12-22 00:15 another quick push coming up 2008-12-22 00:17 wait, I didn't goof 2008-12-22 00:17 get_block should always pass 1 as max_segs 2008-12-22 00:21 ok, I'll merge my changes 2008-12-22 00:22 some functions was exported, it is only for testing? 2008-12-22 00:23 filemap.c uses them now 2008-12-22 00:23 oh 2008-12-22 00:24 it doesn't use btree_leaf_split 2008-12-22 00:24 I have written a post about it 2008-12-22 00:25 but when writing about the new API, I thought I should try it first and see if it makes sense 2008-12-22 00:25 so I implemented it first instead 2008-12-22 00:25 and now I can revise the email and post it 2008-12-22 00:26 my test case in user/filemap.c fails if I run it in reverse 2008-12-22 00:26 I need to fix that 2008-12-22 00:26 ok 2008-12-22 00:26 but you don't need to wait for me 2008-12-22 00:26 it will be a small fix probably 2008-12-22 00:26 well, for dleaf_init, maybe we should use ops->leaf_init() 2008-12-22 00:28 filemap.c will only ever operate on a data btree 2008-12-22 00:28 I know it is attractive to try to hide the specific methods behind an oo type interface 2008-12-22 00:29 yes 2008-12-22 00:29 if we actually had an oo compiler instead of fake oo in C, then the compiler would optimize your version to the same as mine 2008-12-22 00:30 but C will do the function lookup at runtime 2008-12-22 00:30 probably 2008-12-22 00:30 it's a small thing 2008-12-22 00:32 or just move segs stuff to dleaf 2008-12-22 00:33 let me see 2008-12-22 00:33 well, not needed now 2008-12-22 00:34 my hope is that filemap will use primitives from dleaf, btree and version 2008-12-22 00:34 I think it needs to be its own thing 2008-12-22 00:34 ok 2008-12-22 00:35 I am not too happy with the find_segs/fill_segs split 2008-12-22 00:35 it didn't factor very cleanly 2008-12-22 00:36 but it functions for now 2008-12-22 00:36 too share dleaf with filemap, we may want to add dleaf.h later 2008-12-22 00:36 and maybe it will improve, or we can put them back together again 2008-12-22 00:36 we might 2008-12-22 00:36 -!- MaZe(~MaZe@abmc162.neoplus.adsl.tpnet.pl) has joined #tux3 2008-12-22 00:37 I know a lot of people like to have one header file for every c file 2008-12-22 00:37 I don't like that in a small program, because it generates a lot of noise files 2008-12-22 00:38 i see 2008-12-22 00:38 it's a small point though 2008-12-22 00:47 ok, now, I'll start to test 2008-12-22 00:49 segs = get_segs(inode, 4, 5, segvec, 1, 1); show_segs(segvec, segs); 2008-12-22 00:49 segs = get_segs(inode, 2, 3, segvec, 1, 1); show_segs(segvec, segs); <- this is the bug I'm working on 2008-12-22 00:49 1 entry groups: 2008-12-22 00:49 0/2: 4 => 3/1; 2 =>; <- results in this (wrong) 2008-12-22 00:50 looks like dleaf bug? 2008-12-22 00:50 now, it still using dleaf_chop_after 2008-12-22 00:50 dwalk_chop_after 2008-12-22 00:50 it has bug 2008-12-22 00:50 yes 2008-12-22 00:51 I'll try to merge some patches 2008-12-22 00:51 and see if it fixes 2008-12-22 00:51 well, should it really be chop_after 2008-12-22 00:51 or just chop 2008-12-22 00:51 I think, just chop 2008-12-22 00:52 yes, my patches removes old chop and chop_after 2008-12-22 00:52 ah, so maybe it will fix this 2008-12-22 00:52 it might, and there is also a bug with calculating the new "above" variable 2008-12-22 00:53 should I check in my broken test case? 2008-12-22 00:53 wait a bit 2008-12-22 00:53 ok 2008-12-22 00:56 -!- pgquiles(~pgquiles@91.Red-88-18-198.staticIP.rima-tde.net) has joined #tux3 2008-12-22 00:57 0/2: 2 => 2/1; 4 => 3/1; 2008-12-22 00:57 1 level btree at 0: 2008-12-22 00:57 1 entry groups: 2008-12-22 00:57 0/2: 2 => 2/1; 4 => 3/1; 2008-12-22 00:57 that looks good 2008-12-22 00:57 ok 2008-12-22 00:57 looks like dwak_chop bug 2008-12-22 00:58 0/2: 2 => 3/1; 4 => 2/1; <- maybe it should be this? 2008-12-22 00:59 I don't know 2008-12-22 00:59 I just ran filemaptest after merge 2008-12-22 00:59 well it seems to be an improvement 2008-12-22 01:00 http://userweb.kernel.org/~hirofumi/dleaf/dleaf_chop-fix.patch 2008-12-22 01:01 for right now, dwalk_chop2 will do right thing 2008-12-22 01:02 ok, I will try it 2008-12-22 01:02 please copy dwalk_chop2 2008-12-22 01:02 copy? 2008-12-22 01:02 oh 2008-12-22 01:02 just copy in from the patch 2008-12-22 01:02 ok 2008-12-22 01:02 yes 2008-12-22 01:03 ugh 2008-12-22 01:03 I may sent wrong patch 2008-12-22 01:04 I will wait a moment 2008-12-22 01:07 1 level btree at 0: 2008-12-22 01:07 1 entry groups: 2008-12-22 01:07 0/2: 2 => 2/1; 4 => 3/1; 2008-12-22 01:07 1 segs: 3/1 2008-12-22 01:07 ah 2008-12-22 01:07 dwalk_chop is right already 2008-12-22 01:08 0/2: 4 => 3/1; 2 =>; 2008-12-22 01:09 how can this reproduce? 2008-12-22 01:09 how can I this reproduce? 2008-12-22 01:09 comment out the add lf "above" in find_segs 2008-12-22 01:10 above is calculated incorrectly 2008-12-22 01:10 comment out the add of "above" in find_segs, I meant 2008-12-22 01:11 test is second one? 2008-12-22 01:12 segs = get_segs(inode, 4, 5, segvec, 1, 1); show_segs(segvec, segs); 2008-12-22 01:12 segs = get_segs(inode, 2, 3, segvec, 1, 1); show_segs(segvec, segs); 2008-12-22 01:12 ok 2008-12-22 01:12 0/2: 4 => 0/1; 2 =>; 2008-12-22 01:13 0/1: 2 => 0/1; 2008-12-22 01:13 just a sec 2008-12-22 01:14 +// seg[segs - 1].count -= above; 2008-12-22 01:16 0/1: 2 => 3/1; 2008-12-22 01:16 same as I have now 2008-12-22 01:17 with dwalk_chop 2008-12-22 01:17 yes 2008-12-22 01:17 thinking... 2008-12-22 01:17 it seems right 2008-12-22 01:18 right, it should split out the tail 2008-12-22 01:18 it is show_segs? 2008-12-22 01:18 so, it shows 2? 2008-12-22 01:19 that is dump_tree 2008-12-22 01:19 show_segs does not show the logical address 2008-12-22 01:19 oh 2008-12-22 01:19 only physical and count 2008-12-22 01:20 merge tail seems not merge tail 2008-12-22 01:22 right 2008-12-22 01:22 just checking what happens at the split 2008-12-22 01:24 the split does not seem right 2008-12-22 01:24 valgrind shows some problems 2008-12-22 01:26 not this bug though 2008-12-22 01:26 that's an upcoming bug ;) 2008-12-22 01:26 ok :) 2008-12-22 01:27 I believe you agreed to me merging it broken ;) 2008-12-22 01:28 :) however, this merge will help me 2008-12-22 01:29 yes, we were getting too far apart 2008-12-22 01:30 fill_segs: leaf... 2008-12-22 01:30 1 entry groups: 2008-12-22 01:30 0/1: 4 => 2/1; 2008-12-22 01:30 split 0x805e400 into 0x805e508 2008-12-22 01:30 split 1 entries at group 1, entry 0 2008-12-22 01:30 split extents at 1 2008-12-22 01:30 fill_segs: leaf... 2008-12-22 01:30 1 entry groups: 2008-12-22 01:30 0/1: 4 => 2/1; 2008-12-22 01:30 fill_segs: tail... 2008-12-22 01:30 0 entry groups: 2008-12-22 01:30 so if it is really spliiting at entry 0, the 4 => should have ended up in the tail 2008-12-22 01:31 yes 2008-12-22 01:32 ah, the trace output is misleading 2008-12-22 01:34 split 0x805e400 into 0x805e508 at 1 2008-12-22 01:34 so the split is right 2008-12-22 01:34 sounds strange 2008-12-22 01:34 the position chosen for the split is wrong 2008-12-22 01:34 ah 2008-12-22 01:35 dwalk_dump() will help it 2008-12-22 01:35 seek[1].entry 2008-12-22 01:35 yes 2008-12-22 01:35 seek[1], valgrind warned it 2008-12-22 01:35 ? 2008-12-22 01:36 (gdb) call dwalk_dump(seek[1].entry) 2008-12-22 01:36 dwalk_dump: end of extent 2008-12-22 01:36 I wrote a show_dwalk, it is not very good though 2008-12-22 01:36 ok, I missed that 2008-12-22 01:37 ah 2008-12-22 01:37 current code does dwalk_next() 2008-12-22 01:37 and it loops with next_extent 2008-12-22 01:37 it is not human friendly 2008-12-22 01:38 and probably buggy 2008-12-22 01:39 I'll apply previous series 2008-12-22 01:42 your dwalk_dump is much better than my show_dwalk, we can delete mine 2008-12-22 01:42 I'll remove it 2008-12-22 01:45 split_at is wrong, it is off by one 2008-12-22 01:45 should not be entries - entry 2008-12-22 01:47 probably 2008-12-22 01:48 just writing entries - 1 - entry make it fail on an empty leaf 2008-12-22 01:49 so some thinking would be good 2008-12-22 01:49 fail is crash? 2008-12-22 01:49 hits the assert 2008-12-22 01:49 at least I got that right ;) 2008-12-22 01:50 if so, it's good 2008-12-22 01:50 caller is wrong sounds fine 2008-12-22 01:55 below is old offset 2008-12-22 01:55 "above" is what is meaning? 2008-12-22 01:55 above is the part of the extent above the io region 2008-12-22 01:56 i see 2008-12-22 01:56 it is wrong now? 2008-12-22 01:56 very wrong 2008-12-22 01:56 I just commented it out 2008-12-22 01:56 ok 2008-12-22 01:56 it is not being used yet 2008-12-22 01:57 or, a test case has not been written yet 2008-12-22 01:57 so... it is an error to split an empty leaf I think 2008-12-22 01:57 seek[1] in find_segs would be wrong? 2008-12-22 01:57 because we cannot return a valid tuxkey 2008-12-22 01:58 it looks like dleaf_split_at is wrong 2008-12-22 01:58 seek[1] looks right according to your dump 2008-12-22 01:58 end of dleaf? 2008-12-22 01:59 should be at zero, which it is 2008-12-22 01:59 1 entry groups: 2008-12-22 01:59 0/2: 2 => 3/1; 4 => 2/1; <- correct 2008-12-22 02:00 already fixed? 2008-12-22 02:00 - unsigned encount = entries - enbase, split = entries - entry; 2008-12-22 02:00 + unsigned encount = entries - enbase, split = entries - 1 - entry; 2008-12-22 02:00 + printf("split %p into %p at %x\n", leaf, dest, split); 2008-12-22 02:00 + if (split == -1) 2008-12-22 02:00 + return 0; 2008-12-22 02:00 assert(split <= encount); 2008-12-22 02:01 well I don't like my fix 2008-12-22 02:02 the idea of dleaf_split is, it returns the key where it did the split to be used for the btree leaf insert 2008-12-22 02:03 splitting an empty dleaf is invalid, because it will return garbage for a key 2008-12-22 02:04 so I am not quite happy with the regularity of the split/merge primitives 2008-12-22 02:04 (gdb) call dwalk_dump(seek[1]) 2008-12-22 02:04 dwalk_dump: end of extent 2008-12-22 02:04 um.. 2008-12-22 02:05 I didn't see that in my trace output 2008-12-22 02:05 segs = get_segs(inode, 2, 3, segvec, 1, 1); show_segs(segvec, segs); 2008-12-22 02:05 yes 2008-12-22 02:05 it is gdb 2008-12-22 02:06 and it calls fill_segs() 2008-12-22 02:06 then, "call dwalk_dump(seek[1])" 2008-12-22 02:06 seek[1] points end of extent 2008-12-22 02:06 it should point first extent? 2008-12-22 02:07 did the leaf change under it? 2008-12-22 02:07 the walk looks fine to me according to your dump 2008-12-22 02:07 segs = get_segs(inode, 4, 5, segvec, 1, 1); show_segs(segvec, segs); 2008-12-22 02:07 segs = get_segs(inode, 2, 3, segvec, 1, 1); show_segs(segvec, segs); 2008-12-22 02:07 this test? 2008-12-22 02:07 yes 2008-12-22 02:08 works now, with the diff above 2008-12-22 02:08 I am just worrying about how empty dleaf should be handled for split 2008-12-22 02:08 sounds strange 2008-12-22 02:10 does this sound right? ... it makes sense to be able to merge a nonempty leaf with an empty leaf, but merging two empty leaves sounds like it should always be an error, similarly, splitting an empty leaf should be an error 2008-12-22 02:11 ah, strange is not against it 2008-12-22 02:11 ah, strange is not against to it 2008-12-22 02:12 why do I see end of extent 2008-12-22 02:14 only in gdb? 2008-12-22 02:14 ok 2008-12-22 02:15 it seems gdb only 2008-12-22 02:15 makes sense 2008-12-22 02:15 the leaf was changed, so the walk is no longer valid 2008-12-22 02:16 invalid after split_at? 2008-12-22 02:16 right, and then I use it later... that sounds bad 2008-12-22 02:17 no problem 2008-12-22 02:18 right 2008-12-22 02:18 it split at seek[1] 2008-12-22 02:18 and chop at seek[0]? 2008-12-22 02:18 yes, and seek[0] is logically below the split 2008-12-22 02:18 and is unchanged 2008-12-22 02:18 risky, but not wrong 2008-12-22 02:19 if split_at garantees to split at seek[1], it is right 2008-12-22 02:19 it's nice that it works 2008-12-22 02:19 anyway, with this checkin you see what I have in mind for the general case of dleaf editing 2008-12-22 02:20 it needs a day of debugging 2008-12-22 02:20 however, why do we need chop? 2008-12-22 02:20 we are rewriting a region of the original leaf 2008-12-22 02:21 split_at should chop already? 2008-12-22 02:21 it splits at the top of the region 2008-12-22 02:21 the chop chops at the bottom of the region 2008-12-22 02:21 ah 2008-12-22 02:22 within the region we are going to make arbitrary changes 2008-12-22 02:22 so we don't try to preserve those extents at all, but rebuild using the segs vector 2008-12-22 02:23 in many case we will be able to optimize that 2008-12-22 02:23 yes 2008-12-22 02:23 but I am worried about handling the general case now, efficiency is not the first concern 2008-12-22 02:24 well, anyway, in this case, we can just copy to tail? 2008-12-22 02:24 btw, if dwalk_first(seek[0]), copy leaf to tail 2008-12-22 02:24 yes, an example of an optimization we could do 2008-12-22 02:25 well, it can say optimize, but it is also bug fix 2008-12-22 02:25 which bug? 2008-12-22 02:25 we don't call if leaf or tail is empty 2008-12-22 02:25 ah 2008-12-22 02:26 probably the right fix 2008-12-22 02:26 maybe, if dwalk_first(seek[0]) and dwak_end(seek[1]) 2008-12-22 02:28 so there is no tail, and chop will empty the leaf 2008-12-22 02:28 yes 2008-12-22 02:29 chop works even if it points end of extent 2008-12-22 02:29 it is always nice when the corner cases work 2008-12-22 02:29 and for that reason, I try not to cover up corner cases with special tests at a high level 2008-12-22 02:30 yes, it should handle all case even if leaf is empty 2008-12-22 02:30 right 2008-12-22 02:30 I think new one alreay does 2008-12-22 02:31 the only question I have is about the validity of splitting an empty leaf, which returns a key, which can't be valid 2008-12-22 02:31 so this is just a small point, but it is the kind of thing I obsess about 2008-12-22 02:33 ok I just sleep 2008-12-22 02:33 I can't see why split_at returns key 2008-12-22 02:33 unlike you, I seem to need to sleep sometimes 2008-12-22 02:33 good point 2008-12-22 02:34 dleaf_split can return a key, not split_at 2008-12-22 02:34 yes 2008-12-22 02:35 so, just add small wrapper to check some state 2008-12-22 02:36 just move the get_index to dleaf_split 2008-12-22 02:37 split_at does not check for any errors, it should probably do some checking 2008-12-22 02:38 and split read group and entry in dest? 2008-12-22 02:38 read from into? 2008-12-22 02:39 I'm trying to parse that 2008-12-22 02:39 dleaf_split have to calc "destgroups -1" position? 2008-12-22 02:41 deleaf_split does not know anything about dest 2008-12-22 02:42 I think tuxkey from split_at, it is first index in "into" leaf 2008-12-22 02:42 ah now I understand 2008-12-22 02:43 so yes, it is inconvenient for the caller to calculate that 2008-12-22 02:44 so, split_at just return error, and split calc it? 2008-12-22 02:44 something like that :) 2008-12-22 02:45 ok :) 2008-12-22 02:45 well I should sleep, that's for helping me with the debugging I already should have done 2008-12-22 02:46 thanks I meant 2008-12-22 02:46 ok, oyasumi 2008-12-22 02:46 in my dreams, the perfect fix for split_at will come to me :) 2008-12-22 02:46 oyasumi 2008-12-22 02:46 good :) 2008-12-22 07:54 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2008-12-22 08:15 -!- MaZe(~MaZe@84.233.213.113) has joined #tux3 2008-12-22 09:21 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-22 11:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-22 12:12 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-22 12:22 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-22 13:44 -!- MaZe(~MaZe@abmc162.neoplus.adsl.tpnet.pl) has joined #tux3 2008-12-22 14:39 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-22 14:40 -!- RazvanM_(~RazvanM@128.220.251.228) has joined #tux3 2008-12-22 14:43 -!- flips(~phillips@phunq.net) has joined #tux3 2008-12-22 14:43 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-22 14:43 -!- ceatinge(~ceatinge@veryclever.net) has joined #tux3 2008-12-22 14:51 dleaf_split_at does not return split key any more, dleaf_split calculates it 2008-12-22 14:52 now, split_at should take its split point as an unsigned instead of a fragile pointer 2008-12-22 14:54 ok 2008-12-22 14:56 I hope the diff isn't too hard to merge 2008-12-22 14:56 it touched functions you weren't working on 2008-12-22 14:57 well, I'll do 2008-12-22 15:09 I wonder why I put gdict in struct dwalk and not edict 2008-12-22 15:09 gdict need to know blocksize 2008-12-22 15:09 right, and edict nees to bswap the groups count 2008-12-22 15:10 blocksize is not known at all of course 2008-12-22 15:10 however, dwalk does not use it almost 2008-12-22 15:10 edict 2008-12-22 15:11 I'm writing dwalk_at, that returns the current walk position in terms of entries 2008-12-22 15:11 for split_at 2008-12-22 15:11 anyway, edict changes as groups count changes 2008-12-22 15:12 sounds like strange 2008-12-22 15:12 yes, sounds strange 2008-12-22 15:12 but it is just what split_at wants 2008-12-22 15:12 entry pointer sounds better 2008-12-22 15:12 it's easier to leave it as it is 2008-12-22 15:12 so I will 2008-12-22 15:13 and get back to debugging filemap 2008-12-22 15:13 now, my tests is passing 2008-12-22 15:14 good, I'm going to write some much harder tests in user/filemap.c 2008-12-22 15:14 next thing is to fix the "above" bug 2008-12-22 15:14 good, well, I added some tests 2008-12-22 15:15 I'll pull to get your tests whenever you are ready 2008-12-22 15:15 those are a bit big 2008-12-22 15:16 post to the list? 2008-12-22 15:16 those are including unrelated stuff 2008-12-22 15:16 I think post is not needed 2008-12-22 15:16 ok 2008-12-22 15:16 almost all is cleanup and fix 2008-12-22 15:24 the above bug was fixed by removing a line 2008-12-22 15:26 now that actual above and below overlap handling 2008-12-22 15:27 in overwrite mode, which is the only thing fill_segs implements now, the overlap extents could be passed through unchanged 2008-12-22 15:27 in redirect mode, the extent that overlaps below has to be shortened 2008-12-22 15:28 and a segment that overlaps above has to be shorted and physical block pointer set higher 2008-12-22 15:30 for now, I'll remove "above" code 2008-12-22 15:32 I think current loop is hard to work 2008-12-22 15:32 did you find a bug in it? 2008-12-22 15:33 no 2008-12-22 15:33 just is confusing? 2008-12-22 15:34 because "above" is not used now 2008-12-22 15:34 it's going to be soon 2008-12-22 15:34 new loop should can know it easily 2008-12-22 15:35 I think 2008-12-22 15:35 sure, anyway I checked in the bug fix for it 2008-12-22 15:36 it has bug? 2008-12-22 15:36 it had bug 2008-12-22 15:36 yes 2008-12-22 15:36 I thought, it had bug, so we commented it out 2008-12-22 15:36 http://hg.tux3.org/tux3/rev/3106b9fb9ffd 2008-12-22 15:37 the bug is fixed 2008-12-22 15:37 I think bug is there 2008-12-22 15:38 test case? 2008-12-22 15:38 iirc, read was wrong 2008-12-22 15:38 write and read back 2008-12-22 15:39 yes, because the overlapping extent will be written too short 2008-12-22 15:39 will be recorded too short 2008-12-22 15:40 just hole case too 2008-12-22 15:40 above should be zero if the region ends with a hole 2008-12-22 15:40 let me test that 2008-12-22 15:41 that works 2008-12-22 15:42 no it doesn't, the hole is not filled 2008-12-22 15:42 segs = get_segs(inode, 4, 5, segvec, 1, 1); show_segs(segvec, segs); 2008-12-22 15:42 segs = get_segs(inode, 2, 7, segvec, 1, 1); show_segs(segvec, segs); 2008-12-22 15:43 there should be a hole of size 1 at index 6 filled 2008-12-22 15:48 sorry, a hole of size 2 at index 5 2008-12-22 15:49 3 segs: 0/-2 2/1 0/-2 <- correct 2008-12-22 15:49 but fill_segs produces the wrong result 2008-12-22 15:49 1 entry groups: 2008-12-22 15:49 0/4: 2 => 3/2; 4 => 2/1; 5 => 5/2; 4 => 2/1; 2008-12-22 15:50 fill_segs: tail... 2008-12-22 15:50 1 entry groups: 2008-12-22 15:50 0/1: 4 => 2/1; 2008-12-22 15:50 <- this is wrong 2008-12-22 15:51 the tail should be empty 2008-12-22 15:56 it is because the leaf pointer does not give the correct result for split_at when walk is at the end 2008-12-22 15:57 so maybe it is a good idea to pass an unsigned split position to split_at instead of a pointer 2008-12-22 15:58 hirofumi, exstop is supposed to be the end of extents for a single entry, not the end of extents for the entire dleaf 2008-12-22 15:59 yes 2008-12-22 15:59 and it seems to be right in dwalk_next 2008-12-22 16:00 but how does dwalk_end work? 2008-12-22 16:00 exstop == extent is only happened if there is not next group/entry/extent 2008-12-22 16:00 ah fine 2008-12-22 16:01 an empty entry is treated as end of leaf 2008-12-22 16:02 end and first 2008-12-22 16:08 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-22 16:08 this is current state of me 2008-12-22 16:08 reading 2008-12-22 16:10 - if (walk->group == walk->gstop) 2008-12-22 16:10 + if (walk->group == walk->gstop) { 2008-12-22 16:10 + walk->entry--; 2008-12-22 16:10 return 0; 2008-12-22 16:10 + } <- this change to walk_next makes walk->entry have the right value for split_at, at end of leaf 2008-12-22 16:11 3 segs: 3/2 2/1 5/2 <- result for the test case above is then correct 2008-12-22 16:11 1 entry groups: 2008-12-22 16:11 0/3: 2 => 3/2; 4 => 2/1; 5 => 5/2; 2008-12-22 16:12 umm...., it seems abuse of dwalk 2008-12-22 16:13 we could test the end of leaf condition and treat it as a special case 2008-12-22 16:13 it breaks dwalk_back 2008-12-22 16:13 probably 2008-12-22 16:14 that why I pasted it here and didn't check it in 2008-12-22 16:14 what is problem? 2008-12-22 16:14 I think dwalk can do it 2008-12-22 16:14 walk->entry reliably gives the right position for split_at everwhere except at the end of the leaf 2008-12-22 16:15 caller have to do it 2008-12-22 16:16 I will make it so 2008-12-22 16:16 if not, it breaks multiple extent per entry completely 2008-12-22 16:16 that sounds bad 2008-12-22 16:17 well, that is the reason for writing a walk_at function 2008-12-22 16:18 where is it needed? 2008-12-22 16:18 so that the caller does not need to go mess with pointers 2008-12-22 16:18 for dleaf_split_at 2008-12-22 16:18 caller is fill_segs? 2008-12-22 16:19 yes 2008-12-22 16:19 it already checks dwalk_end(seek[1]) 2008-12-22 16:19 it could just test _end and pass a different pointer to split_at 2008-12-22 16:19 because dwalk_index(seek[1]) is also invalid 2008-12-22 16:20 I thought split at end is wrong 2008-12-22 16:20 correct :) 2008-12-22 16:20 ok, let's avoid the split entirely when the tail will be empty 2008-12-22 16:20 yes 2008-12-22 16:21 first works fine though 2008-12-22 16:21 first? 2008-12-22 16:21 split at first entry 2008-12-22 16:21 yes 2008-12-22 16:21 we may want to just copy to dest 2008-12-22 16:22 as an optimization 2008-12-22 16:22 yes 2008-12-22 16:22 and it is readable 2008-12-22 16:23 please find_segs() loop 2008-12-22 16:23 please see 2008-12-22 16:23 our tailkey should just be "limit" 2008-12-22 16:24 I think, it isn't 2008-12-22 16:24 if next extent has gap 2008-12-22 16:25 it is ok to have the tailkey below the first extent in the leaf 2008-12-22 16:25 ah 2008-12-22 16:25 as long as it is above any extent in the leaf below 2008-12-22 16:26 it will work, however it looks strange 2008-12-22 16:26 it is pretty normal for btree stuff 2008-12-22 16:26 yes 2008-12-22 16:27 normal to look strange 2008-12-22 16:27 but, probe will hit it 2008-12-22 16:27 if it does not hit that leaf, it will hit the leaf below 2008-12-22 16:27 so nothing is saved by placing the key right at the first extent in the leaf 2008-12-22 16:27 bnode key will be "limit" 2008-12-22 16:28 yes 2008-12-22 16:28 that is fine 2008-12-22 16:28 so, read will hit? 2008-12-22 16:28 of course, it works 2008-12-22 16:28 probes always go all the way to the leaf 2008-12-22 16:28 however, it is not needed to check 2008-12-22 16:29 ah 2008-12-22 16:29 actually, it is better to include the gap in the leaf above 2008-12-22 16:29 if somebody writes into the gap, the new extent will go into the new leaf 2008-12-22 16:29 why? 2008-12-22 16:29 instead of the leaf below, which is full 2008-12-22 16:29 eh 2008-12-22 16:30 I'm talking rubbish ;) 2008-12-22 16:30 when we merge the leaves, the gap will disappear 2008-12-22 16:30 yes 2008-12-22 16:31 so I guess we can use limit 2008-12-22 16:31 instead of tailkey 2008-12-22 16:32 you prepared 19 changesets since yesterday 2008-12-22 16:32 maybe 2008-12-22 16:32 no 2008-12-22 16:32 I use patch 2008-12-22 16:32 so, date will be lost 2008-12-22 16:34 several patches in a few days ago 2008-12-22 16:34 and patch order was changed some times 2008-12-22 16:36 ok, you did the fix to filemap that we just talked about 2008-12-22 16:36 good, I was starting to do it 2008-12-22 16:37 and now we can just remove tailkey 2008-12-22 16:38 if limit is 2008-12-22 16:38 limit == 3, extent 2/5 2008-12-22 16:39 we need tailkey? 2008-12-22 16:39 find_segs will change the extent to 2/1 2008-12-22 16:39 and set above = 4 2008-12-22 16:40 so, tailkey is needed? 2008-12-22 16:41 I don't think so 2008-12-22 16:42 limit is still correct 2008-12-22 16:42 next bnode key is 3? 2008-12-22 16:42 yes 2008-12-22 16:43 we split at 2? 2008-12-22 16:43 only if the tail will not fit back into the current leaf 2008-12-22 16:43 sounds unnatural 2008-12-22 16:43 2/5 is not tail actually 2008-12-22 16:44 why would be split at 2? 2008-12-22 16:44 because next bnode key is 3 2008-12-22 16:44 ah 2008-12-22 16:44 3 is still wrong 2008-12-22 16:44 we split at 3 :) 2008-12-22 16:44 oh? 2008-12-22 16:45 if we split at 2 2008-12-22 16:45 we split at 3 2008-12-22 16:45 first leaf will have key 2~7 2008-12-22 16:45 what does ~ mean? 2008-12-22 16:45 both is index 2008-12-22 16:45 from 2 to 7 2008-12-22 16:46 oh 2008-12-22 16:46 from 2 to 6 2008-12-22 16:46 are you talking about my latest test? 2008-12-22 16:46 no 2008-12-22 16:47 about tailkey 2008-12-22 16:47 let me ask again 2008-12-22 16:48 limit == 3, extent 2/5 (index == 2, count == 5) 2008-12-22 16:48 we split at 2, and set bnode key == 2? 2008-12-22 16:48 we split at 2, and set bnode key == 3? 2008-12-22 16:48 3 means limit 2008-12-22 16:48 split at 3 and set bnod key == 3 2008-12-22 16:48 extent is 2/5 2008-12-22 16:49 right, find_segs will see that the extent overlaps above and record a seg of length /1 2008-12-22 16:50 leaving the dwalk position above the 2/5 2008-12-22 16:50 so that tail will be nuill 2008-12-22 16:50 so, we can't use limit as tailkey? 2008-12-22 16:51 we can 2008-12-22 16:51 it is limit + above? 2008-12-22 16:51 or limit? 2008-12-22 16:51 it could be limit + above, but limit will also work 2008-12-22 16:52 well 2008-12-22 16:52 no 2008-12-22 16:52 it has to be limit 2008-12-22 16:52 so the missing piece is the above handling 2008-12-22 16:52 my case is how works? 2008-12-22 16:53 find_segs reduces the size of the extent 2/5 to 2/1, and positions the dwalk _after_ 2/5 2008-12-22 16:53 so the extent 2/5 is now lost 2008-12-22 16:54 lost in seg? 2008-12-22 16:54 lost on seg array? 2008-12-22 16:54 however, dleaf still have it? 2008-12-22 16:55 we have 2/1 in the seg, and 3/4 is now lost, this is where the above handling has to be, to fix it up 2008-12-22 16:55 we rewrite that extent? 2008-12-22 16:56 write, the above handling will dwalk_add an extent 3/4 2008-12-22 16:56 right 2008-12-22 16:56 ok 2008-12-22 16:56 so let me add... 2008-12-22 16:57 this could be done without this complicated way of splitting the extent and re-adding part of it later, in the overwrite case 2008-12-22 16:57 but in the redirect case, in general, we need to split those overlapping extents 2008-12-22 16:57 the redirect case is the one that everybody cares about 2008-12-22 16:58 ok 2008-12-22 16:58 next is we do it now? 2008-12-22 16:58 yes, and get that messing little problem done 2008-12-22 16:58 messy 2008-12-22 16:58 ok 2008-12-22 16:58 I deferred it so long, because I knew it would be a mess 2008-12-22 16:59 if so, I'd like to test well on current state 2008-12-22 16:59 and now, we have cleaned up a lot of things and have some a nice api, so it is not so bad 2008-12-22 16:59 ok, reading more patches 2008-12-22 17:00 at "replace leaf_chop" 2008-12-22 17:00 and they are fine so far 2008-12-22 17:00 yes 2008-12-22 17:00 I was my first goal :) 2008-12-22 17:00 it was 2008-12-22 17:01 some of the code that you commented "userland only" will be used in btree leaf_chop 2008-12-22 17:01 if so, just remove comment 2008-12-22 17:01 with real change 2008-12-22 17:02 * Reasons this dleaf truncator sucks <- it is nice to see this comment disappear 2008-12-22 17:02 ok 2008-12-22 17:03 it is already invalid? 2008-12-22 17:03 no, the comment is disappearing because you made it not suck 2008-12-22 17:05 um.., I just moved that comment to dleaf_chop2 2008-12-22 17:05 then, I renamed dleaf_chop2 to dleaf_chop 2008-12-22 17:05 :) 2008-12-22 17:07 ok, the only one I have my doubts about is the find_segs-cleanup patch 2008-12-22 17:07 ok 2008-12-22 17:07 good 2008-12-22 17:07 well, I believe it is human friendly way 2008-12-22 17:08 I will pull it and worry about it later 2008-12-22 17:08 it needs the above/below stuff 2008-12-22 17:08 wait a bit 2008-12-22 17:09 some patch doesn't have comment yet 2008-12-22 17:09 I'll added it 2008-12-22 17:09 ok 2008-12-22 17:09 I'll add it 2008-12-22 17:09 for example, filemap.c~fill_segs-fix2 2008-12-22 17:09 yes 2008-12-22 17:09 I noticed, but I thought it looked kind of cool :) 2008-12-22 17:10 :) I'm thinking below/above may not be needed 2008-12-22 17:10 let's talk about it after I pull 2008-12-22 17:10 ok 2008-12-22 17:12 I see I forgot a rollback for "Fix off by one bug in dleaf_split_at" 2008-12-22 17:13 oh, I thought it is intent 2008-12-22 17:13 because "." was removed 2008-12-22 17:14 I broke the kernel compile by using dwalk_dump without a declaration in tux3.h, found out after the local commit, fixed it, forgot to rollback before comitting again 2008-12-22 17:17 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-22 17:17 I added simple comment 2008-12-22 17:17 and removed dwalk_delete for now 2008-12-22 17:19 pulling 2008-12-22 17:41 hirofumi, there? 2008-12-22 17:41 yes 2008-12-22 17:41 that is the second time I have done that... pulled when I have local changes 2008-12-22 17:41 oh 2008-12-22 17:41 which turns an easy pull into a big mess when I do hg update 2008-12-22 17:42 I fixed it up, but there must be an easier way 2008-12-22 17:42 the conflict is noted at hg update time, and conflict markers are put in my files 2008-12-22 17:42 i see 2008-12-22 17:42 what I want at that point, is the whole pull to go away, so I can do hg diff and find out what my local changes were 2008-12-22 17:43 rollback rolls back the repository, but leaves the local files in the state left by the failed update 2008-12-22 17:44 I guess we can remove in progress merge job 2008-12-22 17:44 I think I want to be told before my update that I have local changes, not after 2008-12-22 17:45 anyway 2008-12-22 17:45 pushed to public 2008-12-22 17:45 now about above/below 2008-12-22 17:45 hg is wrong, or we can add config 2008-12-22 17:46 you mean, the way hg does update is wrong? 2008-12-22 17:46 yes 2008-12-22 17:46 it should tell working dir has local change 2008-12-22 17:47 it can make a big mess, quickly 2008-12-22 17:47 yes 2008-12-22 17:48 ok, above/below... this is needed mainly for redirect on write 2008-12-22 17:48 and it will lost right commit 2008-12-22 17:48 yes 2008-12-22 17:48 it seems to have the right commit now, because I did not have to do a merge 2008-12-22 17:49 good 2008-12-22 17:49 if we need merge work, we will lost pure local change 2008-12-22 17:49 yes, that is the annoying thing 2008-12-22 17:49 it's very easy to forget 2008-12-22 17:50 yes 2008-12-22 17:50 well, above/below 2008-12-22 17:50 I imaged, find_segs may just tell current extent to caller 2008-12-22 17:51 then, caller does below/above job 2008-12-22 17:52 this is my mean of avoid below/above 2008-12-22 17:52 find_segs also positions the walk cursors 2008-12-22 17:52 yes 2008-12-22 17:52 walk has index 2008-12-22 17:53 and asked range is contiguous 2008-12-22 17:53 well, for !create, it is right for find_segs to trim the overlap away 2008-12-22 17:54 yes 2008-12-22 17:54 if !create, caller will do in my case (i.e. get_segs) 2008-12-22 17:55 ok, well a small point is, it is more work for the caller 2008-12-22 17:55 yes 2008-12-22 17:56 in the overwrite case, the overlapping extents do not have to be altered, but in the redirect case they do 2008-12-22 17:56 however, I guess we will have all info for create 2008-12-22 17:57 yes 2008-12-22 17:57 so the algorithm is designed to handle the redirect case, which is tricky, and to work correctly for the overwrite case 2008-12-22 17:58 for redirect, fill_segs will need to shorten the extent that overlaps below 2008-12-22 17:59 read and write are different 2008-12-22 17:59 read is not needed that info at all 2008-12-22 17:59 write is need to redirect it? 2008-12-22 17:59 yes, this is about right 2008-12-22 17:59 write needs 2008-12-22 17:59 about write 2008-12-22 18:00 so, find_segs tells all extent in range 2008-12-22 18:00 yes, that is the idea 2008-12-22 18:00 and when versioning arrives, it will be for only one version 2008-12-22 18:00 yes 2008-12-22 18:01 I thought find_segs just tell all info 2008-12-22 18:01 without shirink 2008-12-22 18:01 shrink? 2008-12-22 18:01 um.., remove first and end 2008-12-22 18:01 without leaving any out? 2008-12-22 18:01 oh 2008-12-22 18:02 and also tell the caller about above/below? 2008-12-22 18:02 no 2008-12-22 18:03 in read case, get_segs will remove first and end 2008-12-22 18:03 in write case, just pass it to fill_segs 2008-12-22 18:03 it is just my guess though 2008-12-22 18:03 and that way, fill_segs can avoid having to reconstruct the overlapping extents 2008-12-22 18:04 but fill_segs is the easy case 2008-12-22 18:04 or rather 2008-12-22 18:04 overwrite is the easy case 2008-12-22 18:04 for redirect, we can't avoid changing the overlapping extents 2008-12-22 18:04 yes 2008-12-22 18:05 so, info has all info 2008-12-22 18:05 well, current also has all info 2008-12-22 18:06 ok, it's a good argument 2008-12-22 18:07 what I was thinking is 2008-12-22 18:07 I thought fill_segs() will loop all extents related to range 2008-12-22 18:08 and if there is out of range, it will do something 2008-12-22 18:08 logical index is not recorded per seg 2008-12-22 18:09 yes 2008-12-22 18:09 seek[] has info 2008-12-22 18:09 and range is contiguous 2008-12-22 18:09 I guess 2008-12-22 18:10 and it can know the overlap above, after it has looped over the entire range 2008-12-22 18:10 yes 2008-12-22 18:11 well, however this idea can be just wrong 2008-12-22 18:12 we could each write our own solution and choose the best 2008-12-22 18:12 yes 2008-12-22 18:13 let's talk a little bit more about the redirect case 2008-12-22 18:13 yes 2008-12-22 18:13 we will normally replace all the extents in the range with a single new extent, sometimes more than one if the region is large or disk is fragmented 2008-12-22 18:14 yes 2008-12-22 18:14 and we have to change the overlapping extents 2008-12-22 18:14 yes 2008-12-22 18:14 wait a bit 2008-12-22 18:15 new one can be many extents than existed extents? 2008-12-22 18:15 yes, for example there can be zero extents there and we add one extent 2008-12-22 18:15 sure 2008-12-22 18:16 ok 2008-12-22 18:16 or there can be one extent, and fragmentation forces it to be replaced by two 2008-12-22 18:16 yes 2008-12-22 18:17 or the region can be big 2008-12-22 18:17 and one big extent will replace by 3 extent if overwrite it at middle of big extent 2008-12-22 18:17 will be replaced 2008-12-22 18:18 right, that is a case where the same extent overlaps above and below 2008-12-22 18:18 so, there is many/little/same extents 2008-12-22 18:18 right, a lot of variations 2008-12-22 18:18 ok 2008-12-22 18:19 the case where one extent overlaps the IO region at both ends is a good one 2008-12-22 18:20 yes 2008-12-22 18:20 the current approach, with the walk positions above and below, and the above/below variables handles that 2008-12-22 18:20 in that case, we can just update extent 2008-12-22 18:20 except for the missing code to generate the altered extents 2008-12-22 18:21 we could 2008-12-22 18:21 hey flips 2008-12-22 18:21 so that is the main possible variant in approaching this I think, do we: a) change extents in place or b) reconstruct them 2008-12-22 18:22 c) change extent in place, and delete unneeded extents 2008-12-22 18:22 replace many extents by one extent 2008-12-22 18:23 replaced extents need processing also, the determine which blocks become free 2008-12-22 18:24 yes, it is ture in all case? 2008-12-22 18:25 in the all replaced case 2008-12-22 18:25 only for redirect, and only for blocks not written twice in the same delta 2008-12-22 18:25 yes 2008-12-22 18:29 with snapshots, if we support a data=ordered kind of mode, then we may overwrite or redirect depending on whether the blocks belong to the current snapshot or not 2008-12-22 18:30 i see 2008-12-22 18:30 yes 2008-12-22 18:30 write twice in the same delta... when we change away from doing the allocations inside ->get_block, then we will never write the same block twice in the same delta I think 2008-12-22 18:30 right now we will 2008-12-22 18:32 well I have done a lot more talking than coding today 2008-12-22 18:32 thanks 2008-12-22 18:35 I wonder if my backwards running extent generating loop works now 2008-12-22 18:36 it is filemap.c:137? 2008-12-22 18:37 yes, I will run it a few more iterations 2008-12-22 18:38 those tests look much nicer than my hacks 2008-12-22 18:39 well, it needs to improve more though 2008-12-22 18:39 it is assuming get_segs(create) returns right seg 2008-12-22 18:40 0/10: 0 => b/1; 2 => a/1; 4 => 9/1; 6 => 8/1; 8 => 7/1; a => 6/1; c => 5/1; e => 4/1; 10 => 3/1; 12 => 2/1; 2008-12-22 18:40 dwalk_probe: probe for 0x2 2008-12-22 18:40 find_segs: emit a/1 2008-12-22 18:40 dwalk_next: 2008-12-22 18:40 find_segs: 2008-12-22 18:40 main: Failed assertion "segs[j].block == seg.block"! 2008-12-22 18:40 it got pretty far 2008-12-22 18:41 what loop? 2008-12-22 18:41 for (int i = 10, j = 0; i--; j++) { 2008-12-22 18:41 ok 2008-12-22 18:42 I just broke your test I think 2008-12-22 18:42 it works fine when I run both loops to 10 2008-12-22 18:42 from 10 2008-12-22 18:42 yes 2008-12-22 18:43 however, we will need to check the result for now 2008-12-22 18:44 test wouldn't check all case 2008-12-22 18:45 overlapping case is handled for overwrite now? 2008-12-22 18:45 or not? 2008-12-22 18:45 I think it will work 2008-12-22 18:46 so now we need replace 2008-12-22 18:46 ah 2008-12-22 18:46 but it will return wrong seg[] 2008-12-22 18:46 right 2008-12-22 18:46 and that is what we have been talking about 2008-12-22 18:47 yes 2008-12-22 18:47 ok, anyway, let's re-add above for now 2008-12-22 18:47 probably best in order to progress ;) 2008-12-22 18:48 I am ok with not having the most perfect algorithm in the world just now 2008-12-22 18:48 after find_segs() loop, index is next index of extent 2008-12-22 18:49 so, limit - index would be above 2008-12-22 18:49 right, I had a dwalk_back in there 2008-12-22 18:50 back is needed? 2008-12-22 18:50 so when it hits the first extent above the region, it goes back, unless the extent overlaps 2008-12-22 18:50 let me think 2008-12-22 18:50 no 2008-12-22 18:50 maybe not 2008-12-22 18:50 it will be positioned on the first extent above the region, which is what we want 2008-12-22 18:51 I think back is bad sign of bad code 2008-12-22 18:51 I think back is sign of bad code 2008-12-22 18:51 could be 2008-12-22 18:52 yes, if last extent was including limit, seek[1] points next extent of it 2008-12-22 18:55 ok, overlap-below detection has to be re-added to your find_segs 2008-12-22 18:56 your code is nice 2008-12-22 18:56 thanks 2008-12-22 18:56 btw, index - limit? 2008-12-22 18:56 the above code was wrong 2008-12-22 18:58 right, that is where the above check is needed 2008-12-22 18:58 @@ -77,12 +77,13 @@ static int find_segs(struct cursor *curs 2008-12-22 18:58 seek[1] = *walk; 2008-12-22 18:58 if (segs) { 2008-12-22 18:58 block_t below = start - seg_start; 2008-12-22 18:58 + block_t above = index - limit; 2008-12-22 18:58 seg[0].block += below; 2008-12-22 18:58 seg[0].count -= below; 2008-12-22 18:58 -// seg[segs - 1].count -= above; 2008-12-22 18:59 + seg[segs - 1].count -= above; 2008-12-22 18:59 if (overlap) { 2008-12-22 18:59 overlap[0] = below; 2008-12-22 18:59 -// overlap[1] = above; 2008-12-22 18:59 + overlap[1] = above; 2008-12-22 18:59 } 2008-12-22 18:59 } 2008-12-22 18:59 return segs; 2008-12-22 18:59 this? 2008-12-22 19:00 seems reasonable 2008-12-22 19:02 one issue 2008-12-22 19:03 dwalk_probe now seeks to exactly start, but it needs to seek to below start 2008-12-22 19:03 why do we need it? 2008-12-22 19:04 in case an extent overlaps from below 2008-12-22 19:04 if extent is including start, dwalk_probe will hit it 2008-12-22 19:05 and it will check by 2008-12-22 19:05 ok, good for now 2008-12-22 19:05 if (!dwalk_end(walk) && dwalk_index(walk) < start) 2008-12-22 19:05 seg_start = dwalk_index(walk); 2008-12-22 19:07 this gets more complicated with versioning 2008-12-22 19:08 because we can have a bunch of extents starting at any entry, with different lengths 2008-12-22 19:09 dwalk_probe will hit only one version? 2008-12-22 19:09 extents can be shared between versions 2008-12-22 19:09 i see 2008-12-22 19:10 so it has to find the entry that is at or below all extents that overlap the start index 2008-12-22 19:10 but that is for later 2008-12-22 19:10 just mentioning it now 2008-12-22 19:11 for now, we can rely on the simplification we get from knowing that extents can never overlap 2008-12-22 19:11 yes 2008-12-22 19:12 I will commit your diff above 2008-12-22 19:12 ok 2008-12-22 19:19 starting to think about ENOSPC handling? 2008-12-22 19:20 I think not need right now 2008-12-22 19:20 well, I'll test much, and cleanup, and think about error handling 2008-12-22 19:25 ok, now sparse work 2008-12-22 19:26 and now to make the overlap work in fill_segs, then make a redirect version of fill_segs to show why we went to all that trouble 2008-12-22 19:31 if it has below/above, we can just update it? 2008-12-22 19:32 btw, adjust before fill_segs 2008-12-22 19:48 first thing is to write a simple example that doesn't work 2008-12-22 19:49 n = get_segs(inode, 2, 7, segs, 10, 1); show_segs(segs, n); 2008-12-22 19:49 n = get_segs(inode, 4, 5, segs, 10, 1); show_segs(segs, n); 2008-12-22 19:50 yes 2008-12-22 19:50 0/1: 4 => 4/1; <- loses the 2 => 2/5 2008-12-22 20:14 I'll sleep, oyasumi 2008-12-22 20:17 oyasumi 2008-12-22 20:58 1 entry groups: 2008-12-22 20:58 0/3: 2 => 2/2; 4 => 4/1; 5 => 3/2; <- correct result 2008-12-22 20:58 code not realy pretty 2008-12-22 21:55 why tailkey is needed, because we only know it from seek[1] 2008-12-22 21:55 range is 2/10, and tail is 30/5 2008-12-22 21:56 sleep really 2008-12-22 22:37 -!- MaZe(~MaZe@aahl212.neoplus.adsl.tpnet.pl) has joined #tux3 2008-12-22 22:54 -!- flips(~phillips@phunq.net) has joined #tux3 2008-12-22 22:54 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-12-22 22:54 -!- ceatinge(~ceatinge@veryclever.net) has joined #tux3 2008-12-22 22:54 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-23 00:58 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-23 00:59 -!- pgquiles(~pgquiles@91.Red-88-18-198.staticIP.rima-tde.net) has joined #tux3 2008-12-23 01:11 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-23 02:28 0/3: 2 => 2/2; 4 => 4/1; 5 => 5/2; <- actual correct result this time 2008-12-23 04:00 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-12-23 04:22 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-23 05:03 ah, sorry, I asked same question for tailkey 2008-12-23 07:27 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-23 08:00 I found the problem of fill_segs() 2008-12-23 08:00 if seek[0] == seek[1], it doesn't work 2008-12-23 08:01 because it splits leaf at seek[1], so seek[0] is also invalid 2008-12-23 08:01 so, how do we fix? 2008-12-23 08:02 I thought dleaf_split() should be build based on dwalk like dleaf_chop() 2008-12-23 08:02 e.g. dleaf_split() will use dwalk_copy() and dwalk_chop() 2008-12-23 08:03 dwalk_copy() copy the extents to another leaf 2008-12-23 08:03 dwalk_chop() chops the extents 2008-12-23 08:04 then, fill_segs() will use dwalk_copy() and dwalk_chop() directly 2008-12-23 08:04 it sounds simple 2008-12-23 09:26 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-23 09:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-23 10:22 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2008-12-23 12:18 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-23 12:21 hirofumi, good morning 2008-12-23 12:22 I like your plan for splitting up dleaf split 2008-12-23 12:23 ACTION thinks about the bug 2008-12-23 12:24 so in that case, segs[] is just a single hole 2008-12-23 12:25 demo case? 2008-12-23 12:56 ret = get_segs(inode, 0x1100000, 0x1100000+5, segs, 10, 1); 2008-12-23 12:56 ret = get_segs(inode, 0x800000, 0x800000+5, segs, 10, 1); 2008-12-23 12:56 ret = get_segs(inode, 0x800005, 0x800005+5, segs, 10, 1); 2008-12-23 12:56 maybe this 2008-12-23 12:56 ok 2008-12-23 12:57 I was just checking in a patch to make our unit tests in filemap.c more similar 2008-12-23 12:57 and will add that test 2008-12-23 12:58 one small thing I should do, I think: change from start, limit to start, count in the get_segs interface 2008-12-23 12:58 keep using start, limit inside get_segs 2008-12-23 12:58 but the external interface is better with count 2008-12-23 12:59 yes, it has contistents 2008-12-23 12:59 so, right after doing something about the seek issue 2008-12-23 13:00 btw, can we have the config for hg web? 2008-12-23 13:00 [diff] 2008-12-23 13:00 showfunc = True 2008-12-23 13:00 ok, just a sec 2008-12-23 13:00 I don't know this affect to web too 2008-12-23 13:04 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-23 13:05 it will be good though 2008-12-23 13:08 it does affect the web diffs, good 2008-12-23 13:09 oh, good 2008-12-23 13:11 I respelled segs as segvec in user/filemap.c, probably affecting lots of your tests, sorry 2008-12-23 13:12 actually it is no problem 2008-12-23 13:12 dwalk_check: Failed assertion "walk->exbase <= walk->extent"! <- ok, reproduced 2008-12-23 13:12 yes 2008-12-23 13:13 it try to do dwalk_chop on splited leaf 2008-12-23 13:13 dwalk_check catched it luckly 2008-12-23 13:14 that was konrad's contribution? 2008-12-23 13:14 dwalk_check? 2008-12-23 13:15 I think he put in the part to check ordering 2008-12-23 13:15 dwalk_check is from me 2008-12-23 13:15 oh no 2008-12-23 13:15 you did 2008-12-23 13:15 :) 2008-12-23 13:15 it checks dwalk state only 2008-12-23 13:15 right 2008-12-23 13:16 is it important that the entry groups be separate for this test? 2008-12-23 13:16 ACTION checks 2008-12-23 13:18 ah 2008-12-23 13:19 ok, does not seem to matter 2008-12-23 13:19 dleaf_split_at() will change seek[0] position 2008-12-23 13:19 yes 2008-12-23 13:19 so, dwalk_chop confuses 2008-12-23 13:19 dwalk_chop was failed 2008-12-23 13:22 segs = get_segs(inode, 0x2, 0x3, segvec, 10, 1); 2008-12-23 13:22 segs = get_segs(inode, 0x6, 0x7, segvec, 10, 1); 2008-12-23 13:22 segs = get_segs(inode, 0x4, 0x5, segvec, 10, 1); <- simplified test 2008-12-23 13:24 probably 2008-12-23 13:28 it has a bit different 2008-12-23 13:29 dd if=/dev/zero bs=$((128*4096)) count=1 | ../tux3/user/tux3 --seek=$((1024*1024*17 * 4096)) write ./tux3.img test.txt > /dev/null 2008-12-23 13:30 dd if=/dev/zero bs=$((128*4096)) count=1 | ../tux3/user/tux3 --seek=$((1024*1024*8 * 4096)) write ./tux3.img test.txt > /dev/null 2008-12-23 13:30 actuall test was this 2008-12-23 13:30 in this case, dwalk_chop() try to use dwalk_back() 2008-12-23 13:30 it may not be important for this bug though 2008-12-23 13:31 my simpler test sill hits the bug 2008-12-23 13:31 yes 2008-12-23 13:32 it also uses truncated fields on dleaf 2008-12-23 13:32 my test abort on dwalk_back() 2008-12-23 13:32 simpler aborts on dwalk_chop() 2008-12-23 13:33 ah 2008-12-23 13:33 both will be fixed at same time though 2008-12-23 13:33 next step for me is, read the tracing output and add more 2008-12-23 13:34 try to catch up with yhou 2008-12-23 13:40 ok, so I will just say what you already told me 2008-12-23 13:40 ok 2008-12-23 13:40 the split_at changes the leaf dict, so seek[0] is not valid 2008-12-23 13:41 yes 2008-12-23 13:42 so, maybe it is wrong to use all those pointers as a seek position 2008-12-23 13:42 and we should use unsigned entry position 2008-12-23 13:43 the struct dwalk is still very useful 2008-12-23 13:43 but across leaf edit operations, we need a more stable cursor 2008-12-23 13:43 it is what is for? 2008-12-23 13:44 the struct dwalk was to optimize repeated serial operations 2008-12-23 13:44 it is good at that 2008-12-23 13:44 yes 2008-12-23 13:44 trying to use it as a cursor also is not entirely successful 2008-12-23 13:45 if operations is dwalk aware, I thought it can 2008-12-23 13:45 so, dwalk_copy and dwalk_chop 2008-12-23 13:45 you mean, fix split to update the walk? 2008-12-23 13:45 we don't need split 2008-12-23 13:45 probably true 2008-12-23 13:46 dwalk_copy and dwalk_chop replaces it 2008-12-23 13:46 dwalk_copy just copy, so dwalk state is still valid 2008-12-23 13:46 dwalk_chop is also valid after it 2008-12-23 13:47 have you written dwalk_copy? 2008-12-23 13:47 I'm still writing it now 2008-12-23 13:47 I think half was done 2008-12-23 13:47 it is a good approach 2008-12-23 13:48 if seek breaks again later, then we can replace it with an unsigned pos 2008-12-23 13:49 maybe I'm not much fan of unsigned pos 2008-12-23 13:49 because it can point wrong pos silently 2008-12-23 13:50 dwalk stuff can check it by some way 2008-12-23 13:50 if we need it, we should do it though 2008-12-23 14:54 flips, there? 2008-12-23 14:54 yes 2008-12-23 14:54 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-23 14:55 I added dwalk_copy 2008-12-23 14:55 it is not used at all now though 2008-12-23 14:55 please check it 2008-12-23 14:56 reading 2008-12-23 14:58 it looks good 2008-12-23 14:58 thanks 2008-12-23 15:00 so, split as copy + chop is a clean design, and efficiency will be about the same 2008-12-23 15:00 yes, hopefully 2008-12-23 15:01 are you going to try it in filemap now? 2008-12-23 15:01 yes 2008-12-23 15:01 ok, I will get started on atomic commit 2008-12-23 15:02 well, maybe, it just replace split_at in fill_segs though 2008-12-23 15:02 good 2008-12-23 15:02 I think filemap has become flexible and fairly robust with the last few days work 2008-12-23 15:02 yes 2008-12-23 15:02 I think it was improved much 2008-12-23 15:03 tux3_get_block() doesn't work now though 2008-12-23 15:03 that's not so good 2008-12-23 15:03 what broke? 2008-12-23 15:03 it doesn't know block is new or not 2008-12-23 15:03 ok, that is not hard to fix 2008-12-23 15:03 yes 2008-12-23 15:04 essentially, our negative count is a flag bit 2008-12-23 15:04 we keep the negative counts in fill_segs instead of turning them positive 2008-12-23 15:04 oh 2008-12-23 15:04 ah 2008-12-23 15:05 maybe we should actually make it a flag bit instead of negative count 2008-12-23 15:05 so, block!=0 and -count is newly allocated blocks? 2008-12-23 15:05 yes 2008-12-23 15:05 well 2008-12-23 15:05 not so clean :) 2008-12-23 15:05 it is very temporary? 2008-12-23 15:05 fine 2008-12-23 16:24 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-23 16:25 I've found my stupid bug in dwalk_chop 2008-12-23 16:25 and used dwalk_copy for fill_segs 2008-12-23 16:25 please pull it 2008-12-23 16:26 I'm starting to think I have to test dwalk stuff more 2008-12-23 16:27 so, I'll test those with fsx-linux 2008-12-23 16:36 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-23 17:02 pulling 2008-12-23 17:12 http://www.chineselinuxuniversity.net/courses/kernel/articles/20744.shtml <- interesting 2008-12-23 17:12 some time I should update that post 2008-12-23 17:12 actual structure of tux3 is simpler 2008-12-23 17:14 what is chinese linux university? 2008-12-23 17:20 maybe collecter of linux info 2008-12-23 17:49 it seems to pass fsx-linux test basically 2008-12-23 17:50 yes, interesting 2008-12-23 17:51 however, some time fail 2008-12-23 17:51 it may race condition 2008-12-23 17:51 ah 2008-12-23 17:51 we don't have any locking 2008-12-23 17:51 that would do it 2008-12-23 17:51 yes, well, detail is unknown for now 2008-12-23 17:52 try with uniprocessor config? 2008-12-23 17:52 yes 2008-12-23 17:52 already tried? 2008-12-23 17:52 uniprocessor and one process 2008-12-23 17:53 yes, I'm trying it now 2008-12-23 17:56 writing down the notes on atomic commit that I had from my skate with timothy 2008-12-23 17:56 skating works very well for organizing thoughts 2008-12-23 17:57 good 2008-12-23 18:14 at least, it seems there is bug related to truncate() 2008-12-23 18:48 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-23 19:07 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-23 19:33 in current kernel code, all we do is mark_buffer_dirty for metadata and sync_blockdev takes care of writing them out 2008-12-23 19:33 we don't mark_buffer_dirty_inode, so vfs won't help us with fsync 2008-12-23 19:33 mark_buffer_dirty_inode is not important 2008-12-23 19:33 and there is not much point in trying to add that functionality 2008-12-23 19:34 it insert buffer cache to inode to know inode related buffer cache 2008-12-23 19:34 it is not inserted, bdev_inode will write it 2008-12-23 19:35 right, anyway there is no way to extend that mechanism to a proper delta commit 2008-12-23 19:36 so I'm thinking about where to start with kernel code 2008-12-23 19:36 one possibility is to try managing our own buffer cache as a separate address_space 2008-12-23 19:37 I don't think bdev actually does anything useful for us, except give us an early filesystem demo 2008-12-23 19:37 ah 2008-12-23 19:38 we have to manage buffer head 2008-12-23 19:38 in ext3 case, it added jbd header 2008-12-23 19:38 keep our own lists 2008-12-23 19:39 to each buffer? 2008-12-23 19:39 yes 2008-12-23 19:39 jbh 2008-12-23 19:39 buffer_head already seems to have enough fields for us 2008-12-23 19:39 so, we have to think our way 2008-12-23 19:40 if it has dirty bit, bdev will write it 2008-12-23 19:40 so, I guess jbh has own dirty bit 2008-12-23 19:40 yes, they disable the bdev writing somehow 2008-12-23 19:40 it's worth a read 2008-12-23 19:41 ok, I just thought I would mention I'm thinking about that now, and I will continue with my design note on exactly what we need to write, when 2008-12-23 19:42 mention it right now? 2008-12-23 19:42 yes 2008-12-23 19:42 I'd like to continue bug fix for now 2008-12-23 19:42 good 2008-12-23 19:42 getting close 2008-12-23 19:43 I'll continue with my design note 2008-12-23 19:43 it really strange 2008-12-23 19:43 ok 2008-12-23 19:43 ah 2008-12-23 19:43 tell me about it? 2008-12-23 19:44 it seems to write corrent dleaf 2008-12-23 19:44 however, when it read back that dleaf, dleaf seems not correct 2008-12-23 19:45 sounds strange all right 2008-12-23 19:46 cache thing? 2008-12-23 19:46 ah, maybe the leaf was not marked dirty? 2008-12-23 19:47 but you say it wrote it 2008-12-23 19:47 dirty seems right 2008-12-23 19:47 what is seen on disk? 2008-12-23 19:47 basically it seems right 2008-12-23 19:48 ok, I agree it sounds strange 2008-12-23 19:48 maybe, it is corrupted 2008-12-23 19:48 and I should not bother you now 2008-12-23 19:49 how big is the filesystem for this test? 2008-12-23 19:49 it is 4gb 2008-12-23 19:49 however, test data is 256mb 2008-12-23 19:49 still really big 2008-12-23 19:50 if it was complessed, it can be small 2008-12-23 19:50 not so easy to dump metadata now, to investigate corruption 2008-12-23 19:51 I can put svg if needed 2008-12-23 19:51 that would be gigantic! 2008-12-23 19:51 it is 51kb 2008-12-23 19:52 oh, not very big 2008-12-23 19:52 because fsx-linux does random truncate 2008-12-23 19:53 if you do put it in svg, I should link it from my next lkml post 2008-12-23 19:54 I think I should write an update to the Structure of Tux3 post to show how the structure got simpler 2008-12-23 19:54 it seems image is also not big 2008-12-23 19:54 link to corrupted image? 2008-12-23 19:55 maybe an uncorrupted image would be better :) 2008-12-23 19:55 but corrupted image would be fine too 2008-12-23 19:55 I guess our extents are keeping the metadata small 2008-12-23 19:56 well, anyway, I'll put example 2008-12-23 19:56 above 64,000 data blocks 2008-12-23 19:56 sorry 2008-12-23 19:56 about 64,000 data blocks 2008-12-23 19:56 how many files? 2008-12-23 19:56 for now, svg 2008-12-23 19:57 if non corrupt image is needed, it can create easily 2008-12-23 19:57 I meant, how many files in the fsx-linux? 2008-12-23 19:58 http://userweb.kernel.org/~hirofumi/20081224/ 2008-12-23 19:58 ah 2008-12-23 19:58 it uses 1 file 2008-12-23 19:58 if corruption was found, it create more 2 files 2008-12-23 20:04 http://userweb.kernel.org/~hirofumi/20081224/tux3.svg 2008-12-23 20:04 it is correct filesystem 2008-12-23 20:05 firefox fails to display it and inkspace is wierd 2008-12-23 20:06 maybe, it need to download 2008-12-23 20:06 an, the new one is much better 2008-12-23 20:06 inkscape shows something sensible 2008-12-23 20:07 ok, I'll remove old one 2008-12-23 20:08 that is really beautiful 2008-12-23 20:08 ok, I must write an updated structure post, and link that 2008-12-23 20:08 yes, it helps to write dleaf internal 2008-12-23 20:09 dwalk internal 2008-12-23 20:11 my daughter wants it to have some pink ;) 2008-12-23 20:11 ok, I will try to present it :) 2008-12-23 20:14 http://userweb.kernel.org/~hirofumi/20081224/tux3.svg 2008-12-23 20:14 no good pink 2008-12-23 20:14 another pink is needed 2008-12-23 20:15 it is a fine pink 2008-12-23 20:16 http://www.graphviz.org/doc/info/colors.html 2008-12-23 20:16 we can chice color from it 2008-12-23 20:17 deeppink 2008-12-23 20:17 ok 2008-12-23 20:18 http://userweb.kernel.org/~hirofumi/20081224/tux3.svg 2008-12-23 20:19 she asks "how do you do that?" 2008-12-23 20:19 I say, it is like tuxpaint 2008-12-23 20:19 she's 4 2008-12-23 20:20 getting pretty good with tuxpaint 2008-12-23 20:20 tuxpaint? 2008-12-23 20:20 paint program for kids 2008-12-23 20:20 well, we can edit tux3.svg by editor 2008-12-23 20:20 recommended 2008-12-23 20:20 yes 2008-12-23 20:20 I have it open in inkscape 2008-12-23 20:21 editor means text editor 2008-12-23 20:21 I just replaced "pink" to "deeppink" 2008-12-23 20:21 I changed the color in inkscape 2008-12-23 20:22 ah 2008-12-23 20:22 it is good way 2008-12-23 20:26 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-23 20:29 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-23 20:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-23 20:34 struct journal_head is scary 2008-12-23 20:34 yes 2008-12-23 20:34 it seems to be complex 2008-12-23 20:35 I'm understanding it completely 2008-12-23 20:35 I'm not understanding it completely 2008-12-23 20:35 I think the standard trick for making the vfs stay away from metadata buffers is to leave the buffer marked clean, with a reference count 2008-12-23 20:36 so the buffer can't be evicted, but will not be written 2008-12-23 20:36 and the filesystem keeps its own dirty flag 2008-12-23 20:36 ext3 seems to do it 2008-12-23 20:37 I will first try to do that too, before deciding we need a separate address_space 2008-12-23 20:37 however, I think you are trying to add new buffer layer with block handles 2008-12-23 20:38 later 2008-12-23 20:38 i see 2008-12-23 20:38 for now we use buffer_head, but maybe not blockdev 2008-12-23 20:39 new buffer layer sounds like a very good way to not get merged for a long time 2008-12-23 20:39 merge first, then try a new buffer layer 2008-12-23 20:39 ah 2008-12-23 20:39 what does it have dirty state? 2008-12-23 20:40 filesystem can have a few private flags in the buffer state, we will use that for dirty 2008-12-23 20:41 ok 2008-12-23 20:44 BH_PrivateStart, 17 bits available 2008-12-23 20:45 yes 2008-12-23 20:45 clear side would need some trick 2008-12-23 20:46 I hope those bits are cleared in grow_buffers 2008-12-23 20:46 iirc, private is not cleared 2008-12-23 20:50 hey flips 2008-12-23 20:50 you're right 2008-12-23 20:51 that blows 2008-12-23 20:51 there is no excuse for not initializing buffer state in init_buffer 2008-12-23 20:52 maybe 2008-12-23 20:56 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L940 bh->b_state = 0; (for page buffers) 2008-12-23 20:56 ok, bug is found 2008-12-23 20:56 it was too long 2008-12-23 20:57 what was? 2008-12-23 20:57 after allocated, yes 2008-12-23 20:57 problem may be reuse 2008-12-23 20:58 now, we are passing max_seg is 1 2008-12-23 20:58 however, find_seg is assuming seg[] has all seg in range 2008-12-23 20:59 at least, it is one of problem 2008-12-23 20:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-23 21:00 that is a stupid change by me 2008-12-23 21:01 well 2008-12-23 21:01 is it right to have max_seg as 1? 2008-12-23 21:01 maybe, even if it is 1, I think caller can work 2008-12-23 21:02 range is from 0x6c to 0xdc 2008-12-23 21:02 but can block library ever handle more than one result seg? 2008-12-23 21:03 no 2008-12-23 21:03 so, it pass seg with max_seg == 1 2008-12-23 21:03 so that seems right 2008-12-23 21:04 however, find_segs is assuming seg[] can have all extents 2008-12-23 21:04 I think the wrong is it 2008-12-23 21:05 it breaks with segs < max_segs 2008-12-23 21:05 however, "above" is index - limit 2008-12-23 21:05 seg does not have extent of limit 2008-12-23 21:06 it stop loops at segs < max_segs 2008-12-23 21:06 oh 2008-12-23 21:07 right 2008-12-23 21:07 it may not reach limit 2008-12-23 21:07 duh 2008-12-23 21:07 yes 2008-12-23 21:08 however, well, it can be caller is wrong 2008-12-23 21:15 ok, quick fix is 2008-12-23 21:15 block_t above = index - min(index, limit); 2008-12-23 21:15 not pretty, but seems sensible 2008-12-23 21:16 with it, get_segs will be limited by max_segs 2008-12-23 21:16 and fill_segs doesn't use limit, so it is no problem 2008-12-23 21:17 and caller looks ok? 2008-12-23 21:17 yes 2008-12-23 21:17 it just loops for returned segs 2008-12-23 21:36 hirofumi still awake? 2008-12-23 21:37 yes 2008-12-23 21:37 ah, 2 pm 2008-12-23 21:37 2:37 2008-12-23 21:37 yes, and I slept yesterday 2008-12-23 21:38 test was passed one time 2008-12-23 21:38 second time was failed 2008-12-23 21:38 :) 2008-12-23 21:38 :( 2008-12-23 21:38 it seems to be race condition 2008-12-23 21:39 trace seems to be mixed 2008-12-23 21:39 with uniprocessor? 2008-12-23 21:39 tux3_get_block: tux3_get_block: ==> inum 14, iblock 11589, b_size 4096, create 1==> inum 14, iblock 8056, b_size 4096, create 1 2008-12-23 21:39 yes 2008-12-23 21:39 preemption off? 2008-12-23 21:39 (just checking) 2008-12-23 21:39 iirc, preempt is on 2008-12-23 21:39 that will produce the effect 2008-12-23 21:40 yes 2008-12-23 21:40 however, mmap can be race 2008-12-23 21:40 ah, no 2008-12-23 21:40 how? 2008-12-23 21:40 :) 2008-12-23 21:40 :) 2008-12-23 21:41 ok, so blockdev buffer handling just seems odd and not useful to us 2008-12-23 21:41 while normal page cache buffer handling is pretty sensible and useful 2008-12-23 21:41 yes 2008-12-23 21:42 so could you think about any reasons why we need to use blockdev at all? 2008-12-23 21:42 I am also looking at it 2008-12-23 21:42 um... 2008-12-23 21:43 blockdev for fs? 2008-12-23 21:43 and for us 2008-12-23 21:46 I meant s_bdev 2008-12-23 21:46 ok 2008-12-23 21:48 s_bdev is used to get the info of backing store 2008-12-23 21:48 and buffer cache 2008-12-23 21:49 and for sb_bread 2008-12-23 21:49 yes, it is buffer cache 2008-12-23 21:50 so, if we don't use buffer cache, I think we don't need s_bdev basically 2008-12-23 21:52 then we write out own _bread and initialize the flags properly 2008-12-23 21:52 I thought it is new layer of buffer handles 2008-12-23 21:53 we use new layer, but we use buffer_head instead of handles? 2008-12-23 21:53 it isn't even a new layer 2008-12-23 21:54 page of buffer_head is where come from? 2008-12-23 21:54 from a page cache, just like now 2008-12-23 21:55 now, page cache is on s_bdev->bd_inode 2008-12-23 21:55 buffer cache is 2008-12-23 21:55 yes 2008-12-23 21:55 and that is only because we access it by sb_bread 2008-12-23 21:56 so if we use tux_bread, we can have our own metadata page cache 2008-12-23 21:56 page cache is linked to which inode? 2008-12-23 21:57 new inode 2008-12-23 21:57 we allocate 2008-12-23 21:57 ah, I called it as new layer 2008-12-23 21:57 I meant to try this with junkfs 2008-12-23 21:57 ok 2008-12-23 21:58 I think we are fine with it for now 2008-12-23 21:59 um.. 2008-12-23 21:59 why did we new inode? 2008-12-23 22:00 to avoid confusion by write from /dev/xxx? 2008-12-23 22:00 so page->mapping->host is our inode for one thing 2008-12-23 22:01 and because vfs like to go mucking with s_bdev, for sync 2008-12-23 22:01 and vfs can't really do anything useful for us by playing with s_bdev 2008-12-23 22:01 only cause trouble 2008-12-23 22:02 if we have buffer_heads on pages in our own page cache, then we can use the assoc_list field for exampel 2008-12-23 22:02 differently from how vfs uses it 2008-12-23 22:02 ah, ok 2008-12-23 22:02 we need own ->writepage, etc. 2008-12-23 22:03 ->writepage is for file page cache, we will keep using it 2008-12-23 22:03 and don't write pages by vfs 2008-12-23 22:03 instead of it, we use bio directly? 2008-12-23 22:04 for new layer 2008-12-23 22:04 we will eventually, but not before christmas 2008-12-23 22:04 the block IO library should work for us 2008-12-23 22:04 we just do metadata differently 2008-12-23 22:05 yes 2008-12-23 22:05 e.g. bnode will on new layer? 2008-12-23 22:05 christmas being two days away :) 2008-12-23 22:05 yes, bnode will go in tuxcache :) 2008-12-23 22:05 and tuxcache will write by bio? 2008-12-23 22:06 will be written 2008-12-23 22:06 yes 2008-12-23 22:06 ok 2008-12-23 22:06 and we can get the bio completions that way 2008-12-23 22:06 yes 2008-12-23 22:07 getting completions for file data is a problem, block io library does not give us anything for that 2008-12-23 22:07 we can set end_io? 2008-12-23 22:07 block io library sets it 2008-12-23 22:08 ah, yes 2008-12-23 22:09 well, we have to handle it ourself? 2008-12-23 22:09 we have to know when a certain set of dirty file data has arrived on disk 2008-12-23 22:09 because we have to avoid to write by vfs 2008-12-23 22:09 especially for directories 2008-12-23 22:10 and do our own ->write_page 2008-12-23 22:10 that is a possibility 2008-12-23 22:11 maybe, we can't use block I/O library almost? 2008-12-23 22:11 probably 2008-12-23 22:12 I thought I should look for a way to handle it with the block IO library, but not try too hard 2008-12-23 22:12 block library should work for reading 2008-12-23 22:12 just not writing 2008-12-23 22:12 i see 2008-12-23 22:15 let's see how ext3 ordered data write works 2008-12-23 22:16 ordered data is using block I/O library 2008-12-23 22:18 yes, it would be nice if we can use block io library for file data and our own for metadata, for now 2008-12-23 22:19 maybe, we need to redirect somehow by get_block 2008-12-23 22:20 oh, that is an issue 2008-12-23 22:20 http://lxr.linux.no/linux+v2.6.27.5/fs/ext3/inode.c#L1474 2008-12-23 22:20 if we set buffer_mapped, the block library may write to the same block again, without calling ->get_block 2008-12-23 22:21 thankyou 2008-12-23 22:21 even if it's not mapped 2008-12-23 22:22 or if we need to redirect the block, that is bad 2008-12-23 22:22 I think that is a killer reason why the block IO library can't be used for anything except an ordered data mode 2008-12-23 22:22 let's think about write path 2008-12-23 22:23 sys_write will prepare page 2008-12-23 22:23 ->write_begin will be called 2008-12-23 22:23 with it, page will be prepared 2008-12-23 22:24 and ->writepage will be called 2008-12-23 22:24 in ->writepage, we have to redirect the page 2008-12-23 22:24 yes 2008-12-23 22:25 and ->write_begin and ->writepage is needed to seriarize 2008-12-23 22:25 or ->write_begin will fork 2008-12-23 22:26 yes 2008-12-23 22:26 so, delayed allocation is necessary 2008-12-23 22:27 I don't see how you reached that conclusion from the above (but I agree...) 2008-12-23 22:27 ah, I skipped get_block 2008-12-23 22:27 ->write_begin calls get_block(create) 2008-12-23 22:28 and dirty page 2008-12-23 22:29 and ->writepage is called with page is allocated blocks 2008-12-23 22:29 ah 2008-12-23 22:30 we may not need delayed allocation if we free blocks in ->writepage 2008-12-23 22:30 there are only two problems I see there 1) we leave the buffer in buffer_mapped state, so our get_block(create) does not get called in the future when we need it 2) no obvious way to know when the IO has completed 2008-12-23 22:31 yes 2008-12-23 22:31 I think 1) can be solved by delayed allocation 2008-12-23 22:33 yes 2008-12-23 22:33 to solve 2), we need to own library, or we need to wait completion 2008-12-23 22:33 I am just trying to convince myself there is no good way around 2 2008-12-23 22:33 so I thought I should see what ext3 does in ordered data mode 2008-12-23 22:34 once convinced, we can get down to the work of building a suitable mechanism 2008-12-23 22:34 ok 2008-12-23 22:34 another topic: locking 2008-12-23 22:35 I assume that the fsx test that fails is really because of a race we don't handle 2008-12-23 22:35 shall we guess what it is? 2008-12-23 22:36 trace is needed 2008-12-23 22:36 I guess it is in our metadata handling, which has no synchronization at all 2008-12-23 22:37 it seems to be both is inum 14 2008-12-23 22:37 race allocating the inum? 2008-12-23 22:37 no 2008-12-23 22:38 tux3_get_block: tux3_get_block: ==> inum 14, iblock 11589, b_size 4096, create 1==> inum 14, iblock 8056, b_size 4096, create 1 2008-12-23 22:38 it seems tux3_get_block is called two times 2008-12-23 22:38 one is iblock 11589 2008-12-23 22:38 one is iblock 8056 2008-12-23 22:39 both seems create == 1 2008-12-23 22:39 and tracing for these calls is interleaved? 2008-12-23 22:39 yes 2008-12-23 22:40 well, nothing in vfs prevents that, we better add a synchronizer 2008-12-23 22:40 yes 2008-12-23 22:40 ah 2008-12-23 22:40 btree->mutex, to start? 2008-12-23 22:40 maybe, one is flusher, and one is write 2008-12-23 22:40 could be 2008-12-23 22:41 shrink_caches 2008-12-23 22:41 or pdflush 2008-12-23 22:41 yes 2008-12-23 22:41 I'll try to get the detail of it 2008-12-23 22:41 and let's add a crude mutex and see how much it sucks 2008-12-23 22:42 yes 2008-12-23 22:43 after some cleanup, I think we can do some benchmark work 2008-12-23 22:43 a little thing I would like to do right now: change get_segs to use start, count 2008-12-23 22:43 yes 2008-12-23 22:44 ok, should be just a few minutes 2008-12-23 22:45 well, we need some cleanup about it 2008-12-23 22:45 at least, bug fix 2008-12-23 22:47 yes, I will do this one then talk about the next one 2008-12-23 22:47 ok 2008-12-23 22:48 ugh, it seems timing was changed 2008-12-23 22:48 another bug is happened 2008-12-23 22:50 ah, ok 2008-12-23 22:50 merge ffff81001d57c120 into ffff81001cada000 2008-12-23 22:50 caller is __mpage_writepage+0x21d/0x56c 2008-12-23 22:50 tux3_get_block: tux3_get_block: <== inum 14, mapped 1, new 1, block 523, size 4096==> inum 14, iblock 10993, b_size 4096, create 1 2008-12-23 22:50 caller is __block_prepare_write+0x1cf/0x3d8 2008-12-23 22:50 balloc extent -> [20c/1] 2008-12-23 22:50 tux3_get_block: ==> inum 14, iblock 9518, b_size 4096, create 1merge ffff81001d57c120 into ffff81001cada000 2008-12-23 22:51 it maybe __block_prepare_write and __mpage_writepage 2008-12-23 22:51 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-23 22:51 so, maybe, we need to lock btree 2008-12-23 22:52 we expect those bugs with preempt and no mutex 2008-12-23 22:52 ok 2008-12-23 22:52 mutex or rwsem? 2008-12-23 22:52 might was well go with rwsem 2008-12-23 22:52 and read will not suck so much 2008-12-23 22:54 ok, rwsem 2008-12-23 22:54 LD .tmp_vmlinux1 2008-12-23 22:54 fs/built-in.o: In function `alloc_cursor': 2008-12-23 22:54 (.text+0x4cdb1): undefined reference to `cond_resched' 2008-12-23 22:54 I wonder what this is 2008-12-23 22:54 it will be added later of some change 2008-12-23 22:54 rebuild? 2008-12-23 22:54 I think so 2008-12-23 22:54 cond_resched is inline 2008-12-23 22:55 however, I wonder why I don't see it 2008-12-23 22:56 make clean on just fs/tux3 fixed it 2008-12-23 22:56 good 2008-12-23 22:57 are you changing some files? 2008-12-23 22:58 user+kernel filemap.c 2008-12-23 22:58 checkin in a minute or so 2008-12-23 22:58 ok 2008-12-23 22:58 I tried a trivial test in kernel 2008-12-23 22:58 just checking my arithmetic 2008-12-23 22:58 big, trivial change 2008-12-23 23:00 pushed 2008-12-23 23:01 ok, another trivial thing... I think our cleanup in filemap was successful, but the factoring into read_segs, fill_segs is not useful 2008-12-23 23:01 so I would like to put those back together in get_segs 2008-12-23 23:02 but I can wait 2008-12-23 23:02 what is read_segs? 2008-12-23 23:02 sorry 2008-12-23 23:02 find_segs 2008-12-23 23:03 no need to do that right now 2008-12-23 23:03 ah, merge find_segs and fill_segs? 2008-12-23 23:03 yes 2008-12-23 23:03 i see 2008-12-23 23:03 it is the handling of the segs vector that we improved 2008-12-23 23:04 breaking into separate functions maybe helped focus that work, but structurally it is awkward 2008-12-23 23:04 anyway, it is not useful to do that right now 2008-12-23 23:04 disruptive 2008-12-23 23:05 ok 2008-12-23 23:06 ok, btree->lock ? 2008-12-23 23:06 let me check current repo 2008-12-23 23:06 and my working stuff 2008-12-23 23:08 ok 2008-12-23 23:08 we have to tell allocated blocks somehow 2008-12-23 23:08 now, I'm hacking it by -count 2008-12-23 23:08 yes, it would be cleaner to have a flag bit 2008-12-23 23:09 sounds good 2008-12-23 23:09 good use of bit fields 2008-12-23 23:16 we remove -count? 2008-12-23 23:17 yes 2008-12-23 23:18 instead of it, we also add SEG_HOLE? 2008-12-23 23:18 == 0? 2008-12-23 23:18 1 2008-12-23 23:20 you mean, have two flag bits, one for hole and one for new? 2008-12-23 23:20 yes, if we remove -count 2008-12-23 23:20 yes 2008-12-23 23:20 this change is not urgent 2008-12-23 23:20 we should try the btree->lock 2008-12-23 23:20 well, new is needed 2008-12-23 23:22 struct seg { block_t block; usigned count:30, hole:1, new:1; }; 2008-12-23 23:22 we want to save 4 bytes? 2008-12-23 23:23 it does not matter much 2008-12-23 23:23 struct seg { block_t block; unsigned count; unsigned state; }; 2008-12-23 23:24 I already started with it, please change it if needed 2008-12-23 23:24 ok 2008-12-23 23:24 fine 2008-12-23 23:35 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-23 23:35 done 2008-12-23 23:35 please check it 2008-12-23 23:38 ok 2008-12-23 23:41 SEG_NEW would be a little clearer than SEG_ALLOCATED 2008-12-23 23:41 oh 2008-12-23 23:41 I'll change it 2008-12-23 23:41 the second one is ambiguous, I saw the meaning from reading 2008-12-23 23:42 I will pull when you're ready 2008-12-23 23:45 ok, done 2008-12-23 23:45 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-23 23:47 pulled 2008-12-23 23:47 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-23 23:48 ok, lock? 2008-12-23 23:50 almost have a trial patch 2008-12-23 23:51 ok 2008-12-23 23:54 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-23 23:57 hirofumi, how about I post the trial patch on the list? 2008-12-23 23:58 that's fine for me 2008-12-24 00:04 I feel, it is good to share what we do with the list before we do it, when possible 2008-12-24 00:04 of course, it is nice to write something about it too, which I did not 2008-12-24 00:05 yes 2008-12-24 00:05 it is good 2008-12-24 00:06 let's see if it affects the fsx test 2008-12-24 00:06 i'll push it 2008-12-24 00:06 I also try to post I can possible 2008-12-24 00:08 pushed 2008-12-24 00:08 new_btree is called if inode is new inode 2008-12-24 00:08 ah 2008-12-24 00:08 I think tux3_inode_init_once or tux3_allocate_inode is good 2008-12-24 00:10 and my test case worked because it was a new file 2008-12-24 00:13 static void tux3_inode_init_once(struct kmem_cache *cachep, void *mem) 2008-12-24 00:13 { 2008-12-24 00:13 inode_init_once(&((tuxnode_t *)mem)->vfs_inode); 2008-12-24 00:13 + init_rwsem(&btree.lock); 2008-12-24 00:13 } 2008-12-24 00:13 I will push it 2008-12-24 00:13 wait a bit 2008-12-24 00:13 ok 2008-12-24 00:13 tux3_alloc_inode() will clear it 2008-12-24 00:14 anyway, it's wrong ;) 2008-12-24 00:14 tuxi->btree = (struct btree){}; 2008-12-24 00:14 tuxi->btree initialization is just for debug, iirc 2008-12-24 00:16 ah, new_btree is also clear it 2008-12-24 00:16 in other words, slow down and think ;) 2008-12-24 00:17 btree = (struct btree){ xxx } 2008-12-24 00:17 return (struct btree){ }; <- not so good 2008-12-24 00:17 yes 2008-12-24 00:17 I think we should pass btree itself 2008-12-24 00:18 and initialize for each fields 2008-12-24 00:18 new_btree doesn't clear it 2008-12-24 00:18 and return err? 2008-12-24 00:18 only clears the root 2008-12-24 00:18 you might be right, let me look 2008-12-24 00:19 it returns 2008-12-24 00:19 struct btree btree = { .sb = sb, .ops = ops }; 2008-12-24 00:22 tux_setup_inode 2008-12-24 00:22 for now, it would be ok 2008-12-24 00:24 ah, no 2008-12-24 00:24 new_btree will clear it 2008-12-24 00:25 this is not the cleanest thing in the world 2008-12-24 00:27 it is because we don't have a specific init_btree 2008-12-24 00:28 I thought init is (struct btree){} 2008-12-24 00:29 well, init_btree() is fine 2008-12-24 00:30 inodes will not always have btrees, is worth remembering 2008-12-24 00:31 yes 2008-12-24 00:31 it is why I clear btree in alloc_inode 2008-12-24 00:31 if someone try to use it, it should die 2008-12-24 00:31 new_btree is a misleading name, it should be make_btree, it makes the persistent object 2008-12-24 00:32 if so, we would like to change new_* 2008-12-24 00:33 ugh 2008-12-24 00:33 well, everywhere we assign inode.sb, we should init the rwsem 2008-12-24 00:34 btree is embeded to inode 2008-12-24 00:34 so, I think tux3_inode_init_once is fine 2008-12-24 00:35 I thought we shouldn't clear it 2008-12-24 00:38 inode_init_once(&((tuxnode_t *)mem)->vfs_inode); 2008-12-24 00:38 + init_rwsem(&((tuxnode_t *)mem)->btree.lock); 2008-12-24 00:38 ugly, but 2008-12-24 00:38 yes, more change is needed 2008-12-24 00:39 new_btree and tux3_alloc_inode 2008-12-24 00:42 or for now, we can just use inode->btree_lock 2008-12-24 00:43 I guess initializing it in more places is preferable 2008-12-24 00:43 and leave comments there 2008-12-24 00:45 structure assignment worries me for the semaphore 2008-12-24 00:45 sometimes those initializations assume the structure address will not change 2008-12-24 00:46 um..., example? 2008-12-24 00:47 a list_head for example 2008-12-24 00:47 points at itself, would be messed up by a structure assignment 2008-12-24 00:48 so, btree = (struct btree){} is wrong? 2008-12-24 00:48 it smells wrong 2008-12-24 00:48 yes 2008-12-24 00:49 it was pretty sloppy stuff I wrote as a prototype 2008-12-24 00:49 we shouldn't tuch btree_lock at all 2008-12-24 00:49 if we initialize it in alloc_inode 2008-12-24 00:52 how about this 2008-12-24 00:52 new_btree should take the btree as a pointer, and only initialize the parts it needs to 2008-12-24 00:53 yes, it's fine 2008-12-24 00:53 new_btree have to return error 2008-12-24 00:53 so, btree as pointer is needed by another reason 2008-12-24 00:54 and we can return a proper error code 2008-12-24 00:54 yes 2008-12-24 01:03 btw, I've tested with hacked inode->lock 2008-12-24 01:03 it seems to work 2008-12-24 01:03 fsx-linux test was passed several times 2008-12-24 01:03 good, that is the important thing :) 2008-12-24 01:03 we also need synchronization for balloc 2008-12-24 01:03 I think 2008-12-24 01:04 ah 2008-12-24 01:04 i see 2008-12-24 01:04 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-24 01:04 we have to find test case 2008-12-24 01:06 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-24 01:06 ah 2008-12-24 01:06 it is hard to test 2008-12-24 01:06 nextalloc is prevent it almost 2008-12-24 01:07 ext2 doesn't have synchronization in balloc 2008-12-24 01:08 ah, and inode allocation 2008-12-24 01:08 right, we have the btree lock, we just need to use it 2008-12-24 01:08 inode table access in general 2008-12-24 01:09 rwsem is good 2008-12-24 01:09 yes 2008-12-24 01:09 nearly ready with a patch for new_btree 2008-12-24 01:10 ext2 is using set_bit atomicity and sb_bgl_lock() 2008-12-24 01:10 I think 2008-12-24 01:10 clever 2008-12-24 01:11 ext2 get_block locking is hard to see also 2008-12-24 01:11 it's there, but hidden 2008-12-24 01:11 yes 2008-12-24 01:11 it seems deep 2008-12-24 01:13 maybe, fsx-linux allocate block by write_begin only 2008-12-24 01:13 so, it seems stable more or less 2008-12-24 01:13 there was one bug, however, test passed many times 2008-12-24 01:14 it is something to celebrate 2008-12-24 01:14 very big thing you did today 2008-12-24 01:14 yes 2008-12-24 01:14 really 2008-12-24 01:14 ok, a patch to look at 2008-12-24 01:14 I'll post to the list 2008-12-24 01:14 I guess tux3 root may work 2008-12-24 01:14 tux3 as rootfs 2008-12-24 01:16 another bug was happened 2008-12-24 01:17 it seems to hang with I/O 2008-12-24 01:17 strange 2008-12-24 01:20 ah, no 2008-12-24 01:20 it is us 2008-12-24 01:20 -!- flips(~phillips@phunq.net) has joined #tux3 2008-12-24 01:20 it is us 2008-12-24 01:20 lost keyboard input 2008-12-24 01:20 third time this week 2008-12-24 01:20 never had it happen before 2008-12-24 01:21 I missed the last bit 2008-12-24 01:21 usb? 2008-12-24 01:21 usb or ps2 2008-12-24 01:21 tried both 2008-12-24 01:21 oh 2008-12-24 01:21 i845 chipset 2008-12-24 01:21 don't have ideas 2008-12-24 01:21 sounds like ps2 driver or hardware 2008-12-24 01:22 sounds like hardware, but nothing in dmesg 2008-12-24 01:22 ah, it can be X 2008-12-24 01:22 it can, but never had it happen before, and no recent updates 2008-12-24 01:22 oh 2008-12-24 01:23 sounds like hardware 2008-12-24 01:23 it does 2008-12-24 01:24 there is one bug somewhere 2008-12-24 01:24 it was failed at dleaf.h:403 2008-12-24 01:24 assert(walk->entry >= walk->estop); 2008-12-24 01:25 dleaf_chop failed 2008-12-24 01:25 truncate again 2008-12-24 01:25 are we locked against it? 2008-12-24 01:25 I don't think so 2008-12-24 01:26 ah 2008-12-24 01:27 right 2008-12-24 01:27 another write can change btree 2008-12-24 01:27 there is race 2008-12-24 01:27 ok 2008-12-24 01:29 I'll go to shop 2008-12-24 01:29 ok, I'll post the new_btree patch and sleep 2008-12-24 01:29 ok 2008-12-24 01:29 oyasumi 2008-12-24 01:29 oyasumi 2008-12-24 02:55 -!- pgquiles(~pgquiles@91.Red-88-18-198.staticIP.rima-tde.net) has joined #tux3 2008-12-24 03:05 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-24 04:13 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-12-24 06:02 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-12-24 06:02 -!- pgquiles(~pgquiles@91.Red-88-18-198.staticIP.rima-tde.net) has joined #tux3 2008-12-24 06:02 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2008-12-24 06:02 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-24 06:03 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2008-12-24 08:19 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-24 12:07 folks 2008-12-24 12:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-24 14:16 -!- pgquiles(~pgquiles@146.Red-81-34-4.dynamicIP.rima-tde.net) has joined #tux3 2008-12-24 16:01 -!- flips(~phillips@phunq.net) has joined #tux3 2008-12-24 19:58 hirofrumi, there? 2008-12-24 20:02 hi 2008-12-24 20:05 I'll push the inode init fix 2008-12-24 20:05 just thinking about the other smp locking 2008-12-24 20:06 ok 2008-12-24 20:06 open_inode takes read lock on itable, save_inode takes write lock 2008-12-24 20:07 bitmap tree... no use for read lock I think 2008-12-24 20:08 maybe, I need to check for it 2008-12-24 20:08 maybe, yes 2008-12-24 20:08 pack_sb... maybe already protected by a kernel lock? 2008-12-24 20:08 it is lock_sb() 2008-12-24 20:09 I can't think of anything else that needs locking right now... xattrs, but it's not 3enabled yet 2008-12-24 20:10 maybe, yes 2008-12-24 20:11 itable is 2008-12-24 20:11 purge_inum() 2008-12-24 20:11 open_inode() 2008-12-24 20:12 make_inode() 2008-12-24 20:12 save_inode() 2008-12-24 20:13 ugh, bitmap is not grep friendly 2008-12-24 20:13 how exactly? 2008-12-24 20:13 bitmap is used for sb->bitmap 2008-12-24 20:14 and another is useing for bitmap data 2008-12-24 20:15 sb->bitmap seems to be used only from balloc/bfreee 2008-12-24 20:15 so, yes 2008-12-24 20:16 disksuper is protected by lock_super() 2008-12-24 20:17 almost all btree operations is needed to lock 2008-12-24 20:18 we will do it with this work, and if needed, we can assert it on btree.c 2008-12-24 20:19 directory operation is already locked by ->i_mutex, and will be locked by btree->lock 2008-12-24 20:19 dleaf is... 2008-12-24 20:19 dleaf is data btree, so btree->lock 2008-12-24 20:20 with quick look, I think you are right 2008-12-24 20:22 of course, we have to lock data btree more though 2008-12-24 20:26 I found the bug in fill_segs() 2008-12-24 20:26 it may call insert_node multiple times 2008-12-24 20:27 however, insert_node doesn't add new root with correct next pointer 2008-12-24 20:28 so, if it read next, it is wrong 2008-12-24 20:32 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-24 20:32 -!- kushal(~kushal@117.195.33.57) has joined #tux3 2008-12-24 20:51 looking at insert_node 2008-12-24 20:58 hirofumi, right, insert_node was written for a situation where there was only one insert per probe 2008-12-24 20:58 now it needs to update the cursor 2008-12-24 21:10 also, it may not be updating cursor 2008-12-24 21:26 insert_node only used the cursor to free the path 2008-12-24 21:38 - level_root_add(cursor, newbuf, NULL); // .next = ??? 2008-12-24 21:38 + level_root_add(cursor, newbuf, newroot->entries + 2); 2008-12-24 21:57 -!- kushal(~kushal@117.195.33.57) has joined #tux3 2008-12-24 21:58 @flips is the current implementation of tux3fuse.c broken for writes to actual disk partitions?? 2008-12-24 22:01 insert_node inserts node to current next 2008-12-24 22:01 insert_node(btree, index, bufindex(newbuf), cursor); 2008-12-24 22:01 insert_node(btree, index, bufindex(newbuf), cursor); 2008-12-24 22:01 2008-12-24 22:01 2008-12-24 22:01 insert_node(btree, index, bufindex(newbuf), cursor); 2008-12-24 22:01 insert_node(btree, index + 1, bufindex(newbuf), cursor); 2008-12-24 22:01 it would wrong 2008-12-24 22:02 getting the following error for disk partition (working fine for ddsnap images)... 2008-12-24 22:03 blockread: failed to read block 0 (Input/output error) balloc_from_range: balloc 1 blocks from [0/d] 2008-12-24 22:09 kushal, I never tried it on a real disk partition 2008-12-24 22:11 used to work before...seems to have broken in the last few days... 2008-12-24 22:13 actually we're back to our deduplication project after a long hiatus... 2008-12-24 22:13 working on the design 2008-12-24 22:16 ddsnap image? 2008-12-24 22:16 you mean, dd to a file? 2008-12-24 22:17 yes 2008-12-24 22:18 strace tux3fs -f 2008-12-24 22:18 see if a syscall returned error 2008-12-24 22:21 hirofumi, because insert_node takes a random address and cursor assumes a current position 2008-12-24 22:21 so insert_node is not general 2008-12-24 22:22 random address? 2008-12-24 22:22 you specify the index where to insert 2008-12-24 22:22 actually, it would be better if insert_node took that from the cursor 2008-12-24 22:23 hmm 2008-12-24 22:23 umm? 2008-12-24 22:23 good answer ;) 2008-12-24 22:23 brb 2008-12-24 22:24 I'm talking nonsense 2008-12-24 22:24 filemap will use it like the above 2008-12-24 22:24 -!- kushal(~kushal@117.195.33.57) has joined #tux3 2008-12-24 22:24 and insert_node will insert to current path[].next 2008-12-24 22:25 yes, so it seems to have the right api 2008-12-24 22:25 btw, "failed to read block 0" seems bug of filemap.c 2008-12-24 22:26 of course, it is because of the changes in the last few days 2008-12-24 22:26 and never tested the user space version 2008-12-24 22:26 pread64(4, "", 4096, 9019845267161088) = 0 2008-12-24 22:26 write(1, "blockread: failed to read block "..., 55blockread: failed to read block 0 (Input/output error) 2008-12-24 22:26 ) = 55 2008-12-24 22:26 using tux3fuse btw 2008-12-24 22:27 filemap.c meant filemap_extent_io() actually 2008-12-24 22:27 hole handling maybe 2008-12-24 22:27 i.e. for (int i = 0, index = start; !err && index < limit; i++) { 2008-12-24 22:27 it loops from start to limit 2008-12-24 22:27 it should be i < segs 2008-12-24 22:28 yes 2008-12-24 22:28 it was missing it 2008-12-24 22:28 I was missing it 2008-12-24 22:29 for (int i = 0, index = start; !err && i < segs; i++) { 2008-12-24 22:29 for (int i = 0, index = start; i < segs; !err && index < limit; i++) { 2008-12-24 22:29 probably the index < limit test is not doing anything useful now 2008-12-24 22:29 yes 2008-12-24 22:29 hole handling looks ok 2008-12-24 22:30 it seems not to do brelse() 2008-12-24 22:30 maybe, it is needed 2008-12-24 22:30 for (int i = 0, index = start; !err && i < segs; i++) { 2008-12-24 22:30 ah, no 2008-12-24 22:30 who is free it? 2008-12-24 22:31 ah, brelse is needed 2008-12-24 22:32 where? 2008-12-24 22:32 if (hole) { 2008-12-24 22:33 it uses "continue" 2008-12-24 22:33 stupid me 2008-12-24 22:33 if (hole) { 2008-12-24 22:33 trace("zero fill buffer"); 2008-12-24 22:33 memset(bufdata(buffer), 0, sb->blocksize); 2008-12-24 22:33 } else 2008-12-24 22:33 err = diskread(dev->fd, bufdata(buffer), sb->blocksize, block << dev->bits); 2008-12-24 22:34 maybe 2008-12-24 22:34 it's not kernel, so uptodate does not mean hole here 2008-12-24 22:35 -!- stargazr5(~gauravstt@117.195.33.57) has joined #tux3 2008-12-24 22:36 yes 2008-12-24 22:36 I just worry someone may be assuming refcount 2008-12-24 22:37 maybe, not assuming though 2008-12-24 22:37 show_buffers is used to check that in user space 2008-12-24 22:37 all should have zero refcount at the end 2008-12-24 22:38 inodetest shows that 2008-12-24 22:38 if so, it's fine 2008-12-24 22:38 I will push those small fixes now 2008-12-24 22:39 yes 2008-12-24 22:39 10 is would be good 2008-12-24 22:39 int segs = get_segs(inode, start, limit - start, segvec, 10, write); 2008-12-24 22:40 ARRAY_SIZE 2008-12-24 22:40 more good 2008-12-24 22:41 btw...just observed something..only fails for first write after mkfs 2008-12-24 22:41 works on subsequent writes 2008-12-24 22:43 kushai, pull to get a couple of bug fixes 2008-12-24 22:44 ok... 2008-12-24 22:45 / leave empty if error ??? <- and this comment needs to be addressed 2008-12-24 22:46 anyway, insert_node 2008-12-24 22:46 working fine now.. 2008-12-24 22:47 thank hirofumi 2008-12-24 22:47 ACTION thanks hirofumi 2008-12-24 22:47 no problem :) 2008-12-24 22:48 it was bug of me :) 2008-12-24 22:48 I changed get_segs(), however I missed to change it 2008-12-24 22:48 coming back to deduplication... 2008-12-24 22:48 I think it was my bug actually 2008-12-24 22:49 kushal, you saw my post about it a couple days ago? 2008-12-24 22:49 yes 2008-12-24 22:49 we 2008-12-24 22:50 we are thinking about block level de-duplication... 2008-12-24 22:50 below the filesystem? 2008-12-24 22:50 anyway, insert_node would be fine, if it insert new to current next, then update next 2008-12-24 22:50 that is how filemap uses it 2008-12-24 22:51 in the fs implementation itself... 2008-12-24 22:51 ok fine 2008-12-24 22:51 in fs deduplication? 2008-12-24 22:51 interesting 2008-12-24 22:52 do you suggest implementation in user space and then later moving to kernel... 2008-12-24 22:52 or 2008-12-24 22:52 kushal, definitely 2008-12-24 22:52 start with the kernel code itself... 2008-12-24 22:52 you will get in done in 1/4 the time, including the kernel version 2008-12-24 22:52 ok 2008-12-24 22:53 kernel work is really slow & tedious, right hirofumi? 2008-12-24 22:53 especially trying new algorithms 2008-12-24 22:53 most of kernel work is about using apis 2008-12-24 22:54 it can be 2008-12-24 22:54 here's a very crude design of what we are thinking... 2008-12-24 22:54 if it is really new things 2008-12-24 22:55 we plan to add an abstraction of a bucket... 2008-12-24 22:56 add_child(parent, cursor->path[depth].next++, childblock, childkey); 2008-12-24 22:56 makes sense? 2008-12-24 22:56 I guess update after add_child() 2008-12-24 22:56 a bucket is just a collection of SHA-1 hash values of blocks and corresponding block numbers... 2008-12-24 22:57 ah 2008-12-24 22:57 a btree stores the mapping between the hash values and the bucket numbers... 2008-12-24 22:58 the hash values are used as keys in the btree... 2008-12-24 22:59 kushal, right 2008-12-24 22:59 another hash tree 2008-12-24 23:00 the buckets would maintain the locality by entries corresponding to blocks of the same locale 2008-12-24 23:00 hirofumi, yes, I was just suggesting a fix 2008-12-24 23:00 not a clean fix ;) 2008-12-24 23:00 or same file... 2008-12-24 23:00 maybe, it works 2008-12-24 23:01 well, I'm not reading this fully yet 2008-12-24 23:02 insert_node is a mess, sorry 2008-12-24 23:02 at least it is a fairly short mess 2008-12-24 23:02 on a new file write, the hash values of blocks will be computed and only entries for the new blocks will be added... 2008-12-24 23:03 kusahl, sounds right 2008-12-24 23:04 kuahsl, ok so you have an extra indexing layer on top of the hash buckets, to try to avoid thrashing cache 2008-12-24 23:04 ordered mode is not deduplication compatible? 2008-12-24 23:05 yes 2008-12-24 23:05 i see 2008-12-24 23:05 to avoid thrashing cache... 2008-12-24 23:05 ACTION thinks about ordered mode 2008-12-24 23:05 so, it will fork buffer always? 2008-12-24 23:05 it will do "fork" 2008-12-24 23:06 we need to figure out a way to avoid making this update expensive... 2008-12-24 23:07 it would be need to work on vm for mmap 2008-12-24 23:07 it meant, e.g. ->page_mkwrite 2008-12-24 23:08 that doesn't affect filesystem level 2008-12-24 23:08 mmap? 2008-12-24 23:09 right, it eventually flushes to disk and the dedup happens there 2008-12-24 23:09 yes... 2008-12-24 23:09 so, buffer to write to disk is separeted from mmap 2008-12-24 23:10 yes 2008-12-24 23:10 so, it will do "fork" always? 2008-12-24 23:10 I think you mean what I call redirect 2008-12-24 23:10 redirect means assigning a new physical block location 2008-12-24 23:11 no 2008-12-24 23:11 ok, so you mean making a copy of a page cache page? 2008-12-24 23:11 mmap can change data in middle of write 2008-12-24 23:11 yes 2008-12-24 23:12 ok, yes, good point 2008-12-24 23:12 but, deduplication needs stable data 2008-12-24 23:12 so yes, it has to fork always 2008-12-24 23:12 sometimes it takes me a while to understand :) 2008-12-24 23:13 the alternative is to make the page r/o 2008-12-24 23:13 much harder 2008-12-24 23:13 maybe worth doing, 3 years from now 2008-12-24 23:13 yes, it is ->page_mkwrie 2008-12-24 23:13 page_mkwrite 2008-12-24 23:14 I don't think that copying a page out of page cache is a big expense,compared to doing the btree lookups and updates 2008-12-24 23:15 maybe, lookup would be fast 2008-12-24 23:16 because it would not dirty memory cache 2008-12-24 23:16 probably, update is yes 2008-12-24 23:17 I would say, don't worry too much about performance at first 2008-12-24 23:17 yes 2008-12-24 23:17 get correct results and measure compression ratios 2008-12-24 23:17 ok 2008-12-24 23:17 and for the same reason, keep your hash table as simple as possible 2008-12-24 23:18 and I think ordered mode should be ignored for now 2008-12-24 23:18 currently we are planning on using btree with hash values as keys... 2008-12-24 23:18 any other suggestions?/ 2008-12-24 23:19 the general term is content addressable memory 2008-12-24 23:19 there is a lot written about it 2008-12-24 23:20 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-24 23:23 isn't CAM a separate hardware by itself?/ 2008-12-24 23:30 btw... Merry Christmas 2008-12-24 23:38 thankyou 2008-12-24 23:38 and a very merry Christmas to you 2008-12-24 23:39 thanks... 2008-12-24 23:39 hirofumi, I a thinking of offering a patch set on Christmas day, bugs and all 2008-12-24 23:42 patch set? 2008-12-24 23:42 tux3 kernel 2008-12-24 23:42 ah 2008-12-24 23:42 sounds good 2008-12-24 23:43 I know it is not ready, but it is doing something interesting 2008-12-24 23:43 making it a long way through fsx 2008-12-24 23:43 no atomic commit, but hey, there has to be something fun left to do 2008-12-24 23:43 we'll post a design doc on the mailing list once we have a concrete design ready... 2008-12-24 23:44 kushal, sure, "post early, post often" 2008-12-24 23:44 good 2008-12-24 23:44 ok... thanks for the help... 2008-12-24 23:44 hirofumi, it can't be all one patch, so... some logical way of breaking it up 2008-12-24 23:45 see u later... 2008-12-24 23:45 see you 2008-12-24 23:45 one patch per file? 2008-12-24 23:45 post to lkml? 2008-12-24 23:45 yes 2008-12-24 23:46 put the big patch somewhere, and write url of it? 2008-12-24 23:46 have done that before 2008-12-24 23:46 yes 2008-12-24 23:46 ok 2008-12-24 23:46 good patch set is in repo 2008-12-24 23:47 so, maybe a big patch is useful for someone 2008-12-24 23:48 we should start review as soon as atomic commit works, I think 2008-12-24 23:48 then we want to post it as a patch set 2008-12-24 23:48 patch as url is not popular for review 2008-12-24 23:50 if it is for review, I think it should be separated logically 2008-12-24 23:51 separated with logical change? 2008-12-24 23:52 ah, full tux3 review? 2008-12-24 23:53 I was thinking about it 2008-12-24 23:53 I know there is a lot of stuff still to do 2008-12-24 23:53 for me, one big patch is still good 2008-12-24 23:54 well, I don't review sparated earh file 2008-12-24 23:54 sure, a review comment can be "break it up" 2008-12-24 23:54 and one big patch too 2008-12-24 23:55 if it is big, I just pull from repo 2008-12-24 23:55 but, it is in my case 2008-12-24 23:55 the patch is about 150k 2008-12-24 23:56 well, separeted each file is most bad to review for me 2008-12-24 23:56 right, it makes it harder to search 2008-12-24 23:56 yes 2008-12-24 23:57 I forget lkml size limit of email 2008-12-24 23:57 I was just thinking about that 2008-12-24 23:59 SubmittingPatches says 40kb 2008-12-25 00:00 Please keep within the 100000 character limit. 2008-12-25 00:00 ok, that settles that 2008-12-25 00:00 url to patch 2008-12-25 00:01 with description of what to expect there 2008-12-25 00:01 yes 2008-12-25 00:01 entire filesystems have been posted to lkml in the past 2008-12-25 00:01 but if somebody wants that, they can ask for it 2008-12-25 00:02 we have snapshot? 2008-12-25 00:02 nightly 2008-12-25 00:03 http://tux3.org/downloads/snapshots/ 2008-12-25 00:03 it may be good if someone don't want to pull 2008-12-25 00:03 http://tux3.org/patches/ 2008-12-25 00:03 so tomorrow will be -3 2008-12-25 00:04 I think it is good 2008-12-25 00:12 possible patch for the .next problem on tux3 ml 2008-12-25 00:12 very lightly tested, did you write a test case? 2008-12-25 00:12 for multiple insertions? 2008-12-25 00:12 no 2008-12-25 00:13 I wrote cursor_check() 2008-12-25 00:13 however, it seems to check well 2008-12-25 00:13 however, it seems not to check well 2008-12-25 00:13 good, well btree needs some unit tests 2008-12-25 00:13 yes 2008-12-25 00:14 it has a little bit now 2008-12-25 00:14 the uleaf stuff 2008-12-25 00:14 yes 2008-12-25 00:14 it tests tree_expand and probe 2008-12-25 00:15 there is no filemap usage 2008-12-25 00:16 filemap tests can also exercise btree, a little more indirectly 2008-12-25 00:16 yes 2008-12-25 00:16 the drawback of uleaf is, it is something else to maintain 2008-12-25 00:16 and the advantage is, it allows tests to exercise the btree more directly 2008-12-25 00:18 wait 2008-12-25 00:18 find_segs() can read only one dleaf 2008-12-25 00:19 and result segs is passed to fill_segs() 2008-12-25 00:19 do we call insert_node twice? 2008-12-25 00:20 it is possible 2008-12-25 00:20 if insert_node is called, it allocates new_leaf() 2008-12-25 00:20 but not with current get_block 2008-12-25 00:20 so, all segs can save? 2008-12-25 00:21 yes 2008-12-25 00:21 so, insert_node is not called twice? 2008-12-25 00:22 not until we add code to generate multiple extents for large writes 2008-12-25 00:22 but it's true 2008-12-25 00:22 it is hard to call insert_node twice 2008-12-25 00:22 so your bug was something else? 2008-12-25 00:22 my bug? 2008-12-25 00:22 I thought you hit a but in insert_node 2008-12-25 00:23 that bug is from review 2008-12-25 00:23 a bug 2008-12-25 00:23 ah 2008-12-25 00:23 theoretical bug 2008-12-25 00:23 yes 2008-12-25 00:24 it is better to fix, however it can not be right now 2008-12-25 00:24 we can be sure right now that if a new leaf is created in the segment write loop, that the tail can fit into that leaf without needing another new leaf 2008-12-25 00:24 well, I'll see insert_node more though 2008-12-25 00:25 ah 2008-12-25 00:25 does newroot->entries + 2 look right? 2008-12-25 00:26 I'm not reading the around of it yet 2008-12-25 00:26 um.. 2008-12-25 00:27 if we create new leaf, it should run on merge path 2008-12-25 00:27 run on merge path? 2008-12-25 00:28 it should call dleaf_merge 2008-12-25 00:28 ah 2008-12-25 00:28 not insert_node 2008-12-25 00:28 yes 2008-12-25 00:29 it is why entries+2? 2008-12-25 00:29 points after old root and new block 2008-12-25 00:29 I guess it can be ->entries+1 2008-12-25 00:30 newroot->entries[1].block <- should point after this 2008-12-25 00:31 we have no way to test the correctness of this right now 2008-12-25 00:31 childkey is changed if bnode split was happened? 2008-12-25 00:33 in fill_segs? 2008-12-25 00:33 in insert_node 2008-12-25 00:34 childkey = newkey; 2008-12-25 00:34 yes 2008-12-25 00:35 inseresting cursor->path[] can be on btree->root.block side? 2008-12-25 00:35 ? 2008-12-25 00:36 ah 2008-12-25 00:36 I can't see why next should point entries+2 2008-12-25 00:36 it should exactly satisfy level_finished 2008-12-25 00:37 level_finished(cursor, 0) should return 1 2008-12-25 00:37 why? 2008-12-25 00:38 theoretically, you could insert a node, then advance to the next 2008-12-25 00:38 yes 2008-12-25 00:39 say the new node was inserted on the extreme right of the tree, then every level should be in level_finished state 2008-12-25 00:39 so, I thought if caller may still on entries[0] side 2008-12-25 00:39 if it was left? 2008-12-25 00:40 if it is positioned on the extreme left? 2008-12-25 00:40 yes 2008-12-25 00:40 I don't think it can be 2008-12-25 00:40 even in tree_chop 2008-12-25 00:41 i see 2008-12-25 00:41 I feel like you are going to ask a lot of questions, then rewrite all the btree operations just like dwalk ;) 2008-12-25 00:41 :) 2008-12-25 00:42 the big design decision is whether to leave .next positioned after the current element or on it 2008-12-25 00:42 well, btree stuff seems insert_node only 2008-12-25 00:42 on it 2008-12-25 00:42 ? 2008-12-25 00:43 it should be next always? 2008-12-25 00:43 whether .next should point at the current element in the path, or at the one after, I chose after 2008-12-25 00:43 yes 2008-12-25 00:43 there is a lot of code that depends on that 2008-12-25 00:43 ah 2008-12-25 00:43 yes 2008-12-25 00:44 whether that choice was right or wrong is a good question 2008-12-25 00:44 insert_node seems only one of vaiolation 2008-12-25 00:44 so, does + 2 still seem wrong? 2008-12-25 00:45 sorry, I'm not sure 2008-12-25 00:45 ok, well here is a way to think about it 2008-12-25 00:45 the cursor should be the same as if you did a new probe in the tree 2008-12-25 00:45 in fact,that is a good way to check it 2008-12-25 00:46 I can't still see why cursor is on entries[0] side 2008-12-25 00:46 i.e. next is entries+1 2008-12-25 00:46 yes 2008-12-25 00:47 maybe, it is problem of me 2008-12-25 00:48 well I think we can write a self-check: after doing insert_node(..key...), do a probe with a new cursor and compare the paths 2008-12-25 00:48 this is a very good self check I think 2008-12-25 00:49 ah 2008-12-25 00:49 good test 2008-12-25 01:00 ok 2008-12-25 01:01 merry christmas 2008-12-25 01:01 merry chrismas 2008-12-25 01:02 if next is small than parent->entries+half, it can be entries+1? 2008-12-25 01:06 I'll try to write test case 2008-12-25 01:27 hirofumi, I think I finally understood your question 2008-12-25 01:27 I've wrote the test of it 2008-12-25 01:27 it is about the possible results of splits in filemap 2008-12-25 01:27 and posted to tux3-ml 2008-12-25 01:29 that is with my patch applied? 2008-12-25 01:29 e.g. talk of dleaf_merge? 2008-12-25 01:29 not yet 2008-12-25 01:29 the test looks fine 2008-12-25 01:30 I think the test will fail about entries+2 2008-12-25 01:30 and then I will be happy :) 2008-12-25 01:30 :) 2008-12-25 01:30 because it is tested and not just theoretical 2008-12-25 01:31 so far we have two kinds of btree, there will be four kinds after a little while 2008-12-25 01:31 each with different patterns of splitting, etc 2008-12-25 01:32 i see 2008-12-25 01:32 main: Failed assertion "cursor->path[i].next == cursor2->path[i].next"! 2008-12-25 01:32 it failed 2008-12-25 01:32 :) 2008-12-25 01:32 well, good 2008-12-25 01:33 it would be nice to know I 2008-12-25 01:33 i 2008-12-25 01:34 i==0 2008-12-25 01:34 root bnode has (key, pos), (0, 0) (100, 1) (101, 2) 2008-12-25 01:35 entries_per_node == 3 2008-12-25 01:35 it tries to insert key==1 after (0,0) 2008-12-25 01:35 ok 2008-12-25 01:37 btw, cursor_check() is crap 2008-12-25 01:39 where is it? 2008-12-25 01:39 it is in test patch 2008-12-25 01:40 in my test patch 2008-12-25 01:40 I tried to write something before, however it didn't help 2008-12-25 01:42 I'll remove it when I post this unit test actually 2008-12-25 01:43 this is a really powerful self check 2008-12-25 01:43 however, it can be wrong itself 2008-12-25 01:43 :) 2008-12-25 01:43 is it? 2008-12-25 01:43 I'm trying it now 2008-12-25 01:43 cursor_check itself can be wrong 2008-12-25 01:46 oh, cursor_check may work 2008-12-25 01:46 it detected the insert_node problem 2008-12-25 01:52 for (int i = 0; i < cursor->btree->root.depth; i++) { 2008-12-25 01:52 instead of cursor->len 2008-12-25 01:52 however, it may not have all depth 2008-12-25 01:53 yet 2008-12-25 01:53 it will after any insert_node 2008-12-25 01:54 after insert_node, yes 2008-12-25 01:55 ok 2008-12-25 01:55 so, add assert(cursor->len == depth + 1) 2008-12-25 01:57 static void cursor_check(struct cursor *cursor) 2008-12-25 01:57 { 2008-12-25 01:57 struct btree *btree = cursor->btree; 2008-12-25 01:57 block_t block = btree->root.block; 2008-12-25 01:57 tuxkey_t key = 0; 2008-12-25 01:57 assert(cursor->len == btree->depth + 1); 2008-12-25 01:57 for (int i = 0; i < btree->depth; i++) { 2008-12-25 01:57 assert(bufindex(cursor->path[i].buffer) == block); 2008-12-25 01:57 assert(from_be_u64((cursor->path[i].next - 1)->key) >= key); 2008-12-25 01:57 block = from_be_u64((cursor->path[i].next - 1)->block); 2008-12-25 01:57 key = from_be_u64((cursor->path[i].next - 1)->key); 2008-12-25 01:57 } 2008-12-25 01:57 } 2008-12-25 01:57 um.. 2008-12-25 01:58 ok 2008-12-25 01:58 2008-12-25 01:58 /* cursor should have all depth */ 2008-12-25 01:59 so len is depth + 1 ? 2008-12-25 01:59 after a probe? 2008-12-25 01:59 yes 2008-12-25 01:59 it is including leaf 2008-12-25 02:00 do we ever have len < depth + 1, except in probe? 2008-12-25 02:01 there may be posiible 2008-12-25 02:01 for (int i = 0; i < cursor->btree->root.depth; i++) { 2008-12-25 02:01 I thought this meat 2008-12-25 02:01 /* cursor should have all depth */ 2008-12-25 02:04 ah 2008-12-25 02:06 btw, tree_chop can be len < depth + 1 2008-12-25 02:08 ah, no 2008-12-25 02:08 tree_chop is special 2008-12-25 02:09 um.. 2008-12-25 02:11 after advance, it may have cursor->len == 0 2008-12-25 02:12 only after the last advance 2008-12-25 02:12 yes 2008-12-25 02:12 probably want to get rid of ->len eventually 2008-12-25 02:12 and just use depth 2008-12-25 02:13 why? 2008-12-25 02:13 reduce redundant fields where possible 2008-12-25 02:13 um.. 2008-12-25 02:14 I thought we are going to use ->len more 2008-12-25 02:14 rename ->len to ->level 2008-12-25 02:14 right 2008-12-25 02:14 well 2008-12-25 02:14 when we do cursor based locking 2008-12-25 02:14 we need to know more than just depth 2008-12-25 02:14 yes 2008-12-25 02:16 forget I said that ;) 2008-12-25 02:16 it works 2008-12-25 02:16 insert_node()? 2008-12-25 02:17 ->len 2008-12-25 02:17 yes 2008-12-25 02:17 can be ->depth any time we feel like a big spelling patch 2008-12-25 02:17 no 2008-12-25 02:17 it is current level 2008-12-25 02:18 I misspoke 2008-12-25 02:18 can be ->level any time we feel like a big spelling patch 2008-12-25 02:18 yes 2008-12-25 02:19 um.. 2008-12-25 02:20 ok 2008-12-25 02:20 yes 2008-12-25 02:20 level_root_add(cursor, newbuf, newroot->entries + 1); <- passes your test 2008-12-25 02:20 for (int i = 0; i < cursor->len; i++) { <- but this asserts with buffers not equal on i = 2 2008-12-25 02:21 I think we have to check that cursor is on which node 2008-12-25 02:21 right, the first probe has to be to an existing key 2008-12-25 02:21 and the second to the same key 2008-12-25 02:22 for (int i = 0; i < cursor->len; i++) { 2008-12-25 02:22 if (!cursor->path[i].next) 2008-12-25 02:22 break; 2008-12-25 02:22 ? 2008-12-25 02:22 this? 2008-12-25 02:23 newroot->entries + 1, I think it can be +2 or +1 2008-12-25 02:23 not what I meant 2008-12-25 02:24 true 2008-12-25 02:24 .next for new root has to come from the original root 2008-12-25 02:24 i == 2 is leaf? 2008-12-25 02:25 !next is assuming it is leaf 2008-12-25 02:25 yes 2008-12-25 02:29 ah 2008-12-25 02:29 we have to change buffer too 2008-12-25 02:30 duh 2008-12-25 02:30 this sounds like we need more time 2008-12-25 02:31 well we do change the buffer 2008-12-25 02:31 the root 2008-12-25 02:31 but not... 2008-12-25 02:31 yes 2008-12-25 02:31 the leaf? 2008-12-25 02:31 and bnode 2008-12-25 02:31 leaf will be changed by caller 2008-12-25 02:33 ah, no 2008-12-25 02:33 we don't need... 2008-12-25 02:33 ah 2008-12-25 02:34 we need to change buffer 2008-12-25 02:34 if (at->next > parent->entries + half) { 2008-12-25 02:34 it set parent = newnode 2008-12-25 02:34 cursor is on newnode side... 2008-12-25 02:44 we should probably pass buffer, not bufindex(buffer) to insert_node 2008-12-25 02:45 what is it for? 2008-12-25 02:45 more regular 2008-12-25 02:46 childblock = bufindex(newbuf); <- this suggests childblock variable is bogus 2008-12-25 02:47 I was puzzling over why your test asserts only on the leaves not being equal 2008-12-25 02:47 it is because insert_node does not update the leaf in the path, as you said earlier I think 2008-12-25 02:48 and the reason it does not, is we don't pass it the leaf, just the index of the leaf buffer 2008-12-25 02:48 i see 2008-12-25 02:48 I think it is always right to pass the buffer instead, and then maybe that hard to read code can get a little simpler 2008-12-25 02:48 it is a small change 2008-12-25 02:49 btree_leaf_split() 2008-12-25 02:49 it alread have leaf in cursor 2008-12-25 02:50 it is why does't change leaf in cursor? 2008-12-25 02:50 I guess 2008-12-25 02:52 I found "insert_node test" bug in user/btree.c 2008-12-25 02:52 it is calling release_cursor twice 2008-12-25 02:53 line 204 2008-12-25 02:53 it is not needed 2008-12-25 02:54 brelse_free: free block 8 still in use! <- this bug? 2008-12-25 02:54 maybe, no 2008-12-25 02:55 it was double free 2008-12-25 02:55 is it which file? 2008-12-25 02:55 caller 2008-12-25 02:56 user/btree.c, I'm playing with it 2008-12-25 02:57 ah 2008-12-25 02:57 maybe, it is tree_chop 2008-12-25 02:57 yes 2008-12-25 02:57 buffer of new_leaf() is not released 2008-12-25 02:57 another bug 2008-12-25 02:58 it needs 2008-12-25 02:58 level_push 2008-12-25 02:58 too buggy 2008-12-25 02:59 well you are as obsessed with cleaning code as I am 2008-12-25 02:59 I must sleep now because my daughter will want to unwrap presents soon 2008-12-25 03:00 yes, good 2008-12-25 03:01 merry christmas 2008-12-25 03:02 glad tidings, seasons greetings, yuletide wishes, all that 2008-12-25 03:02 oyasumi 2008-12-25 03:02 oyasumi 2008-12-25 09:38 -!- pgquiles(~pgquiles@7.Red-217-125-199.dynamicIP.rima-tde.net) has joined #tux3 2008-12-25 12:32 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-25 13:13 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-25 13:54 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-25 15:49 going down because of keyboard issues... 2008-12-25 15:53 -!- flips(~phillips@phunq.net) has joined #tux3 2008-12-25 17:05 trying out my build instructions for tux3 2008-12-25 17:05 it's Tux3 for Christmas time 2008-12-25 17:13 mount -ttux3 /dev/ubdb /mnt 2008-12-25 17:13 <3>TUX3: invalid superblock [74757833dd081212]mount: wrong fs type, bad option, bad superblock on /dev/ubdb 2008-12-25 17:14 magic looks right 2008-12-25 17:24 huh, it was just an old build 2008-12-25 17:24 tux3 mounted on a 256G partition 2008-12-25 17:54 hi 2008-12-25 17:54 flips, there? 2008-12-25 17:54 hi 2008-12-25 17:54 in btree.c, there is free_block 2008-12-25 17:55 do you remember what is it? 2008-12-25 17:55 just a moment 2008-12-25 17:55 it is for tree_chop 2008-12-25 17:56 yes 2008-12-25 17:56 there is no reason not to implment it 2008-12-25 17:56 as bfree 2008-12-25 17:56 just call ops->btree()? 2008-12-25 17:56 ops->bfree() 2008-12-25 17:56 yes 2008-12-25 17:56 ok 2008-12-25 17:57 nice leak ;) 2008-12-25 18:05 hirofumi, somethings I think about moving tux3/user/* up into tux3/*, then I think: 1) we often say user/ to be clear about which file we are talking about; 2) moving files around is disruptive; 3) it is not really broken 2008-12-25 18:05 s/somethings/sometimes/ 2008-12-25 18:06 hg rename? 2008-12-25 18:07 I was thinking about it 2008-12-25 18:07 but then I thought, /user/* is not a big problem 2008-12-25 18:07 no real urgency about this 2008-12-25 18:07 yes 2008-12-25 18:08 I think if we change user to src, it is normal 2008-12-25 18:08 ok, running through my tux3 build, boot and mount instructions 2008-12-25 18:08 ah, sure 2008-12-25 18:09 and we can put binaries in tux3/user 2008-12-25 18:09 that way I can write my instructions now, and they will not break 2008-12-25 18:10 actually, just the main binaries like tux3 and tux3fs 2008-12-25 18:10 and tux3graph 2008-12-25 18:11 s/tux3fs/tux3fuse/ 2008-12-25 18:11 just copy those from current tux3/src to tux3/usr? 2008-12-25 18:12 not current 2008-12-25 18:12 sure 2008-12-25 18:12 yes, it can do by Makefile 2008-12-25 18:15 made a tux3 filesystem on a real partition for the first time ever 2008-12-25 18:15 I know you are doing that all the time 2008-12-25 18:15 for me it was scary ;) 2008-12-25 18:16 I'm using file for kvm 2008-12-25 18:16 but, well, I have machine and partition for testing fs 2008-12-25 18:17 256 GB 2008-12-25 18:17 worked fine 2008-12-25 18:17 it is nice to have no surprises 2008-12-25 18:21 hirofumi, what is fair to say? something like: Tux3 now runs fsx.linux somewhat reliably, some smp locking still missing... 2008-12-25 18:21 fsx-linux 2008-12-25 18:21 fsx-linux is still not runing reliably yet 2008-12-25 18:22 maybe, truncate has problem of locking 2008-12-25 18:22 hirofumi, what is fair to say? something like: Tux3 now runs fsx.linux somewhat unreliably 2008-12-25 18:22 whoops 2008-12-25 18:22 "Tux3 now runs fsx.linux somewhat unreliably" 2008-12-25 18:22 that is better actually 2008-12-25 18:23 yes 2008-12-25 18:23 well, to add locking to truncate() is easy though 2008-12-25 18:24 right, and it will be added soon, but it is still good to know there is debugging left to do... so I can invite people to come do some 2008-12-25 18:24 i see 2008-12-25 18:25 hacking should not be a spectator sport 2008-12-25 18:26 build instructions... 2008-12-25 18:26 # Get a kernel tree: 2008-12-25 18:26 wget http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.26.5.tar.bz2 2008-12-25 18:26 tar -xjf linux-2.6.26.5.tar.bz2 2008-12-25 18:26 cd linux-2.6.26.5 2008-12-25 18:26 # Get the Christmas tux3 patch and patch the kernel: 2008-12-25 18:26 wget http://tux3.org/patches/tux3-2.6.26.5-3 2008-12-25 18:26 patch # Build linux with tux3: 2008-12-25 18:26 make defconfig 2008-12-25 18:26 make CONFIG_TUX3=y 2008-12-25 18:26 sudo make install 2008-12-25 18:26 # Get the Christmas tux3 userspace snapshot: 2008-12-25 18:26 wget http://tux3.org/downloads/snapshots/tux3-20081225.tar.gz 2008-12-25 18:26 tar -xzf tux3-20081225.tar.gz 2008-12-25 18:26 cd tux3/user 2008-12-25 18:26 make 2008-12-25 18:26 # make a tux3 filesystem 2008-12-25 18:26 sudo ./tux3 mkfs /dev/ 2008-12-25 18:26 Boot and mount! 2008-12-25 18:26 cat /proc/filesystems | grep tux3 && mount /dev/ /mnt 2008-12-25 18:28 looks good 2008-12-25 18:28 now to make a post out of it 2008-12-25 18:28 should take about an hour 2008-12-25 18:28 was lots of work to write and try out those instructions 2008-12-25 18:28 btw, I think it is good now, insert_node() takes leafbuf and it does push/pop 2008-12-25 18:28 yes, I was thinking about that 2008-12-25 18:29 and split and work a little differently 2008-12-25 18:29 do the split, then advance is key is in higher block 2008-12-25 18:29 yes, it is things you said yesterday 2008-12-25 18:29 I say lots of things, and many of them are nonsense ;) 2008-12-25 18:30 that one seems to make sense 2008-12-25 18:30 yes 2008-12-25 18:30 the design gets more regular 2008-12-25 18:31 split needs to get_bh, if it already has leafbuf in cursor 2008-12-25 18:31 though 2008-12-25 18:32 with it, we can call insert_node as btree_insert_leafbug 2008-12-25 18:32 leafbuf 2008-12-25 18:37 or just btree_insert_leaf 2008-12-25 18:38 right now split is not used 2008-12-25 18:38 sorry, used in inode 2008-12-25 18:39 yes 2008-12-25 18:41 it is really understandable 2008-12-25 18:41 we don't need to pass btree 2008-12-25 18:41 it already has cursor 2008-12-25 18:41 right 2008-12-25 18:41 there are a lot of api cleanups we can do because of cursor->btree 2008-12-25 18:41 so, it will be btree_insert_leaf(cursor, key, leafbuf) 2008-12-25 18:41 :) 2008-12-25 18:42 high level 2008-12-25 18:42 looks like really good 2008-12-25 18:45 well, it looks like make_inode got broken some time back and will run out of inodes after one table block 2008-12-25 18:45 64 inodes 2008-12-25 18:46 entries_per_leaf==64? 2008-12-25 18:47 no call to btree_leaf_split 2008-12-25 18:47 just runs out at the end of the first leaf 2008-12-25 18:48 store_attrs() calls tree_expand 2008-12-25 18:49 :) 2008-12-25 18:49 wow 2008-12-25 18:49 ok, that is right 2008-12-25 18:49 I forgot completely doing it that way 2008-12-25 18:49 needs a big fat comment 2008-12-25 18:50 it's pretty cool how it works 2008-12-25 18:50 it knows from a leaf not being there, that the inode number is available 2008-12-25 18:51 instead of having to scan the leaf 2008-12-25 18:51 it only scans a leaf when the leaf exists 2008-12-25 18:51 yes 2008-12-25 18:51 so in this sense, it is potentially more efficient than Ext2 in some cases, which scans a bitmap to find a free inode 2008-12-25 18:52 though ext2 will certain be beating us for a while yet 2008-12-25 18:52 because of our heavyweight semaphores and btree probes 2008-12-25 18:52 one issue is next_key() 2008-12-25 18:53 it returns -1 2008-12-25 18:53 if level_finished() is true 2008-12-25 18:54 however, maximum inum would not be -1 2008-12-25 18:54 true 2008-12-25 18:55 later, I think it would be limited by option 2008-12-25 18:55 needs a comment 2008-12-25 18:56 there is a little comment, FIXME 2008-12-25 18:57 good enough for today 2008-12-25 19:04 ok, insert_node fix is almost done 2008-12-25 19:46 hirofumi, when tux3 fails in fsx-linux, is it always an assert? 2008-12-25 19:47 or stranger? 2008-12-25 19:52 for now, it is assert 2008-12-25 19:53 that is pretty cool in itself 2008-12-25 19:53 yes 2008-12-25 19:53 I'd like to keep assert for a while 2008-12-25 19:53 at least, until after atomic commit 2008-12-25 19:53 Always keep, and only disable as a compile option 2008-12-25 19:54 convert some to run-time checks that are always in, to pick up corrupted disk data 2008-12-25 19:54 ext2/3 is loaded with those 2008-12-25 19:54 yes 2008-12-25 19:54 part of the reason for its reputation for reliability 2008-12-25 19:55 some sort of paranoia check like cursor_check 2008-12-25 19:55 compile option maybe good 2008-12-25 19:56 yes 2008-12-25 20:55 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-25 21:11 posted 2008-12-25 21:11 got to see some christmas lights with my kid now 2008-12-25 21:13 http://lkml.org/lkml/2008/12/26/1 <- Tux3 report: Tux3 for Christmas 2008-12-25 21:18 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-25 21:57 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-25 22:03 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-25 23:06 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-25 23:15 forgot to link hirofumi's graphic 2008-12-25 23:16 well, that means another post 2008-12-25 23:16 need to update the Structure of Tux3 post 2008-12-25 23:57 hirofumi, there? 2008-12-26 03:42 hmm, I have to go change the license of all the tux3/kernel files to gpl v2 2008-12-26 03:42 around now 2008-12-26 03:42 but too late for today's patch, on the other hand anybody who would distribute a kernel with tux3 in it today would be certifiably crazy 2008-12-26 05:51 -!- war(war@liquidswords.org) has joined #tux3 2008-12-26 08:41 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-26 10:15 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-26 11:58 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-26 12:01 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-26 14:05 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-26 14:10 -!- flips(~phillips@phunq.net) has joined #tux3 2008-12-26 14:31 there is an issue with make_inode: it finds an empty inode all right, but it does not reserve it 2008-12-26 14:32 reserve? 2008-12-26 14:32 sorry, store attribtues reserves it 2008-12-26 14:32 called from make_inode 2008-12-26 14:32 no problem 2008-12-26 14:32 good morning 2008-12-26 14:32 good morning 2008-12-26 14:33 maybe, I found tree_expand problem 2008-12-26 14:33 solved or not? 2008-12-26 14:33 not yet 2008-12-26 14:39 ok, it seems btree_leaf_split 2008-12-26 14:42 new, improved one, or old one? 2008-12-26 14:43 return at < leaf->count ? to_uleaf(into)->entries[0].key : key; 2008-12-26 14:43 uleaf_split 2008-12-26 14:43 it returns key depends on at 2008-12-26 14:43 oh :p 2008-12-26 14:43 it is pretty crude 2008-12-26 14:43 maybe, it should return dest always 2008-12-26 14:44 ACTION looks 2008-12-26 14:49 how will leaf_split know which leaf to leave in the path then? 2008-12-26 14:49 it is job of caller 2008-12-26 14:50 how will caller know which key dest starts with? 2008-12-26 14:51 ah 2008-12-26 14:51 because ->leaf_split always returns it 2008-12-26 14:51 yes 2008-12-26 14:51 you are right 2008-12-26 14:51 I goofed 2008-12-26 14:52 btw, after btree_leaf_split(), cursor can be invalid 2008-12-26 14:53 if (key < newkey), interesting path is not next leaf 2008-12-26 14:56 I thought that was handled by if (key >= newkey) { level_pop... 2008-12-26 14:56 if (key >= newkey) is ok 2008-12-26 14:57 but if (key < newkey), leaf is old leaf 2008-12-26 14:57 -!- mlankhorst(~m@fw1.astro.rug.nl) has left #tux3 2008-12-26 14:57 however, insert_node will set path for new leaf 2008-12-26 14:58 we can say path is invalid after btree_leaf_split() though 2008-12-26 14:58 I thought we were going to fix that by having insert_node always set the path to old leaf, then we advance if key >= newkey 2008-12-26 14:59 path should remain valid, we might do further processing after the expand 2008-12-26 14:59 so, at->next++ patch is not used? 2008-12-26 15:00 I don't know if that is quite right 2008-12-26 15:00 that patch set path for new leaf 2008-12-26 15:01 right 2008-12-26 15:02 and we set path for new leaf in parents also 2008-12-26 15:02 with that patch 2008-12-26 15:02 yes 2008-12-26 15:02 but maybe the ++ is not needed if we leave path at old leaf and advance if needed 2008-12-26 15:02 and caller from filemap does an explicit advance 2008-12-26 15:03 I don't know what is best, I think probably explicit advance is better 2008-12-26 15:03 ah 2008-12-26 15:03 I haven't been thinking about this as much as yhou 2008-12-26 15:04 putting in more inode locking right now 2008-12-26 15:04 ok 2008-12-26 15:05 thinking would be good, I will post my inode locking patch, then think 2008-12-26 15:05 ok 2008-12-26 15:45 make tests caught a stupid oversight in my patch 2008-12-26 15:45 make tests == good 2008-12-26 15:49 hirofumi, inode locking patch on tux3 ml 2008-12-26 15:49 ok 2008-12-26 16:01 with quick look, it seems good 2008-12-26 16:02 one issue is release_cursor() 2008-12-26 16:02 if error on some funcs, it is already released 2008-12-26 16:03 it confuses me 2008-12-26 16:26 ok, I made patchset for tree_expand() test 2008-12-26 16:26 if inode lock patch was pushed to public, I'll merge the patchset 2008-12-26 16:35 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-12-26 17:17 -!- pgquiles(~pgquiles@7.Red-217-125-199.dynamicIP.rima-tde.net) has joined #tux3 2008-12-26 17:26 back 2008-12-26 17:26 yes 2008-12-26 17:26 I added keep flag to insert_node for now 2008-12-26 17:26 if keep == 1, it doesn't change current position 2008-12-26 17:26 hirofumi, right, releasing cursor needs clearer rules 2008-12-26 17:27 ok 2008-12-26 17:27 I'll add it to my todo 2008-12-26 17:28 for now, I have made all the errors that don't release the cursor branch to release: 2008-12-26 17:28 yes 2008-12-26 17:28 it is good 2008-12-26 17:28 for now 2008-12-26 17:30 a keep flag is better than having a bug 2008-12-26 17:30 thanks 2008-12-26 17:30 I'll push it 2008-12-26 17:31 an explanation of the bug in your commit comment? 2008-12-26 17:31 for keep? 2008-12-26 17:31 yes 2008-12-26 17:32 writing a post on smp locking now 2008-12-26 17:33 I hope with it, fsx-linux would pass all 2008-12-26 17:33 I'll commit the itable locking patch then 2008-12-26 17:34 I should wait it? 2008-12-26 17:34 I'll do it right now 2008-12-26 17:34 ok 2008-12-26 17:36 pushed 2008-12-26 17:36 ok, thanks 2008-12-26 17:39 I think directory operations are all serialized by vfs using parent->i_mutex 2008-12-26 17:39 directory flushing will call get_segs, which is serialized 2008-12-26 17:40 yes 2008-12-26 17:41 xattrs need some locking I think, both the xcache and the atomtable 2008-12-26 17:43 yes 2008-12-26 17:43 ah 2008-12-26 17:43 it depends on caller 2008-12-26 17:44 maybe, yes 2008-12-26 17:50 static-http://userweb.kernel.org/~hirofumi/tux3/ 2008-12-26 17:51 email is posted 2008-12-26 17:51 reading 2008-12-26 17:54 lots of changesets 2008-12-26 17:55 yes 2008-12-26 17:55 add cursor_check() 2008-12-26 17:55 ah, that is what distclean is for 2008-12-26 17:56 sorry, I use *.orig for develop 2008-12-26 17:57 and to pass cursor_check(), change advance() 2008-12-26 17:57 then, fix insert_node() and related stuff 2008-12-26 18:00 cursor_check looks good 2008-12-26 18:01 related to it, we may want to btree_check() 2008-12-26 18:01 it would check bnode corruption 2008-12-26 18:01 yes 2008-12-26 18:02 however, it may be too slow 2008-12-26 18:02 maybe bnode_check 2008-12-26 18:03 I imaged, btree_check is order of key on across bnode 2008-12-26 18:03 so, maybe, bnode_check and btree_check 2008-12-26 18:04 with uleaf_split bug, key in two bnode was not right order 2008-12-26 18:05 right 2008-12-26 18:06 btree_check can be written as a modification of advance 2008-12-26 18:06 yes 2008-12-26 18:07 show_range helped me well though 2008-12-26 18:10 newbuf->count++; <- how does this work? 2008-12-26 18:10 I thought buffer_head is b_count 2008-12-26 18:10 yes 2008-12-26 18:11 kernel has get_bh already 2008-12-26 18:11 ah 2008-12-26 18:11 next patch will fix it 2008-12-26 18:11 ok 2008-12-26 18:16 looking at the bimap leak 2008-12-26 18:19 yes 2008-12-26 18:19 /* +8 for new depth */ <- because depth can increase more than one 2008-12-26 18:19 yes 2008-12-26 18:19 and 8 is assumed the maximum depth? 2008-12-26 18:19 8 is big depth than needed 2008-12-26 18:19 yes 2008-12-26 18:20 just a random value 2008-12-26 18:20 ah, it's in user/btree 2008-12-26 18:20 ok :) 2008-12-26 18:20 :) 2008-12-26 18:25 pulling 2008-12-26 18:25 thanks 2008-12-26 18:28 whoops 2008-12-26 18:29 I corrected one posisiion to position, and then immediately noticed the other one ;) 2008-12-26 18:29 whoops 2008-12-26 18:30 I don't care about spelling on comments, they don't oops :) 2008-12-26 18:30 my pos is random, posision or position 2008-12-26 18:31 it should realy be posishen 2008-12-26 18:31 english is stupid ;) 2008-12-26 18:31 :) 2008-12-26 18:35 diff -puN user/kernel/btree.c~position user/kernel/btree.c 2008-12-26 18:35 --- tux3/user/kernel/btree.c~position 2008-12-27 11:34:39.000000000 +0900 2008-12-26 18:35 +++ tux3-hirofumi/user/kernel/btree.c 2008-12-27 11:34:53.000000000 +0900 2008-12-26 18:35 @@ -521,7 +521,7 @@ static void add_child(struct bnode *node 2008-12-26 18:35 2008-12-26 18:35 /* 2008-12-26 18:35 * Insert new leaf to next position of cursor. 2008-12-26 18:35 - * keep == 1: keep current cursor posision. 2008-12-26 18:35 + * keep == 1: keep current cursor position. 2008-12-26 18:35 * keep == 0, set cursor position to new leaf. 2008-12-26 18:35 */ 2008-12-26 18:35 static int insert_leaf(struct cursor *cursor, tuxkey_t childkey, struct buffer_head *leafbuf, int keep) 2008-12-26 18:35 @@ -601,7 +601,7 @@ eek: 2008-12-26 18:35 return -ENOMEM; 2008-12-26 18:35 } 2008-12-26 18:35 2008-12-26 18:35 -/* Insert new leaf to next posision of cursor, then set cursor to new leaf */ 2008-12-26 18:35 +/* Insert new leaf to next position of cursor, then set cursor to new leaf */ 2008-12-26 18:35 int btree_insert_leaf(struct cursor *cursor, tuxkey_t key, struct buffer_head *leafbuf) 2008-12-26 18:35 { 2008-12-26 18:35 return insert_leaf(cursor, key, leafbuf, 0); 2008-12-26 18:35 there were another two posisions 2008-12-26 18:35 could you fix those too? 2008-12-26 18:36 ok 2008-12-26 18:36 thanks 2008-12-26 18:36 in insert_leaf, why not: insert(...0); if (key >= newkey) advance 2008-12-26 18:37 it was already there 2008-12-26 18:37 and it releases buffer at once, then re-read buffer again 2008-12-26 18:38 good reason 2008-12-26 18:38 ok 2008-12-26 18:39 and duh 2008-12-26 18:41 my commit count is improving with all the spelling corrections :) 2008-12-26 18:41 white space patches rule 2008-12-26 18:42 :) 2008-12-26 18:43 hg has strip space option? 2008-12-26 18:43 don't know 2008-12-26 18:44 looks like import doesn't have it 2008-12-26 18:45 btw, my inode table locking is completely untested 2008-12-26 18:45 ok 2008-12-26 18:46 I'll run fsx-linux several times 2008-12-26 18:47 ah, fix-linux doesn't use it 2008-12-26 18:47 for allocation... maybe bitmap->i_mutex? 2008-12-26 18:48 sounds good 2008-12-26 18:52 fsx-linux doesn't have parallel file creates? 2008-12-26 18:52 yes 2008-12-26 18:52 it tests file data 2008-12-26 18:52 truncate/read/write/mmap/dio/aio 2008-12-26 18:52 lustre guys had a filesystem test called racer 2008-12-26 18:53 yes 2008-12-26 18:54 fsx-linux is most good test tool, because it logs all operations, and tells corruption related operations 2008-12-26 18:54 I should start using it 2008-12-26 18:54 I want tool for metadata operations 2008-12-26 18:54 like fsx-linux 2008-12-26 18:55 there is several stress tools, but it doesn't check corruption and doesn't log operations 2008-12-26 18:55 :( 2008-12-26 18:57 it would check for corruption using readdir? 2008-12-26 18:57 fsx-linux checks data only 2008-12-26 18:58 tools for metadata? 2008-12-26 18:58 I don't know the tool checks it 2008-12-26 18:59 I was thinking, if there was a tool like that for metadata, it would check by doing a lot of renames etc, then get a directory listing and see if it matches its own version of the directory 2008-12-26 18:59 so it would keep a directory in its own memory, and compare that to the filesystem 2008-12-26 18:59 yes, it's good 2008-12-26 19:00 also, it can check timestamp etc 2008-12-26 19:00 maybe somebody on our mailing list would like to do that project 2008-12-26 19:00 extent fsx-linux to do metadata 2008-12-26 19:00 it's great 2008-12-26 19:02 http://osdir.com/ml/ltp/2004-04/msg00000.html 2008-12-26 19:03 yes 2008-12-26 19:03 racer is just scripts 2008-12-26 19:03 it doesn't check result at all 2008-12-26 19:03 I'm using it though 2008-12-26 19:05 Subject: Suggested project... extend fsx-linux to test metadata 2008-12-26 19:05 good 2008-12-26 19:05 strange things was happened 2008-12-26 19:06 pdflush stopped with D state 2008-12-26 19:06 hmm 2008-12-26 19:06 it may be bug of kvm 2008-12-26 19:07 I need to test on real machine 2008-12-26 19:07 IO wait 2008-12-26 19:07 yes 2008-12-26 19:07 balloc -> blockread -> want_on_page -> sync_page -> io_schedule 2008-12-26 19:08 does not sound like our bug 2008-12-26 19:08 or it may be bug of blockread/blockget 2008-12-26 19:09 ah 2008-12-26 19:09 if we don't set buffer states exactly as expected, yes 2008-12-26 19:09 that whole path through the kernel is really too fragile 2008-12-26 19:10 it is likely 2008-12-26 19:10 well, I'll test on real machine after sleeping 2008-12-26 19:11 with random seed == 1, current repo passed test several times 2008-12-26 19:11 it seems there is no new breakage 2008-12-26 19:12 I will do a patch for balloc locking 2008-12-26 19:12 yes 2008-12-26 19:12 truncate? 2008-12-26 19:12 will you do truncate? 2008-12-26 19:13 tree_chop 2008-12-26 19:14 ah 2008-12-26 19:14 that is probably what fsx hit before 2008-12-26 19:14 yes, probably 2008-12-26 19:16 FIXME: must fix expand size? 2008-12-26 19:17 we will need to review for expand size case 2008-12-26 19:17 it may work 2008-12-26 19:18 block_truncate_page should be ok 2008-12-26 19:19 so the lock can go in tree_chop 2008-12-26 19:19 yes 2008-12-26 19:21 by the way, I am tempted to just BUG on any kmalloc failures 2008-12-26 19:21 something to think about 2008-12-26 19:21 __GFP_NOFAIL on everything, unless it is an optional allocation 2008-12-26 19:23 um... 2008-12-26 19:25 if it is readahead, it would want to fail 2008-12-26 19:25 if not, ... 2008-12-26 19:28 patch for truncated is pushed 2008-12-26 19:29 ok, I'll test it 2008-12-26 19:33 NOFAIL seems to use emergency memory 2008-12-26 19:37 there are some flags around it 2008-12-26 19:38 um.. 2008-12-26 19:48 it doesn't use emergency memory 2008-12-26 19:48 it just does not return to caller until it has memory 2008-12-26 19:48 if that is ever the case then kernel is really oom and dead 2008-12-26 19:49 at least for order 0 2008-12-26 19:51 http://lxr.linux.no/linux+v2.6.28/mm/page_alloc.c#L1543 2008-12-26 19:51 ah 2008-12-26 19:52 only for MEMALLOC 2008-12-26 19:52 ah you're right 2008-12-26 19:53 it is GFP_WAIT I was thinking of 2008-12-26 19:53 if this is memory allocation path, MEMALLOC is true 2008-12-26 19:54 ok, so, if someone allocates memory, and vm calls our fs, it's true 2008-12-26 19:54 um... 2008-12-26 20:06 now, fsx-linux passed several times with random seed 2008-12-26 20:06 no oops 2008-12-26 20:07 inode table locking improved things? 2008-12-26 20:07 probably, truncate 2008-12-26 20:07 right 2008-12-26 20:07 for now, problem is I/O freeze only 2008-12-26 20:08 inode->i_lock spinlock to protect xcache maybe... but need to drop it to resize 2008-12-26 20:08 I wonder if xattr path has any locking at all 2008-12-26 20:12 getxattr seems no lock 2008-12-26 20:14 removexattr takes i_mutex for some reason 2008-12-26 20:14 seems inconsistent 2008-12-26 20:15 setxattr also takes i_mutex 2008-12-26 20:16 so it seems if we take i_mutex in our getxattr, it is complete 2008-12-26 20:16 well 2008-12-26 20:16 I wonder if somebody could already have it 2008-12-26 20:16 listxattr seems no lock 2008-12-26 20:17 so, write is i_mutex 2008-12-26 20:17 i_mutex in getxattr and listxattr then 2008-12-26 20:17 read is no lock 2008-12-26 20:17 there has to be a lock 2008-12-26 20:17 yes 2008-12-26 20:17 maybe the idea is, only operations that modify take the mutex 2008-12-26 20:18 and other locking is left to the fs 2008-12-26 20:18 would be nice to see a comment on that 2008-12-26 20:18 yes, ext* seems to use ->xattr_sem rwsem 2008-12-26 20:19 ah, comment may be Document/filesystems/locking 2008-12-26 20:19 Documentation/filesystems/Locking 2008-12-26 20:20 that document is not good 2008-12-26 20:20 just says whether a function takes a lock or not, but needs to say what is protected by a lock 2008-12-26 20:21 well, Locking says it 2008-12-26 20:26 ext* shares xattrs between inodes 2008-12-26 20:26 we don't, but need to lock when they access that atable, which is shared 2008-12-26 20:26 the atable I meant 2008-12-26 20:27 oh 2008-12-26 20:27 btw, our xattr performance will suck until it gets a hash table sitting in front of the atable 2008-12-26 20:27 this is a small project that is good to leave for later 2008-12-26 20:28 when it gets a atom hash, it will suddenly go from sucking to ruling 2008-12-26 20:30 i see 2008-12-26 20:31 ah, so ext* is using global spinlock 2008-12-26 20:32 at least it is rw 2008-12-26 20:32 it protects per inode 2008-12-26 20:32 and cache is protected by global spinlock 2008-12-26 20:33 ext2_xattr_cache_insert -> mb_cache_entry_insert -> spin_lock(&mb_cache_spinlock) 2008-12-26 20:33 we will have similar, only for the atom cache 2008-12-26 20:34 atom cache? 2008-12-26 20:34 xcache? 2008-12-26 20:34 no, global cache to resolve xattr names to atoms 2008-12-26 20:34 oh 2008-12-26 20:35 right now it will do a directory operation for each xattr call, that has to suck 2008-12-26 20:35 i see 2008-12-26 20:35 but good enough to test correctness 2008-12-26 20:44 ah 2008-12-26 20:45 one issue is bitmap doesn't update i_size 2008-12-26 20:45 um..., no problem 2008-12-26 20:46 because? 2008-12-26 20:46 i_size is max always 2008-12-26 20:47 it looks like sparse file 2008-12-26 20:47 right, you did that a month or so ago 2008-12-26 20:47 yes 2008-12-26 20:48 to make sure, I need to test on real machine 2008-12-26 20:48 it can be kvm problem actually 2008-12-26 20:48 maybe unlikely though 2008-12-26 20:51 it's probably us 2008-12-26 20:51 yes 2008-12-26 20:53 vmtruncate_range takes i_mutex and I_alloc_sem but vmtruncated doesn't take any locks that I can see 2008-12-26 20:53 I wonder why the difference 2008-12-26 20:54 vmtruncate() have to be called under i_mutex 2008-12-26 21:10 balloc() may lock recersively 2008-12-26 21:11 write bitmap data page -> balloc() for the page -> lock page 2008-12-26 21:12 yes 2008-12-26 21:13 blockread does 2008-12-26 21:48 folks 2008-12-26 21:50 hirofumi, balloc only updates cache 2008-12-26 21:50 yes, but blockread uses lock_page() 2008-12-26 21:52 ah 2008-12-26 21:52 D state problem seem not this problem 2008-12-26 21:52 just a sec, I have thought about this before, need to wake up those old thoughts 2008-12-26 21:53 however, maybe it is true 2008-12-26 21:53 oh 2008-12-26 21:55 blockread is not a write, I don't see the recursion yet 2008-12-26 21:55 pdflush try to writes bitmap page 2008-12-26 21:56 and if block of that page is not allocated yet, get_segs() calls balloc() 2008-12-26 21:56 and balloc() may call blockread() 2008-12-26 21:58 actually, blockread() doesn't need lock_page(), however it is used to get buffer_head now 2008-12-26 21:58 because that page should be uptodate 2008-12-26 21:59 because it already did prepare_write? 2008-12-26 21:59 yes 2008-12-26 22:01 blockread nees to lock the page in general in case it is already being read 2008-12-26 22:01 yes 2008-12-26 22:02 if page is not uptodate, it needs lock page 2008-12-26 22:03 yes, the page should be uptodate/dirty for the write 2008-12-26 22:04 and balloc() is called from get_seg(create) 2008-12-26 22:05 so, maybe recersive is only if page is uptodate 2008-12-26 22:05 recursive 2008-12-26 22:06 um.., take lock_page twice by same thread 2008-12-26 22:06 for atomic update we need to fork the buffer when this happens, because the bitmap state has to be for the delta being written 2008-12-26 22:07 however, read doesn't do fork? 2008-12-26 22:07 balloc should 2008-12-26 22:07 anyway, that is later 2008-12-26 22:08 forking will be a big fun mess :) 2008-12-26 22:08 ext3 does it 2008-12-26 22:08 i see 2008-12-26 22:08 ok, do you see a recursion right now? 2008-12-26 22:09 yes 2008-12-26 22:09 reproduce? 2008-12-26 22:09 pdflush try to writes bitmap page 2008-12-26 22:09 and if block of that page is not allocated yet, get_segs() calls balloc() 2008-12-26 22:09 and balloc() may call blockread() <- this one, right? 2008-12-26 22:09 yes 2008-12-26 22:09 balloc calls blockread() 2008-12-26 22:10 "fork" is happen before balloc()? 2008-12-26 22:11 inside balloc, when it wants to update a bitmap block 2008-12-26 22:11 it should check to see if the block is dirty in a previous delta 2008-12-26 22:11 yes 2008-12-26 22:11 now I'm thinking is 2008-12-26 22:12 normal sys_write(2) -> balloc() -> allocate block on page(3) 2008-12-26 22:12 page(3) means bitmap data page (page number 3) 2008-12-26 22:13 understood 2008-12-26 22:13 pdflush try to write page(3) -> get_segs -> balloc() 2008-12-26 22:13 and balloc() read pages to search free bit 2008-12-26 22:14 and it may call blockread(page(3)) 2008-12-26 22:18 this is a danger of having the allocation map in a regular file 2008-12-26 22:18 recursion danger 2008-12-26 22:18 yes, it can be 2008-12-26 22:19 I don't see a recursive lock_page yet 2008-12-26 22:19 I'll adds lock_page 2008-12-26 22:19 pdflush try to write page(3) -> lock_page(3) -> writepage -> get_segs -> balloc() 2008-12-26 22:20 and balloc() read pages to search free bit 2008-12-26 22:20 and it may call blockread(page(3)) -> blockread() -> lock_page(3) 2008-12-26 22:22 ok, page lock is help across the ->get_block call 2008-12-26 22:22 held 2008-12-26 22:22 yes 2008-12-26 22:23 to keep other readers and writers from trying to get_block on the same page 2008-12-26 22:24 yes, and probably truncate 2008-12-26 22:24 we are going to stop using block_write_full_page pretty soon, but we need an answer for this case now 2008-12-26 22:25 if we don't use block_*_full_page(), this may not be happen 2008-12-26 22:26 because we don't need to do lock_page() 2008-12-26 22:26 we need lock_buffer() or something actually 2008-12-26 22:26 yes 2008-12-26 22:27 however, it needs to get buffer_head or something from page without lock_page 2008-12-26 22:27 well, and buffer forking solves this 2008-12-26 22:28 we will remove the page from buffer cache 2008-12-26 22:28 and it is ok for it to be locked 2008-12-26 22:28 where point do we do fork? 2008-12-26 22:28 put a new one in its place 2008-12-26 22:28 in balloc, when it wants to update a bitmap block 2008-12-26 22:29 it should check flags in its buffer to see if it should fork 2008-12-26 22:29 this path may not update that page 2008-12-26 22:29 it just read 2008-12-26 22:29 true 2008-12-26 22:29 it should not fork until it knows it has to 2008-12-26 22:30 ok 2008-12-26 22:31 ...still thinking 2008-12-26 22:31 well, let me check in my xattr locking patch 2008-12-26 22:31 or post it 2008-12-26 22:32 well, maybe, bdev should be handling this, so I think we can it for this 2008-12-26 22:32 it got a little longer than expected, I changed some xcache_lookup to ERR_PTR 2008-12-26 22:32 ok 2008-12-26 22:33 ah, bdev doesn't use block_*_full_page() for bread(), so it's ok 2008-12-26 22:33 *sigh of relief* 2008-12-26 22:33 sb_bread, right? 2008-12-26 22:33 yes 2008-12-26 22:34 and to get buffer_head, it uses hash, iirc 2008-12-26 22:35 and it uses mapping->private_lock 2008-12-26 22:35 I made a note of the recursion for when we take over that path 2008-12-26 22:35 read side is quite different 2008-12-26 22:36 thanks 2008-12-26 22:36 probably, read side is lock_page less 2008-12-26 22:37 we take the lock in blockread because it is needed to put buffers on the page iirc 2008-12-26 22:38 yes 2008-12-26 22:38 bdev is also page on bd_inode 2008-12-26 22:39 but, sb_bread() uses lock_buffer() and submit_bh() 2008-12-26 22:39 I guess it is why it work 2008-12-26 22:40 that is cool use of ->write_begin in blockget, by the way 2008-12-26 22:41 well, balloc() is not using blockget(), so it may not be problem 2008-12-26 22:42 I have to understand sb_bread() locking stuff 2008-12-26 22:43 and uses new one for bitmap 2008-12-26 22:45 maybe, it is using mapping->private_lock to get buffer_head 2008-12-26 22:46 and to add new page, it uses lock_page 2008-12-26 22:47 I need to copy some codes from grow_dev_page() 2008-12-26 22:48 interesting locking rule 2008-12-26 22:49 ->private_lock? 2008-12-26 22:49 yes 2008-12-26 22:49 it's bizzare 2008-12-26 22:50 yes 2008-12-26 22:52 to walking buffer_head list on page, vfs seems to be using ->private_lock 2008-12-26 22:58 http://mailman.tux3.org/pipermail/tux3/2008-December/000527.html <- xattr smp locking 2008-12-26 22:58 please review 2008-12-26 22:59 ok 2008-12-26 23:01 I did not make even a slight attempt to optimize the locking 2008-12-26 23:02 in del_xattr(), mutex_unlock() seems to be missing 2008-12-26 23:03 and set_xattr() error path 2008-12-26 23:05 get_xattr() too 2008-12-26 23:06 maybe, email is more helpful 2008-12-26 23:13 it's useful 2008-12-26 23:13 ok, fixing 2008-12-26 23:15 yes, del_xattr as returns without unlock 2008-12-26 23:15 s/as/has/ 2008-12-26 23:16 I post commented by email 2008-12-26 23:16 thanks 2008-12-26 23:17 there are some bugs, however I think we are getting closer to atomic commit 2008-12-26 23:21 I would much rather be working on atomic commit than smp locking ;) 2008-12-26 23:21 but smp locking is the immediate bug 2008-12-26 23:21 yes 2008-12-26 23:22 I'll try to fix D state problem, and blockread problem 2008-12-26 23:22 nice review 2008-12-26 23:22 thanks 2008-12-26 23:23 I'll sleep before 2008-12-26 23:24 I'll sleep before trying it 2008-12-26 23:24 oyasumi 2008-12-26 23:24 oyasumi 2008-12-26 23:24 sleep well 2008-12-26 23:24 thanks 2008-12-26 23:27 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-27 01:04 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-27 01:31 -!- xiaofeng(~xiaofeng@125.34.27.183) has joined #tux3 2008-12-27 01:33 -!- xiaofeng(~xiaofeng@125.34.27.183) has left #tux3 2008-12-27 03:57 -!- xiaofeng(~xiaofeng@125.34.27.183) has joined #tux3 2008-12-27 03:57 -!- xiaofeng(~xiaofeng@125.34.27.183) has left #tux3 2008-12-27 09:24 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-27 10:04 -!- kushal(~kushal@121.246.35.254) has joined #tux3 2008-12-27 10:48 hey flips 2008-12-27 10:48 -!- stargazr5(~gauravstt@59.95.4.53) has joined #tux3 2008-12-27 10:49 looking at the extent code... 2008-12-27 10:49 i generated a graph using hirofumi's tux3graph 2008-12-27 10:50 it shows only single block extents.. 2008-12-27 10:50 so i'm a little confused... 2008-12-27 11:14 kushal, the extents pointers are packed together in a dleaf 2008-12-27 11:25 yes..the counts of all the entries in the dleaf are 1... 2008-12-27 11:25 how big is the file? 2008-12-27 11:25 2.6 mb 2008-12-27 11:27 some of the entries in the dleaf are 2008-12-27 11:27 version 0x000 count 1 block 17 2008-12-27 11:27 (extent 0) 2008-12-27 11:27 version 0x000 count 1 block 18 2008-12-27 11:28 (extent 0) 2008-12-27 11:28 that file will have 10 extents 2008-12-27 11:28 yes..thats what i expected... 2008-12-27 11:29 what is the group count of the dleaf? 2008-12-27 11:29 (group 0) etc 2008-12-27 11:29 2 2008-12-27 11:30 can you put your svg/png online? 2008-12-27 11:31 yes..uploading it....give me a moment... 2008-12-27 11:37 i'm having a little trouble uploading it...can i send it to your mail?? 2008-12-27 11:37 phillips@phunq.net 2008-12-27 11:37 ok...sending now... 2008-12-27 11:42 not arrived yet 2008-12-27 11:43 sent now.. 2008-12-27 11:43 there it is 2008-12-27 11:48 it shows 339 1 block extents in inode 14 2008-12-27 11:48 yes... 2008-12-27 11:49 ok, single blocks instead of long extents 2008-12-27 11:49 yes, our bug 2008-12-27 11:50 I suppose you generated the file with dd? 2008-12-27 11:50 i am using a disk partition 2008-12-27 11:50 I mean, inode 14 2008-12-27 11:50 how did you create it? cp? 2008-12-27 11:50 dd? 2008-12-27 11:51 cp 2008-12-27 11:51 ok, we should merge the single blocks into multiblock extents 2008-12-27 11:53 yes... 2008-12-27 11:57 btw, is there a design doc for extents in tux3... 2008-12-27 11:58 there are some notes and ascii art by hirofumi in dleaf.c 2008-12-27 11:59 ok...thanks 2008-12-27 12:00 the code that generates the single block extents is in filemap.c, get_segs 2008-12-27 12:19 ok...i'm off...thanx for the help... 2008-12-27 12:20 you're welcome 2008-12-27 12:21 and we have a little coding project... merge contiguous blocks 2008-12-27 12:21 yes...i guess we do... 2008-12-27 12:34 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-27 13:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-27 14:11 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-27 14:13 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-27 15:07 -!- RalucaME(~ral@londo.cnds.jhu.edu) has joined #tux3 2008-12-27 16:10 sk8 oclock 2008-12-27 16:10 will be a sunset skate 2008-12-27 16:19 -!- RalucaME(~ral@londo.cnds.jhu.edu) has joined #tux3 2008-12-27 17:27 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-27 17:53 back 2008-12-27 17:53 one of the most lovely sunsets I have seen... everything orange 2008-12-27 17:53 no photo today though :( 2008-12-27 17:53 hirofumi, there? 2008-12-27 17:57 http://blogs.sun.com/bill/entry/zfs_and_the_all_singing <- very nice blog to read 2008-12-27 18:02 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-27 18:03 hiyah tim_dimm 2008-12-27 18:03 howdee 2008-12-27 18:19 resize inum 0xd at 0xa2 from 36 to 36 <- hmm, about time to optimize that 2008-12-27 18:36 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-27 18:54 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-27 20:33 -!- RalucaME(~ral@londo.cnds.jhu.edu) has left #tux3 2008-12-27 21:03 -!- zhaozhou(~zhaozhou@h-68-118.A199.priv.bahnhof.se) has joined #tux3 2008-12-27 21:13 OK, next big thing to think about is replacing some block library code with our own code capable of atomic commit 2008-12-27 21:17 returning to a little code reading project I put aside earlier: how does ext3 know when all data writing is completed in data=ordered write mode? 2008-12-27 21:26 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-12-27 21:27 -!- kushal(~kushal@121.246.34.181) has joined #tux3 2008-12-27 21:31 http://lxr.linux.no/linux+v2.6.27/fs/jbd/journal.c#L527 <- log_wait_commit 2008-12-27 21:32 546 wait_event(journal->j_wait_done_commit, 2008-12-27 21:32 tux3 U stuff, really 2008-12-27 21:58 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-27 22:02 ok, it is stuff like this: http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L291 <- wait_on_page_writeback(page); 2008-12-27 22:02 in a loop 2008-12-27 22:02 :p 2008-12-27 22:02 enough of that 2008-12-27 22:03 we will do this better I hope, using a single wait on delta completion 2008-12-27 22:03 instead of waiting on individual pages, buffers etc 2008-12-27 22:40 -!- cdk(~cdk@121.246.34.181) has joined #tux3 2008-12-28 00:04 the blog looks like good for test 2008-12-28 00:05 I'm thinking the review process is more important though 2008-12-28 00:06 if we can have auto test process, it is really good 2008-12-28 00:12 the blog? 2008-12-28 00:12 http://blogs.sun.com/bill/entry/zfs_and_the_all_singing <- very nice blog to read 2008-12-28 00:12 oh right 2008-12-28 00:13 yes, we will continue towards review first 2008-12-28 00:13 anyway, I was waiting for your comments before checking in the get_segs patch 2008-12-28 00:14 oh 2008-12-28 00:14 ok 2008-12-28 00:16 that patch looks good to me 2008-12-28 00:16 kay, here it comes 2008-12-28 00:17 btw, if we call multiple insert_node(), alloc_cursor() may need more depth 2008-12-28 00:18 yes 2008-12-28 00:18 we might end up making the path a pointer so we can realloc it, while keeping a stable pointer to the cursor 2008-12-28 00:19 or just allocate max depth 2008-12-28 00:19 easier for now 2008-12-28 00:19 and no significant cost 2008-12-28 00:20 yes 2008-12-28 00:20 balloc locking patch coming too 2008-12-28 00:20 well, I guess 1 may be enough 2008-12-28 00:20 if not we will fix it :) 2008-12-28 00:21 good :) 2008-12-28 00:22 I had to really think about deadlock in the balloc patch 2008-12-28 00:23 btw, kushal noticed he was getting all extents as single blocks 2008-12-28 00:24 get_block()? 2008-12-28 00:24 yes 2008-12-28 00:24 yes 2008-12-28 00:24 so I wondered, how did you get the multiblock extents? 2008-12-28 00:24 he was just doing cp 2008-12-28 00:24 multiblock is read only 2008-12-28 00:24 read side 2008-12-28 00:25 write side is 1 always 2008-12-28 00:25 so you generated the file with multi block extents using tux3 userspace? 2008-12-28 00:25 ah, maybe it's userland 2008-12-28 00:25 tux3 write test.txt 2008-12-28 00:25 ok, makes perfect sense 2008-12-28 00:25 in kernel, we need delalloc 2008-12-28 00:26 so generic_*_write can never trigger the multiblock ->get_block interface? 2008-12-28 00:26 yes 2008-12-28 00:27 anyway, we are nearly finished with ->get_block for writing 2008-12-28 00:27 however, merging phyisical extents is still worth doing 2008-12-28 00:28 we will always get cases where writes are broken up logically, but contiguous physically 2008-12-28 00:28 well, I think it's not simple 2008-12-28 00:28 no, it's not 2008-12-28 00:28 I was thinking about it a little 2008-12-28 00:28 it is where dwalk_back might come in handy 2008-12-28 00:29 I think it's not right way 2008-12-28 00:29 dwalk_back 2008-12-28 00:29 I thought we should pass start-1 and limit+1 to find_segs() 2008-12-28 00:29 we need to check the extent before, anyway 2008-12-28 00:29 yes 2008-12-28 00:29 also good 2008-12-28 00:30 and merge above is not as important 2008-12-28 00:30 just doing merge below will take care of the serial write case 2008-12-28 00:30 maybe 2008-12-28 00:31 the actual merging would happen at the seg[segs++] point 2008-12-28 00:31 however, to get proper goal, avobe would be needed 2008-12-28 00:31 proper goal? 2008-12-28 00:31 block number to pass balloc() 2008-12-28 00:32 yes 2008-12-28 00:32 and we should add the goal parameter to balloc pretty soon 2008-12-28 00:32 maybe 2008-12-28 00:33 I was thinking it after atomic commit 2008-12-28 00:33 anyway, single block extents are fine for testing right now 2008-12-28 00:33 yes 2008-12-28 00:33 yes 2008-12-28 00:33 ok, so all that is for after atomic commit 2008-12-28 00:33 yes 2008-12-28 00:33 I thought I would mention it now, to think about in the background 2008-12-28 00:33 i see 2008-12-28 00:34 ok, back to atomic commit 2008-12-28 00:34 ok 2008-12-28 00:34 so we need to replace block_write_full_page, and ? 2008-12-28 00:34 btw, I'll offline from 12/30 or 12/31 to 1/3 or 1/4 or 1/5 2008-12-28 00:35 holiday some place warm? 2008-12-28 00:36 yes, new years holidays 2008-12-28 00:36 in japan 2008-12-28 00:36 syogatsu holiday 2008-12-28 00:37 http://web.mit.edu/jpnet/holidays/Jan/syogatsu.shtml 2008-12-28 00:37 exactly 2008-12-28 00:38 sounds very enjoyable 2008-12-28 00:38 yes 2008-12-28 00:39 now, which aops can we leave null 2008-12-28 00:39 I think... writepages 2008-12-28 00:39 what about write_begin/write_end? 2008-12-28 00:40 are those emulated by ->writepage? 2008-12-28 00:40 which aops? 2008-12-28 00:40 for normal file? 2008-12-28 00:40 we have to consider both 2008-12-28 00:40 ok 2008-12-28 00:40 tux_aops and tux_blk_aops 2008-12-28 00:41 I think write_begin/end is needed to use generic_file_aio_write() 2008-12-28 00:41 we have some metadata in regular page cache 2008-12-28 00:41 yes 2008-12-28 00:41 so we have three methods to implement, for each of two kinds of address space 2008-12-28 00:42 it would be nice to do that incrementally 2008-12-28 00:42 for blk_apos, those are not needed 2008-12-28 00:42 because? 2008-12-28 00:42 oh 2008-12-28 00:42 it doesn't need to copy from userland 2008-12-28 00:42 so just ->writeapge 2008-12-28 00:42 for blk 2008-12-28 00:42 yes 2008-12-28 00:43 four methods to implement 2008-12-28 00:43 and if we write those ourself, ->writepage would also not be needed 2008-12-28 00:44 const struct address_space_operations tux_blk_aops = { 2008-12-28 00:44 .readpage = tux3_blk_readpage, 2008-12-28 00:44 .writepage = tux3_blk_writepage, 2008-12-28 00:44 .writepages = tux3_writepages, 2008-12-28 00:44 .sync_page = block_sync_page, 2008-12-28 00:44 .write_begin = tux3_write_begin, 2008-12-28 00:44 .bmap = tux3_bmap, 2008-12-28 00:44 }; 2008-12-28 00:44 current tux_blk_apos 2008-12-28 00:45 readpage is for blockread() to use read_mapping_page() 2008-12-28 00:45 nned to choose which to attack first, blk or data 2008-12-28 00:45 I think blk, because it is only one method 2008-12-28 00:45 yes 2008-12-28 00:47 tux3_blk_writepage and tux3_writepage are the same at the moment 2008-12-28 00:47 actually, those doesn't need apos if we don't use block io library 2008-12-28 00:47 right 2008-12-28 00:47 we just handle with bio 2008-12-28 00:47 ok 2008-12-28 00:48 ok, that I can do while you are on holiday :) 2008-12-28 00:48 oh, good :) 2008-12-28 00:49 directory and bitmap are the same... they are in data cache, but we control all transfers 2008-12-28 00:50 yes 2008-12-28 00:50 for data, we can do filemap_fdatawait per delta, to get an ordered write kind of consistency 2008-12-28 00:50 difference is directory has access pattern 2008-12-28 00:50 yes 2008-12-28 00:51 but we are not worrying about any allocation efficiency right now 2008-12-28 00:51 I thought about read pattern 2008-12-28 00:51 well, anyway, it can be later 2008-12-28 00:52 it's a very interesting subject 2008-12-28 00:52 optimizing for rotating media 2008-12-28 00:52 yes 2008-12-28 00:52 basically, it may be readahead like normal file 2008-12-28 00:53 that would be very nice 2008-12-28 00:53 ext2 is doing it actually, iirc 2008-12-28 00:53 I have in mind a very aggressive readahead optimization for later 2008-12-28 00:53 good 2008-12-28 00:54 where we read blocks into the buffer cache, before we know if they belong to a file 2008-12-28 00:54 and when a file page cache accesses misses, we look in the buffer cache first, and if the page is there, move it or copy it tothe file page cache 2008-12-28 00:55 we have two copies of data? 2008-12-28 00:55 just one 2008-12-28 00:56 the copy is if blocks are unaligned wrt pages 2008-12-28 00:56 and there are no metadata blocks on the page 2008-12-28 00:56 read physical contiguous? 2008-12-28 00:56 yes 2008-12-28 00:56 i see 2008-12-28 00:56 something I thought about trying to do for generic kernel 2008-12-28 00:57 but it is easier to approach just within one filesystem 2008-12-28 00:57 i see 2008-12-28 00:57 we need to have some state information in the buffer cache, to know which blocks are mapped into files 2008-12-28 00:58 ah, we can have the same page mapped into block buffer cache and page cache I think 2008-12-28 00:59 to handle the mixes data/metadata and unalighed blocks case 2008-12-28 00:59 anyway, it promises to be a strange hack 2008-12-28 00:59 :) 2008-12-28 00:59 disk physical alignment 2008-12-28 01:00 block should be same with page cache? 2008-12-28 01:00 block size 2008-12-28 01:01 I don't think it is reasonable to force physical alignment of blocks on disk 2008-12-28 01:01 so for multiple blocks per page, the logical alignment of a file can be unaligned with a block in buffer cache 2008-12-28 01:01 yes 2008-12-28 01:02 to avoid copy it is needed? 2008-12-28 01:02 yes 2008-12-28 01:02 i see 2008-12-28 01:02 even with a copy it will be a big win 2008-12-28 01:02 copy and invalidate? 2008-12-28 01:03 yes 2008-12-28 01:03 i see 2008-12-28 01:03 we need per-block state for the buffer cache, something like the block handle patch 2008-12-28 01:03 yes 2008-12-28 01:03 that state is referenced from page->private, so to use that a page would be mapped into both buffer and file page cache 2008-12-28 01:04 alternatively, maybe use some bits in the radix tree 2008-12-28 01:05 one detail: if a page is mapped into both caches, vm will not be able to evict it 2008-12-28 01:05 because of elevated ref count 2008-12-28 01:05 anyway, again something to think about 2008-12-28 01:05 for about 6 months from now 2008-12-28 01:05 yes 2008-12-28 01:05 http://userweb.kernel.org/~hirofumi/find-buffer.patch 2008-12-28 01:06 btw, dirty hack of recursive lock_page 2008-12-28 01:07 reading 2008-12-28 01:07 it wouldn't work under memory presser 2008-12-28 01:07 ah, no 2008-12-28 01:07 it may work 2008-12-28 01:08 yes kushal was using tux3fuse to mount it.. 2008-12-28 01:08 userland as u said 2008-12-28 01:09 actually, we were talking about hirofumi's write from userland, which does multiblock extents when done from the tux3 command 2008-12-28 01:09 but good to know about kushal's fuse use 2008-12-28 01:10 tux3fuse should be able to write good extents 2008-12-28 01:10 it shound 2008-12-28 01:10 btw, with this patch, fsx-linux passed 50times 2008-12-28 01:10 D state problem might be recursive lock_page 2008-12-28 01:11 nice :) 2008-12-28 01:11 you would see that for sure with SysRq-t, no? 2008-12-28 01:11 looks like lock_page 2008-12-28 01:11 however, it's not very sure 2008-12-28 01:12 right, it was io wait 2008-12-28 01:12 yes 2008-12-28 01:12 and path seems balloc path 2008-12-28 01:12 however, I can just say it looks like that 2008-12-28 01:13 I'll run fsx-linux several hours 2008-12-28 01:14 and try to make real patch 2008-12-28 01:14 ok, so your patch works by avoiding the ->readpage from read_mapping_page 2008-12-28 01:14 when there is already a buffer 2008-12-28 01:14 yes 2008-12-28 01:15 if there is page already, we would have to create buffer_head 2008-12-28 01:15 without lock_page 2008-12-28 01:15 right 2008-12-28 01:16 it is deep magic 2008-12-28 01:16 yes 2008-12-28 01:16 post to the ml with a short description of the deadlock and resolution? 2008-12-28 01:16 ok 2008-12-28 01:16 I'll post with patch when I was done 2008-12-28 01:17 I have a couple more minor cleanups to do in filemap 2008-12-28 01:18 ok 2008-12-28 01:18 and a post on the smp locking strategy half finished 2008-12-28 01:18 and then, fully concentrate on replacing the block device block IO 2008-12-28 01:18 sounds good 2008-12-28 01:18 we will rely on your deadlock fix for a while 2008-12-28 01:19 ok 2008-12-28 01:19 passing fsx-linux 50 times is a big event 2008-12-28 01:19 dead 2008-12-28 01:19 :) 2008-12-28 01:19 dleaf.c:406 2008-12-28 01:20 oops in dwalk_check 2008-12-28 01:20 fun 2008-12-28 01:20 I saw a few times before 2008-12-28 01:21 need an isolation strategy 2008-12-28 01:21 got the traceback? 2008-12-28 01:21 yes 2008-12-28 01:21 http://userweb.kernel.org/~hirofumi/oops 2008-12-28 01:21 and trace output? 2008-12-28 01:21 no 2008-12-28 01:22 I've ran with dmesg -n3 2008-12-28 01:22 I'll try with full log 2008-12-28 01:24 ok 2008-12-28 01:24 dwalk_check() should dump dleaf 2008-12-28 01:24 yes 2008-12-28 01:25 does it take 50 fsx runs to trigger? 2008-12-28 01:25 yes, 50 or more 2008-12-28 01:26 it sounds like a race 2008-12-28 01:27 may be 2008-12-28 01:27 same block allocated to more than one dtree? 2008-12-28 01:27 I think it is not happen 2008-12-28 01:28 you were running without smp locking for balloc, right? 2008-12-28 01:28 not yet 2008-12-28 01:28 I'll pull last 2 patches 2008-12-28 01:29 I think, with the balloc patch smp coverage is complete 2008-12-28 01:29 maybe 2008-12-28 01:29 good answer :) 2008-12-28 01:30 well, I'll review in holidays 2008-12-28 01:31 we can check that the dleaf is in order when writing it 2008-12-28 01:32 dleaf_check 2008-12-28 01:32 ah, good 2008-12-28 01:39 } 2008-12-28 01:39 + assert(!dleaf_check(btree, leaf)); 2008-12-28 01:39 if (tail) { 2008-12-28 01:39 - if (dleaf_need(btree, tail) < dleaf_free(btree, leaf)) 2008-12-28 01:39 + if (dleaf_need(btree, tail) < dleaf_free(btree, leaf)) { 2008-12-28 01:39 dleaf_merge(btree, leaf, tail); 2008-12-28 01:39 + assert(!dleaf_check(btree, leaf)); 2008-12-28 01:39 + } 2008-12-28 01:39 else { 2008-12-28 01:39 mark_buffer_dirty(cursor_leafbuf(cursor)); 2008-12-28 01:39 not tried 2008-12-28 01:39 filemap.c 2008-12-28 01:41 filemap.c leaks cursor now 2008-12-28 01:41 whoops 2008-12-28 01:42 patch? 2008-12-28 01:42 not yet 2008-12-28 01:43 above patch boots/mounts/writes hello >foo 2008-12-28 01:43 looking for the cursor leak 2008-12-28 01:44 how did you spot the leak? 2008-12-28 01:44 oh 2008-12-28 01:44 valgrind complains 2008-12-28 01:44 many places 2008-12-28 01:44 we may want one more depth for function 2008-12-28 01:45 old one, get_segs does alloc_cursor() and down_*() 2008-12-28 01:45 now, one function 2008-12-28 01:45 all return may be wrong 2008-12-28 01:46 sure 2008-12-28 01:47 however, return seems two places 2008-12-28 01:47 right :P 2008-12-28 01:48 should do the return before allocating the cursor 2008-12-28 01:48 when btree is empty 2008-12-28 01:48 why do we check it? 2008-12-28 01:49 depth == 0 is bug? 2008-12-28 01:49 balloc does bread before the bitmap btree is allocated 2008-12-28 01:49 does not happen in kernel 2008-12-28 01:51 tux_new_inode will allocate new_btree() 2008-12-28 01:51 ah 2008-12-28 01:52 anyway, first thing is to make it work like the old version 2008-12-28 01:52 add one level? 2008-12-28 01:53 so it should goto out_unlock 2008-12-28 01:53 and add one level 2008-12-28 01:54 - int err; 2008-12-28 01:54 + int err, segs = 0; 2008-12-28 01:54 if (!btree->root.depth) 2008-12-28 01:54 - return 0; 2008-12-28 01:54 + goto out_unlock; 2008-12-28 01:55 and I would have caught this if Werror was in ;) 2008-12-28 01:55 so it should go back 2008-12-28 01:56 and find a different way to avoid annoying about unused locals when building single files 2008-12-28 01:59 well I have two patches mixed together now, dleaf_check and filemap bugfix 2008-12-28 01:59 I'll add dleaf_check() more low level 2008-12-28 02:00 all dleaf modification 2008-12-28 02:00 ok 2008-12-28 02:02 checking in the cursor_release bug fix 2008-12-28 02:03 pushed 2008-12-28 02:04 ok 2008-12-28 02:05 I would like to have Werror only on make tests 2008-12-28 02:05 what is for? 2008-12-28 02:06 on make tests, compile already was done 2008-12-28 02:06 stop on make on warning, except when I'm hacking 2008-12-28 02:06 oh right 2008-12-28 02:07 then make all 2008-12-28 02:08 make UCFLAGS=-Wno-error 2008-12-28 02:08 that will do :) 2008-12-28 02:09 so I will put Werror back in CFLAGS 2008-12-28 02:09 ok 2008-12-28 02:16 actually, that would not have caught the memory leak, but it's better anyway 2008-12-28 02:19 yes, valgrind told it 2008-12-28 02:24 another problem was found 2008-12-28 02:28 ? 2008-12-28 02:28 for (int i = -!!below; i < segs + !!above; i++) { 2008-12-28 02:29 for (int i = -!!below, index = start; i < segs + !!above; i++) { 2008-12-28 02:29 here is index is "int" 2008-12-28 02:29 whoops 2008-12-28 02:30 - for (int i = -!!below, index = start; i < segs + !!above; i++) { 2008-12-28 02:30 + index = start; 2008-12-28 02:30 + for (int i = -!!below; i < segs + !!above; i++) { 2008-12-28 02:30 ok? 2008-12-28 02:30 yes 2008-12-28 02:34 - block_t gap = ex_index - index; 2008-12-28 02:34 + unsigned gap = ex_index - index; <- I think this is correct 2008-12-28 02:34 no 2008-12-28 02:34 because we will not ask for an IO range bigger than unsigned 2008-12-28 02:35 hole can be too big 2008-12-28 02:35 -static int get_segs(struct inode *inode, block_t start, block_t count, struct seg seg[], unsigned max_segs, int create) 2008-12-28 02:35 +static int get_segs(struct inode *inode, block_t start, unsigned count, struct seg seg[], unsigned max_segs, int create) <- I was thinking, this 2008-12-28 02:35 yes 2008-12-28 02:36 I think we cannot get a count bigger than unsigned 2008-12-28 02:36 yes 2008-12-28 02:37 however, hole can be bigger than unsigned 2008-12-28 02:37 so, we have to split to some extents 2008-12-28 02:37 how can it be? 2008-12-28 02:37 hole can't be bitter than limit - start, where limit = start + count 2008-12-28 02:37 s/bitter/bigger/ 2008-12-28 02:38 yes 2008-12-28 02:38 we find next extent 2008-12-28 02:38 it can be 0x1000000000 2008-12-28 02:39 ah 2008-12-28 02:39 ex_index was changed 2008-12-28 02:39 so, yes 2008-12-28 02:39 it is unsigned :) 2008-12-28 02:39 ok 2008-12-28 02:39 patch coming 2008-12-28 02:41 ok 2008-12-28 02:41 btw, why do we have alloc_cursor(2)? 2008-12-28 02:42 I thought that is what you were saying before 2008-12-28 02:42 allow for depth increase of 2 2008-12-28 02:42 ah, sorry 2008-12-28 02:42 I meant possibility on furture 2008-12-28 02:43 future 2008-12-28 02:43 now, I think 1 is enough 2008-12-28 02:43 ok, and it will assert if not 2008-12-28 02:43 yes, on level_push 2008-12-28 02:44 ok, now a spelling patch... seg[] to segveg[] 2008-12-28 02:45 sorry 2008-12-28 02:45 segvec[] 2008-12-28 02:45 or just vec[] 2008-12-28 02:45 ok 2008-12-28 02:45 anything but seg[] ;) 2008-12-28 02:45 that came from my earliest hack 2008-12-28 02:46 many users of seg 2008-12-28 02:49 an easy patch though 2008-12-28 02:49 next is to get rid of seek[2] 2008-12-28 02:50 if separate patch, it would help review 2008-12-28 02:50 it is separate 2008-12-28 02:50 good 2008-12-28 02:50 the respelling is already pushed 2008-12-28 02:51 btw, dwalk assert is not reproduced yet 2008-12-28 02:51 how many runs? 2008-12-28 02:51 I don't know 2008-12-28 02:51 infinite loop 2008-12-28 02:51 well that is good 2008-12-28 02:52 if it doesn't happen, it would be balloc locking 2008-12-28 02:52 reason is ... 2008-12-28 02:55 it looks like unlikely 2008-12-28 02:57 I will do the seek patch in 2 steps 2008-12-28 02:58 ok 2008-12-28 02:58 all seek -> &seek[0] 2008-12-28 02:58 -#define SEG_NEW (1 << 1) 2008-12-28 02:58 +#define SEG_NEW (1 << 1) 2008-12-28 02:58 2008-12-28 02:58 if ((err = probe(btree, start, cursor))) { 2008-12-28 02:58 - free_cursor(cursor); 2008-12-28 02:58 - return err; 2008-12-28 02:58 + segs = err; 2008-12-28 02:58 + goto out_unlock; 2008-12-28 02:58 then seek[0] ->walk0 and seek[1] -> walk1 2008-12-28 02:58 2008-12-28 02:58 goto out_unlock; 2008-12-28 02:58 } 2008-12-28 02:58 -; 2008-12-28 02:58 } 2008-12-28 02:58 } 2008-12-28 02:58 mark_buffer_dirty(cursor_leafbuf(cursor)); 2008-12-28 02:59 I talked in the middle of your patch ;) 2008-12-28 03:00 what is the SEG_NEW change for? 2008-12-28 03:00 SEG_NEW(1 << 1) -> SEG_NEW(1 << 1) 2008-12-28 03:01 I can't see why I did it 2008-12-28 03:01 before 2008-12-28 03:01 just a indent 2008-12-28 03:01 done 2008-12-28 03:01 undented ;) 2008-12-28 03:01 ok 2008-12-28 03:02 wstart, wtail? 2008-12-28 03:03 dead without assert 2008-12-28 03:03 whoops 2008-12-28 03:03 sounds like dwalk_probe bug 2008-12-28 03:04 headwalk/tailwalk? 2008-12-28 03:04 looks like good, a bit long though :) 2008-12-28 03:05 spelling is easy to change, getting rid of the arrays needs care though 2008-12-28 03:05 yes 2008-12-28 03:11 seek[1] is just walk 2008-12-28 03:12 ah, yes 2008-12-28 03:13 so now we only need one new name, how about "head"? 2008-12-28 03:13 I like headwalk though 2008-12-28 03:13 ok 2008-12-28 03:13 done 2008-12-28 03:17 looks good like until 751 2008-12-28 03:17 751? 2008-12-28 03:17 current tip rev of public repo 2008-12-28 03:18 where did I goof? 2008-12-28 03:19 eh 2008-12-28 03:19 looks good 2008-12-28 03:19 ah 2008-12-28 03:19 looks like good 2008-12-28 03:19 good 2008-12-28 03:21 ok, now if we want to respell again later it is easy 2008-12-28 03:21 so I have finished messing with filemap for a whiel 2008-12-28 03:21 while 2008-12-28 03:21 the hang is way more important 2008-12-28 03:22 I'm testing 2008-12-28 03:23 flips: hey 2008-12-28 03:24 hi bh 2008-12-28 03:24 how's it going ? any new announcements ? 2008-12-28 03:26 http://kerneltrap.org/mailarchive/linux-fsdevel/2008/12/26/4493214/thread 2008-12-28 03:32 ACTION goes to read 2008-12-28 03:34 hirofumi, what did the backtrace look like that time? 2008-12-28 03:35 http://userweb.kernel.org/~hirofumi/oops 2008-12-28 03:35 this? 2008-12-28 03:36 same one then 2008-12-28 03:36 and this is with the dleaf_checks in? 2008-12-28 03:37 no 2008-12-28 03:37 after dleaf_check, it was same 2008-12-28 03:37 ok, so we know the leaf was ok when it got written 2008-12-28 03:37 and not ok when read 2008-12-28 03:37 right? 2008-12-28 03:38 not perfectly sure 2008-12-28 03:38 dleaf_check is not perfect 2008-12-28 03:38 right 2008-12-28 03:38 now, I'm trying to dump dwalk 2008-12-28 03:39 yes 2008-12-28 03:39 most likey is the bug of dwalk_probe 2008-12-28 03:44 looking at dwalk line 536 2008-12-28 03:44 the assumption is, there always is a next entry in the group at that point? 2008-12-28 03:45 no 2008-12-28 03:45 if this extent is last of this group, next would be next group 2008-12-28 03:52 so the assert says that entry is positioned past the end of a group 2008-12-28 03:53 in other words, not position on a valid entry, and not at the end 2008-12-28 03:53 while (walk->entry > walk->estop) { 2008-12-28 03:53 if (entry_keylo(walk->entry - 1) > keylo) 2008-12-28 03:53 break; 2008-12-28 03:53 walk->entry--; 2008-12-28 03:53 walk->extent = walk->exstop; 2008-12-28 03:53 walk->exstop = walk->exbase + entry_limit(walk->entry); 2008-12-28 03:53 } 2008-12-28 03:53 this loop? 2008-12-28 03:55 the dwalk_next that asserted is at line 536, no? 2008-12-28 03:55 if (walk->extent == walk->exstop) { 2008-12-28 03:55 if (walk->entry == walk->estop) { 2008-12-28 03:55 if (walk->group == walk->gstop) 2008-12-28 03:55 return 0; 2008-12-28 03:55 if next is last entry, it should at estop 2008-12-28 03:56 yes 2008-12-28 03:56 ah 2008-12-28 03:56 yes 2008-12-28 03:56 just trying to catch up with you ;) 2008-12-28 03:56 that assert is line 536 2008-12-28 03:56 I don't think I will get there tonight 2008-12-28 03:57 yes 2008-12-28 03:57 I 2008-12-28 03:57 ok 2008-12-28 03:58 this can't reproduce easily, so it would need time more or less 2008-12-28 03:59 sleep time for me 2008-12-28 03:59 yes, oyasumi 2008-12-28 03:59 it is the kind of thing we can share with the community 2008-12-28 03:59 we don't have to be bug free this week :) 2008-12-28 03:59 ok :) 2008-12-28 03:59 actually, it is way more stable than a project like this usually is at this point 2008-12-28 04:00 something has been done right 2008-12-28 04:01 yes, I want some stable for atomic commit 2008-12-28 04:01 because, bug with atomic commit, it should be more complex 2008-12-28 04:02 if we can't believe base repo at all, I thought debug is really hard 2008-12-28 04:02 true, that 2008-12-28 04:03 ah, btw, git.tux3.org seems to be missing css 2008-12-28 04:04 I didn't even know shapor made that ;) 2008-12-28 04:04 :) 2008-12-28 04:04 we will have a real git tree on kernel.org 2008-12-28 04:04 my current tree is a throwaway 2008-12-28 04:04 ok 2008-12-28 04:05 anyway, I will still complain to shapor 2008-12-28 04:05 he's a kickass sysadmin by the way 2008-12-28 04:06 great 2008-12-28 04:06 move tux3.org off my server here, so now it's partly on his server and partly on mine, with hardly any breakage 2008-12-28 04:06 s/move/he moved/ 2008-12-28 04:06 i see 2008-12-28 04:30 -!- cydork(~vihang@59.184.9.128) has joined #tux3 2008-12-28 04:58 -!- stargazr5(~gauravstt@121.246.35.148) has joined #tux3 2008-12-28 08:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-28 08:51 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-28 09:34 -!- cydork(~vihang@59.184.38.148) has joined #tux3 2008-12-28 10:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-28 11:05 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-28 11:39 -!- kushal(~kushal@121.246.32.247) has joined #tux3 2008-12-28 12:07 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-12-28 14:28 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-28 14:55 -!- kbingham_(~kbingham@82-46-4-172.cable.ubr03.aztw.blueyonder.co.uk) has joined #tux3 2008-12-28 15:29 time to finalize details of buffer forking 2008-12-28 15:29 a good skate should do it 2008-12-28 17:30 ok, time to take a trip into radix.c 2008-12-28 17:31 btw, bug was found 2008-12-28 17:32 :) 2008-12-28 17:32 what was it? 2008-12-28 17:32 it was dleaf_merge() 2008-12-28 17:32 dleaf_merge() didn't check MAX_GROUP_ENTRIES 2008-12-28 17:32 ACTION looks 2008-12-28 17:33 oh :) 2008-12-28 17:33 my mistake 2008-12-28 17:34 fsx-linux is working for a hour or so now 2008-12-28 17:34 awesome 2008-12-28 17:34 ok, buffer forking 2008-12-28 17:34 in kernel, is page forking 2008-12-28 17:35 remove page from radix tree, along with attached buffers, and insert a copy in its place, ideally under the radix tree lock 2008-12-28 17:35 but I do not think the radix tree lock is exported, just checking now 2008-12-28 17:35 however if it is not, that is not a real problem at least for metadata 2008-12-28 17:36 because our fs will be the only code accessing that page cache 2008-12-28 17:36 and we can use our own lock 2008-12-28 17:37 dleaf_merge... only the "possibly merge" group is affected, right? i.e., uncut 2008-12-28 17:37 I'm not sure 2008-12-28 17:37 so the fix would be, another condition in int uncut = 2008-12-28 17:38 but you already did a fix 2008-12-28 17:38 obviously 2008-12-28 17:38 I did all prepare before uncut 2008-12-28 17:39 yes, it would be uncut case 2008-12-28 17:39 there is 3 cases 2008-12-28 17:39 there is no room to copy at all 2008-12-28 17:40 there is room all entry/extent to copy 2008-12-28 17:40 there is room partially 2008-12-28 17:40 well, forking 2008-12-28 17:41 why is it removing from radix tree? 2008-12-28 17:41 the case is, we are dirtying a buffer that is already dirty in a previous delta 2008-12-28 17:42 however, it may be still using from userland? 2008-12-28 17:42 we want to take the buffer out of the page cache and put a copy there 2008-12-28 17:42 this is our metadata 2008-12-28 17:42 ah, metadata 2008-12-28 17:42 ok 2008-12-28 17:42 we don't actually have to do forking for user-accessible pages, that would be stronger-than-posix semantics 2008-12-28 17:43 It would be nice to do it eventually 2008-12-28 17:43 yes 2008-12-28 17:43 well, for data=journal, it is needed? 2008-12-28 17:44 yes 2008-12-28 17:44 but we don't have to do that first 2008-12-28 17:44 i see 2008-12-28 17:44 aim at data=ordered semantics, which is what almost everybody uses 2008-12-28 17:45 which defers some tricky core kernel issues 2008-12-28 17:45 and reduces the amount of physical block redirect we have to do 2008-12-28 17:45 i see 2008-12-28 17:45 anyway, first thing to get right is metadata 2008-12-28 17:46 ok 2008-12-28 17:47 I just checked the exported radix tree api in case somebody already wrote radix_tree_fork 2008-12-28 17:48 but no 2008-12-28 17:48 and no surprise 2008-12-28 17:48 so later, we will 2008-12-28 17:48 and for now, do the radix tree remove-copy-insert under our own lock 2008-12-28 17:49 our btree lock to be precise 2008-12-28 17:50 remove-copy-insert? 2008-12-28 17:50 a page fork 2008-12-28 17:50 remove page1, copy from page1 to page2, insert page1 and page2? 2008-12-28 17:50 yes 2008-12-28 17:50 page1 buffers remain attached to page1 2008-12-28 17:52 um... 2008-12-28 17:52 then we do get_segs on dirty buffers on page1 and attach page1 to a bio for writeout, using the buffer state to know which parts to write transfer 2008-12-28 17:54 for page1 2008-12-28 17:54 remove page1 from radix, copy from page1 data to page2 data, insert page1 to radix? 2008-12-28 17:54 yes 2008-12-28 17:55 both radix is same? 2008-12-28 17:55 yes 2008-12-28 17:55 it's our block device page cache 2008-12-28 17:56 page2 is inserted to what radix? 2008-12-28 17:56 to which index? or which tree? 2008-12-28 17:56 which tree? 2008-12-28 17:56 same tree 2008-12-28 17:56 ah 2008-12-28 17:56 index is different? 2008-12-28 17:56 and same index 2008-12-28 17:57 page1 is different index? 2008-12-28 17:57 same index 2008-12-28 17:57 there is no redirect going on here 2008-12-28 17:57 that happens in get_segs 2008-12-28 17:57 after the fork 2008-12-28 17:58 on insert page2 point, it is already forked? 2008-12-28 17:58 forked and redirected? 2008-12-28 17:59 not redirected yet 2008-12-28 17:59 page1 and page2 is same index, and insert those to same radix? 2008-12-28 17:59 yes 2008-12-28 18:00 um..., same index is valid? 2008-12-28 18:00 yes, we are replacing the radix tree page at that index 2008-12-28 18:01 page1 and page2 are inserted to same radix with same index? 2008-12-28 18:02 page1 is removed, page2 is inserted in its place 2008-12-28 18:02 ah, ok 2008-12-28 18:03 so, page1 is different index when is it inserted to radix tree? 2008-12-28 18:03 http://lxr.linux.no/linux+v2.6.27/include/linux/radix-tree.h#L152 <- radix_tree_replace_slot 2008-12-28 18:04 149 * For use with radix_tree_lookup_slot(). Caller must hold tree write locked 2008-12-28 18:04 150 * across slot lookup and replacement. 2008-12-28 18:05 page1 position will be replaced by page2? 2008-12-28 18:05 yes 2008-12-28 18:06 page1 is inserted to where? 2008-12-28 18:06 same radix? 2008-12-28 18:07 and same index 2008-12-28 18:07 same tree, same index 2008-12-28 18:08 that place will be replaced again by page1 2008-12-28 18:08 ? 2008-12-28 18:08 no, after page1 is taken out of the radix tree it will be released after IO is competed 2008-12-28 18:09 this is a pretty simple situation, because only our code uses that page, and it will not use the page except for the transfer after the page is forked 2008-12-28 18:09 page1 is not inserted to anywhere? 2008-12-28 18:10 inserted into a bio 2008-12-28 18:10 yes 2008-12-28 18:10 I think the bio can hold the only reference to the page 2008-12-28 18:10 ok 2008-12-28 18:10 page1 is not inserted into any radix tree? 2008-12-28 18:10 no 2008-12-28 18:11 except it might be kind of cool to sort the metadata pages for a delta that way 2008-12-28 18:11 maybe 2008-12-28 18:12 it would be nice if a single bio could handle more than one physically contiguous metadata block 2008-12-28 18:12 slot = radix_tree_lookup_slot(page_index(page1)) 2008-12-28 18:13 radix_tree_replace_slot(slot, page2) 2008-12-28 18:13 and radix_tree_replace_slot(slot, page1) 2008-12-28 18:13 ? 2008-12-28 18:13 it seems that is all exported 2008-12-28 18:13 and: spin_lock_irq(&mapping->tree_lock); 2008-12-28 18:14 so maybe we can write fork_page using the exported radix tree api 2008-12-28 18:14 page2 position was replaced by page1 again? 2008-12-28 18:15 sorry, no, only one replace_slot 2008-12-28 18:15 slot = radix_tree_lookup_slot(page_index(page1)) 2008-12-28 18:15 radix_tree_replace_slot(slot, page2) 2008-12-28 18:15 ? 2008-12-28 18:15 yes 2008-12-28 18:16 so, page1 is inserted into which position? 2008-12-28 18:16 slot 2008-12-28 18:16 sorry 2008-12-28 18:17 no, page1 is not inserted anyway 2008-12-28 18:17 ah 2008-12-28 18:17 ok 2008-12-28 18:17 it now has now mapping 2008-12-28 18:17 it now has no mapping 2008-12-28 18:17 i see 2008-12-28 18:18 need to code a trial version of this 2008-12-28 18:18 just as you wrote it above 2008-12-28 18:18 seems too simple :) 2008-12-28 18:18 why do we replace page1 by page2? 2008-12-28 18:19 I imaged, lock -> copy page1 to page2 -> use page2 2008-12-28 18:19 because page1 may already be under write IO 2008-12-28 18:19 in flight 2008-12-28 18:20 it seems there is no problem 2008-12-28 18:20 it looks good to me 2008-12-28 18:20 if it's not "read" 2008-12-28 18:21 I thought, if we removed page1 from radix, someone try to read page1 index 2008-12-28 18:21 someone may try to read 2008-12-28 18:22 that is why it has to be under the tree_lock 2008-12-28 18:22 i see 2008-12-28 18:23 removing page1 is unnecessary? 2008-12-28 18:24 it is necessary, because the caller that is doing the fork will modify the page, and so will other callers later 2008-12-28 18:25 will modify the _logical_ page 2008-12-28 18:25 that is, via the radix tree 2008-12-28 18:26 ok, this has to be considered: blockread in parallel with blockfork 2008-12-28 18:26 yes 2008-12-28 18:27 alloc page2 -> radix tree lock -> copy page1 to page2 -> use page2 2008-12-28 18:27 it may work? 2008-12-28 18:28 it is assuming page1 has stable data 2008-12-28 18:28 kthat does not handle the case where the page cache page is already attached to a bio, under write IO 2008-12-28 18:28 we are not allowed to modify the page then 2008-12-28 18:28 which is the reason the fork is needed 2008-12-28 18:29 if page1 under write IO, what is problem? 2008-12-28 18:29 right, you said "use page2" above, well page2 has to be accessible via the radix tree 2008-12-28 18:29 maybe that is what you meant by "use page2" 2008-12-28 18:30 use page2 means we will use it for new write I/O 2008-12-28 18:30 so, page1 is in radix tree 2008-12-28 18:31 ok, so what I said is correct, the problem with that suggestion is write IO that is already in flight 2008-12-28 18:32 yes, page1 may be write IO in flight 2008-12-28 18:32 however, what is problem if so? 2008-12-28 18:33 page1 is currently mapped into page cache and under write IO, now somebody wants to change a block on page1 2008-12-28 18:33 this is not allowed 2008-12-28 18:33 but we can replace page1 in the page cache, then the caller can go ahead and change it 2008-12-28 18:34 ah, i see 2008-12-28 18:34 ok 2008-12-28 18:35 page1 can't modify anymore in here? 2008-12-28 18:35 correct 2008-12-28 18:36 now, details of blockread and blockfork in parallel 2008-12-28 18:36 and we have to make sure it doesn't have any reference? 2008-12-28 18:36 s/it/someone/ 2008-12-28 18:36 I'm just looking at that question now 2008-12-28 18:37 i see 2008-12-28 18:37 let me see, we may not want read_mapping_page in blockread, but the variant that reads the page locked 2008-12-28 18:38 it can be 2008-12-28 18:39 read_mapping_page() may not be good for this 2008-12-28 18:39 rather it can be bad 2008-12-28 18:40 we can always check the mapping after getting the page, and read again if we got a page without a mapping 2008-12-28 18:40 but there may be a tidier way 2008-12-28 18:40 that is, we could read_mapping_page(); lock_page() if (!mapping) ...try again 2008-12-28 18:41 lock_page() after read_mapping_page() would be invalid 2008-12-28 18:42 um.. 2008-12-28 18:42 you wrote that :) 2008-12-28 18:42 well, we have to consider recursive lock_page 2008-12-28 18:42 yes 2008-12-28 18:42 and recursive lock was found 2008-12-28 18:42 because of block IO library 2008-12-28 18:44 well, we have to care about lock_page 2008-12-28 18:44 we can take it only when new page 2008-12-28 18:44 read_mapping_page() releases lock_page after read 2008-12-28 18:45 lock_page() shouldn't be released for us 2008-12-28 18:45 right, if there is no variant that keeps the lock when our code gets a little more complicate 2008-12-28 18:45 compilcated 2008-12-28 18:46 well I don't think any variant keeps the lock 2008-12-28 18:46 the read can block 2008-12-28 18:46 so we need to take the lock after and see if the page is still in the mapping 2008-12-28 18:47 blockread needs to do that, because nothing stops a blockform from happening at the same time 2008-12-28 18:47 blockfork I mean 2008-12-28 18:48 yes, it seems there is no variant 2008-12-28 18:50 or, we can just call submit_bh(READ) 2008-12-28 18:51 or submit_bio 2008-12-28 18:51 yes 2008-12-28 18:51 however, 2 hours was passed 2008-12-28 18:51 promising 2008-12-28 18:51 ah 2008-12-28 18:51 ? 2008-12-28 18:51 we are locking blockread by some locks 2008-12-28 18:52 so, find_buffer() is no problem 2008-12-28 18:52 true, our btree->lock 2008-12-28 18:52 and bitmap->i_mutex 2008-12-28 18:52 so we can worry about finer granularity smp later 2008-12-28 18:53 so, there is no race between blockread and blockread 2008-12-28 18:53 blockfork 2008-12-28 18:53 it meant blockread 2008-12-28 18:53 oh, you meant blockread and blockread 2008-12-28 18:53 right 2008-12-28 18:53 maybe, it is why fsx-linux is working for now 2008-12-28 18:54 I don't see a race even without the higher level locking 2008-12-28 18:54 I added find_buffer() before read_mapping_page() 2008-12-28 18:54 the race comes from blockio library, does lock_page, then calls get_segs which does lock_page 2008-12-28 18:54 right 2008-12-28 18:55 sorry, not the race, the deadlock 2008-12-28 18:55 that is a really deep hack you did to avoid it ;) 2008-12-28 18:55 yes 2008-12-28 18:55 and soon we will not have to do that hack 2008-12-28 18:55 good 2008-12-28 18:56 so, maybe, we have to modify blockread() more or less 2008-12-28 18:57 and remove find_buffer() 2008-12-28 18:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-28 18:57 that will be nice 2008-12-28 18:58 maybe, do read_mapping_page() by hand, and use ->private_lock or lock_page 2008-12-28 19:01 or use page directly 2008-12-28 19:02 block IO library will not be involved in our metadata IO at that point 2008-12-28 19:04 block IO library is used to prepare page? 2008-12-28 19:04 e.g. bitmap 2008-12-28 19:06 as in ->write_begin in blockget? 2008-12-28 19:07 and ->readpage? 2008-12-28 19:07 that just calls into our code, so we can just call our code directly 2008-12-28 19:08 ah 2008-12-28 19:09 starting to sound nice :) 2008-12-28 19:09 yes 2008-12-28 19:09 by the way, I think we have a golden copy 2008-12-28 19:09 use get_segs() for blockread? 2008-12-28 19:10 golden copy? 2008-12-28 19:10 golden copy => a special copy because it passes fsx-linux 2008-12-28 19:10 you have some changes for these tests? 2008-12-28 19:11 your deadlock hack? 2008-12-28 19:11 yes 2008-12-28 19:11 I should merge, make a new patch, post, and invite people to find bugs 2008-12-28 19:12 a nice way to end the year for you 2008-12-28 19:12 as a hero 2008-12-28 19:12 with find_buffer() hack? 2008-12-28 19:12 yes, with 2008-12-28 19:12 ok 2008-12-28 19:12 it's gross, but it makes fsx-linux run 2008-12-28 19:12 yes 2008-12-28 19:12 ok, I'll prepare to merge 2008-12-28 19:42 done 2008-12-28 19:56 pull? 2008-12-28 20:03 dleaf_merge was completely rewritten :) 2008-12-28 20:04 it has a big new conditional, but looks perfectly efficient 2008-12-28 20:34 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-28 20:38 ah, that was wrong 2008-12-28 20:38 what was? 2008-12-28 20:38 I was going to don't modify vfrom 2008-12-28 20:38 however, current patch modify vfrom 2008-12-28 20:39 there is not actual problem now, but... 2008-12-28 20:39 in dleaf_merge? 2008-12-28 20:39 yes 2008-12-28 20:39 I will pull again when you are ready 2008-12-28 20:40 ok 2008-12-28 20:43 ah, set_group_count(group2, gcount2 - room); 2008-12-28 20:43 yes, and entry2 2008-12-28 20:43 -!- kushal(~kushal@121.246.33.4) has joined #tux3 2008-12-28 20:44 gcount2 -= room I guess 2008-12-28 20:44 and entry2 -= limit_adjust2 2008-12-28 20:45 to avoid to modify vfrom? 2008-12-28 20:45 I thought it was what you would do 2008-12-28 20:46 yes, current code does 2008-12-28 20:47 I thought I shouldn't modify vfrom first, but I wrote the current code somehow 2008-12-28 20:47 and it passed the test 2008-12-28 20:47 yes 2008-12-28 20:48 vfrom will be throwed away 2008-12-28 20:49 it will, but still 2008-12-28 20:51 ok, I have hackfs ready to try out the fork idea 2008-12-28 21:18 spin_lock_irq(&mapping->tree_lock); 2008-12-28 21:18 memcpy(page_address(page2), page_address(page), PAGE_CACHE_SIZE); 2008-12-28 21:18 slot = radix_tree_lookup_slot(&page->mapping->page_tree, page_index(page)); 2008-12-28 21:18 radix_tree_replace_slot(slot, page2); 2008-12-28 21:20 now what else does it need to be a forkblock 2008-12-28 21:20 buffer_heads 2008-12-28 21:21 the page replaced still has count 2 2008-12-28 21:53 dleaf_merge fix was done 2008-12-28 21:53 and restarted the fsx-linux challenge 2008-12-28 21:53 reading 2008-12-28 21:55 a big change 2008-12-28 21:56 pulling, I have faith 2008-12-28 21:57 thanks 2008-12-28 21:58 with this, vfrom shouldn't be modified 2008-12-28 22:00 can_merge_group = 0; <- an extra line! 2008-12-28 22:01 rare that I see one from you 2008-12-28 22:01 worth remarking on 2008-12-28 22:01 extra line? 2008-12-28 22:01 can_merge_group is already zero 2008-12-28 22:01 oh 2008-12-28 22:02 that was a really big rewrite you did in a few minutes 2008-12-28 22:02 I should remove it 2008-12-28 22:02 old one was also had it 2008-12-28 22:07 done 2008-12-28 22:07 please pull again 2008-12-28 22:07 done 2008-12-28 22:08 I'll run fsx-linux on unchanged version 2008-12-28 22:08 continue to run 2008-12-28 22:08 probably, for 8 hours 2008-12-28 22:10 page count 2 is mapping and has buffer_heads? 2008-12-28 22:11 I think, lru and mapping 2008-12-28 22:11 it has been a while since I played in vm 2008-12-28 22:12 I don't think we take a count for buffer_heads 2008-12-28 22:12 I just saw comment of migrate_page_move_mapping() 2008-12-28 22:12 !!page->private counts as a count 2008-12-28 22:13 yes 2008-12-28 22:16 so now... simultaneous blockread and blockfork... can it happen with our heavy locking? 2008-12-28 22:17 have to think about specific kinds of blocks 2008-12-28 22:18 I don't think it can happen 2008-12-28 22:18 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-28 22:19 blockfork is called from get_segs()? 2008-12-28 22:19 it will be part of blockdirty 2008-12-28 22:19 which we have to do _before_ changing the buffer 2008-12-28 22:19 i see 2008-12-28 22:20 so anywhere we have mark_buffer_dirty, we replace by a blockdirty(buffer) before changing anything 2008-12-28 22:20 there are 45 or so such places, not too many 2008-12-28 22:21 however, it is before? 2008-12-28 22:21 blockdirty has to go before the change 2008-12-28 22:21 to do the fork 2008-12-28 22:21 if necessary 2008-12-28 22:22 so, mark_buffer_dirty() already changed 2008-12-28 22:22 mark_buffer_dirty() is called after modify 2008-12-28 22:22 we will remove the existing mark_buffer_dirty calls 2008-12-28 22:23 start change and end change is needed? 2008-12-28 22:23 I don't think so 2008-12-28 22:24 I will try to explain 2008-12-28 22:24 then I will add that to an email 2008-12-28 22:24 these changes are made the file operations 2008-12-28 22:25 after each file operations, completes, it checks to see if a delta transition should be done 2008-12-28 22:25 to make the delta transition, it just increases the delta counter 2008-12-28 22:25 yes 2008-12-28 22:26 when the final file operation in the previous delta completes, the delta is staged 2008-12-28 22:27 that is the only way a dirty metadata block can become clean 2008-12-28 22:27 when the delta is written 2008-12-28 22:27 hey flips...how/where are the block numbers assigned to new blocks?? 2008-12-28 22:27 yes 2008-12-28 22:28 my explanation is not so great... yet 2008-12-28 22:28 kushai, in filemap.c get_segs(... 1) 2008-12-28 22:29 ok, so the short, accurate eplanation is that the filesystem operation will always complete before the block becomes clean 2008-12-28 22:30 so, we need to read delta counter, and check delta count (start and end) 2008-12-28 22:30 yes 2008-12-28 22:30 ? 2008-12-28 22:30 in userspace?? 2008-12-28 22:30 kushal, also in userspace 2008-12-28 22:31 ok, it means transaction start and end? 2008-12-28 22:31 yes 2008-12-28 22:31 ok 2008-12-28 22:33 transaction start has to verify enough space available to write metadata for the transation and increase the active transaction count 2008-12-28 22:34 ok...in guess_extent... 2008-12-28 22:35 I need to respell that to guess_region 2008-12-28 22:35 extent is confusing 2008-12-28 22:35 ok 2008-12-28 22:37 done 2008-12-28 22:38 i see 2008-12-28 22:40 one oddity about block fork 2008-12-28 22:40 there may be multiple blocks on the page 2008-12-28 22:40 when we fork one, we have to fork them all 2008-12-28 22:40 by the way, fork == cow 2008-12-28 22:41 I just don't like to think "mooo" when I work on this code, and don't like acronyms much 2008-12-28 22:41 so for tux3 it is fork 2008-12-28 22:41 why is it needed? 2008-12-28 22:41 we can't avoid it 2008-12-28 22:41 page cache has page granularity 2008-12-28 22:42 yes 2008-12-28 22:42 however, can we copy only interesting block? 2008-12-28 22:42 true, it can work 2008-12-28 22:43 or can it 2008-12-28 22:43 hmm 2008-12-28 22:43 well, I don't care it much though 2008-12-28 22:43 just out of interest 2008-12-28 22:44 if you have four blocks on a page and blockdirty() them one at a time, then you would end up with four pages copied 2008-12-28 22:44 each with its own set of buffers 2008-12-28 22:44 it's very interesting to me :) 2008-12-28 22:45 because it has to work, and it is pretty new stuff. Ext3 does something like this, but it is very complex 2008-12-28 22:46 yes, iirc, ext3 is using kmalloc() or such 2008-12-28 22:46 can we check existing delta page for blockdiryt()? 2008-12-28 22:47 actually it is ok to fork all dirty buffers at the same time, then another fork will never be needed on the same page 2008-12-28 22:47 yes 2008-12-28 22:47 each dirty buffer becomes a member of the active delta, and so will clean blocks if they are dirtied later 2008-12-28 22:47 so it is an oddity, not a prolem 2008-12-28 22:48 problem 2008-12-28 22:48 but we will have two pages? 2008-12-28 22:48 we will 2008-12-28 22:48 and the original page will be freed when the bio completes 2008-12-28 22:49 in this case, I think IO can't start 2008-12-28 22:49 why? 2008-12-28 22:49 becase two copy means delta is not completed? 2008-12-28 22:49 two copy on same delta 2008-12-28 22:50 one belongs to a previous delta, and the other (the new page, now in the page cache) belongs to the active delta 2008-12-28 22:50 there are never two copies in the same delta 2008-12-28 22:50 fork twice? 2008-12-28 22:51 um... 2008-12-28 22:51 why am I think we will have two pages? :) 2008-12-28 22:51 why do I think 2008-12-28 22:52 fork can happen only once on the same logical page, after that there are two pages, one in flight and the other in page cache 2008-12-28 22:52 yes 2008-12-28 22:53 ah, it is if we copy only interesting block 2008-12-28 22:53 right, I assume we will copy all dirty blocks 2008-12-28 22:53 it is a slight additional copying expense, but doesn't really matter 2008-12-28 22:54 will just stupidly copy the whole page for now 2008-12-28 22:54 yes 2008-12-28 22:55 copy is for fork, so we have to copy all on the page? 2008-12-28 22:56 because we have to prepare for read too 2008-12-28 22:56 yes 2008-12-28 22:56 of course 2008-12-28 22:56 so, we will copy the whole page always 2008-12-28 22:57 when buffers exist on the page, we could possibly save some copying 2008-12-28 22:57 with some messy code 2008-12-28 22:57 buffers and non-uptodate? 2008-12-28 22:57 right 2008-12-28 22:57 i see 2008-12-28 22:58 some blocks might not be metadata 2008-12-28 22:59 or they might be free on the fs 2008-12-28 22:59 on page cache? 2008-12-28 22:59 yes, they could be mapped to file page cache 2008-12-28 23:00 above, I am thinking about metadata only, but actually, this approach works for file page cache as well, for example our directories and bitmap 2008-12-28 23:01 yes 2008-12-28 23:01 I was thinking that as bitmap 2008-12-28 23:04 ah, so it can be free 2008-12-28 23:44 -!- pranith(ca4bcee2@webchat.mibbit.com) has joined #tux3 2008-12-29 03:32 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-29 05:13 -!- kushal(~kushal@121.246.36.176) has joined #tux3 2008-12-29 08:05 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2008-12-29 08:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-29 09:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-29 09:25 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-29 09:53 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-29 10:08 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 10:16 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 10:24 -!- fqh(~fqh@218.13.187.233) has joined #tux3 2008-12-29 10:26 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 10:26 -!- fqh(~fqh@218.13.187.233) has left #tux3 2008-12-29 10:30 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 10:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-29 10:37 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 10:43 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 10:54 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 11:00 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 11:01 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-29 11:03 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-29 11:16 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 11:19 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 11:31 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 11:41 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 12:04 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 12:09 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 12:18 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 13:04 -!- kbingham_(~kbingham@82-46-4-172.cable.ubr03.aztw.blueyonder.co.uk) has joined #tux3 2008-12-29 14:21 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 14:26 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 14:35 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 14:40 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 14:49 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 15:02 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 15:12 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 15:16 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 15:21 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 15:29 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 15:35 -!- pranith(~bobby@122.162.71.247) has joined #tux3 2008-12-29 16:13 design note on delta transition coming 2008-12-29 16:14 this is a key one, need to go for a skate to clarify 2008-12-29 16:51 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-29 17:24 flips: what about online checking ? 2008-12-29 17:43 bh, atomic commit is the current priority 2008-12-29 17:44 and offline checking is still not done 2008-12-29 17:44 ok, good skate, I got a nice improvement to the cache layering model out of it 2008-12-29 17:45 design note coming, Cache Layering Revisited 2008-12-29 17:46 the only conflict I had before between the frontend and backend is when frontend wants to create a file and update an inode table block 2008-12-29 17:47 it has to update the inode table block mainly so that a later create will not choose the same inode number 2008-12-29 17:48 so the alternative approach is to choose the inode number just as now, by searching the itable leaves, but to defer updating the itable leaf 2008-12-29 17:48 keep a list of inodes for which itable update is deferred 2008-12-29 17:48 and check that list when choosing a new inode number 2008-12-29 17:50 this takes place in make_inode 2008-12-29 19:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-29 20:09 -!- RazvanM(~RazvanM@96.234.233.172) has joined #tux3 2008-12-29 21:34 -!- kushal(~kushal@121.246.36.176) has joined #tux3 2008-12-29 21:38 name change coming: get_segs => map_region, it more clearly states the purpose 2008-12-29 21:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-29 22:46 -!- RazvanM_(~RazvanM@96.234.237.155) has joined #tux3 2008-12-29 23:49 my simple design note on the simple improvement to inode creation blew up into a big long design note on many aspects of atomic commit 2008-12-29 23:49 oh well 2008-12-29 23:50 had to be done sometime 2008-12-29 23:58 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-30 03:05 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-30 03:28 -!- kbingham_(~kbingham@82-46-4-172.cable.ubr03.aztw.blueyonder.co.uk) has joined #tux3 2008-12-30 04:36 -!- fqh(~fqh@219.131.240.192) has joined #tux3 2008-12-30 04:50 -!- RazvanM(~RazvanM@96.234.237.46) has joined #tux3 2008-12-30 05:34 -!- cydork_(~cydoork@122.169.100.164) has joined #tux3 2008-12-30 05:59 -!- kushal(~kushal@121.246.32.76) has joined #tux3 2008-12-30 06:00 hey flips...are u performing any sort of tree balancing for the btrees...? 2008-12-30 06:00 as in to avoid skews? 2008-12-30 06:11 -!- cydork_(~cydoork@122.169.100.164) has joined #tux3 2008-12-30 06:16 -!- cydork_(~cydoork@122.169.100.164) has joined #tux3 2008-12-30 06:42 -!- RazvanM_(~RazvanM@96.234.235.129) has joined #tux3 2008-12-30 07:29 -!- kushal(~kushal@121.246.32.76) has joined #tux3 2008-12-30 08:00 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 08:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 08:44 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 11:29 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 12:14 -!- kushal(~kushal@121.246.32.76) has joined #tux3 2008-12-30 13:30 -!- kushal(~kushal@121.246.32.76) has left #tux3 2008-12-30 13:46 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-30 14:17 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 15:32 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 15:55 hirofumi, design note up you might be interested in 2008-12-30 15:56 some skeleton code to check out for atomic filesystem changes 2008-12-30 15:56 need to post a golden patch today... passes fsx-linux 2008-12-30 15:57 which the currently posted patch does not, so better announce a better one now 2008-12-30 15:57 sk8 oclock 2008-12-30 16:18 hey flips 2008-12-30 16:19 when you have stuff on online checking written up, I'd like to know how you plan to do it 2008-12-30 17:06 -!- orgthingy(~orgthingy@62.150.159.249) has joined #tux3 2008-12-30 17:07 long time no see 2008-12-30 17:19 hi orgthingy 2008-12-30 17:20 bh, we can have a little chat right now 2008-12-30 17:21 I have a few design notes related to atomic commit in the queue, ahead of online check 2008-12-30 17:23 I see 2008-12-30 17:23 something to think about... are we checking the state as currently recorded on disk? (which will change at the next delta) or are we checking the state as defined by the filesystem tree partially in cache? (which is continuously changing) 2008-12-30 17:23 oh flips, /me missed you :p 2008-12-30 17:24 I see 2008-12-30 17:24 orgthingy, real name? 2008-12-30 17:25 flips : you dont remember me? 2008-12-30 17:25 flips : i didnt tell you my real name 2008-12-30 17:25 i just told you to call me "orgthingy" 2008-12-30 17:25 I was the one from frenode and #C ? 2008-12-30 17:25 temporarily forgot ;) 2008-12-30 17:25 ah 2008-12-30 17:26 http://tux3.org/irclogs/2008-09-28.txt flips 2008-12-30 17:26 dimly remember 2008-12-30 17:26 right 2008-12-30 17:26 first time i joined, i was clueles about filesystems..etc :p 2008-12-30 17:27 and now? 2008-12-30 17:27 flips : better I guess 2008-12-30 17:27 Im now testing Lenny (Debian) 2008-12-30 17:27 I have to say best Debian release, great improvement 2008-12-30 17:29 orgthingy, you can now run tux3 on a real kernel and expect it to work pretty reliably... though no recovery on crash yet 2008-12-30 17:29 at least, when the new patch is posted, in an our or so 2008-12-30 17:30 flips : how about opening Usenet newsgroup instead of mailing groups.etc? 2008-12-30 17:30 you can get free account from my friend @ #albasani - freenode or other free providers 2008-12-30 17:31 (if you're interested) 2008-12-30 17:31 flips : no patents problems in tux3.. let's hope :P ! 2008-12-30 17:31 I used to hang out on usenet... don't any more 2008-12-30 17:31 mailing lists are more efficient 2008-12-30 17:34 I see 2008-12-30 17:36 anyway, still interested in helping? 2008-12-30 17:37 http://kerneltrap.org/mailarchive/linux-fsdevel/2008/12/26/4493214/thread <- download/build insteructions here 2008-12-30 17:39 building tux3? 2008-12-30 17:39 yes 2008-12-30 17:39 hmm 2008-12-30 17:39 no crash recovery :p ? 2008-12-30 17:39 none 2008-12-30 17:40 flips : I'll gladly help if crash recovery is "enabled" 2008-12-30 17:40 too risky man, too risky :p 2008-12-30 17:40 flips : sorry 2008-12-30 17:40 i will, just later 2008-12-30 17:47 crash recovery is easy: tux3 mkfs 2008-12-30 18:13 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 18:36 trying out my instructions for installing our golden copy now... 2008-12-30 18:50 (.text+0x76c2d): undefined reference to `__udivdi3' <- hmm 2008-12-30 18:50 gcc too old? 2008-12-30 19:25 ok, quick hack done to tuxtime to compile without the 64 bit division 2008-12-30 19:25 result is less than full precision for nanoseconds 2008-12-30 19:26 but can be rewritten as a multiply, giving full precision, just takes a little more time than I want to give it just now 2008-12-30 19:26 why does it work? 2008-12-30 19:26 btw, I'll be offline soon 2008-12-30 19:36 hirofumi, ok, I made the latest tux3 repost just in time then 2008-12-30 19:36 http://lkml.org/lkml/2008/12/30/302 2008-12-30 19:36 Tux3 report: A Golden Copy 2008-12-30 19:37 hirofumi, best wishes for the new year 2008-12-30 19:38 good 2008-12-30 19:38 btw, __udivdi3 seems 64bit / 64bit? 2008-12-30 19:38 yes 2008-12-30 19:38 it's really goofy that gcc does not generate inline code for that 2008-12-30 19:39 but I don't see 64bit / 64bit in tux3time 2008-12-30 19:39 ((u64)time.tv_nsec << 32) / 1000000000 2008-12-30 19:40 1000000000 is 32bit? 2008-12-30 19:40 promotion rules change it to 64 bit 2008-12-30 19:40 it's stupid 2008-12-30 19:40 oh 2008-12-30 19:41 div64()? 2008-12-30 19:41 is that a library call in gcc and kernel? 2008-12-30 19:41 it is library in kernel 2008-12-30 19:42 but not in gcc? 2008-12-30 19:42 yes 2008-12-30 19:42 div_u64() 2008-12-30 19:43 it seems 64bit / 32bit 2008-12-30 19:43 anyway, I can write it as a 32 x 32 = 64 multiply 2008-12-30 19:43 it's easy, but I wanted to make the post 2008-12-30 19:43 oh, good 2008-12-30 19:44 so I wrote something with less precision that I didn't have to test ;) 2008-12-30 19:44 could still have got it wrong of course 2008-12-30 19:44 -!- RalucaM(~ral@londo.cnds.jhu.edu) has joined #tux3 2008-12-30 19:44 but at this point, as long as it doesn't divide by zero it's fine 2008-12-30 19:44 the result of that patch is same with old? 2008-12-30 19:45 time result? 2008-12-30 19:45 yes 2008-12-30 19:45 it will have less precision 2008-12-30 19:45 nanoseconds less exact 2008-12-30 19:45 i see 2008-12-30 19:45 but still pretty high precision 2008-12-30 19:46 and it will be fixed later? 2008-12-30 19:46 yes 2008-12-30 19:46 ok 2008-12-30 19:46 tonight probably 2008-12-30 19:46 I worry that was final version 2008-12-30 19:47 it is not exactly the version you tested 2008-12-30 19:47 we made several commits after that 2008-12-30 19:47 yes 2008-12-30 19:47 tuxtime() 2008-12-30 19:47 some warnings and respellings, and changed a divide to a multiply 2008-12-30 19:47 I did try it in uml 2008-12-30 19:48 without high precision time display 2008-12-30 19:48 times looked good 2008-12-30 19:48 and tuxtime() will be fixed, so I'm happy with it 2008-12-30 19:48 what utility can we use to display high precision times? 2008-12-30 19:48 right 2008-12-30 19:49 stat will show it 2008-12-30 19:49 iirc 2008-12-30 19:49 I need to add stat to tuxroot 2008-12-30 19:50 or strace may can 2008-12-30 19:50 if the seconds are good, I am happy for today 2008-12-30 19:52 ls --full-time? 2008-12-30 19:52 mode 0100644 uid 0 gid 0 root d:1 ctime 495aec2d00000000 size 6 links 1 mtime 495aec2d00000000 2008-12-30 19:52 so I killed all the precision 2008-12-30 19:52 well I will fix it 2008-12-30 19:53 yes 2008-12-30 19:53 --full-time may show 2008-12-30 19:54 ls -l --full-time /mnt/foo 2008-12-30 19:54 -rw-r--r-- 1 root root 6 Wed Dec 31 04:51:09 2008 /mnt/foo 2008-12-30 19:54 in uml 2008-12-30 19:54 oh 2008-12-30 19:55 utility is too old 2008-12-30 19:55 yes 2008-12-30 19:55 drwxr-xr-x 2 hirofumi hirofumi 4096 2008-12-12 06:54:37.000000000 +0900 doc 2008-12-30 19:55 fuse will show it 2008-12-30 19:56 it would be enough now 2008-12-30 19:58 next golden copy we will be sure to post exactly the version that passes the test 2008-12-30 19:58 this one is special though 2008-12-30 19:58 its the first golden copy 2008-12-30 19:58 and it's just before you go on holiday 2008-12-30 20:00 when you come back I hope to have fork working, plus new versions of blockread and blockget, and working on our own inode instead of blkdev 2008-12-30 20:00 I think we will do the first version of atomic commit without delalloc 2008-12-30 20:01 but with the block library routines replaced 2008-12-30 20:02 at least, the block library writes 2008-12-30 20:03 yes 2008-12-30 20:04 blockget can be written like blockread is now, but using grab_cache_page 2008-12-30 20:04 and blockread can be written in terms of blockget, just like bread is written in terms of getblk 2008-12-30 20:05 we will do a bio transfer for the blockread 2008-12-30 20:05 syncio, from hackfs/junkfs 2008-12-30 20:05 maybe, I will also think about it on holiday 2008-12-30 20:06 if you are like me, after two days of holiday I want to be back coding 2008-12-30 20:06 drives my family crazy at times 2008-12-30 20:16 trace_on("ns = %lx", time.tv_nsec); 2008-12-30 20:16 tuxtime: ns = 0 <- my host system is not giving high precision for some reason 2008-12-30 20:16 fuse 2008-12-30 20:19 2.6.24.3 is too old maybe 2008-12-30 20:21 tuxtime() issue? 2008-12-30 20:22 that is a "nothing in tx_nsec" issue 2008-12-30 20:22 I will test the math in isolation now 2008-12-30 20:22 it's the best way 2008-12-30 20:35 Man, you guys are really something. I _have_ to try this fs. Great work, really. 2008-12-30 20:35 :) 2008-12-30 20:36 it is far from functional 2008-12-30 20:36 functioning, but not functional 2008-12-30 20:39 I'll be offline, see you 2008-12-30 20:39 sayonara 2008-12-30 20:39 happy new year 2008-12-30 20:39 Based on your hg-logs (no fan of git) I'm sure you'll be there and beyond. And I like trying new stuff out. 2008-12-30 20:40 Happy new year, hirofumi. :-) 2008-12-30 20:41 Oh, and, get a 'donate'-link. ;-) 2008-12-30 20:41 http://tux3.org/contribute.html 2008-12-30 20:42 a $$$ donate should be done 2008-12-30 20:42 beer is possible now :) 2008-12-30 20:44 Ha, yeah. Well I'm not going to send anything to that address, the shipping would be greater then my monthly income. And I'm not touching that 'beer of the month'-site, it's horrible. 2008-12-30 20:45 heh 2008-12-30 20:45 <: 2008-12-30 20:46 for what it's worth, they ship it from some address stateside 2008-12-30 20:47 now, what is the libc version of gettimeofday that handles timespec instead of timeval? 2008-12-30 20:51 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2008-12-30 21:01 ((u64)time.tv_sec << 32) + ((time.tv_nsec * ((1ULL << 63) / 1000000000ULL)) >> 31) 2008-12-30 21:01 :p 2008-12-30 21:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 21:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 21:21 -!- RalucaM(~ral@londo.cnds.jhu.edu) has joined #tux3 2008-12-30 21:47 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 21:47 ACTION returns from the dead 2008-12-30 21:52 don't bit me please 2008-12-30 21:53 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-12-30 23:14 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-30 23:17 -!- RazvanM(~RazvanM@96.234.235.129) has joined #tux3 2008-12-31 04:45 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2008-12-31 05:07 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2008-12-31 07:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2008-12-31 10:48 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-31 10:51 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2008-12-31 10:52 Woah, Tux3 University! :D That's cool. 2008-12-31 12:04 when will tux3 be ported to .28 ? 2008-12-31 13:05 Day changed to 01 Jan 2009 2008-12-31 13:05 bye! 2008-12-31 13:56 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-12-31 16:06 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2008-12-31 16:08 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-12-31 16:17 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-12-31 17:41 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-12-31 21:27 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2008-12-31 23:51 check this out http://blogs.sun.com/bmc/entry/catching_disk_latency_in_the 2008-12-31 23:57 -!- RazvanM(~RazvanM@96.234.235.129) has joined #tux3 2009-01-01 00:56 edt, it works on tux3, with on small patch that is on the list 2009-01-01 05:07 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-01 08:09 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2009-01-01 08:56 http://lkml.org/lkml/2008/12/31/208 2009-01-01 08:56 heh 2009-01-01 08:56 happy new year tux3 2009-01-01 08:57 thankyou 2009-01-01 08:57 you still up coding? 2009-01-01 08:57 woke up 2009-01-01 08:57 will sleep again 2009-01-01 08:57 got my head full of atomic commit algorithms 2009-01-01 08:57 yeah, you need the beauty sleep 2009-01-01 08:57 ;-) 2009-01-01 08:58 I got my head full of a head cold 2009-01-01 08:59 serves you right for having fun at new years instead of grinding code ;) 2009-01-01 08:59 :-) 2009-01-01 08:59 anyway, if sleep won't make me beautiful, nothing will 2009-01-01 09:00 better beautiful code... 2009-01-01 13:10 inspired: "I take no responsibility for the following command:" 2009-01-01 13:10 :D 2009-01-01 14:30 konrad, yes that was about you :) 2009-01-01 15:02 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-01 15:16 ok, got all my ducks in a row for a basic atomic commit, I think 2009-01-01 15:17 as many corners cut as possible but still pretty cool 2009-01-01 15:20 concept of rollup stays, promises stay, deltas stay, but delayed allocation, block forking, delta staging and forward logging all deferred for later 2009-01-01 15:21 this will be atomic commit in "immediate mode"... new locations for metadata blocks are assigned immediately on write and write transfer starts immediately 2009-01-01 15:24 we rewrite the block IO library write methods, which could possibly be avoided, but it is needed some time and isn't too hard, just start with ->writepage for now and that will do until after review starts 2009-01-01 15:26 next thing to do is add a new file, commit.c, entirely shared between user and kernel, that sets up log and commit blocks, and does replay 2009-01-01 15:47 sk8 oclock 2009-01-01 20:00 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2009-01-01 20:52 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2009-01-01 21:04 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2009-01-01 21:23 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-01 21:23 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-01 21:23 -!- bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-01-01 21:23 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-01-01 21:23 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-01-01 21:23 -!- ceatinge(~ceatinge@veryclever.net) has joined #tux3 2009-01-01 21:29 -!- bushman_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-01-01 21:38 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2009-01-01 22:52 -!- RazvanM(~RazvanM@96.234.235.129) has joined #tux3 2009-01-01 23:02 -!- flips(~phillips@phunq.net) has joined #tux3 2009-01-01 23:13 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-02 00:02 -!- zhaozhou(~zhaozhou@h-68-118.A199.priv.bahnhof.se) has joined #tux3 2009-01-02 00:58 -!- cydork_(~cydoork@122.169.100.164) has joined #tux3 2009-01-02 05:07 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-02 05:11 -!- ajonat(~ajonat@190.48.109.30) has joined #tux3 2009-01-02 06:45 -!- kushal(~kushal@121.246.34.123) has joined #tux3 2009-01-02 06:46 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2009-01-02 06:51 hey flips 2009-01-02 08:18 -!- kushal(~kushal@121.246.34.123) has joined #tux3 2009-01-02 08:42 -!- kushal(~kushal@121.246.34.123) has left #tux3 2009-01-02 08:42 -!- kushal(~kushal@121.246.34.123) has joined #tux3 2009-01-02 08:55 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-02 10:21 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2009-01-02 11:33 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-02 11:33 hi kushal 2009-01-02 11:38 happy new year... 2009-01-02 11:38 had a few questions... 2009-01-02 11:39 is there any sort of btree rebalancing thats done in tux3?? 2009-01-02 11:48 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2009-01-02 12:28 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2009-01-02 13:51 -!- tim_dimm(~timothyhu@cpe-66-27-75-196.san.res.rr.com) has joined #tux3 2009-01-02 15:32 -!- data(~data@echo489.server4you.de) has joined #tux3 2009-01-02 20:11 -!- macan(~chatzilla@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-02 20:17 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-02 22:12 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-02 22:38 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-03 00:04 -!- RazvanM(~RazvanM@96.234.235.129) has joined #tux3 2009-01-03 00:12 -!- stargazr5(~gauravstt@117.195.40.146) has joined #tux3 2009-01-03 00:40 hey flips 2009-01-03 00:40 hi 2009-01-03 00:51 hey flips 2009-01-03 00:51 happy new year 2009-01-03 01:00 how is the new inode number allotted? 2009-01-03 01:01 -!- cdk(~cdk@117.195.40.146) has joined #tux3 2009-01-03 01:11 -!- kushal(~kushal@117.195.40.146) has joined #tux3 2009-01-03 01:40 hi stargazr5 2009-01-03 01:40 in make_inode 2009-01-03 01:40 inode.c:168 2009-01-03 01:41 find_empty_inode 2009-01-03 01:44 is nextalloc taken as new inode number? 2009-01-03 01:44 it is the goal 2009-01-03 01:44 if that goal is occupied, the next higher unoppupied inum will be used 2009-01-03 01:46 why use next free block number as goal for new inode instead of maintaining free inode number list 2009-01-03 01:47 there can be 2**48 inodes, initializing that list would take forever and need 4 petabytes of storage 2009-01-03 01:47 ok.. 2009-01-03 01:49 another thing..is there any btree rebalancing performed 2009-01-03 01:50 no rebalancing, btrees are always balanced 2009-01-03 01:50 but leaves can be merged 2009-01-03 01:51 so no skew occurs? 2009-01-03 01:51 it is possible 2009-01-03 01:52 tree_chop (sorry, bad name, it should be tree_delete or something) merges leaves that have had entries deleted 2009-01-03 01:52 so what about the performance hit due to skew? 2009-01-03 01:52 so, slight imbalance between leaves is possible 2009-01-03 01:52 it can be marginally affected 2009-01-03 01:52 ok.. 2009-01-03 01:53 by achieving higher fullness ratios of leaves, less metada needs to be read 2009-01-03 01:53 that is the purpose of such balancing algorithms as b+tree and b*tree 2009-01-03 01:54 to get around minimum 66%, average 83% fullness instead of minimum 50%, average 75% fullness with ordinary splitting 2009-01-03 01:55 it is a lot of work for a small improvement, one day when that is the only optimization left to do, we will do it 2009-01-03 01:55 :) 2009-01-03 01:55 ok 2009-01-03 01:57 notice that purge_inum, used in delete_inode, only removes the inode from the leaf 2009-01-03 01:58 it does not mege leaves as it should 2009-01-03 02:00 hey flips... 2009-01-03 02:00 hi kushal 2009-01-03 02:05 -!- kushal_(~kushal@websorbs.dnsbl.oftc.net) has joined #tux3 2009-01-03 02:05 -!- cdk_(~cdk@websorbs.dnsbl.oftc.net) has joined #tux3 2009-01-03 02:05 we plan to use the generic btree of tux3 to store the SHA-1 hash values of the blocks for data dedup... 2009-01-03 02:06 yes, that is a reasonable alternative to a hash table 2009-01-03 02:06 so the hash values might fall in very close vicinity which would lead to skew 2009-01-03 02:06 -!- stargazr51(~gauravstt@websorbs.dnsbl.oftc.net) has joined #tux3 2009-01-03 02:06 so any suggestions for handling that?? 2009-01-03 02:06 btrees tolerate that kind of imbalance well 2009-01-03 02:07 I assume you will use only 64 bits of the sha-1 hash as the key? 2009-01-03 02:09 so any suggestions for handling that?? 2009-01-03 02:09 btrees tolerate that kind of imbalance well 2009-01-03 02:09 I assume you will use only 64 bits of the sha-1 hash as the key? 2009-01-03 02:31 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 02:31 -!- cdk(~cdk@117.195.40.226) has joined #tux3 2009-01-03 02:34 -!- kushal(~kushal@117.195.40.226) has joined #tux3 2009-01-03 02:34 sorry for the delay..lost connection... 2009-01-03 02:34 what was the last thing you saw? 2009-01-03 02:35 u asked whether we plan to use the 64 bits of the sha-1 hash... 2009-01-03 02:36 for that...still working out the best solution...we might use the entire 160 bits also for the prototype... 2009-01-03 02:36 yes 2009-01-03 02:36 tux3 btree only supports a 64 bit key 2009-01-03 02:37 If I may make a suggestion: use a smaller hash, of only 64 bits, do not use sha1 because it is relatively expensive to compute 2009-01-03 02:38 allow multiple entries for the same key in your btree, in case a 64 bit hash collides 2009-01-03 02:38 but then the probability of collisions will increase... 2009-01-03 02:38 ok... 2009-01-03 02:39 but our plan is to compare the hash values of 2 blocks to check if the blocks are same... 2009-01-03 02:39 if there is a collision..this check will fail... 2009-01-03 02:40 so we need to reduce collisions as much as possible.... 2009-01-03 02:40 ok, reasonable argument 2009-01-03 02:40 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 02:40 so you will use 64 bits of the sha1 as a key, and store the full 20 bit sha1 in the btree entry 2009-01-03 02:41 yes... 2009-01-03 02:41 along with the 6 byte block pointer 2009-01-03 02:41 yes.. 2009-01-03 02:42 actually, you only have to store 12 bytes of the sha1 in the btree entry, you already know the other 8 bytes 2009-01-03 02:44 also a count of number of references to the block, 32 bits should be enough 2009-01-03 02:46 on the storage of only 12 bytes...the entries which occur only in the leaves won't have any valid 64 bits... 2009-01-03 02:47 the 32 bit ref count is a nice suggestion... 2009-01-03 02:47 yes of course you are right :) 2009-01-03 02:48 there has to be a way to find the entry in the leaf, which requires storing the key in the leaf 2009-01-03 02:48 it's getting later here ;) 2009-01-03 02:49 so we still face the skew problem in our hash index as mentioned by stargazr5 2009-01-03 02:49 it should not be a problem 2009-01-03 02:50 tree_chop rebalances, it does need a little bit of work to do that properly 2009-01-03 02:50 ok... 2009-01-03 02:50 and you will not get significant skew on insert 2009-01-03 02:51 ok... 2009-01-03 02:51 we have to fix tree_chop anyway, you don't have to do that 2009-01-03 02:51 thanks... :) 2009-01-03 02:51 it was orginally written to walk through an entire btree, removing selected entries, so it never had to deal with deleting just one entry, starting somewhere in the middle 2009-01-03 02:52 now it needs to be adjusted to handle that 2009-01-03 02:53 ok...another thing... 2009-01-03 02:54 the block level dedup we plan to implement will need to insert entries for each new block.. 2009-01-03 02:54 that will disrupt the extent mechanism 2009-01-03 02:55 you will write your own btree leaf methods I assume 2009-01-03 02:55 and not use the dleaf methods 2009-01-03 02:55 is that your question? 2009-01-03 02:55 um 2009-01-03 02:55 right 2009-01-03 02:56 you will fragment extents 2009-01-03 02:56 that is your question 2009-01-03 02:56 yes...thats my question...fragmentation of extents... 2009-01-03 02:56 extents of 1 block work fine 2009-01-03 02:56 that is all ext3 ever had 2009-01-03 02:58 later you may be able to optimize so that when no blocks in an extent have duplicates, you allow larger extents 2009-01-03 02:58 ok...but wont the disk arm movement increase when we insert an entry into the hash index and then make changes to the dleaf? 2009-01-03 03:00 it will 2009-01-03 03:00 I think it is best just to ignore that at first 2009-01-03 03:01 later, you will be able to use logging technques like I am writing a design note for right now, to avoid writing your dedup btree to disk every time a data block is written 2009-01-03 03:02 ok...looking forward to that design... 2009-01-03 03:02 I think your approach is pretty good, by the way 2009-01-03 03:02 I will post in a few minutes 2009-01-03 03:03 thanks :) 2009-01-03 03:06 inserting a new entry in the hash tree for every non-duplicate block is an overhead for every write and appears to be a performance bottleneck... 2009-01-03 03:09 also, as the btree gets large, it will be randomly accessed and there will be few opportunities to save writes that land in the same leaf blocks 2009-01-03 03:10 we plan to implement a "bucket" abstraction which stores hashes of logically contiguous blocks together... 2009-01-03 03:10 so the randomness is reduced... 2009-01-03 03:11 yes, you mentioned you would do that before, and I did not know why, now I do 2009-01-03 03:11 let me see 2009-01-03 03:14 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 03:15 ah, logging your hash instead of writing out the btree node immediately will save something 2009-01-03 03:16 when you do an episode of hash tree updating, then at least seeks will be reduced 2009-01-03 03:17 something similar to batch updates?? 2009-01-03 03:17 something similar to what I am describing in my post now, for allocation bitmap updates 2009-01-03 03:18 ok... 2009-01-03 03:18 you update your hash btree in cache, and log the update information in a delta commit log 2009-01-03 03:18 so the hash btree, which may have leaves stored far from where you are writing, is not updated on every write 2009-01-03 03:19 just the data and the delta commit log block(s) 2009-01-03 03:19 so we use the underlying the tux3 logging mechanism? 2009-01-03 03:19 when the number of dirty leaves of the hash btree rises above some threshold, you flush all of them to disk 2009-01-03 03:20 yes 2009-01-03 03:20 you can start with no optimization, and see how badly it hurts 2009-01-03 03:20 and then improve it with logging 2009-01-03 03:20 ok... 2009-01-03 03:20 I am thinking about your buckets abstraction, I don't see how it works yet 2009-01-03 03:21 one note: on solid state disk, where dedup is really important because it is expensive, you don't have to worry much about the update cost 2009-01-03 03:22 i'm sending you an image which kind of represents our design... 2009-01-03 03:22 on the solid state disk issue, aren't the random writes expensive?? 2009-01-03 03:22 only random reads are fast... 2009-01-03 03:23 not very expensive 2009-01-03 03:24 not nearly as epensive as disk seeks 2009-01-03 03:25 ok...we looked into the solid state disk performance but didn't consider the disk seek time while comparing... 2009-01-03 03:25 u got the image?? 2009-01-03 03:26 i've sent it on your mail... 2009-01-03 03:26 yes 2009-01-03 03:27 where are the buckets stored? 2009-01-03 03:28 buckets are nothing but disk blocks... 2009-01-03 03:30 how does storing hashes of blocks that are logically contiguous help? 2009-01-03 03:30 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 03:30 how does storing _together_ hashes of blocks that are logically contiguous help? 2009-01-03 03:31 we assume that mostly there will be a run of continuous blocks that are duplicate... 2009-01-03 03:31 like all zero for example? 2009-01-03 03:31 there are not many other cases where you can expect sucessive blocks to be dupilicates 2009-01-03 03:32 i mean...continuous blocks of two similar files... 2009-01-03 03:33 ah, like when sombody has stored the same pdf in two different directories 2009-01-03 03:33 yes... 2009-01-03 03:33 right, optimizes the case of detecting identical files of multiple blocks 2009-01-03 03:34 yes...that is the intent... 2009-01-03 03:35 ok, well updating your hash tree is going to be the hardest cost to reduce as you have noticed, making the entries in it small will help a lot 2009-01-03 03:36 the hash tree will always start thrashing at some point, when it gets too big for cache 2009-01-03 03:36 the smaller the entries are, the bigger it can get before it starts thrashing 2009-01-03 03:36 yes...that is our main concern... 2009-01-03 03:37 well everybody who does dedup must have the same problem 2009-01-03 03:38 yes i guess 2009-01-03 03:45 any different hash index structure which might reduce this problem? 2009-01-03 03:45 I'm thinking about it 2009-01-03 03:46 I don't see one, the keys will be essentially randomly accessed 2009-01-03 03:46 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 03:47 so optimizing is mainly about trying to keep your hash tree entries small\ 2009-01-03 03:47 yes... 2009-01-03 03:48 we reached the same dead end... 2009-01-03 03:49 say you have 2**7 = 128 entries per hash leaf, then with 1 GB to store hash leaves, you can fully index 128 GB entirely in cache, then cache hit rate will start drop 2009-01-03 03:49 it's not a dead end if everybody else hits it 2009-01-03 03:50 that's called a "level playing field" 2009-01-03 03:50 :) 2009-01-03 03:50 yes... 2009-01-03 03:51 so what you do is just recognize that at a certain point, every access is going to be hitting disk, and just try to reduce the number of times it hits disk 2009-01-03 03:52 ok... 2009-01-03 03:52 and the logging idea at least will reduce seeks 2009-01-03 03:52 yes... 2009-01-03 03:53 your tweak for fully duplicated files is good 2009-01-03 03:53 and of course, zero should be treated specially 2009-01-03 03:53 thanks... :) 2009-01-03 03:57 we'll do that... 2009-01-03 03:57 thanks for the time... 2009-01-03 03:58 it was a really fruitful discussion... 2009-01-03 03:58 I'm happy 2009-01-03 03:58 :) 2009-01-03 03:59 bye 2009-01-03 03:59 bye 2009-01-03 04:00 I will post my bitmap logging not tomorrow, it's late here 2009-01-03 04:00 ok... 2009-01-03 04:26 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 04:40 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 04:58 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 05:07 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-03 05:08 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 05:41 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 06:22 -!- stargazr5(~gauravstt@117.195.40.226) has joined #tux3 2009-01-03 06:27 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 06:45 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 06:46 -!- RazvanM(~RazvanM@96.234.235.129) has joined #tux3 2009-01-03 07:05 hey flips 2009-01-03 07:17 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 07:32 -!- dcg(~dcg@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-03 07:51 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 08:12 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 08:25 -!- pranith(~bobby@122.163.49.172) has joined #tux3 2009-01-03 08:45 -!- cdk(~cdk@117.195.34.108) has joined #tux3 2009-01-03 09:15 -!- kushal(~kushal@117.195.34.108) has joined #tux3 2009-01-03 09:48 -!- dcg(~dcg@167.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-01-03 10:00 ssh desktop 2009-01-03 10:17 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-03 11:21 hi flips 2009-01-03 11:57 hi kushal 2009-01-03 12:00 where do we map the logical block numbers into physical block numbers? 2009-01-03 12:01 in filemap.c 2009-01-03 12:01 -!- cdk(~cdk@117.195.34.108) has joined #tux3 2009-01-03 12:01 map_region 2009-01-03 12:07 ok... 2009-01-03 12:09 thanks...bye 2009-01-03 12:09 bye 2009-01-03 12:09 goodnight? 2009-01-03 12:09 yes... 2009-01-03 12:10 13 hours ahead of you :) 2009-01-03 13:44 -!- pgquiles(~pgquiles@8.Red-81-37-107.dynamicIP.rima-tde.net) has joined #tux3 2009-01-03 14:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-03 20:38 -!- kushal(~kushal@117.195.37.114) has joined #tux3 2009-01-03 20:56 hi 2009-01-03 20:56 backed 2009-01-03 21:28 hi! 2009-01-03 21:33 hi 2009-01-03 21:35 got to restart x 2009-01-03 21:37 -!- flips(~phillips@phunq.net) has joined #tux3 2009-01-03 21:38 still losing keyboard every day or two, restarting X fixes it 2009-01-03 21:38 I can't remember changing X at all 2009-01-03 21:39 um.. 2009-01-03 21:39 xorg.conf? 2009-01-03 21:39 didn't touch it 2009-01-03 21:39 i see 2009-01-03 21:40 kernel and x imcompatible? 2009-01-03 21:40 didn't change the kernel 2009-01-03 21:41 at least I know it's X now and don't have to do full reboot 2009-01-03 21:41 -!- amey(~amey@117.195.37.114) has joined #tux3 2009-01-03 21:41 maybe, it seems x lose sync with keyboard state 2009-01-03 21:42 it seems 2009-01-03 21:43 keyboards are pretty simple things 2009-01-03 21:43 I have written keyboard drivers 2009-01-03 21:43 you have to work at it to get it wrong 2009-01-03 21:43 oh well, eventually I will find it and it's not too bad 2009-01-03 21:46 I think x is not so simple 2009-01-03 21:46 that's the problem no doubt 2009-01-03 21:46 iirc, old one grabs keyboard control from kernel 2009-01-03 21:46 maybe mine is old 2009-01-03 21:47 I'm not sure though 2009-01-03 21:47 7.1.0 2009-01-03 21:47 doesn't sound old 2009-01-03 21:47 yes 2009-01-03 21:48 is there some logs? /var/log/Xorg.log.0? 2009-01-03 21:49 let's see 2009-01-03 21:51 don't see any errors 2009-01-03 21:52 Xorg.0.log.old too? 2009-01-03 21:52 looks normal 2009-01-03 21:53 um.. 2009-01-03 21:54 checking with xev and interrupt at next time may helps something 2009-01-03 21:57 or gpm? 2009-01-03 21:58 ah, but it's keyboard 2009-01-03 21:58 no gpm 2009-01-03 21:58 right 2009-01-03 21:58 yes 2009-01-03 21:59 updating kernel would be good 2009-01-03 21:59 yes 2009-01-03 21:59 etch is all up to date 2009-01-03 22:00 ok, well the most important thing now is atomic update 2009-01-03 22:00 well, I'm using kernel based on linus git tree 2009-01-03 22:00 I can handle a few X reboots ;) 2009-01-03 22:00 I'm 2.6.24.3 2009-01-03 22:00 only upgraded to fix the vmsplice hole 2009-01-03 22:01 yes 2009-01-03 22:01 I can't remember any big input bug recently 2009-01-03 22:01 I added a file, commit.c to kernel/ 2009-01-03 22:01 not much in it yet 2009-01-03 22:02 I have a partly finished patch to log bitmap updates 2009-01-03 22:02 and a design note that I should finish 2009-01-03 22:02 good 2009-01-03 22:02 put a lot of work into the idea of buffer forking 2009-01-03 22:02 without that, some things are really hard 2009-01-03 22:02 ah 2009-01-03 22:03 like flushing the bitmap file 2009-01-03 22:03 which causes allocations, and the allocations change bitmap blocks 2009-01-03 22:03 I remembered about inode number allocation delay 2009-01-03 22:03 yes, that is a pretty big change, I don't think we need to do it 2009-01-03 22:03 I found a simpler way 2009-01-03 22:04 there is a post about it 2009-01-03 22:04 yes, I read it on holiday 2009-01-03 22:04 make a list of inodes that have been allocated, but not entered into the inode table 2009-01-03 22:04 so that make_inode does not need to do store_attrs 2009-01-03 22:05 with that, ileaf updater became backend only? 2009-01-03 22:05 but that is for when we do a layered-style atomic commit, I am working on a simpler version 2009-01-03 22:05 using the current, immediate update style 2009-01-03 22:06 I think there are a few small things we can do to get atomic commit working with the current code, which writes file blocks immediately 2009-01-03 22:06 I call it immediate mode 2009-01-03 22:07 file blocks is file data? 2009-01-03 22:07 yes 2009-01-03 22:07 so whatever we do, we need a log 2009-01-03 22:07 immediate is direct io? 2009-01-03 22:07 immediate is just how it is now 2009-01-03 22:08 sys_write causes immediate mapping of data and disk transfer starts immediately 2009-01-03 22:08 inside sys_write 2009-01-03 22:08 no delalloc 2009-01-03 22:08 sync mode? 2009-01-03 22:08 not even sync mode, just how it is now 2009-01-03 22:09 ah, yes 2009-01-03 22:09 like sync mode? 2009-01-03 22:09 i.e. flush before return syscall? 2009-01-03 22:09 that would be slow 2009-01-03 22:09 more like, on sync_fs, do an atomic update 2009-01-03 22:10 a good way to start 2009-01-03 22:10 that can be improved to, do an atomic update after some number of fs operations 2009-01-03 22:11 the place to start is, logging allocations 2009-01-03 22:11 I have a post, needs a little more work 2009-01-03 22:11 I got involved in deep analysis of the buffer forking 2009-01-03 22:11 and it took a while to convince myself we really need that now 2009-01-03 22:11 where does it do sync_fs? 2009-01-03 22:12 it doesn't now 2009-01-03 22:12 we can define the super op 2009-01-03 22:12 immediate mode 2009-01-03 22:12 and then it sill do it on sync(2) 2009-01-03 22:12 I imaged it does sync_fs like op for atomic commit 2009-01-03 22:12 immediate mode == !delalloc 2009-01-03 22:13 in other works, the way ext2 and ext3 work 2009-01-03 22:13 however, it does redirect? 2009-01-03 22:13 yes, we need that 2009-01-03 22:14 we redirect, log allocations, and a couple of small things, and it is atomic commit 2009-01-03 22:14 flushing the bitmap file is tricky 2009-01-03 22:14 so, we need to point to redirect after coping user data? 2009-01-03 22:15 we will redirect in get_block -> map_region 2009-01-03 22:15 without delalloc, it is before copy? 2009-01-03 22:17 after copy, the page has been allocated, copy is done, block_write_full_page does the mapping and starts the transfer 2009-01-03 22:17 ->write_begin() doesn't call get_block()? 2009-01-03 22:19 ->write_begin called form? 2009-01-03 22:19 form? 2009-01-03 22:19 from? 2009-01-03 22:19 :) 2009-01-03 22:19 curret one is, sys_write -> ->write_begin -> ->write_end -> pdflush -> ->writepage 2009-01-03 22:21 write_begin can call get_block 2009-01-03 22:21 so, it's before copy 2009-01-03 22:21 right 2009-01-03 22:22 that is fine 2009-01-03 22:22 the buffer does not get redirected, just the physical mapping 2009-01-03 22:22 there is no buffer fork in this path 2009-01-03 22:24 rdirect just means we free the old disk block the logical block was mapped to and allocate a new one 2009-01-03 22:24 that is already most of atomic commit 2009-01-03 22:25 dinner time here 2009-01-03 22:26 ok 2009-01-03 22:29 ah, ok 2009-01-03 22:29 ->write_begin will allocate block, then ->writepage will free block and redirect block 2009-01-03 22:36 simpler 2009-01-03 22:37 previous ->write_begin will allocate block, current ->write_begin will free the old block and allocate new 2009-01-03 22:38 the dleaf is also freed and reallocated 2009-01-03 22:39 currently, the dleaf is just updated 2009-01-03 22:40 write_begin() will not call if page is uptodate 2009-01-03 22:40 will not call get_block() 2009-01-03 22:40 eh 2009-01-03 22:40 that's ok 2009-01-03 22:41 allocation of write_begin() would be just overhead 2009-01-03 22:41 no problem for now 2009-01-03 22:43 one thing we need to do: wrap writepage and set buffers to !mapped 2009-01-03 22:43 because we need to have tux3_get_block called on every write 2009-01-03 22:43 or just call get_block without block library? 2009-01-03 22:43 yes 2009-01-03 22:44 even better 2009-01-03 22:44 yes 2009-01-03 22:44 and we have to launch our on bio then 2009-01-03 22:44 yes 2009-01-03 22:44 which is ok, I was just trying to be lazy at first 2009-01-03 22:45 it have to check !mapped state is ok or not 2009-01-03 22:45 good idea 2009-01-03 22:45 maybe, it is complex 2009-01-03 22:46 and maybe writing our own ->writepage is better 2009-01-03 22:46 yes 2009-01-03 22:46 ok, let's 2009-01-03 22:46 ext3 does 2009-01-03 22:46 probably 2009-01-03 22:47 we also need to start IO on the dleaf immediately 2009-01-03 22:48 dleaf change? 2009-01-03 22:49 the dleaf will change because of the free/alloc 2009-01-03 22:49 or even just the alloc 2009-01-03 22:49 if it's not a rewrite 2009-01-03 22:49 so right now, vfs flushes the blockdev 2009-01-03 22:49 for file data? 2009-01-03 22:50 yes 2009-01-03 22:50 we have to make the vfs not flush the blockdev 2009-01-03 22:50 yes 2009-01-03 22:50 ah 2009-01-03 22:50 I think the easiest way to do that is not to use the blockdev, but allocate our own inode for buffer cache 2009-01-03 22:50 probably, yes 2009-01-03 22:51 it would be same? 2009-01-03 22:51 we have to handle dirty bit somehow? 2009-01-03 22:52 it is same problem for blockdev and inode? 2009-01-03 22:52 I think ext3 keeps the vfs away from its buffers by never setting the dirty bit 2009-01-03 22:52 yes 2009-01-03 22:52 it has jbh_dirty() 2009-01-03 22:52 I think it is more clean to set the dirty bit, and use a different mapping 2009-01-03 22:53 vfs should stay away from our buffers, and we become responsible for freeing the buffers 2009-01-03 22:53 if we use mark_buffer_dirty(), it dirties inode 2009-01-03 22:53 inode too 2009-01-03 22:53 so, pdflush calls ->writepage 2009-01-03 22:53 that is pretty easy to deal with 2009-01-03 22:53 we give it a null method 2009-01-03 22:54 later, we can do optimized metadata IO there 2009-01-03 22:54 ok 2009-01-03 22:54 it's not too pretty 2009-01-03 22:54 inode to avoid blockdev->writepage 2009-01-03 22:55 sounds good 2009-01-03 22:55 we also don't need mark_buffer_dirty 2009-01-03 22:55 that's an easier way 2009-01-03 22:56 we are going to be handling our own buffer writeout 2009-01-03 22:56 I am thinking, we should use the b_assoc_buffers list, as our delta list 2009-01-03 22:57 I don't think vfs touches it 2009-01-03 22:57 which inode? 2009-01-03 22:57 our fake blockdev 2009-01-03 22:58 um... 2009-01-03 22:58 I am also ok with working with the existing blockdev as far as we can go 2009-01-03 22:58 in other words, we don't call mark_buffer_dirty 2009-01-03 23:00 we call get_bh() instead of dirty? 2009-01-03 23:01 we put it on our delta list instead of marking dirty 2009-01-03 23:01 with get_bh()? 2009-01-03 23:01 yes\ 2009-01-03 23:01 take a refcount 2009-01-03 23:02 it seems to work except dirty balance 2009-01-03 23:03 ah right 2009-01-03 23:03 and buffer freeing 2009-01-03 23:04 free will be done if refcont==0? 2009-01-03 23:05 checking put_bh 2009-01-03 23:06 create_empty_buffers avoids the grow_buffers path 2009-01-03 23:06 that is good 2009-01-03 23:07 yes, create_empty_buffers is for page cache 2009-01-03 23:07 non blockdev page cache 2009-01-03 23:07 which is what we're moving towards 2009-01-03 23:07 my theory: page cache and buffer cache should be more similar 2009-01-03 23:08 I think that the page cache works perfectly well as a buffer cache, with only a small amount of code to support it 2009-01-03 23:08 we can test that theory 2009-01-03 23:08 yes, similar 2009-01-03 23:08 actually, it would be same almost 2009-01-03 23:09 but, blockdev path is special 2009-01-03 23:09 grow_buffers() will be called from getblk() path only? 2009-01-03 23:09 brelse and put_bh do not free buffers 2009-01-03 23:09 yes 2009-01-03 23:09 yes, grow_buffers is blockdev only 2009-01-03 23:10 shrink_buffers()? 2009-01-03 23:10 shrink_buffers is also for blockdev 2009-01-03 23:10 I think try_to_free_buffers is the only one for page cache 2009-01-03 23:11 ah, yes 2009-01-03 23:12 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L2680 2009-01-03 23:12 try_to_release_page 2009-01-03 23:13 we can override that with our own ->releasepage 2009-01-03 23:14 so try_to_free_buffers will work fine in page cache too 2009-01-03 23:15 yes 2009-01-03 23:16 own ->releasepage is what does for now? 2009-01-03 23:16 we don't need it now I think 2009-01-03 23:16 we would need it with the block handles patch 2009-01-03 23:16 ah, ok 2009-01-03 23:16 yes 2009-01-03 23:17 can we talk about buffer forking for a moment? 2009-01-03 23:17 btw, I'm thinking delalloc may make simple atomic commit 2009-01-03 23:18 ok 2009-01-03 23:18 it simplifies some things 2009-01-03 23:18 ok, forking 2009-01-03 23:18 without forking, flushing bitmaps to disk is very tricky 2009-01-03 23:18 because bits get set in the bitmap blocks during the flush 2009-01-03 23:19 yes 2009-01-03 23:19 it's possible to hack around this, but is pretty complicated 2009-01-03 23:19 wait a bit 2009-01-03 23:19 with forking, we can change a bitmap block, but not change the data that was in the buffer at the beginning of the flush 2009-01-03 23:19 before that 2009-01-03 23:19 ok 2009-01-03 23:20 frontend path modify the bitmap? 2009-01-03 23:20 for now, yes 2009-01-03 23:20 later, no 2009-01-03 23:20 backend will do all allocation 2009-01-03 23:21 writepage is one of frontend? 2009-01-03 23:21 yes 2009-01-03 23:22 so let's think about a sync 2009-01-03 23:22 that will be our atomic update for now 2009-01-03 23:22 ->sync_fs 2009-01-03 23:22 sys_sync -> ->sync_fs 2009-01-03 23:22 ok 2009-01-03 23:23 assume we let get_block work the way it does now 2009-01-03 23:23 yes 2009-01-03 23:23 except we will be doing free/alloc instead of just overwrite 2009-01-03 23:23 bitmap buffers will be changed as we allocate 2009-01-03 23:24 but frees will just be logged, we can't free immediately 2009-01-03 23:24 current? 2009-01-03 23:24 after the commit, the frees are entered into the bitmap 2009-01-03 23:24 ok 2009-01-03 23:24 the reason for this is to avoid corrupting the previous atomic commit 2009-01-03 23:25 bitmaps are not written out as part of the sync 2009-01-03 23:25 instead, we log the allocs and frees 2009-01-03 23:25 so the bitmap buffers will be current, the bitmap on disk will be old 2009-01-03 23:26 ok 2009-01-03 23:27 replay is easy, on restart we walk the log and enter the allocs and frees into the bitmap buffers 2009-01-03 23:27 after a while, we have to flush the bitmaps to disk or the log will get too long 2009-01-03 23:27 this is where buffer forking is useful 2009-01-03 23:28 yes 2009-01-03 23:28 flushing a bitmap block requires map_region, which will fork the bitmap buffer before setting bits to allocate 2009-01-03 23:29 the old version of the bitmap block is what we want to flush 2009-01-03 23:29 we need to get all dirty bitmaps at the time of the flush onto disk, exactly as they are 2009-01-03 23:30 um.. 2009-01-03 23:30 we need to get all dirty bitmaps at the time of the flush onto disk, exactly as they are when the bitmap flush begins 2009-01-03 23:30 -!- RazvanM(~RazvanM@96.234.235.129) has joined #tux3 2009-01-03 23:31 before flushing, it may be forked? 2009-01-03 23:31 it won't we 2009-01-03 23:31 ah, re-fork 2009-01-03 23:31 umm 2009-01-03 23:31 we will fork against the previous flush 2009-01-03 23:32 so we will have a flush counter, our rollup counter, that is like the delta counter 2009-01-03 23:32 question: if you have free logs for bitmap, why not just abandon bitmaps 2009-01-03 23:33 because you need to be able to set and free allocation per block, rapidly 2009-01-03 23:33 and the log is not infinitely long 2009-01-03 23:34 you need to be able to find the allocation state of adjacent blocks rapidly 2009-01-03 23:34 i.e., scan the bitmap 2009-01-03 23:34 ACTION is still waiting for hirofumi's umm 2009-01-03 23:34 we can keep an extent tree in memory, all the free logs will go to disk in a delta, then modify the memory tree 2009-01-03 23:35 the bitmaps are our version of the extent maps 2009-01-03 23:35 it's also the stable representation 2009-01-03 23:36 fairly simple, and optimal for some important loads such as high fragmentation 2009-01-03 23:37 macan, the most important reason for why not just use the log is, the log is not very long 2009-01-03 23:38 and if it was long, startup would take a long time because the whole log has to be read to find out what the current allocation state is 2009-01-03 23:38 bitmap will be forked forcely after delta counter change? 2009-01-03 23:39 not the delta counter, but an additional counter for rollups 2009-01-03 23:39 yes, but we can periodic clean the logs, and in memory extent tree will supply a more compact log presentation 2009-01-03 23:39 flushing bitmaps using the delta counter does not work too well, for reasons that are a bit to complicated to write in the irc channel 2009-01-03 23:40 our rollup == clean the logs 2009-01-03 23:41 macan, but where is your stable representation after the log is cleaned? 2009-01-03 23:43 we write the log to disk first, then change the log pointer in superblock 2009-01-03 23:44 sorry, write the new compact logs to disk first 2009-01-03 23:44 bitmaps are about as compact as you can get 2009-01-03 23:45 anyway, with the buffer forking operation in place, then we just put the bitmap inode on the dirty inodes list and sync_inodes_sb will do the rest... I think 2009-01-03 23:46 that flushes the bitmaps 2009-01-03 23:46 two or three lines 2009-01-03 23:46 however, buffer forking is not two or three lines, it is hard 2009-01-03 23:46 because there are multiple buffers sharing a page 2009-01-03 23:47 and we have to change the page, changing the data for other buffers on the page that may be allocated to completely different metadata structures for example 2009-01-03 23:47 well, at least in the case of bitmaps it is in the page cache and all the buffers on the page are bitmap buffers, so that is not as bad 2009-01-03 23:48 anyway, buffer forking is possible with with multiple independent buffers on a page 2009-01-03 23:48 ah, compact logs 2009-01-03 23:49 it can be multiple GB on some case? 2009-01-03 23:49 I was thinking, not more than a meg or so 2009-01-03 23:49 not multiple GB 2009-01-03 23:49 maybe a few meg 2009-01-03 23:49 big disk and fragmented 2009-01-03 23:49 ? 2009-01-03 23:50 our logs will be rolled up frequently, that frees the old log blocks 2009-01-03 23:50 yes 2009-01-03 23:50 compact logs strategy 2009-01-03 23:50 yes, maybe useful at some point 2009-01-03 23:50 if our logs turn out to need compacting 2009-01-03 23:50 hopefully they don't 2009-01-03 23:51 it meant compact logs strategy can be multiple GB 2009-01-03 23:51 so for example, logging allocs and frees, it is by extent, it should just be 8 or 9 bytes per extent 2009-01-03 23:51 ah 2009-01-03 23:51 yes 2009-01-03 23:52 an extent can be up to 246K, so with big files it will be 9 bytes in the log per 256K 2009-01-03 23:53 36 bytes per meg 2009-01-03 23:53 36K per gig of writeout, for logging allocs 2009-01-03 23:54 best case? 2009-01-03 23:54 yes 2009-01-03 23:54 yes 2009-01-03 23:54 worst is... 2009-01-03 23:55 9 bytes per 4K alloc 2009-01-03 23:55 9bytes per 2bits? 2009-01-03 23:55 2bits? 2009-01-03 23:55 bitmap 2009-01-03 23:55 data 2009-01-03 23:55 bitmap data 2009-01-03 23:55 yes 2009-01-03 23:56 good example of how bitmaps beat extents for free maps in many cases 2009-01-03 23:56 yes 2009-01-03 23:56 well, bitmap forking 2009-01-03 23:57 anyway, buffer forking is a pretty scary thing 2009-01-03 23:57 the buffers are all in different states, can be clean, dirty in different deltas or unmapped 2009-01-03 23:57 I thought, we want to stable buffers immediately after delta count 2009-01-03 23:57 we want stable buffers 2009-01-03 23:57 ah true, for now that will be the case 2009-01-03 23:57 delta counter change 2009-01-03 23:58 I'm thinking in the general case 2009-01-03 23:58 maybe I'm being too general 2009-01-03 23:58 but anyway, it seems to work in the general case 2009-01-03 23:58 I can't see why bitmap is using rollup counter 2009-01-03 23:59 using the delta counter, and thus flushing them every delta, causes problems 2009-01-03 23:59 can't use the delta counter and flush less than once per delta, because there are only a limited number of dirty states that can be represented in the buffer flags 2009-01-04 00:00 why is it only bitmap? 2009-01-04 00:00 because bitmap handles allocation 2009-01-04 00:00 and therefore, bits in the buffers can change during allocation 2009-01-04 00:01 yes 2009-01-04 00:01 I imaged 2009-01-04 00:02 it is a write converge process ? 2009-01-04 00:02 after delta counter change, fork bitmap 2009-01-04 00:03 ACTION thinks 2009-01-04 00:03 and next delta's frontend and previous backend modify new buffers 2009-01-04 00:03 fork every bitmap? 2009-01-04 00:03 maybe, only dirty 2009-01-04 00:04 and the forked bitmaps will be written in the next delta, it might work 2009-01-04 00:04 and backend logs allocation for previous delta 2009-01-04 00:05 I think that works 2009-01-04 00:05 well, it was I just imaged 2009-01-04 00:05 rollup counter is what's for? 2009-01-04 00:05 maybe nothing if your suggestion works 2009-01-04 00:06 nothing now 2009-01-04 00:06 well, I would like to flush bitmaps much less than once per delta 2009-01-04 00:06 I would like delta to be NFS writing 2009-01-04 00:06 yes 2009-01-04 00:07 sorry 2009-01-04 00:07 I would like deltas to work well for sync mounts 2009-01-04 00:07 so that each data block written only requires a few metadata blocks 2009-01-04 00:08 yes 2009-01-04 00:08 so even writing out one bitmap block would be nice to avoid 2009-01-04 00:08 eventually, I would like it to be just one data block and one commit block 2009-01-04 00:08 all metadata changes logged in the commit block 2009-01-04 00:09 i see 2009-01-04 00:09 it seems not have big difference 2009-01-04 00:09 no 2009-01-04 00:09 it logs bitmap change instead of dirty 2009-01-04 00:09 I mean, I agree 2009-01-04 00:09 the "no" in that case means "I agree" ;) 2009-01-04 00:09 :) 2009-01-04 00:10 the issue is backend change next delta 2009-01-04 00:10 the issue is backend change next delta buffers 2009-01-04 00:11 ok, buffer forking, if we stop all filesystem transactions and flush bitmaps then our fork is pretty easy 2009-01-04 00:11 right, doing everything asynchronously is more challenging 2009-01-04 00:11 right now, we just want it atomic, with performance that does not completely suck 2009-01-04 00:11 and then start review 2009-01-04 00:12 sounds good 2009-01-04 00:12 ok, so forking a page with multiple buffers... 2009-01-04 00:12 all the dirty buffers on the page are forked at the same time 2009-01-04 00:12 -!- RazvanM_(~RazvanM@96.234.232.218) has joined #tux3 2009-01-04 00:12 all the buffer dirty states are set to the current delta 2009-01-04 00:13 yes 2009-01-04 00:13 we remove the buffer heads from the old page, put them on the copy 2009-01-04 00:13 and allocate new buffer heads to put on the old page, that is just for convenience 2009-01-04 00:14 if we have reads in flight on any of the buffers, which we will not in this simple situation, things get interesting 2009-01-04 00:15 the read will complete on the old page, but the buffer is now pointing at the new page 2009-01-04 00:15 the endio has to copy from the old page to the new page 2009-01-04 00:15 this is in the fully general case where we are forking with async buffer modifications going on 2009-01-04 00:15 that is for later, but worth thinking about now 2009-01-04 00:16 mmap can do it easily? 2009-01-04 00:16 buffer forking won't be used for inodes that can be mmapped 2009-01-04 00:16 ah 2009-01-04 00:16 it will be used for dirents, bitmaps, buffer cache 2009-01-04 00:17 doing this on the blockdev seems not likely to work 2009-01-04 00:17 yes 2009-01-04 00:17 but if we use our own inode for a fake blockdev, it can work 2009-01-04 00:17 it will be really cool 2009-01-04 00:17 yes 2009-01-04 00:18 the user of the buffer has to be aware that the address of the data can change when it does blockdirty(buffer) 2009-01-04 00:18 which we will do before making any changes 2009-01-04 00:18 there are about 40 places we need to check, and maybe 10 that need to be fixed 2009-01-04 00:18 one example is struct dwalk 2009-01-04 00:18 where we use it for seeking 2009-01-04 00:18 it has poitners 2009-01-04 00:19 and after the blockdirty, the pointers will point at the wrong page 2009-01-04 00:19 ah 2009-01-04 00:19 another example is balloc... when it finds a free region, it does blockdirty() and then it has to adjust its pointer to point onto the new page, 2009-01-04 00:19 small changes 2009-01-04 00:20 so anyway, I thought I would start with logging frees and allocs, the patch is partly done 2009-01-04 00:20 and I have a long design note on that, that I should finish and post 2009-01-04 00:20 it's nearly done 2009-01-04 00:21 I was just worrying about the details of flushing the bitmap 2009-01-04 00:22 i see 2009-01-04 00:28 -!- kushal(~kushal@117.195.33.149) has joined #tux3 2009-01-04 00:31 oh, another thing we want forking to do... maintain list links for the buffer heads 2009-01-04 00:32 we will use the b_assoc_buffers to link buffers to a delta I think 2009-01-04 00:32 ah 2009-01-04 00:32 i see 2009-01-04 00:32 to avoid to change buffer_head? 2009-01-04 00:32 when we fork a buffer, we want the old buffer head to be removed from that delta list, and the new one, which points at the old data, to be put on the list 2009-01-04 00:33 just to keep track of which buffers belong to the delta 2009-01-04 00:34 when a buffer is forked, where did the old buffer go? 2009-01-04 00:34 something has to point at it 2009-01-04 00:34 -!- kushal_(~kushal@117.195.33.149) has joined #tux3 2009-01-04 00:34 I think delta list 2009-01-04 00:34 right 2009-01-04 00:34 simple idea 2009-01-04 00:35 has some delta 2009-01-04 00:35 and deltas state will be changed 2009-01-04 00:35 before, I was going to kmalloc list links to keep track of metadata buffers for a delta, it is much nicer to use the list link in the buffer_head 2009-01-04 00:36 deltas state will be changed? 2009-01-04 00:36 I imaged we allocate delta structure 2009-01-04 00:37 and buffer_head will be linked to it 2009-01-04 00:37 yes 2009-01-04 00:37 -!- stargazr5(~stargazr@117.195.33.149) has joined #tux3 2009-01-04 00:37 -!- amey(~amey@117.195.33.149) has joined #tux3 2009-01-04 00:37 and if delta counter was changed, delta state will be changed to staging(?) 2009-01-04 00:37 an array of delta headers will do, we access modulo the low bit of the delta counter 2009-01-04 00:37 and new delta will be allocated to be linked new buffer_head 2009-01-04 00:37 yes 2009-01-04 00:38 we could do that 2009-01-04 00:38 I like an array though 2009-01-04 00:38 yes 2009-01-04 00:38 array is much good 2009-01-04 00:38 when we have a fancier pipeline, it can be an array of four deltas 2009-01-04 00:38 yes 2009-01-04 00:39 so, we doesn't have to remove buffer_head from delta until io complition 2009-01-04 00:41 yes 2009-01-04 00:41 so I am thinking we will use a very simple strategy for knowing when all blocks are on disk 2009-01-04 00:42 just walk the delta buffer list and do wait_on_buffer, for everything except file data 2009-01-04 00:42 really crude 2009-01-04 00:43 it sounds like good start 2009-01-04 00:44 btw, I'm still not sure about redirect of btree buffer 2009-01-04 00:45 because? 2009-01-04 00:45 if it was redirected, cursor->path and btree.c has invalid block address 2009-01-04 00:45 forked, you mean 2009-01-04 00:46 yes, that is another place that has to be fixed 2009-01-04 00:46 I think fork and redirect 2009-01-04 00:46 well 2009-01-04 00:46 actually, no, we will not fork there 2009-01-04 00:46 redirect changes the physical mapping 2009-01-04 00:46 yes 2009-01-04 00:46 ah 2009-01-04 00:46 and moves the buffer to a new place in the buffer cache, yes 2009-01-04 00:47 we can either update the poitner in the path, or we can change the path pointers to offsets 2009-01-04 00:47 fork is 2009-01-04 00:47 no 2009-01-04 00:48 entries->block is changed by redirect? 2009-01-04 00:48 yes 2009-01-04 00:48 and we will log that change as a "promise" 2009-01-04 00:48 I'm not sure how do we handle it 2009-01-04 00:49 entries->block will be changed by rollup? 2009-01-04 00:49 on buffer 2009-01-04 00:49 entries->block will always be current 2009-01-04 00:49 we just do not write out the dirty metadata block 2009-01-04 00:50 replay is able to reconstruct that block from the last saved version, and the promises 2009-01-04 00:50 so this is a key point: our cache is always current 2009-01-04 00:51 whether we fork, redirect, make promises or whatever, the cache always has the current tree structure, just as it is now 2009-01-04 00:51 current means on disk btree data + promises? 2009-01-04 00:51 yes 2009-01-04 00:52 that a nice equation: disk + promises = filesystem image 2009-01-04 00:52 e.g. after mount, entries->block == 1 on buffer cache 2009-01-04 00:52 and block 1 was changed 2009-01-04 00:53 after that, we redirect block 1 to somewhere? 2009-01-04 00:53 yes, say 9 2009-01-04 00:53 ok 2009-01-04 00:53 so, block 1 was redirected to block 9 2009-01-04 00:53 so enties->block in memory is 9, on disk it is 1 2009-01-04 00:53 ok 2009-01-04 00:54 abd there is a promise to make it 9 2009-01-04 00:54 who changes entries->block in memory? 2009-01-04 00:54 the same code that does now 2009-01-04 00:55 we doesn't fork that buffer? 2009-01-04 00:55 it's not necessary 2009-01-04 00:55 i see 2009-01-04 00:55 eventually, we will do a rollup 2009-01-04 00:56 and the cache is already pointing at the right place 2009-01-04 00:56 so the rollup could just write out the cached block 2009-01-04 00:56 i see 2009-01-04 00:56 except that would corrupt the stable image 2009-01-04 00:56 so we redirect the block at that point, and make a promise to update its parent 2009-01-04 00:57 when we redirect, we write the block out 2009-01-04 00:58 writing out a block fullfills all promises for that block 2009-01-04 00:59 i see 2009-01-04 00:59 so, it needs promises 2009-01-04 00:59 not just redirect 2009-01-04 01:00 um.. 2009-01-04 01:00 promises are a way of avoiding updating parents, and thus having to move them 2009-01-04 01:01 this avoids recursive copy on write, that goes all the way to the root of the filesystem tree 2009-01-04 01:01 ah 2009-01-04 01:01 that could typically be 6-8 levels 2009-01-04 01:02 if parent bnode (i.e. entries->block change) was also redirect, it is recursive 2009-01-04 01:02 so, for a small sync update, it avoids writing 6-8 metdata blocks 2009-01-04 01:03 so, entries->block change is promises? 2009-01-04 01:03 we describe that change by a promise 2009-01-04 01:03 and therefore don't have to change the parent 2009-01-04 01:04 changes in memory, and on disk is promises? 2009-01-04 01:04 exactly 2009-01-04 01:04 i see 2009-01-04 01:04 I think this is about as efficient as it is possible to be 2009-01-04 01:05 good idea 2009-01-04 01:05 thanks 2009-01-04 01:06 so, btree.c would not be like others? 2009-01-04 01:06 in what sense? 2009-01-04 01:06 other files 2009-01-04 01:07 other filesystems? 2009-01-04 01:07 it will not be handled by redirect and fork 2009-01-04 01:07 other files like directory.c 2009-01-04 01:07 we still need to redirect btree nodes when we do rollup, or split 2009-01-04 01:07 it is not like filemap 2009-01-04 01:07 it would not use blockfork(), instead it will have promises code 2009-01-04 01:08 ? 2009-01-04 01:08 blockfork is only needed when we do fancy async stuff later 2009-01-04 01:08 for btree.c 2009-01-04 01:08 ah 2009-01-04 01:09 fork affects the in-memory buffer only 2009-01-04 01:09 yes 2009-01-04 01:09 promsies are not related to avoiding fork, they avoid redirect 2009-01-04 01:09 fork is to get stable buffer, and another job is async stuff? 2009-01-04 01:10 yes 2009-01-04 01:10 ok 2009-01-04 01:10 fork will let us do a nice pipeline with no wait between stages 2009-01-04 01:10 btree.c just want async stuff? 2009-01-04 01:11 a bit isn't 2009-01-04 01:11 um.. 2009-01-04 01:11 ? 2009-01-04 01:11 somehow, I'm thinking fork is a part of atomic 2009-01-04 01:11 it's useful for flushing the bitmaps 2009-01-04 01:12 that's all we need it for right now 2009-01-04 01:13 i see 2009-01-04 01:16 ah, fork is redirect in memory? 2009-01-04 01:17 sort of, it's more like a copy break in page tables 2009-01-04 01:17 because, with entries->block change, buffer also on new position 2009-01-04 01:17 cow break 2009-01-04 01:18 fork is not for atomic commit, it is to avoid stalls 2009-01-04 01:18 it lets us change a buffer without destroying the old contents, which has not be written to disk yet 2009-01-04 01:18 however, it changes buffer to new position in radix tree 2009-01-04 01:18 ? 2009-01-04 01:19 fork does not move the buffer in the radix tree 2009-01-04 01:19 ah, I was fiddling with that around christmas 2009-01-04 01:20 we swap pages at a radix tree slot 2009-01-04 01:20 do not move to a new slot 2009-01-04 01:20 ah, yes 2009-01-04 01:21 I imaged directory data page redirect 2009-01-04 01:21 however, it didn't change bnode, it changes dleaf 2009-01-04 01:21 yes 2009-01-04 01:21 dleaf is redirect? 2009-01-04 01:21 or promises? 2009-01-04 01:21 yes, for now 2009-01-04 01:22 we can promise to the dleaf too 2009-01-04 01:22 and that will be good 2009-01-04 01:22 or we can redirect it, and promise to the bnode 2009-01-04 01:22 we redirect directory data page 2009-01-04 01:22 yes 2009-01-04 01:22 and changes dleaf 2009-01-04 01:23 yes 2009-01-04 01:23 dleaf points new address 2009-01-04 01:23 right 2009-01-04 01:23 buffer should be new address in radix tree? 2009-01-04 01:24 the dleaf has to move in the buffer cache, yes 2009-01-04 01:24 in this case, buffer meant directory data page 2009-01-04 01:25 ah, that stays in the same place because the page cache is logically indexed 2009-01-04 01:25 ah 2009-01-04 01:25 we only have to change the pointer in the dleaf 2009-01-04 01:25 I was forgetting it 2009-01-04 01:25 this is a nice advantage of putting metadata in the page cache 2009-01-04 01:26 if metadata was pointed by logical index 2009-01-04 01:26 yes 2009-01-04 01:27 I was even trying to think of how to put the inode table in the page cache, but that was too crazy 2009-01-04 01:27 btw, it was why too crazy? 2009-01-04 01:28 there is not advantage? 2009-01-04 01:28 there is no advantage? 2009-01-04 01:28 there probably is 2009-01-04 01:28 and maybe it isn't crazy 2009-01-04 01:28 but it hurt my head 2009-01-04 01:28 so I stopped thinkin about it :) 2009-01-04 01:28 maybe later 2009-01-04 01:28 i see 2009-01-04 01:29 phtree is going to go in the page cache 2009-01-04 01:29 sounds good 2009-01-04 01:30 if not, it sounds like it breaks current strategy 2009-01-04 01:31 if we had fixed sized inodes it would be natural to put the inode table in the page cache 2009-01-04 01:31 ah, well, it may be able to use promises 2009-01-04 01:31 i see 2009-01-04 01:32 so, inode table is also redirect + data btree strategy? 2009-01-04 01:33 inode table also needs redirect 2009-01-04 01:34 it means redirect is needed without data btree? 2009-01-04 01:34 yes 2009-01-04 01:34 i see 2009-01-04 01:34 ileaf is needed redirect, and bnode will be handled by promises? 2009-01-04 01:35 yes 2009-01-04 01:35 I think that is easiest to start with 2009-01-04 01:35 because, ileaf logs is complex? 2009-01-04 01:35 and then later we can add promises for ileaf changes, so only the table block needs redirect 2009-01-04 01:36 no, it's not very complex 2009-01-04 01:36 only a little more 2009-01-04 01:36 i see 2009-01-04 01:36 but redirecting it is even easier, and is not as bad as recursive copy all the way to the root 2009-01-04 01:37 i see 2009-01-04 01:37 redirect for inode table goes in make_inode and save_inode 2009-01-04 01:37 and good way is depends on operations? 2009-01-04 01:38 s/operations/loads/ 2009-01-04 01:38 good way to do what? 2009-01-04 01:38 efficient way 2009-01-04 01:38 promises is the most efficient way 2009-01-04 01:38 always 2009-01-04 01:39 if user creates many inodes, one redirect may be efficient? 2009-01-04 01:39 yes 2009-01-04 01:39 you are right :) 2009-01-04 01:39 oh :) 2009-01-04 01:39 not always 2009-01-04 01:39 that is, promises are not always the most efficient 2009-01-04 01:40 i see 2009-01-04 01:40 let's do the redirect of ileaf for now 2009-01-04 01:41 yes 2009-01-04 01:41 and at least, the create many files benchmark should be fast :) 2009-01-04 01:41 ok :) 2009-01-04 01:43 I'll go to food shop 2009-01-04 01:43 ok, I will work on the log prototype 2009-01-04 01:43 ok, thanks 2009-01-04 02:10 -!- amey(~amey@117.195.33.149) has left #tux3 2009-01-04 03:02 -!- cdk(~cdk@117.195.33.149) has joined #tux3 2009-01-04 03:53 got some basic logging code written 2009-01-04 03:54 tomorrow: replay 2009-01-04 03:54 then commit the prototype, just userspace now 2009-01-04 03:57 -!- stargazr5(~stargazr5@117.195.33.149) has joined #tux3 2009-01-04 04:35 hi flips 2009-01-04 05:07 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-04 06:54 -!- amey(~amey@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-04 07:06 -!- dcg(~dcg@39.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-01-04 07:12 -!- kushal(~kushal@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-04 07:12 -!- stargazr5(~stargazr5@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-04 09:03 -!- stargazr5(~stargazr5@117.195.42.45) has joined #tux3 2009-01-04 09:05 -!- kushal(~kushal@117.195.42.45) has joined #tux3 2009-01-04 10:44 -!- kushal(~kushal@117.195.42.45) has joined #tux3 2009-01-04 10:46 hi flips... 2009-01-04 11:18 -!- stargazr5_(~stargazr5@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-04 11:35 -!- kushal(~kushal@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-04 11:53 -!- kushal(~kushal@xbl.dnsbl.oftc.net) has left #tux3 2009-01-04 12:51 -!- stargazr5_(~stargazr5@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-04 14:31 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-04 14:51 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-04 16:08 -!- dagle(~weechat@host162-104.bornet.net) has joined #tux3 2009-01-04 17:15 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-04 19:16 -!- amey(~amey@116.73.35.180) has joined #tux3 2009-01-04 19:29 -!- amey(~amey@116.73.35.180) has left #tux3 2009-01-04 20:42 -!- amey(~amey@116.73.35.180) has joined #tux3 2009-01-04 21:14 -!- kushal(~kushal@websorbs.dnsbl.oftc.net) has joined #tux3 2009-01-04 21:15 hi flips 2009-01-04 21:15 hi kushal 2009-01-04 21:16 a small doubt...for freeing deleted blocks where is bfree called? 2009-01-04 21:22 in dleaf_chop 2009-01-04 21:22 it's supposed to be ;) 2009-01-04 21:23 some work to do there 2009-01-04 21:24 as in? 2009-01-04 21:25 actually, it looks right 2009-01-04 21:25 in dleaf 2009-01-04 21:25 -!- stargazr5(~stargazr5@websorbs.dnsbl.oftc.net) has joined #tux3 2009-01-04 21:25 ok...another thing... 2009-01-04 21:26 there are also bfrees in btree 2009-01-04 21:26 ok...for our dedup design one concern is the block deletes... 2009-01-04 21:27 we maintain the ref count in the buckets 2009-01-04 21:28 so on a block deletion we have to calculate the hash and traverse the hash tree to find appropriate bucket 2009-01-04 21:28 where the ref count would be reduced 2009-01-04 21:28 this is an overhead... 2009-01-04 21:29 logging refcount changes will help a little 2009-01-04 21:29 it might help to keep the refount in a separate tree index by block number 2009-01-04 21:30 with the help of logging, that might be efficient 2009-01-04 21:30 but wont the metadata be too much then? 2009-01-04 21:30 for the log? 2009-01-04 21:30 or the refcounts? 2009-01-04 21:31 no for the separate tree for ref counts... 2009-01-04 21:31 it's not a lot of metadata 2009-01-04 21:31 tree indexes are pretty small 2009-01-04 21:32 it is extra metadata to update 2009-01-04 21:32 that is where the log helps 2009-01-04 21:32 48 bit block no + ref count for each block 2009-01-04 21:33 and the block numbers are already present in our hash tree 2009-01-04 21:33 since it is blocks you can expect many of them to be contiguous and only store the block number once per block 2009-01-04 21:33 or it can be a direct map 2009-01-04 21:33 not a btree 2009-01-04 21:34 just map the recounts into a file, like the xattr refcounts 2009-01-04 21:34 that file will be 1GB per terabyte of blocks 2009-01-04 21:35 if will have pretty good locality 2009-01-04 21:35 s/if/it/ 2009-01-04 21:36 this will reduce the size of your big hash table a little and help to reduce thrashing 2009-01-04 21:39 but wont we have to pre-allocate the 1GB for every TB or am i missing something here?? 2009-01-04 21:40 any block that has never been written is absent from the tree and will be read as zeros 2009-01-04 21:40 so you only get refcount blocks allocated in regions with nonzero refcount 2009-01-04 21:41 you can further improve that by only setting the refcount if greater than one 2009-01-04 21:41 most blocks will be singly referenced, probably 2009-01-04 21:42 in a tree index? 2009-01-04 21:43 the tree index is compact, and the data blocks are completely absent if they have never been written 2009-01-04 21:43 ok 2009-01-04 21:45 something you might think about is having only a few bits of the sha1 in your main hash tree, and an entry lists all blocks that have those bits of the hash 2009-01-04 21:45 and...currently the bfree in btree and dleaf.c are just printf statements? 2009-01-04 21:46 .bfree = bfree, 2009-01-04 21:46 something like a trie? 2009-01-04 21:46 in dtree 2009-01-04 21:47 is that like a trie? 2009-01-04 21:47 anyway, the main thing is... don't get too worried about your performance before you have something basic running 2009-01-04 21:47 a design can always be refactored for more performance 2009-01-04 21:48 achieving base functionality is the big step 2009-01-04 21:50 ok...but i still couldn't exactly understand your earlier suggestion of keeping only a few bits of the sha1? 2009-01-04 21:52 when a block is written, your goal is to determine if it is equal to some other block, and the hash is an accelerator to avoid searching the entire volume 2009-01-04 21:52 yes... 2009-01-04 21:52 so you can extend that idea, to just using a few bits of the hash, to search for the real hash 2009-01-04 21:53 if you store the full hash indexed by block number 2009-01-04 21:53 -!- amey(~amey@websorbs.dnsbl.oftc.net) has joined #tux3 2009-01-04 21:53 and you store a list of blocks in your main hash tree that have that hash 2009-01-04 21:53 or _might_ have that hash 2009-01-04 21:54 that have the first 32 bits (for example) of the hash 2009-01-04 21:55 ok... 2009-01-04 21:55 then you examine a small set of blocks, usually just one, to find the full hash 2009-01-04 21:55 you can extend this idea further 2009-01-04 21:55 ok... 2009-01-04 21:55 don't compute a full sha1 2009-01-04 21:56 compute a much smaller hash, and keep a list of blocks that have the same hash 2009-01-04 21:56 to determine equality when there are collisions, actually read the block and compare the data 2009-01-04 21:57 you still save most of the actual block reads, and the hash tree gets much smaller 2009-01-04 21:57 I imagine that this is a win 2009-01-04 21:58 because the hash tree is randomly accessed, and the only way to keep it from falling out of cache is make it small 2009-01-04 21:58 you also win in cpu by computing a more efficient hash 2009-01-04 22:00 we had thought of this but it was looking expensive because detection of duplicate blocks would require a byte by byte comparison with a lot of blocks 2009-01-04 22:01 how does that happen? 2009-01-04 22:01 you only compare to blocks with the same hash 2009-01-04 22:01 and that will not be many 2009-01-04 22:02 if there are a lot of collisions, there is yet another trick you can use 2009-01-04 22:03 have a "collision tree" that handles hashes with a lot of collisions, in that tree you store larger hashes 2009-01-04 22:03 what is that...coz we were worried about the increasing number of collisions 2009-01-04 22:04 dont know much about a collision tree...will look into it.... 2009-01-04 22:04 I just made that name up 2009-01-04 22:05 ok... 2009-01-04 22:05 so how will it work? 2009-01-04 22:05 just another btree, in which you only store hashes that receive a lot of collisions 2009-01-04 22:06 if your primary hash tree has a list of blocks with that hash, then when the list gets long you just store a "look in the collision tree" special value there 2009-01-04 22:06 the collision tree stores much larger hashes, maybe full sha1 2009-01-04 22:06 this design works because there will not be many entries in the collision tree 2009-01-04 22:07 ok...nice idea... 2009-01-04 22:08 also, if you keep a list of blocks that collide, you don't need refcounts 2009-01-04 22:10 ah, you don't actually need a separate tree for collisions... you can store a variable number of bits of hash in your main tree 2009-01-04 22:11 there is room for some creative design here :) 2009-01-04 22:13 yes... :) but i'm still not clear about the ref count handling... 2009-01-04 22:14 ah true, you still need refcounts 2009-01-04 22:14 because you have multiple pointers referencing the same block 2009-01-04 22:15 yes... 2009-01-04 22:15 well I think a separate refcount tree is probably the best idea, only set an entry in it if refcount goes above 1 2009-01-04 22:16 you can use a similar trick 2009-01-04 22:16 use a 1 byte refcount 2009-01-04 22:16 ok... 2009-01-04 22:16 then if that overflows, store 255 in the byte and store the actual refcount in a btree 2009-01-04 22:17 ok... 2009-01-04 22:17 so then your refcount table can be at most 256 MB per terabyte 2009-01-04 22:17 and will be mostly sparse at start 2009-01-04 22:17 yes... 2009-01-04 22:18 and you use the logging technique to avoid updating it on every commit 2009-01-04 22:18 this will be fine 2009-01-04 22:19 I think this is better than storing the refcount with the hash 2009-01-04 22:19 yes...i think it should work... 2009-01-04 22:19 will work on this design and get back to you... 2009-01-04 22:20 was fun so far 2009-01-04 22:20 btw..you mentioned two update methods in your design doc... 2009-01-04 22:20 and I don't use one of them at all 2009-01-04 22:20 which one?? 2009-01-04 22:21 only the redirect method will be used 2009-01-04 22:21 no in-place? 2009-01-04 22:21 the other method, journalling style overwrite, could be used for updating the superblock 2009-01-04 22:21 but isn't planned right now 2009-01-04 22:22 i think we will need the same method for our bucket handling... 2009-01-04 22:22 there isn't a lot of use for trying to keep a block in exactly the same place 2009-01-04 22:22 atomic update should be handled for you automatically 2009-01-04 22:23 ok... 2009-01-04 22:23 so i guess the other trees being maintained in the same fashion ? 2009-01-04 22:24 I think we will use get_region(... 2) to mean "redirect these blocks" 2009-01-04 22:24 yes 2009-01-04 22:25 ok 2009-01-04 22:25 ok...then 2009-01-04 22:25 thanks for your inputs... 2009-01-04 22:26 they were very helpful as usual... 2009-01-04 22:26 :) 2009-01-04 22:57 -!- kushal_(~kushal@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-04 23:12 time to commit some code 2009-01-04 23:16 -!- amey(~amey@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-05 01:05 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-05 02:35 -!- flips(~phillips@phunq.net) has joined #tux3 2009-01-05 03:03 -!- flips(~phillips@phunq.net) has joined #tux3 2009-01-05 03:52 -!- RazvanM(~RazvanM@96.234.232.218) has joined #tux3 2009-01-05 05:08 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-05 05:16 -!- stargazr5(~stargazr5@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-05 05:54 -!- stargazr5_(~stargazr5@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-05 07:07 -!- inverse(~michael@nat028.dc-uoit.net) has joined #tux3 2009-01-05 07:08 happy new year everyone :) 2009-01-05 07:47 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-01-05 07:54 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2009-01-05 08:08 -!- pgquiles(~pgquiles@240.Red-88-22-55.staticIP.rima-tde.net) has joined #tux3 2009-01-05 08:13 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-05 08:52 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-05 09:34 -!- inverse(~michael@nat026.dc-uoit.net) has joined #tux3 2009-01-05 10:23 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-05 11:18 -!- inverse(~michael@h80-net10.simres.netcampus.ca) has joined #tux3 2009-01-05 13:05 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-05 13:07 and a very happy new year to you, too, inverse 2009-01-05 14:18 http://kerneltrap.org/mailarchive/tux3/2008/12/30/4546684 <- there's a gross mistake in end_change 2009-01-05 14:19 none of the thousand eyeballs have caught it yet 2009-01-05 15:36 Well we saved that one for you. 2009-01-05 15:36 thanks for that 2009-01-05 15:38 Patch: Filesystem change brackets <- big yummy post 2009-01-05 15:39 I would appreciate if everybody who can, takes a look at it 2009-01-05 15:39 another big piece of atomic update 2009-01-05 15:40 typo in the first paragraph :( 2009-01-05 15:52 -!- kbingham_(~kbingham@82-46-4-172.cable.ubr03.aztw.blueyonder.co.uk) has joined #tux3 2009-01-05 16:00 hirofumi, around? 2009-01-05 16:20 sk8 oclock 2009-01-05 17:29 hi flips 2009-01-05 17:35 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-05 17:43 macan, hi 2009-01-05 17:43 the begin_change and end_change pair will serialize all the file system operations 2009-01-05 17:44 if there are synchronous writes in end_change, it should be time-consuming 2009-01-05 17:44 so, can we group some filesystem changes together as one delta? 2009-01-05 17:44 yes we do 2009-01-05 17:44 deltas will be large, normally having a few hundred operations 2009-01-05 17:45 that would be a load like untar 2009-01-05 17:45 ok 2009-01-05 17:46 at least the serialization is mainly read locks 2009-01-05 17:47 write lock is only taken when it's time to do a delta transition 2009-01-05 17:47 yes 2009-01-05 17:47 in other words, ops are only serialized against delta transition 2009-01-05 17:48 except for the overhead of the read lock itself, it will be pretty smooth 2009-01-05 17:48 except at transitions, and we will try to make them rare under load. Later we will do more clever serialization 2009-01-05 17:51 only a couple more bits to prototype, and then we will have at least a crude prototype of all the atomic op stuff 2009-01-05 17:51 if the delta transition can accept more fs changes, it should be well. and there would exists more deltas, some of which is syncing, and some of others can accept new changes 2009-01-05 17:53 to start with we will just have one delta in the pipeline at a time 2009-01-05 17:54 and pretty soon after that, improve it to have three: active, staging, committing 2009-01-05 17:54 the pipeline will be very interesting 2009-01-05 17:55 I think tux3 will already move along pretty well on single node systems, even with the crude serialization 2009-01-05 17:55 because it doesn't write a lot of metadata 2009-01-05 17:57 yes, but what i consider most is the bitmap file, if there are many frees, bitmap file will touch many different locations, then a lot of dirty blocks will be forked 2009-01-05 17:58 then a lot of logs ? 2009-01-05 17:58 forking is strictly an in-memory operation 2009-01-05 17:58 the logs are compact 2009-01-05 17:58 8 bytes per extent alloc 2009-01-05 17:58 ok 2009-01-05 17:59 we don't have to flush the bitmap every delta either 2009-01-05 17:59 can flush it only when the log gets too long 2009-01-05 18:00 ah, i missed this point 2009-01-05 18:00 at first we will flush it every delta, changing that to less than once per delta is pretty easy 2009-01-05 18:00 a delta can be long too, so I don't think this is a big worry even at the beginning 2009-01-05 18:02 thanks for your explanation:) 2009-01-05 18:02 any time 2009-01-05 18:43 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-05 19:10 flips: hey 2009-01-05 19:10 happy new year, bh 2009-01-05 19:10 same to you 2009-01-05 19:49 http://mailman.tux3.org/pipermail/tux3/2009-January/000592.html <- Design note: Block Redirect 2009-01-05 20:11 hi 2009-01-05 20:11 hi 2009-01-05 20:11 the change brackets patch was just for you ;) 2009-01-05 20:11 btw, I noticed begin_change/end_change problem 2009-01-05 20:11 ah 2009-01-05 20:12 however, I didn't care about it 2009-01-05 20:12 because I thought it is for explaining it 2009-01-05 20:12 try it 2009-01-05 20:14 btw, write_lock was released before commit 2009-01-05 20:14 ACTION looks 2009-01-05 20:15 ah, up_write() before commit_delta() 2009-01-05 20:15 should be ok 2009-01-05 20:15 yes 2009-01-05 20:15 it's ok to release the write lock 2009-01-05 20:16 just don't trigger a new delta 2009-01-05 20:16 however, it means we need blockfork() now? 2009-01-05 20:16 hmm, maybe :) 2009-01-05 20:16 ok :) 2009-01-05 20:17 I better move that write_unlock 2009-01-05 20:17 yes, maybe for now 2009-01-05 20:18 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-05 20:18 by the way, the keyboard issue is solved 2009-01-05 20:18 what was problem? 2009-01-05 20:19 it is actually a misdesigned X feature 2009-01-05 20:19 "slow keys" 2009-01-05 20:19 can be enabled by a "gesture" 2009-01-05 20:19 gesture? 2009-01-05 20:19 I don't know it 2009-01-05 20:19 and since there is no indication, except one cryptic popup that never comes back, it's a recipe for problems 2009-01-05 20:20 gesture -> moving the mouse in a certain way 2009-01-05 20:20 it's a very bad feature interaction, you could call it a design bug 2009-01-05 20:20 many people hit it 2009-01-05 20:20 mouse gesture feature by x? 2009-01-05 20:20 and maintainers can't even figure out what it is 2009-01-05 20:20 even xorg maintainers 2009-01-05 20:21 yes, a xorg feature 2009-01-05 20:21 oh 2009-01-05 20:21 misfeature 2009-01-05 20:21 I thought it has only applications 2009-01-05 20:21 like firefox 2009-01-05 20:21 it can now be magically enabled by mouse gestures 2009-01-05 20:22 that is, slow keys can be enabled by a mouse gesture 2009-01-05 20:22 that is a really, really bad idea 2009-01-05 20:22 especially with no indication on the screen that the system is in that state 2009-01-05 20:23 so, if you want to use the shift key, you have to hold it down for one second, then type the shifted key 2009-01-05 20:23 people including me tend to perceive this as a keyboard failure 2009-01-05 20:23 kde feature? 2009-01-05 20:24 I think gestures is kde, slow keys is X 2009-01-05 20:24 i see 2009-01-05 20:24 ah, gestures is a xorg feature too 2009-01-05 20:24 kde is blameless, except for not providing an on-screen indication 2009-01-05 20:25 i see 2009-01-05 20:25 anyway, what was the problem with begin/end_change? 2009-01-05 20:25 the lock? 2009-01-05 20:26 yes 2009-01-05 20:26 up_write() 2009-01-05 20:27 if (need_delta(sb)) { 2009-01-05 20:27 unsigned delta = atomic_read(&sb->delta); 2009-01-05 20:27 up_read(&sb->delta_lock); 2009-01-05 20:27 down_write(&sb->delta_lock); 2009-01-05 20:27 if (sb->delta == atomic_read(&delta)) { 2009-01-05 20:27 atomic_inc(&sb->delta); 2009-01-05 20:27 stage_delta(sb, delta); 2009-01-05 20:27 commit_delta(sb, delta); 2009-01-05 20:27 } 2009-01-05 20:27 up_write(&sb->delta_lock); 2009-01-05 20:27 } else 2009-01-05 20:27 up_read(&sb->delta_lock); 2009-01-05 20:28 yes 2009-01-05 20:28 with it, maybe we don't need blockfork() for now except bitmap 2009-01-05 20:29 yes 2009-01-05 20:29 it's useful for bitmap, and isn't so hard to implement when it does not have to deal with async access to the different buffers 2009-01-05 20:30 probably 2009-01-05 20:30 and the write lock will be held until all the delta data is on disk 2009-01-05 20:30 we can wait on each delta buffer 2009-01-05 20:30 very crude 2009-01-05 20:31 but if it atomic commits, it will still be fine 2009-01-05 20:31 a nice base to work from 2009-01-05 20:31 -!- macana(~macan@websorbs.dnsbl.oftc.net) has joined #tux3 2009-01-05 20:31 the change brackets for write operations are tricky and messy 2009-01-05 20:32 I was preparing to fix that up by reading ext3 code 2009-01-05 20:32 this is very scary 2009-01-05 20:32 change brackets? 2009-01-05 20:32 begin/end_change 2009-01-05 20:33 what is current problem? 2009-01-05 20:33 in the patch I posted, they are wrongly paced in map_region 2009-01-05 20:33 yes 2009-01-05 20:33 s/paced/placed/ 2009-01-05 20:33 they have to go in write_begin/end and writepage 2009-01-05 20:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-05 20:34 I have not had time to find out exactly where they should go 2009-01-05 20:34 ok 2009-01-05 20:35 we can start with one function 2009-01-05 20:36 I'll find all modifiers later 2009-01-05 20:36 we can start with some name operations then 2009-01-05 20:37 ok, so directory shouldn't be corrupted 2009-01-05 20:37 shall we call them change locks or change brackets? 2009-01-05 20:37 need to call them something, we will be talking about them a lot 2009-01-05 20:38 transaction? 2009-01-05 20:38 I forget term of this 2009-01-05 20:38 sometimes called transaction, but it's not really a transaction 2009-01-05 20:39 transaction == change anyway 2009-01-05 20:39 upper layer - delta 2009-01-05 20:39 sometimes called transaction markers 2009-01-05 20:39 it has logs 2009-01-05 20:39 I think we talked before about delta terms 2009-01-05 20:40 a delta has multiple changes in it 2009-01-05 20:40 yes 2009-01-05 20:40 old, it had two commit terms 2009-01-05 20:40 last commit and fake commit 2009-01-05 20:41 last commit is delta commit, iirc 2009-01-05 20:41 fake commit is called commit, maybe 2009-01-05 20:41 later, delta commit is called commit 2009-01-05 20:41 what is a fake commit? 2009-01-05 20:41 and commit was.. 2009-01-05 20:42 just log record? 2009-01-05 20:42 ah 2009-01-05 20:42 well, speaking of commit, I will commit the changes patch 2009-01-05 20:42 it is all no-ops 2009-01-05 20:43 ok 2009-01-05 20:49 ok, log blocks 2009-01-05 20:49 right 2009-01-05 20:49 and commit block 2009-01-05 20:49 so, prepare log? 2009-01-05 20:50 commit block will just be a log block with a commit entry at the end of it 2009-01-05 20:50 maybe 2009-01-05 20:50 well, last block of delta 2009-01-05 20:50 yes 2009-01-05 20:50 we can flush all the log blocks at the same time, in any order 2009-01-05 20:51 yes 2009-01-05 20:51 when _all_ of them have completed, then the commit block must have completed too, and we enter a pointer in the root 2009-01-05 20:51 s/root/disksuper/ 2009-01-05 20:51 later we will do something better than changing the superblock 2009-01-05 20:52 i see 2009-01-05 20:52 our superblock updating code is a little messy right now, I think we should leave it alone, it will be the last thing to fix 2009-01-05 20:53 ok 2009-01-05 20:53 I am thinking about something called a "metablock", there is an array of them listed in the disksuper 2009-01-05 20:54 they are distributed across the disk 2009-01-05 20:54 and we update the closest one that was not updated in the previous delta, to point at the delta commit 2009-01-05 20:54 pretty simple 2009-01-05 20:54 -!- RazvanM(~RazvanM@96.234.232.218) has joined #tux3 2009-01-05 20:55 so we put all those fields with comments "shouldn't really be here" in the metablock 2009-01-05 20:55 all the variable data 2009-01-05 20:55 so the superblock only has fixed data 2009-01-05 20:56 who points metablock? 2009-01-05 20:56 disksuper has an array of pointers to metablocks 2009-01-05 20:56 on start, we check all of them to find the most recent 2009-01-05 20:56 it is fixed location? 2009-01-05 20:56 yes 2009-01-05 20:56 i see 2009-01-05 20:56 a few fixed locations 2009-01-05 20:56 scattered across the disk 2009-01-05 20:58 anyway, log 2009-01-05 20:59 "log old free, new alloc and promise to update parent on disk" <- how many bytes is that? 2009-01-05 20:59 8 + 8 + length of promise 2009-01-05 21:00 old free, and new alloc? 2009-01-05 21:00 free old log? 2009-01-05 21:01 [UPDATE:8, child:48, parent:48, key:48] (19 bytes) Implies balloc, bfree 2009-01-05 21:01 fields of log record? 2009-01-05 21:01 yes 2009-01-05 21:01 a promise 2009-01-05 21:01 i see 2009-01-05 21:02 but... 2009-01-05 21:02 the free old is not in there 2009-01-05 21:02 it is in the parent though 2009-01-05 21:02 on replay, we can read the parent to find out which block to free 2009-01-05 21:03 what is parent block? 2009-01-05 21:04 a bnode 2009-01-05 21:04 ah, this is promise 2009-01-05 21:04 yes 2009-01-05 21:05 btw, this is logical logging 2009-01-05 21:05 yes 2009-01-05 21:05 um.. 2009-01-05 21:05 I forget work in english 2009-01-05 21:05 logical logging is a normal term 2009-01-05 21:06 we avoid one typical problem of logical logging, by always knowing that our change has not yet been applied to the target 2009-01-05 21:07 Idempodent? 2009-01-05 21:07 !idempotent 2009-01-05 21:07 what is the opposite of idempotent? 2009-01-05 21:08 the UPDATE is idempotent, but the INSERT and DELETE are not 2009-01-05 21:08 idempotent 2009-01-05 21:08 before, I worry about idempotent of loggical logging 2009-01-05 21:09 right, but we avoid that problem 2009-01-05 21:09 i see 2009-01-05 21:09 because we can guarantee to apply the change only once 2009-01-05 21:10 so we can use the logical log for some other nice optimizations 2009-01-05 21:10 for example, we can write all the attributes of a new inode to the logical log instead of updating the itable block 2009-01-05 21:11 we can also log atimes to the log, atom refcount changes... lots of things 2009-01-05 21:11 quota change 2009-01-05 21:11 how do we avoid apply log twice or more? 2009-01-05 21:13 when we replay, we have a log of changes, and each change refers to some block that is read-only since the last delta 2009-01-05 21:13 in other words, we never overwrite a block that is referred to by a log entry 2009-01-05 21:13 i see 2009-01-05 21:15 and initial start has dirty buffers 2009-01-05 21:16 we ? 2009-01-05 21:16 and dirty buffers will be written with next delta? 2009-01-05 21:16 ah 2009-01-05 21:16 initial means "after mount" 2009-01-05 21:16 they will be written just as if the fielsystem had never shut down 2009-01-05 21:16 yes, I understand now 2009-01-05 21:17 so, a dirty bitmap block might written out a few deltas later 2009-01-05 21:17 i see 2009-01-05 21:18 dirty bnodes only need to be written out to recover cache memory 2009-01-05 21:18 or if they are split 2009-01-05 21:18 there is not clean state, however there is no exception? 2009-01-05 21:18 there is no clean state, however there is no exception? 2009-01-05 21:18 right, no clean state 2009-01-05 21:18 we could add one if we want 2009-01-05 21:19 ah, yes 2009-01-05 21:19 but there is no need to have it now, and it is better to always start dirty, that way we test the recovery 2009-01-05 21:20 yes 2009-01-05 21:21 the code in user/commit.c needs to be tweaked to work in kernel 2009-01-05 21:22 in user space it is easy to just allocate a map and use it 2009-01-05 21:22 I don't know if it is easy to allocate an address_space 2009-01-05 21:24 I think address_space only is abnormal at least 2009-01-05 21:24 yes 2009-01-05 21:24 so easiest is just to change the prototype to allocate an inode instead of a mapping 2009-01-05 21:25 swapper seems to do it 2009-01-05 21:25 swapper_space 2009-01-05 21:26 file/line#? 2009-01-05 21:26 http://lxr.linux.no/linux+v2.6.27.5/mm/swap_state.c#L40 2009-01-05 21:28 uses radix tree ops directly to insert pages 2009-01-05 21:29 probably 2009-01-05 21:29 it seems abnormal too 2009-01-05 21:29 251 251 page = find_get_page(&swapper_space, entry.val); 2009-01-05 21:30 yes 2009-01-05 21:30 and xfs does something 2009-01-05 21:30 if (!pag->pag_ici_init) { 2009-01-05 21:30 rwlock_init(&pag->pag_ici_lock); 2009-01-05 21:30 INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC); 2009-01-05 21:30 pag->pag_ici_init = 1; 2009-01-05 21:30 } 2009-01-05 21:30 error = radix_tree_insert(&pag->pag_ici_root, agino, ip); 2009-01-05 21:31 all I am doing with the mapping in user/commit.c is inserting pages 2009-01-05 21:32 anyway, we know an inode will work 2009-01-05 21:32 yes 2009-01-05 21:32 it would be easy 2009-01-05 21:32 easy is good 2009-01-05 21:33 yes, and we don't have strong reason to use address_space directly at least for now 2009-01-05 21:34 we can do it if we want, I think it is enough for now 2009-01-05 21:36 yes 2009-01-05 21:37 ok, I will change the prototype to use an inode 2009-01-05 21:37 ok 2009-01-05 21:37 first thing for log is balloc/bfree? 2009-01-05 21:38 I thought it would be good to start there 2009-01-05 21:38 yes 2009-01-05 21:38 the UPDATE above looks like it is pretty good, time to implement that too 2009-01-05 21:39 we need the promises even in the intial version, otherwise we must do recursive redirect all the way to the filesystem root 2009-01-05 21:39 on every delta, for every changed file 2009-01-05 21:39 :p 2009-01-05 21:39 btrfs and zfs do this 2009-01-05 21:39 wafl too 2009-01-05 21:39 if first target is tux3_mknod(), tux3_mknod() -> make_inode() -> balloc() 2009-01-05 21:40 i see 2009-01-05 21:40 tux3_mknod is a good first target 2009-01-05 21:41 and balloc() calls blockread() 2009-01-05 21:42 buffer of blockread() must not be dirty 2009-01-05 21:42 so, first log is... 2009-01-05 21:42 ok, balloc() log 2009-01-05 21:43 balloc will be called, and we will log too 2009-01-05 21:43 by blockread() in balloc()? 2009-01-05 21:43 so that the bitmap buffer in cache is current, but we do not write it out until after the delta 2009-01-05 21:44 the blockread in balloc will get the read/only copy from disk, or zeroes if it is the first bit on that block 2009-01-05 21:44 yes 2009-01-05 21:44 ah, it can be dirty if second usage in same delta 2009-01-05 21:46 first log is balloc(), and... 2009-01-05 21:46 we have to delay inode allocation? 2009-01-05 21:47 it is not needed 2009-01-05 21:48 don't need to delay inode allocation I think 2009-01-05 21:48 we don't allow next delta until current delta is completed 2009-01-05 21:48 yes 2009-01-05 21:48 this simplifies a few things 2009-01-05 21:48 i see 2009-01-05 21:48 a good place to start 2009-01-05 21:48 and for big deltas, it will still be pretty efficient 2009-01-05 21:49 so our untar test should not suck too badly 2009-01-05 21:49 i see 2009-01-05 21:50 so, next log is btree ops 2009-01-05 21:50 and ileaf 2009-01-05 21:51 yes 2009-01-05 21:51 the hardest one is the first one 2009-01-05 21:51 and directory btree, then directory data 2009-01-05 21:51 ah 2009-01-05 21:52 we can do file data and directory data at the same time 2009-01-05 21:52 i see 2009-01-05 21:52 block redirect is the same for them 2009-01-05 21:52 we redirect in map_region 2009-01-05 21:52 and it handles both dirents and regular files 2009-01-05 21:53 yes, and if new directory, we have to change parent inode count 2009-01-05 21:53 logs is done 2009-01-05 21:53 :) 2009-01-05 21:53 we can have map_region(... , 2) <- redirect instead of overwrite 2009-01-05 21:53 yes 2009-01-05 21:53 and ileaf too? 2009-01-05 21:53 ileaf is not in page cache 2009-01-05 21:54 ah, yes 2009-01-05 21:54 however, it is needed to redirect 2009-01-05 21:55 ah, so page cache for inode table may be good 2009-01-05 21:56 um... 2009-01-05 21:56 something to think about 2009-01-05 21:56 but right now, inode index is very similar to file index 2009-01-05 21:56 so index update code will be similar 2009-01-05 21:58 maybe 2009-01-05 22:00 the index update code will be generic in btree.c 2009-01-05 22:00 including index leaf 2009-01-05 22:01 ah 2009-01-05 22:01 yes 2009-01-05 22:03 dleaf is also redirected? 2009-01-05 22:04 yes 2009-01-05 22:04 later we can add a promise for the pointer from dleaf to new extent 2009-01-05 22:05 so that the dleaf is not written on every delta 2009-01-05 22:06 i see 2009-01-05 22:06 btw, bh_delay() style delalloc helps atomic commit? 2009-01-05 22:07 I'm not sure about it 2009-01-05 22:07 if it helps, I'm thinking to try ti 2009-01-05 22:08 bh_delay? 2009-01-05 22:08 current ext4/xfs is using it for delalloc 2009-01-05 22:08 ah, buffer_delay() 2009-01-05 22:09 bh_delay looks like xfs 2009-01-05 22:10 void log_update(struct sb *sb, block_t child, block_t parent, tuxkey_t key) 2009-01-05 22:10 { 2009-01-05 22:10 unsigned char *data = log_need(sb, 19); 2009-01-05 22:10 *data++ = LOG_UPDATE; 2009-01-05 22:10 data = encode48(data, child); 2009-01-05 22:10 data = encode48(data, parent); 2009-01-05 22:10 sb->logpos = encode48(data, key); 2009-01-05 22:10 } 2009-01-05 22:15 looks like good 2009-01-05 22:25 each log block can handle 140 update promises, that is 575K at worst (1 block/extent) or 36MB at best (64 blocks/extent) 2009-01-05 22:26 i see 2009-01-05 22:26 that is written data per log block 2009-01-05 22:26 btw, I'd like to remove 64blocks limit later 2009-01-05 22:27 yes 2009-01-05 22:27 have a special format for large extents 2009-01-05 22:27 special format? 2009-01-05 22:27 not current diskextent? 2009-01-05 22:27 we have 10 bits reserved for versioning 2009-01-05 22:28 yes 2009-01-05 22:28 that fields is used for other purpose? 2009-01-05 22:28 so to have longer extents + 48 bits block + 6 bits count, we need a variant extent format 2009-01-05 22:29 I thought about it 2009-01-05 22:29 we could use 1 bit of the extent to mean "larger count", version same as previous extent" 2009-01-05 22:29 ah 2009-01-05 22:29 just some more tricky code to write, otherwise no problem ;) 2009-01-05 22:30 I thought about reserve some bits on extent->count 2009-01-05 22:30 e.g. 1 bit flag + 5bits count 2009-01-05 22:31 if flags == 1, count means shift bit 2009-01-05 22:32 so, max is 1 << 32 blocks 2009-01-05 22:32 well, it is later 2009-01-05 22:32 nice idea 2009-01-05 22:33 better than having a block full of 64 block extents 2009-01-05 22:33 maybe, yes 2009-01-05 22:35 that's a 16TB extent 2009-01-05 22:35 I guess we need that :) 2009-01-05 22:36 issue of it is it can present only log2 2009-01-05 22:36 so you need log2 of those extents to represent an exact size, that is not bad 2009-01-05 22:37 yes, if extent->count is big, it may help 2009-01-05 22:37 at most 27 extents for any exact size up to 16 TB 2009-01-05 22:38 your math is fast :) 2009-01-05 22:38 but is it accurate? 2009-01-05 22:39 probably 2009-01-05 22:40 if blocksize is 12bit 2009-01-05 22:42 if we introduce metadata extents then we could use smaller blocksize, maybe 512 2009-01-05 22:42 which still has max vol size of 128TB 2009-01-05 22:43 and better space use 2009-01-05 22:43 i see 2009-01-05 22:43 so there will be work to do far into the future :) 2009-01-05 22:43 sorry 2009-01-05 22:43 128 PB 2009-01-05 22:46 i see 2009-01-05 23:05 commit.c now uses an inode instead of mapping 2009-01-05 23:06 maybe that is the only user space dependency 2009-01-05 23:06 let's see what happens when I compile it in kernel 2009-01-05 23:14 compiles, just needed to change buffer->data to bufdata() and add decls 2009-01-05 23:21 ok, logging is in kernel now, it just needs a little initialization 2009-01-05 23:30 now it's initialized, should be all ready to log in kernel now 2009-01-05 23:31 btw, why logbuf is not sb_bread()? 2009-01-05 23:32 you mean, not have the log buffers be physically mapped in the blockdev? 2009-01-05 23:32 sorry 2009-01-05 23:32 you mean, why not have the log buffers be physically mapped in the blockdev? 2009-01-05 23:33 yes 2009-01-05 23:33 my idea is, you don't have to allocate a physical block for the buffer until it is time to write it out 2009-01-05 23:33 you can't do that with a blockdev buffer 2009-01-05 23:33 yes 2009-01-05 23:33 ah 2009-01-05 23:34 it's also to develop some experience operating our own inode, so maybe we can stop using the blockdev 2009-01-05 23:35 ah, logmap may replace blockdev? 2009-01-05 23:36 and bnode may also use it 2009-01-05 23:36 well another inode like logmap 2009-01-05 23:37 because we also could not have an unindexed block in our own volume map 2009-01-05 23:37 I thought logbuf can be on another inode 2009-01-05 23:38 ah 2009-01-05 23:38 the only problem with mixing logically addressed and physically addressed buffers is, they might collide 2009-01-05 23:39 I think inodes are cheap 2009-01-05 23:39 i see 2009-01-05 23:39 it doesn't hurt to have a couple of extra 2009-01-05 23:40 ideally, the exact same code would handle buffers in file page cache and volume buffer cache 2009-01-05 23:40 the vfs really should be this way now 2009-01-05 23:40 but if it works for tux3 then that is a powerful argument for doing it in other filesystems 2009-01-05 23:40 I think it could be a big cleanup, but we have to try it to know 2009-01-05 23:41 ok 2009-01-05 23:41 anyway, we don't take any risk by doing the logmap this way 2009-01-05 23:41 one issue is underly_metadata? 2009-01-05 23:42 yes, I've thought about that 2009-01-05 23:42 killing buffer cache aliases 2009-01-05 23:42 yes 2009-01-05 23:43 it's not a lot of extra cruft to reproduce if we just copy it, but maybe it can be done better 2009-01-05 23:44 or, we can just use alloc_pages() for logbuf? 2009-01-05 23:44 if we don't need to cache it 2009-01-05 23:46 we could 2009-01-05 23:47 but using bread in an inode is less code in kernel, just like it is less in user space 2009-01-05 23:47 and blockread will load it 2009-01-05 23:48 however, it needs to special ->readpage? 2009-01-05 23:48 however, it needs special ->readpage? 2009-01-05 23:49 no 2009-01-05 23:49 because it checks !btree->depth? 2009-01-05 23:50 maybe :) 2009-01-05 23:51 I'm just trying this 2009-01-05 23:51 if it turns out to be awkward, we will change it 2009-01-05 23:51 I don't think it needs it's own readpage 2009-01-05 23:52 it will use blockread->grab_cache_page, not ->readpage 2009-01-05 23:52 blockread() try to read if it is not mapped 2009-01-05 23:52 and this will be a good test case for an improved blockread 2009-01-05 23:53 what do we use blockread for in kernel now? 2009-01-05 23:53 it calls read_mapping_page 2009-01-05 23:53 reading dir blocks? 2009-01-05 23:53 and xattr, when we do that? and bitmaps? 2009-01-05 23:53 yes 2009-01-05 23:54 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-05 23:54 e.g. reading directory data blocks 2009-01-05 23:54 so we can write a blockread2, that works a little differently 2009-01-05 23:54 first write blockget2, that uses grab_cache_page 2009-01-05 23:55 then blockread2 is written like bread: call blockget; if not update, read 2009-01-05 23:55 well, we can add special case for !btree->depth 2009-01-05 23:55 we could 2009-01-05 23:55 but the classic factoring is cleaner 2009-01-05 23:56 or just use i_size=0 2009-01-05 23:56 so in this case, read means hackfs :: syncio, I think. Remember that? 2009-01-05 23:56 syncio, yes 2009-01-05 23:57 so we have blockread2 -> blockget2 -> syncio -> reads using a bio 2009-01-05 23:57 it's a really tight call chain 2009-01-05 23:57 and I think it's optimal for this case 2009-01-05 23:58 for blockdev? 2009-01-05 23:58 blockdev replacement 2009-01-05 23:58 I'm thinking, try it on the log, get it working, then use it for blockdev replacement 2009-01-05 23:59 so there is one question I did not answer above... how does blockread2 know where to read the logmap data from? 2009-01-05 23:59 it can't know 2009-01-06 00:00 me neither, I will try to get an answer when I sleep tonight 2009-01-06 00:00 :) 2009-01-06 00:00 so Ithought it's i_size=0 or !btree->depth 2009-01-06 00:00 I thought 2009-01-06 00:00 ah 2009-01-06 00:00 I am not sure what you have in mind, but I have some confidence you can make it work :) 2009-01-06 00:00 it needs to read for replay 2009-01-06 00:01 for replay, we need flavor of syncio where we just tell it where to read each block from 2009-01-06 00:01 ok 2009-01-06 00:02 if so, readpage() with i_size=0 will just produce zeroed page 2009-01-06 00:03 after we have read the log we don't care about it any more, we will never write to it, the only thing we will do to it is free the physical blocks on a later rollup 2009-01-06 00:03 yes 2009-01-06 00:04 logmap inode just provide radix tree and as page allocater 2009-01-06 00:05 blockdev replacement is not 2009-01-06 00:05 or can be like logmap inode 2009-01-06 00:06 right 2009-01-06 00:06 it's really just an excuse to try out these ideas ;) 2009-01-06 00:06 the way I always thought it should be 2009-01-06 00:06 the way I actually thought it was :) 2009-01-06 00:07 ok :) 2009-01-06 00:07 if ((err = syncio(READ, sb->s_bdev, 0, 1, 2009-01-06 00:07 &(struct bio_vec){ 2009-01-06 00:07 .bv_page = virt_to_page(buf), 2009-01-06 00:07 .bv_offset = offset_in_page(buf), 2009-01-06 00:07 .bv_len = blocksize }))) 2009-01-06 00:07 goto out; 2009-01-06 00:08 so that may look a little ugly with all the }))) 2009-01-06 00:08 blockdev replacement will be like file inode (->readpage), or with syncio? 2009-01-06 00:08 but it's clear what it does, it's really easy to see how to read a block with it 2009-01-06 00:09 blockdev replacement will only use syncio for bread 2009-01-06 00:09 sorry 2009-01-06 00:09 blockread 2009-01-06 00:09 blockread does the job of sb_bread 2009-01-06 00:09 ok, well, syncio() needs lock like lock_buffer() 2009-01-06 00:09 for blockdev replacement 2009-01-06 00:09 yes 2009-01-06 00:09 logbuf may be not needed 2009-01-06 00:10 logmap you mean? 2009-01-06 00:10 yes 2009-01-06 00:11 because it doesn't read 2009-01-06 00:11 it does, on replay 2009-01-06 00:11 ah, yes 2009-01-06 00:11 however, it doesn't need to serialize 2009-01-06 00:11 right, then it doesn't 2009-01-06 00:11 which is nice 2009-01-06 00:11 we can play 2009-01-06 00:12 there isn't much serialization in writing to it either 2009-01-06 00:12 just need to take a lock in log_next 2009-01-06 00:12 and spinlock in the fast path 2009-01-06 00:13 yes 2009-01-06 00:13 which brings up a question 2009-01-06 00:13 what about parallel logging? 2009-01-06 00:13 there can be parallel redirects happening 2009-01-06 00:13 I don't think it matters which order the log entries are logged 2009-01-06 00:14 yes 2009-01-06 00:14 I think order is needed for commit block 2009-01-06 00:14 only 2009-01-06 00:14 ok, that's another thing to think about tonight... spinlocking for the log 2009-01-06 00:14 right 2009-01-06 00:15 ok 2009-01-06 00:15 well, if we make a promise about a block and the block is later written out, the promise _must_ appear in the log before the block write entry goes into the log 2009-01-06 00:15 I call the block write log entry STORE... it invalidates promises for that block 2009-01-06 00:16 so that must appear after any UPDATE, that is probably easy but it has to be thought about 2009-01-06 00:16 sorry 2009-01-06 00:16 after any UPDATE for that block 2009-01-06 00:16 yes, if we process logs straightly 2009-01-06 00:17 don't process twice or more 2009-01-06 00:18 ACTION thinks about it 2009-01-06 00:18 ah, this protection if provided by the btree lock 2009-01-06 00:18 s/if/is/ 2009-01-06 00:19 the btree lock has to be held to make the promise, it also has to be held for any operation that can do a STORE 2009-01-06 00:19 so these are serialized with respect to one btree, and can be interleaved with activity from other btrees, and that is ok 2009-01-06 00:20 this is pretty cool 2009-01-06 00:20 when we have fancier locking for the btree we have to think about this again, for now it just works 2009-01-06 00:21 ok 2009-01-06 00:21 probably, the btree lock will end up being used _only_ for logging, and we will use more parallel locking to operate on the btree itself 2009-01-06 00:23 I really like the idea of the log entries being randomly interleaved, but still ordered with respect to the btrees they operate on 2009-01-06 00:23 for bitmaps now... 2009-01-06 00:24 add log serial number? 2009-01-06 00:24 we could 2009-01-06 00:24 um... 2009-01-06 00:25 STORE means write all block data? 2009-01-06 00:25 I'm just thinking about whether it is possible for the same block to be allocated/freed/allocated within a delta 2009-01-06 00:25 yes 2009-01-06 00:26 STORE after UPDATE is possible if we don't have btree->lock? 2009-01-06 00:26 STORE after UPDATE is ok, it's UPDATE after STORE that is not ok 2009-01-06 00:26 ah 2009-01-06 00:26 yes, UPDATE after STORE 2009-01-06 00:27 yes, then it is possible to get in the wrong order without the btree lock 2009-01-06 00:27 when do we do STORE? 2009-01-06 00:27 we could solve that with a serial number, as you said, but that would make the entries a lot bigger 2009-01-06 00:28 we do store for a bitmap block in the bitmap flush 2009-01-06 00:28 so that is nicely serialized 2009-01-06 00:28 no problem 2009-01-06 00:29 bitmap flush for now will happen after waiting for the delta to commit 2009-01-06 00:30 um.. 2009-01-06 00:30 now I am thinking about the PROMISE, PROMISE, PROMISE... STORE sequence 2009-01-06 00:30 for when we are updating a bnode, then the bnode splits 2009-01-06 00:31 (actually, I used the symbol DEADMAP for a bitmap store, and STORE is supposed to be for btree nodes) 2009-01-06 00:33 um.. I can't imagine we are logging wrong order 2009-01-06 00:33 so, we don't need to sequecial number without for sanity check 2009-01-06 00:34 um.. 2009-01-06 00:34 the btree lock prevents getting out of order 2009-01-06 00:34 sequence would be fine for sanity check 2009-01-06 00:34 we can have a debug field 2009-01-06 00:34 if we don't have btree lock, it is possible? 2009-01-06 00:34 possible, and it will happen instantly I think 2009-01-06 00:36 PROMISE, STORE, PROMISE? 2009-01-06 00:38 ah, if we change btree before split was completed? 2009-01-06 00:38 ah 2009-01-06 00:39 split result was recorded when flush 2009-01-06 00:39 um.. 2009-01-06 00:41 it does have to be thought about 2009-01-06 00:41 I think it works 2009-01-06 00:41 I'm pretty sure it works 2009-01-06 00:41 the btree lock is a powerful serializer, per btree 2009-01-06 00:42 serializes against mixed truncates and writes too 2009-01-06 00:42 yes 2009-01-06 00:43 split and merge would be serialized? 2009-01-06 00:43 yes, also 2009-01-06 00:44 even if we use more fine granularity 2009-01-06 00:45 if we use find granularity it gets interesting 2009-01-06 00:45 I assumed it would be serialized 2009-01-06 00:46 most parent to change will be locked 2009-01-06 00:46 if so, log order may be right order 2009-01-06 00:47 same reason of btree->lock 2009-01-06 00:47 I think so 2009-01-06 00:47 that's cool 2009-01-06 00:47 serialized by the parent lock 2009-01-06 00:47 starts to hurt the lead a little 2009-01-06 00:47 hurt the head 2009-01-06 00:48 yes 2009-01-06 00:49 well, right lock have to produce right log records 2009-01-06 00:49 should produce 2009-01-06 00:49 it is enough for now to me, thanks 2009-01-06 01:00 ok, so the logger needs a spinlock or something against parallel access from different btrees, I will look at that now 2009-01-06 01:01 log_need will return with the lock held 2009-01-06 01:03 yes 2009-01-06 01:03 ah 2009-01-06 01:04 only log_need() needs lock, it's good 2009-01-06 01:12 oh, I didn't know queue already have non-rotational flag 2009-01-06 01:14 non-rotational flag? 2009-01-06 01:14 SSD like device 2009-01-06 01:14 device like SSD 2009-01-06 01:14 :) 2009-01-06 01:15 which queue? 2009-01-06 01:15 mmc/nbd/ide/ata 2009-01-06 01:15 for now 2009-01-06 01:16 the log_need locking with a spinlock is messy compared to a mutex 2009-01-06 01:16 has to be a loop 2009-01-06 01:17 why? 2009-01-06 01:17 I thought unlock -> blockget -> lock 2009-01-06 01:17 somebody might some and fill the whole block while you wait to get the spinlock 2009-01-06 01:17 ah 2009-01-06 01:18 blockget should be completely smp safe, no? 2009-01-06 01:18 we haven't written it yet... 2009-01-06 01:18 well 2009-01-06 01:18 let me check 2009-01-06 01:18 I think page allocation can be sleep 2009-01-06 01:19 doesn't look smp safe 2009-01-06 01:19 anyway, it's time to write blockget2, using grab_cache_page 2009-01-06 01:19 smp safe, but it may sleep 2009-01-06 01:20 write_begin doesn't return with a lock, does it? 2009-01-06 01:20 if it doesn't, them if (!page_has_buffers(page)) is racy 2009-01-06 01:21 ah 2009-01-06 01:21 page lock is held 2009-01-06 01:21 ok 2009-01-06 01:21 ah, yes 2009-01-06 01:34 too many weird puzzles to solve using a spinlock for the logging, I'm adding a mutex and getting on with life ;) 2009-01-06 01:34 we can return to this puzzle later 2009-01-06 01:34 _in theory_ it can be a spinlock 2009-01-06 01:36 ok 2009-01-06 01:40 for blockdev replacement, we can find_get_page() and ->private_lock, and find_or_create_page() and ->private_lock 2009-01-06 01:41 and if it's read, caller have to handle io 2009-01-06 01:41 i.e. blockread() 2009-01-06 01:42 exactly 2009-01-06 01:43 grab_cache_page == find_or_create_page 2009-01-06 01:43 just supplies default gfp flags 2009-01-06 01:43 ah, yes 2009-01-06 01:43 happens to be the right ones for us, but cleaner code would be find_or_create 2009-01-06 01:44 grab_ is just an inline 2009-01-06 01:44 grab_cache_page() would be fine 2009-01-06 01:44 somebody seems to have gone in and done a lot of cleaning in filemap.c 2009-01-06 01:44 used to drive me crazy 2009-01-06 01:45 it was so disorganized, with apis really random 2009-01-06 01:46 and logmap is... 2009-01-06 01:47 my code is disorganized too, but it's mine you see ;) 2009-01-06 01:47 it can be use i_size=0 to tell it is logmap 2009-01-06 01:47 ? 2009-01-06 01:47 blockread2() can use for logmap too 2009-01-06 01:47 yes 2009-01-06 01:48 I don't think we should just ignore i_size 2009-01-06 01:48 if caller wants a transfer above i_size, that should be ok 2009-01-06 01:49 transfer is for read? 2009-01-06 01:49 um.. 2009-01-06 01:49 write 2009-01-06 01:49 then we have to worry about it 2009-01-06 01:49 both read and write 2009-01-06 01:49 yes 2009-01-06 01:49 but I don't think that is the function that should worry 2009-01-06 01:49 blockread will take index 2009-01-06 01:50 yes, block index 2009-01-06 01:50 for blockdev, it is physical address 2009-01-06 01:50 for logmap, it is logical index (just memory pos) 2009-01-06 01:50 ok, well for the log... we don't know for sure how it will work 2009-01-06 01:50 yes 2009-01-06 01:51 blockget will work fine for the logdev 2009-01-06 01:51 so, if index is bigger than i_size, we can handle it like hole 2009-01-06 01:51 ah, yes 2009-01-06 01:52 now, blockread on a logically mapped mapping... it should call our get_block 2009-01-06 01:52 ok, so index bigger than i_size is invalid 2009-01-06 01:52 that's what I did in the original htree 2009-01-06 01:52 worked fine 2009-01-06 01:53 why treat i_size specially at all? 2009-01-06 01:53 let index be anything the caller wants 2009-01-06 01:53 all blockget is supposed to do is instantiate a buffer 2009-01-06 01:54 for blockdev replacement, outside of i_size is invalid address for device 2009-01-06 01:54 ah yes, but the submit_bio will report that 2009-01-06 01:55 yes 2009-01-06 01:55 just to get bug more early 2009-01-06 01:55 yes 2009-01-06 01:55 good point 2009-01-06 01:56 i_size can be a fraction of a block 2009-01-06 01:56 so it is a little messy 2009-01-06 01:59 um.. 2009-01-06 01:59 blocksize = sb_min_blocksize(sb, BLOCK_SIZE); 2009-01-06 01:59 if (!blocksize) { 2009-01-06 01:59 if (!silent) 2009-01-06 01:59 printk(KERN_ERR "TUX3: unable to set blocksize\n"); <- we can lose this after getting rid of blockdev :) 2009-01-06 02:00 oh, yes :) 2009-01-06 02:00 and we can set i_size volblocks << blockbits 2009-01-06 02:00 not whole partition 2009-01-06 02:01 the idea result would be, less code without using the blockdev 2009-01-06 02:01 ideal result 2009-01-06 02:02 it will probably be a few hundred lines more though, maybe 200 2009-01-06 02:02 but really worth it 2009-01-06 02:02 maybe, yes. code would be almost for io 2009-01-06 02:03 page cache handling would be simple 2009-01-06 02:04 and blockget2/blockread2() will replace sb_* 2009-01-06 02:05 logmap is.. 2009-01-06 02:05 and it gets easier to think about implementing ->writepages with an extent-oriented interface 2009-01-06 02:06 that's probably the biggest win 2009-01-06 02:07 write side is... 2009-01-06 02:09 write side is handled by own ->writepage()? 2009-01-06 02:09 yes 2009-01-06 02:10 write side for logmap? 2009-01-06 02:10 currently I'm thinking about others of logmap 2009-01-06 02:11 ah, I think write is just vecio 2009-01-06 02:11 and if it happens to be similar to something else, we can share an io function 2009-01-06 02:11 probably, yes 2009-01-06 02:11 write io is only vecio 2009-01-06 02:12 static int vecio(int rw, struct block_device *dev, sector_t sector, 2009-01-06 02:12 bio_end_io_t endio, void *data, unsigned vecs, struct bio_vec *vec) 2009-01-06 02:12 we want to supply our custom endio 2009-01-06 02:12 yes 2009-01-06 02:12 that can count outstanding blocks and wake us up when done 2009-01-06 02:13 logmap may still be special 2009-01-06 02:13 normal map->writepage() may trigger delta 2009-01-06 02:13 ah, logmap may be same 2009-01-06 02:14 logmap doesn't need a writepage though 2009-01-06 02:14 maybe 2009-01-06 02:15 however, maybe vm may call 2009-01-06 02:15 ah, no 2009-01-06 02:15 it better not write out our log block :) 2009-01-06 02:15 yes 2009-01-06 02:15 I would like to take all our dirty pages off the vm's lru, but that is another matter 2009-01-06 02:15 btw, I'm thinking about apos for logmap 2009-01-06 02:16 after merge I will discuss that with akpm 2009-01-06 02:16 apos? 2009-01-06 02:16 yes 2009-01-06 02:17 what is that? 2009-01-06 02:17 to get vm callback 2009-01-06 02:17 oh, aops 2009-01-06 02:17 well, it dones't need though 2009-01-06 02:17 yes 2009-01-06 02:17 right, it might not need any 2009-01-06 02:18 vm is never allowed to flush 2009-01-06 02:18 yes 2009-01-06 02:18 if we want vm to shrink, we just put the page 2009-01-06 02:18 we have to take care of removing buffer heads 2009-01-06 02:19 I think we should always free buffer heads on last release on the page, but that is an open question 2009-01-06 02:19 for blockdev, it would be complex a little 2009-01-06 02:20 right, I didn't mean blockdev 2009-01-06 02:20 for normal map, it can be 2009-01-06 02:20 directory cache and our private blockdev 2009-01-06 02:20 ah blockdev meant our private blockdev 2009-01-06 02:20 yes 2009-01-06 02:21 what shall we call it? volume cache? 2009-01-06 02:21 sounds good 2009-01-06 02:21 volume 2009-01-06 02:21 sure 2009-01-06 02:21 we have volblocks 2009-01-06 02:21 yes 2009-01-06 02:22 so, for volume, it would be complex a little 2009-01-06 02:22 directory cache can be it 2009-01-06 02:23 I think 2009-01-06 02:23 if directory reads full page 2009-01-06 02:23 like current code 2009-01-06 02:25 maybe, logmap would work with blockread/blockget, not blockread2/blockget2 2009-01-06 02:25 phtree will not work well with full page transfers, and blocksize smaller than page size 2009-01-06 02:25 logmap only needs blockget, not blockread 2009-01-06 02:25 blockget is very simple 2009-01-06 02:26 yes 2009-01-06 02:26 however, not current blockget() 2009-01-06 02:26 right, but we know how to fix that 2009-01-06 02:26 ah 2009-01-06 02:26 no, it would work with current blockget() 2009-01-06 02:27 ah good 2009-01-06 02:30 ah, it wouldn't work 2009-01-06 02:31 it clears outside of interesting block 2009-01-06 02:31 if it's not mapped 2009-01-06 02:31 right 2009-01-06 02:32 it tries to be too helpful 2009-01-06 02:33 anyway, using write_begin for blockget is weird :) 2009-01-06 02:33 creative 2009-01-06 02:33 btw, lognext is incremented always 2009-01-06 02:33 yes, is that bad? 2009-01-06 02:34 ah 2009-01-06 02:34 it's intent to overflow 2009-01-06 02:36 we can wrap below zero, and it should work too 2009-01-06 02:36 it will make a maximally deep radix tree 2009-01-06 02:36 i see 2009-01-06 02:36 useless, but interesting 2009-01-06 02:37 maze suggested using that for something 2009-01-06 02:37 storing xattrs at negative offects in a file page cache 2009-01-06 02:37 ugly or what? :) 2009-01-06 02:38 if we have limit, it would be more efficient 2009-01-06 02:38 because we don't need to allocate radix tree slot 2009-01-06 02:38 exactly 2009-01-06 02:39 ok 2009-01-06 02:39 blockget() and map_bh() will work 2009-01-06 02:40 and write side will free page 2009-01-06 02:41 or, just use volume blockget2() with mapping 2009-01-06 02:42 logmap? 2009-01-06 02:42 yes 2009-01-06 02:43 logmap seems a bit special 2009-01-06 02:43 it can say like normal map, but it can say like volume map 2009-01-06 02:43 but it also can 2009-01-06 02:44 buffer is like volume map, however it has delalloc like normal map 2009-01-06 02:45 it seems natural to me 2009-01-06 02:46 it is natural for logmap? 2009-01-06 02:46 the only thing is, we should be able to define a ->get_region aop 2009-01-06 02:47 ah 2009-01-06 02:47 anyway, it does not have to be perfect 2009-01-06 02:47 I meant ->map_region 2009-01-06 02:47 maybe a few years from now 2009-01-06 02:47 um.. 2009-01-06 02:47 it has same issue with current one 2009-01-06 02:48 logmap has to have own ->map_region? 2009-01-06 02:48 I was just speculating 2009-01-06 02:48 I just want to use normal or volume ->map_region 2009-01-06 02:49 however, ->map_region sounds like good idea 2009-01-06 02:49 it will be find with blockget to create a log block, syncio to read it and vecio to write it 2009-01-06 02:50 yes 2009-01-06 02:54 http://hg.tux3.org/tux3/rev/f34c99f0fdd6 2009-01-06 02:55 so I have written begin_change on one day, and log_begin on the next 2009-01-06 02:55 I need to be more consistent 2009-01-06 02:56 maybe change_begin and change_end is better 2009-01-06 02:57 yes, I was thinking about term list 2009-01-06 02:58 it's not really consistent 2009-01-06 02:58 it's not horrible though 2009-01-06 02:59 well, and I'd like to use term in list for code too 2009-01-06 02:59 e.g. delta may able to be used as structure name 2009-01-06 03:00 yes, struct delta { we need that pretty soon 2009-01-06 03:01 yes, and beggner can see term list to help to understand it 2009-01-06 03:02 ah, I see what you mean 2009-01-06 03:02 a glossary 2009-01-06 03:02 yes 2009-01-06 03:02 ah, glossary 2009-01-06 03:02 we can ask for a volunteer 2009-01-06 03:03 yes, it would help beginer and consistency of code 2009-01-06 03:06 probably, I'll also play with blockget/blockread 2009-01-06 03:11 it's oyasumi time for me 2009-01-06 03:11 yes, oyasumi 2009-01-06 04:53 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-06 05:08 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-06 05:40 interesting feature was introduced to ptrace() 2009-01-06 05:41 it can use BTS feature of cpu 2009-01-06 05:42 Branch Trace Store 2009-01-06 05:43 it can trace log of branch (from ip to ip), it would help debug well 2009-01-06 05:45 test program is http://userweb.kernel.org/~hirofumi/bts/ 2009-01-06 05:45 I guess real tracer is in systemtap or something 2009-01-06 06:29 hirofumi: which version of fsx do you use? 2009-01-06 06:30 iirc, basically, ltp version 2009-01-06 06:32 ltp + aio/dio patch 2009-01-06 06:32 ok, thanks. i was looking into porting the zeroed-page-check from the dfly variant 2009-01-06 06:35 it doesn't have zeroed page check? 2009-01-06 06:49 http://userweb.kernel.org/~hirofumi/fsx-linux/ 2009-01-06 06:49 I've put my fsx-linux 2009-01-06 06:49 btw, it has original codes 2009-01-06 07:15 -!- dcg(~dcg@220.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-01-06 07:18 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-06 09:05 -!- dcg(~dcg@21.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-01-06 09:56 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-06 10:12 -!- macan_(~chatzilla@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-06 10:12 -!- kushal(~kushal@115.109.13.235) has joined #tux3 2009-01-06 10:15 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-06 10:22 -!- stephan(~stephan@w0547.dip.tu-dresden.de) has joined #tux3 2009-01-06 12:34 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-06 13:26 -!- dcg(~dcg@133.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-01-06 14:22 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-06 15:03 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-06 17:02 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-06 21:15 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-06 21:16 -!- tim_dimm__(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-06 21:34 -!- kushal(~kushal@115.109.13.235) has joined #tux3 2009-01-06 21:36 hi flips 2009-01-06 21:58 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-06 21:58 hi kushal 2009-01-06 22:00 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-06 22:01 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-06 22:02 hm no tux3 tonight 2009-01-06 22:02 tux3u 2009-01-06 22:03 apparently not 2009-01-06 22:03 we could try for thursday 2009-01-06 22:03 if the usual suspects are back online 2009-01-06 22:17 -!- kushal(~kushal@115.109.10.112) has joined #tux3 2009-01-06 22:19 hi flips...sorry lost connection... 2009-01-06 22:19 no problem 2009-01-06 22:20 was just working with tux3...and had a small doubt... 2009-01-06 22:21 in userspace...the rm command doesn't seem to deallocate any blocks... 2009-01-06 22:21 it doesn't show any changes in the dleaf in the tux3graph 2009-01-06 22:21 after an rm... 2009-01-06 22:22 that's pretty good way of finding out :) 2009-01-06 22:22 let me see what tux3 does there 2009-01-06 22:22 ok... 2009-01-06 22:28 kusahl, tux3 delete does a tree_chop, that is supposed to free blocks 2009-01-06 22:28 should call dleaf_chop 2009-01-06 22:29 which calls ops->bfree 2009-01-06 22:29 so try this: gdb -args tux3 delete 2009-01-06 22:30 and set a break: b dleaf_chop 2009-01-06 22:30 and run, then step through dleaf_chop and see what it does 2009-01-06 22:31 ok...will do and then get back to you... 2009-01-06 22:31 anyway, I already see the problem 2009-01-06 22:31 but it's worth doing that anyway 2009-01-06 22:32 should I tell, or do you want to find it yourself? 2009-01-06 22:32 i'll try to find it...will ask if i dont get it... 2009-01-06 22:32 ok fine, and when you find it, send a patch 2009-01-06 22:33 ok.. 2009-01-06 22:34 if you need a hint, just ask 2009-01-06 22:34 sure...\ 2009-01-06 23:04 -!- RazvanM(~RazvanM@96.234.232.218) has joined #tux3 2009-01-06 23:12 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-06 23:12 -!- cdk(~chinmay@115.109.15.69) has joined #tux3 2009-01-06 23:28 is the problem in clear_bits? 2009-01-06 23:29 the "loff-roff-1" should be "loff-roff" 2009-01-06 23:33 flips, u there? 2009-01-06 23:34 hi 2009-01-06 23:34 no, it's just that the file isn't flushed after the delete 2009-01-06 23:34 but maybe you found a bug in clear_bits? 2009-01-06 23:35 after making the change in clear_bits...the tux3graph is actually showing the blocks removed... 2009-01-06 23:36 ah 2009-01-06 23:36 i think the flush is also missing... 2009-01-06 23:37 we should add self-check code to balloc 2009-01-06 23:37 anyway, post the patch 2009-01-06 23:37 and we must write a test case 2009-01-06 23:39 the patch with only the clear_bits change? 2009-01-06 23:40 sure, and one for the missing sync separately? 2009-01-06 23:43 so a tuxsync(sb->bitmap) in dleaf_chop? 2009-01-06 23:55 just in tux3.c 2009-01-06 23:56 like the other syncs in there 2009-01-06 23:56 dleaf_chop is shared with kernel, we don't want to be doing a sync on every file delete, and tuxsync is only userspace anyway 2009-01-06 23:57 oh yes... 2009-01-07 00:29 hey flips... 2009-01-07 00:30 i seem to have made some mistake... 2009-01-07 00:31 even after adding a tuxsync(inode) for delete in tux3.c 2009-01-07 00:32 the blocks of the deleted file are released only after another file is copied/written 2009-01-07 00:32 even after that...the inode remains allocated... 2009-01-07 00:33 i'm telling you all these results after looking at the tux3graph... 2009-01-07 00:36 ah 2009-01-07 00:38 also needs a sync_super, to write out the inode table 2009-01-07 00:38 ok..yes 2009-01-07 00:41 plus the change i pointed out in clear_bits is not required :) 2009-01-07 00:47 -!- macan(~macan@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-07 00:49 tried adding a sync_super also...still no change... 2009-01-07 00:52 i'm off for lunch..will be back in a while... 2009-01-07 02:08 hey flips 2009-01-07 02:08 hi bh 2009-01-07 02:08 sped through your area of the woods at about 80mph coming back down to San Diego from San Francisco 2009-01-07 02:08 I thought I felt a gust 2009-01-07 02:16 yeah, it's hard to sense that with my german made aerodynamics when driving by at autobahn speeds down the 405 2009-01-07 03:27 -!- kushal(~kushal@115.109.12.63) has joined #tux3 2009-01-07 03:36 -!- macan_(~macan@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-07 03:38 -!- macan_(~macan@xbl.dnsbl.oftc.net) has left #tux3 2009-01-07 03:43 -!- macan(~macan@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-07 04:21 -!- cdk(~chinmay@115.109.12.63) has joined #tux3 2009-01-07 04:56 -!- fqh(~fqh@219.131.240.95) has joined #tux3 2009-01-07 05:05 -!- macan(~macan@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-07 05:07 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-07 07:52 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-07 07:53 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-07 08:34 -!- stargazr(~stargazr@59.95.8.199) has joined #tux3 2009-01-07 08:57 -!- stargazr(~stargazr@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-07 09:50 -!- stargazr(~stargazr@59.95.30.21) has joined #tux3 2009-01-07 09:59 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-07 10:43 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-07 11:26 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-07 12:51 finishing touches on block forking design note... 2009-01-07 12:53 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-07 12:54 hirofumi, up late? 2009-01-07 12:54 hi 2009-01-07 12:55 thinking about atomic commit, snapshot and logging on a cluster is fun 2009-01-07 12:55 don't want to get distracted of course 2009-01-07 12:56 but I think we can do a cluster tux3 not too long after versioning 2009-01-07 12:56 the logging techniques will be helpful 2009-01-07 12:56 and the delta pipeline is useful on a cluster 2009-01-07 12:56 solves some locking problems that other cluster fs's had 2009-01-07 12:57 i see 2009-01-07 12:57 I don't know about cluster issues 2009-01-07 12:58 I only know about distributed lock a very little 2009-01-07 12:58 the way we think about caching applies to clusters 2009-01-07 12:59 when we do fine-grained btree locks, the shared locks higher up in the tree can be shared across nodes 2009-01-07 12:59 and copies of the btree nodes under shared locks can be shared with them 2009-01-07 12:59 ocfs2 had a problem sharing the inode table, which they solved by making each inode 4K 2009-01-07 13:00 so that each entry in the inode table could be owned by one node 2009-01-07 13:01 but we could use logging to solve that, only the inode lock is owned by a node, when it wants to update the inode it logs the update 2009-01-07 13:02 and later, the rollup process takes logged changes from the different nodes and updates the inode btree 2009-01-07 13:02 anway, it's not for now 2009-01-07 13:02 just mentioning that some of the things we are doing make clustering easier 2009-01-07 13:02 i see 2009-01-07 13:03 fine granularity delta replication? 2009-01-07 13:03 replication is the model hammer uses for clustering 2009-01-07 13:04 I like the shared cache model more 2009-01-07 13:04 it's more like smp 2009-01-07 13:04 hard to implement, like smp :) 2009-01-07 13:04 i see 2009-01-07 13:04 each cluster node would make its own log chain 2009-01-07 13:05 for updates to objects that it holds exclusive lock on 2009-01-07 13:05 a cluster snapshot means, all nodes have to agree that some particular delta belongs to the snapshot 2009-01-07 13:06 so a cluster snapshot is a set of delta numbers, one per node 2009-01-07 13:06 just an idea 2009-01-07 13:06 FWIW, after I heared the above, if we can save tux3 delta and apply, it's interesting 2009-01-07 13:06 oh yes, very interesting, the delta is a kind of snapshot by itself 2009-01-07 13:07 yes 2009-01-07 13:07 and the delta could be used for exact volume replication 2009-01-07 13:07 yes, maybe 2009-01-07 13:07 I am not sure what extra functionality that would add, but it is interesting 2009-01-07 13:08 it could give us an early snapshot mechanism, before versioning is ready 2009-01-07 13:08 i see 2009-01-07 13:09 or... it can give us different kind of snapshot... infinite per-delta snapshots, like hammer 2009-01-07 13:09 or it can just be fun to think about 2009-01-07 13:10 logging the deltas is what you were suggesting I think 2009-01-07 13:10 ah 2009-01-07 13:10 I just thought to export deltas to userland 2009-01-07 13:11 that can be useful for an external indexing program 2009-01-07 13:11 or security monitoring 2009-01-07 13:11 I need to respond to jamie lockier's comments 2009-01-07 13:12 selinux? 2009-01-07 13:14 yes 2009-01-07 13:14 anyway, I need to concentrate on atomic commit, when that is working we can start review 2009-01-07 13:14 it's close 2009-01-07 13:15 we need prototypes of block redirect, block fork, basic logging, the operation begin/end... 2009-01-07 13:16 per-delta rollup, that is just bitmap flush and btree node flush 2009-01-07 13:16 ok 2009-01-07 13:16 superblock update 2009-01-07 13:17 the only one we haven't talked about at all is flushing dirty btree nodes 2009-01-07 13:18 right now I'm doing pseudocode for block forking 2009-01-07 13:18 after that, userspace implementation of delta transition 2009-01-07 13:18 yes 2009-01-07 13:19 in userspace would be good 2009-01-07 13:19 and I think kernel breakage is not problem for now 2009-01-07 13:34 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-07 14:36 -!- data(~data@echo489.server4you.de) has joined #tux3 2009-01-07 15:30 -!- pworrall(~pworrall@resnet-nat060.lancs.ac.uk) has joined #tux3 2009-01-07 17:26 -!- pworrall(~pworrall@resnet-nat060.lancs.ac.uk) has joined #tux3 2009-01-07 17:30 can someone clear something you may find trivial up, the btrees which hold the file system meta data, can they grow across the entire disk? 2009-01-07 17:36 yes 2009-01-07 18:27 -!- macana(~macana@211.136.73.102) has joined #tux3 2009-01-07 18:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-07 18:51 -!- pworrall(~pworrall@resnet-nat060.lancs.ac.uk) has joined #tux3 2009-01-07 19:34 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-01-07 19:35 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-01-07 20:14 -!- pcacjr(~root@189.81.137.54) has joined #tux3 2009-01-07 20:15 -!- pcacjr(~root@189.81.137.54) has left #tux3 2009-01-07 21:15 -!- kushal(~kushal@115.109.12.63) has joined #tux3 2009-01-07 21:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-07 21:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-07 22:34 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-07 23:17 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-07 23:58 the latest tux3 makes VFS complain, and there should be a "iput(sbi->logmap);" in tux3_put_super 2009-01-08 00:24 complain about? 2009-01-08 00:25 when umount 2009-01-08 00:25 VFS: Busy inodes after unmount of sdb1. Self-destruct in 5 seconds. Have a nice day... 2009-01-08 00:25 ah, that is because of the missing iput all right 2009-01-08 00:26 sorry :) 2009-01-08 00:26 I'll put it in, and the patch credit goes to you 2009-01-08 00:26 unless you want to email a patch 2009-01-08 00:27 ok 2009-01-08 00:27 it's already in 2009-01-08 00:28 just booting it now 2009-01-08 00:29 vfs is happy now 2009-01-08 00:29 yep 2009-01-08 00:30 http://hg.tux3.org/tux3/rev/1893408a58a7 2009-01-08 01:05 -!- kushal(~kushal@115.109.12.63) has joined #tux3 2009-01-08 01:14 hey flips 2009-01-08 01:14 hi bh 2009-01-08 01:26 hirofumi, around? 2009-01-08 02:22 hi 2009-01-08 02:22 hi 2009-01-08 02:22 I'm working on buffer forking races, just trying to do a reality check to see if it can work asynchronously 2009-01-08 02:23 what race? 2009-01-08 02:23 fork vs read? 2009-01-08 02:23 we don't need it to be asyncrhonous for our simplified atomic commit, but I would like to convince myself it will work when operations on all buffers on the page ar asynchronous 2009-01-08 02:23 yes 2009-01-08 02:23 fork vs read is nasty 2009-01-08 02:24 the dma will be to the old page, and has to be copied to the new page 2009-01-08 02:24 the bio endio has to do this 2009-01-08 02:25 endio? 2009-01-08 02:25 don't we copy on fork? 2009-01-08 02:25 this is when there is another buffer on the same page that is under read io 2009-01-08 02:27 "delete" don't need brelse()? 2009-01-08 02:28 sorry, other topic 2009-01-08 02:28 the brelse is done a few lines below 2009-01-08 02:28 ah, tux_delete_entry does 2009-01-08 02:28 ok 2009-01-08 02:28 right 2009-01-08 02:28 funny interface, I didn't invent it ;) 2009-01-08 02:28 yes, ext2 does 2009-01-08 02:29 well, read io 2009-01-08 02:29 is there any difference? 2009-01-08 02:29 ok, when we fork a buffer and another buffer on the page is !uptodate, then it could be under read IO 2009-01-08 02:29 buffork(buffer) <- this buffer is under read io 2009-01-08 02:30 i see 2009-01-08 02:30 um... 2009-01-08 02:30 that would be a bug 2009-01-08 02:30 forking a buffer that is !uptodate is not allowed 2009-01-08 02:30 yes, probaby 2009-01-08 02:30 but a different buffer on the same page could be !uptodate 2009-01-08 02:31 yes 2009-01-08 02:31 ah 2009-01-08 02:31 so, there is race 2009-01-08 02:31 messy, hmm? 2009-01-08 02:31 asynchronous readers will be doing lock_buffer to get access to the buffer 2009-01-08 02:32 the !update buffer 2009-01-08 02:32 if they want to change it, they have to do blockdirty(buffer) 2009-01-08 02:32 in blockdirty, we can add synchronization to fix this 2009-01-08 02:33 lock_page is enough 2009-01-08 02:33 but it's kind of annoying to have to do a lock_page in blockdirty 2009-01-08 02:34 synchronization does copy from forked buffer to new buffer? 2009-01-08 02:34 that is possible 2009-01-08 02:35 or does read full page strategy 2009-01-08 02:35 ? 2009-01-08 02:35 read full page always 2009-01-08 02:36 so, if it dirty buffer, page would be uptodate 2009-01-08 02:36 the blocks may be completely independent 2009-01-08 02:36 that doesn't work for metadata, only file data 2009-01-08 02:36 it waits page, not buffer 2009-01-08 02:37 what waits for the page? 2009-01-08 02:37 first reader of page 2009-01-08 02:37 the forker waits for the page lock 2009-01-08 02:37 readers if page is not uptodate 2009-01-08 02:38 that is a good idea 2009-01-08 02:38 can't hold the page lock during the read, but can require the reader to take it before initiating the read 2009-01-08 02:39 now what is the locking order... page lock then buffer lock? 2009-01-08 02:39 yes 2009-01-08 02:41 like block_read_full_page() 2009-01-08 02:41 and end_buffer_aync_read() 2009-01-08 02:41 reader takes the buffer lock, sumits buffer io, the lock is released asynchronously, then the reader could take the page lock before returning, that would fix the fork race 2009-01-08 02:42 lock_page() -> lock_buffer for each buffers -> last callback unlock_page() 2009-01-08 02:43 above, I meant just take the lock_page and drop it immediately 2009-01-08 02:43 we can't hold the page lock across the read 2009-01-08 02:44 why can we hold? 2009-01-08 02:44 the application may have to read two blocks on the same page 2009-01-08 02:44 the second would deadlock 2009-01-08 02:44 I think no problem 2009-01-08 02:44 first read is blocked until page is uptodate 2009-01-08 02:45 until the buffer is uptodate you mean 2009-01-08 02:46 if until buffer, we can't hold lock_page() 2009-01-08 02:46 until page, I think we can 2009-01-08 02:46 the page may never become uptodate, one of the blocks on it might not even be allocated 2009-01-08 02:47 for volume map? 2009-01-08 02:47 yes 2009-01-08 02:48 what is "be allocated" meaning in here? 2009-01-08 02:48 it might be free in the allocation map 2009-01-08 02:48 I thought volume map is like blockdev 2009-01-08 02:49 it is 2009-01-08 02:49 so, there is no allocation? 2009-01-08 02:49 to bring a page uptodate, every block on it has to have valid data 2009-01-08 02:49 yes 2009-01-08 02:50 I felt we can read physical contiguous range on the page 2009-01-08 02:50 yes, that helps 2009-01-08 02:51 I guess I was thinking of a file cache 2009-01-08 02:51 and if we read whole page data at once, there is no partial uptodate 2009-01-08 02:51 which also has to work 2009-01-08 02:51 you are right, for the volume map 2009-01-08 02:51 now, a file map, such as a directory 2009-01-08 02:51 ok 2009-01-08 02:51 um... 2009-01-08 02:52 ok, that works too 2009-01-08 02:52 if it is not allocated, we can think it as hole? 2009-01-08 02:52 every block is either mapped or unmapped 2009-01-08 02:52 yes 2009-01-08 02:52 i.e. zero clear 2009-01-08 02:52 so fork can require the page to be uptodate 2009-01-08 02:53 I feel better now :) 2009-01-08 02:53 ok 2009-01-08 02:53 good :) 2009-01-08 02:53 thanks 2009-01-08 02:53 issue is we have to read whole page even if we want only one buffer 2009-01-08 02:53 that doesn't bother me 2009-01-08 02:53 yes, probably 2009-01-08 02:54 multiple blocks per page just has to work, it doesn't have to be perfectly efficient 2009-01-08 02:54 i see 2009-01-08 02:55 if you don't like that philosphy, then you can think of it as readahead ;) 2009-01-08 02:55 yes, I can :) 2009-01-08 02:56 well, those blocks can be unused from fs though 2009-01-08 02:56 how? 2009-01-08 02:56 on volume map 2009-01-08 02:56 it may unallocated block on higher layer 2009-01-08 02:57 each block will be mapped separately, there is no requirement to be physically contiguous 2009-01-08 02:57 for volume map 2009-01-08 02:57 ah, we are back to the volume map 2009-01-08 02:57 it is physically contiguous 2009-01-08 02:58 I think it is why sb_bread() doesn't read whole page 2009-01-08 02:58 yes, if we read unallocated space it is a slight waste 2009-01-08 02:58 only fork needs the page uptodate 2009-01-08 02:58 regular bread doesn't, which is more common 2009-01-08 03:00 if bechmark shows that as issue, probably we would be able to optimize it 2009-01-08 03:01 yes 2009-01-08 03:01 I'm happy with the idea 2009-01-08 03:01 ok 2009-01-08 03:01 it's a big simplification over what I was trying to do 2009-01-08 03:02 yes 2009-01-08 03:02 I just want to be sure fork will work before committing the design to it, and now I am I think 2009-01-08 03:02 async read sounds like complex 2009-01-08 03:02 very 2009-01-08 03:03 ext3 what does... 2009-01-08 03:03 maybe, it use lock_buffer() to copy 2009-01-08 03:03 wait any io 2009-01-08 03:04 I did find something like this fork somewhere in ext3, I forget where now 2009-01-08 03:04 there is some buffer copying 2009-01-08 03:04 but I don't think it changes the page out from under the buffer head like I plan to do 2009-01-08 03:06 do_get_write_access() seems to copy from bh->b_data to ->frozen_data 2009-01-08 03:07 that's a little different 2009-01-08 03:07 with lock_buffer() and jbd_lock_bh_state() 2009-01-08 03:07 the buffer can't be under write IO in that case 2009-01-08 03:07 yes 2009-01-08 03:08 ext3 seem synchronous against io 2009-01-08 03:08 I'm pretty sure this works now 2009-01-08 03:09 bring the page uptodate at the start of the fork... I didn't think of that 2009-01-08 03:09 after thinking about this for days 2009-01-08 03:09 I was trying to get too fancy I think 2009-01-08 03:10 however, it would be fast 2009-01-08 03:10 can be faster 2009-01-08 03:11 lock_buffer() was released after some state checking 2009-01-08 03:12 ah, it wait io, so it shouldn't have io anymore 2009-01-08 03:12 ok, probably ext3 waits io, then copy 2009-01-08 03:12 yes 2009-01-08 03:13 fork can be faster than ext3 even if we read whole page 2009-01-08 03:13 that's what I think 2009-01-08 03:14 great 2009-01-08 03:14 another matter... I have a very nice simplification to the rollup idea 2009-01-08 03:15 rollup will apply some logged changes, and free some old log buffers 2009-01-08 03:15 we have to know the physical block of each log buffer somehow 2009-01-08 03:15 apply to disk, and free on-disk logs? 2009-01-08 03:16 apply to cache and free log blocks 2009-01-08 03:16 sorry 2009-01-08 03:16 you are right 2009-01-08 03:16 ok 2009-01-08 03:17 write out some metadata blocks, when the delta completes, the log blocks can be freed 2009-01-08 03:17 so, the physical address of each log block is held in the next log block 2009-01-08 03:17 that makes the backwards chain 2009-01-08 03:17 i see 2009-01-08 03:17 and that is how we remember the physical address to free the log block 2009-01-08 03:18 a small thing 2009-01-08 03:18 but it means we don't have to allocate some structure to remember that 2009-01-08 03:18 we just hold last? 2009-01-08 03:18 so as we run, the index of the logmap just keeps increasing 2009-01-08 03:19 i see 2009-01-08 03:19 it is kind of annoying to keep full log blocks pinned in cache, just to remember the physical address, isn't it? 2009-01-08 03:20 we only have to pin the ones that are not completed to disk yet 2009-01-08 03:20 good 2009-01-08 03:20 ah, ok 2009-01-08 03:20 do we have forward chain? 2009-01-08 03:21 not yet 2009-01-08 03:21 ah, ok 2009-01-08 03:21 not for our initial atomic commit 2009-01-08 03:21 that will be a nice improvement 2009-01-08 03:22 anyway, sb->lognext can just keep increasing forever, there will be N valid log blocks, going backwards from the most recent log block at any time 2009-01-08 03:22 we will record N in very delta commit 2009-01-08 03:22 we shorten the log chain by storing a smaller N 2009-01-08 03:23 at that time, blocks are freed 2009-01-08 03:23 how do we know log blocks for replay 2009-01-08 03:23 ? 2009-01-08 03:23 there is a backwards chain starting from the most recent 2009-01-08 03:24 the most recent says how many blocks to go back 2009-01-08 03:24 i see 2009-01-08 03:24 I imaged we have forward chain from sb or something 2009-01-08 03:25 yes, the pointer to the most recent is stored in the sb for now 2009-01-08 03:25 i see 2009-01-08 03:25 we can make that a forward log without too much work 2009-01-08 03:25 but it is extra work 2009-01-08 03:25 so something a little simpler for now 2009-01-08 03:25 i see, ok, we use backward chain always 2009-01-08 03:25 yes, for ow 2009-01-08 03:25 for now 2009-01-08 04:16 http://mailman.tux3.org/pipermail/tux3/2009-January/000625.html <- Design note: Buffer forking 2009-01-08 04:16 that took way too long to write 2009-01-08 05:03 -!- kushal(~kushal@115.109.12.63) has joined #tux3 2009-01-08 05:08 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-08 05:14 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-08 06:00 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-08 06:17 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-08 08:09 -!- pworrall(~pworrall@resnet-nat060.lancs.ac.uk) has joined #tux3 2009-01-08 08:44 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-08 10:16 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-08 10:28 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-08 13:44 time to code buffer fork 2009-01-08 17:03 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-08 17:50 hirofumi, there? 2009-01-08 18:00 hi 2009-01-08 18:12 I think it is time to add our own volmap 2009-01-08 18:12 ok 2009-01-08 18:13 however, I'm thinking why do we need it 2009-01-08 18:13 good question, let's see if we can answer it 2009-01-08 18:14 small thing: it would be nice to have volmap->mapping->host->i_sb be our superblock 2009-01-08 18:14 before, I thought the reason is because we can't dirty buffer_head 2009-01-08 18:14 that's another thing, I'm also worried about vfs touching our buffers 2009-01-08 18:15 yes 2009-01-08 18:15 it would be nice to have just one blockread for block volmap and file inodes 2009-01-08 18:15 however, volmap has same issue 2009-01-08 18:16 because vfs will want to flush the volmap dirty pages? 2009-01-08 18:16 yes 2009-01-08 18:16 that is easy to prevent 2009-01-08 18:16 how do we do it? 2009-01-08 18:17 set inode dirty, then put the inode on whatever list we want 2009-01-08 18:17 vfs never moves an inode to the sb dirty list if it is already marked dirty 2009-01-08 18:17 I found that out when I was working on the deferred namespace patch 2009-01-08 18:18 mark_inode_dirty? 2009-01-08 18:19 let's see 2009-01-08 18:20 yes 2009-01-08 18:22 ok, our volmap (if we have it) will be set dirty right when we allocate 2009-01-08 18:22 then vfs will never flush anything on it 2009-01-08 18:22 how do we do the same thing for blockdev? 2009-01-08 18:22 it can't 2009-01-08 18:22 ok, that's reason 2009-01-08 18:23 and host->i_sb is also a reason I think 2009-01-08 18:23 it must allow some cleanups 2009-01-08 18:23 well, always dirty or don't i_hash it 2009-01-08 18:23 ah, I didn't consider that way 2009-01-08 18:24 what are reasons against our own volmap? 2009-01-08 18:24 1) we have to do our own unmap underlying metadata 2009-01-08 18:25 I think, the right thing to do there is remove the metadata buffers from a page as soon as all have zero count 2009-01-08 18:26 I don't know why it is currently done when a file maps the block 2009-01-08 18:26 that seems strangely complex, with no clear win 2009-01-08 18:27 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1603 2009-01-08 18:27 I think we don't need it at all 2009-01-08 18:28 because we control dirty buffer ourself 2009-01-08 18:28 all it does is a brelse 2009-01-08 18:28 yes 2009-01-08 18:28 the only reason to do it is free some buffer heads 2009-01-08 18:29 it doesn't free buffer_head 2009-01-08 18:29 no 2009-01-08 18:29 just brelse and waits for shrink_caches to clean up 2009-01-08 18:29 that's not nice 2009-01-08 18:30 I would like to be using our own aops for the volume map, not blockdev aops 2009-01-08 18:30 maybe 2009-01-08 18:31 let's see what blockdev aops ->writepage is 2009-01-08 18:31 it is just use block_* 2009-01-08 18:32 let's see if there is anything in block_dev.c that actually helps us 2009-01-08 18:33 def_blk_ops is what we use now? 2009-01-08 18:33 for sb_bread? 2009-01-08 18:33 yes 2009-01-08 18:33 no 2009-01-08 18:33 it will use for /dev/* 2009-01-08 18:34 open(/dev/*) directly 2009-01-08 18:34 ah, the block device decides\ 2009-01-08 18:34 sb_bread() will do ll_rw_block() 2009-01-08 18:35 but, ->writepage will be used to flush buffers 2009-01-08 18:35 and open/release are used for mount() 2009-01-08 18:35 it's never useful to have ->writepage called on our volume map 2009-01-08 18:35 yes 2009-01-08 18:36 volume map will just be like normal file 2009-01-08 18:36 not like blockdev 2009-01-08 18:36 it cover whole blocks though 2009-01-08 18:36 I think, the main thing is, blockdev seems to be a whole huge thing that does nothing useful, and introduces risks 2009-01-08 18:37 block_fsync would be instant corruption for tux3 2009-01-08 18:38 yes 2009-01-08 18:38 if users touch block device under fs, it is user's fault 2009-01-08 18:39 it is still true, even if volume map 2009-01-08 18:40 only fsync() may be ok though 2009-01-08 18:40 this is under the category of "what value does block_dev.c add" 2009-01-08 18:40 it registers with sysfs 2009-01-08 18:40 but that has nothing to do with caching for us, and it will continue to do that 2009-01-08 18:41 wow, there is a ton of cruft, to do a very simple thing with sysfs 2009-01-08 18:41 in block_dev.c 2009-01-08 18:42 I think block_dev.c has the wrong idea that it should be a cache and be involved in filesystem operations, it should really just be managing block devices, period 2009-01-08 18:42 no caching 2009-01-08 18:43 if there is no fs, it can't? 2009-01-08 18:44 there is no fs which mounting that block device 2009-01-08 18:44 ? 2009-01-08 18:45 blockdev can't call fs operations 2009-01-08 18:45 because it is not mounted yet 2009-01-08 18:45 oh, you're saying what if somebody wants to use the blockdev as a file 2009-01-08 18:46 yes 2009-01-08 18:46 yes, then it needs its own cache 2009-01-08 18:46 yes 2009-01-08 18:46 it doesn't have to impose itself on filesystems as a cache though 2009-01-08 18:46 I don't see any advantage 2009-01-08 18:46 looking at block ioctls now, to see if there's anything useful 2009-01-08 18:49 it looks like there is no big things for fs 2009-01-08 18:49 fsync_bdev will be done by things like lockfs 2009-01-08 18:50 that would corrupt us 2009-01-08 18:50 have is that avoided? 2009-01-08 18:50 yes, I think so 2009-01-08 18:50 that is a big reason for not using the blockdev then 2009-01-08 18:50 so there are three reasons for not using it, and probably one for using: it already works 2009-01-08 18:51 well, however corruption stuff can be done by many way 2009-01-08 18:51 just write via /dev/* 2009-01-08 18:52 I'm thinking all is user fault 2009-01-08 18:52 another small reason for not using: get rid of set block size 2009-01-08 18:52 ah, but users thing lockfs is a high level api for snapshotting 2009-01-08 18:52 when actually it is a high level api for corrupting our fs 2009-01-08 18:53 ah 2009-01-08 18:54 well, it would avoid by __fsync_super() 2009-01-08 18:55 it calls ->sync_fs() 2009-01-08 18:56 so, I think to don't flush dirty buffer automatically is primary reason 2009-01-08 18:57 true 2009-01-08 18:57 having just one version of blockread is a better reason 2009-01-08 18:57 and blockdirty, blockfork 2009-01-08 18:57 yes 2009-01-08 18:59 oh, bad_inode is not hashed 2009-01-08 18:59 make_bad_inode() remove inode from hash 2009-01-08 18:59 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L210 <- unconditionally calls sync_blockdev 2009-01-08 18:59 yes 2009-01-08 19:00 however, it calls __fsync_super() before it 2009-01-08 19:00 so we can try to clean our blocks ourselves? 2009-01-08 19:00 hi 2009-01-08 19:00 yes 2009-01-08 19:01 hi shapor 2009-01-08 19:01 flips: did you break the repo? 2009-01-08 19:01 because __fsync_super() calls ->sync_fs 2009-01-08 19:01 my bitbucket mirror broke: abort: push creates new remote branches! 2009-01-08 19:01 shapor, yesterday I did a rollback on the public repo 2009-01-08 19:01 ah ok 2009-01-08 19:01 shapor, partly to see what happens to your mirror ;) 2009-01-08 19:01 it broke? 2009-01-08 19:01 how? 2009-01-08 19:01 ah 2009-01-08 19:01 well i do a pull from you and a push to them 2009-01-08 19:01 so now we know 2009-01-08 19:01 the push to them is breaking 2009-01-08 19:02 kinda expected i think 2009-01-08 19:02 yes 2009-01-08 19:02 I'll try not to do it again 2009-01-08 19:02 not hard to fix, i could even automate it i support 2009-01-08 19:02 just gotta clean out my local tree and repull/push i think 2009-01-08 19:03 did it before 2009-01-08 19:03 hirofumi, so if we use the existing blockdev, we have to add code to keep it away from our filesystem, and we have two versions of a several block operations 2009-01-08 19:03 compared to what advantage? 2009-01-08 19:04 clean out means? 2009-01-08 19:05 if we use dirty buffer_head on blockdev, we can't control flush anymore 2009-01-08 19:05 whats a volmap? 2009-01-08 19:05 it's our prosed inode that will map the whole volume, our buffer cache 2009-01-08 19:06 hirofumi is making me justify it 2009-01-08 19:06 map what to what? 2009-01-08 19:06 caches metadata blocks 2009-01-08 19:07 metadata blocks mapped one to one to the device, that is, physical blocks 2009-01-08 19:07 it is replacement of sb_bread/sb_getblk() 2009-01-08 19:08 so, if we use dirty buffer_head, we have to use volmap 2009-01-08 19:08 that's enough reason for me 2009-01-08 19:09 yes 2009-01-08 19:09 shall we start? I'll add the volmap inode 2009-01-08 19:09 ok 2009-01-08 19:10 ah, it can do on blockdev. but too complex 2009-01-08 19:10 e.g. ext3 like jbh 2009-01-08 19:10 oh yes, that is scary 2009-01-08 19:11 all of jbd is scary 2009-01-08 19:11 yes 2009-01-08 19:19 ok, there it is: http://hg.tux3.org/tux3/rev/78bcf649e655 2009-01-08 19:20 so, I think we should try to replace sb_bread/getblk with blockread2/blockget2 2009-01-08 19:20 and we can break it a little for a while, that is ok 2009-01-08 19:21 we can use blockread/blockget for volmap? 2009-01-08 19:21 probably 2009-01-08 19:21 ok, anyway, I'll try to do something for volmap 2009-01-08 19:22 and remove find_buffer() hack 2009-01-08 19:22 that is worth it, just by itself 2009-01-08 19:32 will we ever call mark_inode_dirty on volmap? I don't think so 2009-01-08 19:33 I think it wouldn't be called directly at least 2009-01-08 19:33 or indirectly 2009-01-08 19:33 however, it may be called via mark_buffer_dirty()? 2009-01-08 19:33 volmap->i_state |= I_DIRTY; <- this will prevent harm 2009-01-08 19:34 maybe add that now and remove if it's not necessary? 2009-01-08 19:35 mark_inode_dirty() is called, but it is ignored? 2009-01-08 19:35 yes 2009-01-08 19:36 well, now, we are not inserting volmap to inode hash 2009-01-08 19:36 so, it will not be inserted into sb->s_dirty 2009-01-08 19:37 154 if (hlist_unhashed(&inode->i_hash)) <- ah, here 2009-01-08 19:37 yes 2009-01-08 19:51 earthquake, moderate one 2009-01-08 19:52 oh, scary 2009-01-08 19:52 there is many small or big earthquake in japan 2009-01-08 19:55 http://quake.wr.usgs.gov/recenteqs/Quakes/ci10370141.htm <- magnitude 5.0 2009-01-08 19:55 big enough for me 2009-01-08 19:55 u just felt one? 2009-01-08 19:55 you didn't? 2009-01-08 19:56 nope 2009-01-08 19:56 we're in the sediment, you're on the rockl 2009-01-08 19:56 rock 2009-01-08 19:56 you'll feel it more if its coming from the valley 2009-01-08 19:57 hirofumi, where r u? 2009-01-08 19:57 I'm in japan 2009-01-08 19:57 what part? 2009-01-08 19:57 in Tokyo 2009-01-08 19:57 k 2009-01-08 19:58 what was the city that had the huge earthquake about 10 yrs ago? 2009-01-08 19:58 a friend of mine lived there, and his house was destroyed 2009-01-08 19:58 probably, in Koge and Awajishima 2009-01-08 19:58 kobe 2009-01-08 19:58 more recently Nigata 2009-01-08 19:59 the ring of fire 2009-01-08 19:59 yes 2009-01-08 19:59 I lived in Osaka about 10 years ago 2009-01-08 19:59 all the cool cities are on the ring ;-) 2009-01-08 19:59 yes 2009-01-08 19:59 it was really scary 2009-01-08 20:00 when are you coming to LA? 2009-01-08 20:00 indeed, vancouver gets to say its on the ring, without having to put up with the earthquakes 2009-01-08 20:00 gets the warm water, not the shakes 2009-01-08 20:01 many building was crashed, and many house was fired 2009-01-08 20:02 we get it easy, so far 2009-01-08 20:03 http://www.emergency.com/jpnquake.htm <- this one? 2009-01-08 20:03 that was it 2009-01-08 20:04 if we have a big one here, I'm fucked 2009-01-08 20:04 we're on a liquefaction zone 2009-01-08 20:04 and I'm only 8 ft above sea level 2009-01-08 20:11 hey, I didn't mean to interrupt the dev work :-P 2009-01-08 20:22 removing devmap from user now, replacing with volmap 2009-01-08 20:47 I might found blockfork issue on directory page cache 2009-01-08 20:48 it uses blockget() to allocate new buffer 2009-01-08 20:48 however, it leaves rest of buffers as not uptodate 2009-01-08 20:51 read_mapping_page was supposed to bring them uptodate 2009-01-08 20:51 blockget() doesn't use read_mapping_page() 2009-01-08 20:51 yes 2009-01-08 20:52 we may want to use blockread() in _tux_create_entry() 2009-01-08 20:53 but in that case, blockget should not leave the page uptodate 2009-01-08 20:53 right? 2009-01-08 20:53 if there are any !uptodate buffers, the page should not be uptodate 2009-01-08 20:54 yes 2009-01-08 20:55 it may not have problem for blockfork() 2009-01-08 20:55 I was thinking to add assert(PageUptodate(page)) in blockfork() 2009-01-08 20:56 in the blockget() case, it would not be true 2009-01-08 20:57 there could be a debug check that loops over the buffers 2009-01-08 20:57 that would be a very good thing 2009-01-08 20:57 core kernel needs that too 2009-01-08 20:58 um... 2009-01-08 20:59 maybe, we need to use blockread() instead of blockget() 2009-01-08 20:59 if blockget() is last buffer on the page, read io can be there for first buffer 2009-01-08 21:01 blockread should not do read_mapping_page 2009-01-08 21:01 why? 2009-01-08 21:02 ah, I'm not sure I'm correct 2009-01-08 21:02 anyway, it isn't the issue 2009-01-08 21:03 you were talking about blockget in directory ops, right? 2009-01-08 21:03 yes, in _tux_create_entry() 2009-01-08 21:03 it is only user of blockget() except logmap 2009-01-08 21:04 usually, blockread would be written in terms of blockget, but I don't have a solid argument why that needs to be 2009-01-08 21:05 if we use blockread in create, it will zero the block for us, that seems good 2009-01-08 21:05 yes, and blockfork() requires page is uptodate 2009-01-08 21:06 I thought blockget() will violate it 2009-01-08 21:07 read_mapping_page is supposed to bring the page uptodate, and blockget should not bring the page uptodate unless all buffers are uptodate, right? 2009-01-08 21:08 yes 2009-01-08 21:08 well, blockget() will not touch to page state 2009-01-08 21:09 not change page state 2009-01-08 21:09 when buffers are added to an uptodate page, they are all supposed to be set uptodate 2009-01-08 21:09 yes 2009-01-08 21:09 it is true in both case 2009-01-08 21:10 so blockget does not need to touch page state, it should be right 2009-01-08 21:10 yes 2009-01-08 21:10 if page is not uptodate, blockget() will leaves page state as is 2009-01-08 21:11 right, it has no effect on state of data 2009-01-08 21:11 so, after that, we may see non uptodate page in blockfork() 2009-01-08 21:11 and read_mapping_page should bring it uptodate 2009-01-08 21:12 we do read_mapping_page() in blockfork()? 2009-01-08 21:12 I thought that was our big conclusion last night 2009-01-08 21:12 ah 2009-01-08 21:13 it prevents a race with block read 2009-01-08 21:13 I thought read side will garantee page is uptodate 2009-01-08 21:13 maybe, both is not big difference though 2009-01-08 21:14 I thought in this case uptodate is garanteed by blockget() side 2009-01-08 21:14 what do you mean by read side? 2009-01-08 21:14 ah 2009-01-08 21:14 first page user side 2009-01-08 21:14 ah, page allocater side 2009-01-08 21:15 well putting read_mapping_page in the block fork does not seem wrong 2009-01-08 21:15 yes 2009-01-08 21:15 it's easy to see correctness then 2009-01-08 21:16 I was not thinking about it 2009-01-08 21:16 because current blockread() garantee page is uptodate 2009-01-08 21:17 however, maybe we can do in blockfork() 2009-01-08 21:19 blockread always reading a full page does not seem right for file page cache 2009-01-08 21:19 a block might be a hole 2009-01-08 21:19 it was cleared by zero, and leaves as unmapped 2009-01-08 21:19 it might work 2009-01-08 21:20 it requires for mmap() 2009-01-08 21:20 mmap always reads a full page, not a block 2009-01-08 21:20 yes 2009-01-08 21:21 anyway, I can't prove there is any flaw yet 2009-01-08 21:21 so we will learn by doing 2009-01-08 21:22 so, we do read in blockfor()? 2009-01-08 21:22 blockfork() 2009-01-08 21:22 do read_mapping_page() in blockfork() 2009-01-08 21:23 it makes sense to me 2009-01-08 21:23 yes, it is good 2009-01-08 21:24 and we don't need to garantee page is uptodate in blockread() 2009-01-08 21:24 yes 2009-01-08 21:25 well, I was rewrite blockread() to page is uptodate 2009-01-08 21:25 it is not any big difference with current one 2009-01-08 21:26 we can optimize blockread/blockget again 2009-01-08 21:26 yes 2009-01-08 21:27 one block at a time is good enough for now 2009-01-08 21:29 tux3 mkfs does bad things if you mkfs on a zero size volume 2009-01-08 21:45 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-01-08 21:45 this is patches in my tree 2009-01-08 21:45 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-08 21:46 reading 2009-01-08 22:26 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-01-08 22:26 this added tux_new_volmap 2009-01-08 22:26 blockread/blockget(sb->volmap) may work 2009-01-08 22:27 hirofumi, tuxtime-original? 2009-01-08 22:27 it just reminder for me 2009-01-08 22:27 because the new one is now impossible to read I suppose 2009-01-08 22:28 and there is off-by-one difference with old 2009-01-08 22:28 lowest bit? 2009-01-08 22:29 maybe, yes 2009-01-08 22:29 because of my fancy new rounding 2009-01-08 22:30 I just compared against old result 2009-01-08 22:30 I didn't check the detail of it 2009-01-08 22:31 I didn't know you would regression test it, I could have left out the rounding ;) 2009-01-08 22:31 new one converts more accurately 2009-01-08 22:31 oh, new one is right, good :) 2009-01-08 22:35 revisit-blockread is the big one 2009-01-08 22:36 however, it would just optimize slightly 2009-01-08 22:36 don't get lock_page again 2009-01-08 22:37 and return uptodate page only 2009-01-08 22:38 we can optimize it again to allow partial uptodate 2009-01-08 22:38 and current one may be used partially for blockfork() 2009-01-08 22:38 yes 2009-01-08 22:39 the fsuid patch... 2009-01-08 22:39 if we back that out, it won't build on recent kernel 2009-01-08 22:39 after it, the patches is for new kernel 2009-01-08 22:40 I don't really want a compat.h because we will just delete the file a little later 2009-01-08 22:40 yes 2009-01-08 22:40 that patch will removes old one 2009-01-08 22:41 well, anyway, those patch is for future 2009-01-08 22:41 did the nfsd race actually happen? 2009-01-08 22:42 well some should be merged now I think 2009-01-08 22:42 I don't know it does or not 2009-01-08 22:42 maybe, Al found the race 2009-01-08 22:44 I think mergable patches is current static-http://... 2009-01-08 22:44 mkdir-error-fix.patch, revisit-blockread.patch, and support-volmap.patch 2009-01-08 22:44 I don't see support-volmap in that set 2009-01-08 22:45 yes, I added after that repo 2009-01-08 22:45 current repo has it 2009-01-08 22:45 ah, I got 3 more changesets 2009-01-08 22:47 ah, that was dumb of me not to initialize the volmap first 2009-01-08 22:48 well, I want to remove all blockdev usage 2009-01-08 22:48 does it work? 2009-01-08 22:48 but, it can't read sb without it 2009-01-08 22:48 current repo doesn't have any users 2009-01-08 22:49 right, sb IO needs a cleanup 2009-01-08 22:49 but not right away 2009-01-08 22:49 well, we can add block_write_full_page to tux3_vol_writepage() 2009-01-08 22:49 and replace all sb_bread/sb_getblk() to volmap 2009-01-08 22:50 and insert_inode_hash(sb->volmap) 2009-01-08 22:50 want to merge it without uses yet, just to clean out your queue a little? 2009-01-08 22:50 so, we can test volmap actually work or not 2009-01-08 22:51 either is ok for me 2009-01-08 22:51 if merge, I'll prepare 2009-01-08 22:52 merge is fine 2009-01-08 22:52 ok 2009-01-08 22:53 I'll add with block_write_full_page() 2009-01-08 22:53 for now 2009-01-08 22:53 ok 2009-01-08 22:54 your factoring looks tight, as usual 2009-01-08 22:55 I have sometimes tried to improve your factoring, without success ;) 2009-01-08 22:56 tux3_vol_writepage <- why not just leave NULL and let it segfault? 2009-01-08 22:56 maybe a BUG is better for now 2009-01-08 22:58 to test it, we need block_write_full_page() 2009-01-08 22:58 without it, buffer would be flushed 2009-01-08 22:59 would not be? 2009-01-08 22:59 ah, would not be 2009-01-08 22:59 makes sense 2009-01-08 23:06 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-01-08 23:06 I added block_write_full_page and comment 2009-01-08 23:17 still boots, still mounts 2009-01-08 23:17 yes 2009-01-08 23:18 I ran fsx-linux slightly before volmap 2009-01-08 23:19 my patch to remove devmap and use volmap in user is almost done... but tux3fuse broke 2009-01-08 23:19 tracking that down 2009-01-08 23:19 we can add get_block handler to tux_inode 2009-01-08 23:20 get_block handler? 2009-01-08 23:20 just a idea 2009-01-08 23:20 yes, tux_inode()->get_block 2009-01-08 23:20 what I thought you meant 2009-01-08 23:20 so, we can merge some apos 2009-01-08 23:20 we may be able to merge 2009-01-08 23:21 like I did in user space, slightly differently 2009-01-08 23:22 so get_block for volmap just sets physical block to the buffer intex? 2009-01-08 23:22 yes 2009-01-08 23:23 it sounds like a good idea 2009-01-08 23:24 I wonder if it has any other uses 2009-01-08 23:24 ah 2009-01-08 23:24 well, it would be later 2009-01-08 23:24 maybe, after handles 2009-01-08 23:25 however, we may need only one apos with it 2009-01-08 23:25 let's see how exactly they differ 2009-01-08 23:25 -!- weechat2(~weechat@host162-104.bornet.net) has joined #tux3 2009-01-08 23:25 ah 2009-01-08 23:26 file map is using mpage_readpage() 2009-01-08 23:26 it may not attach buffer_heads 2009-01-08 23:26 it may not be good 2009-01-08 23:26 right, we don't want to be putting buffer heads on file data 2009-01-08 23:27 it may work for metadata too, but I'm not sure for now 2009-01-08 23:27 buffers should only go on metadata in file block cache, i.e., dirents and bitmaps 2009-01-08 23:27 mpage for metadata? 2009-01-08 23:27 maybe, if only one apos 2009-01-08 23:28 mpage only comes from sys_read, I thought 2009-01-08 23:29 ah, yes 2009-01-08 23:29 mpage_readpages() 2009-01-08 23:29 however, there is mpage_readpage() 2009-01-08 23:31 it also use bio to read page 2009-01-08 23:32 we won't use mpage forever 2009-01-08 23:32 it makes lots of tiny calls to map_region 2009-01-08 23:32 so, maybe, file map and others apos 2009-01-08 23:33 it is fallback version of mpage_readpages() 2009-01-08 23:34 tux3fuse works now, I didn't change anything 2009-01-08 23:34 mpage 2009-01-08 23:34 good 2009-01-08 23:35 ah 2009-01-08 23:35 later, we have to replace get_block calls 2009-01-08 23:36 yes, so that's a reason not to use that interface 2009-01-08 23:36 that settles that 2009-01-08 23:36 but there will be a different way to abstract it 2009-01-08 23:37 i see 2009-01-08 23:37 getting rid of get_block will be very satisfying, it is not that far away 2009-01-08 23:37 yes 2009-01-08 23:38 so, we will revisit to filemap.c with handles and it 2009-01-08 23:39 one note on handles: for now we will walk lists of buffers, when we have handles they don't have list links, but we can walk pages instead 2009-01-08 23:39 lists of pages 2009-01-08 23:39 hmm 2009-01-08 23:39 what if one page is in one delta, and another page is in a different delta 2009-01-08 23:41 sorry 2009-01-08 23:41 what if one block on a page is in one delta, and another block on the same page in a different delta 2009-01-08 23:42 ah 2009-01-08 23:42 fork 2009-01-08 23:42 right 2009-01-08 23:42 yes, it can 2009-01-08 23:42 yes 2009-01-08 23:42 um... 2009-01-08 23:43 it is bug? 2009-01-08 23:43 no 2009-01-08 23:43 it is nice :) 2009-01-08 23:44 it can be, and just fork? 2009-01-08 23:44 it is nice that fork allows using the page list link, otherwise we would have to kmalloc something to keep track of which pages are in a delta 2009-01-08 23:45 page list link? 2009-01-08 23:46 yes, a page has a list for the purpose of linking dirty pages to an inode 2009-01-08 23:46 ah 2009-01-08 23:46 I would like to use that field for linking pages to a delta 2009-01-08 23:46 it will require some analysis 2009-01-08 23:46 I think it works 2009-01-08 23:47 I thought we use buffer_head->b_assoc_list 2009-01-08 23:47 for now, yes 2009-01-08 23:47 I'm thinking forward to handles 2009-01-08 23:47 and handles also have replacement of that 2009-01-08 23:48 I thought 2009-01-08 23:48 actually, they don't, I am too cheap with memory to put a list link for each block in the handle 2009-01-08 23:48 they could have 2009-01-08 23:48 but it would be nicer if things work out with lists of pages 2009-01-08 23:49 if it pages, we can't have the above state 2009-01-08 23:49 in current delta, blockfork() will fork page? 2009-01-08 23:49 that's what I'm thinking 2009-01-08 23:49 i see 2009-01-08 23:49 that could work out very well 2009-01-08 23:50 it has to see all handles? 2009-01-08 23:50 to checks delta counter in handles 2009-01-08 23:51 yes 2009-01-08 23:51 i see 2009-01-08 23:51 it will walk the delta list by walking the page list and checking each handle state 2009-01-08 23:52 it would be nice if the inode dirty page list was sorted 2009-01-08 23:52 I don't think anything gaurantees that now 2009-01-08 23:52 but it tends to be sorted by having pages added in order 2009-01-08 23:53 I think it is sorted by dirty time 2009-01-08 23:53 is it? 2009-01-08 23:53 s_dirty list 2009-01-08 23:53 ah, inodes 2009-01-08 23:53 ah 2009-01-08 23:53 page 2009-01-08 23:53 right, the inode dirty page list 2009-01-08 23:54 if that is sorted, the walking it to make big regions for map_region will work 2009-01-08 23:55 what list do we use for page? 2009-01-08 23:55 ok, two devmaps to remove, in tux3graph 2009-01-08 23:55 you mean, where do we put the list head? 2009-01-08 23:56 it meant page doesn't have ->list 2009-01-08 23:58 hmm, where did that page list go :) 2009-01-09 00:01 :) 2009-01-09 00:05 ah, it's find_get_pages now 2009-01-09 00:05 ok, well there is the lru link 2009-01-09 00:06 a page does not need to be on the lru if it is dirty 2009-01-09 00:06 or rather, 2009-01-09 00:06 if we have already decided to write it out 2009-01-09 00:07 so the lru link becomes avaiable, and we would rather not have those pages being scanned anyway 2009-01-09 00:07 ah, it may be able to do 2009-01-09 00:08 no 2009-01-09 00:08 if it is not forked, lru is still needed? 2009-01-09 00:08 then it could be some use 2009-01-09 00:08 in theory 2009-01-09 00:08 in practice, it is probably still useless 2009-01-09 00:09 if it clean, vm decides evict page from it? 2009-01-09 00:09 if page is clean 2009-01-09 00:09 well, even if dirty, it will write out 2009-01-09 00:10 shrink_caches now tries to do ->writepage on dirty lru pages 2009-01-09 00:10 that is really a bad idea 2009-01-09 00:10 that should be gotten rid of completely 2009-01-09 00:11 but, if pages was clean? 2009-01-09 00:11 yes, then it is useful 2009-01-09 00:11 we will never put a clean page on a delta list 2009-01-09 00:11 after write out, it become clean? 2009-01-09 00:12 then it will be put on the clean list, which is fine 2009-01-09 00:12 sorry 2009-01-09 00:12 not it won't 2009-01-09 00:12 there is no clean list ;) 2009-01-09 00:12 yes :) 2009-01-09 00:12 but we should put it back on the lru on write completion 2009-01-09 00:13 there was an attempt to have a clean list many years ago, and it ended up with lots of dirty pages on it 2009-01-09 00:13 yes, so, lru order may not right 2009-01-09 00:13 true 2009-01-09 00:13 good argument 2009-01-09 00:13 ah, yes 2009-01-09 00:13 iirc, hash and some list to manage pages 2009-01-09 00:13 it probably doesn't matter much though, inode lru order is more important 2009-01-09 00:15 so the question is, what happens if we put a page we have just cleaned on the cold end of the lru 2009-01-09 00:15 probably nothing bad 2009-01-09 00:16 if the page is actualy hot, it will be referenced before being evicted 2009-01-09 00:17 um... 2009-01-09 00:18 if under the heavy loads, after write out, it can be evicted? 2009-01-09 00:19 um..., e.g. writing pages cyclic 2009-01-09 00:19 on some range 2009-01-09 00:20 thre's supposed to be a grace period for a "rescue" 2009-01-09 00:20 I haven't looked at that for a while 2009-01-09 00:21 ok, where are my instructions for using tux3graph? 2009-01-09 00:22 found it 2009-01-09 00:33 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-09 00:36 hirofumi, we put the page on the hot end of the inactive list 2009-01-09 00:36 and it has to go all the way to the cold end before being reclaimed 2009-01-09 00:37 if the inactive list is so short that pages are immediately reclaimed, that is a vmscan bug 2009-01-09 00:39 um... 2009-01-09 00:46 it would work, but it is not perfect lru 2009-01-09 00:47 our lru has a reputation for being worse than random, so everything is relative ;) 2009-01-09 00:47 :) 2009-01-09 00:49 my funny bug with tux3fuse went away after I did nothing and came back with tux3fuse 2009-01-09 00:49 it would be ok, issue is not benchmark easily 2009-01-09 00:49 it's because of my devmap removal patch 2009-01-09 00:49 make clean? 2009-01-09 00:49 didn't help 2009-01-09 00:50 oh 2009-01-09 00:50 I wonder if something didn't get cleaned 2009-01-09 00:50 everything looks clean 2009-01-09 00:50 should I post the patch? 2009-01-09 00:51 or commit to repo 2009-01-09 00:51 sure 2009-01-09 00:51 ah, makefile has depends on tux3graph.o 2009-01-09 00:51 it does 2009-01-09 00:51 but .o's are removed by make clean 2009-01-09 00:52 ah, mistake. makefile has dependency of tux3use.o 2009-01-09 00:52 however it would be not used 2009-01-09 00:52 well, no problem though 2009-01-09 00:52 is is just unneeded 2009-01-09 00:53 commit a broken patch? 2009-01-09 00:53 probably will not stay broken too long 2009-01-09 00:53 I think it is ok 2009-01-09 00:54 for now 2009-01-09 00:54 here comes 2009-01-09 00:56 done 2009-01-09 00:58 also fails on a known-good filesystem image, so the bug is not in tux3 mkfs 2009-01-09 01:01 "volmap = new_inode()" would be bad 2009-01-09 01:02 reason? 2009-01-09 01:02 volmap->map->ops should be devmap->ops 2009-01-09 01:02 :p 2009-01-09 01:02 of course 2009-01-09 01:04 ok, easiest thing to do is shell new_inode 2009-01-09 01:06 maybe, we can do it in tux_setup_inode() 2009-01-09 01:06 I'm not sure my little abstraction there is a good one anyway 2009-01-09 01:07 it's easy enough to shell it 2009-01-09 01:07 almost done 2009-01-09 01:07 don't let this cruft escape into a real program ;) 2009-01-09 01:07 well, it is not escape though 2009-01-09 01:08 userland tux_setup_inode() 2009-01-09 01:08 sure 2009-01-09 01:08 it will calls from new_inode 2009-01-09 01:08 we can check ->i_mode for checking it 2009-01-09 01:09 devmap ops has to be exported from buffer.c 2009-01-09 01:09 and tux_new_volmap()/tux_new_inode() will be shared from both 2009-01-09 01:09 ah 2009-01-09 01:09 um 2009-01-09 01:09 yes it does 2009-01-09 01:09 ah, new_map(NULL) does 2009-01-09 01:10 ah, that was me working around this before ;) 2009-01-09 01:11 works 2009-01-09 01:11 ok, good enough 2009-01-09 01:11 ok 2009-01-09 01:11 userland structures are getting a little crufty 2009-01-09 01:11 dev is now stored in two places 2009-01-09 01:11 sb and inode 2009-01-09 01:11 inode->map->dev 2009-01-09 01:12 how did you spot that so fast? 2009-01-09 01:12 tux_setup_inode()? 2009-01-09 01:13 the bogus new_inode 2009-01-09 01:13 ah 2009-01-09 01:13 just reviewd 2009-01-09 01:14 normally, it should look like tux_new_inode() or tux_new_volmap() 2009-01-09 01:14 however, it was new_inode() 2009-01-09 01:14 I felt something strage 2009-01-09 01:14 right 2009-01-09 01:14 well good instinct 2009-01-09 01:14 you already fixed that for a similar reason in kernel 2009-01-09 01:14 yes 2009-01-09 01:15 inode setup is tux_new_inode or open_inode 2009-01-09 01:16 both calls tux_setup_inode() 2009-01-09 01:16 tux_new_inode() is delalloc stuff 2009-01-09 01:17 we can move the userspace stuff a little closer to kernel over time 2009-01-09 01:17 maybe 2009-01-09 01:18 ok, fix is pushed 2009-01-09 01:18 fuse stuff makes some exception though 2009-01-09 01:18 it can't get uid every where 2009-01-09 01:21 ah, new_inode_ops(), ok 2009-01-09 01:25 lazy fix 2009-01-09 01:25 I don't know if that bread abstraction actually does anything 2009-01-09 01:25 you were thinking of something very similar yourself, for kernel 2009-01-09 01:26 treat it as an idea on probation 2009-01-09 01:27 blockread() for everywhere? 2009-01-09 01:27 blockread == bread 2009-01-09 01:28 we can just rename sb_bread() to vol_bread() for example 2009-01-09 01:28 yes 2009-01-09 01:28 and blockread(mapping(sb->volmap)) 2009-01-09 01:28 it would work 2009-01-09 01:29 and can share from both 2009-01-09 01:30 I am also not sure whether there is any reason for blockread to take a mapping instead of an inode 2009-01-09 01:30 anyway, these issues are not very important right now 2009-01-09 01:31 more important is making lists of blocks 2009-01-09 01:31 yes 2009-01-09 01:32 so I can return to may changes to buffer.c, to implement multiple dirty lists, and we will do the same in kernel after it is prototyped 2009-01-09 01:32 I am thinking, in user space there will be one list of buffers for each buffer state 2009-01-09 01:32 a simple idea 2009-01-09 01:32 there is only one buffer state, EMPTY, that doesn't obviously need buffers on a list 2009-01-09 01:33 only needs dirty list? 2009-01-09 01:33 dirty per delta 2009-01-09 01:33 ah, yes 2009-01-09 01:33 we need two deltas for now, running and committing 2009-01-09 01:34 so we can fork bitmap blocks 2009-01-09 01:34 i see 2009-01-09 01:34 and btree nodes, to support the promises idea 2009-01-09 01:35 there should be a prototype by tomorrow that shows this so we can discuss it 2009-01-09 01:35 just for bitmaps 2009-01-09 01:35 ok 2009-01-09 01:35 I am feeling that atomic commit is not very far away 2009-01-09 01:36 good 2009-01-09 01:36 I can't imagine whole of it yet though 2009-01-09 01:37 however, maybe not so far 2009-01-09 01:37 that's why I'm writing design notes every day and protyping 2009-01-09 01:37 I can see it all pretty clearly now 2009-01-09 01:37 yes, it helps me a lot 2009-01-09 01:37 and it follows the original model, simplified 2009-01-09 01:38 this irc also help a lot 2009-01-09 01:38 it does 2009-01-09 01:38 sometimes just talking about things gets the ideas clear 2009-01-09 01:39 and that talks help me too 2009-01-09 01:39 you saw how bitmap flush has to be two passes, right? 2009-01-09 01:39 probably 2009-01-09 01:39 I think that there is no option, any time the allocation map is dynamically allocated 2009-01-09 01:40 which is probably all recent filesystems 2009-01-09 01:40 one pass to do the allocations, another pass to start the writeout 2009-01-09 01:41 it is interest to see zfs, btrfs, hammer strategy 2009-01-09 01:41 hammer avoided allocation entirely 2009-01-09 01:41 oh 2009-01-09 01:41 it logs, and it has a follow up reblocking pass 2009-01-09 01:42 reblocking frees up big chunks, something like 8 MB 2009-01-09 01:42 which are used for later logging 2009-01-09 01:42 I'm not sure what btrfs does 2009-01-09 01:42 or zfs 2009-01-09 01:42 or reiser4 2009-01-09 01:42 it moves data blocks on reblocking pass? 2009-01-09 01:43 yes 2009-01-09 01:43 oh, interesting strategy 2009-01-09 01:43 you would think that would be costly, but it isn't really 2009-01-09 01:43 why? 2009-01-09 01:43 because the reblocking does not need to do many seeks 2009-01-09 01:44 it costs bandwidth 2009-01-09 01:44 ah, it assuming rotating storage? 2009-01-09 01:44 yes 2009-01-09 01:44 if rewrite is slow, it is slow? 2009-01-09 01:44 it also never rewrites 2009-01-09 01:45 only adds new deltas 2009-01-09 01:45 I'm not sure what happens when the volume fills up 2009-01-09 01:45 i see 2009-01-09 01:45 it does delete at some point 2009-01-09 01:45 sounds like near of logfs strategy 2009-01-09 01:45 it does 2009-01-09 01:46 it seems to be doing pretty well 2009-01-09 01:46 written by one person in a year or so 2009-01-09 01:46 great 2009-01-09 01:47 I can't see why he can work on many parts of kernel 2009-01-09 01:48 pretty special guy 2009-01-09 01:48 he may not be sleeping :) 2009-01-09 01:49 that's what I've been doing wrong 2009-01-09 01:49 I sleep almost every day 2009-01-09 01:49 me too :) 2009-01-09 01:50 int blockdirty(struct buffer_head *buffer) 2009-01-09 01:50 { 2009-01-09 01:50 if the buffer is dirty in an earlier delta { 2009-01-09 01:50 struct buffer *newbuffer = new_buffer(map); 2009-01-09 01:50 if (IS_ERR(buffer)) 2009-01-09 01:50 return NULL; // ERR_PTR me!!! 2009-01-09 01:50 void *data = buffer->data; 2009-01-09 01:50 buffer->data = newbuffer->data; 2009-01-09 01:50 newbuffer->data = data; 2009-01-09 01:50 remove the old buffer from the earlier delta list 2009-01-09 01:50 insert the new buffer on the earlier delta list 2009-01-09 01:50 } 2009-01-09 01:50 set the buffer dirty in the current delta 2009-01-09 01:50 insert the buffer on the current delta list 2009-01-09 01:50 } 2009-01-09 01:50 prototype of blockdirty, including fork 2009-01-09 01:50 pseudocode 2009-01-09 01:51 got to the point where I have to change lists, and started rewriting buffer code to support that 2009-01-09 01:53 looks good 2009-01-09 01:54 we make sure read side is in write lock range 2009-01-09 01:55 write lock range? 2009-01-09 01:55 oh 2009-01-09 01:55 right, now we can take the page lock 2009-01-09 01:55 page lock or something 2009-01-09 01:55 ah, yes 2009-01-09 01:56 um... 2009-01-09 01:58 and write effect have to propagation to reader side 2009-01-09 01:59 i.e. reader shoud be on new buffer if it has to read new buffer 2009-01-09 02:00 ? 2009-01-09 02:00 ah 2009-01-09 02:00 right 2009-01-09 02:00 well, e.g. we have to lock btree 2009-01-09 02:01 reader, as in reading buffer->data memory? 2009-01-09 02:01 yes 2009-01-09 02:01 and pointer to buffer->data 2009-01-09 02:01 right, some code has to be changed there 2009-01-09 02:02 I am thinking, maybe it would be good to make cursor->path.next an offset instead of pointer 2009-01-09 02:02 because when it is a pointer, we will have to go and fix it up on every blockdirty 2009-01-09 02:02 i see 2009-01-09 02:03 for now, it is not needed because of btree->lock? 2009-01-09 02:03 not needed because of no forking 2009-01-09 02:04 bitmap does? 2009-01-09 02:04 ah, btree 2009-01-09 02:04 right 2009-01-09 02:04 the path 2009-01-09 02:04 we will be forking btree nodes 2009-01-09 02:05 let me see if I can explain why 2009-01-09 02:05 if btree was protected by rw_sem, it is not needed? 2009-01-09 02:05 no 2009-01-09 02:05 it is not protected, probably 2009-01-09 02:06 promises need to be fullfilled eventually 2009-01-09 02:07 ah, I wonder if our simplified atomic commit gets rid of the need to fork btree nodes 2009-01-09 02:07 anyway, we certainly have to redirect 2009-01-09 02:07 yes 2009-01-09 02:07 that happens for example in insert_child 2009-01-09 02:07 and redirecting moves the buffer, changing buffer->data, and requiring .next to be updated 2009-01-09 02:09 later, we will also have btrees mapped into files, and then the buffer will be logically mapped and not have to be moved 2009-01-09 02:09 ah, write side itself 2009-01-09 02:46 oyasumi 2009-01-09 02:46 oyasumi 2009-01-09 04:19 -!- kushal(~kushal@115.109.13.48) has joined #tux3 2009-01-09 04:33 -!- pgquiles(~pgquiles@240.Red-88-22-55.staticIP.rima-tde.net) has joined #tux3 2009-01-09 05:09 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-09 05:50 interesting commits 2009-01-09 05:50 http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commit;h=87d8fe1ee6b8d2f95076142d58c440dba4e7bdc2 2009-01-09 05:50 http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commit;h=6b082b531228c43d454c082fc0f969da1695b060 2009-01-09 05:51 it looks like ext* leaked jbh stuff on some case 2009-01-09 05:51 -!- cdk(~cdk@115.109.13.48) has joined #tux3 2009-01-09 05:51 we shouldn't this mistake again 2009-01-09 05:52 probably, don't use jbh like stuff 2009-01-09 08:26 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-09 09:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-09 09:42 start fsx-linux challenge on volmap 2009-01-09 10:27 I found one thing for blockdev 2009-01-09 10:36 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-09 10:36 -!- kushal(~kushal@115.109.13.48) has joined #tux3 2009-01-09 10:38 ah, no 2009-01-09 10:39 I thought if backing storage was ram, it may be able to share page caches 2009-01-09 10:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-09 12:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-09 12:38 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-09 12:42 -!- kushal(~kushal@115.109.13.48) has joined #tux3 2009-01-09 12:42 hey flips... 2009-01-09 12:42 hi kushal 2009-01-09 12:43 just wanted some help...if i have an inode...how do i get all the buffers in its map one by one...? 2009-01-09 12:44 as in buffer=blockget(mapping(inode),??) 2009-01-09 12:45 in kernel or in userspace? 2009-01-09 12:45 you do it like that 2009-01-09 12:45 userspace... 2009-01-09 12:45 the same in kernel and userspace 2009-01-09 12:45 so... you want to know which buffers are in its map? 2009-01-09 12:45 then you have to traverse the hash, that is the only way 2009-01-09 12:46 ok... 2009-01-09 12:46 usually, if you need to know this, your design has a problem 2009-01-09 12:46 what is it you are trying to do? 2009-01-09 12:47 when there is a write of multiple blocks...i'm trying to find out the hash value for the buffers before there is a balloc... 2009-01-09 12:48 for the dedup? 2009-01-09 12:48 yes... 2009-01-09 12:48 ok, doing some high level optimization across buffers 2009-01-09 12:49 in userspace and kernel, traversing the hash is very different 2009-01-09 12:49 it's a radix tree in kernel 2009-01-09 12:49 ok... 2009-01-09 12:49 and in userspace? 2009-01-09 12:49 a hash table, see buffer.c 2009-01-09 12:50 ok... 2009-01-09 12:50 there is a peekblk function that will tell you if a buffer exists or not 2009-01-09 12:51 but that is not practical for walking all buffers, it is ok for checking neighbours though 2009-01-09 12:51 it is not implemented in kernel 2009-01-09 12:52 ok... 2009-01-09 12:52 -!- cdk(~chinmay@115.109.13.48) has joined #tux3 2009-01-09 12:53 why do you need to know the hash value before balloc? 2009-01-09 12:54 so that if there are duplicates, then we do not physically allocate any blocks for those duplicates... 2009-01-09 12:54 so .. if i want to get the logical block "index" i can just get it using blockget(mapping(inode),index) right ??? 2009-01-09 12:55 that gives you a buffer, I think that's what you mean 2009-01-09 12:55 it creates the buffer if it doesn't exist in the hash 2009-01-09 12:55 k 2009-01-09 12:56 sometime we should put that comment in the code :-/ 2009-01-09 12:57 kusahl, but why not compute the hash just before the balloc? 2009-01-09 12:58 yes...we're doing it there only...in map_region...before the balloc for holes... 2009-01-09 13:00 good, that's the right place 2009-01-09 13:02 cdk, the peekblk function was created for looking at buffers without creating new ones 2009-01-09 13:03 we have to do some higher level optimization involving looking at groups of dirty buffers, actually pages, in kernel, but the design for that is not set yet 2009-01-09 13:03 yes...that will work better ... instead of blockgett 2009-01-09 13:03 cdk, and it won't work in kernel 2009-01-09 13:04 if you find you really need it, we can implement it in kernel 2009-01-09 13:04 make a case for why you need it, on this mailing list 2009-01-09 13:04 ok.. 2009-01-09 13:05 will see if it works for us and then put up a post appropriately 2009-01-09 13:37 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-09 13:46 hi flips .. another thing... 2009-01-09 13:46 maze, happy new year 2009-01-09 13:46 hey 2009-01-09 13:47 just got back ;-) 2009-01-09 13:47 Happy New Year to everyone at tux3 as well! 2009-01-09 13:47 what is your new year's resolution, to carve your name in the tux3 hall of fame? 2009-01-09 13:47 junkfs didn't quite get you on the credits list ;) 2009-01-09 13:47 heh 2009-01-09 13:48 I ended up with no internet access and a broken laptop for a little over 2 weeks 2009-01-09 13:48 let me see, what issues are worthy of your considerable intellect 2009-01-09 13:48 still - it was a very nice change of pace 2009-01-09 13:48 ah, sucks 2009-01-09 13:48 I highly recommend eee... it's so cheap you don't care if it breaks 2009-01-09 13:48 you just get another one 2009-01-09 13:48 and you get used to running with less cpu 2009-01-09 13:48 hmm, small screen though 2009-01-09 13:49 10 inch isn't bad 2009-01-09 13:49 the size way makes up for it 2009-01-09 13:49 and everybody coos about it 2009-01-09 13:49 I've been stopped a bunch of times, people asking me if it's a real computer 2009-01-09 13:50 it's sitting on top of my cd burner, it's way smaller than the cd burner 2009-01-09 13:50 anyway, let's see, what is interesting 2009-01-09 13:51 hirofumi got us liberated from the whole block_dev.c hairball 2009-01-09 13:51 we're just about positioned to do some groovy things with cache 2009-01-09 13:52 buffer fork is a cool thing 2009-01-09 13:52 sorry, a quick run down of what the hairball in block_dev.c was? 2009-01-09 13:52 metadata blocks are handled by a special page cache... it's the recent incarnation of the buffer cache\ 2009-01-09 13:53 address_space->host normally points at an inode that belongs to the filesystem superblock 2009-01-09 13:54 but not in the case of the filesystem's buffer cache, for some reason it points at a specially allocated inode belonging to the block device 2009-01-09 13:54 I suspected that was bogus, it's certainly inconvenient when trying to write generic block access functions 2009-01-09 13:55 so... now tux3 is running fsx-linux stable, with the buffer cache in one of our own inodes 2009-01-09 13:55 pretty much proving that block_dev.c, which implements the cache operations for buffer cache, is completely bogus 2009-01-09 13:55 it's such a hairball that that isn't easy to see 2009-01-09 13:57 hopefully we will lose pretty much all of buffer.c over the next few months too 2009-01-09 13:57 with pretty compact code to replace it 2009-01-09 14:01 flips: hey 2009-01-09 14:01 hi bh 2009-01-09 14:02 cool! 2009-01-09 14:02 less code is always a good thing 2009-01-09 14:03 and stripping away layers of cruft 2009-01-09 14:03 brb 2009-01-09 14:03 flips: while writing a 2 block file shouldn't map region do all the work in a single call ? 2009-01-09 14:04 cdk, it should, and it doesn't because of the way the block IO library calls our mapping code 2009-01-09 14:05 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-09 14:05 that is a big thing we are going to change 2009-01-09 14:05 not have the block library call our get_block one block at a time the way it does 2009-01-09 14:05 but instead, handle ->writepages with our own loop 2009-01-09 14:06 it will call map_region once for a whole set of logically contiguous dirty pages 2009-01-09 14:06 so i guess in userspace i am stuck with 1 block at a time ?? 2009-01-09 14:07 in userspace you're not really using blocks 2009-01-09 14:07 not at all, user space already implements region oriented transfers, see guess_region 2009-01-09 14:08 we don't implement ->get_block in userspace 2009-01-09 14:08 that's a really bad interface, well past the end of its useful life 2009-01-09 14:08 but deeply entrenched in kernel 2009-01-09 14:10 it makes sense only for simple fs's 2009-01-09 14:11 so while writing a 2 block file i should get a segment in map[] with map[].count =2 before the balloc call in map_region ?? 2009-01-09 14:12 maze, it doesn't even make sense for simple fs's, it really obscures things 2009-01-09 14:12 hmm 2009-01-09 14:12 like the bh_NEW flag 2009-01-09 14:12 completely bogus 2009-01-09 14:12 I'd have thought it would make sense for fs'es along the lines of ext2 or fat 2009-01-09 14:13 if it doesn't it's probably just the interface/api that's broken - the idea itself should have merit 2009-01-09 14:13 using a buffer head as the interface to find out where physical blocks are mapped is perverse 2009-01-09 14:13 it wasn't meant for that 2009-01-09 14:13 and it's wrong to still be doing this, 30 years after that hack was first done 2009-01-09 14:15 the block_read/write_full_page part is good in spirit, it lets a block-oriented filesystem get up and running basically just by writing a ->get_block 2009-01-09 14:16 cdk, yes you should, in userspace 2009-01-09 14:16 if not, we broke something 2009-01-09 14:17 i am getting segments with count =1 even for files with multpile blocks.. 2009-01-09 14:17 cdk, check what is happening in guess_extent then 2009-01-09 14:18 soory 2009-01-09 14:18 sorry 2009-01-09 14:18 guess_region 2009-01-09 14:29 for a 3 block file i am getting the op after guess_region as ---- extent 0x0/1 --- 2009-01-09 14:31 ok, next thing to check is inode.c, tuxio() 2009-01-09 14:32 if that is being called one block at a time, then guess_region will map one block at a time 2009-01-09 14:32 hmm, no, not true 2009-01-09 14:34 it is tuxflush that drives filemap_extent_io 2009-01-09 14:34 yes 2009-01-09 14:35 so check and see why guess_region does not find the other contiguous dirty buffers 2009-01-09 14:40 putting a show_buffers(buffer->map) at start of map_region would be helpful 2009-01-09 14:41 ok 2009-01-09 14:49 ok, it's about time to stop committing trivial cleanups to buffer.c and do something fundamental that moves towards atomic commit 2009-01-09 14:49 that is, we need a number of different dirty lists 2009-01-09 14:49 its showing the buffers but none of them seem to be dirty 2009-01-09 14:49 I'm thinking, for now, one list for each buffer state 2009-01-09 14:50 cdk, you are calling from tux3, or tux3fuse? 2009-01-09 14:50 tux3fuse 2009-01-09 14:53 cdk, the problem is, tux3fuse is doing a sync_super on every tux3_write call 2009-01-09 14:54 just take that out, and hopefully sync_super is called from somewhere else 2009-01-09 14:54 if it works, send a patch :) 2009-01-09 14:54 ok...will do checking now 2009-01-09 14:55 sync_super is not called from anywhere else, and should be 2009-01-09 14:55 and it's called from one place without checking error 2009-01-09 14:57 check the trace and see if fuse calls tux3_flush and/or tux3_fsync, you can put sync_super there 2009-01-09 14:57 it's overkill, but it's still better than doing it on every write 2009-01-09 15:01 removing sync_super does not seem to solve it 2009-01-09 15:01 sync_super has to be called from somewhere else in fuse, and currently is not 2009-01-09 15:02 otherwise, the inode buffers will not be flushed 2009-01-09 15:02 so look at your trace output from fuse... run tux3fuse with make debug 2009-01-09 15:02 that runs fuse in the foreground so you can see the trace output 2009-01-09 15:02 every unimplemented function will show in the trace 2009-01-09 15:03 see iff tux3_flush is being called, that would be a good place to put a sync_super 2009-01-09 15:03 yes.....looking at it now 2009-01-09 15:04 I think this code was written in a hurry, and the author just wanted it to work, not work efficiently 2009-01-09 15:04 it should not be too hard to improve it 2009-01-09 15:05 -!- dcg(~dcg@235.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-01-09 15:10 moved sync_super to tux3_flush....still does not solve the regions problem 2009-01-09 15:13 when mounted in foreground ... should tux3_write(inode no.) be called just once for a file write?? 2009-01-09 15:14 I don't know, I have not worked with fuse 2009-01-09 15:14 i am getting 3 calls for this file.. 2009-01-09 15:14 right 2009-01-09 15:14 and maybe the inode is being closed on each call 2009-01-09 15:15 if the inode is not being closed, then buffers should be dirty after the tuxwrite call 2009-01-09 15:16 you can put in a show_buffers(inode>map) to find out 2009-01-09 15:18 tux3_getxattr(e, 'security.capability') is being called after 2 tux3_write() calls. 2009-01-09 15:19 if (ino != FUSE_ROOT_ID) 2009-01-09 15:19 tuxclose(inode) 2009-01-09 15:20 tuxclose is also called on every lookup 2009-01-09 15:21 anyway, at the end of the second tux3_write call, you should see two buffers dirty 2009-01-09 15:22 unless something closed the inode between the two calls 2009-01-09 15:22 so, try putting a trace() into tuxclose 2009-01-09 15:23 there is also a stacktrace(), you can put that in to find out exactly where it is called from 2009-01-09 15:26 will do.... 2009-01-09 15:27 its getting really late here...will get back to u.... 2009-01-09 15:29 cdk, good start 2009-01-09 15:29 sleep well 2009-01-09 15:29 thanks 2009-01-09 15:30 I'm happy somebody is looking at the fuse code 2009-01-09 15:30 it's a nice start, but could be much better 2009-01-09 15:32 flips, thanks for the help... 2009-01-09 15:32 :) 2009-01-09 15:32 any time 2009-01-09 15:46 maze, there? 2009-01-09 16:54 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-09 17:34 ok, next phase of buffer.c changes: where we actually make it useful for atomic commit prototyping 2009-01-09 17:38 hmm, I think I committed buffer.c with trace_on 2009-01-09 17:39 that's going to make tux3 very chatty for a while 2009-01-09 18:17 int blockdirty(struct buffer_head *buffer) 2009-01-09 18:17 { 2009-01-09 18:17 if buffer is dirty in some earlier delta { 2009-01-09 18:17 struct buffer *newbuffer = new_buffer(map); 2009-01-09 18:17 if (IS_ERR(buffer)) 2009-01-09 18:17 return PTR_ERR(buffer); 2009-01-09 18:17 void *data = buffer->data; 2009-01-09 18:18 buffer->data = newbuffer->data; 2009-01-09 18:18 newbuffer->data = data; 2009-01-09 18:18 set new buffer state to buffer state 2009-01-09 18:18 move new buffer to earlier delta list 2009-01-09 18:18 } 2009-01-09 18:18 set buffer dirty in current delta 2009-01-09 18:18 move buffer to buffer map dirty list 2009-01-09 18:18 return 0; 2009-01-09 18:18 } 2009-01-09 18:18 the concept is, a buffer is always on two lists: the lru list and some other list 2009-01-09 18:19 the "some other" list might be free, clean, on the dirty list for a mapping, or on a dirty list for a delta 2009-01-09 18:20 if it's on the dirty list for a mapping, its dirty state will be the low couple of bits of the delta counter 2009-01-09 18:23 conversely, if its dirty state is not the current delta, then it will be on a global dirty list for the delta 2009-01-09 18:28 in kernel we will use the buffer b_assoc_buffers field for the list link 2009-01-09 18:28 and no lru 2009-01-09 18:31 instead of a buffer lru, we define a ->release_page method, which is called by try_to_release_page, driven by vmscan.c to shrink caches 2009-01-09 18:33 probably a bad interface because it doesn't know whether our buffers are freeable or not 2009-01-09 18:33 but oh well 2009-01-09 18:33 it's what the kernel currently does, we can't fix everthing in a day 2009-01-09 19:36 -!- kushal_(~kushal@121.246.34.33) has joined #tux3 2009-01-09 19:53 -!- cdk(~chinmay@121.246.34.33) has joined #tux3 2009-01-09 20:16 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2009-01-09 20:57 my four year old plays snakeball on "super pro" 2009-01-09 20:57 and sets the difficulty lower when she lets me play 2009-01-09 20:57 and sets it to super easy if mom plays 2009-01-09 20:58 let's see, how is buffer.c doing... seems to work, needs unit tests 2009-01-09 21:37 -!- aks(~project_t@123.237.65.159) has joined #tux3 2009-01-09 21:41 -!- elvyn(~ankit@123.237.65.159) has joined #tux3 2009-01-09 21:44 -!- aks(~project_t@123.237.65.159) has left #tux3 2009-01-09 22:29 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-09 23:06 -!- fqh(~fqh@218.13.196.58) has joined #tux3 2009-01-09 23:13 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-09 23:25 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-10 01:48 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-10 01:59 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2009-01-10 02:11 now how to deal with this big fat patch I have made 2009-01-10 02:12 maze, awake? 2009-01-10 02:12 just falling asleep ;-) 2009-01-10 02:12 got a problem that might interest you next time you're awake 2009-01-10 02:12 you sleep at your 'puter? 2009-01-10 02:13 no more, like started falling asleep in front of it ;-) 2009-01-10 02:28 flips, there? 2009-01-10 02:28 hi 2009-01-10 02:29 current lru_buffers is what is meaning 2009-01-10 02:29 um... 2009-01-10 02:29 if buffer is on free_buffers, buffer shouldn't be on free_buffers? 2009-01-10 02:29 if buffer is on free_buffers, buffer shouldn't be on lru_buffers? 2009-01-10 02:30 I thought about doing that 2009-01-10 02:30 but decided to leave lru_buffers alone 2009-01-10 02:30 but share lists for freed, empty, clean, and four kinds of dirty 2009-01-10 02:31 there is another patch coming in a few minutes so it will hopefully make more sense 2009-01-10 02:31 ok 2009-01-10 02:31 next patch coming gets rid of the hexdump in kernel/ileaf.c 2009-01-10 02:32 good 2009-01-10 02:32 filemap.c is also verbose for me though 2009-01-10 02:32 side effect is, gives kernel it's own hexdump with printk instead of printf 2009-01-10 02:33 http://hg.tux3.org/tux3/rev/0ac6527602a5 2009-01-10 02:34 this patch cleans up a bunch of things, I'm not as good as separating the changes as you 2009-01-10 02:34 ah, adds list_move to list.h, that is an unrelated change, sorry it escaped 2009-01-10 02:36 why don't we share hexdump() anymore? 2009-01-10 02:37 it required an include tux3.h, which messed up a bunch of includes in userspace 2009-01-10 02:37 it doesn't really need tux3.h 2009-01-10 02:37 it needs include/kernel.h though 2009-01-10 02:37 and it is better to have printk than printf in kernel code 2009-01-10 02:38 it's a small function 2009-01-10 02:38 just let it fork 2009-01-10 02:38 in fact, it turned out that some files in user where relying on the include tux3.h in hexdump.c, without knowing it 2009-01-10 02:39 just make a small change, like removing the include "hexdump.c" and things broke 2009-01-10 02:40 it is still true for all other files 2009-01-10 02:40 it is 2009-01-10 02:40 but it's much better than it was 2009-01-10 02:40 filemap.c was the worst, and now it is pretty robust 2009-01-10 02:41 ok, here comes the main change to buffer.c 2009-01-10 02:41 we can always revert if this buffer.c work is going in the wrong direction 2009-01-10 02:42 more good way would be compile kernel/* as library 2009-01-10 02:42 maybe that's what it's trying to be 2009-01-10 02:42 but it's not broken just now, or it's not the most broken thing 2009-01-10 02:43 I like bottom up compiling 2009-01-10 02:43 making a library with a header, where everything can be defined in random order eliminates the good organinzing force that comes from strict bottom up 2009-01-10 02:43 you get that from source includes, not from separate compilation units with header files 2009-01-10 02:44 ah 2009-01-10 02:44 anyway, I'll commit the buffer.c change and we can discuss it 2009-01-10 02:45 ok 2009-01-10 02:46 buffer.c | 184 +++++++++++++++++++++++++------------------------------------- 2009-01-10 02:46 buffer.h | 1 2009-01-10 02:46 filemap.c | 1 2009-01-10 02:46 3 files changed, 77 insertions(+), 109 deletions(-) 2009-01-10 02:46 got smaller, a little 2009-01-10 02:47 moving hexdump() to kernel/hexdump() is just work 2009-01-10 02:47 just work? 2009-01-10 02:48 can compile without any error 2009-01-10 02:48 yes, it's robust 2009-01-10 02:48 ah, we can share hexdump() again 2009-01-10 02:49 remove hexdump() in kernel/hexdump.c, then move hexdump() in hexdump.c to kernel/hexdump.c 2009-01-10 02:49 we don't need tux3.h 2009-01-10 02:50 ok :) 2009-01-10 02:50 so, my concern was resolved 2009-01-10 02:50 :) 2009-01-10 02:51 that patch can commit to repo? 2009-01-10 02:51 which patch? 2009-01-10 02:52 your buffer.c 2009-01-10 02:52 just now 2009-01-10 02:52 ok, the biggest change is, set_buffer_uptodate now asserts if the buffer was already uptodate 2009-01-10 02:52 it doesn't have to do that, it could be more lenient 2009-01-10 02:53 I think it is good thing 2009-01-10 02:53 I think so 2009-01-10 02:53 we should be precise about that 2009-01-10 02:53 now, each time the buffer state changes, in moves to a list 2009-01-10 02:53 there may be too many moves, but this is just the use space code 2009-01-10 02:53 we are not that worried about efficiency 2009-01-10 02:54 it also sounds good 2009-01-10 02:54 the point is to have lists of buffers in different states, and we are primarily interested in the buffers that are in different dirty states 2009-01-10 02:55 btw, we don't have hlist, how about convert buffer.c to it 2009-01-10 02:55 ? 2009-01-10 02:55 yes we could 2009-01-10 02:55 ok 2009-01-10 02:55 but I'm not really thinking about optimizing the userspace 2009-01-10 02:56 my hash table per inode is very wasteful 2009-01-10 02:56 my perpose is readability though 2009-01-10 02:56 for example 2009-01-10 02:56 ah, hlist has better type safety? 2009-01-10 02:56 maybe, there is no big change 2009-01-10 02:57 anyway, the hashing code in buffer.c has never given any problems 2009-01-10 02:57 yes, it seems 2009-01-10 02:57 now, we are changing it 2009-01-10 02:57 besides the 7 per-state lists, a buffer can also be on an inode dirty lsit 2009-01-10 02:58 so, I'd like to change to hlist if possible 2009-01-10 02:58 it's fine with me 2009-01-10 02:58 not the most interesting project at the moment though 2009-01-10 02:59 ok, I am hoping we can do the same thing with dirty lists in kernel as in userspace 2009-01-10 02:59 we want to try it in userspace, see if the idea works 2009-01-10 02:59 ok 2009-01-10 02:59 I also played with the idea of getting rid of the buffer lru, and only doing lru on the clean list 2009-01-10 03:00 as we discussed last night 2009-01-10 03:00 I decided it is better not to experiment with that now 2009-01-10 03:00 by the way, I didn't do a lot of testing, sorry 2009-01-10 03:00 only added unit test to buffer.c yesterday 2009-01-10 03:01 well, no problem 2009-01-10 03:01 I think that the BUFFER_FREED state does not do anything useful 2009-01-10 03:01 there are other ways we can check for double free, and use of free buffer 2009-01-10 03:02 anyway, it doesn't hurt right now, just costs a coule of extra list moves 2009-01-10 03:03 it would be useful for preallocate in buffer.c 2009-01-10 03:03 I just noticed I left buffer trace on 2009-01-10 03:03 actually, state is not useful, but list_head is useful 2009-01-10 03:03 I need to commit a change to turn it off 2009-01-10 03:04 right 2009-01-10 03:09 in blockread(), we don't set buffer_uptodate? 2009-01-10 03:10 no, the read function does that 2009-01-10 03:10 similar to kernel 2009-01-10 03:10 oh 2009-01-10 03:10 sorry, io function 2009-01-10 03:10 in kernel, a buffer goes uptodate in bh_end_io 2009-01-10 03:11 yes 2009-01-10 03:12 dev_blockread and dev_blockwrite are identical except for one word 2009-01-10 03:13 the interface from blockread/blockwrite should probably pass a rw flag, as it used to do in the past 2009-01-10 03:13 this is a very local change 2009-01-10 03:14 both is ok for me 2009-01-10 03:15 if it was separated, and if buffer can merge those, we can do in buffer.c localy 2009-01-10 03:15 ok, well the big point is, we now have an array of lists for dirty state, which we can use as our delta buffer lists 2009-01-10 03:15 if globaly can merge, it sounds like interface can merge 2009-01-10 03:16 this change is easy to make, it takes about 5 minutes, I've done it before ;) 2009-01-10 03:16 it's better for filemap.c 2009-01-10 03:16 which already takes a flag 2009-01-10 03:16 code will get shorter 2009-01-10 03:16 ah, yes 2009-01-10 03:19 in review, set_buffer_empty is using set_buffer_uptodate 2009-01-10 03:20 so, assert(!buffer_uptodate()) may not be true 2009-01-10 03:20 not true? 2009-01-10 03:20 ah 2009-01-10 03:20 set_buffer_empty is probably never called 2009-01-10 03:20 maybe, we can make empty from dirty? 2009-01-10 03:21 oh 2009-01-10 03:21 well, iirc, it is called from block free functions 2009-01-10 03:23 a wart to clean up 2009-01-10 03:24 in new_buffer(), eviction is only from buffer_uptodate 2009-01-10 03:25 is it !buffer_dirty and !buffer_free? 2009-01-10 03:25 looking 2009-01-10 03:26 my feeling is, only uptodate buffers can be evicted 2009-01-10 03:27 if we have a buffer_empty buffer with 0 use count, it is a bug 2009-01-10 03:27 um... 2009-01-10 03:27 and evicting a free buffer is obviously a bug 2009-01-10 03:27 free buffer on lru list 2009-01-10 03:28 if we call set_buffer_empty, after that, what do we do? 2009-01-10 03:28 brelse() changes state? 2009-01-10 03:28 let me see where set_buffer_empty is called 2009-01-10 03:29 btree.c 2009-01-10 03:30 (and need a buffer free state) <- who wrote that comment, me? 2009-01-10 03:30 I don't know 2009-01-10 03:31 it was there 2009-01-10 03:31 I think we just want evict_buffer there 2009-01-10 03:31 well, it is kernel code 2009-01-10 03:31 so need to think 2009-01-10 03:32 what is set_buffer_empty in kernel? 2009-01-10 03:32 there is no empty 2009-01-10 03:33 it would be happened only blockdev 2009-01-10 03:33 ok, so some fuzz to clean up 2009-01-10 03:33 however, blockdev buffer can share with /dev/* 2009-01-10 03:33 or just for next getblk() 2009-01-10 03:33 how close are we to getting away from blockdev? 2009-01-10 03:34 volmap can't share anywhere from others 2009-01-10 03:35 I think eviction or leave are ok for us 2009-01-10 03:35 is our only tie to blockdev our super read/write now? 2009-01-10 03:35 current repo does't use volmap 2009-01-10 03:36 ah right, it was a demo patch 2009-01-10 03:36 in demo, it uses blockdev only first read of superblock 2009-01-10 03:36 only 2009-01-10 03:37 are you happy with that patch? 2009-01-10 03:37 I'm not sure yet 2009-01-10 03:37 well, it seems to work 2009-01-10 03:37 at least, fsx-linux worked for 10 hours or so 2009-01-10 03:38 I think I would like to commit it early, and get it reviewed on lkml 2009-01-10 03:38 even before we start full review 2009-01-10 03:38 just that aspect of it 2009-01-10 03:38 I'd like to hear akpm's take on it 2009-01-10 03:38 and others 2009-01-10 03:38 good 2009-01-10 03:39 well, maybe, I think anybody don't care it though 2009-01-10 03:39 it feels like a really big cleanup to me 2009-01-10 03:39 if everybody doesn't care, that's fine 2009-01-10 03:39 I guess btrfs and xfs does like it 2009-01-10 03:39 but if there's a subtle problem, I'd like to know early 2009-01-10 03:40 yes, it's good 2009-01-10 03:40 with ->releasepage we should obey cache shrinking ok 2009-01-10 03:41 yes 2009-01-10 03:42 in demo, default try_to_free_buffer() just does it 2009-01-10 03:43 ah, because it really is a ring of buffers 2009-01-10 03:43 yes, it is everything normal 2009-01-10 03:44 it is just big file covering whole volme 2009-01-10 03:45 it turned out to be a very small amount of code to do that, didn't it? 2009-01-10 03:46 yes, we were having already almost all codes 2009-01-10 03:47 see, my map_ops in userspace only has one function 2009-01-10 03:47 I think that is all it ever will have, so it can be simplified 2009-01-10 03:48 instead of passing the ops struct, we can just pass the function for inode initialization 2009-01-10 03:48 it is possible that the generalization you were trying to do yesteday can be done like that 2009-01-10 03:49 we just add a field to tuxnode_t that has the ->blockio method 2009-01-10 03:50 tuxnode_t? 2009-01-10 03:50 tux_inode 2009-01-10 03:50 map_t? 2009-01-10 03:51 we can't put anything in the kernel mapping, that is, address_space, we don't own it 2009-01-10 03:51 we don't own the definition 2009-01-10 03:51 ah, yes 2009-01-10 03:52 I thought for userland 2009-01-10 03:52 I was thinking of a possible way to use the same blockread in kernel for volmap and file page cache 2009-01-10 03:53 well, it already have ->readpage and ->write_begin 2009-01-10 03:53 that won't do when we get to phtree 2009-01-10 03:53 and doesn't work for the atom table 2009-01-10 03:54 why? 2009-01-10 03:54 atom table doesn't work in pages 2009-01-10 03:54 it's blocks 2009-01-10 03:55 in future? 2009-01-10 03:55 even now 2009-01-10 03:55 the phtree index is a clearer case 2009-01-10 03:55 it has to have independent blocks 2009-01-10 03:56 it might lock a parent block and two children, and they might all be on the same page, or different pages 2009-01-10 03:56 I don't know, maybe it works 2009-01-10 03:57 is it having own radix tree, or use volmap? 2009-01-10 03:58 own radix tree 2009-01-10 03:58 ok 2009-01-10 03:59 I think ->readpage would work 2009-01-10 03:59 however, it would not be efficient 2009-01-10 03:59 so, we will optimize it as per blocks 2009-01-10 03:59 maybe 2009-01-10 04:01 the first thing we will atomically commit is bitmaps 2009-01-10 04:01 yes 2009-01-10 04:01 we can work out the block api with that 2009-01-10 04:01 maybe, it is same state of current volmap 2009-01-10 04:02 bitmaps are in files 2009-01-10 04:02 not volmap 2009-01-10 04:02 so this is a good chance to make the blockread work for both 2009-01-10 04:02 blockget/blockread 2009-01-10 04:03 it meant phtree and volmap 2009-01-10 04:03 what about phtree and volmap? 2009-01-10 04:04 both needs block, not page? 2009-01-10 04:04 yes 2009-01-10 04:04 I would prefer to work with blocks with the bitmap too 2009-01-10 04:04 and both can read physically contiguous range to page? 2009-01-10 04:05 the blocks for phtree won't be physically contiguous 2009-01-10 04:05 so, page is just wasted? 2009-01-10 04:06 they are logically contiguous, so cache pages will normally be full 2009-01-10 04:06 the directory file is not spare 2009-01-10 04:06 not sparse 2009-01-10 04:07 it sounds like ->readpage 2009-01-10 04:07 except it is not read as a page 2009-01-10 04:08 the different blocks on a page will typically be read at different times 2009-01-10 04:08 for optimization? 2009-01-10 04:08 just because of the way the probe works for example 2009-01-10 04:09 you might have 4 blocks at different tree levesl on the same page 2009-01-10 04:09 probe will pick them up one at a time 2009-01-10 04:10 what is problem with strage readahead like ->readpage? 2009-01-10 04:10 what is the high level driver for the ->readpage? 2009-01-10 04:11 driver? blockread? 2009-01-10 04:11 ah, I must be misunderanding then 2009-01-10 04:11 you think blockread should call tux3_readpage? 2009-01-10 04:12 maybe 2009-01-10 04:12 then how does it get a buffer? 2009-01-10 04:12 I'm just tring to delay optimize 2009-01-10 04:12 it needs a buffer for locking 2009-01-10 04:12 it also needs a buffer for block diryting, forking and redirect 2009-01-10 04:13 lock_page() -> lock_buffer for each buffers -> last callback unlock_page 2009-01-10 04:13 maybe, I'm not understanding the issue 2009-01-10 04:14 what is current issue? 2009-01-10 04:14 maybe I'm not understanding the proposal 2009-01-10 04:14 anyway, it is good we have a simple case to start with 2009-01-10 04:14 yes 2009-01-10 04:14 bitmap IO is pretty simple 2009-01-10 04:14 ok 2009-01-10 04:14 I'm looking at btree.c, line 378 2009-01-10 04:15 need to fix that, nice spotting 2009-01-10 04:15 that's really a bforget 2009-01-10 04:15 in brelse_free()? 2009-01-10 04:15 ok 2009-01-10 04:22 kernel bforget is not useful for us 2009-01-10 04:22 for delta? 2009-01-10 04:22 it just does funny things with assoc mapping 2009-01-10 04:23 yes 2009-01-10 04:23 it removes all dirty state 2009-01-10 04:23 removes preparation of flush 2009-01-10 04:24 ok, set_buffer_empty is just a stub in kernel, that's not so bad, shrink caches will eventually get buffers back 2009-01-10 04:24 in userspace it's a bug as you pointed out 2009-01-10 04:24 so lets just comment it out 2009-01-10 04:25 ah 2009-01-10 04:25 the set_buffer_uptodate is the bug 2009-01-10 04:25 you're way ahead of me ;) 2009-01-10 04:26 :) 2009-01-10 04:26 6 struct buffer_head *set_buffer_empty(struct buffer_head *buffer) 2009-01-10 04:26 7 { 2009-01-10 04:26 8 - set_buffer_uptodate(buffer); // to remove from dirty list 2009-01-10 04:26 9 buffer->state = BUFFER_EMPTY; 2009-01-10 04:26 10 + list_move_tail(&buffer->link, buffers + BUFFER_EMPTY); 2009-01-10 04:26 11 return buffer; 2009-01-10 04:26 12 } 2009-01-10 04:27 ok? 2009-01-10 04:28 looks good 2009-01-10 04:28 dev_blockio patch is done too, I'll commit both 2009-01-10 04:29 ok 2009-01-10 04:33 I would like to change all the set_buffer_uptodate to set_buffer_clean 2009-01-10 04:34 what about in kernel? 2009-01-10 04:34 that's what I meant 2009-01-10 04:34 I don't really have a good excuse ;) 2009-01-10 04:34 I just said I would like to 2009-01-10 04:35 ah :) 2009-01-10 04:35 I'll leave it alone 2009-01-10 04:35 we warn't ever going to use mark_buffer_dirty 2009-01-10 04:36 we will always set 3 bits to set a buffer dirty 2009-01-10 04:36 well, I noticed that is ok 2009-01-10 04:36 the dirty bit, and two bits to indicate the delta 2009-01-10 04:36 yes 2009-01-10 04:37 ah 2009-01-10 04:37 however, mark_buffer_dirty may be good to tell vm 2009-01-10 04:38 or may not be 2009-01-10 04:38 time to read it now 2009-01-10 04:39 1195 smp_mb(); <- scary 2009-01-10 04:39 how did we run so many years without that? only needed on some arch? 2009-01-10 04:39 which file? 2009-01-10 04:39 http://fxr.watson.org/fxr/source/fs/buffer.c?v=linux-2.6;im=bigexcerpts#L1195 2009-01-10 04:41 I don't think we care about setting the page dirty 2009-01-10 04:41 oh yes, there is the dirty page accounting bogosity 2009-01-10 04:42 it may be useful to get DIRTY_TAG? 2009-01-10 04:42 smb_mb() seems to be for optimization 2009-01-10 04:42 it syncs dcache in cpu, after that it make sure it has dirty state 2009-01-10 04:43 -!- cdk(~chinmay@121.246.34.33) has joined #tux3 2009-01-10 04:44 ah, the comment is good 2009-01-10 04:45 yes 2009-01-10 04:46 we don't care about marking the inode dirty, or the radix tree dirty 2009-01-10 04:46 so that is most of __set_page_dirty 2009-01-10 04:47 we can't race with truncate 2009-01-10 04:47 I thought radix tree tag would be useful to get dirty range 2009-01-10 04:48 not useful for metadata 2009-01-10 04:48 and we won't use buffers for file IO 2009-01-10 04:49 well, maybe it is useful for bitmaps, to know a range of dirty bitmaps 2009-01-10 04:49 ah, yes 2009-01-10 04:50 we will be fine doing bitmap IO a block at a time for now 2009-01-10 04:50 let the disk elevator sort that out 2009-01-10 04:51 that's ok 2009-01-10 04:51 the only slightly interesting thing is the dirty accounting 2009-01-10 04:51 well, I imagined extent 2009-01-10 04:51 one day far in the future we will have metadata extents ;) 2009-01-10 04:51 ah, yes 2009-01-10 04:52 for map_region 2009-01-10 04:52 that meant 2009-01-10 04:52 well, we have to pin page 2009-01-10 04:53 I think refcnt and dirty can pin page 2009-01-10 04:53 refcnt or dirty 2009-01-10 04:53 yes 2009-01-10 04:54 well, I am ready to finish the blockdirty/blockfork prototype now 2009-01-10 04:54 oh, good 2009-01-10 04:54 messing with buffer.c took longer than I expected 2009-01-10 05:08 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-10 05:12 oyasumi 2009-01-10 05:12 oyasumi 2009-01-10 05:55 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-10 06:14 -!- macan(~macan@xbl.dnsbl.oftc.net) has joined #tux3 2009-01-10 07:53 -!- dcg(~dcg@48.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-01-10 09:16 -!- fqh(~fqh@218.13.196.58) has joined #tux3 2009-01-10 12:04 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-01-10 13:07 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-10 15:08 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-10 15:20 ok, time to write blockdirty/fork, really 2009-01-10 15:21 hi 2009-01-10 15:22 you're up still up, or up early? 2009-01-10 15:22 I'm going to sleep soon 2009-01-10 15:22 still up :) 2009-01-10 15:22 yes :) 2009-01-10 15:22 I stopped around 5:30 am this morning 2009-01-10 15:22 I was searching zfs allocation strategy 2009-01-10 15:23 I found interesting blog 2009-01-10 15:23 http://blogs.sun.com/bonwick/entry/zfs_block_allocation 2009-01-10 15:23 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-10 15:24 I haven't read that one before 2009-01-10 15:24 maybe, I think we are doing similar thing? 2009-01-10 15:25 there is no metaslab 2009-01-10 15:25 just assuming a linear device works pretty well 2009-01-10 15:25 however, logging allocation 2009-01-10 15:25 let the device spread things out for bandwidth 2009-01-10 15:26 "the point being that it's done on the fly by the filesystem, rather than at configuration time by the administrator" <- the truth is, the problem with linux LVM is it is unsuitable to being used from within a filesystem 2009-01-10 15:26 zfs twists the "lvm has no internal interface" argument into an argument that the filesystem has to implement lvm 2009-01-10 15:27 anyway 2009-01-10 15:27 we have to implement an allocation policy of some sort 2009-01-10 15:27 yes 2009-01-10 15:28 allocation groups, based on the number of blocks covered by one bitmap block 2009-01-10 15:28 128 MB for 4K blocks 2009-01-10 15:28 interesting thing was allocation extents and metaslab 2009-01-10 15:29 metaslabs sound like allocation groups, no? 2009-01-10 15:29 and logging allocation extents like logfs 2009-01-10 15:30 maybe, it is allocation groups 2009-01-10 15:30 I like the concept of weighting scheme 2009-01-10 15:30 i see 2009-01-10 15:31 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2009-01-10 15:31 logging allocation extents? 2009-01-10 15:32 ok, so what I am thinking about allocation groups... 2009-01-10 15:32 this is our unit of deciding whether to do bitmap allocation or extent allocation 2009-01-10 15:32 ah, I might point wrong url 2009-01-10 15:33 sorry 2009-01-10 15:33 http://blogs.sun.com/bonwick/en_US/ 2009-01-10 15:33 and search "Friday Sep 14, 2007" 2009-01-10 15:33 Space Maps 2009-01-10 15:33 found 2009-01-10 15:33 http://blogs.sun.com/bonwick/en_US/entry/space_maps 2009-01-10 15:33 got it 2009-01-10 15:35 we are also logging allocation 2009-01-10 15:35 sounds similar 2009-01-10 15:36 they seem to never roll up the allocation log 2009-01-10 15:36 yes 2009-01-10 15:36 maybe 2009-01-10 15:37 that must result in long startup time 2009-01-10 15:37 it seems optimize log 2009-01-10 15:37 we will get the same effect he talks about, using tree allocation 2009-01-10 15:38 that is, when all space fills up, we end up with a block of all ones 2009-01-10 15:38 that can be dropped and represented as a single extent 2009-01-10 15:38 yes 2009-01-10 15:38 so, we are logging allocation 2009-01-10 15:39 yes, I think the strategy is reasonable 2009-01-10 15:39 and I forget (maybe, not understand completey) why we need blockfork for bitmap 2009-01-10 15:39 one thing that has to be kept in mind about zfs, it runs about half the speed of ext3 on equivalent hardware 2009-01-10 15:39 blockfork forcely 2009-01-10 15:40 so the blog arguments have to be taken with some degree of skepticism 2009-01-10 15:40 i see 2009-01-10 15:40 we need fork for bitmap because allocating blocks for the bitmap on initial flush or redirect changes the buffered bitmaps 2009-01-10 15:41 without blockfork, this issue is hard to solve for all the corner cases 2009-01-10 15:41 btrfs must have this issue too 2009-01-10 15:41 because it comes up for any design where the allocation map is not preallocated 2009-01-10 15:42 however, we are using logging, so we don't touch bitmap data buffer? 2009-01-10 15:42 not until we roll up the log 2009-01-10 15:42 yes 2009-01-10 15:42 then the recursion re-appears 2009-01-10 15:43 rollup write current bitmap before logging? 2009-01-10 15:43 block fork is a nice solution because it also will be very useful to support the asynchronous frontend/backend optimization 2009-01-10 15:43 yes 2009-01-10 15:43 it will write the bitmap buffers exactly as they are at time of delta transition 2009-01-10 15:44 I worry about forcely blockfork 2009-01-10 15:44 worried 2009-01-10 15:44 worried or just a question 2009-01-10 15:45 ah, yes 2009-01-10 15:45 I worried about that too, but your suggestion about doing the read_mapping_page solved my worries 2009-01-10 15:45 i see 2009-01-10 15:46 a kernel prototype can start tomorrow, after the user prototype arrives 2009-01-10 15:46 and we can check it for races 2009-01-10 15:46 my worry is just for effient or simpily 2009-01-10 15:47 ah, efficiency is not a problem with bitmap logging, the corner cases are rare 2009-01-10 15:47 but they are hard 2009-01-10 15:47 sorry, bitmap rollup 2009-01-10 15:47 i see 2009-01-10 15:48 in the case of async frontend/backend, the fork copy on write will always avoid a stall 2009-01-10 15:48 with it, I expect we may be able to use normal blockfork() stragety for bitmap 2009-01-10 15:48 I think the forks will be relatively rare in the async case too, but it's hard to completely analyze 2009-01-10 15:49 i see 2009-01-10 15:50 one thing about bitmap rollup, it is necessary to have two passes: 1) assign or redirect physical blocks 2) initiate IO 2009-01-10 15:50 hmm 2009-01-10 15:50 why do I think that? 2009-01-10 15:50 I don't know 2009-01-10 15:51 I expected logging solves it 2009-01-10 15:51 maybe that is wrong, and we can just flush the bitmap inode 2009-01-10 15:51 oh, great 2009-01-10 15:52 I think we will leave all allocation strategy for after review begins 2009-01-10 15:52 it's pure optimization 2009-01-10 15:52 yes 2009-01-10 15:52 no correctness element, and some cases will run ok without any allocation policy 2009-01-10 15:52 untar of kernel tree for example 2009-01-10 15:52 i see 2009-01-10 15:53 also gives us something to focus on in review 2009-01-10 15:53 maybe, I don't have good knowlege of allocation policy 2009-01-10 15:54 it's a well known subject, in every respect except the effect of redirect 2009-01-10 15:54 marcin is a mathematicion who hangs on this list, and wants to do some modelling of those effects for us 2009-01-10 15:54 mathematicion 2009-01-10 15:55 i see 2009-01-10 15:56 a more important issue to address is ENOSPC 2009-01-10 15:56 ah, yes 2009-01-10 15:56 anyway, sk8 oclock for me 2009-01-10 15:56 ok 2009-01-10 15:56 and oyasumi time for you I think 2009-01-10 15:57 yes, oyasumi 2009-01-10 15:57 oyasumi 2009-01-10 17:41 maze, awake yet? 2009-01-10 17:42 I'm awake, but I'm oncall atm, and very busy ;-( 2009-01-10 17:42 let us know when the fire is out 2009-01-10 17:42 or at least not spreading as fast 2009-01-10 18:06 folks 2009-01-10 21:54 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-10 23:18 ok, here comes, a prototype fork 2009-01-10 23:19 blockdirty works like mark_buffer_dirty as long as the delta doesn't change 2009-01-10 23:19 when the delta changes, it gets more interesting 2009-01-11 00:09 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-11 00:11 for our sb->delta counter, how about we protect it with memory barriers instead of atomic_* 2009-01-11 03:58 -!- cdk(~chinmay@121.246.34.33) has joined #tux3 2009-01-11 03:59 hi hirofumi 2009-01-11 03:59 hi cdk 2009-01-11 03:59 thought I'd get in one last hack before zzz's 2009-01-11 03:59 ohh....yeah...must be quite late there 2009-01-11 04:00 just wanted to ask hirofumi something related to tux3graph 2009-01-11 04:00 kay, I'll go do what I should be doing 2009-01-11 04:02 btw...copied 2 files having similar blocks ... able to detect them...just wanted to see in graph that the physical mapping is done right.. 2009-01-11 04:02 shows up fine in the traces 2009-01-11 04:03 detect and point them to existing ones 2009-01-11 04:03 -!- kushal(~kushal@121.246.34.33) has joined #tux3 2009-01-11 04:03 congratulations 2009-01-11 04:04 thanks :) 2009-01-11 04:04 -!- gaurav(~gaurav@121.246.34.33) has joined #tux3 2009-01-11 04:04 it's worth a list post to update everybody 2009-01-11 04:05 yes ... gaurav and kushal working on it now...will send later 2009-01-11 04:06 -!- amey(~amey@121.246.34.33) has joined #tux3 2009-01-11 04:30 hi 2009-01-11 04:31 hi 2009-01-11 04:31 tux3graph bugs? 2009-01-11 04:31 no bugs 2009-01-11 04:31 need a small addition 2009-01-11 04:31 it sounds good 2009-01-11 04:32 want to print dtree for 2 files 2009-01-11 04:32 how do i do that ? 2009-01-11 04:32 2 special files like bitmap/atable? 2009-01-11 04:32 not special 2009-01-11 04:33 2 general data files 2009-01-11 04:33 draw_file() is handler for S_IFREG 2009-01-11 04:34 now, it is called with btree 2009-01-11 04:35 yes 2009-01-11 04:35 so, with map_region(), it may be able to read data blocks 2009-01-11 04:35 currently it prints only complete dtree of inode 14 ... i want to print complete dtree of another file 2009-01-11 04:36 ah, tux3graph -v? 2009-01-11 04:36 ok... 2009-01-11 04:36 -v dumps all inodes 2009-01-11 04:36 ok....did not notice that 2009-01-11 04:36 will do 2009-01-11 04:36 ok 2009-01-11 04:37 done...thanks 2009-01-11 04:38 good 2009-01-11 05:13 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-11 06:16 -!- kushal(~kushal@115.109.15.208) has joined #tux3 2009-01-11 06:17 hi flips... 2009-01-11 06:41 -!- gaurav(~gaurav@115.109.15.208) has joined #tux3 2009-01-11 08:27 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-11 08:47 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-11 10:53 hey flips... 2009-01-11 10:53 -!- cdk(~chinmay@115.109.15.208) has joined #tux3 2009-01-11 11:05 -!- data(~data@echo489.server4you.de) has joined #tux3 2009-01-11 11:36 -!- amey(~amey@116.73.35.180) has joined #tux3 2009-01-11 12:09 hi flips 2009-01-11 12:53 flips, u there? 2009-01-11 12:56 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-11 13:03 kusah, here 2009-01-11 13:04 me and kushal together here ...we made changes to tux3.c and tux3fuse.c 2009-01-11 13:04 to remove inode entry from ileaf after delete 2009-01-11 13:04 want to submit patch 2009-01-11 13:05 post it to the mailing list? 2009-01-11 13:05 as u say .. 2009-01-11 13:05 do you have a mercurial repository on the web? 2009-01-11 13:06 we created a .patch using hg diff on the current changeset 2009-01-11 13:06 -!- kushal(~kushal@115.109.15.208) has joined #tux3 2009-01-11 13:07 send it to you ? 2009-01-11 13:07 ah, good, just post the patch to tux3 mailing list for discussion 2009-01-11 13:08 ok....its a very minor change ... basically the tux3_unlink and delete tux3.c werent calling purge_inum added that 2009-01-11 13:09 also added sync_super to tux3_unlink 2009-01-11 13:09 ah, fine, it's still good to post to the list though 2009-01-11 13:09 big things and small, all good 2009-01-11 13:09 ok....sending it 2009-01-11 13:09 :) 2009-01-11 13:17 sent 2009-01-11 13:18 hey flips...another thing... 2009-01-11 13:18 replying on the list... 2009-01-11 13:19 ok... 2009-01-11 13:22 kushal, what was it? 2009-01-11 13:24 after versioning is implemented...what will happen to the older versions of the blocks being deleted..? 2009-01-11 13:25 if a new version of a file is deleted...will the blocks related to the older version be deleted with it... 2009-01-11 13:28 only if no version references the block 2009-01-11 13:29 this is important for deduplication of course 2009-01-11 13:29 yes...when will all versions of the file be deleted? 2009-01-11 13:29 we find out if any version references a block by examining all the versioned pointers in a dleaf 2009-01-11 13:30 ok... 2009-01-11 13:30 a version of a file can be deleted either by deleting it from the version, or deleting the entire version 2009-01-11 13:30 ok... 2009-01-11 13:31 on the patch...sorry for the mistakes...and we did attach the .patch file... 2009-01-11 13:32 ah, I didn't notice your attachment, and those were not mistakes 2009-01-11 13:32 :) 2009-01-11 13:32 just typical ways of doing things 2009-01-11 13:32 ok...we need to learn these fast... 2009-01-11 13:32 another thing.... currently all block updates are happening inplace ... ?? 2009-01-11 13:33 inplace? 2009-01-11 13:33 on the same physical block...no redirect .. and ofcourse in userspace 2009-01-11 13:34 yes 2009-01-11 13:34 we're just preparing to change that now 2009-01-11 13:34 in the next few days 2009-01-11 13:34 yes.u talked abt block redirect 2009-01-11 13:35 should i resend the patch? 2009-01-11 13:35 kushal, yes 2009-01-11 13:35 ok...sending 2009-01-11 13:35 if you send inline, and wrapping is ok, there is no need to also attach 2009-01-11 13:35 ok 2009-01-11 13:41 sent... 2009-01-11 13:42 this version looks unwrapped in our mail accounts, but still seems wrapped on the mailing list archive... 2009-01-11 13:43 i guess its better to stick to attachments :) 2009-01-11 13:43 some mailing lists prefer inline 2009-01-11 13:43 I am ok either way 2009-01-11 13:43 ok... 2009-01-11 13:44 it's still wrapped 2009-01-11 13:44 don't bother resending 2009-01-11 13:44 I'll unwrap it with an editor 2009-01-11 13:44 but try mailing to yourself to confirm it doesn't wrap 2009-01-11 13:45 already tested that...but i guess something messed up... 2009-01-11 13:46 sorry for the inconvenience... 2009-01-11 13:47 no problem, the changes looks right 2009-01-11 13:47 ok...thanks... 2009-01-11 14:06 i'm off...thanks for your time... 2009-01-11 14:23 ok, here goes, try putting together block forking, allocation logging, delta transition and bitmap flush... 2009-01-11 14:27 balloc_from_range: balloc 1 blocks from [0/0] 2009-01-11 14:27 change_end: ----- delta ------ 2009-01-11 14:27 balloc_from_range: balloc 1 blocks from [0/0] 2009-01-11 14:53 flips .. read ur reply to the patch mail... 2009-01-11 14:53 will keep that in mind.... 2009-01-11 14:53 make tests is nice and easy 2009-01-11 14:54 I usually also do a kernel build, unless I'm completely sure I didn't touch any kernel code 2009-01-11 14:54 thats what surprises me .. i did make and even used tux3fuse before sending the patch 2009-01-11 14:54 ah 2009-01-11 14:54 checked again after reading ur mail.. 2009-01-11 14:54 make there is a makefil bug? 2009-01-11 14:54 works fine here.. 2009-01-11 14:55 mabe there is a makefile bug? 2009-01-11 14:55 no...i have made a few changes there...guess that let it slip through 2009-01-11 14:56 a patch to construct the inode in tux3fuse the same way as tux3.c would be welcome 2009-01-11 14:56 in theory, the compiler should generate the same code in either case 2009-01-11 14:56 ok.. 2009-01-11 14:56 in practice it's probably a little different, and it doesn't matter, details like that don't matter much in user space 2009-01-11 14:58 and abt the word wrapping part...seems gmail's web client wraps every thing :( ....setting up hg email now...wont make the same mistake again 2009-01-11 14:58 gmail sucks ;) 2009-01-11 14:58 :) 2009-01-11 14:59 just send as attachement only, it's fine 2009-01-11 14:59 if you have to use gmail 2009-01-11 14:59 ok 2009-01-11 15:00 in my email client, attachments show up inline, just like they were in the email text 2009-01-11 15:00 kmail rules 2009-01-11 15:00 the only difference is how I save the patch 2009-01-11 15:01 ah, also it is easier to reply to an inline patch with comments, because the patch will be quoted 2009-01-11 15:01 yes... 2009-01-11 15:04 gmail uses seem to have adapted by including a "warning: gmail wraps lines, so I have also attached" comment in their mails 2009-01-11 15:04 anyways...its very late here now...i am off...feels good to finally contribute :) 2009-01-11 15:04 good night :) 2009-01-11 15:04 you got your name in the hall of fame 2009-01-11 15:05 thanks for that.. 2009-01-11 15:05 well, kushal did, you will soon too 2009-01-11 15:06 well...we are working together on this thing....so yes ... i guess i will... 2009-01-11 15:06 one day .. 2009-01-11 15:07 bye for now 2009-01-11 15:07 bye 2009-01-11 15:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-11 15:45 balloc extent -> [9/1] 2009-01-11 15:45 change_end: commit delta 0 2009-01-11 15:45 balloc_from_range: balloc 1 blocks from [a/5a] 2009-01-11 15:45 blockdirty: ---- fork buffer 0x8061f30 ---- 2009-01-11 15:45 balloc extent -> [a/1] 2009-01-11 15:45 replay: set 0x0/1 2009-01-11 15:45 replay: set 0x1/1 2009-01-11 15:46 things are starting to work 2009-01-11 15:46 ...out for a bit 2009-01-11 16:48 -!- macan(~macan@159.226.41.129) has joined #tux3 2009-01-11 21:29 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-11 21:47 -!- kushal(~kushal@115.109.12.19) has joined #tux3 2009-01-11 22:02 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-11 22:48 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-11 23:50 well, there is an additional complication with flushing bitmaps 2009-01-11 23:50 the version of the buffer we want to write is the one before the fork 2009-01-11 23:51 which is no longer the data in the buffer by the time diskio is called 2009-01-11 23:51 so for this reason, just a straight flush_buffers on the bitmap table is not quite right 2009-01-11 23:53 the buffer we actually want to write is the one in that fork put on the dirty list for the delta 2009-01-11 23:56 why do we dirty bitmap data in balloc()? 2009-01-11 23:56 filemap does it when it allocates a physical block for the bitmap block 2009-01-11 23:58 I thought about blockdirty() in balloc() 2009-01-11 23:58 good :) 2009-01-11 23:58 ok :) 2009-01-11 23:59 ah, it is blockdirty in balloc that does the fork 2009-01-11 23:59 this balloc came from filemap 2009-01-12 00:00 it forks just fine, but we are going to write out the wrong data 2009-01-12 00:00 I thought we are logging bitmap buffer change, instead of dirty 2009-01-12 00:00 at delta transition, only data in the earlier delta should be written 2009-01-12 00:01 we have to dirty the in-cache buffer 2009-01-12 00:01 to keep track of what is allocated 2009-01-12 00:01 we change buffer, however we don't need to dirty? 2009-01-12 00:02 we do need to fork 2009-01-12 00:02 ok, there's one more piece here 2009-01-12 00:03 we also redirect when the bitmap is flushed 2009-01-12 00:03 ok 2009-01-12 00:04 just checking, to see if that helps, by accident 2009-01-12 00:04 i see, I think I understand more or less what is issue 2009-01-12 00:05 I will check in my revised commit.c unit test 2009-01-12 00:05 which is mostly working pretty well 2009-01-12 00:05 there is one nice thing I noticed 2009-01-12 00:05 about forking in kernel 2009-01-12 00:06 one of the issues is, write endio has to go update buffer states on a page 2009-01-12 00:06 to do that, it should take the page lock 2009-01-12 00:06 but that is not possible in bio endio, which is in an interrupt, and page lock sleeps 2009-01-12 00:07 however, if we have a list of all the metadata blocks that were written out, we can walk the list updating buffer states, after the writes complete 2009-01-12 00:09 we need to wait on buffer clean, for each buffer, as an alternative to counting completions in the endio 2009-01-12 00:11 don't we use writeback? 2009-01-12 00:12 what about our metadata? 2009-01-12 00:12 PG_writeback 2009-01-12 00:12 so far, not 2009-01-12 00:12 I doubt that breaks anything 2009-01-12 00:12 like block_write_full_page() 2009-01-12 00:12 if it's only for metadata 2009-01-12 00:13 PG_writeback is mainly to handle re-dirtying 2009-01-12 00:13 we don't allow re-dirtying of metadata in previous deltas, that's always a bug 2009-01-12 00:14 we will be handling regular file writeout with normal writeback mechanism for now 2009-01-12 00:14 so that will use PG_writeback 2009-01-12 00:19 ok, the commit prototype is updated 2009-01-12 00:19 make commit && ./commit foodev 2009-01-12 00:21 stage_delta? 2009-01-12 00:22 yes 2009-01-12 00:22 I should have removed the hexdumps 2009-01-12 00:22 it's sloppy to hexdump like that and not brelse 2009-01-12 00:25 ok, here is one resolution 2009-01-12 00:29 1) walk the bitmap dirty list, for each dirty buffer, map it to disk, then if it is dirty in earlier delta, move it to the earlier delta list, otherwise it is dirty in current delta, and fork must have already moved the correct data to the earlier delta list; 2) initiate IO on all the buffers in previous delta list 2009-01-12 00:29 I think that is what I had in mind when I said two passes would be needed 2009-01-12 00:30 i see 2009-01-12 00:31 I wonder how endio_bh_async gets away with modifying buffer dirty state without holding page lock 2009-01-12 00:31 in my dump mind 2009-01-12 00:32 I just logging bitmap data until rollback 2009-01-12 00:32 rollup you mean 2009-01-12 00:32 ah, yes 2009-01-12 00:32 this is the rollup 2009-01-12 00:32 we're writing the rollup now :) 2009-01-12 00:32 this is probably the trickiest puzzle in the whole scheme 2009-01-12 00:33 it meant blockfork in balloc() is part of rollup? 2009-01-12 00:33 just for now, I thought it would be ok to do rollup on ever delta 2009-01-12 00:33 yes, blockfork for bitmap is part of the rollup 2009-01-12 00:34 i see 2009-01-12 00:34 we will use a different sequence of delta numbers for rollup, when there is less than one per delta 2009-01-12 00:34 there will be sb->delta and sb->rollup 2009-01-12 00:35 that is because we have a limited supply of positions in the dirty state 2009-01-12 00:35 anyway, when we are doing one rollup per delta, we can use the delta counter for rollup 2009-01-12 00:36 ah, initially bitmap buffer is not dirty 2009-01-12 00:36 when the filesystem is first created, yes, and after each rollup except if a bitmap block was allocated in rollup 2009-01-12 00:37 it is why do we see modified data? 2009-01-12 00:37 what is why? 2009-01-12 00:38 indeed, initially the bitmap buffer is not dirty, because it is mapped to a hole 2009-01-12 00:38 well, I expected first stage_delta dump is "00 00 00 00 00..." 2009-01-12 00:39 I think you're right, it should be 2009-01-12 00:39 so I have written some bugs :) 2009-01-12 00:40 if we allocated buffer insert to next delta, does it work? 2009-01-12 00:40 allocated buffer insert? 2009-01-12 00:40 newly dirty buffer 2009-01-12 00:40 in blockdirty? 2009-01-12 00:40 hole buffer 2009-01-12 00:41 -!- Man_of_W1x(~wax@gualtiero.cs.unibo.it) has joined #tux3 2009-01-12 00:41 -!- data`(~data@echo489.server4you.de) has joined #tux3 2009-01-12 00:41 ah, if it's not dirty, insert to next delta 2009-01-12 00:41 wait 2009-01-12 00:41 I think it is right, what is shown in dirty list 0 2009-01-12 00:42 we are flushing the _current_ bitmap state as at the _end_ of delta 0 2009-01-12 00:42 this retires any log allocations 2009-01-12 00:43 logged allocations 2009-01-12 00:43 and there is no efficiency advantage to logging in this case, if we flush once per delta 2009-01-12 00:45 ah, it already have right delta (dirty buffer)? 2009-01-12 00:46 first stage_delta has no dirty buffers 2009-01-12 00:46 yes, it all seems right, except for the blocks that are forked during bitmap flushing 2009-01-12 00:47 if I add a delta transition right at the beginning, it should have no dirty buffers 2009-01-12 00:47 forcked bitmap flushing sounds like right thing 2009-01-12 00:47 yes, it's very close to correct 2009-01-12 00:48 forked bitmap durning bitmap flushing is good for memory pressure 2009-01-12 00:48 why? 2009-01-12 00:48 because we can free bitmap pages in previous delta 2009-01-12 00:48 true 2009-01-12 00:49 if we didn't write bitmap pages in delta, we can't free until some point 2009-01-12 00:49 exactly 2009-01-12 00:49 so, current commit.c seems to do right thing :) 2009-01-12 00:50 that is one of two reasons for rollup: 1) to keep the log from getting inititely long 2) to avoid having memory full of pinned metadata 2009-01-12 00:51 i see. yes 2009-01-12 00:51 yes, I think it is nearly right. It needs to separate the mapping from the writing of bitmap blocks 2009-01-12 00:52 mapping? 2009-01-12 00:52 map_region? 2009-01-12 00:52 -!- ilan(ilan@captain.fonz.net) has joined #tux3 2009-01-12 00:52 yes 2009-01-12 00:53 mapping to disk, actually map_region is overkill 2009-01-12 00:53 just balloc is all we need 2009-01-12 00:53 um 2009-01-12 00:53 no, we need to set up dleaves 2009-01-12 00:53 so it's not overkill, it's just right 2009-01-12 00:55 which stage do we call map_region for bitmap... 2009-01-12 00:56 1) walk the bitmap dirty list, for each dirty buffer, map it to disk (map_region), then if it is dirty in earlier delta, move it to the earlier delta list, otherwise it is dirty in current delta, and fork must have already moved the correct data to the earlier delta list; 2) initiate IO on all the buffers in previous delta list 2009-01-12 00:56 it would be stage_delta 2009-01-12 00:56 yes 2009-01-12 00:56 later, in rollup when we separate rollup from delta transition 2009-01-12 00:57 current commit.c does it already? 2009-01-12 00:57 yes 2009-01-12 00:58 it's almost exactly right, except for the issue of writing the wrong data for forked bitmap blocks 2009-01-12 00:59 so what I am proposing, is to drive all bitmap writeout from the dirty list for the delta 2009-01-12 00:59 write? in user/commit.c? 2009-01-12 00:59 ah, yes 2009-01-12 01:00 I was missing flush_buffers() 2009-01-12 01:03 in kernel, we have a handy place to cache the physical mapping of a logically mapped buffer, the buffer block 2009-01-12 01:04 in userspace, that treatment is a little different, we don't cache physical mappings 2009-01-12 01:04 so that is another logistical issue to sort out if mapping is separated from writing 2009-01-12 01:11 ok 2009-01-12 01:11 maybe, flush_buffers(mapping(sb->bitmap)) is wrong? 2009-01-12 01:11 ah, yes. you already know 2009-01-12 01:11 um... 2009-01-12 01:13 at the point of writing out the wrong data, filemap_extent_io can see that it is about to do something wrong: the buffer it is writing is for the next delta 2009-01-12 01:14 yes 2009-01-12 01:19 ah, another fix is to get a pointer to the buffer data before mapping the buffer to disk, we could use a special map->io function for this 2009-01-12 01:20 that isn't too painful 2009-01-12 01:20 just use that map->io for the bitmap inode 2009-01-12 01:21 in kernel, if we add the per-inode diskio function like userspace map->io, we can use the same approach 2009-01-12 01:21 it's very easy to write if it only handles a block at a time 2009-01-12 01:41 I was thinking about to use "sb->delta + 1" for bitmap durning go to shop 2009-01-12 01:42 well, you can put your excellent mind to work on a better solution... I have a workable solution I have just coded 2009-01-12 01:42 might as well check in right now, it's just in the commit prototype 2009-01-12 01:43 if it was already done, it is great 2009-01-12 01:44 http://hg.tux3.org/tux3/rev/2045baebc405 2009-01-12 01:44 yes, there are a lot of other things to take care of 2009-01-12 01:44 can't get stuck on this one thing 2009-01-12 01:45 so, I am thinking that per-inode io methods make a lot of sense in kernel too 2009-01-12 01:46 where io = map + launch io 2009-01-12 01:46 it would be async io in kernel of course 2009-01-12 01:47 maybe ->mapio is a better name for the field 2009-01-12 01:47 just blockwrite()? 2009-01-12 01:48 all IO, actually 2009-01-12 01:48 i see 2009-01-12 01:51 for the volume mapping, we can't do opportunistic write of dirty buffers, but for directory data we can, that is, find neighbours that are also dirty and can be written in the same call 2009-01-12 01:52 write is flush_buffers() in userland 2009-01-12 01:52 we can call it blockwrite()? 2009-01-12 01:53 I guess 2009-01-12 01:53 so, filemap_extent_io -> blockwrite? 2009-01-12 01:53 no 2009-01-12 01:53 userland is use ->io 2009-01-12 01:54 kernel is... 2009-01-12 01:54 maybe, generic part and ->writepage()? 2009-01-12 01:54 for now 2009-01-12 01:54 ah 2009-01-12 01:56 so in kernel, we will have two separate sets of ops, one for volume and one for file data 2009-01-12 01:56 they are almost the same 2009-01-12 01:56 so I think we should write both sets of operations, and then try to merge them by introducing a method to handle just the part that is different 2009-01-12 01:56 like currently apos? 2009-01-12 01:56 yes 2009-01-12 01:56 i see 2009-01-12 01:57 aops ;) 2009-01-12 01:57 address operations 2009-01-12 01:57 oh :) 2009-01-12 01:57 address_space operations, actually 2009-01-12 01:57 dumb name 2009-01-12 01:58 should be mapping operations 2009-01-12 01:58 anyway, my solution is still flawed 2009-01-12 01:59 what is problem? 2009-01-12 01:59 it only works if the bitmap being written is the one forked 2009-01-12 01:59 or if the bitmap being forked has already been written 2009-01-12 01:59 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2009-01-12 01:59 i see 2009-01-12 02:00 if the flush has not gotten to the bitmap that will be forked yet, flush will still write the wrong data 2009-01-12 02:00 so it was a nice try :) 2009-01-12 02:01 issus is buffer->data in bitmap_io? 2009-01-12 02:01 yes 2009-01-12 02:01 i see 2009-01-12 02:02 bitmap_io tried to solve the problem by storing the buffer->data before mapping, that only handles some cases 2009-01-12 02:03 what would work is: first save _all_ the buffer_data pointers for dirty bitmaps, then map all the dirty bitmaps, then write all the bitmaps 2009-01-12 02:04 it's a little more of a mess than I wanted, but it will work 2009-01-12 02:04 um... 2009-01-12 02:05 this issue is complex for me, I need more time 2009-01-12 02:05 if bitmap_io works, it is good 2009-01-12 02:05 bitmap_io doesn't work 2009-01-12 02:06 um... 2009-01-12 02:06 it just fixed some cases 2009-01-12 02:06 i see 2009-01-12 02:06 block forking is working as designed, I think it is right 2009-01-12 02:07 but it only provides part of the solution 2009-01-12 02:07 I think we can leave me with this problem, and move on to block redirect 2009-01-12 02:07 I'm expecting we can use fork to solve bitmap issue 2009-01-12 02:08 good 2009-01-12 02:08 yes, fork will be part of the solution, no question 2009-01-12 02:08 I'm thinking fork can be use to get stable data 2009-01-12 02:08 can be used 2009-01-12 02:08 what it succeeds in doing is leaving the current state of the bitmap acessible via the bitmap inode, while creating read-only copies where necessary 2009-01-12 02:09 yes 2009-01-12 02:09 so, I'm thinking what is problem... 2009-01-12 02:09 we should be able to get stable data for bitmap too 2009-01-12 02:10 there is difference, but I think we can get 2009-01-12 02:10 there is difference with other buffers 2009-01-12 02:11 we are getting stable data, I'm just having a hard time doing the bookkeeping :) 2009-01-12 02:11 ah, yes 2009-01-12 02:11 we have stable data 2009-01-12 02:11 all other kinds of inodes are easier 2009-01-12 02:12 logging state and stable data is on different stage 2009-01-12 02:12 logging data 2009-01-12 02:12 yes 2009-01-12 02:13 so, I thought we may use "sb->delta + 1" 2009-01-12 02:13 but, it seems not enough 2009-01-12 02:13 just shifts the problem by one I think 2009-01-12 02:14 ah 2009-01-12 02:14 I expected the above different stabe issue can be solved 2009-01-12 02:14 maybe I left the wrong buffer on the mapping dirty list 2009-01-12 02:14 yes 2009-01-12 02:14 it can 2009-01-12 02:15 just leave me with it :) 2009-01-12 02:15 the big step was getting a unit test going, that includes most of atomic commit 2009-01-12 02:15 the biggest missing piece is redirect 2009-01-12 02:15 yes 2009-01-12 02:16 dleaf stuff? 2009-01-12 02:16 not just dleaf 2009-01-12 02:16 also has to be done when splitting btree nodes 2009-01-12 02:16 ah 2009-01-12 02:16 also bitmap have to be redirected 2009-01-12 02:17 bitmap is also like dleaf? 2009-01-12 02:17 dleaf also has to be redirected 2009-01-12 02:18 start is map_region? 2009-01-12 02:18 we could start there 2009-01-12 02:19 it's a good place to start 2009-01-12 02:19 ok 2009-01-12 02:19 start with redirecting data blocks 2009-01-12 02:19 so I am thinking, if map_region is called with create = 2, then redirect 2009-01-12 02:20 create == 1 is what's for? 2009-01-12 02:20 um... 2009-01-12 02:20 for an ordered data writing mode 2009-01-12 02:21 i see 2009-01-12 02:21 we overwrite existing data, no redirect 2009-01-12 02:21 yes 2009-01-12 02:22 map_region redirect seems easy 2009-01-12 02:22 for (int i = 0; i < segs; i++) { <- this is where the redirect goes 2009-01-12 02:22 yes 2009-01-12 02:23 issue is old blocks 2009-01-12 02:23 yes, we will log frees for them, but not update the bitmap immediately 2009-01-12 02:24 i see 2009-01-12 02:26 the frees can be added into the bitmaps as soon as the delta has completed 2009-01-12 02:26 they can't be added in before, because it write IO could be launched on top of data in the previous stable image 2009-01-12 02:29 if (create == 2 && map[i].state != SEG_HOLE) { 2009-01-12 02:29 /* logging */ 2009-01-12 02:29 trace("block %Lx, count %x", map[i].block, map[i].count); 2009-01-12 02:29 } 2009-01-12 02:29 if (create == 2 || map[i].state == SEG_HOLE) { 2009-01-12 02:30 :) 2009-01-12 02:31 log_alloc(sb, seg->block, seg->count, 0); <- for a free 2009-01-12 02:32 yes 2009-01-12 02:32 now log_* has build issue 2009-01-12 02:33 log_alloc(sb, seg->block, seg->count, 1); <- for a filled in hole 2009-01-12 02:33 build issue? 2009-01-12 02:33 oh right 2009-01-12 02:33 it should be defined earlier 2009-01-12 02:33 log.c 2009-01-12 02:33 for a nice bottom up layering 2009-01-12 02:34 jsut add kernel/log.c and move the logging primitives there 2009-01-12 02:34 commit.c should be a very high level file 2009-01-12 02:34 i see 2009-01-12 02:35 shall I do that? 2009-01-12 02:35 otherwise, user space build will be a mess 2009-01-12 02:35 yes, please 2009-01-12 02:47 ok, it's in 2009-01-12 02:47 thanks 2009-01-12 02:48 user/Makefile does not know about it yet 2009-01-12 02:48 kernel builds :) 2009-01-12 02:48 actually, that was the first time I ever built it in kernel and it worked the first time 2009-01-12 02:49 ok 2009-01-12 02:50 btw, I was thinking about branch for atomic commit 2009-01-12 02:51 I was thinking about that too, but I do so many fixups to the existing code while I'm working on it 2009-01-12 02:51 on the other hand, I need experience working on a branch, which I have never done because I mainly used subversion up till now 2009-01-12 02:52 and alternative would be #ifdef ATOMIC_COMMIT 2009-01-12 02:52 maybe, #ifdef ATOMIC_COMMIT is hard to work 2009-01-12 02:54 either that or a branch 2009-01-12 02:54 or don't care 2009-01-12 02:54 that's a third way 2009-01-12 02:55 break it, and if anybody complains, they have volunteered to help 2009-01-12 02:55 I think that is the best way 2009-01-12 02:55 least work for us 2009-01-12 02:55 yes 2009-01-12 03:32 http://userweb.kernel.org/~hirofumi/logging/logging.patch 2009-01-12 03:33 reading 2009-01-12 03:35 block %Lx, %x -> extent %Lx/%x, consistent with other output 2009-01-12 03:35 the patch depends on my local change though 2009-01-12 03:35 ok 2009-01-12 03:36 inode.c isn't the only caller of map_region, all callers have to have the logging set up now 2009-01-12 03:38 well, tests seems to pass 2009-01-12 03:38 could have create = 1 + !!sb->logbuf 2009-01-12 03:39 where is the log opened for tux3.c? 2009-01-12 03:39 create ==2 is called only commit.c 2009-01-12 03:39 ah, there are no unit tests for tux3.c or tux3fuse ;) 2009-01-12 03:40 ah, yes 2009-01-12 03:40 ah 2009-01-12 03:40 if (create == 2 && map[i].state != SEG_HOLE) 2009-01-12 03:40 log_alloc(sb, map[i].block, map[i].count, 0); 2009-01-12 03:40 this would be better 2009-01-12 03:41 tux3.c fix and logging free before balloc 2009-01-12 03:41 it looks good 2009-01-12 03:42 ah, tux3 has no problem 2009-01-12 03:42 how does it escape? 2009-01-12 03:42 it should be state == SEG_HOLE 2009-01-12 03:44 you can just do one big alloc for the whole region for create = 2 2009-01-12 03:45 then inside the loop, only handle the non-holes, by logging frees 2009-01-12 03:45 better than lots of little allocs 2009-01-12 03:46 easier in the other order 2009-01-12 03:47 ah, yes 2009-01-12 03:47 loop through, logging frees for the non-holes, then after the loop, alloc one bit extent for the whole region, store it, seg segs to 1 2009-01-12 03:48 s/seg/set/ 2009-01-12 03:50 it's oyasumi time 2009-01-12 03:51 oyasumi 2009-01-12 03:51 when I wake up I will have a solution to bitmap flushing :) 2009-01-12 03:51 great :) 2009-01-12 03:51 bye 2009-01-12 05:13 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-12 06:21 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 07:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 07:18 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 07:47 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-01-12 08:01 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-01-12 10:09 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-12 10:22 -!- kushal(~kushal@115.109.12.19) has joined #tux3 2009-01-12 10:22 -!- gaurav(~gaurav@59.95.2.48) has joined #tux3 2009-01-12 10:23 -!- cdk(~chinmay@115.109.15.10) has joined #tux3 2009-01-12 10:25 hi flips 2009-01-12 10:25 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-12 10:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 10:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 10:58 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 11:06 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 11:28 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-12 11:44 -!- gaurav(~gaurav@59.95.5.166) has joined #tux3 2009-01-12 11:54 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-12 14:00 -!- ^LuPo^(~^LuPo^@adsl-ull-114-80.47-151.net24.it) has joined #tux3 2009-01-12 14:00 <^LuPo^> 8 Ciao a Tutti...4 ;) 2009-01-12 14:00 -!- ^LuPo^(~^LuPo^@adsl-ull-114-80.47-151.net24.it) has left #tux3 2009-01-12 15:06 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 15:53 -!- MaZe(~MaZe@12.176.154.6) has joined #tux3 2009-01-12 18:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 21:03 ok, next thing is to prototype redirect of a volume buffer 2009-01-12 21:04 - if buffer not dirty in current delta: 2009-01-12 21:04 - balloc new block 2009-01-12 21:04 - use cursor to update pointer in parent 2009-01-12 21:04 - log old free, new alloc and promise to update parent on disk 2009-01-12 21:04 - set buffer dirty in current delta 2009-01-12 21:04 - add old block to free-at-end-of-delta list 2009-01-12 21:04 - if physically mapped: 2009-01-12 21:04 - blockget new block 2009-01-12 21:04 - copy old buffer to new 2009-01-12 21:04 - replace buffer and next in cursor 2009-01-12 21:58 -!- kushal(~kushal@115.109.12.19) has joined #tux3 2009-01-12 22:06 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-12 22:19 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 22:27 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-12 23:22 hey flips 2009-01-12 23:23 hi bh, wassup? 2009-01-12 23:23 just reading through some of my old code and about to redo something that I removed in the first place because multiple folks asked to simply it 2009-01-12 23:24 it doesn't generate the quality of result of my old code so I'm redoing it against their advice 2009-01-12 23:24 you don't know these things until you try it 2009-01-12 23:24 the dogs watching television effect 2009-01-12 23:24 what's that ? 2009-01-12 23:25 they look, they see, but they don't know what they see 2009-01-12 23:25 ok 2009-01-12 23:25 well it's close to that 2009-01-12 23:26 they din't like the complexity of my annotations in the code, but it's really the only way of guaranteeing that the results mean something in the measurement 2009-01-12 23:26 I'm putting it back and tell them that their suggestion didn't work well 2009-01-12 23:29 after this, it's going to be a crude EDF policy 2009-01-12 23:29 for a scheduler that's got higher static ordering than any of the real time priority or SCHED_OTHER 2009-01-12 23:29 that'll be interesting 2009-01-12 23:31 prototype of volume block redirect just about ready to check in 2009-01-12 23:31 will check in, because there is no user yet 2009-01-12 23:32 hirofumi should have fun with it 2009-01-12 23:33 how's development going gradually getting new stuff in ? 2009-01-12 23:33 sparse likes it 2009-01-12 23:33 you could say that 2009-01-12 23:33 good 2009-01-12 23:33 working out kernel issues and stuff ? 2009-01-12 23:34 yes, stuff like that 2009-01-12 23:34 making it awesome 2009-01-12 23:35 yeah, I've been noticing the content of the discussions have been of that nature 2009-01-12 23:35 have you figured out how to deal with snapshots yet ? and atomic commits ? 2009-01-12 23:35 I mean, that's a lot right there 2009-01-12 23:36 yes 2009-01-12 23:36 I know you said that ext3 has goodies to help with that, but I can't imagine that it's sufficient for your purposes 2009-01-12 23:43 ext3's approach to atomic commit is completely scary 2009-01-12 23:43 it's amazing it works and a testament to the sheer determination of all those involved 2009-01-12 23:59 ACTION is afriad of thinking about that code 2009-01-13 00:00 afraid 2009-01-13 00:09 -!- data(~data@echo489.server4you.de) has joined #tux3 2009-01-13 00:37 -!- aks(~project_t@59.90.32.1) has joined #tux3 2009-01-13 00:38 -!- aks(~project_t@59.90.32.1) has left #tux3 2009-01-13 00:38 hi 2009-01-13 00:38 hi 2009-01-13 00:39 got prototype code for btree redirect, just about to post 2009-01-13 00:39 current commit.c is woking version? 2009-01-13 00:39 good 2009-01-13 00:39 current commit.c works 2009-01-13 00:39 bitmap flushing works 2009-01-13 00:39 I think log data is not current 2009-01-13 00:39 like I thought, the problem was solved while I slept 2009-01-13 00:40 or more truthfully, while I should have been sleeping ;) 2009-01-13 00:40 log data? 2009-01-13 00:40 :) 2009-01-13 00:40 map_region() will logs changes for next delta 2009-01-13 00:41 isn't that in your repository, not mine? 2009-01-13 00:41 map_region does in fact log changes for the next delta 2009-01-13 00:41 I think both have 2009-01-13 00:42 ACTION looks 2009-01-13 00:42 no log_ in my kernel/filemap.c 2009-01-13 00:42 ah, yes 2009-01-13 00:43 however, balloc() will logs it? 2009-01-13 00:43 yes 2009-01-13 00:43 for the next delta 2009-01-13 00:43 sorry, current sources doesn't though 2009-01-13 00:43 let me see 2009-01-13 00:43 ah 2009-01-13 00:44 that is because I do the allocation logging in user/commit.c 2009-01-13 00:44 yes 2009-01-13 00:44 so it is about time to move that into balloc.c 2009-01-13 00:44 map_region() will allocate blocks for next delta 2009-01-13 00:45 it will launch IO immediately 2009-01-13 00:45 so the blocks it allocates are in the current delta 2009-01-13 00:45 however, it will not write out the bitmaps 2009-01-13 00:45 so it logs instead 2009-01-13 00:46 in write_bittmap()? 2009-01-13 00:47 write_bitmap just the bitmap flush, which we are doing every delta for now... but needs to be less than once per delta 2009-01-13 00:47 it's a little confusing, sorry 2009-01-13 00:47 write_bitmap() -> map_region() -> balloc() <= this balloc() is for next delta 2009-01-13 00:47 write_bitmap is a rollup 2009-01-13 00:47 oh 2009-01-13 00:48 we don't write bitmap data pages for each delta? 2009-01-13 00:49 we do for now, but very soon after we will only do it on rollup 2009-01-13 00:49 i see 2009-01-13 00:49 writing the bitmaps every delta will produce poor seeking behaviour 2009-01-13 00:49 I was thinking we will write initial state of bitmap data and btree 2009-01-13 00:50 ah, we should do that, the initial state will be created by tux3 mkfs 2009-01-13 00:51 and I thought every delta also does 2009-01-13 00:51 i.e. data + log is in delta 2009-01-13 00:51 that is true 2009-01-13 00:51 i see 2009-01-13 00:52 so, write_bitmap() -> map_region() creates data for next delta? 2009-01-13 00:52 at the end of a delta, for now, we will also have bitmap images that are nearly current, except for any new bitmap block allocations 2009-01-13 00:53 yes 2009-01-13 00:53 so, the images is for next delta? 2009-01-13 00:53 next delta or current delta after increment 2009-01-13 00:53 they are correct for the next delta, except for the missing bits for the new bitmap table blocks 2009-01-13 00:54 that is only for now 2009-01-13 00:54 well 2009-01-13 00:54 actually, it is always true just after a bitmap flush 2009-01-13 00:54 yes 2009-01-13 00:55 to flush the bitmaps, we are basically dumping them in as part of the active delta 2009-01-13 00:55 so, map_region() -> balloc() <= this balloc is wrong or is needed some trick? 2009-01-13 00:56 it's correct 2009-01-13 00:56 that logs is flushing logmap soon? 2009-01-13 00:57 the balloc needs a log entry at the same place 2009-01-13 00:57 map_region() -> logs in balloc() 2009-01-13 00:57 I mispoke, balloc should not log 2009-01-13 00:57 the logging should go in filemap.c 2009-01-13 00:58 it may be possible to put it in balloc.c, we should see how the code looks first 2009-01-13 00:58 it will be more clear in filemap.c 2009-01-13 00:59 your patch last night looked basically right 2009-01-13 00:59 ah, that's is logs for free? 2009-01-13 00:59 yes, you can put the log entries for allocs in the same place 2009-01-13 00:59 this logs is for balloc() 2009-01-13 01:00 right 2009-01-13 01:00 goes in the same place 2009-01-13 01:00 I think that was your question 2009-01-13 01:00 and it is for the allocations in the current delta 2009-01-13 01:00 blockdirty(buffer, tux_sb(inode->i_sb)->delta); 2009-01-13 01:00 set_bits(bufdata(buffer), found & mapmask, run); 2009-01-13 01:00 brelse(buffer); 2009-01-13 01:00 in balloc() 2009-01-13 01:00 sorry, balloc_from_range() 2009-01-13 01:00 right, there should be no logging there 2009-01-13 01:00 oh 2009-01-13 01:01 mainly because it's too confusing 2009-01-13 01:01 I was thinking we will log for allocation 2009-01-13 01:01 we can 2009-01-13 01:01 yes 2009-01-13 01:02 they way I would like to do it, is put the log entries at the site of the balloc caller 2009-01-13 01:02 so, I thought the above issue is appeared 2009-01-13 01:02 which issue? 2009-01-13 01:02 ACTION is slow sometimes 2009-01-13 01:02 map_region() -> balloc() -> log_alloc() 2009-01-13 01:02 oh I get it :) 2009-01-13 01:02 sorry, it would be my english problem 2009-01-13 01:03 why is it an issue? 2009-01-13 01:03 ah 2009-01-13 01:03 it is logging for next delta 2009-01-13 01:03 why is it the next delta? 2009-01-13 01:03 thinking about delta 0 2009-01-13 01:03 think about delta 0 2009-01-13 01:03 yes 2009-01-13 01:04 we changed some thing 2009-01-13 01:04 it logs to logmap 2009-01-13 01:04 then sb->delta++ for stage_delta() 2009-01-13 01:04 now, delta 1 2009-01-13 01:04 right, that map belongs to the current delta, not the next 2009-01-13 01:05 we call write_bitmap() -> map_region() -> log_alloc() 2009-01-13 01:05 this log_alloc() is not log to flush soon 2009-01-13 01:06 yes, but the delta counter is not used in that path, is it? 2009-01-13 01:06 it is only used for fork 2009-01-13 01:07 map_region()? 2009-01-13 01:07 ah, yes 2009-01-13 01:08 ah, did you notice I stopped using map_region for writing bitmap blocks? 2009-01-13 01:08 no 2009-01-13 01:09 http://mailman.tux3.org/pipermail/tux3/2009-January/000649.html 2009-01-13 01:09 http://hg.tux3.org/tux3/rev/42fe4237d34b 2009-01-13 01:09 it is now I am seeing 2009-01-13 01:09 ah, it works well 2009-01-13 01:10 problem solved 2009-01-13 01:10 it is using map_region() to fork, and redirect and change btree 2009-01-13 01:11 oh sorry 2009-01-13 01:11 you are right 2009-01-13 01:11 it uses map_region indeed 2009-01-13 01:11 yes 2009-01-13 01:11 btw, if (buffer->state == sb->delta) 2009-01-13 01:12 should be ((buffer->state - BUFFER_DIRTY) == (sb->delta & (...))) or something 2009-01-13 01:12 it should 2009-01-13 01:13 oops :) 2009-01-13 01:13 well, maybe I see what are you trying to do 2009-01-13 01:13 and maybe it is right 2009-01-13 01:13 ok, let's go back to the issue you were describing 2009-01-13 01:14 however, maybe it has implementation issue 2009-01-13 01:14 ok 2009-01-13 01:14 at first, maybe we should have per delta logmap or something? 2009-01-13 01:15 what we need is sb->logbase 2009-01-13 01:15 ah, i see 2009-01-13 01:15 that tells us the index of the first log block in the delta 2009-01-13 01:15 then we can just keep incrementing forever, ever wrap to the beginning of the mapping 2009-01-13 01:15 s/ever/even/ 2009-01-13 01:15 ->logbase is per delta? 2009-01-13 01:16 yes 2009-01-13 01:16 so the number of log blocks in the delta will be sb->lognext - sb->logbase 2009-01-13 01:16 i see 2009-01-13 01:16 and that number will be stored in the delta commit entry in the final log block 2009-01-13 01:17 so we know how many log blocks to chain backwards, on replay 2009-01-13 01:17 wait 2009-01-13 01:17 it is assuming next log is not happen before flush? 2009-01-13 01:18 log for next delta 2009-01-13 01:18 for now, not, because we only have one delta in flight at a time 2009-01-13 01:18 when we have multiple deltas, each has its own logbase 2009-01-13 01:19 logbase may not be enough 2009-01-13 01:19 log for delta 1 -> log for delta 0 -> log for delta 1 2009-01-13 01:20 if the above was happened, it seems not work 2009-01-13 01:20 oh, we will never interleave log blocks for different deltas 2009-01-13 01:20 i see 2009-01-13 01:21 ok, back to balloc() 2009-01-13 01:21 when we have separation between frontend and backend, all logging will be done by the backend, in staging 2009-01-13 01:21 i see 2009-01-13 01:21 back to balloc 2009-01-13 01:22 map_region() -> balloc() -> log_alloc() <- this 2009-01-13 01:22 now, we are logging to sb->logmap 2009-01-13 01:23 we should change ->logbase before that? 2009-01-13 01:23 write_bitmap() -> map_region() -> balloc() -> log_alloc() 2009-01-13 01:25 all that logging is part of the delta that starts at logbase 2009-01-13 01:25 i see 2009-01-13 01:25 we have incremented the delta counter to cause forking during staging, but not changed any of the other delta setup 2009-01-13 01:26 if (buffer->state - BUFFER_DIRTY == (sb->delta & (BUFFER_DIRTY_STATES - 1))) 2009-01-13 01:26 return -EAGAIN; 2009-01-13 01:26 this obviously needs a helper function 2009-01-13 01:26 yes 2009-01-13 01:27 well, I think sb->delta don't need to increment always 2009-01-13 01:27 um..., sb->delta = sb->delta % maxdelta 2009-01-13 01:27 I meant 2009-01-13 01:27 http://hg.tux3.org/tux3/rev/3279337c3d7d <- bug fix 2009-01-13 01:28 looks good to me 2009-01-13 01:28 sb->delta needs to increment so that bitmap buffers will fork 2009-01-13 01:28 yes 2009-01-13 01:28 ah 2009-01-13 01:28 right 2009-01-13 01:28 so we need to handle wrapping 2009-01-13 01:28 should be a fixme comment 2009-01-13 01:29 I think my expression above handles wrapping 2009-01-13 01:29 yes 2009-01-13 01:30 well, so if ->logbase was changed with delta 2009-01-13 01:31 bitmap seems to logging the logs to right position 2009-01-13 01:31 however, other inodes seems not right position 2009-01-13 01:32 example? 2009-01-13 01:32 map_region() for bitmap would be need to sb->delta logmap 2009-01-13 01:32 however, map_region() for other inode would be need to sb->delta - 1 logmap 2009-01-13 01:33 because the changes is for before delta increment 2009-01-13 01:33 the map_region for the bitmap also is for sb->delta - 1 2009-01-13 01:34 sb->delta++, sb->delta -1 in stage_delta()? 2009-01-13 01:34 we do sb->delta++, and it is sb->delta -1 after ++? 2009-01-13 01:35 I meant, sb->delta -1 means the logs is for flushing soon? 2009-01-13 01:35 yes 2009-01-13 01:35 the log belong to sb->delta - 1 after the increment 2009-01-13 01:36 write_bitmap() -> map_region() -> balloc() -> log_alloc() 2009-01-13 01:36 but, this log is for sb->delta? 2009-01-13 01:36 it's for delta - 1 2009-01-13 01:36 um... 2009-01-13 01:37 we are mapping bitmap data pages for sb->delta 2009-01-13 01:37 um.. 2009-01-13 01:37 ah, no 2009-01-13 01:38 we are changing btree for sb->delta 2009-01-13 01:38 ah 2009-01-13 01:39 we are though away new allocation? 2009-01-13 01:39 which new allocation? 2009-01-13 01:39 in stage_delta() -> write_bitmap() -> map_region() -> balloc() 2009-01-13 01:39 we log it, and the cached bitmap will be current 2009-01-13 01:40 so that allocation is in two places 2009-01-13 01:40 flips: night 2009-01-13 01:40 the cached bitmap version belongs to the next delta 2009-01-13 01:40 goodnight bh 2009-01-13 01:40 I'll be reading the logs and stuff 2009-01-13 01:40 we will try to make the interesting 2009-01-13 01:40 it already is 2009-01-13 01:41 it's just too bad that Linux blows at this kind of stuff and the core infrastructure isn't there 2009-01-13 01:41 I can imagine substantial changes are needed to the cache so that online checking can work 2009-01-13 01:41 it will work fine as it is 2009-01-13 01:44 well, so 2009-01-13 01:44 we are logging twice? 2009-01-13 01:45 stage_delta -> write_bitmap -> map_region -> balloc -> log_alloc 2009-01-13 01:45 you are right 2009-01-13 01:45 a mistake 2009-01-13 01:46 i see 2009-01-13 01:46 well, except that I didn't log at all 2009-01-13 01:46 not in my patch 2009-01-13 01:47 yes 2009-01-13 01:48 well, it was what I noticed 2009-01-13 01:50 ok, I will post my prototype for btree node redirect now, without writing much about it 2009-01-13 01:50 ok, thanks 2009-01-13 01:51 http://mailman.tux3.org/pipermail/tux3/2009-January/000650.html 2009-01-13 01:51 hmm, long lines in the mail 2009-01-13 01:52 I should have forced it to wrap 2009-01-13 01:53 well, it is better than patch is wrapping 2009-01-13 01:59 see the issue with redirecting the btree root? 2009-01-13 02:00 basically 0 is root? 2009-01-13 02:00 yes 2009-01-13 02:00 and we need to update level -1 2009-01-13 02:01 ah, i see 2009-01-13 02:01 in the case of a file btree, a tux_inode->btree.root needs to be updated, and eventually flushed to an inode table block 2009-01-13 02:02 we might want to use a promise there 2009-01-13 02:03 I need to see current code 2009-01-13 02:03 you mean, apply the patch? 2009-01-13 02:03 no 2009-01-13 02:03 I want to search root changer 2009-01-13 02:03 insert_leaf and tree_chop 2009-01-13 02:03 good start 2009-01-13 02:04 all we do there is update the cached inode, vfs eventually flushes it out 2009-01-13 02:07 i see 2009-01-13 02:08 we may want to change root separately 2009-01-13 02:08 ah, no 2009-01-13 02:08 um... 2009-01-13 02:10 we need to flush the inode at delta transition, and store_attrs will do redirect() instead of mark_buffer_dirty 2009-01-13 02:11 or we might use blockdirty -> fork there 2009-01-13 02:11 i see 2009-01-13 02:11 hmm 2009-01-13 02:11 better to do the redirect for now 2009-01-13 02:12 ok, so the cursor does know where to find the pointer to the root, it is in the cached inode 2009-01-13 02:13 the pointer to the root of the itable is in the cached disksuper 2009-01-13 02:14 cursor->btree->root is not enough? 2009-01-13 02:15 the pointer to the root needs to be updated on disk 2009-01-13 02:15 ah 2009-01-13 02:15 if the root is redirected to a new location 2009-01-13 02:20 i see, interesting issue 2009-01-13 02:20 redirect is going to be a new btree op 2009-01-13 02:21 ah, i see 2009-01-13 02:21 do itable and file btree have their own redirect method, which shares common code for the nodes and handles root specially 2009-01-13 02:21 s/do/so/ 2009-01-13 02:22 in fact, we do not have btree ops right now 2009-01-13 02:22 only leaf operations 2009-01-13 02:26 ah, this talk teach me current problem 2009-01-13 02:26 map_region() -> btree_insert_leaf() may change btree->root 2009-01-13 02:26 however, it doesn't dirty inode 2009-01-13 02:27 ah 2009-01-13 02:27 we're just lucky it never triggered corruption then 2009-01-13 02:27 yes 2009-01-13 02:27 maybe, timestamp would be dirty inode 2009-01-13 02:29 cursor may want to point inode? 2009-01-13 02:29 maybe 2009-01-13 02:30 i see 2009-01-13 02:30 or the btree can point at the inode 2009-01-13 02:30 ah, i see 2009-01-13 02:31 remove btree->sb and add btree->inode maybe 2009-01-13 02:32 ah 2009-01-13 02:32 itable doesn't have inode 2009-01-13 02:32 :) 2009-01-13 02:32 volmap? 2009-01-13 02:32 let its inode be the volmap? 2009-01-13 02:33 i see 2009-01-13 02:33 just an idea 2009-01-13 02:33 it may clean 2009-01-13 02:33 I need to check code 2009-01-13 02:34 we have lots of btree->sb->blocksize in ileaf.c, a wrapper would be good 2009-01-13 02:35 and that will become btree->inode->sb->blocksize 2009-01-13 02:35 btree can point at the tux_inode 2009-01-13 02:36 same with all the usage in dleaf.c 2009-01-13 02:36 or if we have inode for btree always, we may pass inode instead of btree 2009-01-13 02:36 also a good idea 2009-01-13 02:37 maybe a better idea 2009-01-13 02:38 if we pass inode everywhere then we can get rid of btree->sb 2009-01-13 02:38 yes 2009-01-13 02:39 I think it is the right thing to do 2009-01-13 02:39 big patch :) 2009-01-13 02:41 itable in volmap may can be confusable, or may not be... 2009-01-13 02:41 um... 2009-01-13 02:41 it seems to make sense to me 2009-01-13 02:43 just for cleaness, volmap and itable inode may be clean, but .... 2009-01-13 02:43 so, flush(volmap) and flush(itable) are both is right 2009-01-13 02:43 we never do those though 2009-01-13 02:44 our flush(volmap/itable) is a delta transition 2009-01-13 02:44 yes 2009-01-13 02:44 however, itable doesn't have buffer, volmap is for it? 2009-01-13 02:45 buffer? 2009-01-13 02:45 ileaf buffer 2009-01-13 02:45 the root? 2009-01-13 02:45 ah 2009-01-13 02:45 it will be rooted in a metablock 2009-01-13 02:45 for now, rooted in the superblock 2009-01-13 02:46 and we should put the superblock in a buffer 2009-01-13 02:46 it's a bit messy the way it is now (my fault) 2009-01-13 02:46 I think that is the laste thing we will do, to finish atomic commit 2009-01-13 02:46 i see 2009-01-13 02:47 ah 2009-01-13 02:47 we may actually use itable for inode? 2009-01-13 02:47 because itable can be file 2009-01-13 02:47 that's an interesting idea 2009-01-13 02:47 in future 2009-01-13 02:49 how would you find that inode? 2009-01-13 02:49 inode is virtual inode like logmap 2009-01-13 02:50 did you mean, put the itable root in an inode? 2009-01-13 02:50 yes 2009-01-13 02:51 an on-disk inode, or cached? 2009-01-13 02:51 however, special handling is needed 2009-01-13 02:51 just cached 2009-01-13 02:51 fine 2009-01-13 02:51 it's good 2009-01-13 02:52 so, I thought ileaf may can be in page cache of itable inode 2009-01-13 02:53 so the bottom layer of the inode table index is mapped in page cache? 2009-01-13 02:53 well, volmap->btree also should be ok for now 2009-01-13 02:53 yes, maybe 2009-01-13 02:54 find_get_page(inum & (PAGE_CACHE_SIZE - 1)) 2009-01-13 02:54 no 2009-01-13 02:54 it would require looking up in two btrees to find an inode, first looking would be in a file btree, second would be a tree mapping the ileaf cache 2009-01-13 02:54 the itable has to be a btree because of variable sized inodes 2009-01-13 02:55 it's an interesting and scary idea ;) 2009-01-13 02:55 current itable become like data btree for file 2009-01-13 02:56 and ileaf is in file data pages 2009-01-13 02:56 right, I understand the suggestion 2009-01-13 02:56 there also has to be an additional btree, to find the right ileaf 2009-01-13 02:56 like htree 2009-01-13 02:57 which originally was mapped into a file 2009-01-13 02:57 before it got ported to ext3 2009-01-13 02:57 i see 2009-01-13 02:57 there are advantages: most lookups will be in the shallower btree mapped into page cache 2009-01-13 02:58 however, I think it is more complex 2009-01-13 02:58 if complex, it is too bad 2009-01-13 02:59 we might consider using the page cache as an inode lookup accelerator 2009-01-13 02:59 btw, why is it needed two btree? 2009-01-13 02:59 because the ileaf blocks contain a variable number of inodes 2009-01-13 03:00 you can't directly index by inum 2009-01-13 03:00 ah 2009-01-13 03:00 i see 2009-01-13 03:00 I'm crazy 2009-01-13 03:00 not completely ;) 2009-01-13 03:00 thanks :) 2009-01-13 03:00 I seriously considered your suggestion a couple of weeks ago 2009-01-13 03:01 including the complexity of having a secondary index 2009-01-13 03:01 i see 2009-01-13 03:02 what we could do is use a page cache as a volatile cache to map inum -> ileaf block address 2009-01-13 03:02 i see 2009-01-13 03:02 there would be 2^9 addresses/block 2009-01-13 03:03 so mapping 2^48 inums would require 2^39 page cache index range 2009-01-13 03:03 out of range for 32 bit systems 2009-01-13 03:04 to be useful, it would depend on inums being tightly clustered together 2009-01-13 03:04 i see 2009-01-13 03:04 probably a bad idea right now 2009-01-13 03:05 fancy cursors is a cheaper inode lookup accelerator approach 2009-01-13 03:06 anyway 2009-01-13 03:06 are we going to change btree * to inode * in all the btree functions? 2009-01-13 03:07 um... 2009-01-13 03:07 maybe, we can delay to do it 2009-01-13 03:07 good 2009-01-13 03:08 block forking in kernel is a more interesting question 2009-01-13 03:08 yes 2009-01-13 03:11 your logging change to filemap.c is welcome any time 2009-01-13 03:11 I think we should put it in, and make that part work 2009-01-13 03:12 ok 2009-01-13 03:12 that is, create = 2 2009-01-13 03:12 yes 2009-01-13 03:13 we need to consider the question of where begin_change goes for filemap 2009-01-13 03:14 ->write_begin? 2009-01-13 03:14 ah 2009-01-13 03:15 userland is where... 2009-01-13 03:16 let me see 2009-01-13 03:16 not anywhere yet 2009-01-13 03:16 yes 2009-01-13 03:16 tuxwrite or map_region? 2009-01-13 03:17 tuxwrite I think 2009-01-13 03:17 map_region does not work in kernel 2009-01-13 03:18 because the data transfer has not been started when change_end is called 2009-01-13 03:18 yes 2009-01-13 03:19 userland too 2009-01-13 03:19 actually, the right place for change_begin/end in userland is tuxio 2009-01-13 03:19 well, tuxwrite is fine too 2009-01-13 03:19 let's put it there 2009-01-13 03:19 easier 2009-01-13 03:20 ok 2009-01-13 03:21 so, in kernel, I think ->write_begin() and ->write_end for file apos 2009-01-13 03:22 it is for sys_write() 2009-01-13 03:23 just looking at all the uses of ->write_begin, there are: 2009-01-13 03:23 http://lxr.linux.no/linux+v2.6.27/fs/namei.c 2009-01-13 03:23 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c 2009-01-13 03:23 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c 2009-01-13 03:24 I think any data copy for file data 2009-01-13 03:24 symlink data too 2009-01-13 03:25 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L2429 <- most write traffic goes through here, right? 2009-01-13 03:26 for sys_write(), yes 2009-01-13 03:26 write family 2009-01-13 03:27 e.g. loop dev, splice, or symlink would be via page_cache_write() 2009-01-13 03:28 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L2013 2009-01-13 03:28 sorry, pagecache_write_begin 2009-01-13 03:31 probably, it may be same points with ext3_journal_start()? 2009-01-13 03:31 and ext3_journal_stop() 2009-01-13 03:31 similar anyway 2009-01-13 03:32 grep ext3_journal_start * -I | wc 2009-01-13 03:32 36 2009-01-13 03:32 seems excessive 2009-01-13 03:34 ext3 does journal_start in ext3_get_block, that seems wrong 2009-01-13 03:34 it seems for direct io 2009-01-13 03:34 oh good 2009-01-13 03:35 also has journal_start in ->writepage 2009-01-13 03:35 because ext3 is not delalloc? 2009-01-13 03:36 and neither is tux3 right now 2009-01-13 03:36 yes 2009-01-13 03:36 then there is a big mess in setattr 2009-01-13 03:37 mmap pages would need it in writepage 2009-01-13 03:37 it's a really stupid idea to use setattr to for vmtruncate 2009-01-13 03:38 yes 2009-01-13 03:38 if separated, it would be clean 2009-01-13 03:38 I wonder why it has not been, after 17 years 2009-01-13 03:39 for inode change at a time? 2009-01-13 03:39 for change inode at a time 2009-01-13 03:39 inode times? 2009-01-13 03:39 still seems like a bad idea 2009-01-13 03:39 truncate may strip suid bit 2009-01-13 03:42 anyway, I think it can be separated 2009-01-13 03:46 my repo has 10 other patches of filemap 2009-01-13 03:47 filemap redirect 2009-01-13 03:47 time to send 2009-01-13 03:47 I should send filemap redirect only? 2009-01-13 03:47 or all? 2009-01-13 03:47 as you wish 2009-01-13 03:47 I am ok with breaking things a little 2009-01-13 03:48 ok, thanks 2009-01-13 03:48 I'll flush my queue 2009-01-13 03:48 btw, do you have interest to locking debug patch? 2009-01-13 03:48 it emulate kernel lock slightly 2009-01-13 03:48 I guess we need it pretty soon 2009-01-13 03:49 oh, for user space 2009-01-13 03:49 yes 2009-01-13 03:49 yes, it would be cool 2009-01-13 03:49 ok 2009-01-13 03:49 better than my stubs 2009-01-13 03:50 I'll post my repo change this midnight 2009-01-13 03:50 I think I will start on a kernel prototype of buffer fork 2009-01-13 03:50 and otherwise wait for your push 2009-01-13 03:50 ok 2009-01-13 03:51 kernel/* change is almost filemap.c only for redirect 2009-01-13 03:52 it's good to get that change in 2009-01-13 03:53 what decides whether to do create = 1 or create = 2? 2009-01-13 03:53 create == 1 is mapping, create == 2 is redirect 2009-01-13 03:54 right, so for now it is always called with 1? 2009-01-13 03:54 for now, write_bitmap() and test code only 2009-01-13 03:55 write_bitmap() somehow calls it with 2 2009-01-13 03:55 ah, called from user space with create = 2, good 2009-01-13 03:55 yes 2009-01-13 04:08 I may write a new Tux3 Report tomorrow, about how people are using your graphical dump for debugging 2009-01-13 04:08 I think that is one of the coolest hacks ever 2009-01-13 04:08 ok 2009-01-13 04:10 I breaks kernel with my patches 2009-01-13 04:10 because of log_alloc() 2009-01-13 04:11 how does it break? 2009-01-13 04:11 build? run out of log blocks? 2009-01-13 04:11 kernel doesn't have logmap 2009-01-13 04:12 ah, that's ok 2009-01-13 04:12 it has the sb->logmap, but not initialized 2009-01-13 04:12 kernel don't call it with create == 2 2009-01-13 04:12 right 2009-01-13 04:12 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-01-13 04:13 I'll post it 2009-01-13 04:13 ok, fsx-linux is still running 2009-01-13 04:14 good. We didn't change much since the golden copy 2009-01-13 04:14 yes 2009-01-13 04:14 in english: "no" :) 2009-01-13 04:14 ah :) 2009-01-13 04:15 why can't I learn it 2009-01-13 04:16 it's not very important 2009-01-13 04:16 thanks 2009-01-13 04:16 all asian languages seem to have the same difference with all european 2009-01-13 04:17 probably, yes. I don't know about chinese though 2009-01-13 04:18 I must find out :) 2009-01-13 04:18 actually, I know 2009-01-13 04:19 great :) 2009-01-13 04:19 they make it unambiguous 2009-01-13 04:20 so: A: We didn't change much since the golden copy B: did not change. 2009-01-13 04:20 always answered with "verb" or "not verb" 2009-01-13 04:21 yes, so, I have to use "no" 2009-01-13 04:21 but if you just say no, they will think you disagree 2009-01-13 04:22 disagree? 2009-01-13 04:22 "no, it didn't" is right? 2009-01-13 04:25 I'm getting the authoritative answer from my wife :) 2009-01-13 04:25 I'll know tomorrow 2009-01-13 04:25 oh 2009-01-13 04:27 well, in japanese, "iie" ("no" in english) means "it is not right" 2009-01-13 04:28 so, somehow, in my brain I convert "iie" to "no" 2009-01-13 04:28 in english, "aye" means "right" 2009-01-13 04:28 oh, good 2009-01-13 04:30 how did I miss that down_read in change_end? :) 2009-01-13 04:31 I must have cut & pasted the change_begin function 2009-01-13 04:31 probaby, copied from old email 2009-01-13 04:31 exactly 2009-01-13 04:34 what is the advantage in compiling .o files for everything? 2009-01-13 04:34 rule can be simple 2009-01-13 04:35 the compile takes almost twice as long though 2009-01-13 04:35 it should be no change 2009-01-13 04:36 probably true 2009-01-13 04:36 I think this is about .c.o: rule 2009-01-13 04:36 ? 2009-01-13 04:36 yes 2009-01-13 04:36 and commit: *.o 2009-01-13 04:36 yes, commit: *.o defines dependency 2009-01-13 04:36 .c.o defines solution for it 2009-01-13 04:36 we have had the behavior for a while, but not the clever make syntax 2009-01-13 04:37 ok pulling 2009-01-13 04:37 ah 2009-01-13 04:37 $(testbin): is 2009-01-13 04:38 I thought for something 2009-01-13 04:38 however, it was not used 2009-01-13 04:38 um... 2009-01-13 04:38 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-13 04:38 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-13 04:38 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-01-13 04:38 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-01-13 04:39 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-13 04:39 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-01-13 04:39 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-01-13 04:39 -!- ceatinge(~ceatinge@veryclever.net) has joined #tux3 2009-01-13 04:39 fs/tux3/filemap.c: In function 'map_region': 2009-01-13 04:39 fs/tux3/filemap.c:115: error: implicit declaration of function 'log_alloc' 2009-01-13 04:39 yes 2009-01-13 04:39 however, build is finished? 2009-01-13 04:40 not the kernel build 2009-01-13 04:40 user space built 2009-01-13 04:40 oh, sorry 2009-01-13 04:40 um.. 2009-01-13 04:40 I can build with error 2009-01-13 04:41 because error is from sparse 2009-01-13 04:42 +/* log.c */ 2009-01-13 04:42 +void log_alloc(struct sb *sb, block_t block, unsigned count, unsigned alloc); 2009-01-13 04:42 +void log_update(struct sb *sb, block_t child, block_t parent, tuxkey_t key); 2009-01-13 04:42 + 2009-01-13 04:43 builds 2009-01-13 04:43 thanks 2009-01-13 04:43 /devel/linux/works/git/mercurial/tux3fs/fs/tux3/filemap.c: In function 'map_region': 2009-01-13 04:43 /devel/linux/works/git/mercurial/tux3fs/fs/tux3/filemap.c:115: warning: implicit declaration of function 'log_alloc' 2009-01-13 04:43 I just get this 2009-01-13 04:43 sorry for it 2009-01-13 04:44 no problem 2009-01-13 04:44 a small thing 2009-01-13 04:45 CC fs/tux3/filemap.o 2009-01-13 04:45 fs/tux3/filemap.c: In function 'tux3_get_block': 2009-01-13 04:45 fs/tux3/filemap.c:106: warning: 'below_block' may be used uninitialized in this function 2009-01-13 04:45 fs/tux3/filemap.c:106: warning: 'above_block' may be used uninitialized in this function 2009-01-13 04:45 anyway, oyasumi time 2009-01-13 04:45 old gcc-4.1 is buggy about uninitalized warning 2009-01-13 04:45 perhaps while I am asleep I will write a kernel block fork ;-) 2009-01-13 04:45 :) 2009-01-13 04:46 well, I'll fix those warnings 2009-01-13 04:47 oyasumi 2009-01-13 04:47 oyasumi 2009-01-13 05:08 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-13 07:49 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-13 08:13 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-13 08:49 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-13 10:45 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-13 10:46 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-13 10:59 -!- pradeeps(~weechat@122.167.98.61) has joined #tux3 2009-01-13 11:02 If i take a look at http://mailman.tux3.org/pipermail/tux3/2008-November/000380.html, I find that this step ... 2009-01-13 11:02 # make a tux3 filesystem in a file 2009-01-13 11:02 dd if=/dev/zero of=testdev bs=1M count=1 2009-01-13 11:02 tux3 mkfs testdev 2009-01-13 11:02 looks odd when it comes to tux3 mkfs testdev 2009-01-13 11:03 Sorry for my naive question but how can I use tux3 here beforehand? 2009-01-13 11:03 Is tux3 a userspace utility or something? 2009-01-13 11:06 Oh looks like there is hg repo for tux3 tools :) 2009-01-13 11:06 I hope thats where I should be looking at. 2009-01-13 11:06 Oh crap! compile failed. *Warning treated as errors*?? Hmmmm 2009-01-13 11:08 yes, tux3 is command 2009-01-13 11:08 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-01-13 11:09 hirofumi: but it does not compile clean fomr hg repo on my 64 bit machine 2009-01-13 11:09 this repo have warning fixes 2009-01-13 11:09 for now 2009-01-13 11:10 old gcc-4.x warned some uninitialized variable 2009-01-13 11:10 oh great ,let me clone repo from you hirofumi 2009-01-13 11:10 yes, flips will pull it tommorow 2009-01-13 11:11 Okay 2009-01-13 11:12 hirofumi: can i do a hg(git) clone from your repo? 2009-01-13 11:12 yes 2009-01-13 11:12 great, thanks a lot . 2009-01-13 11:12 static-... is hg repo for hg 1.0.x 2009-01-13 11:12 ok 2009-01-13 11:13 if you hg version is not 1.0.x series, I'll just put fix as patch 2009-01-13 11:15 I have 1.1.1 2009-01-13 11:15 Thats ok i guess. 2009-01-13 11:16 maybe, well, hg does imcompatible change for static-.. repo some time 2009-01-13 11:20 hirofumi: loads of errors now :) 2009-01-13 11:20 Okay, let me put this in a better way. 2009-01-13 11:20 hg errors? 2009-01-13 11:20 How do I compile this tux3 binary? 2009-01-13 11:20 No compile errors :) 2009-01-13 11:21 ok 2009-01-13 11:21 dependencies? 2009-01-13 11:21 cd tux3/user 2009-01-13 11:21 yes thats what i did 2009-01-13 11:21 make 2009-01-13 11:21 ok 2009-01-13 11:21 yes thats what i did too 2009-01-13 11:21 do you have popt-dev package or something? 2009-01-13 11:22 tux3 is using popt package 2009-01-13 11:22 s/package/library/ 2009-01-13 11:23 Okay 2009-01-13 11:26 done 2009-01-13 11:26 thanks hirofumi for kind explanation :) 2009-01-13 11:26 enjoy :) 2009-01-13 11:30 Yes, I enjoyed a panic just now. 2009-01-13 11:30 UML crashed. heh 2009-01-13 11:32 ah, uml may not be work for now 2009-01-13 11:32 now, the change for atomic commit is in progress 2009-01-13 11:33 oh, no 2009-01-13 11:33 uml is not work? 2009-01-13 11:34 uml crashed with error with classic unknown block device error 2009-01-13 11:34 I meant - uml crashed with classic unknown block device error 2009-01-13 11:35 um.. 2009-01-13 11:35 rootfs couldn't be mounted? 2009-01-13 11:35 yes because a stat call on tuxroot fails 2009-01-13 11:36 me ponders ... sigh 2009-01-13 11:37 tuxroot64? 2009-01-13 11:38 Yes 2009-01-13 11:38 and error is ENOENT 2009-01-13 11:39 empty path/mangled path?? in tuxroot ? :D 2009-01-13 11:40 hirofumi: can i use a 32 bit tux3 patched kernel safely in my 64 bit environment? 2009-01-13 11:40 maybe 2009-01-13 11:40 well, I don't use uml 2009-01-13 11:41 KVM/Xen? 2009-01-13 11:41 kvm 2009-01-13 11:42 Okay, let me try something out and get back. 2009-01-13 11:42 thanks for the help. 2009-01-13 11:43 well, SUBARCH=i386 may build 32bit uml 2009-01-13 11:43 Okay will try that too 2009-01-13 11:54 Done :) 2009-01-13 11:54 In both Vmware and UML 2009-01-13 11:54 good 2009-01-13 11:54 Thanks hirofumi for all the pointers and help. 2009-01-13 11:54 you are welcome 2009-01-13 11:55 Is there a TODO somwhere for tux3 , hirofumi? 2009-01-13 11:55 any tux3 wiki? 2009-01-13 11:55 there is no wiki yet 2009-01-13 11:55 Ok 2009-01-13 11:56 flips would help it 2009-01-13 11:56 Any TODO or bugzilla etc? 2009-01-13 11:56 there is no bugzilla, and public todo 2009-01-13 11:57 Okay, got it. 2009-01-13 11:57 thanks 2009-01-13 11:57 however, flips report may help 2009-01-13 11:57 Thats nice. 2009-01-13 11:57 there is a howto for uml 2009-01-13 11:58 Design note: is design report 2009-01-13 11:58 yes 2009-01-13 11:58 http://lwn.net/Articles/308950/ 2009-01-13 11:58 oh, hi 2009-01-13 11:59 ./linux ubda=/src/zuma/root ubdb=testdev 2009-01-13 11:59 hi 2009-01-13 11:59 I pushed your warning fixes 2009-01-13 11:59 thanks 2009-01-13 11:59 64bit uml may have problem 2009-01-13 12:00 ah, and I promised to update my uml root fs, earlier 2009-01-13 12:00 and forget the reason 2009-01-13 12:00 have not updated it 2009-01-13 12:00 need to 2009-01-13 12:01 oh, it is to run under a different emulation 2009-01-13 12:03 it seems to work 2009-01-13 12:03 64bit uml 2009-01-13 12:10 bye gtg to sleep. 2009-01-13 12:11 Good night and good morning 2009-01-13 12:11 bye 2009-01-13 12:11 -!- pradeeps(~weechat@122.167.98.61) has left #tux3 2009-01-13 12:11 -!- gaurav(~gaurav@117.195.44.163) has joined #tux3 2009-01-13 12:14 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-13 12:25 starting to draft the kernel block fork 2009-01-13 12:56 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-13 13:27 -!- kushal_(~kushal@117.195.33.39) has joined #tux3 2009-01-13 13:50 kernel block fork is starting to look like real code 2009-01-13 14:20 first compile attempt 2009-01-13 14:31 hello 2009-01-13 14:32 shapor: got your linked invitation a while back but I didn't know that it was you until I looked at the profile 2009-01-13 14:32 ah, i forgot about that 2009-01-13 14:33 hi all 2009-01-13 14:33 I get a fair number of recruiters trying to add me to their network and stuff and I'm a bit leary about revealing who I know and stuff 2009-01-13 14:33 ACTION gets social networking because of his Facebook addiction now 2009-01-13 14:34 -!- ilan(ilan@captain.fonz.net) has left #tux3 2009-01-13 17:28 -!- macan(~macan@159.226.41.129) has joined #tux3 2009-01-13 17:33 -!- macan(~macan@159.226.41.129) has joined #tux3 2009-01-13 17:34 -!- macan(~macan@159.226.41.129) has left #tux3 2009-01-13 17:40 -!- macan(~macan@159.226.41.129) has joined #tux3 2009-01-13 18:01 -!- macan(~macan@159.226.41.129) has joined #tux3 2009-01-13 20:03 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2009-01-13 21:17 hey maze, ready for some fun? 2009-01-13 23:49 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-14 04:36 hirofumi, there? 2009-01-14 05:08 -!- edt(~Ed@112-78.162.dsl.aei.ca) has joined #tux3 2009-01-14 05:40 -!- padraig(~padraig@84.203.137.218) has joined #tux3 2009-01-14 08:48 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-14 09:42 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-14 09:49 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-14 10:00 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-14 10:37 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-14 15:56 found a bug in blockget 2009-01-14 16:04 what is a bug? 2009-01-14 16:08 just a sec 2009-01-14 16:08 offset = iblock & ((PAGE_CACHE_SHIFT - inode->i_blkbits) - 1); 2009-01-14 16:09 should be ((1 << (PAGE_CACHE_SHIFT - inode->i_blkbits)) - 1) 2009-01-14 16:11 it is not a mask 2009-01-14 16:12 iblock & <- it's a mask 2009-01-14 16:13 ah 2009-01-14 16:13 we only use it in dir.c now 2009-01-14 16:13 iblock & (the above line)? 2009-01-14 16:13 yes 2009-01-14 16:13 sorry 2009-01-14 16:13 but, it is used by blockread too 2009-01-14 16:14 not yet 2009-01-14 16:14 oh 2009-01-14 16:14 right 2009-01-14 16:14 same expression 2009-01-14 16:14 yes 2009-01-14 16:14 it is because our blocksize == page size 2009-01-14 16:15 we would have hit it on 1K blocks 2009-01-14 16:15 oh, right 2009-01-14 16:19 btw, I'm trying delalloc for atomic commit 2009-01-14 16:20 it seems easy if not efficient way 2009-01-14 16:21 I will be interested to see it 2009-01-14 16:21 I am unit testing blockget and fork_buffer in uml 2009-01-14 16:22 blockget2 is written 2009-01-14 16:22 maybe we should use ERR_PTR for it 2009-01-14 16:22 blockget2() is vol_getblk()? 2009-01-14 16:22 it should work for both volume and file cache 2009-01-14 16:23 yes 2009-01-14 16:24 and I noticed a reason why it is a good idea to have our own volume map 2009-01-14 16:24 if a user does cp /dev/tuxvolume /somewhere, which I do with ext3, they will get badly corrupted data for tux3 2009-01-14 16:25 because our dirty cache in the volume does not match the on-disk blocks 2009-01-14 16:25 it is user fault 2009-01-14 16:25 I am guilty then :) 2009-01-14 16:25 :) 2009-01-14 16:25 I often use it to copy a running ext2/3 volume 2009-01-14 16:26 maybe people do 2009-01-14 16:26 we all know it's bad 2009-01-14 16:26 but it works 2009-01-14 16:26 anyway, if we have a separate volume cache, and do freeze fs first, it is actually correct 2009-01-14 16:26 I think it is also bad for ext3 2009-01-14 16:26 with ext3 you have to do a forced fsck after 2009-01-14 16:26 it always works 2009-01-14 16:27 taht is why ext3 is linux's standard fs 2009-01-14 16:27 it is a bit suprise if work 2009-01-14 16:27 I've done it dozen's of times 2009-01-14 16:27 it can get currupt journal and fs 2009-01-14 16:28 if fs has dirty buffer and journal 2009-01-14 16:28 it can, so I don't replay the journal, do fsck -f instead 2009-01-14 16:28 ah 2009-01-14 16:28 ext2/3 can do that because it is a very simple fs with a very good fsck 2009-01-14 16:29 well, however, I still think it's user fault if it didn't work 2009-01-14 16:30 delalloc seems to work on simple case 2009-01-14 16:30 kernel is linus's git though 2009-01-14 16:30 delalloc would in some ways be simpler for us 2009-01-14 16:31 so if you would like to start with delalloc, that is fine with me 2009-01-14 16:31 current delalloc is just to make atomic commit simple 2009-01-14 16:32 good 2009-01-14 16:32 with it, allocation and dleaf would be same state with userland 2009-01-14 16:32 I hope 2009-01-14 16:32 if not, we will make it so 2009-01-14 16:32 http://userweb.kernel.org/~hirofumi/delalloc/simple-delalloc.patch 2009-01-14 16:36 create = 3 :) 2009-01-14 16:36 yes :) 2009-01-14 16:38 basically, it is just set buffer_delay on write_begin 2009-01-14 16:39 ah, and clear and set block on writepage 2009-01-14 16:39 clear delay 2009-01-14 16:39 ok, so it avoids buffer.c -> write_begin 2009-01-14 16:40 where is the actual write transfer? 2009-01-14 16:41 it's writepage 2009-01-14 16:41 differece is just block allocation 2009-01-14 16:42 of course 2009-01-14 16:42 buffer_head is handled like hole on write_begin 2009-01-14 16:43 and writepage allocates blocks and write it out 2009-01-14 16:43 I should have asked, where is the actual map_region(... 1/2) ? 2009-01-14 16:43 writepage -> tux3_get_block -> map_region 2009-01-14 16:43 block_write_full_page(page, tux3_get_block, wbc); 2009-01-14 16:43 right 2009-01-14 16:44 and I see you are thinking about using writepages 2009-01-14 16:44 yes 2009-01-14 16:44 library doesn't handle buffer_delay even on current git 2009-01-14 16:45 fs drivers have to handle ownself 2009-01-14 16:45 we want to handle it anyway 2009-01-14 16:46 yes 2009-01-14 16:46 however, we can delay it 2009-01-14 16:46 if we want delay 2009-01-14 16:48 what checks buffer_delay when we set it? 2009-01-14 16:48 because, the purpose of this patch is just for atomic commit 2009-01-14 16:48 yes 2009-01-14 16:49 on current git, block io library and __tux3_get_block 2009-01-14 16:50 block_* on current git knows about buffer_delay 2009-01-14 16:50 I should look at it 2009-01-14 16:51 __block_write_full_page and __block_prepare_write and block_truncate_page 2009-01-14 16:52 right, we probably want to stop using those eventually 2009-01-14 16:52 but for now, use them 2009-01-14 16:52 and just have atomic commit in git-latest? 2009-01-14 16:52 move the repo to sync up with git? 2009-01-14 16:53 or patch to kernel in our git 2009-01-14 16:53 it is time to sync up with linus anyway 2009-01-14 16:55 well, either ways are fine for me 2009-01-14 16:56 I'm going to use hg for now 2009-01-14 17:36 http://userweb.kernel.org/~hirofumi/delalloc/simple-delalloc.patch 2009-01-14 17:37 found delalloc bug 2009-01-14 17:37 old one was missing to run map_bh path 2009-01-14 17:37 I'll sleep, oyasumi 2009-01-14 18:01 oyasumi 2009-01-14 18:01 :) 2009-01-14 19:22 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2009-01-14 21:56 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-14 23:49 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-15 04:41 -!- vomjom(~vomjom@99-157-248-71.lightspeed.stlsmo.sbcglobal.net) has joined #tux3 2009-01-15 05:59 \who 2009-01-15 05:59 bother 2009-01-15 09:58 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-15 11:35 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-15 12:49 hey flips 2009-01-15 13:20 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-15 15:57 hi ceatinge :) 2009-01-15 15:57 hey bh 2009-01-15 16:38 most of the tux3 source code says that it's GPLv3. will this license cause problems when merging with the linux kernel? 2009-01-15 18:07 vomjom, I need to put in the patch to change the kernel code to gpl v2 2009-01-15 18:43 vomjom: http://hg.tux3.org/tux3/rev/0a694d3212ee <- updated to gpl v2 2009-01-15 20:04 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-01-15 21:28 hirofumi, there? 2009-01-15 21:28 hi 2009-01-15 21:29 yersterday, I found interesting issues 2009-01-15 21:29 memory allocation and locking orders 2009-01-15 21:29 ah 2009-01-15 21:29 where? 2009-01-15 21:29 bitmap inode 2009-01-15 21:30 involving the blockio library? 2009-01-15 21:30 http://userweb.kernel.org/~hirofumi/delalloc/locking-doc.patch 2009-01-15 21:30 this patch docs of locking orders 2009-01-15 21:31 well, just reminder to me 2009-01-15 21:31 nice docs! 2009-01-15 21:31 thanks 2009-01-15 21:32 bottom hierarchy of that 2009-01-15 21:32 bitmap->i_mutex -> down_read() is one of it 2009-01-15 21:32 down_write -> bitmap->i_mutex is one of it 2009-01-15 21:32 AB BA deadlock 2009-01-15 21:34 checking it 2009-01-15 21:34 oh 2009-01-15 21:34 I mean, thinking about it 2009-01-15 21:34 ah 2009-01-15 21:35 those lock_pages come from the block library, right? 2009-01-15 21:36 block libray or vfs 2009-01-15 21:36 this meant caller of ->writepage etc. 2009-01-15 21:36 vfs doesn't call map_region under the page lock I think, only indirectly via the block library 2009-01-15 21:36 ah 2009-01-15 21:38 checking the usage in vmscan.c 2009-01-15 21:39 it locks the page by trylock 2009-01-15 21:39 yes, maybe 2009-01-15 21:40 we don't have to call map_region in our writepage though 2009-01-15 21:41 maybe delayed allocation is easier for locking 2009-01-15 21:41 that docs already delayed allocation 2009-01-15 21:42 so, we have to call balloc() for write buffers 2009-01-15 21:43 ah, redirect also call though 2009-01-15 21:44 so, that doc ignores blockget -> balloc 2009-01-15 21:44 and ->write_begin -> balloc 2009-01-15 21:45 let's see how the specialized bitmap block write locks in kernel 2009-01-15 21:45 that doc is not enough? 2009-01-15 21:46 takes me a while to make the connection 2009-01-15 21:46 yes, ok 2009-01-15 21:47 I thought about map_region() for bitmap data pages 2009-01-15 21:47 someone -> map_region(sb->bitmap) 2009-01-15 21:48 so, we take down_write(sb->bitmap->btree->lock) 2009-01-15 21:48 then we call balloc() to allocate blocks in map_region() 2009-01-15 21:49 balloc() takes bitmap->i_mutex, and call blockread() to read bitmap data pages 2009-01-15 21:50 blockread() calls map_region(sb->bitmap) to get block address 2009-01-15 21:50 down_read(sb->bitmap->btree->lock) 2009-01-15 21:51 um..., source may be good than my explanation 2009-01-15 21:51 :) 2009-01-15 21:51 I'm walking though your explanation 2009-01-15 21:52 well, so bitmap locking order is 2009-01-15 21:53 map_region() -> down_write(sb->bitmap->tree->lock) -> balloc() -> mutex_lock(bitmap->i_mutex) -> blockread() -> map_region() -> down_read(sb->bitmap->btree->lock) 2009-01-15 21:54 one is down_write() -> mutex_lock(bitmap->i_mutex) 2009-01-15 21:54 sure, it's a deadlock 2009-01-15 21:54 another one is mutex_lock(bitmap->i_mutex) -> down_read() 2009-01-15 21:55 yes 2009-01-15 21:55 I noticed it yersterday, then slept 2009-01-15 21:56 maybe per-block lock is a fix 2009-01-15 21:56 it replaces which lock? 2009-01-15 21:57 btree->lock? 2009-01-15 21:57 bitmap->tree->lock 2009-01-15 21:57 yes 2009-01-15 21:57 recursive lock :) 2009-01-15 21:58 :) 2009-01-15 21:58 so, this is a "map_region in map_region" problem 2009-01-15 21:58 yes 2009-01-15 21:59 or, we can remove btree->lock for bitmap 2009-01-15 21:59 we can remove it because? 2009-01-15 21:59 we have bitmap->i_mutex to lock bitmap 2009-01-15 22:00 so, down_read is redutant 2009-01-15 22:00 for write, we have to add bitmap->i_mutex though 2009-01-15 22:00 or we can drop the btree lock before the balloc 2009-01-15 22:01 it may corrupt cursor 2009-01-15 22:01 s/can/will/ 2009-01-15 22:02 we never hit this because bitmaps are always in cache I guess 2009-01-15 22:03 memory reclaim can drop bitmap pages 2009-01-15 22:03 if we didn't pin it 2009-01-15 22:03 that's what I meant 2009-01-15 22:04 we pin all bitmap pages? 2009-01-15 22:04 no, we won't, I was just saying why we did not see the lockup in practise 2009-01-15 22:04 it will only show up under memory pressure 2009-01-15 22:04 or on startup 2009-01-15 22:04 ah, yes 2009-01-15 22:05 I run more stress under memory pressure 2009-01-15 22:05 ran 2009-01-15 22:08 for write, we have to add bitmap->i_mutex though <- which write? 2009-01-15 22:09 writer (caller of map_region) for bitmap pages 2009-01-15 22:09 only spotted problems in bitmap so far? 2009-01-15 22:10 yes 2009-01-15 22:10 another one is different issue 2009-01-15 22:10 not locking issue 2009-01-15 22:11 i_mutex is too crude for bitmap I guess 2009-01-15 22:11 does a per bitmap block (buffer lock) fix this? 2009-01-15 22:12 we use the cursor to walk to next bitmap block 2009-01-15 22:12 I don't think it will fix 2009-01-15 22:13 bitmap block meant bitmap data pages? 2009-01-15 22:13 yes 2009-01-15 22:13 ok 2009-01-15 22:14 we have to lock btree for bitmap even if it is per block? 2009-01-15 22:15 ah, even if we removed bitmap->i_mutex, bitmap->btree->lock is still recursive 2009-01-15 22:15 down_write(bitmap->btree->lock) -> bitmap->i_mutex -> down_read(bitmap->btree->lock) 2009-01-15 22:16 ok, I need to think about it very slowly, the way I do 2009-01-15 22:16 me too 2009-01-15 22:18 and it's just the bitmap flush, right? 2009-01-15 22:19 in future, yes 2009-01-15 22:19 but now? 2009-01-15 22:19 now, ->writepage flush bitmap 2009-01-15 22:19 or any buffers 2009-01-15 22:20 ah right, we are fixing that soon 2009-01-15 22:20 that isn't allowed, ever 2009-01-15 22:20 and now we uses __GFP_FS to allocate page 2009-01-15 22:21 right, needs a patch 2009-01-15 22:21 it means any pages allocation can be recursive 2009-01-15 22:21 every filesystem has to do that 2009-01-15 22:22 I noticed it is why blockdev uses GFP_NOFS 2009-01-15 22:22 well, I have the patch for memory allocation 2009-01-15 22:23 a write lock can be downgraded to a read lock 2009-01-15 22:23 which would prevent cursor corruption, also prevent us from modifying the btree there 2009-01-15 22:24 anyway, my suggestions tonight will be stupid ;) 2009-01-15 22:24 better ideas tomorrow 2009-01-15 22:24 ok :) 2009-01-15 22:25 I was thinking about doing the btree -> inode parameter change now 2009-01-15 22:25 well, I'm not thinking about solution yet 2009-01-15 22:25 I think it's good 2009-01-15 22:25 parameter? 2009-01-15 22:25 I thought we can just cursor->inode 2009-01-15 22:25 well, both is ok though 2009-01-15 22:26 I'm thinking inode is good anyway 2009-01-15 22:26 do we always have a cursor to pass? 2009-01-15 22:27 we may not have cursor in some place 2009-01-15 22:28 ah 2009-01-15 22:28 we don't need (struct inode * everywhere 2009-01-15 22:28 it meant btree->inode 2009-01-15 22:28 right 2009-01-15 22:28 not btree->sb 2009-01-15 22:28 understood 2009-01-15 22:28 ok, that is easier 2009-01-15 22:28 yes, just for lazyness 2009-01-15 22:29 that's fine 2009-01-15 22:29 the locking recursion is more important 2009-01-15 22:29 and concentrate to atomic commit 2009-01-15 22:29 I thought I was finished with bitmap recursions ;) 2009-01-15 22:29 yes 2009-01-15 22:29 yes 2009-01-15 22:29 :) 2009-01-15 22:31 balloc() for bitmap somehow seems more complex 2009-01-15 22:31 than guess 2009-01-15 22:31 again, any filesystem with a dynamically allocated allocation map has this problem 2009-01-15 22:32 yes 2009-01-15 22:33 btw, do you know any fs trying this? 2009-01-15 22:33 I presume btrfs does 2009-01-15 22:33 ext4 doesn't... smart 2009-01-15 22:33 i see 2009-01-15 22:34 will, you have to have a dynamically allocated allocation map in a versioning fs 2009-01-15 22:34 no choice 2009-01-15 22:34 oh, i see 2009-01-15 22:34 so we will solve it 2009-01-15 22:35 a recursive mutex would solve it 2009-01-15 22:35 why don't we like that? 2009-01-15 22:35 leads to lazy programmers? 2009-01-15 22:35 yes 2009-01-15 22:35 and can be expencive 2009-01-15 22:35 needlessly 2009-01-15 22:35 but... maybe it's the right tool for the job in some cases 2009-01-15 22:35 anyway, our locking is very expensive now 2009-01-15 22:36 yes 2009-01-15 22:36 if we can solve this, and lighten the bitmap locking at the same time, it is good progress 2009-01-15 22:36 ok, here is an idea 2009-01-15 22:36 well, there is quick solution though 2009-01-15 22:37 which is? 2009-01-15 22:37 e.g. set flags to current - we have bitmap lock already 2009-01-15 22:38 i.e., get the effect of a recursive lock 2009-01-15 22:38 some sort of 2009-01-15 22:39 a hack that lets us keep concentrating on atomic commit 2009-01-15 22:40 ok 2009-01-15 22:41 it doesn't even have to be a process flag, if there is only one bitmap flusher 2009-01-15 22:41 just a "we are flushing" flag 2009-01-15 22:42 yes, if delta is only one? 2009-01-15 22:43 we will certainly have a better solution by the time it is more than one 2009-01-15 22:43 I guess we will probably have a better idea by tomorrow anyway 2009-01-15 22:43 ok 2009-01-15 22:44 I thought a lot about blockdirty -> fork_block 2009-01-15 22:44 in kernel, with smp 2009-01-15 22:44 yes 2009-01-15 22:45 it seems to work out ok, anyway we can use it now for the simple case of bitmap flushing 2009-01-15 22:45 cool 2009-01-15 22:45 I will start testing it, did you see the hackfs patch? 2009-01-15 22:46 ok, ml? 2009-01-15 22:46 yes 2009-01-15 22:46 it's a nice little environment for trying out crazy ideas 2009-01-15 22:47 change buffer->b_state without any lock? 2009-01-15 22:48 needs a lock 2009-01-15 22:48 I was thinking, maybe bitspin 2009-01-15 22:48 i see 2009-01-15 22:48 block library uses buffer lock 2009-01-15 22:48 locking is missing there :) 2009-01-15 22:48 it is just a prototype, never run yet 2009-01-15 22:49 ok, maybe it has to atomic op 2009-01-15 22:49 um..., it meant set_bit and family 2009-01-15 22:49 there isn't an atomic op for a bit field 2009-01-15 22:49 actually, I wrote one for the handle patch 2009-01-15 22:49 using cmpxchg 2009-01-15 22:49 any ways is ok for handles 2009-01-15 22:50 but when we change state, we also change lists, so just atomic bit field change is not enough 2009-01-15 22:50 however, buffer_head has to sync with vfs 2009-01-15 22:50 yes 2009-01-15 22:51 vfs uses bitops 2009-01-15 22:51 yes 2009-01-15 22:51 i guess that has to sync with it 2009-01-15 22:51 vfs doesn't know about the delta field though, we we don't have to sync with it 2009-01-15 22:52 or, buffer_dirty implies delta field is valid 2009-01-15 22:52 it does read-modify-write 2009-01-15 22:52 and it changes a list 2009-01-15 22:53 I guess, read-modify-write should be atomic 2009-01-15 22:53 well, for now there is no smp issue because the bitmap flush is done by a single process that is the only user of fork 2009-01-15 22:54 easiest is just some spinlock for changing the buffer delta 2009-01-15 22:55 and when the flush is happening, file operations are stopped 2009-01-15 22:55 if anyone didn't call set_bit(buffer)? 2009-01-15 22:55 if nobody call set_bit(buffer) 2009-01-15 22:56 set_bit(buffer)? 2009-01-15 22:56 buffer->state? 2009-01-15 22:56 set_buffer_dirty(buffer) etc. 2009-01-15 22:57 vfs doesn't know about these buffers 2009-01-15 22:57 I think :) 2009-01-15 22:57 right now it does 2009-01-15 22:57 ok 2009-01-15 22:57 if nobody use set_bit, I think it should work 2009-01-15 22:58 + lock_page(oldpage); 2009-01-15 22:58 + while (!PageUptodate(oldpage)) { 2009-01-15 22:58 + unlock_page(oldpage); 2009-01-15 22:58 + oldpage = read_mapping_page(mapping, oldpage->index, NULL); 2009-01-15 22:58 + lock_page(oldpage); 2009-01-15 22:58 + } 2009-01-15 22:58 what is solving with this loop? 2009-01-15 22:59 good question 2009-01-15 22:59 I thought we can if (!PageUptodate) 2009-01-15 22:59 with no loop? 2009-01-15 22:59 yes 2009-01-15 22:59 probably 2009-01-15 23:00 well, if we don't use read_mapping_page(), we don't need to call lock_page() again 2009-01-15 23:01 this is low quality code :) 2009-01-15 23:02 well, it is not big issue 2009-01-15 23:02 alloc_pages() would be page_cache_alloc() with mapping 2009-01-15 23:04 ah, has a per-inode gff mask 2009-01-15 23:04 gfp 2009-01-15 23:04 yes 2009-01-15 23:05 we will never use this with HIGHMEM 2009-01-15 23:05 why? 2009-01-15 23:05 heh 2009-01-15 23:05 we might :) 2009-01-15 23:06 :) 2009-01-15 23:06 but fork doesn't work perfectly for data pages without pte tricks 2009-01-15 23:06 ah 2009-01-15 23:07 well, we can use HIGHMEM for any buffer in future 2009-01-15 23:07 I'm trying to remember what I was thinking with the loop on read_mapping_page 2009-01-15 23:12 while (!PageUptodate(oldpage)) 2009-01-15 23:12 oldpage = read_mapping_page(mapping, oldpage->index, NULL); 2009-01-15 23:12 lock_page(oldpage); 2009-01-15 23:13 I can't still understand why read is loop 2009-01-15 23:13 um... 2009-01-15 23:14 it access page->private directly, page_buffers(page) would be better 2009-01-15 23:15 page->private is not private to this function 2009-01-15 23:16 sorry 2009-01-15 23:16 page->private is known to be buffers by this function 2009-01-15 23:16 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-15 23:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-15 23:16 I don't think there is a set_page_buffers() 2009-01-15 23:17 I think create_emtpry_buffers() does it 2009-01-15 23:18 we swap the page buffers in fork_buffer 2009-01-15 23:18 that means we have to know everything about page->private for this page 2009-01-15 23:19 yes 2009-01-15 23:20 all page_buffers() users knows about it 2009-01-15 23:21 ah 2009-01-15 23:21 ok, it touches ->private deep low level 2009-01-15 23:21 yes 2009-01-15 23:22 I convinced myself it has to swap the page buffers 2009-01-15 23:22 between the two pages 2009-01-15 23:22 and I better write an email about that 2009-01-15 23:22 it's not obvious 2009-01-15 23:22 well, still we can use set_page_private() and page_private 2009-01-15 23:24 those are all is small thing 2009-01-15 23:24 yes 2009-01-15 23:24 I wonder what the purpose of these shell functions is 2009-01-15 23:25 maybe, abstruction 2009-01-15 23:25 maybe somebody was thinking about making page private a hash or something 2009-01-15 23:25 and unfinished idea... the people who were squeezing the struct page size down seem to have finished doing that 2009-01-15 23:26 probably because 32 bit machines are dying out 2009-01-15 23:29 it changes b_page without lock 2009-01-15 23:30 I think it may be under mapping->tree_lock 2009-01-15 23:31 because read can grab that page middle of it 2009-01-15 23:31 um..., however there is no problem... 2009-01-15 23:35 right, read doesn't care which page it gets 2009-01-15 23:36 we have to exclude other blockdirty 2009-01-15 23:36 maybe, fs level is simple, however, vm is... 2009-01-15 23:36 and we have to synchronize with list handling for writeout, which isn't handled yet 2009-01-15 23:37 ah, yes 2009-01-15 23:38 for that reason, the state changes need to be protected with spinlocks 2009-01-15 23:38 we have a long time to think about this, because the use case for bitmap writeout is far simpler 2009-01-15 23:38 i see 2009-01-15 23:39 ok, when we enter fork_buffer it is from pagedirty, which I have not written here 2009-01-15 23:40 pagedirty? 2009-01-15 23:40 sorry 2009-01-15 23:40 blockdirty 2009-01-15 23:40 ACTION is getting tired 2009-01-15 23:40 ah 2009-01-15 23:40 so we have a reference on a buffer 2009-01-15 23:40 which will prevent the page from being evicted (we will refuse ->releasepage) 2009-01-15 23:41 so the look on read_mapping_page was stupid :) 2009-01-15 23:41 the loop 2009-01-15 23:41 there is no way for the page to become !uptodate after read_mapping_page 2009-01-15 23:42 yes 2009-01-15 23:42 but an error in read_mapping_page is possible, which isn't handled 2009-01-15 23:42 yes 2009-01-15 23:42 it would be similar to blockread in current repo 2009-01-15 23:43 except buffer_head handling 2009-01-15 23:43 returns ERR_PTR, good 2009-01-15 23:43 yes 2009-01-15 23:46 I need to read kernel to think fork_buffer(), need more time 2009-01-15 23:47 if (!PageUptodate(oldpage)) { 2009-01-15 23:47 oldpage = read_mapping_page(mapping, oldpage->index, NULL); 2009-01-15 23:47 if (IS_ERR(oldpage)) 2009-01-15 23:47 return oldpage; 2009-01-15 23:47 } 2009-01-15 23:47 probably, basically, for race 2009-01-15 23:47 looks good to me 2009-01-15 23:47 if lock_page() was taked after it 2009-01-15 23:47 it is 2009-01-15 23:48 and not before 2009-01-15 23:48 yes, already changed 2009-01-15 23:49 how to spinlock the list moves and state changes is the biggest issue 2009-01-15 23:49 and we have to call page_cache_release() after read 2009-01-15 23:51 4#define page_cache_release(page) put_page(page) 2009-01-15 23:52 yes 2009-01-15 23:52 I think it used to mean something special at one time 2009-01-15 23:53 it is all for page_cache 2009-01-15 23:53 well, the above means read and uptodate have different reference count 2009-01-15 23:54 read_mapping_page() takes reference count of page 2009-01-15 23:54 if the page is PageUptodate(), we don't take reference count of page 2009-01-15 23:54 it seems wrong 2009-01-15 23:55 right, because read_cache_page takes a reference count 2009-01-15 23:55 stupid of me :) 2009-01-15 23:56 we don't change page state in fork_buffer 2009-01-15 23:56 ? 2009-01-15 23:57 it may be problem 2009-01-15 23:57 because we change b_page without any lock 2009-01-15 23:58 user can see newpage state middle of fork_buffer 2009-01-15 23:58 -!- macan(~macan@159.226.41.129) has joined #tux3 2009-01-16 00:00 I was thinking that stage_delta will pull pages off the delta list under a spinlock that will protect the list and b_page 2009-01-16 00:01 and put it on a different list after launching the bio 2009-01-16 00:01 it syncs with read? 2009-01-16 00:01 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2009-01-16 00:01 read doesn't care which pages it reads from 2009-01-16 00:02 sorry 2009-01-16 00:02 read doesn't care which page it reads from 2009-01-16 00:02 I think it would care page state 2009-01-16 00:02 it just does blockread 2009-01-16 00:03 blockread cares about the page state 2009-01-16 00:03 it only cares about uptodate 2009-01-16 00:03 and blockread() reads page again? 2009-01-16 00:03 ah cares 2009-01-16 00:03 ah 2009-01-16 00:05 ok, test case may be fork_buffer() twice asynchronously 2009-01-16 00:06 ah, it takes lock_page(), but... 2009-01-16 00:07 that one works 2009-01-16 00:07 the first fork_buffer sets all buffers to the current delta 2009-01-16 00:07 second one can see newpage 2009-01-16 00:08 right, that is what we want 2009-01-16 00:08 so, we have to change page state? 2009-01-16 00:09 yes, in set_bufdelta 2009-01-16 00:09 I think before ->b_page change 2009-01-16 00:10 the page lock excludes fork_buffer from fork_buffer 2009-01-16 00:11 it doesn't protect read_mapping_page() 2009-01-16 00:11 how can that cause trouble? 2009-01-16 00:11 if lock_page before read_mapping_page was removed 2009-01-16 00:12 it will try to read from disk for newpage 2009-01-16 00:12 ah :) 2009-01-16 00:12 newpage must be set uptodate 2009-01-16 00:12 yes, copy from oldpage 2009-01-16 00:13 that is a bug 2009-01-16 00:13 kernel may have function for it 2009-01-16 00:15 memcpy(page_address(newpage), page_address(oldpage), PAGE_CACHE_SIZE); 2009-01-16 00:15 SetPageUptodate(newpage); 2009-01-16 00:15 and I should change all the // to /* */ before I annoy somebody 2009-01-16 00:16 and maybe more state bits 2009-01-16 00:16 for vm 2009-01-16 00:16 more state bits? 2009-01-16 00:16 yes, I thought e.g. PageReferenced() 2009-01-16 00:17 maybe copy the page flags 2009-01-16 00:17 maybe 2009-01-16 00:17 migrate_page_copy() seems to do similar 2009-01-16 00:18 but it thinks about the page for swap 2009-01-16 00:18 ah, I should look more closely at migrate 2009-01-16 00:19 PG_locked should not be copied 2009-01-16 00:19 PG_dirty... unclear what that means for a page that has buffers 2009-01-16 00:20 PG_writeback... good question 2009-01-16 00:20 PG_writeback should not be copied 2009-01-16 00:21 I think we don't use it? 2009-01-16 00:21 right 2009-01-16 00:22 PG_lru doesn't seem to be used 2009-01-16 00:22 buffer_migrate_page() seems it 2009-01-16 00:22 except waiting io 2009-01-16 00:24 ah, and it removes oldpage from mapping 2009-01-16 00:25 right, so it is a good question: can the oldpage still point at the mapping? 2009-01-16 00:25 I think page->mapping is wrong logically 2009-01-16 00:25 because we need to be able to find the mapping to submit the page for write 2009-01-16 00:26 ah 2009-01-16 00:26 it is offensive, yes :) 2009-01-16 00:26 now let's see if it breaks things 2009-01-16 00:26 probably, only vmscan.c cares about it 2009-01-16 00:28 probably, yes. after atomic commit 2009-01-16 00:29 I'm meaning endio for write is not using page->mapping 2009-01-16 00:29 my theory is that incrementing the page count keeps vmscan from looking at the mapping 2009-01-16 00:29 I'm testing that theory now 2009-01-16 00:30 I don't think it needs to 2009-01-16 00:30 yes, probably after atomic commit 2009-01-16 00:30 after or middle of 2009-01-16 00:31 so, I think we can set page->mapping to some inode->mapping or NULL 2009-01-16 00:33 the question is whether is whether mapping can be set when the page is not in a mapping 2009-01-16 00:34 remove_from_page_cache()? 2009-01-16 00:44 I'll go to shop with this 2009-01-16 00:47 ok 2009-01-16 00:47 It seems, if page count is elevated, vmscan doesn't care about page->mapping 2009-01-16 00:48 could easily have missed something 2009-01-16 00:50 time to sleep 2009-01-16 01:06 oyasumi 2009-01-16 01:30 not quite gone yet :) 2009-01-16 01:30 my dev_blockio fails to set the buffer uptodate (will fix) 2009-01-16 01:32 ok, now oyasumi 2009-01-16 02:02 hirofumi, there? 2009-01-16 02:02 hi 2009-01-16 02:03 we can create a balloc_nolock, and that can be the balloc method for the bitmap btree 2009-01-16 02:03 yes, exactly 2009-01-16 02:03 you already said that before? 2009-01-16 02:03 no 2009-01-16 02:03 ok 2009-01-16 02:03 it's not even tomorrow yet ;) 2009-01-16 02:04 I'm thinking about it now 2009-01-16 02:04 ok, oyasumi, really :) 2009-01-16 02:04 probably, we can remove btree->lock for bitmap 2009-01-16 02:04 :) oyasumi 2009-01-16 07:34 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-16 08:35 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-16 09:24 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-16 09:51 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-16 10:23 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-16 10:28 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-16 10:53 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-16 10:56 hirofumi, still awake? 2009-01-16 10:56 yes 2009-01-16 10:57 you are right, the bitmap btree does not need a lock 2009-01-16 10:57 yes 2009-01-16 10:57 however, I found another issue 2009-01-16 10:57 on bitmap pages flush, we change btree for bitmap 2009-01-16 10:58 but, we need to allocate block for that btree 2009-01-16 10:59 but allocation can happen middle of btree change 2009-01-16 10:59 e.g. insert_leaf() allocates new_node() 2009-01-16 11:00 it calls balloc(), but bitmap btree root is not allocated yet 2009-01-16 11:00 balloc is only supposed to touch cache 2009-01-16 11:01 it reads btree to read bitmap pages 2009-01-16 11:01 that is, page cache 2009-01-16 11:01 yes 2009-01-16 11:01 but we are chaging btree for bitmap 2009-01-16 11:02 when we are flushing bitmap pages 2009-01-16 11:02 yes I see it 2009-01-16 11:03 that is another issue 2009-01-16 11:05 we have to be sure that the bitmap btree in volume cache is always valid when balloc is called 2009-01-16 11:06 it always does seem to be valid 2009-01-16 11:07 if btree is in cache, it still have issue 2009-01-16 11:07 what is the exact path that shows the issue? 2009-01-16 11:07 insert_leaf() is exact case I think 2009-01-16 11:08 we are flushing bitmap pages now 2009-01-16 11:08 so, we calls map_region(sb->bitmap) 2009-01-16 11:09 we allocate blocks for bitmap pages 2009-01-16 11:09 there is no problem 2009-01-16 11:09 then we calls btree_insert_leaf() 2009-01-16 11:09 btree_insert_leaf() -> insert_leaf() 2009-01-16 11:10 now we are trying to add new dleaf for bitmap pages 2009-01-16 11:11 and if split for root was happened, we allocate new block for new root 2009-01-16 11:11 it calls balloc() 2009-01-16 11:11 so, it is problem 2009-01-16 11:11 map_region -> insert_leaf ->split root -> balloc 2009-01-16 11:12 is that it? 2009-01-16 11:12 btree for bitmap change is middle of change 2009-01-16 11:12 right, and why is it a problem to be in the middle of a change? 2009-01-16 11:13 the cached tree seems to be valid 2009-01-16 11:13 well 2009-01-16 11:13 balloc() try to read bitmap pages for allocation 2009-01-16 11:13 is the new node inserted yet 2009-01-16 11:13 yes 2009-01-16 11:13 so, balloc can't search all bitmap pages 2009-01-16 11:13 ok, so the problem statement is: we try to read the btree before a new node is inserted 2009-01-16 11:14 yes 2009-01-16 11:14 ah, some btree designs leave one node entry free to handle a problem like this 2009-01-16 11:15 oh 2009-01-16 11:15 it's not a very attractive solution 2009-01-16 11:15 and it's not clear it is a solution 2009-01-16 11:16 well, I was think to solve this, delaying block allocation for bitmap 2009-01-16 11:16 we are delaying it already 2009-01-16 11:16 I was thinking about to solve this 2009-01-16 11:17 completely delay 2009-01-16 11:17 yes 2009-01-16 11:17 it's good 2009-01-16 11:17 well, i.e. delay this new_node()'s balloc allocation 2009-01-16 11:18 well when we do the delayed allocation the problem comes back 2009-01-16 11:18 it meant more tricky allocation 2009-01-16 11:19 just a idea though 2009-01-16 11:19 e.g. 2009-01-16 11:19 balloc() for bitmap saves allocation bit save to memory 2009-01-16 11:20 and balloc checks saved memory 2009-01-16 11:20 on some point, saved memory writes to bitmap pages 2009-01-16 11:21 um... 2009-01-16 11:21 one point: the block we are trying to insert is guaranteed to be in cache, we only need to worry about being able to read other blocks 2009-01-16 11:22 yes 2009-01-16 11:23 and even before we complete the insert_leaf, those other blocks should be accessible via the existing btree 2009-01-16 11:24 the important thing is that the bitmap btree should be a valid tree at the point balloc is called, and I think it is 2009-01-16 11:24 yes, current code can't though 2009-01-16 11:24 ah, where does it break? 2009-01-16 11:24 splited child is not accessible yet 2009-01-16 11:25 because we don't have new root yet 2009-01-16 11:25 we should be calling blockdirty before the split for each of the new children, which causes fork 2009-01-16 11:26 so the old version of the block is still accessible in the cache 2009-01-16 11:27 the old version of the block is still accessible from its parent 2009-01-16 11:28 I need time to imagine it 2009-01-16 11:28 it's a good sign that you did not find a flaw immediately :) 2009-01-16 11:29 I think fork saves us this time 2009-01-16 11:29 well, fork may solve, but... 2009-01-16 11:30 ah, it is fork forcely? 2009-01-16 11:30 ? 2009-01-16 11:31 it meant the fork should be fork buffer, because delta shouldn't be same? 2009-01-16 11:31 exactly 2009-01-16 11:31 or we do fork forcely? 2009-01-16 11:31 by defintion 2009-01-16 11:31 ok 2009-01-16 11:32 because we flush the btree only at a delta transition 2009-01-16 11:32 yes 2009-01-16 11:32 ok, here is another fun thing to think about: replay needs to be two pass 2009-01-16 11:33 1) reconstruct the pinned, dirty btree nodes 2) replay the ballocs 2009-01-16 11:33 the reason is, we cannot access the bitmap btree before we have reconstructed its file index, which requires reconstructing the inode index 2009-01-16 11:34 it's easy to do a two pass replay 2009-01-16 11:34 in fact, it will probably be three pass: 0) make a list of all the places in the log where promises are retired 2009-01-16 11:36 um... 2009-01-16 11:41 ah, replay 2009-01-16 11:41 I was thinking about rollup 2009-01-16 11:42 right, replay runs in an easier environment: filesystem not started yet 2009-01-16 11:42 yes 2009-01-16 11:42 we can do things like kmalloc list links without worrying about it 2009-01-16 11:43 i see 2009-01-16 11:45 if (create && inode != sb->bitmap) 2009-01-16 11:45 down_write_nested(&cursor->btree->lock, inode == sb->bitmap); <- lazy fix to the balloc btree recursion? 2009-01-16 11:46 maybe, down_read is needed too 2009-01-16 11:47 is not needed 2009-01-16 11:47 it's not, but I was trying to be _very_ lazy 2009-01-16 11:47 it's an ugly solution either way 2009-01-16 11:47 and since it's just on flush, we don't care about taking an extra read lock 2009-01-16 11:48 if (inode != sb->bitmap) { if (create) ..? 2009-01-16 11:49 right, it's a little less ugly, I just tried to hide the hack a little ;) 2009-01-16 11:49 ok fine 2009-01-16 11:49 ah 2009-01-16 11:50 down_write would be needed 2009-01-16 11:50 ah, um... 2009-01-16 11:50 we change btree in map_region 2009-01-16 11:50 and another balloc() reads it 2009-01-16 11:50 only one task should ever change the bitmap btree 2009-01-16 11:50 ah 2009-01-16 11:51 well, I was thinking remove bitmap->i_mutex 2009-01-16 11:51 I thought it can be atomic operations 2009-01-16 11:51 that is easy as we saw 2009-01-16 11:51 oh 2009-01-16 11:52 well, it can lock a single block 2009-01-16 11:52 atomic operation on a bit range is hard 2009-01-16 11:53 just use spin_lock()? 2009-01-16 11:53 for now 2009-01-16 11:53 yes 2009-01-16 11:54 well, lock_buffer() may also work 2009-01-16 11:54 spinlock is even cheaper, but a litle more work 2009-01-16 11:55 lock_buffer will be fine 2009-01-16 11:55 by the way, my proposed blockread does not do lock_buffer, and it should 2009-01-16 11:55 ah, yes 2009-01-16 11:55 otherwise multiple bio reads will be launched for the same block 2009-01-16 11:56 yes 2009-01-16 11:56 it ends up wanting something like block_read_full_page 2009-01-16 11:56 which we want to write anyway, if we are going to change to block handles some time 2009-01-16 11:57 quick way would be ll_rw_block() 2009-01-16 11:57 well, bitmap, lock_buffer() would be good 2009-01-16 11:57 and lock_buffer in blockread is good enough too 2009-01-16 11:58 yes 2009-01-16 11:58 I would like to start switching away from block library, where we can 2009-01-16 11:58 we still use it for fork_buffer -> read_mapping_page 2009-01-16 11:59 fork_buffer -> read_mapping_page -> readpage -> tux3_readpage 2009-01-16 11:59 if so, it would be care about buffer/page state and locking 2009-01-16 11:59 it needs to care 2009-01-16 12:00 which function needs to care? 2009-01-16 12:00 new blockread and io 2009-01-16 12:00 and blockget 2009-01-16 12:02 I am thinkint lock_buffer might be enough 2009-01-16 12:02 checking now 2009-01-16 12:03 block_read_full_page does lock_buffer on pages it is going to read 2009-01-16 12:04 yes 2009-01-16 12:05 well, we have to care like it 2009-01-16 12:05 lock_buffer(buffer); 2009-01-16 12:05 int err = ((blockio_t *)mapping->host->i_private)(buffer, READ); 2009-01-16 12:05 if (err) { 2009-01-16 12:05 unlock_buffer(buffer); 2009-01-16 12:05 brelse(buffer); 2009-01-16 12:05 return ERR_PTR(err); 2009-01-16 12:05 } 2009-01-16 12:05 unlock_buffer(buffer); 2009-01-16 12:06 unlock_buffer() looks strange 2009-01-16 12:06 it should be endio? 2009-01-16 12:06 why in endio? 2009-01-16 12:07 because it is end of io 2009-01-16 12:07 bit it is syncio 2009-01-16 12:07 sorry 2009-01-16 12:07 but it is syncio 2009-01-16 12:08 who wakes up? 2009-01-16 12:09 syncio sets up its own wakeup 2009-01-16 12:09 own? 2009-01-16 12:09 static int syncio(int rw, struct block_device *dev, sector_t sector, unsigned vecs, struct bio_vec *vec) 2009-01-16 12:09 { 2009-01-16 12:09 struct biosync sync = { .wait = __WAIT_QUEUE_HEAD_INITIALIZER(sync.wait) }; 2009-01-16 12:09 if (!(sync.err = vecio(rw, dev, sector, biosync_endio, &sync, vecs, vec))) 2009-01-16 12:09 wait_event(sync.wait, sync.done); 2009-01-16 12:09 return sync.err; 2009-01-16 12:09 } 2009-01-16 12:10 it wakes up syncio itself 2009-01-16 12:10 I meant wakeup for lock_buffer 2009-01-16 12:10 ah 2009-01-16 12:10 it is unlock_buffer 2009-01-16 12:10 right, unlock_buffer wakes up other waiters 2009-01-16 12:13 um... 2009-01-16 12:13 who update page state? 2009-01-16 12:14 well, anyway, it would work finally 2009-01-16 12:14 if a page has buffers, anybody who wants to set set the page uptodate has to walk the buffers first 2009-01-16 12:15 so, blockread will update? 2009-01-16 12:16 blockread does not set the page uptodate, it could but it does not need to I think 2009-01-16 12:17 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L383 <- block_read_full_page sets the page uptodate 2009-01-16 12:18 yes 2009-01-16 12:18 however, I'm not sure it is ok or not 2009-01-16 12:19 what is the concern? 2009-01-16 12:19 all users of PageUptodate() 2009-01-16 12:19 every user has to check the buffer states before setting the page uptodate 2009-01-16 12:21 before checking? 2009-01-16 12:22 um..., can't beliave PageUptodate()? 2009-01-16 12:22 PageUptodate does not require checking buffer states 2009-01-16 12:23 only SetPageUptodate 2009-01-16 12:23 sorry 2009-01-16 12:23 only caller of SetPageUptodate 2009-01-16 12:25 um... 2009-01-16 12:25 I wonder why endio try to set uptodate to page 2009-01-16 12:26 it is to optimize the case where several blocks are read at the same time 2009-01-16 12:26 I don't think it is an important optimization 2009-01-16 12:27 if users of buffer set uptodate, it sounds like overhead rather 2009-01-16 12:27 true 2009-01-16 12:28 there may be a lot of funny things hiding in the block IO library 2009-01-16 12:28 it's very complex code to read 2009-01-16 12:28 the buffer_endio path is very strange 2009-01-16 12:29 yes, it is why I didn't change current library rules for now 2009-01-16 12:29 I emulate current rule 2009-01-16 12:29 that is wise 2009-01-16 12:30 well, for now, it seems to work 2009-01-16 12:30 I need time to see more 2009-01-16 12:31 I will test block_fork today, and add the blockdirty wrapper 2009-01-16 12:31 not worry too much about async block_fork 2009-01-16 12:33 so, I'm thinking about bitmap 2009-01-16 12:34 well, if we fork btree node for bitmap, how do we access to it? 2009-01-16 12:35 I'm thinking we have to read old buffers 2009-01-16 12:35 true, fork leaves the new buffer in the cache 2009-01-16 12:35 yes 2009-01-16 12:36 and it is too early to do a redirect 2009-01-16 12:36 or is it? 2009-01-16 12:37 maybe redirect is what we want 2009-01-16 12:37 we split bnodes on memory 2009-01-16 12:37 it makes btree is not accessible 2009-01-16 12:38 we can split a copy 2009-01-16 12:38 we have to redirect there anyway 2009-01-16 12:39 reader side will read which buffer? before split or after split 2009-01-16 12:39 before split 2009-01-16 12:39 yes 2009-01-16 12:40 but, before split doesn't in radix tree anymore? 2009-01-16 12:40 is not in radix tree 2009-01-16 12:41 the old block is still in the radix tree (volume cache) 2009-01-16 12:41 ah 2009-01-16 12:41 i see 2009-01-16 12:43 so, btree->root should be changed atomicly? 2009-01-16 12:43 root.block and root.depth are changed same time 2009-01-16 12:43 yes 2009-01-16 12:44 I am thinking about that 2009-01-16 12:44 i see 2009-01-16 12:45 split means copy-on-write like behavior? 2009-01-16 12:45 split root 2009-01-16 12:45 you mean, copy-on-write of the inode table block? 2009-01-16 12:45 or the cache object? 2009-01-16 12:46 for both btree 2009-01-16 12:46 we have to access old and new btree 2009-01-16 12:46 on disk 2009-01-16 12:47 um... 2009-01-16 12:47 ok, I am thinking about whether we should log all changes to inode attributes 2009-01-16 12:47 dirty block may not be on disk 2009-01-16 12:47 which dirty block? 2009-01-16 12:47 bnodes 2009-01-16 12:48 if we split bnodes on memory, we fork buffer 2009-01-16 12:49 new buffer is not on disk 2009-01-16 12:49 right, we log the change 2009-01-16 12:49 old buffer can't be on disk if it's not flushing 2009-01-16 12:50 log? 2009-01-16 12:50 it meant split bnodes is also logical logging? 2009-01-16 12:50 when we split, we need to write out block the split blocks in that delta 2009-01-16 12:50 well 2009-01-16 12:51 ah, ok 2009-01-16 12:51 let me introduce another issue at this point 2009-01-16 12:51 oh 2009-01-16 12:52 our rollup needs to run less than once per delta to be efficient 2009-01-16 12:52 i see 2009-01-16 12:52 that means there will be two counters, a sb->delta and a sb->rollup, say 2009-01-16 12:54 i see 2009-01-16 12:54 now, blockfork is not going to work properly if one block on a page uses the ->delta counter and another uses ->rollup 2009-01-16 12:54 so this suggests that _every_ block on the volume inode should be on the ->rollup counter, not the delta counter 2009-01-16 12:55 rollup is going to use for blockfork? 2009-01-16 12:55 yes, blockdirty() -> fork_buffer 2009-01-16 12:55 there are four kinds of blocks in the volume cache: root and leaf of itable; root and leaf of bitmap 2009-01-16 12:55 can you think of any more? 2009-01-16 12:55 I can't 2009-01-16 12:56 log blocks are in a different cache 2009-01-16 12:57 root and leaf of dleaf? 2009-01-16 12:57 sorry 2009-01-16 12:57 there are four kinds of blocks in the volume cache: root and leaf of itable; root and leaf of file btree 2009-01-16 12:58 and for now superblock 2009-01-16 12:58 right, but no real need for superblock to be there 2009-01-16 12:58 I was thinking sb_bread() is to grep this 2009-01-16 12:58 we can leave superblock handling for last 2009-01-16 12:58 yes 2009-01-16 12:58 let's 2009-01-16 12:59 btree and super.c 2009-01-16 12:59 that's it 2009-01-16 12:59 and leaf 2009-01-16 13:00 leaf? 2009-01-16 13:00 dleaf and ileaf 2009-01-16 13:00 right, I mentioned those above 2009-01-16 13:00 yes 2009-01-16 13:00 only four kinds of blocks in volume table 2009-01-16 13:00 I think that is a very nice thing if true 2009-01-16 13:01 I think it is true 2009-01-16 13:01 at least for now 2009-01-16 13:01 ok, and we should log all changes to all four kinds of blocks 2009-01-16 13:01 at least for current format 2009-01-16 13:01 i see 2009-01-16 13:02 that simplifies some analysis, and is good for efficiency 2009-01-16 13:02 ileaf and dleaf need special consideration for replay 2009-01-16 13:02 but it is not too hard 2009-01-16 13:03 we can start by logging all attributes of a changed inode, then improve it by logging only changed attributes 2009-01-16 13:03 if we changed many index of bnodes, redirect is more efficient? 2009-01-16 13:03 it is 2009-01-16 13:03 i see 2009-01-16 13:05 changed attributes sounds hard to do 2009-01-16 13:05 writing a full block retires all promises for that block, so when we know there are a lot of promises to make we just write the block instead 2009-01-16 13:05 may not be hard actually, need vfs change 2009-01-16 13:05 well 2009-01-16 13:05 it's pretty easy :) 2009-01-16 13:06 yes 2009-01-16 13:06 we just compare the present masks at log time 2009-01-16 13:06 but changed attributes is just an optimization 2009-01-16 13:06 we can write full inodes to the log for now 2009-01-16 13:06 ah, i see 2009-01-16 13:08 I think that the simplification of knowing that we will log every change to every block in the volume cache makes it easier to think about redirect and split 2009-01-16 13:09 so every block in the volume cache is on the rollup cycle, not the delta cycle 2009-01-16 13:10 now maybe return to the question of updating the btree root 2009-01-16 13:11 here is log every change means counter or something? 2009-01-16 13:11 just log as we already log 2009-01-16 13:11 we just need more kinds of log entries, and more code in replay 2009-01-16 13:12 the result is cleaner high level design and faster filesystem 2009-01-16 13:12 um..., I'm not understanding yet why we need this log 2009-01-16 13:13 why we need to log every change to a block in volume cache? 2009-01-16 13:13 yes 2009-01-16 13:13 it is because having a separate cycle counter for rollup relies on log replay 2009-01-16 13:14 and it is necessary to have a separate cycle counter for rollup to get any efficiency advantage from rollup 2009-01-16 13:18 What are the requirements for compiling the nightly snapshots from the website? I'm missing popt.h 2009-01-16 13:19 install the popt.h devel package 2009-01-16 13:20 Figures; but are there anymore? :) 2009-01-16 13:20 I'll go by try and error :) 2009-01-16 13:20 that's the current system :) 2009-01-16 13:20 we should improve it 2009-01-16 13:21 ah, I was thinking about removing popt dependency 2009-01-16 13:21 thats only required for the user space version right? 2009-01-16 13:22 kg-config --cflags libpopt-dev 2009-01-16 13:22 Package libpopt-dev was not found in the pkg-config search path. 2009-01-16 13:22 hirofumi, how, by writing our own options parser? 2009-01-16 13:22 I was thinking getopt or getopt_long 2009-01-16 13:23 it is in any posix or glibc 2009-01-16 13:23 we make very simple use of options, so it is probably ok 2009-01-16 13:23 I forget why popt was preferred 2009-01-16 13:23 but I am sure we will remember as soon as we change ;) 2009-01-16 13:23 :) 2009-01-16 13:23 I was following your discussion btw; and I was wondering compared to btrfs -- is tux3 more complicated? 2009-01-16 13:24 gila, it's currently less than 6,000 lines 2009-01-16 13:24 I think, less complicated 2009-01-16 13:26 whoops - Did i made him angry? 2009-01-16 13:26 -!- flips(~phillips@phunq.net) has joined #tux3 2009-01-16 13:26 pused the wrong button ;) 2009-01-16 13:26 :) 2009-01-16 13:27 hehe *phew* I thought I made you angry :-) 2009-01-16 13:27 haha, not, it's not so easy to do that 2009-01-16 13:27 ok :) 2009-01-16 13:27 you have to say "your code sucks and you are ugly" to do that 2009-01-16 13:27 Pardon if the net question is stupid; but how does tux3 compare to CAS based systems 2009-01-16 13:28 just "your code sucks" is not enough by itself 2009-01-16 13:28 haha 2009-01-16 13:28 CAS? 2009-01-16 13:28 Content Addressed Storage 2009-01-16 13:29 you want a wiki with that? 2009-01-16 13:29 http://en.wikipedia.org/wiki/Content-addressable_storage 2009-01-16 13:29 there is no content addressing in Tux3, though a university group is working on block deduplication 2009-01-16 13:29 I thought that's what you meant, was just checking 2009-01-16 13:29 Okay okay 2009-01-16 13:30 Let me clarify some bit. I've just graduated and working now at a storage intergrator.. They use CAS for storing a lot of data and well.. Since I'm working there i have developed an interrest (or obession) for filesystems 2009-01-16 13:31 So from walf ->zfs -> btrfs -> tux3 2009-01-16 13:31 thanks to google 2009-01-16 13:32 The nice thing about CAS (or at least the caringo implementation of it) is that it scales not only in size but also performance since it is distributed in nature 2009-01-16 13:33 with upcomming cloud computing etc this is a big advantage 2009-01-16 13:33 it seems like block deduplication is a big start on that, though it is only within one filesystem 2009-01-16 13:34 well,I'n my point of view dedup should always just work on one specific drive 2009-01-16 13:35 there is this company in germany that developed ZFS+ -- it has real-time online dedup 2009-01-16 13:35 Sun is working on it to (at least they told me) 2009-01-16 13:37 but the CAS system is able to store one file (so afther dedeping) on distributed storage 2009-01-16 13:37 So if I save tux3.tar.gz on my share 2009-01-16 13:37 isn't it a layer about the filesystem? 2009-01-16 13:37 it is saved one time in the netherlands and is distributed for data recovery to the US for instance 2009-01-16 13:38 well what they told me is that it does not use a file system 2009-01-16 13:38 i.e 2009-01-16 13:38 it just writes the bits to disc 2009-01-16 13:38 and stores metadata where to find it 2009-01-16 13:39 in a database 2009-01-16 13:39 it's hard to see why they would not use a fileystem 2009-01-16 13:39 even oracle stores their databases on a filesystem now 2009-01-16 13:39 I have no clue -- you are the expert :) 2009-01-16 13:40 that's my expert opinion ;) 2009-01-16 13:40 hehe 2009-01-16 13:40 I'll have to dig further in that particular system but wikipedia says that git is a user space CAS system 2009-01-16 13:41 it is, but you get every object in every repository 2009-01-16 13:41 if they implemented a way of going out to other repository to get objects, then it would be distributed CAS, it would also fail to work when the internet connection breaks 2009-01-16 13:42 Somehow you can set file attibutes determine on how many different nodes the file has to be saved. It is done by the same metadata 2009-01-16 13:43 they store the metadata in sqlite, and write the data as blob 2009-01-16 13:44 that makes sense 2009-01-16 13:44 however If you change just one bit in a file 2009-01-16 13:44 the whole file is stored again 2009-01-16 13:45 since the hash is different 2009-01-16 13:45 -!- cdk(~chinmay@117.195.32.222) has joined #tux3 2009-01-16 13:46 gila, cdk is one of those working on deduplication for tux3 2009-01-16 13:46 hi flips 2009-01-16 13:46 Hi cdk 2009-01-16 13:46 posting design to mailing list in a while 2009-01-16 13:47 hi gila 2009-01-16 13:47 hi cdk 2009-01-16 13:47 flips: I'm a big fs noob so I think i'm more of a burden then anything else :) 2009-01-16 13:52 Would it be possible to have tux3 distributed (afther dedup) (or sync) inodes with a specific flag? 2009-01-16 13:52 distributed? 2009-01-16 13:53 Maybe that is a bad choice of words 2009-01-16 13:53 replicate 2009-01-16 13:54 tux3 is to have replication, yes 2009-01-16 13:54 Okay 2009-01-16 13:56 Is there a manual on how to use the fuse implementation? 2009-01-16 14:00 the fuse version of tux3? 2009-01-16 14:01 yes 2009-01-16 14:01 http://hg.tux3.org/tux3/file/0a694d3212ee/user/tux3fuse.c 2009-01-16 14:01 instructions at the beginning of the file 2009-01-16 14:02 file needs to be updated to say tux3fuse instead of fuse-tux3 2009-01-16 14:04 I've already complied and used tux3fuse to create a volume and mointpoint 2009-01-16 14:04 But i cant cd to it 2009-01-16 14:05 if you are using a partition, you need super user privileges. 2009-01-16 14:07 Ah okay; Or use dd to create a file and tux3fuse it right? 2009-01-16 14:08 yes 2009-01-16 14:09 http://hg.tux3.org/tux3/rev/c2305995f834 2009-01-16 14:10 thx 2009-01-16 14:10 flips , about the design .. 2009-01-16 14:12 we are currently maintaining the ref counts for physical blocks in the buckets .. which are pointed to by the hash tree leaves 2009-01-16 14:12 -!- amey(~amey@117.195.32.222) has joined #tux3 2009-01-16 14:14 however when a particular block in a file is edited .. its hash will now have different value and so we wont be able to travel to the same leaf again...to reduce the ref counts 2009-01-16 14:14 problem 2009-01-16 14:16 yes....so as with deletes we require some another structure that we can traverse using block numbers .. 2009-01-16 14:16 here we can maintain only physical block numbers and ref counts 2009-01-16 14:17 need to store the previous hash with the block pointer? 2009-01-16 14:25 hmm....need to think this through.. 2009-01-16 14:25 it's just an idea... would require a change to dleaf.c 2009-01-16 14:26 and introduce more redundant metadata.. 2009-01-16 14:27 free_map: Failed assertion "list_empty(&map->dirty)"! 2009-01-16 14:27 gila, looks like a bug :) 2009-01-16 14:28 gila, can you reproduce it? 2009-01-16 14:28 yeah just run it again :-) 2009-01-16 14:29 dd if=/dev/zero of=test.img bs=512 count=100 2009-01-16 14:29 ./tux3 make test.img 2009-01-16 14:29 ok, could you mail your bug report to the list? 2009-01-16 14:29 sure 2009-01-16 14:42 it worked for me 2009-01-16 14:43 but a different bug 2009-01-16 14:43 sudo umount ./test || true 2009-01-16 14:43 umount: /more/src/hg/tux3/user/test: device is busy 2009-01-16 14:43 sorry for the funny path :) 2009-01-16 14:45 __destroy_buffers: dirty buffer leak, or list corruption? 2009-01-16 14:45 map [0x8061510] 3/0* 2009-01-16 14:45 __destroy_buffers: Failed assertion "list_empty(&lru_buffers)"! 2009-01-16 14:46 on breaking out of tux3fuse 2009-01-16 14:46 blockdirty() didn't flushed 2009-01-16 14:47 I meant forked buffer was not flushed 2009-01-16 14:47 because I interrupted it 2009-01-16 14:47 is there any way to know whether the change to the file is a new write or an edit ? 2009-01-16 14:48 hirofumi, I am debugging fork_buffer now, radix tree gets confused: 2009-01-16 14:48 (gdb) bt 2009-01-16 14:48 #0 0x08130d76 in radix_tree_gang_lookup (root=0x9848218, results=0x9d4fcec, first_index=0, max_items=14) 2009-01-16 14:48 at lib/radix-tree.c:236 2009-01-16 14:48 #1 0x080928af in find_get_pages (mapping=0x9848214, start=0, nr_pages=14, pages=0x9d4fcec) at mm/filemap.c:754 2009-01-16 14:48 #2 0x080975a1 in pagevec_lookup (pvec=0x9d4fce4, mapping=0x9848214, start=0, nr_pages=14) at mm/swap.c:473 2009-01-16 14:48 #3 0x08097c67 in truncate_inode_pages_range (mapping=0x9848214, lstart=0, lend=-1) at mm/truncate.c:222 2009-01-16 14:48 #4 0x08097d50 in truncate_inode_pages (mapping=0x9848214, lstart=0) at mm/truncate.c:264 2009-01-16 14:48 #5 0x080baf82 in generic_delete_inode (inode=0x984817c) at fs/inode.c:1059 2009-01-16 14:48 #6 0x080ba6bc in iput (inode=0x984817c) at fs/inode.c:1137 2009-01-16 14:49 what was happened after this trace? 2009-01-16 14:50 loops forever 2009-01-16 14:50 oh 2009-01-16 14:50 it's hard to see the reason why with all the inlining and optimization 2009-01-16 14:50 page->index is not copied? 2009-01-16 14:51 newpage->index = oldpage->index; 2009-01-16 14:51 ok 2009-01-16 14:51 is there a way to compile without inlining? I tried it once and didn't succeed 2009-01-16 14:52 actually, I tried O0 2009-01-16 14:52 compile kernel 2009-01-16 14:52 hey flips 2009-01-16 14:53 CONFIG_OPTIMIZE_INLINING would be related to it 2009-01-16 14:53 its getting later here...need to think about this problem .. will post the design once we get any idea on this... 2009-01-16 14:54 cdk, I will talk with you tomorrow about it 2009-01-16 14:54 ah, ok 2009-01-16 14:54 ok 2009-01-16 14:54 gila, 512*100 is too small for now 2009-01-16 14:55 Okay I'll try something different 2009-01-16 14:55 hirofumi, I can I suppose my problem was a mishandled error 2009-01-16 14:56 I suppose my problem was a mishandled error 2009-01-16 14:56 make_tux3: eek, No space left on device 2009-01-16 14:57 no, still device busy 2009-01-16 14:57 hirofumi: that comes right afther the assertion 2009-01-16 14:57 dd if=/dev/zero of=test.img bs=1M count=32 2009-01-16 14:58 gila, it works? 2009-01-16 15:01 hirofumi: lik a charm 2009-01-16 15:08 cya guys, i'm off 2009-01-16 15:12 ok 2009-01-16 15:13 On 512*100, balloc() seems to return -ENOSPC 2009-01-16 15:13 so, dirty buffer was not flushed 2009-01-16 15:14 it is 12 blocks on 4096 blocksize 2009-01-16 15:18 flips, loops is truncate? 2009-01-16 17:30 hirofumi, yes 2009-01-16 17:31 it was I forgot to seg newpage->mapping 2009-01-16 17:32 radix tree just loops forever when it can't find the page in the mapping it is supposed to be in 2009-01-16 17:32 because of rcu 2009-01-16 17:32 kind of fragile 2009-01-16 19:21 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-16 19:37 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-16 19:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-16 20:36 our unit tests are wonderful things 2009-01-16 21:37 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-16 22:55 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2009-01-16 23:40 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-17 02:04 yet another exercise in deathless geek prose on its way 2009-01-17 02:05 For all you cheapass, err sorry, thrifty loyal readers who do not subscribe to lwn.net, a tux3 post got lead quote of the week on the the lwn kernel page: http://lwn.net/Articles/313927/ 2009-01-17 06:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-17 07:17 -!- gentoo-user(~kvirc@pD953B8CA.dip.t-dialin.net) has joined #tux3 2009-01-17 07:18 hi 2009-01-17 07:20 i have a question about the tux3 roadmap. is there a date, when the first stable release is planned to be released? (englich is not my native language - sorry) 2009-01-17 07:23 if there is a date ... it would be nice to find this date on tux3.org 2009-01-17 07:23 cu 2009-01-17 07:24 -!- gentoo-user(~kvirc@pD953B8CA.dip.t-dialin.net) has left #tux3 2009-01-17 07:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-17 08:25 -!- kushal(~kushal@117.195.35.155) has joined #tux3 2009-01-17 08:25 -!- amey(~amey@117.195.35.155) has joined #tux3 2009-01-17 08:26 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-17 08:43 hi flips 2009-01-17 08:43 -!- gaurav(~gaurav@117.195.35.155) has joined #tux3 2009-01-17 09:10 -!- gaurav_(~gaurav@117.195.32.87) has joined #tux3 2009-01-17 09:10 -!- kushal_(~kushal@117.195.32.87) has joined #tux3 2009-01-17 09:12 -!- amey_(~amey@117.195.32.87) has joined #tux3 2009-01-17 11:59 hirofumi, there? 2009-01-17 12:00 hi 2009-01-17 12:00 I had more locking thoughts 2009-01-17 12:00 yes 2009-01-17 12:00 for the btree, eventually we will use a locking style that just locks the btree blocks that we modify 2009-01-17 12:01 so, just one index block at a time usually, or for a split, a parent and two children 2009-01-17 12:01 for right now? 2009-01-17 12:01 for right now we use the method we have 2009-01-17 12:01 with the hack for bitmaps, it should work fine 2009-01-17 12:02 the above is for future? 2009-01-17 12:02 anyway, with local locking in the btree, the bitmap btree recursion goes away 2009-01-17 12:02 yes, for the future 2009-01-17 12:02 i see 2009-01-17 12:03 what we do is: first balloc the blocks we need, then lock the 1 or 3 blocks we will modify 2009-01-17 12:03 so balloc is not called under the block lock 2009-01-17 12:03 no lock recursion 2009-01-17 12:03 I just like to know there will eventually be a clean solution :) 2009-01-17 12:04 good :) 2009-01-17 12:05 block_fork seems to be functional, and I am returning to the logging and redirect 2009-01-17 12:05 I was thinking bitmap pages would become atomic lock 2009-01-17 12:05 so, we care only btree 2009-01-17 12:05 I agree 2009-01-17 12:05 yes 2009-01-17 12:05 so a nice solution to both in future, and we have something good enough for now 2009-01-17 12:06 yes 2009-01-17 12:06 I'm going to remove bitmap->i_mutex for now 2009-01-17 12:06 to solve recursive lock 2009-01-17 12:06 good 2009-01-17 12:06 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-17 12:07 and balloc_nolock() would not take btree->lock 2009-01-17 12:08 after all, on bitmap flush, top level caller just locks btree->lock 2009-01-17 12:08 yes 2009-01-17 12:09 ah, for now, bitmap->i_mutex is needed to lock pages 2009-01-17 12:09 until atomic lock 2009-01-17 12:10 well, the hack like something it 2009-01-17 12:11 any hack right now if fine with me 2009-01-17 12:11 ok 2009-01-17 12:12 btw, I'm confusing a bit recent mark_dirty_buffer change 2009-01-17 12:12 it is change for future or just cleanup? 2009-01-17 12:13 -!- gaurav(~gaurav@117.195.32.87) has joined #tux3 2009-01-17 12:13 just cleanup, and to protect against accidental confusion between clean state and kernel uptodate state, which mean different things 2009-01-17 12:14 i see 2009-01-17 12:14 there almost no sharing of those state functions between userspace and kernel, which makes sense because it is in low level IO 2009-01-17 12:14 -!- cdk(~chinmay@117.195.32.87) has joined #tux3 2009-01-17 12:14 -!- kushal(~kushal@117.195.32.87) has joined #tux3 2009-01-17 12:14 -!- amey(~amey@117.195.32.87) has joined #tux3 2009-01-17 12:14 hi flips 2009-01-17 12:14 hi cdk 2009-01-17 12:15 continuing yesterdays discussion abt edit 2009-01-17 12:15 ok 2009-01-17 12:15 btw, this is not matter, for current async flushing, mark_buffer_dirty() is needed to call after modiy 2009-01-17 12:16 matter though 2009-01-17 12:16 we are thinking about keeping the same design....we will check if the edited block is in the bucket already ... if yes we can change the ref count easily there itself 2009-01-17 12:16 hirofumi, why after modify? 2009-01-17 12:16 because flusher is async 2009-01-17 12:17 dirty -> async flush -> modify can be occured 2009-01-17 12:17 if it is not in the bucket .. we read the physical block which will already be there in the map[] and calculate its hash...using which we can get the bucket 2009-01-17 12:17 cdk, it seems reasonable 2009-01-17 12:18 cdk, and keep in mind that it is pretty easy to refactor your design after you have something working 2009-01-17 12:18 the big step is to get something working 2009-01-17 12:18 assuming that the bucket contains logically continuous blocks...we wont have to read physical blocks every time 2009-01-17 12:18 cdk, yes, it's a good philosphy 2009-01-17 12:19 and we can use the same logic for deletes as well.. 2009-01-17 12:19 considering typical application of dedup in backup and archival storage.. there won't be many edits and deletes 2009-01-17 12:20 so this won't be much of an overhead 2009-01-17 12:20 hirofumi, we won't be using mark_buffer_dirty, we will use blockdirty, which includes fork, to avoid re-dirtying a block that will be flushed 2009-01-17 12:21 yes, for future 2009-01-17 12:21 cdk, gaurav, I agree 2009-01-17 12:21 so, I asked just cleanup or not 2009-01-17 12:21 hirofumi, right, async is future 2009-01-17 12:21 just cleanup 2009-01-17 12:22 actually cleanup for atomic commit? 2009-01-17 12:22 it makes me happy for buffer_clean to match the BUFFER_CLEAN state :) 2009-01-17 12:22 it would be good 2009-01-17 12:22 ah, for atomic commit now, we have to change all the mark_buffer_dirty after change to blockdirty before change 2009-01-17 12:22 there are not many of those 2009-01-17 12:23 yes 2009-01-17 12:24 essentially, we create a snapshot of cache and flush the snapshot 2009-01-17 12:25 and for dedup, we need data pages too? 2009-01-17 12:28 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-17 12:29 data pages ? 2009-01-17 12:30 not metadata 2009-01-17 12:35 ah, you are asking if we need to fork data pages, to create a snapshot to flush? 2009-01-17 12:36 yes, for dedup 2009-01-17 12:36 I am not sure :) 2009-01-17 12:38 there is also the question of dedupping memory mapped data 2009-01-17 12:39 yes 2009-01-17 12:43 for mmap, I'm thinking ->page_mkwrite may solve optimisticlly 2009-01-17 12:44 I think so 2009-01-17 12:45 and it will allow us to offer "strong" file data integrity semantics 2009-01-17 12:45 beyond posix 2009-01-17 12:45 yes 2009-01-17 12:46 hi flips, we currently plan not to de-duplicate the bitmap, atable and vtable data blocks... 2009-01-17 12:46 as they'll be changed very frequently 2009-01-17 12:46 kushal, good :-) 2009-01-17 12:47 and because you don't need to 2009-01-17 12:47 and they will seldom be identical 2009-01-17 12:47 yes... 2009-01-17 12:49 also about the sync_super call in tux3fuse .. instead of calling it after each write will it be better to call it in tux3_flush and after the fuse loop ends ?? 2009-01-17 12:52 cdk, yes 2009-01-17 12:52 the tux3fuse code will get atomic commit pretty soon, which will be more efficient than calling sync_super after every operation 2009-01-17 12:53 you don't have to worry about that 2009-01-17 12:53 ok 2009-01-17 13:14 -!- kushal_(~kushal@117.195.32.55) has joined #tux3 2009-01-17 13:15 -!- gaurav_(~gaurav@117.195.32.55) has joined #tux3 2009-01-17 13:16 -!- gaurav_(~gaurav@117.195.32.55) has joined #tux3 2009-01-17 13:16 -!- kushal_(~kushal@117.195.32.55) has joined #tux3 2009-01-17 13:16 posted the design to the mailing list. 2009-01-17 13:16 thankyou, I will review and comment later today, my time 2009-01-17 13:17 -!- amey_(~amey@117.195.32.55) has joined #tux3 2009-01-17 13:17 ok 2009-01-17 13:17 It is clearly written 2009-01-17 13:18 -!- kushal_(~kushal@117.195.32.55) has joined #tux3 2009-01-17 13:18 thanks 2009-01-17 13:23 -!- gaurav(~gaurav@117.195.32.55) has joined #tux3 2009-01-17 13:24 -!- cdk(~chinmay@117.195.32.55) has joined #tux3 2009-01-17 13:24 -!- kushal(~kushal@117.195.32.55) has joined #tux3 2009-01-17 13:24 -!- amey(~amey@117.195.32.55) has joined #tux3 2009-01-17 14:14 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-17 16:00 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-17 17:16 -!- kushal(~kushal@117.195.32.57) has joined #tux3 2009-01-17 17:19 -!- cdk(~chinmay@117.195.32.57) has joined #tux3 2009-01-17 17:53 -!- cdk(~chinmay@117.195.34.136) has joined #tux3 2009-01-17 17:53 -!- kushal(~kushal@117.195.34.136) has joined #tux3 2009-01-17 19:34 Yummy new design note posted 2009-01-17 20:59 -!- cdk(~chinmay@117.195.32.12) has joined #tux3 2009-01-17 23:24 -!- kushal(~kushal@117.195.42.131) has joined #tux3 2009-01-18 01:45 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-18 04:16 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-18 04:23 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-18 06:59 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-18 07:08 What kind of data does the btree hold? 2009-01-18 07:08 Inodes? 2009-01-18 07:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-18 07:41 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-18 08:26 is the root_dtree a btree on itself? 2009-01-18 10:12 -!- kushal(~kushal@121.246.34.211) has joined #tux3 2009-01-18 10:13 -!- amey(~amey@121.246.34.211) has joined #tux3 2009-01-18 11:54 gila, inodes are in a btree 2009-01-18 11:54 and each file has a btree index 2009-01-18 12:21 flips: Okay; I'm trying to understand how the filesystem is build up 2009-01-18 12:22 gila, http://markmail.org/message/gj7jyjwky4ws5dcc 2009-01-18 12:23 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-18 12:25 Okay; the puzzles me though. I thought that inodes contained the information on where to find the file on the disc 2009-01-18 12:26 So the inode points to a file which is a bree which contains the actual blocks? 2009-01-18 12:26 I'm getting close? 2009-01-18 12:35 apperently no 2009-01-18 12:36 I will study the mails some more 2009-01-18 12:38 Any chance that there is sequence diagram that shows the steps involved when I open a file? i.e open("foo.txt") --> how does tuxfs finds the right data structures and blocks it needs to access en open the file on disk? 2009-01-18 13:36 -!- cdk(~chinmay@121.246.34.211) has joined #tux3 2009-01-18 13:36 gila, the inode points to a btree that points to the file data blocks, yes 2009-01-18 13:37 gila, later we will optimize so that for small files, the inode points directly at the data blocks 2009-01-18 14:17 flips: Okay; thanks that clears things up 2009-01-18 14:18 hi flips.. 2009-01-18 14:19 How many internal nodes then does the btree with inodes have 2009-01-18 14:19 I'm assuming that the leaf nodes contain the inode structures themself 2009-01-18 14:27 hi cdk 2009-01-18 14:28 just got dedup to work using hash tree.. need to do some tests now 2009-01-18 14:29 gila, the number of internal nodes is variable 2009-01-18 14:29 got dedup to work :) 2009-01-18 14:29 works for small files. got a graph which shows same block numbers 2009-01-18 14:29 that was fast 2009-01-18 14:30 not done yet. something is breaking for bigger files :) 2009-01-18 14:38 fork_buffer forked a buffer in hackfs 2009-01-18 14:39 there were a couple more bugs 2009-01-18 14:39 will get back to you when we get a more stable code running 2009-01-18 14:39 thanks for the update 2009-01-18 14:40 got to go now. bye 2009-01-18 14:40 bye 2009-01-18 14:40 late over there 2009-01-18 14:40 yes :) 2009-01-18 15:37 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-18 15:56 ah, fork_buffer doesn't need to do read_mapping_page to bring the page uptodate, it can do lock_buffer for each buffer on the page instead, which exclusdes asychonous blockread 2009-01-18 15:58 that means it does not have to read unknown blocks from the volume, or worry about blocks past the end of volume 2009-01-18 16:17 flips, there? 2009-01-18 16:17 well, I've got the idea for pointer to root (btree->inode) 2009-01-18 16:18 we don't need btree->inode at all 2009-01-18 16:18 we can just use container_of(btree, struct inode, btree) 2009-01-18 16:19 so, inode parametor to all functions is also not needed 2009-01-18 16:34 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-18 18:12 hirofumi, here now 2009-01-18 18:13 hirofumi, it's good 2009-01-18 20:32 hirofumi, there? 2009-01-18 22:25 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-18 23:27 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-19 01:15 -!- Gila(~jeffry.mo@62-177-200-122.dsl.bbeyond.nl) has joined #tux3 2009-01-19 03:03 -!- Gila(~jeffry.mo@62-177-200-122.dsl.bbeyond.nl) has joined #tux3 2009-01-19 07:27 -!- Gila(~jeffry.mo@62-177-200-122.dsl.bbeyond.nl) has joined #tux3 2009-01-19 07:45 flips, I get the overal picture now on how the filesystem is build up. More so to a URL in the irc logs: http://www.complang.tuwien.ac.at/papers/czezatke%26ertl00/reading.gif 2009-01-19 07:49 The only thing that I dont know is what extents are 2009-01-19 08:11 Is it the same concept as within ext4? 2009-01-19 08:43 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-19 08:44 -!- gaurav(~gaurav@59.95.39.214) has joined #tux3 2009-01-19 09:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-19 09:46 Just so that I know that I'm understanding everything, the nightly snapshot is the userspace tux3 code, the git "ddtree" is the entire FS code all in git, and the official repo is all of the FS code under hg? 2009-01-19 09:50 Oh, they seem to be the same thing. :) 2009-01-19 09:59 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-19 11:21 -!- Gila(Gila@195-240-214-27.ip.telfort.nl) has joined #tux3 2009-01-19 12:01 kspaans, correct 2009-01-19 12:01 kspanns, and the kernel code is included in the userspace tree, in the kernel directory 2009-01-19 12:18 hirofumi, there? 2009-01-19 12:19 hi 2009-01-19 12:19 :) 2009-01-19 12:19 the btree->inode solution is good 2009-01-19 12:19 why didn't I see that? :) 2009-01-19 12:19 btree_inode()? 2009-01-19 12:19 ah, yes 2009-01-19 12:19 container_of(btree, struct inode, btree) 2009-01-19 12:20 yes 2009-01-19 12:20 one issue is itable 2009-01-19 12:20 use the volume inode to hold the intable btree 2009-01-19 12:20 not issue, considable point 2009-01-19 12:20 it can 2009-01-19 12:20 I think it's reasonable 2009-01-19 12:21 and also we can use current sb->itable 2009-01-19 12:21 I like always being about to do the container_of 2009-01-19 12:21 anyway, either way is fine 2009-01-19 12:22 yes 2009-01-19 12:23 probably, we will use btree_inode() to mark to flush pointer of root 2009-01-19 12:23 question: what gaurantees that, after read maps physical data and starts IO, the data is not redirected and freed? 2009-01-19 12:24 um... 2009-01-19 12:24 freed means bfree()? 2009-01-19 12:24 or brelse()? 2009-01-19 12:25 it means, after the delta has commited, the blocks of the previous stable image are freed 2009-01-19 12:25 bffree 2009-01-19 12:25 ok 2009-01-19 12:26 I think: nothing guarantees that data will remain avaiable 2009-01-19 12:26 bfree() would be rcu like things can garantee? 2009-01-19 12:26 or, remain valid 2009-01-19 12:26 yes, like rcu with delta transition as the "freeze" 2009-01-19 12:27 freeze? 2009-01-19 12:27 the rcu "every task must sleep" step 2009-01-19 12:28 flips: Ahh, I see it there. But tux3 isn't in the mainline kernel tree yet is it? 2009-01-19 12:28 ah, yes 2009-01-19 12:29 ah, rcu terminology is "synchronize" 2009-01-19 12:29 kspaans, not yet 2009-01-19 12:30 yes 2009-01-19 12:30 so, bfree is not the big issue? 2009-01-19 12:30 um... 2009-01-19 12:30 I think bfree is the issue 2009-01-19 12:31 otherwise, we could just let the read run asynchronously and only worry about truncate 2009-01-19 12:31 now we have to worry about whether the data it is reading is still allocated 2009-01-19 12:32 where do we do bfree is issue 2009-01-19 12:33 my first question is, who can read the bfree()ed block? 2009-01-19 12:34 an async read could be started, and be delayed for a long time for some reason 2009-01-19 12:35 any read, actually 2009-01-19 12:35 because there is no synchronizer on read completion 2009-01-19 12:35 async read is why reading bfreed blocks? 2009-01-19 12:35 it could 2009-01-19 12:36 because we allow old buffer (that forked)? 2009-01-19 12:36 allow to read old buffer 2009-01-19 12:36 not even that 2009-01-19 12:37 sys_read calls map_region, then lauches bio 2009-01-19 12:37 yes 2009-01-19 12:37 nothing guarantees that the physical blocks of the mapped region will remain allocated 2009-01-19 12:38 who can bfree it? 2009-01-19 12:38 a write can 2009-01-19 12:38 by redirecting it 2009-01-19 12:38 ah, redirect 2009-01-19 12:39 yes 2009-01-19 12:39 well, for now, lock_page will prevent it for write 2009-01-19 12:40 I actually noticed this problem when I was thinking about rewriting ->readpage without a page lock 2009-01-19 12:41 if there is no lock_page(), we would need to another lock 2009-01-19 12:41 btw, now I'm assume this is file data 2009-01-19 12:41 and what about direct IO, where there is no lock_page? 2009-01-19 12:42 it is issue 2009-01-19 12:42 well, read/write will protect by ->i_mutex, iirc 2009-01-19 12:42 ah 2009-01-19 12:46 it was not true, reader side seems to downgrade to ->i_alloc_sem 2009-01-19 12:48 I am considering the question of whether reader needs begin/end_change 2009-01-19 12:49 if needed, it sounds strange 2009-01-19 12:49 very 2009-01-19 12:49 yes 2009-01-19 12:50 it can possible by implementation issue though 2009-01-19 12:50 similar problems must come up in ext4 with online file shrink, where physical data will move 2009-01-19 12:51 file shrink? 2009-01-19 12:51 fs shrink? 2009-01-19 12:52 ah, online fs resize 2009-01-19 12:52 yes, fs shrink 2009-01-19 12:54 it may only be expand 2009-01-19 12:54 well 2009-01-19 12:55 lock_page() sounds good 2009-01-19 12:56 lock_page for file data, and probably metadata can be other lock 2009-01-19 12:56 lock_page does seem like the synchronizer we rely on 2009-01-19 12:57 it may similar of current truncate vs read race 2009-01-19 12:57 and it would be protected by lock_page() 2009-01-19 12:59 http://lxr.linux.no/linux+v2.6.27/fs/ext4/inode.c#L4809 <- ext4_page_mkwrite, I wonder what it is for 2009-01-19 12:59 mingming can tell us, no doubt 2009-01-19 12:59 iirc, it reserves journal 2009-01-19 12:59 or free blocks 2009-01-19 13:03 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2e9ee850355593e311d9a26542290fe51e152f74 2009-01-19 13:05 that comment really needs to be in the source ;) 2009-01-19 13:05 somewhat less than obvious 2009-01-19 13:06 hehhe 2009-01-19 13:06 yes 2009-01-19 13:07 ok, next thing is delalloc 2009-01-19 13:08 ok 2009-01-19 13:08 before I continue the atomic commit prototype, I would like to settle whether we are doing delalloc right from the beginning 2009-01-19 13:09 i see 2009-01-19 13:09 I was working for it 2009-01-19 13:09 and on testing, some problems was found 2009-01-19 13:10 bitmap recurvie, memory reclaim recurvie, btree_inode 2009-01-19 13:10 remaining problem is bitmap recursive lock 2009-01-19 13:11 well, either way is ok to me though 2009-01-19 13:12 if you want to delay delalloc, it's ok 2009-01-19 13:12 if you want delalloc more early, I'll try :) 2009-01-19 13:13 if we do not do delaloc, then we redirect inside ->write_begin, and that may cause mutliple redirects on the same data in the same delta 2009-01-19 13:13 yes 2009-01-19 13:14 it should work, but it doesn't seem nice 2009-01-19 13:14 however, only file data though 2009-01-19 13:14 however, it is also a rare pattern 2009-01-19 13:14 oh 2009-01-19 13:15 so, we need delalloc before atomic commit? 2009-01-19 13:15 I am asking that question 2009-01-19 13:16 sure, I think delalloc before atomic commit save our time 2009-01-19 13:17 I need to understand the existing kernel framework better then 2009-01-19 13:17 what is atomic commit 2009-01-19 13:19 atomic commit makes fs reliable like journal on flips's design 2009-01-19 13:19 http://userweb.kernel.org/~hirofumi/delalloc/simple-delalloc.patch 2009-01-19 13:19 this is current my delalloc 2009-01-19 13:20 can I help to understand it? 2009-01-19 13:20 short :) 2009-01-19 13:20 yes 2009-01-19 13:20 it would make more slow current fs 2009-01-19 13:20 however, it would delay allocations 2009-01-19 13:22 where is buffer delay flag set? 2009-01-19 13:23 ->write_begin will set it 2009-01-19 13:23 ->write_begin -> get_block -> map_region => SEG_HOME -> set_buffer_delay() 2009-01-19 13:23 and where is it acted on? 2009-01-19 13:24 ->writepage will allocate blocks with get_block() 2009-01-19 13:24 ->writepage() -> get_block() and clear_buffer_delay() -> balloc() 2009-01-19 13:25 ugh 2009-01-19 13:25 ->writepage() and clear_buffer_delay() -> get_block() -> balloc() 2009-01-19 13:26 http://lxr.linux.no/linux+v2.6.28.1/fs/buffer.c#L1693 2009-01-19 13:26 however, it is in recent version though 2009-01-19 13:27 and block io library checks buffer_delay() to avoid overwrite data by read 2009-01-19 13:27 how did ext4 do delaloc before that? 2009-01-19 13:28 ext4 delalloc was implemented by based on buffer_delay() 2009-01-19 13:29 actually, xfs introduced it 2009-01-19 13:29 and ext4 makes generic it 2009-01-19 13:29 made 2009-01-19 13:32 however, I'm thinking it would be hack until true generic version 2009-01-19 13:34 ok, in your patch, if the ->writepage will be delalloc, then you call map_region with create = 0 2009-01-19 13:35 if the block(s) exist, then block_write_full_page will write to them normally, otherwise... what? 2009-01-19 13:36 it calls get_block(create == 1) to allocate/get block address 2009-01-19 13:37 if it was mmap, buffer_head should be !buffer_mapped() 2009-01-19 13:37 if it was write, buffer_head may has buffer_delay() 2009-01-19 13:37 where does the get_block(create == 1) come from? 2009-01-19 13:38 block_write_full_page() will call it 2009-01-19 13:38 http://lxr.linux.no/linux+v2.6.28.1/fs/buffer.c#L1693 2009-01-19 13:38 yes, look at it 2009-01-19 13:38 still have not quite understood the full path 2009-01-19 13:38 get_block(inode, block, bh, 1); 2009-01-19 13:39 this allocates block 2009-01-19 13:39 and write data page 2009-01-19 13:39 ->write_begin() is changed one 2009-01-19 13:40 in patch, ->write_begin uses tux3_da_get_block() 2009-01-19 13:40 ok 2009-01-19 13:40 very helpful :) 2009-01-19 13:40 good :) 2009-01-19 13:41 with that change, ->write_begin() doesn't allocate blocks anymore 2009-01-19 13:41 for now, it will get mapped or delay buffers 2009-01-19 13:42 so the ->writepage(s) do the actual allocation and writeout, write_begin/end just dirty the page cache 2009-01-19 13:42 yes 2009-01-19 13:42 where doe the ->writepage(s) come from? 2009-01-19 13:42 for now, there is no change 2009-01-19 13:43 pdflush or memory reclaim will call it 2009-01-19 13:44 if we don't want to write with ->writepage(), maybe we can redirty page instead 2009-01-19 13:44 and mpage_writepages is the fast path? 2009-01-19 13:44 mpage_writepages() doesn't use buffer_head 2009-01-19 13:44 it calls get_block directly, and use bio to write 2009-01-19 13:45 so, buffer_delay will not be checked and cleared 2009-01-19 13:45 and it's very complicated 2009-01-19 13:45 maybe 2009-01-19 13:45 ah, so mpage_writepages does not do delalloc now? 2009-01-19 13:46 yes 2009-01-19 13:46 and maybe never do it 2009-01-19 13:46 "no, it doesn't" ;-) 2009-01-19 13:46 :) 2009-01-19 13:46 so the real delalloc path is pdflush -> what? 2009-01-19 13:46 pdflush -> writepage? 2009-01-19 13:47 page-writeback.c 2009-01-19 13:48 yes 2009-01-19 13:49 which is driven by flushing inodes 2009-01-19 13:49 which is starting to sound like something tux3 knows something about 2009-01-19 13:50 inode dirty is checked to flush 2009-01-19 13:52 write_cache_pages is the heart of it 2009-01-19 13:53 we could use our own ->writepage that does not use block_write_full_page bug calls map_region directly 2009-01-19 13:53 it would have a structure similar to block_write_full_page 2009-01-19 13:55 it was ->writepages 2009-01-19 13:55 I was assuming generic_sync_sb_inodes() was per fs handler 2009-01-19 13:56 well 2009-01-19 13:57 pdflush -> for each sb -> for each dirty inodes -> do_writepages() -> generic_writepages() or ->writepages() -> write_cache_pages() -> ->writepage() 2009-01-19 13:57 yes 2009-01-19 13:58 very nice explanation :) 2009-01-19 13:59 http://lxr.linux.no/linux+v2.6.28.1/fs/ext4/inode.c#L2281 2009-01-19 13:59 this is writepage of ext4 2009-01-19 13:59 ok, so you can ask: what problem does writing our own tux3_write_full_page solve? 2009-01-19 14:00 it just delay to write 2009-01-19 14:00 so, io can be handled by own 2009-01-19 14:00 ah, that is a little strange 2009-01-19 14:01 so writeback is delayed twice? 2009-01-19 14:01 delayed once in sys_write, and delayed again when pdflush calls writepage? 2009-01-19 14:01 no 2009-01-19 14:01 delayed allocation in sys_write 2009-01-19 14:02 vm's ->writepage will delay until our write 2009-01-19 14:03 pdflush calls ->writepages before ->writepage 2009-01-19 14:03 we have to handle it with atomic commit 2009-01-19 14:04 or disable pdflush itself by some trick 2009-01-19 14:04 ah, after all, delayed twice :) 2009-01-19 14:06 tux3_writepage will actually launch a bio for now 2009-01-19 14:06 and we will have lots of little bios 2009-01-19 14:06 yes 2009-01-19 14:06 optimization without atomic commit can be by ->writepages() 2009-01-19 14:07 so the atomic commit goes in tux3_writepage 2009-01-19 14:07 it can collects dirty pages, and write at a once 2009-01-19 14:07 tux3_writepage -> end_change -> commit_delta 2009-01-19 14:08 now... the locking 2009-01-19 14:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has left #tux3 2009-01-19 14:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-19 14:08 ah 2009-01-19 14:08 what locks are held above ->writepage? 2009-01-19 14:09 or write it by our functions 2009-01-19 14:09 just ignore ->writepage() 2009-01-19 14:09 double delay :) 2009-01-19 14:09 yes 2009-01-19 14:09 I think we can do that as an incremental improvement 2009-01-19 14:10 yes 2009-01-19 14:10 well, I just thought our function can be more easy 2009-01-19 14:10 either way is ok 2009-01-19 14:10 I would like to drive it from write_cache_pages, which has a generic interface where we can supply our own worker function 2009-01-19 14:11 mpage_writepages works this way, but it is very complex 2009-01-19 14:11 yes 2009-01-19 14:11 ->writepages is for it 2009-01-19 14:11 _s_ 2009-01-19 14:11 ah, but it is per inode 2009-01-19 14:12 generic_writepages is a very thin shell on write_cache_pages 2009-01-19 14:12 yes 2009-01-19 14:12 anyway, I am talking about how we optimize it, to get it working we just implement ->writepage 2009-01-19 14:13 and it will not win every speed contest :) 2009-01-19 14:13 which you already know obviously 2009-01-19 14:14 I was not thinking to commit on writepage() in past 2009-01-19 14:14 where then? 2009-01-19 14:15 I was thinking we will do own central function 2009-01-19 14:15 just image though 2009-01-19 14:15 commit can just launch IO 2009-01-19 14:15 and we can wait for it somewhere else 2009-01-19 14:15 but to start, we do the wait in the commit 2009-01-19 14:16 yes 2009-01-19 14:16 so some random ->writepage can take a very long time to complete 2009-01-19 14:16 and lots of locks will be taken inside it 2009-01-19 14:17 our btree locks, i_mutex on different files, page_locks 2009-01-19 14:17 my image was stage_delta() will work for all dirty buffers 2009-01-19 14:17 yes, that is as I described it originally 2009-01-19 14:18 so, I thought ->writepage() shouldn't do it 2009-01-19 14:18 actually... 2009-01-19 14:18 writepage -> change_end -> { stage_delta(); commit_delta() } <- when I intended for initial version 2009-01-19 14:19 caller of writepage is where? 2009-01-19 14:19 pdflush 2009-01-19 14:19 ah 2009-01-19 14:20 it is a very simple idea 2009-01-19 14:20 it may work 2009-01-19 14:21 it will cause a big stall on delta transition, but still will be pretty efficient for some loads 2009-01-19 14:21 with big delta 2009-01-19 14:21 like untarring a kernel tree 2009-01-19 14:22 and found problem 2009-01-19 14:22 ah 2009-01-19 14:22 if sys_write() was write multiple pages size, it would not work 2009-01-19 14:23 because? 2009-01-19 14:23 writepage will calls multiple times 2009-01-19 14:23 with one change_begin() 2009-01-19 14:23 why only one change_begin? 2009-01-19 14:24 because sys_write is only one 2009-01-19 14:24 change_begin can be in tux3_writeapge 2009-01-19 14:24 ah 2009-01-19 14:24 this satisfies posix 2009-01-19 14:24 which is not very strict 2009-01-19 14:25 result is one page delta? 2009-01-19 14:26 ah 2009-01-19 14:26 not every change_begin/end causes a delta transition 2009-01-19 14:26 ah 2009-01-19 14:26 it normally just takes and releases a rw sem 2009-01-19 14:26 and checks to see if there should be a delta 2009-01-19 14:27 so, my crude hack just counts to 10 then does a delta 2009-01-19 14:27 ah, yes 2009-01-19 14:27 and we can start with something mindless like that 2009-01-19 14:27 then make it dependent on block congestion etc 2009-01-19 14:27 with a clever, efficient and accurate algorithm :) 2009-01-19 14:28 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-19 14:29 hirofumi, thanks a lot, I think I am keeping you up past your oyasumi time 2009-01-19 14:29 I noticed why I wonder it 2009-01-19 14:29 no problem 2009-01-19 14:29 and besides, its sk8 o'clock 2009-01-19 14:29 :) 2009-01-19 14:30 hirofumi, we'll have to get you out on the santa monica beach path one day 2009-01-19 14:30 -!- gila(~gila@195-240-214-27.ip.telfort.nl) has joined #tux3 2009-01-19 14:30 I was thinking flusher will donw(sb->delta_lock); stage_delta(); commit_delta(); up() 2009-01-19 14:31 yes, exactly 2009-01-19 14:31 well, that can happen in any writepage 2009-01-19 14:31 according to me 2009-01-19 14:31 so, I wonder why change_begin() 2009-01-19 14:32 tim_dimm, thanks :) 2009-01-19 14:32 however, it actully was no problem 2009-01-19 14:33 it does the above, however it just increment delta counter needlessly 2009-01-19 14:33 tim_dimm, do you have a url for the scale presentation? 2009-01-19 14:33 I think now I understand what the above meant 2009-01-19 14:34 and it makes sense, to start? 2009-01-19 14:34 yes, exactly 2009-01-19 14:34 it is also compatible with delalloc 2009-01-19 14:34 looking... 2009-01-19 14:34 i see 2009-01-19 14:35 wait, which scale presentation? 2009-01-19 14:36 on february 18 2009-01-19 14:36 nope 2009-01-19 14:37 http://scale7x.socallinuxexpo.org/conference-info/schedules 2009-01-19 14:37 http://scale7x.socallinuxexpo.org/conference-info/speakers/daniel-phillips <- hirofumi, it will be partly about you 2009-01-19 14:38 and it won't be versioning at that point 2009-01-19 14:38 oh 2009-01-19 14:38 but most probably atomic committing, and in initial review 2009-01-19 14:38 I go to? 2009-01-19 14:39 if you can :) 2009-01-19 14:39 but email a photo if you can't? 2009-01-19 14:40 wow, I have to fix that abstract, it sucks ;) 2009-01-19 14:41 I have not gone abroad until now, and can't speak english at all 2009-01-19 14:42 heh 2009-01-19 14:42 you write it very well 2009-01-19 14:42 I learned english for email and for computer :) 2009-01-19 14:43 nobody in los angeles speaks english :) 2009-01-19 14:43 we all speak "dude" 2009-01-19 14:44 if you don't know someone's name, call them dude 2009-01-19 14:44 ;-) 2009-01-19 14:44 :) 2009-01-19 14:45 who is the nerd on the pic ? ;-) 2009-01-19 14:45 ancient photo of me 2009-01-19 14:45 heheh I knew that 2009-01-19 14:48 oh, you will make a speech about tux3 on that 2009-01-19 14:49 this will be the first presentation of the tux3 design 2009-01-19 14:49 are you guys working on the kernel files or the user files in tux3 btw 2009-01-19 14:49 flips, Could you film it? 2009-01-19 14:49 gila, we work on both, most of the kernel files are also used in userspace 2009-01-19 14:49 some things are not completely clear 2009-01-19 14:50 flips, Okay -- the snapshot still upto date? 2009-01-19 14:50 nightly snapshot is nightly :) 2009-01-19 14:50 hirofumi, I would appreciate it if you would tell me what is broken about my block ops in hackfs 2009-01-19 14:50 gettimeofday() 2009-01-19 14:51 ok, if I can 2009-01-19 14:53 it is about fork_buffer() in tux3ml? 2009-01-19 14:55 and blockread and blockdirty 2009-01-19 14:55 it doesn't work for now? 2009-01-19 14:56 it seems to work 2009-01-19 14:56 and blockget 2009-01-19 14:56 I will post a current patch 2009-01-19 14:56 ok, I'll see after sleep and bitmap work 2009-01-19 14:58 well, I guess the issue is locking only 2009-01-19 14:59 current patch: http://mailman.tux3.org/pipermail/tux3/2009-January/000674.html 2009-01-19 15:00 I will say oyasumi 2009-01-19 15:00 oyasumi 2009-01-19 15:03 blockread() just calls blockget() 2009-01-19 15:03 yes 2009-01-19 15:03 I think it has same problem in current ->readpage 2009-01-19 15:03 lock_page() -> wirtepage() -> balloc() -> blockread() 2009-01-19 15:03 -> lock_page() 2009-01-19 15:04 maybe use a different blockread for bitmap 2009-01-19 15:04 i see 2009-01-19 15:05 or we may able to use find_get_page() 2009-01-19 15:05 this is why I was asking questions about lock_page today, which protects the file -> disk mapping 2009-01-19 15:06 it checks PageUptodate(), and if not, it will take lock_page 2009-01-19 15:07 I now understand that lock_page is essential for ->readpage 2009-01-19 15:08 well, maybe I think we can replace lock_page() with lock_buffer() though 2009-01-19 15:08 hirofumi, blockget drops the page lock 2009-01-19 15:08 oh 2009-01-19 15:08 ah 2009-01-19 15:08 yes 2009-01-19 15:09 drops the page lock can't solve the problem 2009-01-19 15:09 it is double lock 2009-01-19 15:09 what is the path? 2009-01-19 15:09 lock_page() -> wirtepage() -> balloc() -> blockread() -> blockget() -> lock_page() 2009-01-19 15:10 it is current one though 2009-01-19 15:10 yes, ok that is tomorrow's problem 2009-01-19 15:12 lock_buffer() and find_get_page() may solve it 2009-01-19 15:12 um... 2009-01-19 15:13 where does the lock_page() -> writepage() come from? 2009-01-19 15:13 caller of writepage() must have lock_page() 2009-01-19 15:14 right, but this ->writepage is on the bitmap inode, isn't it? 2009-01-19 15:14 yes 2009-01-19 15:14 and just in bitmap flush 2009-01-19 15:14 current is flushed by pdflush 2009-01-19 15:15 so, it takes lock_page() 2009-01-19 15:15 right, so we will not let pdflush flush it 2009-01-19 15:15 we will control that by taking the bitmap inode off the sb dirty list 2009-01-19 15:15 ok? 2009-01-19 15:15 or add redirty ->writepage 2009-01-19 15:16 yes, or that 2009-01-19 15:16 well, it sounds work 2009-01-19 15:16 either way, we will walk our own dirty buffer list afterwards 2009-01-19 15:16 yes 2009-01-19 15:16 that is the list created by fork_buffer 2009-01-19 15:16 or, that is part of the list 2009-01-19 15:17 yes 2009-01-19 15:17 ok, it is sk8 oclock now, oyasumi 2009-01-19 15:18 oyasumi 2009-01-19 15:38 why is tux3_write_super() reading instead of; what the name implies -- writing 2009-01-19 17:44 gila, slight misname for historical reasons, it means "write cached superblock to the disk superblock" 2009-01-19 17:45 it reads the old image so it doesn't have to repack the read-only fields 2009-01-19 17:46 superblock handling is a little crufty, we will sort it out after atomic commit is working, it's not a very critical design element 2009-01-19 17:50 "Requires a second blkdev flush by the caller to complete the operation." http://lxr.linux.no/linux+v2.6.27/fs/super.c#L246 2009-01-19 17:52 we will avoid that whole crazy vfs flush path, it is design to support ancient primitive filesystems like ext2 2009-01-19 21:55 -!- kushal(~kushal@121.246.34.211) has joined #tux3 2009-01-19 22:44 hi flips 2009-01-19 22:45 -!- cdk(~cdk@121.246.34.211) has joined #tux3 2009-01-19 22:46 -!- amey(~amey@121.246.34.211) has joined #tux3 2009-01-19 23:02 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-19 23:22 -!- RazvanM(~RazvanM@96.234.240.153) has joined #tux3 2009-01-19 23:27 kushal, hi 2009-01-19 23:31 in our dedup solution, we're getting an assert fail for cursor->len < cursor->maxlen... 2009-01-19 23:31 is there any limit on the cursor size? 2009-01-19 23:32 this occurs only for big files 2009-01-19 23:32 it is set to allow the btree to become one level deeper in an operaton 2009-01-19 23:33 if it becomes more than one level deeper, it would hit the assert 2009-01-19 23:33 how big is the big file? 2009-01-19 23:34 24 mb is getting copied successfully...but nothing larger than that... 2009-01-19 23:35 int maxlevel = btree->root.depth + 1 + extra; <- it happens here 2009-01-19 23:36 try extra + 6 :) 2009-01-19 23:36 btree.c 2009-01-19 23:37 -!- cdk(~cdk@121.246.34.211) has joined #tux3 2009-01-19 23:38 -!- kushal(~kushal@121.246.34.211) has joined #tux3 2009-01-19 23:38 -!- amey_(~amey@121.246.34.211) has joined #tux3 2009-01-19 23:38 sorry lost connection...might have missed out after my last msg... 2009-01-19 23:40 int maxlevel = btree->root.depth + 1 + extra; <- it happens here 2009-01-19 23:40 try extra + 6 :) 2009-01-19 23:40 btree.c 2009-01-19 23:40 we will consider this more carefully later 2009-01-19 23:41 or extra + 10 2009-01-19 23:41 the extra is not really significant 2009-01-19 23:41 but without dedup it gives no such faults...? so we were concerned whether we messed up something 2009-01-19 23:41 the extra memory I mean 2009-01-19 23:42 Don't make the mistake of assuming we never make mistakes ;) 2009-01-19 23:42 :) 2009-01-19 23:44 we should add some stort of "tux3 info " to tell you something about a file, for example the depth of its btree 2009-01-19 23:44 we should add a command to dump the btree index of a file 2009-01-19 23:44 ya...that would be good... 2009-01-19 23:44 we will add some utilities like that, as we go 2009-01-19 23:45 this is running under fuse? 2009-01-19 23:45 yes... 2009-01-19 23:46 I will give you a variant of "show_tree" that just shows the index, I will put it on my to-do list 2009-01-19 23:47 and you can use it in your code for debugging purposes 2009-01-19 23:47 ok...but i think its not a problem with the file btree...it is with our hash tree... 2009-01-19 23:48 home many entries does it have, and how large is an entry? 2009-01-19 23:48 how many I mean 2009-01-19 23:49 a leaf has 256 entries..each of 128bits... 2009-01-19 23:50 how many entries in total in your btree, I meant 2009-01-19 23:51 one entry per non-duplicate file data block... 2009-01-19 23:51 and how many do you think there are now, before it stops? 2009-01-19 23:52 approx 6000 entries... 2009-01-19 23:53 that would be a 5 level btree 2009-01-19 23:54 sorry 2009-01-19 23:54 jsut a 2 level tree 2009-01-19 23:55 just a min..trying to get a graph... 2009-01-20 00:00 yes..it would be a 2 level tree...then i guess the node split is not working 2009-01-20 00:00 as it is we're getting the assert failure in level_add_root 2009-01-20 00:01 print a message saying how many levels it has 2009-01-20 00:01 first, set your maxlevel higher 2009-01-20 00:01 and see how deep your btree is 2009-01-20 00:01 not split not working would be a common problem 2009-01-20 00:01 sorry 2009-01-20 00:01 node split not working would be a common problem 2009-01-20 00:02 also, you might want to take a look at user/btree.c 2009-01-20 00:02 ok... 2009-01-20 00:02 it sets up a simple btree test using "uleaf", which is just a test format 2009-01-20 00:02 you can test your btree in isolation, similarly 2009-01-20 00:03 it is a very good idea to write a unit test 2009-01-20 00:03 ok...will do that... 2009-01-20 00:03 just load your tree with thousands or millions of random entries and see what happens 2009-01-20 00:04 ok... 2009-01-20 00:04 you can just copy user/btree.c and make it your unit test file 2009-01-20 00:06 yes... 2009-01-20 00:08 ok, goodnight, it sounds like your progress is good 2009-01-20 00:08 I will respond to your design pretty soon 2009-01-20 00:08 early night today :) 2009-01-20 00:09 looking forward to your response 2009-01-20 00:58 hey flips 2009-01-20 01:52 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-01-20 02:54 -!- amey(~amey@121.246.34.211) has joined #tux3 2009-01-20 03:10 -!- kushal_(~kushal@121.246.34.120) has joined #tux3 2009-01-20 03:47 -!- gaurav(~gaurav@121.246.34.120) has joined #tux3 2009-01-20 08:21 -!- kushal(~kushal@121.246.34.120) has joined #tux3 2009-01-20 08:34 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-20 09:59 -!- gaurav(~gaurav@121.246.34.120) has joined #tux3 2009-01-20 10:03 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-20 10:09 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-20 10:15 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-20 10:29 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-20 10:32 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-20 10:59 -!- amey(~amey@121.246.34.120) has joined #tux3 2009-01-20 11:27 hi flips 2009-01-20 11:59 -!- tim_dimm(~timothyhu@rrcs-64-183-50-58.west.biz.rr.com) has joined #tux3 2009-01-20 12:04 -!- cdk(~chinmay@121.246.34.120) has joined #tux3 2009-01-20 13:57 -!- tim_dimm(~timothyhu@rrcs-64-183-50-58.west.biz.rr.com) has joined #tux3 2009-01-20 15:56 hi all 2009-01-20 15:56 flips: ping 2009-01-20 16:23 shapor, pong 2009-01-20 17:03 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-20 17:52 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-20 18:29 hirofumi, there? 2009-01-20 18:29 hi 2009-01-20 18:29 static int tux3_writepage(struct page *page, struct writeback_control *wbc) 2009-01-20 18:29 { 2009-01-20 18:29 struct sb *sb = tux_sb(page->mapping->host->i_sb); 2009-01-20 18:29 change_begin(sb); 2009-01-20 18:29 int err = block_write_full_page(page, tux3_get_block, wbc); 2009-01-20 18:29 change_end(sb); 2009-01-20 18:29 return err; 2009-01-20 18:29 } 2009-01-20 18:29 this depends on using delalloc 2009-01-20 18:30 otherwise, a map_region would be done from begin_write, outside the change brackets 2009-01-20 18:30 does it make sense? 2009-01-20 18:31 I thought about it yesterday 2009-01-20 18:31 maybe, we have to do stage_delta forcely? 2009-01-20 18:32 forcibly in what sense? 2009-01-20 18:33 otherwise, allocation would be done outside of stage_delta? 2009-01-20 18:34 we don't need a state_delta for every change_end 2009-01-20 18:34 sorry 2009-01-20 18:34 don't need a stage_delta for every change_end 2009-01-20 18:34 yes 2009-01-20 18:34 did I answer the right question? 2009-01-20 18:35 I'm not sure, map_region(create == 1) is done outside of stage_delta, or not 2009-01-20 18:36 currently it is done in write_begin and writepage 2009-01-20 18:36 yes 2009-01-20 18:36 and after delalloc, it would be writepage 2009-01-20 18:37 and block_write_full_page write data out 2009-01-20 18:38 write is allowed before stage_delta? 2009-01-20 18:38 yes 2009-01-20 18:38 oh 2009-01-20 18:38 i see 2009-01-20 18:38 we only require that filesystem change be atomic, in this case just mapping a page to disk 2009-01-20 18:39 block_write_full_page is async 2009-01-20 18:39 mapping and write out 2009-01-20 18:39 and we need all the in-flight writes to complete before returning from commit_delta, just for now 2009-01-20 18:40 in ordered data mode that is done by sync_inode_pages(1) 2009-01-20 18:41 calling sync_inode_pages inside ->writepage does not seem like a good idea 2009-01-20 18:42 somehow I was thinking we don't allow to write data out before stage_delta() 2009-01-20 18:42 it can be done either way 2009-01-20 18:42 however, it was not right, so that looks good to me 2009-01-20 18:44 ok, we need a way to wait for all in-flight page writes to complete 2009-01-20 18:45 one issue I noticed now, writepage is taking lock_page 2009-01-20 18:46 it may cause of problem in stage_delta or commit_delta 2009-01-20 18:46 ah, may not have 2009-01-20 18:46 I hope not :) 2009-01-20 18:47 we take just about every lock in stage_delta, except page lock I think 2009-01-20 18:48 block_write_full_page drops lock_page, so we can wait passed page itself too 2009-01-20 18:48 so, it would be not problem 2009-01-20 18:49 possible one is memory allocation path 2009-01-20 18:49 but, it can be prevented by ~__GFP_FS 2009-01-20 18:49 I hope 2009-01-20 18:49 I think that is right 2009-01-20 18:49 it seems to work for now 2009-01-20 18:52 ok, can generic_sync_sb_inodes be called from inside ->writepage? 2009-01-20 18:53 ah, I think it can not 2009-01-20 18:54 because it wait inodes to sync 2009-01-20 18:54 but, now we are blocking one inode 2009-01-20 18:55 if we just wait pages not inode, I think we can 2009-01-20 18:56 we just want to wait on pages 2009-01-20 18:56 filemap_fdatawait(mapping); 2009-01-20 18:56 yes 2009-01-20 18:57 ok, I think I see the whole picture 2009-01-20 18:57 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2009-01-20 18:57 pdflush may flush the inode data, is it ok for atomic commit? 2009-01-20 18:57 we will control the inode table writeback completely 2009-01-20 18:58 without pdflush driving it 2009-01-20 18:58 vfs will dirty inode by timestamp change 2009-01-20 18:59 we can keep the vfs from flushing our inodes 2009-01-20 19:00 it is easy, just __mark_inode_dirty, then move the inode to our own list 2009-01-20 19:00 vfs may never flush anything on our volume 2009-01-20 19:01 your were talking about other inodes besides the volume inode... 2009-01-20 19:01 yes, with some trick 2009-01-20 19:03 in our stage/commit, we may walk our own dirty inode list to do save_inode and filempa_fdatawait 2009-01-20 19:03 maybe we don't have to do save_inode 2009-01-20 19:04 maybe, we can, however maybe we need some trick 2009-01-20 19:04 we have to know dirty inode 2009-01-20 19:05 for it, we may use sb->s_op->dirty_inode() hook 2009-01-20 19:07 I was just looking at that 2009-01-20 19:07 and if we don't provide ->write_inode() handler, vfs will ignore writing inode 2009-01-20 19:07 just a idea 2009-01-20 19:07 a good idea 2009-01-20 19:09 I will post my to.do list for atomic commit 2009-01-20 19:09 ok 2009-01-20 19:10 and ->dirty_inode would have to take inode refcnt to prevent free inode 2009-01-20 19:12 I will add "dirty inode handling" as a task 2009-01-20 19:12 good 2009-01-20 19:29 I might noticed the one issue related to dirty inode 2009-01-20 19:30 vfs will update timestamp asynchronously with our save_inode 2009-01-20 19:31 so, maybe inode timestamp is not right exactly 2009-01-20 19:34 if that is the worst problem it is not a bad one 2009-01-20 19:35 yes, I guess all fs have this problem on linux 2009-01-20 19:37 btw, will we have per delta state flag? 2009-01-20 19:38 e.g. DELTA_STATE_INITIAL -> *_FLUSH_DATA -> *_FLUSH_INODE -> *_FLUSH_BITMAP or something 2009-01-20 19:39 well, I thought it can be used to solve bitmap recursive lock 2009-01-20 20:00 I thought we already had a solution for the bitmap recursion 2009-01-20 20:00 ...vfs does not flush the bitmap 2009-01-20 20:00 we may not need those state flags 2009-01-20 20:01 let's complete the pieces and see how it looks 2009-01-20 20:02 deferred free is one little thing... redirect does not immediately call balloc, but puts the block address on the free list 2009-01-20 20:02 the free list will be a vector of pages 2009-01-20 20:02 either pagevec, or custom coded 2009-01-20 20:02 a small detail, I need to decide and code 2009-01-20 20:03 cursor redirect exists in prototype (posted) 2009-01-20 20:03 except for saving the deferred free 2009-01-20 20:07 I think it is about lock_page() recursive 2009-01-20 20:07 it meant btree->lock recursive 2009-01-20 20:08 temporary solution for a while 2009-01-20 20:33 -!- kushal(~kushal@121.246.34.120) has joined #tux3 2009-01-20 20:33 hi flips 2009-01-20 21:21 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-20 21:39 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2009-01-20 22:04 hi kushal 2009-01-20 22:06 actually we figured out what the problem seems to be... the max buffers limit is being reached... 2009-01-20 22:06 so if we increase the value of max buffers...it works properly for larger files... 2009-01-20 22:07 ah, sounds like a bug in buffer.c 2009-01-20 22:08 how did that show up as a cursor depth exceeded? 2009-01-20 22:08 max_buffers... 10,000, that seems like a lot 2009-01-20 22:09 yes...so the problem is not with buffer.c 2009-01-20 22:09 rather the buffers are not being evicted... 2009-01-20 22:09 right, buffer count > 0 2009-01-20 22:09 bug 2009-01-20 22:10 buffer count leak 2009-01-20 22:10 ok... 2009-01-20 22:12 show_active_buffers(map) <- show all buffers with count > 0 2009-01-20 22:12 we did that as well as printed the dirty buffers... 2009-01-20 22:13 they are not even close to 10000 2009-01-20 22:13 with you see lots of buffers with count > 0? 2009-01-20 22:13 shall i paste the trace? 2009-01-20 22:13 ah 2009-01-20 22:13 please 2009-01-20 22:13 just a min... 2009-01-20 22:16 you can do: trace_on("try to evict buffers"); 2009-01-20 22:16 and see if it is trying to evict 2009-01-20 22:17 yes...it tries to evict... 2009-01-20 22:17 and it doesn't find any to evict because? 2009-01-20 22:20 just a min...pasting the trace... 2009-01-20 22:23 -!- cdk(~chinmay@121.246.34.120) has joined #tux3 2009-01-20 22:24 here is the trace 2009-01-20 22:24 new_buffer: try to evict buffers 2009-01-20 22:24 show_buffers: (map 0x9d20418) 2009-01-20 22:24 [3] 10d0/100 2009-01-20 22:24 [6] 1b1f/1 2587/100 2009-01-20 22:24 [12] a04/100 2009-01-20 22:24 [18] c6c/100 2009-01-20 22:24 [39] 2327/100 53c/100 2009-01-20 22:24 [45] 7a4/100 2009-01-20 22:24 [57] 158f/100 2009-01-20 22:24 [63] 17f7/100 2009-01-20 22:24 [69] 1a8/100 2009-01-20 22:24 [78] 18c7/100 2009-01-20 22:24 [93] 1ecb/100 2009-01-20 22:24 [99] 1bff/100 2009-01-20 22:24 [126] 1b1e/1 2009-01-20 22:24 [135] 106b/100 2009-01-20 22:24 [138] 2522/100 2009-01-20 22:24 [141] d9f/100 2009-01-20 22:24 [144] ed3/100 99f/100 2009-01-20 22:24 [162] 66f/100 2009-01-20 22:24 [171] 22c2/100 4d7/100 2009-01-20 22:24 [180] 16c2/100 33f/100 2009-01-20 22:24 [189] 152a/100 2009-01-20 22:24 [192] 20c6/100 2009-01-20 22:25 [222] 1d32/100 12ca/100 2009-01-20 22:25 [225] 1e66/100 2009-01-20 22:25 [243] b9a/100 2009-01-20 22:25 [270] 24bd/100 2009-01-20 22:25 [273] 273e/1 2009-01-20 22:25 [276] e6e/100 2009-01-20 22:25 [303] 225d/100 472/100 2009-01-20 22:25 [318] e/9997 2009-01-20 22:25 [321] 14c5/100 2009-01-20 22:25 [336] 1ac9/100 2009-01-20 22:25 [354] 1ccd/100 1265/100 2009-01-20 22:25 [357] 271c/94 2009-01-20 22:25 [375] b35/100 2009-01-20 22:25 [387] 1005/100 2009-01-20 22:25 [396] 939/100 2009-01-20 22:25 [414] 609/100 2009-01-20 22:25 [432] 2d9/100 2009-01-20 22:25 [435] 21f8/100 2009-01-20 22:25 [441] 141/100 2009-01-20 22:25 [444] 2060/100 2009-01-20 22:25 [453] 1994/100 1460/100 2009-01-20 22:25 [471] 1b98/100 2009-01-20 22:25 [477] 1e00/100 2009-01-20 22:25 [486] 1200/100 2009-01-20 22:25 [489] 26b7/100 2009-01-20 22:25 [513] d38/100 2009-01-20 22:25 [519] fa0/100 2009-01-20 22:25 [522] 2457/100 2009-01-20 22:26 [528] 8d4/100 2009-01-20 22:26 [549] 6d8/100 2009-01-20 22:26 [555] 178f/100 40c/100 2009-01-20 22:26 [576] 1ffb/100 2009-01-20 22:26 [588] 1a63/100 2009-01-20 22:26 [621] 2652/100 2009-01-20 22:26 [639] 3/0* 2009-01-20 22:26 [645] cd3/100 2009-01-20 22:26 [651] f3b/100 2009-01-20 22:26 [654] 23f2/100 2009-01-20 22:26 [660] 86f/100 2009-01-20 22:26 [666] 5a3/100 2009-01-20 22:26 [672] 165a/100 2009-01-20 22:26 [684] 273/100 2009-01-20 22:26 [687] 3a7/100 2009-01-20 22:26 [693] db/100 2009-01-20 22:26 [705] 13fa/100 2009-01-20 22:26 [708] 1f96/100 2009-01-20 22:26 [723] 1b32/100 2009-01-20 22:26 [738] 119a/100 2009-01-20 22:26 [747] ace/100 2009-01-20 22:26 [750] c02/100 2009-01-20 22:26 [753] 25ed/100 2009-01-20 22:26 [768] e06/100 2009-01-20 22:26 [786] 238d/100 2009-01-20 22:26 [786] 238d/100 2009-01-20 22:26 [804] 15f5/100 2009-01-20 22:26 [807] 2191/100 2009-01-20 22:27 [810] 185d/100 2009-01-20 22:27 [816] 20e/100 2009-01-20 22:27 [825] 76/100 2009-01-20 22:27 [837] 1395/100 2009-01-20 22:27 [840] 1f31/100 19fd/100 2009-01-20 22:27 [846] 1c65/100 2009-01-20 22:27 [849] 1d99/100 2009-01-20 22:27 [870] 1135/100 2009-01-20 22:27 [879] a69/100 2009-01-20 22:27 [912] 809/100 2009-01-20 22:27 [927] 1728/100 2009-01-20 22:27 [939] 212c/100 2009-01-20 22:27 [945] 192c/100 2009-01-20 22:27 [969] 1330/100 2009-01-20 22:27 show_dirty_buffers: 1 dirty buffers: 2009-01-20 22:27 3/0* show_dirty_buffers: end 2009-01-20 22:27 new_buffer: expand buffer pool 2009-01-20 22:27 new_buffer: Maximum buffer count exceeded (10000) 2009-01-20 22:27 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-20 22:27 the show_buffers() and show_dirty_buffers() called before the evict buffers loop 2009-01-20 22:34 the /100 is a count of 100 2009-01-20 22:36 is that a partial list above? 2009-01-20 22:36 oh 2009-01-20 22:36 ah 2009-01-20 22:36 I see 2009-01-20 22:37 it always starts at the same place, and gives up after looking at 100 buffers 2009-01-20 22:37 broken algorithm 2009-01-20 22:38 then there is a leak, and it only has to leak 100 buffers, then it can't evict any buffers after that 2009-01-20 22:38 ok... 2009-01-20 22:39 first thing is to find where the leak comes from 2009-01-20 22:39 at the end of each filesystem operation, there should be no buffers with count > 0 2009-01-20 22:40 yes... 2009-01-20 22:40 so run a very small operation that leaves 1 buffer with count > 0 2009-01-20 22:41 then go put show_active_buffers everywhere to find out where that buffer first appears 2009-01-20 22:43 ok 2009-01-20 22:47 hirofumi, still there? 2009-01-20 22:47 yes 2009-01-20 22:48 I thought we were just going to skip the btree lock for the bitmap 2009-01-20 22:48 yes 2009-01-20 22:48 I was thinking safe way to skip 2009-01-20 22:48 and AB BA deadlock 2009-01-20 22:49 well, now new patch seems to work 2009-01-20 22:50 I am going to write a little but important piece now... a place to save the deferred frees 2009-01-20 22:50 deferred free blocks? 2009-01-20 22:50 yes 2009-01-20 22:50 ok 2009-01-20 22:50 when redirect frees a block, the free has to be held on a list until after the delta has completed 2009-01-20 22:51 btw, I think the list was missing inode delayed save 2009-01-20 22:51 I will make a followup post 2009-01-20 22:51 actually it's not missing 2009-01-20 22:52 it's just implied 2009-01-20 22:52 i see 2009-01-20 22:52 "log inode update" and "replay inode update" 2009-01-20 22:53 then, inode is an ileaf... it just gets retired in rollup, like a dleaf or a btree node 2009-01-20 22:53 "retired" means, the dirty block is written out, retiring any promises against it 2009-01-20 22:53 it is including inode number allocation without save_inode()? 2009-01-20 22:54 I'm thinking, it might be nide to do that now 2009-01-20 22:55 keep a list of allocated inodes that are not yet entered in the inode table 2009-01-20 22:55 yes 2009-01-20 22:56 we can also handle the same issue with a log entry 2009-01-20 22:57 it is not nessesary? 2009-01-20 22:57 I don't think we actually have to do anything different there 2009-01-20 22:58 i see 2009-01-20 22:58 but let's put some of the redirect and logging code in, then we can analyze that better 2009-01-20 22:58 yes 2009-01-20 22:58 the cursor redirect prototype will be complete as soon as there is a way to record the deferred free 2009-01-20 22:58 if unnecessary, I'd like to delay it 2009-01-20 22:59 yes, delay it 2009-01-20 22:59 there are lots of small things to do before that is the biggest issue 2009-01-20 22:59 yes, I was thinking that were necessary 2009-01-20 23:00 if not, it's good 2009-01-20 23:00 we can work on other parts 2009-01-20 23:00 the nice thing is, every item on the to.do list is a small thing 2009-01-20 23:00 ah, I remember a task I forgot to list 2009-01-20 23:02 "find retired physical/logical blocks on replay" 2009-01-20 23:02 "consult the retired list before replaying each entry" 2009-01-20 23:03 i see 2009-01-20 23:04 this rule for replay gives us a very nice property: we can actually retire any dirty volume block we want to, whenever we want to 2009-01-20 23:04 we are not restricted to retiring all dirty metadata on a rollup 2009-01-20 23:04 well, unfortunately, my thinking is not reaching to replay yet at all 2009-01-20 23:06 right, you are thinking about dealocks 2009-01-20 23:06 deadlocks 2009-01-20 23:06 replay is the entire reason for atomic commit 2009-01-20 23:06 yes 2009-01-20 23:06 if replay works, and we aren't dealocking, then we have succeeded 2009-01-20 23:07 I'm thinking atomic commit, but it still is for logging and redirect 2009-01-20 23:07 that's fine 2009-01-20 23:07 I will write the replay 2009-01-20 23:07 yes 2009-01-20 23:08 maybe deadlock seems to solved 2009-01-20 23:09 it seems to be to be solved 2009-01-20 23:09 fsx-linux under memory pressure was passed 1 hour 2009-01-20 23:09 ah 2009-01-20 23:09 you hit it in fsx-linux before? 2009-01-20 23:09 yes 2009-01-20 23:09 it hit the bitmap deadlock 2009-01-20 23:09 good :) 2009-01-20 23:09 yes 2009-01-20 23:10 how do you create the pressure? 2009-01-20 23:10 test on big file 2009-01-20 23:10 memory is 256M, and test file is 256M 2009-01-20 23:11 fsx-linux saves test result to memory, probably it will be same size with test file (256M) 2009-01-20 23:11 by the way, a server company is sending me a big machine to test on 2009-01-20 23:12 8 cores, 12 disks 2009-01-20 23:12 oh, great 2009-01-20 23:12 so I will be running fsx-linux stress tests also, starting in a week or so 2009-01-20 23:12 for now, I will stay with uml 2009-01-20 23:12 yes 2009-01-20 23:13 usually, small environment may be enough 2009-01-20 23:13 until benchmarkable 2009-01-20 23:13 not good for smp testing, but finds a lot of other problems 2009-01-20 23:14 int defer_free(struct sb *sb, block_t block) 2009-01-20 23:14 { 2009-01-20 23:14 2009-01-20 23:14 } <- now I will fill this function in 2009-01-20 23:14 ok 2009-01-20 23:15 it will be a list of pages, each page holds 512 8 byte deferred extent frees 2009-01-20 23:15 for userspace, I will define struct page 2009-01-20 23:15 and the same code will work in userspace 2009-01-20 23:16 if we want, we can use radix tree to hole 2009-01-20 23:16 I considered that 2009-01-20 23:16 oh 2009-01-20 23:17 allocate an address_space 2009-01-20 23:17 then I thought, it is a little bit heavyweight 2009-01-20 23:17 i see 2009-01-20 23:17 and I thought of a very simple way to do it with just a list of pages 2009-01-20 23:17 it will be cute 2009-01-20 23:18 and short, and efficient 2009-01-20 23:18 we use ->lru? 2009-01-20 23:18 page->private // circular list of pages 2009-01-20 23:18 lru is used by vm 2009-01-20 23:19 I would like to use it, but that would not fit with other usage 2009-01-20 23:19 page is not for buffer? 2009-01-20 23:19 no buffers on this page 2009-01-20 23:19 i see 2009-01-20 23:20 get_page will keep shrink_caches from trying to evict it 2009-01-20 23:20 page is page cache? 2009-01-20 23:21 just an alloc_pages() page 2009-01-20 23:21 not in page cache 2009-01-20 23:21 so, we have refcnt alway until free_pages 2009-01-20 23:21 yes 2009-01-20 23:21 it's our page 2009-01-20 23:22 we hold a count on it 2009-01-20 23:22 get_page would not be needed 2009-01-20 23:23 alloc_pages() returns refcnt=1 page 2009-01-20 23:23 true 2009-01-20 23:26 if we need header for it, slab may be more efficient 2009-01-20 23:26 for cpu cache 2009-01-20 23:28 it might 2009-01-20 23:28 the only efficiency that really matters here is, overhead of putting an entry onto the page 2009-01-20 23:29 well, that doesn't matter much either 2009-01-20 23:29 the big thing is, using a vector instead of a linked list is cache friendly 2009-01-20 23:30 yes, probably 2009-01-20 23:30 if you want to change my algorithm later, you are welcome :) 2009-01-20 23:30 I just want it short now 2009-01-20 23:30 :) 2009-01-20 23:30 well, I thought about page data cache line is alway same 2009-01-20 23:31 this is very cache line friendly 2009-01-20 23:31 8 extents per cache line 2009-01-20 23:31 if we touch only one page 2009-01-20 23:31 we only touch one page 2009-01-20 23:32 :) 2009-01-20 23:32 it's good 2009-01-20 23:38 struct circ { struct circ *circ; }; 2009-01-20 23:39 definition of a single linked circular list 2009-01-20 23:39 is it used for page? 2009-01-20 23:40 yes 2009-01-20 23:40 with container_of 2009-01-20 23:40 in kernel? 2009-01-20 23:41 yes 2009-01-20 23:42 um... 2009-01-20 23:42 I can't image how do we use it 2009-01-20 23:42 I will know if it works pretty soon, if so the definition is cute 2009-01-20 23:43 flips...sorry to interrupt.. 2009-01-20 23:43 made a few changes.. 2009-01-20 23:43 this is a small and unimportant part of tux3, but I would still like it nice 2009-01-20 23:43 hi cdk 2009-01-20 23:43 hi 2009-01-20 23:44 now the show_buffers is showing only 7 entries each with count == 1 ... still exceeding max buffer limit 2009-01-20 23:45 hmm, our evict algorithm is really broken 2009-01-20 23:45 but where do the count == 1 come from? 2009-01-20 23:45 hey flips 2009-01-20 23:46 i mean what was e/100 before is now e/1 2009-01-20 23:46 ah 2009-01-20 23:46 nice 2009-01-20 23:46 so you found a leak 2009-01-20 23:46 yes 2009-01-20 23:46 and there is another leak to find 2009-01-20 23:46 maybe in my code :) 2009-01-20 23:47 or hirofumi's 2009-01-20 23:47 but probably in yours :) 2009-01-20 23:48 most probably in mine 2009-01-20 23:48 well, I checked it before with BUFFER_PARANOIA_DEBUG, the result was no leak 2009-01-20 23:48 and I have found and fixed many leaks in my own code in the past 2009-01-20 23:48 it was a bit surprise for me 2009-01-20 23:49 :) 2009-01-20 23:50 I started the fsx-linux with delalloc 2009-01-20 23:50 if it works, I think my preparation is almost done 2009-01-20 23:50 cool 2009-01-20 23:52 ah, I was forgetting to complete btree_inode patch 2009-01-20 23:52 but, it is easy patch 2009-01-20 23:52 what does that one do? 2009-01-20 23:54 with it, we can know the pointer to btree root 2009-01-20 23:54 container_of(btree, struct inode, btree) basically 2009-01-20 23:54 oh yes 2009-01-20 23:54 good 2009-01-20 23:55 yes 2009-01-20 23:56 maybe, with it, I'll move itable to volmap 2009-01-20 23:56 or makes itable to inode, and use as volmap 2009-01-20 23:59 let's keep the name volmap 2009-01-20 23:59 ok 2009-01-20 23:59 well, I was thinking which is not confusible 2009-01-21 00:00 volmap also has dleaf blocks, so calling that mapping itable would be confusing 2009-01-21 00:00 volmap->btree is itable, or itable buffer is whole volume 2009-01-21 00:00 ah, yes 2009-01-21 00:01 I was thinking volmap->btree is not notisable as itable 2009-01-21 00:01 itable(sb) ? 2009-01-21 00:02 however, I agree dleaf blocks is also confusable 2009-01-21 00:02 yes 2009-01-21 00:02 it may help 2009-01-21 00:03 same way, volmap can be vol_bread(sb) 2009-01-21 00:03 well, there would be no matter 2009-01-21 00:03 I like volmap :) 2009-01-21 00:03 ok 2009-01-21 00:04 it will help us when we need to explain how it is our buffer cache 2009-01-21 00:04 I'll move itable to volmap, and add some comment to it, and may use itable(sb) (not sure yet) 2009-01-21 00:04 yes 2009-01-21 00:06 it sure is nice to be able to have structure returns, like itable(sb) 2009-01-21 00:06 when I first started in linux, linus and some others did not trust structure returns 2009-01-21 00:06 or structure assignment 2009-01-21 00:07 I'm also not beliaving it 2009-01-21 00:08 structure assignment is now heavily used in kernel 2009-01-21 00:08 it's just better code 2009-01-21 00:08 initializes all fields without bloat 2009-01-21 00:08 ah, yes 2009-01-21 00:08 it is good 2009-01-21 00:09 struct bar func(struct foo bar) 2009-01-21 00:09 not passing call by reference 2009-01-21 00:09 and returns structure 2009-01-21 00:10 I'm not sure gcc will opimize it perfectly 2009-01-21 00:10 then we need to give them a reason to optimize it :) 2009-01-21 00:10 it is not very important that our structure parameters and returns be perfectly efficient, but that they be reliable and clear 2009-01-21 00:11 then if it generates terrible code, we post it on the gcc devel list :) 2009-01-21 00:11 :) 2009-01-21 00:11 to tell the truth, I have never disassibled one 2009-01-21 00:12 well, hoever, there is some reason gcc don't optimize it perfectly 2009-01-21 00:12 there is? 2009-01-21 00:12 the function is assuming it was passed copy of structure 2009-01-21 00:13 function may be aussming 2009-01-21 00:13 I guess gcc has to detect it 2009-01-21 00:14 gcc is doing global optimization now 2009-01-21 00:14 it used to only do peephole 2009-01-21 00:15 gcc may be smater than I think 2009-01-21 00:15 it got a lot smarter around 3.0 2009-01-21 00:15 I think 4.0 2009-01-21 00:15 and... with the new optimizing infrastructure, it generated worse code :) 2009-01-21 00:15 I'm sure 3.0 is not smater 2009-01-21 00:16 no, 3.0 introduced the new intermediate format if I recall 2009-01-21 00:16 I meant "yes" in japanese 2009-01-21 00:16 it was not very smart 2009-01-21 00:16 :) 2009-01-21 00:16 just new 2009-01-21 00:16 yes 2009-01-21 00:16 then by 4.0, it started actually optimizing 2009-01-21 00:17 yes, iirc, introduced SSA 2009-01-21 00:17 4.0 had really big inprovement 2009-01-21 00:18 I'm not testing 4.0 much like 3.0 yet, so I'm not sure yet 2009-01-21 00:18 but as you say, there is no question that passing pointers generates faster and smaller code than passing structures 2009-01-21 00:19 itable(sb) could return btree * 2009-01-21 00:19 s/count/should/ 2009-01-21 00:20 s/could/should/ 2009-01-21 00:20 however, with inline function passing structure, gcc 4.3 seems to optimize it 2009-01-21 00:20 fine 2009-01-21 00:21 yes 2009-01-21 00:21 I tested only few function though 2009-01-21 00:22 I should re-benchmark some of my old structure using code 2009-01-21 00:22 I wrote two copies of a vector library, one using structs and one using pointers 2009-01-21 00:22 for parameters and returns 2009-01-21 00:23 the struct code was slower and bigger than the pointer code, but way easier to write correctly 2009-01-21 00:23 I should benchmark on 4.3 now and see if there is less difference 2009-01-21 00:23 good 2009-01-21 00:25 I have some experience on 3.x, the optimize by hand was faster than gcc 2009-01-21 00:25 optimize means just inline tweaks 2009-01-21 00:25 4.0 was able to unroll loops better than me 2009-01-21 00:25 and I am good at it :) 2009-01-21 00:26 good :) 2009-01-21 00:34 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-01-21 00:37 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-21 01:06 struct page *page = alloc_page(); 2009-01-21 01:06 struct circ *circ = sb->defree; 2009-01-21 01:06 page->private = circ->circ; 2009-01-21 01:06 circ->circ = page_circ(page); 2009-01-21 01:06 sb->defree = sb->defree->circ; <- this code succeeds at inserting a new page at the end of my circular list 2009-01-21 01:06 will finish it tomorrow 2009-01-21 01:09 i see 2009-01-21 01:09 struct circ *page_circ(struct page *page) 2009-01-21 01:09 { 2009-01-21 01:09 return (void *)&page->private; 2009-01-21 01:09 } 2009-01-21 01:09 anyway, not very important 2009-01-21 01:10 but it will be one more thing done 2009-01-21 01:19 I would like to do it a function like list_add() 2009-01-21 01:20 yes, it is suitable for that 2009-01-21 01:20 it would be circ_add_tail 2009-01-21 01:21 good 2009-01-21 01:21 it is the same idea as kernel double linked lists: the link points directly at the next link field instead of at the object 2009-01-21 01:21 which makes it possible to be generic 2009-01-21 01:22 it also loses type safety, but you can't have everything 2009-01-21 01:31 type safe version may be able to add based on it if we want to 2009-01-21 01:32 maybe, we don't need it, because we should touch that list on few places 2009-01-21 01:33 agreed 2009-01-21 01:34 flips: fixing a leak ? 2009-01-21 01:34 why didn't you use another kind of b-tree btw ? 2009-01-21 01:34 because this one functions 2009-01-21 01:36 circ_add((struct circ **)&page->private, sb->defree); <- that cast is really ugly 2009-01-21 01:36 why didn't it use page_circ()? 2009-01-21 01:37 because the page->private field needs to be changed, so circ * is not good enough, it has to be circ ** 2009-01-21 01:38 ah, I think I can fix it 2009-01-21 01:38 needs just a little more clear thinking 2009-01-21 01:40 ah, because I declared page_circ wrong 2009-01-21 01:40 it should return circ ** 2009-01-21 01:41 struct circ { struct circ **next; }? 2009-01-21 01:43 maybe 2009-01-21 01:43 not very much like list_head then 2009-01-21 01:45 but, for now, I can't see why we need **... 2009-01-21 01:46 maybe, if I try to write it, I notice it 2009-01-21 01:47 anyway, it is now struct circ { struct circ *next; } 2009-01-21 01:47 that is more like list_head 2009-01-21 01:47 looks good 2009-01-21 01:50 btw, test hit the assertion in cursor_check() 2009-01-21 01:50 I'll see it 2009-01-21 01:51 cursor depth exceeded? 2009-01-21 01:51 no 2009-01-21 01:51 btree.c:183 2009-01-21 01:51 assert(bufindex(cursor->path[i].buffer) == block); 2009-01-21 01:52 right 2009-01-21 01:52 could be anything 2009-01-21 01:52 good luck :) 2009-01-21 01:52 :) 2009-01-21 01:53 well, it can be the bug of debug code 2009-01-21 01:53 or delalloc, or before it 2009-01-21 01:53 I'll try 2009-01-21 02:15 void circ_add(struct circ *node, struct circ *list) 2009-01-21 02:15 { 2009-01-21 02:15 node->next = list->next; 2009-01-21 02:15 list->next = node; 2009-01-21 02:15 } 2009-01-21 02:15 circ_add(page_circ(page), sb->defree); 2009-01-21 02:16 in other words, circ_add is just single_linked_list_add 2009-01-21 02:16 it is only the initialization of the list to point at itself that make it circular 2009-01-21 02:16 kind of pretty 2009-01-21 02:17 looks good and clean 2009-01-21 02:19 sb->defree always points at the tail of the list 2009-01-21 02:20 to get the head of the list, it is sb->defree->next 2009-01-21 02:20 so the new element is inserted after the tail, then sb->defree = sb->defree->next; makes it the new tail 2009-01-21 02:23 I used the field name defree, because that is what I called it in tux2, the deferred free list, it will be the only part of tux2 that becomes part of tux3 2009-01-21 02:25 i see 2009-01-21 02:33 http://lkml.indiana.edu/hypermail/linux/kernel/0208.2/0417.html <- ha, I forgot about this post 2009-01-21 02:34 the only difference is, the new one is inline instead of macro 2009-01-21 02:37 I think new one is more clean and more harder to misuse 2009-01-21 02:37 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-01-21 07:02 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-21 07:34 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2009-01-21 07:34 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-21 07:39 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-01-21 08:37 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2009-01-21 09:01 -!- gaurav(~gaurav@59.95.9.212) has joined #tux3 2009-01-21 09:05 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-21 09:43 -!- pgquiles(~pgquiles@88.0.158.31) has joined #tux3 2009-01-21 09:49 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-21 10:10 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-21 11:28 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-21 12:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-21 12:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-21 12:58 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-21 15:47 -!- tim_dimm(~timothyhu@rrcs-64-183-50-58.west.biz.rr.com) has joined #tux3 2009-01-21 15:59 -!- tim_dimm(~timothyhu@rrcs-64-183-50-58.west.biz.rr.com) has joined #tux3 2009-01-21 16:26 -!- tim_dimm(~timothyhu@rrcs-64-183-50-58.west.biz.rr.com) has joined #tux3 2009-01-21 19:18 -!- gaurav(~gaurav@59.95.27.168) has joined #tux3 2009-01-21 19:34 well I spent too much time fiddling with deferred free logging 2009-01-21 19:34 I will put this code in now, and get on to the next thing. 2009-01-21 20:12 compiles in userspace 2009-01-21 21:08 ok I'm going to mark down some progress on atomic commit 2009-01-21 21:09 checked in lots of untried code :) 2009-01-21 21:12 hmm, no more primitives to develop 2009-01-21 21:12 something got finished 2009-01-21 21:37 -!- kushal_(~kushal@115.109.9.123) has joined #tux3 2009-01-21 21:53 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-21 22:56 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-22 00:39 ok, I found the cause of bug 2009-01-22 00:40 btree.c uses bufindex() to get physical address 2009-01-22 00:41 so, volmap must garantee buffer is buffer_mapped() 2009-01-22 00:41 but, current code (blockread) doesn't garantee it 2009-01-22 00:42 if buffer is uptodate, it doesn't call get_block() 2009-01-22 00:43 and if page is PageUptodate, buffers is created without buffer_mapped() 2009-01-22 00:43 it is efficient than blockdev buffers 2009-01-22 00:44 so, volmap needs additional codes 2009-01-22 00:44 hi hirofumi 2009-01-22 00:44 hi 2009-01-22 00:45 how did you find it? 2009-01-22 00:46 by assertion of cursor_check() 2009-01-22 00:46 self checking code :) 2009-01-22 00:46 I checked the detail of it 2009-01-22 00:46 buffer was !buffer_mapped(), so bufindex() was -1 2009-01-22 00:47 well, with this, I think I learned about buffer state more 2009-01-22 00:47 ok, well we can get the bufindex without calling get_block 2009-01-22 00:48 it can be calculated from the page->index 2009-01-22 00:48 yes 2009-01-22 00:49 however, it would be strange 2009-01-22 00:49 (page->index << page_block_shift) + ((buffer->b_data - page_address(page)) >> sb->blockbits) 2009-01-22 00:49 it's not really strange 2009-01-22 00:50 hey flips 2009-01-22 00:50 ah 2009-01-22 00:50 hi folks in general :) 2009-01-22 00:50 hi bh 2009-01-22 00:51 ACTION is happy to have a non-fuckhead for a president 2009-01-22 00:51 maybe I should ask the question publically 2009-01-22 00:51 however, I'd like to have buffer_mapped() for volmap 2009-01-22 00:52 how long should it take me to get up to speed with tux3 and surrounding the vm system ? 2009-01-22 00:52 best idea is to ignore the surrounding vm system 2009-01-22 00:52 couple of weeks ? yeah, I don't know much about b-trees per se 2009-01-22 00:52 yeah, eventually, I'll have to hack on it 2009-01-22 00:53 hirofumi, I would like to get rid of buffer_mapped entirely if possible 2009-01-22 00:54 I meant I'd like to delay it after atomic commit 2009-01-22 00:54 let's see how heaviliy used bufindex is 2009-01-22 00:54 sure, a hack is fine for now 2009-01-22 00:54 when we allocate a buffer for volmap we can set b_blocknr 2009-01-22 00:54 yes 2009-01-22 00:55 I think it is not hard to do 2009-01-22 00:55 you set it using the above expression (with corrections) 2009-01-22 00:55 without volmap patch, it worked for 7.5 hours 2009-01-22 00:56 so after we moved to our own volume it only worked for 1 hour or so? 2009-01-22 00:56 yes 2009-01-22 00:56 my volmap patch has that bug 2009-01-22 00:57 what happened after 7.5 hours before? it was still running? 2009-01-22 00:58 yes 2009-01-22 00:58 I just send SIGINT to stop it 2009-01-22 00:59 let's see if my new blockread has that bug 2009-01-22 00:59 I guess it does 2009-01-22 00:59 there is no ->b_blocknr 2009-01-22 00:59 the right thing to do is rewrite bufindex 2009-01-22 01:00 or I was thinking bufindex()less 2009-01-22 01:00 getting rid of bufindex? 2009-01-22 01:00 yes 2009-01-22 01:00 if possible 2009-01-22 01:01 it seems possible from the reading in btree.c just now 2009-01-22 01:01 need to think a little more 2009-01-22 01:02 ah, the cursor will not be happy 2009-01-22 01:02 cursor_check()? 2009-01-22 01:03 changing the path for splits 2009-01-22 01:03 and advance will not be happy without bufindex 2009-01-22 01:04 advance is not using bufindex() 2009-01-22 01:05 ah, true 2009-01-22 01:05 basically, I think bnode -> (parent->next - 1).block is it 2009-01-22 01:06 right 2009-01-22 01:07 for all other usage in btree.c, efficiency does not really matter 2009-01-22 01:08 btree is the only user 2009-01-22 01:08 yes 2009-01-22 01:08 well, I think the right thing is to implement the expression up above 2009-01-22 01:08 ok, I'm thinking to start to use your bufindex() 2009-01-22 01:08 that is what buffer.c -> block library does 2009-01-22 01:08 with asserttion()? 2009-01-22 01:09 assert(volmap == page->mapping->host) 2009-01-22 01:09 the expression is correct for logical mappings too 2009-01-22 01:09 but, it is logical address 2009-01-22 01:09 yes, and tux3 always uses it that way 2009-01-22 01:10 yes 2009-01-22 01:10 but, in kernel it doesn't use 2009-01-22 01:10 vfs better not ever use b_blocknr instead of calling our get_block, or it will get a stale block 2009-01-22 01:11 anyway, you can assert for volmap :) 2009-01-22 01:11 ok, thanks :) 2009-01-22 01:13 with it, I think delalloc with volmap would be work 2009-01-22 01:13 with what? 2009-01-22 01:13 with bufinex() 2009-01-22 01:13 bufindex() change 2009-01-22 01:13 that sounds good 2009-01-22 01:13 I'm just looking to see if I have already written that code 2009-01-22 01:14 it seems familiar 2009-01-22 01:23 unsigned bufindex(struct buffer_head *buffer) 2009-01-22 01:23 { 2009-01-22 01:23 struct page *page = buffer->b_page; 2009-01-22 01:23 unsigned blockbits = page->mapping->host->i_blkbits; 2009-01-22 01:23 unsigned subshift = PAGE_CACHE_SHIFT - blockbits; 2009-01-22 01:23 unsigned offset = (void *)buffer->b_data - page_address(page); 2009-01-22 01:23 return (page->index << subshift) + (offset >> blockbits); 2009-01-22 01:23 } 2009-01-22 01:23 not tested 2009-01-22 01:23 thanks 2009-01-22 01:29 tested... it worked once :) 2009-01-22 01:32 fsx-linux was started 2009-01-22 01:34 some of the atomic commit to.do is starting to be done 2009-01-22 01:34 testing cursor_redirect is next I think 2009-01-22 01:35 good 2009-01-22 01:35 I must say oyasumi now 2009-01-22 01:35 I think I'd like to add INIT_LINK() or something 2009-01-22 01:35 oh, oyasumi 2009-01-22 01:36 INIT_LINK? 2009-01-22 01:36 oh 2009-01-22 01:36 right 2009-01-22 01:36 init_link() please :) 2009-01-22 01:36 ok 2009-01-22 01:36 :) 2009-01-22 01:36 new inits are lower case 2009-01-22 01:37 btw, how about the slink instead of link? 2009-01-22 01:37 we use "list" for double link, so I think "link" for single link is ok 2009-01-22 01:38 "slick" means "to sneak" :) 2009-01-22 01:38 slink 2009-01-22 01:38 oh 2009-01-22 01:38 :) 2009-01-22 01:38 these are worth offering to lkml, after a little more improvement 2009-01-22 01:39 they are just as generic as list.h lists, and save a link field 2009-01-22 01:39 yes 2009-01-22 01:40 ah, list, hlist. so slist? 2009-01-22 01:40 hlist is hash list, right? 2009-01-22 01:40 yes 2009-01-22 01:40 and both is in list.h 2009-01-22 01:41 this list is probably a better hashlist than hlist 2009-01-22 01:41 but that doesn't really matter 2009-01-22 01:41 yes 2009-01-22 01:41 ah, no in english :) 2009-01-22 01:41 no, that doesn't 2009-01-22 01:41 :) 2009-01-22 01:43 see, bill irwin already wrote it 7 years ago: http://lkml.indiana.edu/hypermail/linux/kernel/0208.2/0527.html 2009-01-22 01:43 slist_add == link_add 2009-01-22 01:44 and I didn't really get it then 2009-01-22 01:44 oh 2009-01-22 01:45 http://lkml.indiana.edu/hypermail/linux/kernel/0208.2/0417.html <- the ugly original 2009-01-22 01:46 by the way, I found Bill's post after I wrote link_add 2009-01-22 01:47 now, I think link_* is useful on restricted one 2009-01-22 01:47 like page->private 2009-01-22 01:47 yes 2009-01-22 01:48 and don't need flexible link_del 2009-01-22 01:48 we didn't really need it for the case here, because we could have used the lru field... the page is not on the lru 2009-01-22 01:48 but in other cases, maybe page->private is the only list field available 2009-01-22 01:49 it is very nice to be able to put pages on lists 2009-01-22 01:49 yes 2009-01-22 01:49 pagevec.h kind of gives a replacement, but it is not pretty 2009-01-22 01:50 and there would be other cases only have one ->private 2009-01-22 01:50 dentry, inode, file, and many others 2009-01-22 01:50 yes 2009-01-22 01:53 ok, oyasumi, relaly 2009-01-22 01:53 really 2009-01-22 01:53 oyasumi 2009-01-22 02:40 -!- data(~data@echo489.server4you.de) has joined #tux3 2009-01-22 07:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-22 07:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-22 07:46 -!- amey(~amey@117.195.33.62) has joined #tux3 2009-01-22 07:48 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-01-22 08:27 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-01-22 09:37 -!- kushal(~kushal@117.195.33.62) has joined #tux3 2009-01-22 09:50 -!- cdk(~chinmay@117.195.33.62) has joined #tux3 2009-01-22 10:01 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-22 10:49 hi flips 2009-01-22 10:53 -!- gaurav(~gaurav@117.195.33.62) has joined #tux3 2009-01-22 10:58 hi kushal 2009-01-22 10:59 have been trying to copy back files from tux3 to ext3... 2009-01-22 10:59 copy aborted each time at about 39mb...which is approx 10000 buffers 2009-01-22 11:00 in tuxio... 2009-01-22 11:00 the buffer is always released using brelse_dirty 2009-01-22 11:00 shouldn't it be brelse for reads and brelse_dirty for writes? 2009-01-22 11:00 so all buffers are dirty? 2009-01-22 11:01 yes 2009-01-22 11:01 we're working on changeset tux3 11 jan... 2009-01-22 11:02 whoops :) 2009-01-22 11:04 should this be patched ?? 2009-01-22 11:05 8 - if (write) 2009-01-22 11:05 9 + if (write) { 2009-01-22 11:05 10 + mark_buffer_dirty(buffer); 2009-01-22 11:05 11 memcpy(bufdata(buffer) + from, data, some); 2009-01-22 11:05 12 - else 2009-01-22 11:05 13 + } else 2009-01-22 11:05 14 memcpy(data, bufdata(buffer) + from, some); 2009-01-22 11:05 15 printf("transfer %u bytes, block 0x%Lx, buffer %p\n", some, (L)bufindex(buffer), buffer); 2009-01-22 11:05 16 hexdump(bufdata(buffer) + from, some); 2009-01-22 11:05 17 - brelse_dirty(buffer); 2009-01-22 11:05 18 + brelse(buffer); 2009-01-22 11:05 how does it look? 2009-01-22 11:06 good..this should work... 2009-01-22 11:06 can we send a patch :) 2009-01-22 11:08 cdk, please 2009-01-22 11:09 hopefully wont mess it up this time ... will attach the patch to be on safer side 2009-01-22 11:15 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-01-22 11:17 kushal, cdk, you probably want to turn the printf into a trace_off and comment out the hexdump 2009-01-22 11:18 ok 2009-01-22 11:26 -!- cdk(~chinmay@117.195.33.62) has joined #tux3 2009-01-22 11:43 -!- cdk(~chinmay@117.195.33.62) has joined #tux3 2009-01-22 11:57 bye flips... 2009-01-22 11:57 bye 2009-01-22 12:49 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-22 12:59 time to make metadata redirect work 2009-01-22 13:05 need to handle redirect of btree root now 2009-01-22 13:07 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-01-22 13:07 current my tree 2009-01-22 13:07 including temporary patches though 2009-01-22 13:07 I will read 2009-01-22 13:07 before btree_inode(), it ran fsx-linux for 11 hours 2009-01-22 13:08 thanks 2009-01-22 13:09 the improvement to cursor_redirect to handle redirect of root relies on btree_inode 2009-01-22 13:09 I'll prepare to submit those patch tomorrow if current runing fsx-linux was good 2009-01-22 13:09 yes 2009-01-22 13:10 that changes is for like it 2009-01-22 13:10 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-22 13:10 thanks for the __free_page fix :) 2009-01-22 13:10 :) 2009-01-22 13:10 there is no such thing as free_page() right? 2009-01-22 13:11 free_page takes address of page 2009-01-22 13:11 __free_page takes page structure 2009-01-22 13:11 ah, a naming mistake from way back 2009-01-22 13:11 yes 2009-01-22 13:13 I forgot entirely that page->private is ulong 2009-01-22 13:14 well, that didn't made any change 2009-01-22 13:14 no, because I had to cast it anyway 2009-01-22 13:15 and it is for future change 2009-01-22 13:15 don't forget the cast or proper wrapper 2009-01-22 13:16 we actually have two kinds of single linked list ops: ones that assume circular such as _empty and init_, and ones that work on either circular or noncircular lists, link_add and _del_next 2009-01-22 13:16 what is benefit of noncircular? 2009-01-22 13:17 empty state is easier 2009-01-22 13:17 no header or special checks needed 2009-01-22 13:18 most single linked lists are noncircular 2009-01-22 13:18 ah, yes 2009-01-22 13:19 nice cleanups 2009-01-22 13:19 however, with that change, I think almost all case needs header 2009-01-22 13:19 which change? 2009-01-22 13:19 link_empty() and related stuff 2009-01-22 13:20 I'm thinking, the circular-specific functions should be circ_ 2009-01-22 13:20 there are only two 2009-01-22 13:20 anyway, it is not important to getting atomic commit working 2009-01-22 13:20 circular was not waste memeory 2009-01-22 13:20 circular is cool :) 2009-01-22 13:21 :) 2009-01-22 13:22 and if we want to FIFO order, I think we can add ->tail pointer to header 2009-01-22 13:22 with it, we need actuall header 2009-01-22 13:23 yes, it can be done in a clever way with a header, I wrote about it in the post 2009-01-22 13:23 there is a tail pointer and a head element just as you say 2009-01-22 13:24 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-01-22 13:24 and making that into one structure sounds like a good way to make it easy to initialize and use 2009-01-22 13:24 yes 2009-01-22 13:25 so, 8 bytes header and 4 bytes/link to implement an efficient queue, it sounds good 2009-01-22 13:25 with no tests in the add or del 2009-01-22 13:25 add now? 2009-01-22 13:25 add it now? 2009-01-22 13:26 I think, not unless we actually need it 2009-01-22 13:26 add it to the userspace program maybe 2009-01-22 13:26 just for fun 2009-01-22 13:27 it's good, I actually try to do it, however I was not sure if we need, so I didn't it 2009-01-22 13:27 it will make a fun lkml thread sometime 2009-01-22 13:28 yes 2009-01-22 13:29 someone may disklike it, because it is fragile than list_head stuff 2009-01-22 13:29 but I think it is useful actually 2009-01-22 13:29 it saves memory 2009-01-22 13:29 yes 2009-01-22 13:29 there are probably a hundred places in kernel that could be improved with it 2009-01-22 13:30 and it can use with one private pointer on any structure 2009-01-22 13:30 yes 2009-01-22 13:30 I like it much 2009-01-22 13:36 current->journal_info = sb->bitmap; <- this is temporary, right? 2009-01-22 13:36 yes 2009-01-22 13:36 however, it is for a while 2009-01-22 13:37 strange-cleanup -> change-cleanup 2009-01-22 13:37 until bitmap recursive lock was solved 2009-01-22 13:37 yes 2009-01-22 13:37 comment may be wrong almost 2009-01-22 13:38 I'm not updating it yet, I'm going to update it tomorrow 2009-01-22 13:40 unpack_sb can't find the iroot from the sb? 2009-01-22 13:41 because it's not initialized yet I guess 2009-01-22 13:41 in unpack_sb, sb->volmap is not allocated yet 2009-01-22 13:42 because inode needs initialized sb 2009-01-22 13:42 it makes sense 2009-01-22 13:43 in userspace, load_sb() does it, because it doesn't need initialized sb 2009-01-22 13:44 +int unpack_sb(struct sb *sb, struct disksuper *super, struct root *iroot, 2009-01-22 13:44 + int silent) <- this is a good place to break the 80 column rule 2009-01-22 13:45 yes 2009-01-22 13:45 well, people will complain about it on review 2009-01-22 13:46 then we will fix it and the complainer will be happy 2009-01-22 13:46 but probably they won't complain if it's not gross 2009-01-22 13:46 linus has pronounced on it 2009-01-22 13:46 more readable -> use more columns 2009-01-22 13:46 yes 2009-01-22 13:47 however, akpm and some people are more strict about that rule 2009-01-22 13:47 + if (itable_btree(btree->sb) != btree) <- I read it more easily in the other order 2009-01-22 13:47 anyway, I'll change it tomorrow 2009-01-22 13:48 ok 2009-01-22 13:48 I think it will remove soon 2009-01-22 13:49 after we write inodes and sb with our flusher 2009-01-22 13:49 yes 2009-01-22 13:50 ok, both was fixed 2009-01-22 13:51 I will be happy to see the volmap patch merged 2009-01-22 13:51 ok 2009-01-22 13:51 good 2009-01-22 13:52 now, with it, fsx-linux worked for 11 hours 2009-01-22 13:53 :) 2009-01-22 13:53 that internal will be changed later, however, state handling became more surely than before 2009-01-22 13:54 what did I break that your reverted with revert-mark_buffer_dirty? 2009-01-22 13:54 that is temporary patch to work for current flusher 2009-01-22 13:55 dirty needs after change 2009-01-22 13:55 a short explanation in the commit would be nice 2009-01-22 13:55 I'm not going to submit it 2009-01-22 13:56 truncate-silent.patch 2009-01-22 13:56 btree-set-uptodate.patch 2009-01-22 13:56 revert-mark_buffer_dirty.patch 2009-01-22 13:56 those patches will be removed before submit 2009-01-22 13:57 nfsd race fix had to be made for every fs in linux? 2009-01-22 13:58 yes, I think so 2009-01-22 13:58 if fs is using iget stuff 2009-01-22 13:59 it looks good, I will be really happy to merge this set 2009-01-22 14:00 thanks 2009-01-22 14:00 it is a step towards elimination of buffer_head 2009-01-22 14:00 I'll cleanup those if needed, and add comments to those tomorrow 2009-01-22 14:00 yes 2009-01-22 14:01 in our fs, we don't need ->b_blocknr at least 2009-01-22 14:01 going to block handle will be 16 bytes / page versus 192 for buffer heads, for a 1K block filesystem 2009-01-22 14:01 folks 2009-01-22 14:01 I think it means buffer_head can be !buffer_mapped() 2009-01-22 14:02 yes 2009-01-22 14:02 the block IO library expects to see buffer_mapped, but we will stop using it at some point 2009-01-22 14:02 if not, it will call our get_block 2009-01-22 14:03 I think it checks for buffer_mapped on return at some point 2009-01-22 14:03 anyway, obviously if we have block handles we don't use anything in buffer.c 2009-01-22 14:03 on the race with page shrink and readpage, actually buffer_head can be !buffer_mapped() 2009-01-22 14:04 yes 2009-01-22 14:05 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L299 <- this race I guess 2009-01-22 14:06 no 2009-01-22 14:06 maybe it is bug 2009-01-22 14:06 I think race is more usual 2009-01-22 14:07 page shrink does, 1) remove buffer_head, 2) then try to free page 2009-01-22 14:07 if readpage was called, between 1) and 2) 2009-01-22 14:08 readpage will create buffer_head on PageUptodate() page 2009-01-22 14:08 so, readpage don't need to call get_block, because all buffer_head can be buffer_uptodate() 2009-01-22 14:09 it makes !buffer_mapped() and buffer_uptodate() buffers 2009-01-22 14:10 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1712 <- something to worry about? 2009-01-22 14:10 1712 if (!buffer_mapped(bh)) 2009-01-22 14:10 1713 continue; // __block_write_full_page 2009-01-22 14:11 on this path, if !buffer_mappd(), it will call get_block before it 2009-01-22 14:13 buffer_head in removed under lock_page(), so after create_empty_buffer(), those are not removed 2009-01-22 14:14 our get_block returns with !buffer_mapped, always? 2009-01-22 14:14 no 2009-01-22 14:14 get_block will return buffer_mapped if create==1 2009-01-22 14:15 I don't see where we set_buffer_mapped 2009-01-22 14:15 map_bh() is it 2009-01-22 14:15 ah :) 2009-01-22 14:16 well, on read path, that seems more efficient than I thought 2009-01-22 14:16 block library 2009-01-22 14:17 it may still be not enough though 2009-01-22 14:18 so if we leave a buffer_mapped buffer on a page after a get_block, how does the block library know it has to call ->get_block again next time it writes out that block? 2009-01-22 14:19 it will not call get_block again 2009-01-22 14:20 if we need to call it, we have to call it ourself 2009-01-22 14:20 that will not be a problem when we are using ordered data mode, but it will be a problem later 2009-01-22 14:20 by that time, I think we will not be using the block library 2009-01-22 14:20 yes 2009-01-22 14:20 if we want to redirect, we have to use own handler or something 2009-01-22 14:21 I think the right order is: 1) stop using block library 2) redirect regular file writes 2009-01-22 14:22 we do not have to do this soon 2009-01-22 14:22 we can start review with just ordered data mode 2009-01-22 14:22 yes, well or hack it :) 2009-01-22 14:22 or hack it :) 2009-01-22 14:22 maybe, I think we can remove buffer_mapped on writepage 2009-01-22 14:23 I was thinking of exactly that hack 2009-01-22 14:23 readpage will just see buffer_uptodate or PageUptodate 2009-01-22 14:24 getblk is... 2009-01-22 14:24 um... 2009-01-22 14:24 well, later :) 2009-01-22 14:24 yes, we don't need redirect for regular files now 2009-01-22 14:24 but it is good to have the code for it started 2009-01-22 14:24 we do need to redirect dirent blocks 2009-01-22 14:24 ah, maybe, buffer_delay() would do it 2009-01-22 14:25 always use buffer_delay for write 2009-01-22 14:25 write_begin 2009-01-22 14:26 dirent would be not hard to do 2009-01-22 14:27 don't use buffer_mapped() would be work 2009-01-22 14:27 good 2009-01-22 14:27 well, I guess we can control with blockread()/blockget() 2009-01-22 14:28 yes 2009-01-22 14:29 so, I guess next work for me is redirect for pages 2009-01-22 14:29 with hack 2009-01-22 14:31 and I will continue with the cursor_redirect 2009-01-22 14:31 ok 2009-01-22 14:42 ah, bufferdirty() may solves it 2009-01-22 14:43 that would be nice 2009-01-22 14:43 blockdirty I think it was 2009-01-22 14:43 ah, yes 2009-01-22 14:44 it may clears mapped or don't copy mapped 2009-01-22 14:45 so, new dirty buffer is not mapped, and will call get_block on writepage 2009-01-22 14:47 submit_bh does BUG_ON(!buffer_mapped 2009-01-22 14:47 but clearing _mapped in writepage is ok 2009-01-22 14:47 whoops 2009-01-22 14:48 I don't think it is a big worry, we just can't return from get_block with !buffer_mapped 2009-01-22 14:48 yes 2009-01-22 14:49 well, if it can't solved with hack, we can write actual code 2009-01-22 14:49 it only matters for dirent blocks 2009-01-22 14:49 and I think we should do this without help from the block library 2009-01-22 14:51 it would be not hard, just become code long 2009-01-22 14:55 btw, dirent buffers will use blockdirty? 2009-01-22 14:55 yes 2009-01-22 14:55 ok 2009-01-22 14:56 I will think with it 2009-01-22 14:56 for dirent, we do not need ->read/writepage 2009-01-22 14:57 so it is just blockread, and our own flush 2009-01-22 14:57 so I think it is not hard to do it without any help from block library 2009-01-22 14:57 not much code 2009-01-22 14:57 there is no truncate 2009-01-22 14:58 no async truncate 2009-01-22 14:59 anyway, it's not hard to do 2009-01-22 15:00 also, no smp issues 2009-01-22 15:01 read and write can be race? 2009-01-22 15:01 I thought vfs holds i_mutex 2009-01-22 15:02 for read 2009-01-22 15:02 let me see 2009-01-22 15:02 read takes i_mutex 2009-01-22 15:02 but, write (flush) doesn't take i_mutex 2009-01-22 15:03 we won't let vfs/vm flush our dirs 2009-01-22 15:03 so, we take i_mutex on flush? 2009-01-22 15:03 and maybe do not need that, I'm thinking about it 2009-01-22 15:03 it is easy, to start 2009-01-22 15:04 it would not be needed to start 2009-01-22 15:04 I think we have change_begin/end 2009-01-22 15:05 anyway, vfs/vm doesn't flush our dirs? 2009-01-22 15:05 right, and I think blockdirty -> fork_buffer takes care of synchronize for delta writeout 2009-01-22 15:05 right, vfs/vm are not allowed to flush our dirs 2009-01-22 15:05 that would create an inconsistent image 2009-01-22 15:06 if so, I don't need to care about buffer_mapped() 2009-01-22 15:06 dirents have to match itable 2009-01-22 15:06 yes 2009-01-22 15:06 all is own 2009-01-22 15:06 yes <- japanese version 2009-01-22 15:07 ah :) 2009-01-22 15:08 I need to care buffer_mapped()? 2009-01-22 15:09 no, because we will use our own blockread 2009-01-22 15:09 ah, and we will use a version that does not use block library calls 2009-01-22 15:09 and write is also own function? 2009-01-22 15:09 yes 2009-01-22 15:10 ok 2009-01-22 15:10 those functions are prototyped and partly tested in hackfs.c 2009-01-22 15:10 so, hack is not needed 2009-01-22 15:10 right 2009-01-22 15:11 I was thinking those are more later 2009-01-22 15:13 anyway, we have to make worked version with those 2009-01-22 15:13 I'll try with it 2009-01-22 15:13 :) 2009-01-22 15:13 they are pretty simple 2009-01-22 15:14 yes, for now 2009-01-22 15:14 ah, I see I used get_bh in blockget 2009-01-22 15:14 that is because I copied it from yours 2009-01-22 15:15 oh sorry 2009-01-22 15:15 get_bh != map_bh :) 2009-01-22 15:15 yes :) 2009-01-22 15:15 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-22 15:18 these use ERR_PTR 2009-01-22 15:18 yes 2009-01-22 15:24 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-22 15:25 that blockread has the mapping->host->i_private hack to get a per-inode block io function 2009-01-22 15:25 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-22 15:26 well, first thing is write 2009-01-22 19:21 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-22 19:35 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-22 23:16 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-01-23 00:11 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-23 08:43 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-23 09:03 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-01-23 09:19 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-23 09:26 -!- kushal(~kushal@115.109.15.140) has joined #tux3 2009-01-23 09:59 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-23 10:04 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-23 10:06 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-23 10:29 -!- amey(~amey@115.109.15.140) has joined #tux3 2009-01-23 10:39 -!- gaurav(~gaurav@115.109.15.140) has joined #tux3 2009-01-23 10:39 -!- kushal(~kushal@115.109.15.140) has joined #tux3 2009-01-23 11:31 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-23 11:41 -!- dcg(~dcg@125.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-01-23 12:35 -!- kushal_(~kushal@115.109.10.125) has joined #tux3 2009-01-23 13:00 -!- kushal_(~kushal@115.109.10.125) has joined #tux3 2009-01-23 13:22 hirofumi, there? 2009-01-23 13:22 yes 2009-01-23 13:22 I will generalize defer_free to have a struct header instead of all fields in sb 2009-01-23 13:23 because we will have two defer lists, I think 2009-01-23 13:23 it will probably conflict with your struct link cleanup 2009-01-23 13:23 so maybe just put that patch aside for now? 2009-01-23 13:23 no problem, it's very good 2009-01-23 13:23 I was also thinking about it 2009-01-23 13:24 and removing extent_t 2009-01-23 13:24 yes 2009-01-23 13:24 struct defer { struct link *tail; u64 *pos, top; }; 2009-01-23 13:24 for now, we can make it even more general maybe... but that's all we need 2009-01-23 13:25 I think general extent would be struct seg 2009-01-23 13:25 yes 2009-01-23 13:25 it's shorter to type too :) 2009-01-23 13:25 and I think extent_t is just defer-free local structure 2009-01-23 13:25 yes 2009-01-23 13:26 so, I was thinking remove (actually move to local place) it 2009-01-23 13:27 it's not actually used in any code yet 2009-01-23 13:27 but it will be soon 2009-01-23 13:27 so I think it's best to keep it in the tree, it helps us move a little faster 2009-01-23 13:27 Iirc, retire_defree is using 2009-01-23 13:28 right, and retire_defree isn't used yet 2009-01-23 13:28 but soon 2009-01-23 13:29 it was defree, it is using extent_t to store it 2009-01-23 13:29 ah, and retire_defree too 2009-01-23 13:30 the next revision will use u64 instead of extent_t 2009-01-23 13:31 ah, yes 2009-01-23 13:31 I guess actually diskextent? 2009-01-23 13:31 diskextent is be_ 2009-01-23 13:31 yes 2009-01-23 13:31 ah, it's not store to disk 2009-01-23 13:32 right 2009-01-23 13:33 well, u64 or something structure 2009-01-23 13:33 I see three uses for this now: defer frees per delta; defer frees per rollup (metdata redirect); on replay... bookkeeping between the replay passes 2009-01-23 13:34 i see 2009-01-23 13:34 well, I guess exporting defree to other place is not good idea 2009-01-23 13:35 exporting defree internal 2009-01-23 13:35 yes, just internal 2009-01-23 13:35 good 2009-01-23 13:35 it's a highly specific task... keep a list of u64's 2009-01-23 13:36 yes, sounds good 2009-01-23 13:36 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-23 13:36 I was thinking about hackfs's io functions 2009-01-23 13:37 what is main purpose for current task? 2009-01-23 13:39 why use them? 2009-01-23 13:39 yes, almost 2009-01-23 13:39 to get them tested... they move towards no block library, which moves towards no buffers 2009-01-23 13:40 and vecio/syncio... moves towards a direct path from vfs IO to bio transfer 2009-01-23 13:40 yes, I guess it's good itself 2009-01-23 13:40 with several layers removed, and get rid of the get_block bottleneck 2009-01-23 13:40 but, I thought we can do after atomic commit 2009-01-23 13:40 possibly 2009-01-23 13:41 i see 2009-01-23 13:41 ok 2009-01-23 13:41 well, I'm going to think about it more 2009-01-23 13:41 for now 2009-01-23 13:41 good 2009-01-23 13:42 I agree with the principle of doing only what we need to do 2009-01-23 13:42 now I found is fork_buffer() can be race on some arch (weak memory order) 2009-01-23 13:43 where is the race? 2009-01-23 13:43 and for write, we will need async and sync version 2009-01-23 13:44 data copy and state is no locking 2009-01-23 13:44 actually, it is using lock_buffer(), but it's to sync with io 2009-01-23 13:45 I think state can be leak to another cpu before data copy 2009-01-23 13:45 I left the locking out because I was not sure exactly what was needed, and our initial usage is synchronous 2009-01-23 13:46 so it could be mb() or spinlock 2009-01-23 13:48 on current locking (e.g. btree->lock), I guess we want smp_wmb() or spinlock logically 2009-01-23 13:48 well, it would be rare, and would not happen on x86 2009-01-23 13:49 asynchronous use of fork_buffer will be some time in the future, so there is time to think about it 2009-01-23 13:49 synchronous use is needed very soon 2009-01-23 13:50 by synchronous, I mean only one task 2009-01-23 13:50 I think read and buffer_fork are not serialized 2009-01-23 13:50 it's protected by delta_lock for now 2009-01-23 13:51 this is very crude 2009-01-23 13:51 well, what I want to say here, I want to share about issue 2009-01-23 13:51 the both of fork_buffer and read have down_read()? 2009-01-23 13:52 for now, fork_buffer will have down_write 2009-01-23 13:53 because it is only used to flush the bitmaps 2009-01-23 13:53 and dirent pages? 2009-01-23 13:53 ah 2009-01-23 13:54 give me 5 minutes to think :) 2009-01-23 13:54 ok :) 2009-01-23 13:58 I think you're right 2009-01-23 13:59 ok 2009-01-23 13:59 ah well, that is a good reason to get the SMP issue correct now :) 2009-01-23 14:00 ah, you are tring to delay about it 2009-01-23 14:00 more like, let it improve over time 2009-01-23 14:00 yes 2009-01-23 14:01 directory change has to be inside change_begin/end, so maybe it is protected by delta_lock 2009-01-23 14:01 but still, it does not hurt to make it fully SMP safe 2009-01-23 14:01 well, smp issue is a bit hard to improve incrementally in my mind 2009-01-23 14:02 I imagined the initial usage would be protected by delta_lock 2009-01-23 14:02 i see 2009-01-23 14:02 I think we can down_write always if we want 2009-01-23 14:03 or mutex instead of rwsem 2009-01-23 14:03 yes 2009-01-23 14:04 well, it would be a bit later 2009-01-23 14:04 I meant I'll think about hackfs for now 2009-01-23 14:05 smp improvement is welcome 2009-01-23 14:06 ok 2009-01-23 14:06 it would be nice to show the fork_buffer idea on lkml 2009-01-23 14:06 yes, I think it's good idea and can be generarize for all fs 2009-01-23 14:07 to avoid needless io wait 2009-01-23 14:08 any time you are ready for a pull, I am ready, I will merge my defer free changes with your link improvements after pull 2009-01-23 14:09 ok 2009-01-23 14:10 I'll prepare 3 patches for link 2009-01-23 14:13 ACTION will be back in 1 hour 2009-01-23 14:13 ok, I'll post those until you back 2009-01-23 15:13 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-01-23 15:14 not only 3 patches though 2009-01-23 15:34 reading 2009-01-23 15:40 - link_add(page_link(page), sb->defree); 2009-01-23 15:40 - sb->defree = sb->defree->next; 2009-01-23 15:40 + link_add(page_link(page), &sb->defree); <- nice :) 2009-01-23 15:40 thanks 2009-01-23 15:40 it is main purpose of link change 2009-01-23 15:41 it's exactly equivalent, which I didn't see 2009-01-23 15:42 well is it 2009-01-23 15:42 it puts sb.defree on the list as a member 2009-01-23 15:43 yes 2009-01-23 15:44 doesn't that build the queue in the reverse order? 2009-01-23 15:44 with it, it can remove special case 2009-01-23 15:44 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-23 15:44 yes 2009-01-23 15:44 it doesn't have the pointer to back 2009-01-23 15:45 one of my goals in this application was to replay the queue in the same order as it is created 2009-01-23 15:46 it doesn't matter for deferred free, but it does matter for the other use I see in replay 2009-01-23 15:46 to get it, I think we have to add the pointer to tail, and link_add_tail would be used for it 2009-01-23 15:46 incidentally, in tux2 I did the same thing as you have written here: stored lists on pages, replayed the pages in reverse order 2009-01-23 15:47 yes 2009-01-23 15:47 or walk until tail to add 2009-01-23 15:48 I think I would like to merge the part of this that doesn't touch log.c for now 2009-01-23 15:49 eh, I thought you said you want it 2009-01-23 15:50 btw, why? 2009-01-23 15:50 both is LIFO order 2009-01-23 15:51 the current one is fifo 2009-01-23 15:51 according to my test 2009-01-23 15:51 oh, let me check it 2009-01-23 15:52 fifo queue with requirement that fifo never has less than one element, which is taken care of by init 2009-01-23 15:53 I can imagine a fifo that removes that special case by putting a dedicated tail element on the list 2009-01-23 15:53 but it needs some fiddling 2009-01-23 15:57 -void change_begin(struct sb *sb) { }; 2009-01-23 15:57 -void change_end(struct sb *sb) { }; 2009-01-23 15:57 +void change_begin(struct sb *sb) { } 2009-01-23 15:57 +void change_end(struct sb *sb) { } 2009-01-23 15:57 <- thanks :) 2009-01-23 15:58 ok, I've got 2009-01-23 15:58 I think we can it with link_add() 2009-01-23 16:03 the way I thought about doing it is, have a dedicated head element, and access the list through a pointer to tail 2009-01-23 16:05 then, insert new element by list_add(tail) and remove from head by list_del_next(tail->next) 2009-01-23 16:06 sorry 2009-01-23 16:06 link_ 2009-01-23 16:09 yes 2009-01-23 16:09 well, so, those patches don't need to merge for now 2009-01-23 16:10 it's fine, I can pull them all 2009-01-23 16:10 as they are 2009-01-23 16:11 and then fiddle with log.c 2009-01-23 16:11 I have a patch in progress to allow two deferred free lists 2009-01-23 16:13 link patch is just broken, if you are ok, please pull 2009-01-23 16:13 I'm ok with it 2009-01-23 16:21 pulled 2009-01-23 16:22 pushed to public 2009-01-23 16:30 hirofumi, it feels like another golden copy 2009-01-23 16:30 well, with delalloc and vol_bread 2009-01-23 16:30 it needs some patches though 2009-01-23 16:31 right 2009-01-23 16:31 any issues? 2009-01-23 16:31 needs patch to kernel 2009-01-23 16:31 that's an issue, what patch? 2009-01-23 16:31 or newer kernel version 2009-01-23 16:31 I vote for newer kernel version 2009-01-23 16:32 yes, it would be more easy 2009-01-23 16:32 ah, and needs temporary patches 2009-01-23 16:32 like mark_buffer_dirty and set_buffer_uptodate 2009-01-23 16:33 temporary patches are ok 2009-01-23 16:33 how do we make things easy for our students doing the dedup work? 2009-01-23 16:33 it is 5:30 am there, so I can't ask them if new kernel is ok, I think it probabably is though 2009-01-23 16:33 it is kernel, so there is not needed those patches 2009-01-23 16:34 ah right 2009-01-23 16:34 ah, if we merge those 2009-01-23 16:34 and I think they will be ok working with more recent kernel 2009-01-23 16:34 when they start working with kernel 2009-01-23 16:34 yes 2009-01-23 16:34 it is probably more interesting for them anyway 2009-01-23 16:35 well, I don't recommend to merge temporary patches like mark_buffer_dirty() and set_buffer_uptodate 2009-01-23 16:35 it just makes happy current flusher 2009-01-23 16:35 for testing purpose 2009-01-23 16:36 happy in what sense? 2009-01-23 16:36 buffer state management for current flusher 2009-01-23 16:37 e.g. mark_buffer_dirty() requires buffer_uptodate state 2009-01-23 16:37 and dirty after modify 2009-01-23 16:38 ok, it sounds good 2009-01-23 16:39 it means reverting your some change for atomic commit 2009-01-23 16:39 with one? 2009-01-23 16:39 the clean vs uptodate? 2009-01-23 16:40 http://userweb.kernel.org/~hirofumi/revert-mark_buffer_dirty.patch 2009-01-23 16:40 the patch is this 2009-01-23 16:40 reintroduce mark_buffer_dirty() after blockdirty() 2009-01-23 16:40 and... 2009-01-23 16:41 don't assume new_block() buffer is dirty 2009-01-23 16:41 that is fine, it's a small local change 2009-01-23 16:41 if those are ok, it would work for new kernel 2009-01-23 16:42 they are ok 2009-01-23 16:43 ok, I'll prepare to merge those, with delalloc and vol_bread 2009-01-23 16:43 yes, here are my three most exciting things right now: delalloc; vol_bread; new kernel 2009-01-23 16:44 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-23 16:44 and the 26 hour survival is fine :) 2009-01-23 16:44 ah, btw, without delalloc and vol_bread is necessary to survive 2009-01-23 16:45 requirement is bitmap lock fix and memory allocation change 2009-01-23 16:45 and temporary patche 2009-01-23 16:45 and temporary patches 2009-01-23 16:45 that is ok, it is just interesting to know it survived that long with some minor differences 2009-01-23 16:45 well, probably, any convination of patches 2009-01-23 16:46 it is more interesting to work on stabilizing with delalloc and vol_bread now, those are the things that move us forward 2009-01-23 16:46 the patch depending newer kernel is only delalloc 2009-01-23 16:46 oh :) 2009-01-23 16:47 right, it's just the buffer_delay check 2009-01-23 16:47 well, so we can choise new kernel with delalloc, or current kernel without delalloc 2009-01-23 16:47 yes 2009-01-23 16:47 I vote for new kernel with delalloc 2009-01-23 16:48 ok 2009-01-23 16:48 I'll prepare 2009-01-23 16:48 after sleep 2009-01-23 16:48 ACTION was getting tired of 2.6.26.5 anyway 2009-01-23 16:48 ok 2009-01-23 16:49 I will hack log.c to allow multiple defer_free lists, and start running cursor_redirect today 2009-01-23 16:49 btw, if we use symlink to tux3/user/kernel, we don't need full tree for now 2009-01-23 16:50 full tree? 2009-01-23 16:50 git tree from linus 2009-01-23 16:50 it's it getting huge? 2009-01-23 16:50 it may just be lazy 2009-01-23 16:51 lazy to prepare tree 2009-01-23 16:51 if we merge those patches, current repo our git tree will not work with new hg tree 2009-01-23 16:52 current our git tree 2009-01-23 16:52 so, I thought we may be lazy to prepare git tree for it 2009-01-23 16:53 if not, there is no problem 2009-01-23 16:53 you mean the 2.6.26.5 git tree? 2009-01-23 16:53 yes 2009-01-23 16:53 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-23 16:53 I am ready to say goodbyte to 2.6.26.5 completely 2009-01-23 16:53 delalloc requires more newler version 2009-01-23 16:53 yes, it is time to make the move to current linus, and set up on kernel.org 2009-01-23 16:54 say sayonara to 2.6.26.5 2009-01-23 16:54 ok 2009-01-23 16:54 I think we are maybe one week away from atomic commit in userspace and kernel, on current linus tree 2009-01-23 16:55 it is temporary? or will use for long time 2009-01-23 16:55 ? 2009-01-23 16:55 that will be a milestone 2009-01-23 16:55 the linus tree? 2009-01-23 16:55 yes 2009-01-23 16:55 only until we can get a spot in linux-next :) 2009-01-23 16:56 ah, I thought about to make good history on git 2009-01-23 16:56 it would be nice 2009-01-23 16:56 but not necessary 2009-01-23 16:56 if somebody has time to spend on it, it could be interesting 2009-01-23 16:56 convert from hg to git, however it would be lazy to do now 2009-01-23 16:57 the hg history will remain available, and the userspace development will be there 2009-01-23 16:57 yes 2009-01-23 16:58 the question is, will we start committing new changes to both git and hg? 2009-01-23 16:58 for kernel, I think probably we want the history in git 2009-01-23 16:58 I think we will use hg for now 2009-01-23 16:58 there must be an import 2009-01-23 16:59 yes 2009-01-23 16:59 without user/* 2009-01-23 16:59 we will need some patches like current 2.6.26.5 2009-01-23 16:59 like fs/Kconfig and fs/Makefile 2009-01-23 16:59 it is good to import the history, so people who did the work get credit 2009-01-23 17:00 yes 2009-01-23 17:00 we import all our mistakes too :) 2009-01-23 17:00 however, it would be lazy to filter out user/* (may be) 2009-01-23 17:01 so, I thought we may want to delay to import 2009-01-23 17:02 for now I will clone linus's tree and do my kernel builds with that instead of 2.6.26.5 2009-01-23 17:02 good 2009-01-23 17:02 however, fs/Kconfig and fs/Makefile change is needed 2009-01-23 17:03 yes of course 2009-01-23 17:03 so it will have a tux3 branch 2009-01-23 17:03 it's good for making patches 2009-01-23 17:03 ah, ok 2009-01-23 17:03 it's good 2009-01-23 17:04 so, we can make actual history to master 2009-01-23 17:05 ah 2009-01-23 17:05 I can do some trial history imports in that too 2009-01-23 17:05 ah, yes 2009-01-23 17:07 well, what I want to say, maybe I want to try to make history on some point 2009-01-23 17:07 and git tree for it 2009-01-23 17:07 good 2009-01-23 17:08 I think you will do a better job than me :) 2009-01-23 17:08 :) 2009-01-23 17:09 I'm not sure at all now, I have not been tried it before 2009-01-23 17:10 however, at least, I'm thinking it is possible 2009-01-23 17:11 I have heard git is good at that 2009-01-23 17:11 and also allows rewriting history 2009-01-23 17:12 oh, good 2009-01-23 17:12 which might be nice for cleaning up some old things, like using multiple different user names for the same person 2009-01-23 17:12 ah, good 2009-01-23 17:15 you need to sleep sometime ;) 2009-01-23 17:16 thanks :) oyasumi 2009-01-23 17:16 oyasumi 2009-01-23 17:16 and sorry about link change 2009-01-23 17:22 it's no problem whatsoever 2009-01-23 19:10 http://userweb.kernel.org/~hirofumi/add-fifo.patch 2009-01-23 19:10 how about this? it is untested yet though 2009-01-23 19:10 checking 2009-01-23 19:11 and name can be changed 2009-01-23 19:17 that is the style with explicit checks for empty and fully condition, and you have concealed the checks very expertly :) 2009-01-23 19:17 thanks :) 2009-01-23 19:20 I'll try to submit it after test after sleep 2009-01-23 19:23 struct defer { struct link *tail; u64 *pos, *top; }; <- I am working on a patch that does this 2009-01-23 19:24 could you merge after that one goes in? 2009-01-23 19:24 yes, of course 2009-01-23 19:24 your strategy is correct, and it avoids initializing in two places 2009-01-23 19:27 it is about init_defree? 2009-01-23 19:35 I was allocating and inserting a page in two places 2009-01-23 19:35 extra code :) 2009-01-23 19:37 and you have made it about as efficient as it can be: there is no sentinel element, and there is only a single pointer to the fifo 2009-01-23 19:39 yes, it is you did 2009-01-23 19:41 I just converted those to inline function 2009-01-23 19:43 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-23 19:44 earthquake 2009-01-23 19:44 btw, this algorithm is where com from? 2009-01-23 19:44 yup 2009-01-23 19:44 oh 2009-01-23 19:44 big one 2009-01-23 19:44 ish 2009-01-23 19:44 short and strong 2009-01-23 19:44 over 5.0 2009-01-23 19:44 but still not close to hirofumi's 2009-01-23 19:44 in osaka 2009-01-23 19:45 hirofumi, single linked circular lists are ancient, knuth talks about them 2009-01-23 19:45 well, Kobe was too big 2009-01-23 19:46 the linux innovation is to use container_of 2009-01-23 19:46 flips: damn felt pretty strong here 2009-01-23 19:46 yes, however, I didn't know fifo trick 2009-01-23 19:46 must have been close 2009-01-23 19:46 since it was so short 2009-01-23 19:47 shapor, no link yet? 2009-01-23 19:47 3.4 2009-01-23 19:47 1 mile from marina del ray 2009-01-23 19:47 wnw 2009-01-23 19:47 that's my house 2009-01-23 19:47 yeah 2009-01-23 19:47 seems like venice 2009-01-23 19:47 a little one, right under your toes 2009-01-23 19:47 http://earthquake.usgs.gov/eqcenter/recenteqsus/Maps/US2/33.35.-119.-117.php 2009-01-23 19:48 http://earthquake.usgs.gov/eqcenter/recenteqsus/Quakes/ci10373093.php 2009-01-23 19:48 http://quake.wr.usgs.gov/recenteqs/Quakes/ci10373093.html 2009-01-23 19:48 this time, the site didn't get knocked off the web like last time 2009-01-23 19:48 http://maps.google.com/maps?q=33.9841+-118.4701(M3.4+-+GREATER+LOS+ANGELES+AREA%2C+CALIFORNIA+-+2009+January+24++03%3A42%3A44+UTC)&ll=33.9841,-118.4701&spn=2,2&f=d&t=h&hl=e 2009-01-23 19:49 right by tim 2009-01-23 19:50 that's a block from switch studios 2009-01-23 19:50 wonder if matt was home 2009-01-23 19:53 opened a cupboard door here 2009-01-23 19:59 matt sez: whoomp 2009-01-23 19:59 straight up and down 2009-01-23 20:04 twice 2009-01-23 20:04 whoomp de whoomp 2009-01-23 20:04 shake and bake 2009-01-23 20:43 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-23 21:13 tim_dimm: how was it by you? 2009-01-23 21:13 it was the same here 2009-01-23 21:14 less than 3 secs probably 2009-01-23 21:14 up and down twice 2009-01-24 02:31 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-24 02:48 hey flips 2009-01-24 02:48 hi bh 2009-01-24 02:50 how's it going this late at night ? 2009-01-24 02:50 getting late 2009-01-24 03:19 yeah 2009-01-24 08:02 -!- edt(~Ed@dsl-61-49.aei.ca) has joined #tux3 2009-01-24 08:29 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-24 10:05 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-24 10:07 -!- cdk(~chinmay@121.246.33.150) has joined #tux3 2009-01-24 10:07 hi flips 2009-01-24 10:08 -!- gaurav(~gaurav@59.95.12.249) has joined #tux3 2009-01-24 10:11 -!- cdk(~chinmay@121.246.33.150) has joined #tux3 2009-01-24 10:16 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-24 11:40 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-24 11:46 hi cdk 2009-01-24 11:46 hi 2009-01-24 11:47 was using the jan 11 source again...after the tuxio changes. 2009-01-24 11:47 seems the changes broke something 2009-01-24 11:47 the buffer dirty change? 2009-01-24 11:48 yes 2009-01-24 11:48 on very unmount i am getting 2009-01-24 11:48 __destroy_buffers: dirty buffer leak, or list corruption? 2009-01-24 11:48 map [0x8ec74c0] 3/0* 2009-01-24 11:48 __destroy_buffers: Failed assertion "list_empty(&lru_buffers)"! 2009-01-24 11:48 Trace/breakpoint trap 2009-01-24 11:48 tux3fuse again.. 2009-01-24 11:48 trying to find the problem...no luck so far :( 2009-01-24 11:49 you can #undef BUFFER_PARANOIA_DEBUG 2009-01-24 11:50 ok.... 2009-01-24 11:50 then on shutdown, see if there really are any leaked buffers 2009-01-24 11:50 ok 2009-01-24 11:51 will do....also this is without any dedup code...just tux3 2009-01-24 11:51 put a show_active_buffers at the end of tux3fuse main maybe 2009-01-24 11:51 yes... 2009-01-24 11:51 doing it now 2009-01-24 11:52 also getting this on certain reads 2009-01-24 11:52 set_buffer_uptodate: Failed assertion "!buffer_uptodate(buffer)"! 2009-01-24 11:52 Trace/breakpoint trap 2009-01-24 11:53 ah, it would be nice to find the source of that 2009-01-24 11:53 one way to fix is to remove the assert 2009-01-24 11:53 but it is better to find out why a buffer is set uptodate more than once 2009-01-24 11:54 yes..... 2009-01-24 11:58 have you tried running tux3fuse under gdb, to find out where the assert comes from? 2009-01-24 11:58 gdb -args tux3fuse -f ... 2009-01-24 11:59 show_active_buffers(sb->volmap->map) in tux3fuse is giving me compilation errors...implicit declaration...Makefile changes?? 2009-01-24 11:59 let me check buffer.h 2009-01-24 12:00 it was never exported 2009-01-24 12:01 show_buffers only shows buffers with count > 1 2009-01-24 12:01 oh sorry 2009-01-24 12:01 no 2009-01-24 12:01 need to export show_active_buffers 2009-01-24 12:04 cdk, those functions are in buffer.h now 2009-01-24 12:04 you probably already put them there? 2009-01-24 12:04 show_active_buffers and show_dirty_buffers 2009-01-24 12:05 no was trying gdb...will export now...gdb later.. 2009-01-24 12:08 show_active_buffers is just showing buckets...no buffers .. will do a show_buffers 2009-01-24 12:09 [39] 8/0 2009-01-24 12:09 [318] e/0 2009-01-24 12:09 [399] 5/0 2009-01-24 12:09 [438] d/0 2009-01-24 12:09 [519] 4/0 2009-01-24 12:09 [639] 3/0* 2009-01-24 12:09 [759] 2/0 2009-01-24 12:09 [918] 9/0 2009-01-24 12:09 with show_buffers after the fuse_unmount() call 2009-01-24 12:10 so there really is a dirty buffer left 2009-01-24 12:10 yes 2009-01-24 12:11 next thing is to find out where it came from 2009-01-24 12:11 and why it was not flushed 2009-01-24 12:11 yes 2009-01-24 12:11 recompile with #define buftrace trace_on ? 2009-01-24 12:12 ok 2009-01-24 12:12 is it a short test that causes that? 2009-01-24 12:13 yes...just copied a single file .. having 2 blocks 2009-01-24 12:13 and unmounted 2009-01-24 12:13 the buftrace may not show the buffer being dirtied 2009-01-24 12:13 but we will change it so it does 2009-01-24 12:14 ah it does already show the dirty 2009-01-24 12:14 uh..i put a printf in tuxclose ... brelse: Release buffer 3, count = 1, state = 3 2009-01-24 12:14 brelse: Free buffer 3 2009-01-24 12:14 brelse: Release buffer 2, count = 1, state = 2 2009-01-24 12:14 brelse: Free buffer 2 2009-01-24 12:14 In tux close 2009-01-24 12:14 show_buffers: (map 0xa0274c8) 2009-01-24 12:14 [39] 8/0 2009-01-24 12:14 [318] e/0 2009-01-24 12:14 [399] 5/0 2009-01-24 12:14 [438] d/0 2009-01-24 12:15 [519] 4/0 2009-01-24 12:15 [639] 3/0* 2009-01-24 12:15 [759] 2/0 2009-01-24 12:15 [918] 9/0 2009-01-24 12:16 tuxsync should have written the buffer, setting it clean 2009-01-24 12:17 um 2009-01-24 12:17 sorry 2009-01-24 12:17 tuxclose after sync super?? 2009-01-24 12:18 sync super is the one that should have set the buffer clean 2009-01-24 12:18 well 2009-01-24 12:18 it is because of a recursive dirty of the bitmap, most probably 2009-01-24 12:19 is the dump above from show_buffers(sb->volmap->map); ? 2009-01-24 12:20 yes 2009-01-24 12:21 how about adding stacktrace() to mark_buffer dirty? 2009-01-24 12:22 ok 2009-01-24 12:22 stacktrace is not included in buffer.c, so just cut an paste it from hexdump.c 2009-01-24 12:22 ok 2009-01-24 12:22 needs 2009-01-24 12:24 i also tried show_buffers in save_inode 2009-01-24 12:24 before and after the release_cursor 2009-01-24 12:25 before cursor release 3/2* 2009-01-24 12:26 sorry...before relse 3/1* and after 3/0* 2009-01-24 12:28 thats correct i guess....doing the stacktrace() now 2009-01-24 12:31 the stacktrace will reveal the culprit 2009-01-24 12:32 mark_buffer_dirty: set_buffer_dirty 3 state = 2 2009-01-24 12:32 _______stack ______ 2009-01-24 12:32 ./tux3fuse(mark_buffer_dirty+0x99)[0x804afbf] 2009-01-24 12:32 ./tux3fuse[0x805a6ec] 2009-01-24 12:32 ./tux3fuse[0x805abe3] 2009-01-24 12:32 ./tux3fuse(tuxsync+0x2d)[0x805b4e6] 2009-01-24 12:32 ./tux3fuse(tuxclose+0x11)[0x805b4ff] 2009-01-24 12:32 ./tux3fuse[0x805e047] 2009-01-24 12:32 /lib/libfuse.so.2[0xb7ed4603] 2009-01-24 12:32 /lib/libfuse.so.2[0xb7ed2fa9] 2009-01-24 12:32 /lib/libfuse.so.2(fuse_session_process+0x26)[0xb7ed5b06] 2009-01-24 12:32 /lib/libfuse.so.2(fuse_session_loop+0x9f)[0xb7ed173f] 2009-01-24 12:32 ./tux3fuse(main+0x12e)[0x805e60b] 2009-01-24 12:32 /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe5)[0xb7d7c685] 2009-01-24 12:32 ./tux3fuse[0x804ac01] 2009-01-24 12:32 brelse: Release buffer 3, count = 1, state = 3 2009-01-24 12:33 so the dirty happened in the flush 2009-01-24 12:33 and we think it is the flush of the bitmap inode 2009-01-24 12:34 we can confirm that, but it seems pretty clear 2009-01-24 12:35 to confrim, printf(..., inode->inum tuxflush and see if it is the bitmap inode 2009-01-24 12:36 ok 2009-01-24 12:36 I am pretty sure it is, but it is still worth running the test just to be sure 2009-01-24 12:40 inode ::e save_inode: save inode 0xe 2009-01-24 12:40 probe: probe level 0, 1 of 1 2009-01-24 12:40 lookup inode 0xe, 0 + e 2009-01-24 12:40 resize inum 0xe at 0xd8 from 36 to 36 2009-01-24 12:40 mark_buffer_dirty: set_buffer_dirty 3 state = 2 2009-01-24 12:40 _______stack ______ 2009-01-24 12:40 oh, it's not the bitmap inode 2009-01-24 12:40 no 2009-01-24 12:40 but I see the problem 2009-01-24 12:41 in sync_super, the bitmap is flushed first, and it should be flushed last 2009-01-24 12:41 ok.. 2009-01-24 12:42 but before the volmap? 2009-01-24 12:42 exactly 2009-01-24 12:42 --- a/user/super.c Sat Jan 24 12:04:12 2009 -0800 2009-01-24 12:42 +++ b/user/super.c Sat Jan 24 12:42:32 2009 -0800 2009-01-24 12:42 @@ -40,14 +40,14 @@ int sync_super(struct sb *sb) 2009-01-24 12:42 int sync_super(struct sb *sb) 2009-01-24 12:42 { 2009-01-24 12:42 int err; 2009-01-24 12:42 - printf("sync bitmap\n"); 2009-01-24 12:43 - if ((err = tuxsync(sb->bitmap))) 2009-01-24 12:43 - return err; 2009-01-24 12:43 printf("sync rootdir\n"); 2009-01-24 12:43 if ((err = tuxsync(sb->rootdir))) 2009-01-24 12:43 return err; 2009-01-24 12:43 printf("sync atom table\n"); 2009-01-24 12:43 if ((err = tuxsync(sb->atable))) 2009-01-24 12:43 + return err; 2009-01-24 12:43 + printf("sync bitmap\n"); 2009-01-24 12:43 + if ((err = tuxsync(sb->bitmap))) 2009-01-24 12:43 return err; 2009-01-24 12:43 printf("sync volmap\n"); 2009-01-24 12:43 if ((err = flush_buffers(sb->volmap->map))) 2009-01-24 12:43 i still have a dirty buffer 2009-01-24 12:44 hmm 2009-01-24 12:45 still e? 2009-01-24 12:45 yes...and hitting the assert again because of the dirty buffer 2009-01-24 12:47 save_inode for e is being called after save_inode for 0 2009-01-24 12:48 that's a good clue :) 2009-01-24 12:49 save_inode: save inode 0x0 2009-01-24 12:49 probe: probe level 0, 1 of 1 2009-01-24 12:49 lookup inode 0x0, 0 + 0 2009-01-24 12:49 resize inum 0x0 at 0x0 from 36 to 36 2009-01-24 12:49 mark_buffer_dirty: set_buffer_dirty 3 state = 3 2009-01-24 12:49 brelse: Release buffer 3, count = 1, state = 3 2009-01-24 12:49 brelse: Free buffer 3 2009-01-24 12:49 brelse: Release buffer 2, count = 1, state = 2 2009-01-24 12:49 brelse: Free buffer 2 2009-01-24 12:49 sync volmap 2009-01-24 12:49 and after the sync super 2009-01-24 12:49 tux3_flush: not implemented 2009-01-24 12:49 right, there must be a tux3fuse reason for that 2009-01-24 12:49 inode ::e save_inode: save inode 0xe 2009-01-24 12:49 probe: probe level 0, 1 of 1 2009-01-24 12:49 lookup inode 0xe, 0 + e 2009-01-24 12:51 yes 2009-01-24 12:52 but a backtrace in tuxclose maybe? 2009-01-24 12:52 ok 2009-01-24 12:53 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-24 12:55 _______stack ______ 2009-01-24 12:55 ./tux3fuse(tuxclose+0x1f)[0x805b91a] 2009-01-24 12:55 ./tux3fuse[0x805e499] 2009-01-24 12:55 /lib/libfuse.so.2[0xb80a0603] 2009-01-24 12:55 /lib/libfuse.so.2[0xb809efa9] 2009-01-24 12:55 /lib/libfuse.so.2(fuse_session_process+0x26)[0xb80a1b06] 2009-01-24 12:55 /lib/libfuse.so.2(fuse_session_loop+0x9f)[0xb809d73f] 2009-01-24 12:55 ./tux3fuse(main+0x12e)[0x805ea5d] 2009-01-24 12:55 /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe5)[0xb7f48685] 2009-01-24 12:55 ./tux3fuse[0x804ac51] 2009-01-24 12:55 that's not too helpful 2009-01-24 12:55 yeah 2009-01-24 12:56 the second tux3fuse, I wonder if that is from main 2009-01-24 12:57 don't see anything 2009-01-24 12:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-24 12:58 thats odd 2009-01-24 13:06 how do we sort this out ? 2009-01-24 13:06 thinking about the next move 2009-01-24 13:07 find out where the call comes from in tux3fuse.c 2009-01-24 13:07 I thought the backtrace would do that 2009-01-24 13:07 it's like tux3fuse is compiled without debugging symbols 2009-01-24 13:08 it is possible to break into the debugger and get a better backtrace 2009-01-24 13:09 and there is a fprintf at the beginning of every fuse call 2009-01-24 13:09 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-24 13:09 so which one did the tuxclose? 2009-01-24 13:10 tux3_flush 2009-01-24 13:11 tux3_flush doesn't do anything 2009-01-24 13:12 sync super 2009-01-24 13:12 tux3_flush: not implemented 2009-01-24 13:12 _______stack ______ 2009-01-24 13:12 ./tux3fuse(tuxclose+0x1f)[0x805b91a] 2009-01-24 13:12 ./tux3fuse[0x805e499] 2009-01-24 13:14 is any of the fprintf output being printed? 2009-01-24 13:16 the last one i see is tux3_write(e) 2009-01-24 13:16 I'm making a patch to convert those fprintfs to trace 2009-01-24 13:17 ok 2009-01-24 13:20 by the way, I get the same assert here 2009-01-24 13:21 simple test 2009-01-24 13:21 cdk, getting a little late where you are? 2009-01-24 13:22 yes... 2009-01-24 13:22 continue tomorrow? 2009-01-24 13:22 but i am here for a while...if u are ok.. 2009-01-24 13:22 ok 2009-01-24 13:22 I am ok 2009-01-24 13:22 its 0300 here...i can last for another hour :) 2009-01-24 13:22 our dedup work is stuck on this problem and the read one... 2009-01-24 13:23 checked in a patch to convert tux3fuse printfs to traces, we should see the fuse call that causees the tuxclose on inode e now 2009-01-24 13:23 ok 2009-01-24 13:23 pulling it now 2009-01-24 13:28 tux3_getattr: tux3_getattr(1) 2009-01-24 13:28 save_inode: save inode 0xe 2009-01-24 13:28 odd, just as you say 2009-01-24 13:30 where does that save_inode come from 2009-01-24 13:30 it's not in the tux3_getattr 2009-01-24 13:30 no....it comes from tuxsync 2009-01-24 13:31 which one? 2009-01-24 13:33 just a min... 2009-01-24 13:37 ah, it must be tux3_release 2009-01-24 13:37 doesn't have a trace for some reason 2009-01-24 13:38 indeed it is 2009-01-24 13:38 that clears that up 2009-01-24 13:38 :) 2009-01-24 13:39 now... a sloppy fix coming 2009-01-24 13:40 should tux3_release be calling sync_super ?? 2009-01-24 13:40 yes 2009-01-24 13:40 because we do it like that in other places 2009-01-24 13:40 it is wrong, but that is what we do 2009-01-24 13:44 hmm... 2009-01-24 13:44 thats the problem solved for now 2009-01-24 13:44 want to send the patch? 2009-01-24 13:45 ok :) 2009-01-24 13:45 tux3_release needs something like this: trace("release (%Lx)", (L)ino); 2009-01-24 13:45 to be consistent with others 2009-01-24 13:45 will add that as well 2009-01-24 13:46 I wish I had more time to work on the fuse code 2009-01-24 13:46 we should not be flushing the super on every write, but lets wait for atomic commit to arrive before fixing that 2009-01-24 13:47 ok 2009-01-24 13:47 it is actually pretty easy to work on 2009-01-24 13:48 yes... 2009-01-24 13:49 { 2009-01-24 13:49 + trace("release (%Lx)", (L)ino); 2009-01-24 13:49 if (ino != FUSE_ROOT_ID) { 2009-01-24 13:49 struct inode *inode = (struct inode *)(unsigned long)fi->fh; 2009-01-24 13:49 tuxclose(inode); 2009-01-24 13:49 + sync_super(inode->i_sb); 2009-01-24 13:49 ok? 2009-01-24 13:50 need a // error??? 2009-01-24 13:50 comment 2009-01-24 13:50 every time you leave out an error check 2009-01-24 13:50 it's ok to be lazy, but you have to document it ;) 2009-01-24 13:51 + if ((errno = -sync_super(sb))) { 2009-01-24 13:51 + fuse_reply_err(req, errno); 2009-01-24 13:51 + return; 2009-01-24 13:51 + } <- this is what I did 2009-01-24 13:51 :) 2009-01-24 13:51 it's also sloppy 2009-01-24 13:51 but at least it reports the error 2009-01-24 13:52 after I have thought about it a little, we will replace all the sync_supers in tux3fuse with one sync_super on exit 2009-01-24 13:54 i did that.. 2009-01-24 13:54 i mean .. before.. 2009-01-24 13:54 but i thought it was not a good thing to do sync_super() in main for tux3fuse.c 2009-01-24 13:57 will it cause any problems with buffers ? 2009-01-24 14:03 patch sent 2009-01-24 14:10 thanks cdk, and sleep well 2009-01-24 14:10 actually.. 2009-01-24 14:10 not yet... 2009-01-24 14:10 :) 2009-01-24 14:11 i am still looking at the read problem...just a min... 2009-01-24 14:15 cdk I wrote "wrong and just for now" on the commit... it doesn't mean your patch is wrong 2009-01-24 14:15 no problem :) 2009-01-24 14:27 flips, about the second problem 2009-01-24 14:27 i am copying a 200 mb mkv file on the tux3 volume.. 2009-01-24 14:27 cp is fine .. no errors 2009-01-24 14:28 but whenever i try to read it i get -- set_buffer_uptodate: Failed assertion "!buffer_uptodate(buffer)"! 2009-01-24 14:34 ok 2009-01-24 14:35 hmm 2009-01-24 14:36 there should be no set_buffer_uptodates, only set_buffer_clean 2009-01-24 14:39 here is the stack trace 2009-01-24 14:39 _______stack ______ 2009-01-24 14:39 ./tux3fuse(set_buffer_uptodate+0x1f)[0x804b005] 2009-01-24 14:39 ./tux3fuse(filemap_extent_io+0x56c)[0x8059d65] 2009-01-24 14:39 ./tux3fuse(blockread+0x5b)[0x804ba8a] 2009-01-24 14:39 ./tux3fuse(tuxio+0x2ec)[0x805b14c] 2009-01-24 14:39 ./tux3fuse(tuxread+0x27)[0x805b2d6] 2009-01-24 14:39 ./tux3fuse[0x805cb5b] 2009-01-24 14:39 /lib/libfuse.so.2[0xb804477a] 2009-01-24 14:39 /lib/libfuse.so.2[0xb8042fa9] 2009-01-24 14:39 /lib/libfuse.so.2(fuse_session_process+0x26)[0xb8045b06] 2009-01-24 14:39 /lib/libfuse.so.2(fuse_session_loop+0x9f)[0xb804173f] 2009-01-24 14:39 ./tux3fuse(main+0x12e)[0x805e70e] 2009-01-24 14:39 /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe5)[0xb7eec685] 2009-01-24 14:39 ./tux3fuse[0x804abf1] 2009-01-24 14:39 set_buffer_uptodate: Failed assertion "!buffer_uptodate(buffer)"! 2009-01-24 14:40 this is not current hg repo? 2009-01-24 14:41 no the Jan 11 one 2009-01-24 14:41 i mean Jan 11 snapshot. 2009-01-24 14:42 I think that bug may already be fixed in current repo 2009-01-24 14:42 ok...will check 2009-01-24 14:45 yes its changed to set_buffer_clean 2009-01-24 14:46 using the current repo -- 2009-01-24 14:46 set_buffer_clean: Failed assertion "!buffer_clean(buffer)"! 2009-01-24 14:46 Trace/breakpoint trap 2009-01-24 14:47 just a moment 2009-01-24 14:49 cdk, could you try this patch: 2009-01-24 14:49 --- a/user/filemap.c Sat Jan 24 14:14:43 2009 -0800 2009-01-24 14:49 +++ b/user/filemap.c Sat Jan 24 14:49:35 2009 -0800 2009-01-24 14:49 @@ -51,6 +51,7 @@ void guess_region(struct buffer_head *bu 2009-01-24 14:49 } 2009-01-24 14:49 *start = ends[0]; 2009-01-24 14:49 *count = ends[1] + 1 - ends[0]; 2009-01-24 14:50 + *count = 1; 2009-01-24 14:50 } 2009-01-24 14:50 int filemap_extent_io(struct buffer_head *buffer, int write) 2009-01-24 14:50 doing now 2009-01-24 14:51 I see the bug 2009-01-24 14:51 or do I 2009-01-24 14:51 maybe not 2009-01-24 14:55 ah, I see it 2009-01-24 14:56 try this patch: 2009-01-24 14:56 --- a/user/filemap.c Sat Jan 24 14:14:43 2009 -0800 2009-01-24 14:56 +++ b/user/filemap.c Sat Jan 24 14:55:44 2009 -0800 2009-01-24 14:56 @@ -41,7 +41,7 @@ void guess_region(struct buffer_head *bu 2009-01-24 14:56 if (next > inode->i_size >> tux_sb(inode->i_sb)->blockbits) 2009-01-24 14:56 break; 2009-01-24 14:56 } else { 2009-01-24 14:56 - unsigned stop = write ? !buffer_dirty(nextbuf) : buffer_empty(nextbuf); 2009-01-24 14:56 + unsigned stop = write ? !buffer_dirty(nextbuf) : !buffer_empty(nextbuf); 2009-01-24 14:56 brelse(nextbuf); 2009-01-24 14:56 if (stop) 2009-01-24 14:56 break; 2009-01-24 14:57 its giving me a fail.. 2009-01-24 14:57 one moment 2009-01-24 15:03 yes....that seems to have solved the problem 2009-01-24 15:03 thanks 2009-01-24 15:03 sorry for creating the problem ;) 2009-01-24 15:03 ok, I'll commit that 2009-01-24 15:04 ok 2009-01-24 15:05 that assert in set_buffer_clean is a good thing 2009-01-24 15:06 hirofumi, there? 2009-01-24 15:06 btw...waiting for your response to the dedup design :) 2009-01-24 15:07 it will be there when you get up :) 2009-01-24 15:07 thanks 2009-01-24 15:07 hi 2009-01-24 15:09 I will put in atomic commit now, for btree 2009-01-24 15:09 I just wanted to talk a bit first 2009-01-24 15:09 ok 2009-01-24 15:10 the tricky part remaining is log replay 2009-01-24 15:10 and I think it just got simpler 2009-01-24 15:11 I talked about not replaying promises for retired blocks 2009-01-24 15:11 well, that will never happen for volmap blocks 2009-01-24 15:12 with the simplification that rollup always flushes all dirty volmap blocks 2009-01-24 15:12 yes 2009-01-24 15:14 the only promises against logically mapped blocks are bitmaps and inode table block updates, which are also flushed per rollup 2009-01-24 15:14 inode table? it is about future? 2009-01-24 15:15 it is about soon... I think we will do all inode table updates by writing the attributes to the log 2009-01-24 15:15 I forget why I thought that was needed 2009-01-24 15:15 but it is pretty easy to implement 2009-01-24 15:16 log is no problem, but now it is not logically mapped? 2009-01-24 15:16 sorry 2009-01-24 15:16 it is not logically mapped, but it is a logical update 2009-01-24 15:16 ah, ok 2009-01-24 15:16 because we have to seek in the inode table btree to find the right inode 2009-01-24 15:18 i see 2009-01-24 15:21 I have some more thinking to do about inode attribute updates 2009-01-24 15:23 yes 2009-01-24 15:24 just about done... 2009-01-24 15:25 ok, it is simple 2009-01-24 15:25 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-24 15:25 inode attribute updates can reference an inode table block physically 2009-01-24 15:27 yes 2009-01-24 15:30 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-24 15:32 ok, and replay needs to be able to replay splits 2009-01-24 15:33 so there will be a split log entry 2009-01-24 15:34 that says exactly where in the btree node, ileaf or dleaf to do the split, and the physical address of the new block 2009-01-24 15:34 do replaying a split always produces the same results as the original split in cache 2009-01-24 15:34 s/do/so/ 2009-01-24 15:35 originally, I was planning to flush split blocks to disk as part of the delta 2009-01-24 15:35 but just logging the splits will be more efficient and easier to implement 2009-01-24 15:36 ok, I will try to make that work in prototype today 2009-01-24 15:37 it means splited blocks is not write to disk? 2009-01-24 15:38 not until rollup 2009-01-24 15:38 I was originally planning to write them to disk right away 2009-01-24 15:38 thinking that split is fairly rare 2009-01-24 15:39 and that this would simplify replay 2009-01-24 15:39 but actually, it makes things more complicated 2009-01-24 15:39 and it's less efficient than logging 2009-01-24 15:40 um... 2009-01-24 15:40 it also means bnode split is not forked? 2009-01-24 15:41 ah, no 2009-01-24 15:41 um.. 2009-01-24 15:41 redirect? 2009-01-24 15:41 for now there is no forking for volmap blocks 2009-01-24 15:41 yes, it is redirected 2009-01-24 15:41 redirect happens in cache 2009-01-24 15:41 but, redirect is logging? 2009-01-24 15:42 the redirect is logged 2009-01-24 15:42 if the original block was clean, then both blocks are redirected 2009-01-24 15:42 well 2009-01-24 15:42 sorry 2009-01-24 15:42 I meant, if the original block was clean, it is redirected 2009-01-24 15:43 dirty blocks do not need to be redirected 2009-01-24 15:43 if same dirty delta counter? 2009-01-24 15:44 essentially 2009-01-24 15:45 because we flush all dirty volume blocks on a rollup, a delta counter is not needed for now 2009-01-24 15:45 well, I'm thinking the above means split would like physical (or phylogical) logging 2009-01-24 15:46 my thinking is, replay can reproduce the result of the split exactly 2009-01-24 15:47 that is, the same data in the same volmap buffers 2009-01-24 15:47 and we can add a paranoia checksum to verify if we want 2009-01-24 15:47 just a log the split key? 2009-01-24 15:48 it could be the split key, or it could be the offset in the block 2009-01-24 15:48 offset in block is easier at the physical level 2009-01-24 15:48 well\ 2009-01-24 15:48 for bnode 2009-01-24 15:48 yes 2009-01-24 15:49 for ileaf and dleaf, key is good 2009-01-24 15:49 i see 2009-01-24 15:50 ah, or split on replay arbitrary 2009-01-24 15:51 um... 2009-01-24 15:51 yes 2009-01-24 15:51 in filemap we don't actually split 2009-01-24 15:52 we do a repack 2009-01-24 15:52 repack? 2009-01-24 15:52 dwalk_add in a loop 2009-01-24 15:52 ah 2009-01-24 15:54 just a idea for future though 2009-01-24 15:55 replay may not be able to exact same result 2009-01-24 15:55 that's what I'm thinking about right now 2009-01-24 15:55 i see 2009-01-24 15:55 and that's why I originally planned to write the dleaf out when it is changed 2009-01-24 15:57 replay can be optimize image without using it 2009-01-24 15:57 without users of it 2009-01-24 15:58 so, it may be able to do more aggressive optimize than runtime 2009-01-24 15:59 but things will get confused if it does not end up using the same physical blocks 2009-01-24 15:59 replay can move physical location? 2009-01-24 16:00 because there is no users 2009-01-24 16:00 on replay we have not reconstructed the bitmap btree yet, so can't do allocation 2009-01-24 16:01 replay must be told exactly what to do 2009-01-24 16:02 i see 2009-01-24 16:02 well, it would be far future 2009-01-24 16:03 if we want, I guess we can do it 2009-01-24 16:04 anyway, a dleaf better be written out in the delta that changes it, not deferred to rollup 2009-01-24 16:04 ileaf is reasonable to defer 2009-01-24 16:05 btw, why ileaf is defer? 2009-01-24 16:05 I thought it would be more efficient and simpler 2009-01-24 16:06 ileaf has a simple structure with split/merge that can be expected to repeat exactly 2009-01-24 16:08 many inodes creation case is more efficient than logging? 2009-01-24 16:09 btw, this is just a question 2009-01-24 16:10 it's a good point 2009-01-24 16:11 anyway, I am done fiddling with deferred free and list links 2009-01-24 16:11 for now, I guess we will use most simple strategy 2009-01-24 16:11 no more patches that collide with yours ;) 2009-01-24 16:11 yes 2009-01-24 16:11 ok :) 2009-01-24 16:12 and I think the best thing to do is write out ileaf and dleaf in the delta that changes them 2009-01-24 16:12 I will go have a skate and think... 2009-01-24 16:12 and then implement 2009-01-24 16:12 btw, I'd like to change stash_free to defer_free 2009-01-24 16:12 ok 2009-01-24 16:12 defer_free, yes 2009-01-24 16:12 ok 2009-01-24 16:14 so there will be no promises for ileaf and dleaf 2009-01-24 16:14 we can do that later as an optimization maybe 2009-01-24 16:14 i see, sounds good 2009-01-24 17:44 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-24 18:42 -!- edt(~Ed@dsl-61-49.aei.ca) has joined #tux3 2009-01-24 22:13 -!- kushal(~kushal@115.109.10.118) has joined #tux3 2009-01-24 23:26 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-24 23:40 -!- edt(~Ed@dsl-61-49.aei.ca) has joined #tux3 2009-01-24 23:59 -!- cdk(~chinmay@115.109.10.24) has joined #tux3 2009-01-25 01:18 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-25 01:21 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-25 01:22 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-25 04:26 -!- kushal_(~kushal@121.246.32.157) has joined #tux3 2009-01-25 04:29 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-25 04:58 hirofumi: hi? 2009-01-25 05:00 for class, i have to extend the fat-fs (supposed to be easy) to be more energy efficient. This shall be achieved by having a harddisk and a flash disk. All the data is mirrored between both devices, but it suffices if the metadata is only on the flash disk. 2009-01-25 05:00 so when the hard disk is spun down, the data shall be queried fromt he flash disk 2009-01-25 05:01 i just noticed that you did some work on the vfat-system, too 2009-01-25 05:02 so i was wondering if you could give me a few pointers as of where to start 2009-01-25 05:54 -!- amey(~amey@121.246.32.157) has joined #tux3 2009-01-25 06:38 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-25 07:06 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-25 07:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-25 07:21 http://www.microsoft.com/whdc/system/platform/firmware/fatgen.mspx 2009-01-25 07:21 this doc would help to understand fatfs format if you didn't know it 2009-01-25 07:24 and on current git, namei_*.c is handling about filename, and other files are managing the disk data 2009-01-25 07:25 and fatent.c is handling the detail of FileAllocationTable 2009-01-25 07:26 I hope those help to start 2009-01-25 07:49 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-25 08:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-25 08:51 hirofumi: thanks 2009-01-25 08:51 i restarted by having a look at "an introduction to the linux kernel" again :) 2009-01-25 09:12 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-25 10:02 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-25 10:50 -!- gila_(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-25 10:51 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-25 11:31 -!- cdk(~chinmay@121.246.32.157) has joined #tux3 2009-01-25 11:32 hi flips 2009-01-25 11:32 hi cdk 2009-01-25 11:32 the sync super problem also exists in release directory 2009-01-25 11:33 got a patch? 2009-01-25 11:33 not quite... 2009-01-25 11:33 i just added a sync_super now at the end of the loop 2009-01-25 11:33 very wrong 2009-01-25 11:33 loop? 2009-01-25 11:34 after err = fuse_session_loop(fs); 2009-01-25 11:34 why is that wrong? 2009-01-25 11:35 i mean wrong as u said about the patch yesterday 2009-01-25 11:35 that is the right way 2009-01-25 11:35 my way was wrong 2009-01-25 11:35 oh.. 2009-01-25 11:36 we should remove all sync_super, and put one after the session loop, if (!err) ... 2009-01-25 11:37 btw, why would the offset value in the tux3_read call be > i_size ? 2009-01-25 11:37 anyway, should be: errno = -fuse_session_loop(fs); 2009-01-25 11:38 original author was being paranoid... that test can be removed 2009-01-25 11:39 it's wrong actually 2009-01-25 11:39 should just return a short read 2009-01-25 11:39 which tuxio will take care of 2009-01-25 11:40 a patch for that one would be fine 2009-01-25 11:42 actually...it sort of helped me.. 2009-01-25 11:43 i copied the same file twice on a tux3 partition with our code. 2009-01-25 11:43 video file again 2009-01-25 11:43 the first copy runs fine .. the second copy however ends at that check. 2009-01-25 11:44 checked the mappings they logical to physical mappings .. they are same for both the files 2009-01-25 11:47 the data blocks are over written.. 2009-01-25 11:48 sorry not overwritten 2009-01-25 12:04 ok, cursor redirect is actually a loop 2009-01-25 12:05 loops up from a btree leaf all the way to the root, or the first dirty node 2009-01-25 12:06 fortunately, this will not happen often 2009-01-25 12:20 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-25 12:24 hirofumi, there? 2009-01-25 12:24 yes 2009-01-25 12:25 how close are you to a pull for the volmap and delalloc patches? 2009-01-25 12:26 if target is linus git, it would be done with adding comment 2009-01-25 12:26 let's make linus git the target then 2009-01-25 12:26 cdk, there? 2009-01-25 12:27 yes 2009-01-25 12:27 we're planning to break things on trunk for a while 2009-01-25 12:27 and rebase to current linus tree 2009-01-25 12:28 I am thinking that you might want to continue working with the current version, to avoid disruption 2009-01-25 12:28 yes 2009-01-25 12:29 since you're working all in user space now, it won't be too bad 2009-01-25 12:29 yes 2009-01-25 12:30 anyway, hirofumi's new patches will not change userspace in any way that affects you 2009-01-25 12:30 ok 2009-01-25 12:31 you probably want to get the fix to tux3fuse in today 2009-01-25 12:31 removing all sync_supers and adding them directly at the end? 2009-01-25 12:32 I think that is the right thing to do 2009-01-25 12:32 ok 2009-01-25 13:36 flips, putting the sync_super after the loop does not seem to help for a few cases. 2009-01-25 13:37 you mean, some tuxcloses still happen after the sync super? 2009-01-25 13:37 if i write a file and delete it .. 2009-01-25 13:37 without umount ? 2009-01-25 13:39 until the sync_super, the disk image will not be consistent, if that is what you mean 2009-01-25 13:40 yes 2009-01-25 13:40 just a min 2009-01-25 13:46 i am getting a failed assertion for dleaf_merge dleaf_group (leaf) >= 1 2009-01-25 13:47 when i copy a file and delete it without umount 2009-01-25 13:47 and you didn't get that without the sync_super change? 2009-01-25 13:48 i think .. let me check again 2009-01-25 13:50 -!- amey(~amey@121.246.32.157) has joined #tux3 2009-01-25 13:51 same problem without the change ... 2009-01-25 13:51 good :) 2009-01-25 13:52 ok, then just send in the change and then we will work on the delete problem 2009-01-25 13:52 interesting bug 2009-01-25 13:52 it seems hit empty leaf 2009-01-25 13:52 um... 2009-01-25 13:53 happens on any delete? 2009-01-25 13:53 no only for bigger files...didnt cause a problem for a single block file 2009-01-25 13:54 ok, so it not a completely embarrassing bug ;) 2009-01-25 13:54 how big is the files? 2009-01-25 13:55 2 mb 2009-01-25 13:55 oh, small enough :) 2009-01-25 13:55 i am copying /boot/vmlinuz 2009-01-25 13:55 it sounds like userland only problem 2009-01-25 13:56 I'm testing with 256M file usually 2009-01-25 13:56 the userspace code did not get much testing until chinmay's group started working 2009-01-25 13:56 yes 2009-01-25 13:56 :) 2009-01-25 13:57 well, I used tux3 command for testing before though 2009-01-25 13:58 cdk, does it work if you umount before the delete? 2009-01-25 13:58 no 2009-01-25 13:58 does "tux3 delete" work? 2009-01-25 14:00 um..., it may related to buffer invalidation 2009-01-25 14:00 in other news: cursor_redirect, which now loops across the whole cursor, ran without segfaulting 2009-01-25 14:00 same error 2009-01-25 14:01 oh 2009-01-25 14:02 well that makes it easy to chase 2009-01-25 14:26 flips, should i send the patch or wait for delete problem? 2009-01-25 14:27 amey, send in, it doesn't affect the delete problem 2009-01-25 14:27 ok 2009-01-25 14:27 will send in few mins.. 2009-01-25 14:48 static-http://userweb.kernel.org/~hirofumi/tux3/ <- reading 2009-01-25 14:48 thanks, for now it just for review 2009-01-25 14:48 actually, it can be pulled 2009-01-25 14:49 however, we may want to linus git before (I don't need though) 2009-01-25 14:49 if ((err = -fuse_session_loop(fs))) 2009-01-25 14:49 goto eek; 2009-01-25 14:49 if ((errno = -sync_super(sb))) 2009-01-25 14:49 goto eek; 2009-01-25 14:49 this ok? 2009-01-25 14:51 should be errno in both places 2009-01-25 14:51 and return !errno 2009-01-25 14:52 k 2009-01-25 14:55 - return err ? 1 : 0; 2009-01-25 14:55 + eek: 2009-01-25 14:55 + warn("Eek! %s", strerror(errno)); 2009-01-25 14:55 + return !errno; 2009-01-25 14:55 this ok? 2009-01-25 14:55 good 2009-01-25 14:56 I feel like fuse is getting some care and attention now :) 2009-01-25 14:56 :) 2009-01-25 14:56 :) 2009-01-25 15:00 Docuemnting for current locking <- hirofumi, a typo 2009-01-25 15:00 thanks 2009-01-25 15:00 typical english would be: Document current locking 2009-01-25 15:02 i see, fixed 2009-01-25 15:03 hirofumi, with volmap what flushes the volume now, sync_inode_pages ? 2009-01-25 15:04 yes 2009-01-25 15:04 that's kind of cool 2009-01-25 15:04 and directly ->writepage 2009-01-25 15:04 right 2009-01-25 15:05 btw, -fuse_session_loop()? <- "-" 2009-01-25 15:05 I'm not sure it is from our fs, or fuse returns it 2009-01-25 15:05 though 2009-01-25 15:05 I read in the fuse docs they return -err 2009-01-25 15:06 i see 2009-01-25 15:06 like we do, down low 2009-01-25 15:06 sounds like really strange 2009-01-25 15:06 I think, near the top level of a libc-style program, it is ok to use errno 2009-01-25 15:06 what is really strange, using -err? 2009-01-25 15:06 yes 2009-01-25 15:07 fuse errno for replay, and -errno for session_loop 2009-01-25 15:07 fuse uses 2009-01-25 15:07 errno for replay? 2009-01-25 15:08 e.g. fuse_reply_err(req, ENOSYS); 2009-01-25 15:08 arguably, ERR_PTR should take positive err and make it negative 2009-01-25 15:08 there is some sort of rule 2009-01-25 15:08 it's kind of fuzzy 2009-01-25 15:09 err number is basically positive, except when passed by a function 2009-01-25 15:09 returned by a function I meant 2009-01-25 15:09 yes 2009-01-25 15:09 but, in kernel, it uses always -errno 2009-01-25 15:10 fuse mixed it 2009-01-25 15:10 I think, even in kernel if an err number is passed to a function, it should be passed positive 2009-01-25 15:10 but ERR_PTR violates that 2009-01-25 15:11 there is a lot of confusion in kernel about that 2009-01-25 15:11 somebody did an audit a while ago and found lots of sign problems in error paths 2009-01-25 15:12 lots of broken error paths that have obviously never been tested 2009-01-25 15:12 yes 2009-01-25 15:12 but, it is just bug 2009-01-25 15:12 if mixed it, I guess we will have more bugs 2009-01-25 15:13 so for example, one writes fuse_reply_err(req, PTR_ERR(pointer)), and it is correct 2009-01-25 15:13 PTR_ERR returns positive error (I hope) 2009-01-25 15:13 let's see 2009-01-25 15:13 PTR_ERR returns negative 2009-01-25 15:13 patch sent 2009-01-25 15:14 hirofumi, ok you are right 2009-01-25 15:14 so it has to be fuse_reply_err(req, -PTR_ERR(pointer)) 2009-01-25 15:15 yes 2009-01-25 15:15 but I think the problem is with ERR_PTR, not fuse 2009-01-25 15:15 yes, it is our code 2009-01-25 15:15 copied from kernel 2009-01-25 15:15 no problem (or just our problem) 2009-01-25 15:16 however, fuse_session_loop() uses negative error 2009-01-25 15:16 I'm assuming it is fuse error code 2009-01-25 15:16 returns a negative error 2009-01-25 15:16 ah right 2009-01-25 15:16 let's see if it is 2009-01-25 15:17 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-25 15:18 Returns: 2009-01-25 15:18 0 on success, -1 on error 2009-01-25 15:18 i see 2009-01-25 15:18 http://fuse.sourcearchive.com/documentation/2.7.4/fuse__lowlevel_8h_5f1e538aa3287e251afbe985438c4249.html 2009-01-25 15:18 and is errno set? 2009-01-25 15:18 probably 2009-01-25 15:18 it seems work like sysccall 2009-01-25 15:20 it seems to lose errno 2009-01-25 15:20 http://fuse.sourcearchive.com/documentation/2.7.4/fuse__loop_8c-source.html 2009-01-25 15:21 or leave errno as is 2009-01-25 15:22 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-25 15:25 it's a confused interface 2009-01-25 15:26 yes 2009-01-25 15:26 anyway, -fuse_seesion_loop is incorrect 2009-01-25 15:27 yes, not completely though 2009-01-25 15:27 it would be if (fuse_session_loop() < 0) 2009-01-25 15:27 then what? 2009-01-25 15:27 or if ((err = fuse_session_loop())) 2009-01-25 15:28 I like the first one more 2009-01-25 15:28 f (fuse_session_loop() < 0) errno = ENOIDEA :) 2009-01-25 15:28 -!- amey(~amey@121.246.32.157) has joined #tux3 2009-01-25 15:28 then, it will run termination 2009-01-25 15:29 ah, so if error, skip sync_super() 2009-01-25 15:30 because error may be occured before mount 2009-01-25 15:30 error can be occured 2009-01-25 15:31 -!- amey(~amey@121.246.32.157) has joined #tux3 2009-01-25 15:31 anyway, it should be int err = 1; 2009-01-25 15:32 and err = fuse_session_loop(fs) < 0; 2009-01-25 15:32 yes 2009-01-25 15:33 I found tux3_destroy() 2009-01-25 15:34 I am ready to pull your latest, I think it is time 2009-01-25 15:35 ok, I'll push fixed repo 2009-01-25 15:35 btw, tux3_destory() may be able to use to call sync_super() 2009-01-25 15:36 and don't care about fuse_session_loop 2009-01-25 15:37 ah 2009-01-25 15:38 destroy does not sound like a sync :) 2009-01-25 15:38 yes :) 2009-01-25 15:39 however, maybe better off than after fuse_session_loop() 2009-01-25 15:39 it also could be good to make valgrind happy if we free some resources there 2009-01-25 15:40 ready for a pull? 2009-01-25 15:40 yes, I've pushed 2009-01-25 15:40 yes 2009-01-25 15:42 and I won't try building on 2.6.26.5, my next kernel buiild will be on linus current 2009-01-25 15:43 I feel like this is the more important pull so far 2009-01-25 15:44 yes, and it will break older kernel 2009-01-25 15:44 I will make a list post about that 2009-01-25 15:44 thanks 2009-01-25 15:46 fuse seems not to send request to flush buffers to userland 2009-01-25 15:47 so, userland will need some king of flusher 2009-01-25 15:47 probably, with atomic commit 2009-01-25 15:48 yes 2009-01-25 15:48 exactly 2009-01-25 15:50 cursor_redirect: redirect block 59 to 5c 2009-01-25 15:50 cursor_redirect: update parent 2009-01-25 15:50 balloc: -> 5c 2009-01-25 15:50 cursor_redirect: redirect block 5a to 5d 2009-01-25 15:50 cursor_redirect: update parent 2009-01-25 15:50 balloc: -> 5d 2009-01-25 15:50 cursor_redirect: redirect block 5b to 5e 2009-01-25 15:51 cursor_redirect: redirect root 2009-01-25 15:51 hmm, why does the balloc message come out after the redirect 2009-01-25 15:52 it calls balloc the above order? 2009-01-25 15:53 5c should have been allocated before the first redirect message 2009-01-25 15:53 sure 2009-01-25 15:54 maybe, balloc: is not from balloc 2009-01-25 15:54 no, it's balloc-dummy 2009-01-25 15:54 ah 2009-01-25 15:54 I mean "yes" (japanese) 2009-01-25 15:55 well, dummy also have balloc 2009-01-25 15:55 it's not quite right 2009-01-25 15:56 it should return the block before incrementing 2009-01-25 15:56 I probably wrote that ;) 2009-01-25 15:57 yes, I never write that as online :) 2009-01-25 15:58 one line 2009-01-25 15:58 balloc: -> b7 2009-01-25 15:58 cursor_redirect: redirect block b1 to b7 2009-01-25 15:58 cursor_redirect: update parent 2009-01-25 15:58 balloc: -> b9 2009-01-25 15:58 cursor_redirect: redirect block b3 to b9 2009-01-25 15:58 cursor_redirect: update parent 2009-01-25 15:58 balloc: -> bb 2009-01-25 15:58 cursor_redirect: redirect block b5 to bb 2009-01-25 15:58 cursor_redirect: redirect root 2009-01-25 15:58 I will post this patch pretty soon 2009-01-25 15:58 after a skate 2009-01-25 15:58 ok 2009-01-25 15:58 or now 2009-01-25 15:59 just a sec 2009-01-25 16:05 some sloppy things in that patch 2009-01-25 16:05 but it is the heart of atomic commit, more or less working 2009-01-25 16:05 ok 2009-01-25 16:15 it's really "redirect_leaf" 2009-01-25 16:16 it's interesting how this code designs itself 2009-01-25 16:17 the btree cursor basically defined the algorithm 2009-01-25 16:50 i see 2009-01-25 16:50 looks good to me, still quick look though 2009-01-25 16:50 *block = sb->nextalloc += blocks; 2009-01-25 16:51 except this :) 2009-01-25 16:51 I thought I changed that 2009-01-25 16:51 oh 2009-01-25 16:51 :) 2009-01-25 16:51 mistake 2009-01-25 16:52 btw, why is clone buffer dirty? 2009-01-25 16:53 so that a redirect applied to it will not redirect again 2009-01-25 16:54 until it is cleaned 2009-01-25 16:54 i see 2009-01-25 16:54 the principle is: all clean volmap blocks that will be changed must be redirected 2009-01-25 16:57 it is update part? 2009-01-25 16:57 um... 2009-01-25 16:57 because caller will modify level buffer 2009-01-25 16:58 buffer of passed level 2009-01-25 16:58 I think the level parameter may not be needed 2009-01-25 16:58 it always redirects from leaf 2009-01-25 16:58 tree_chop has one special case 2009-01-25 16:58 it modifies a key higher up in the tree 2009-01-25 16:59 I think that is ok, it's handled 2009-01-25 16:59 because cursor_redirect makes every block in the path dirty 2009-01-25 17:00 this handles splits too, by redirecting the source block, and the new block for the split does not need to be redirected 2009-01-25 17:04 the only thing that has to be added to the index node split/merge code is a log record 2009-01-25 17:05 ACTION will be out for a while 2009-01-25 18:08 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-25 19:14 I think I noticed why I had felt strage 2009-01-25 19:14 I guess balloc() and blockget() and mark_buffer_dirty() should be new_block() 2009-01-25 19:16 and buffer initialization should be done by caller 2009-01-25 19:52 if not, I think it would mean it makes the special buffer 2009-01-25 19:57 anyway, cleanup patch is in static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-01-25 20:02 and redirect is after modify, is it right? 2009-01-25 20:03 in update parent, modify entry->block, then redirect it 2009-01-25 20:15 hirofumi, still there? 2009-01-25 20:16 you're right 2009-01-25 20:16 it's just new_block 2009-01-25 20:17 new_block does not need initialization, true 2009-01-25 20:44 redirect is before modify 2009-01-25 20:45 only in that one case, the modify is before direct 2009-01-25 20:45 it works, but it's a little ugly 2009-01-25 20:46 in that case, the buffer is modified but left clean 2009-01-25 21:40 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-25 23:40 -!- edt(~Ed@dsl-61-49.aei.ca) has joined #tux3 2009-01-26 00:05 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-26 00:38 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-26 02:46 -!- pgquiles(~pgquiles@54.Red-79-148-70.dynamicIP.rima-tde.net) has joined #tux3 2009-01-26 08:04 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-01-26 08:07 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-26 09:09 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-26 10:21 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-26 10:36 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-26 11:03 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-26 11:04 um..., if bnode was modified before redirect, both of orignal and cloned buffers points modified buffers 2009-01-26 11:20 -!- cdk(~chinmay@121.246.35.224) has joined #tux3 2009-01-26 11:22 hi flips 2009-01-26 11:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-26 11:34 hi cdk 2009-01-26 11:35 trying to get mkdir to work in fuse 2009-01-26 11:35 hirofumi, a bnode is only redirected if it is clean 2009-01-26 11:36 yes 2009-01-26 11:37 -!- gaurav(~gaurav@121.246.35.224) has joined #tux3 2009-01-26 11:37 well, anyway, if modify before redirect, I think it doesn't save same data when buffer was clean 2009-01-26 11:37 hi flips 2009-01-26 11:38 hirofumi, I don't think it happens 2009-01-26 11:39 gaurav, hi 2009-01-26 11:39 can you please review if our approach is right 2009-01-26 11:39 http://paste2.org/p/136311 2009-01-26 11:39 in that patch, the code does, A level clone -> modify A-1 level parent -> A-1 level clone 2009-01-26 11:40 or should i put the code here itself 2009-01-26 11:41 "modify A-1 level parent" meant modify A-1 level buffer 2009-01-26 11:41 hirofumi, are you worried about this: * Note: this may change a clean buffer which is then copied 2009-01-26 11:43 gaurav, it looks good to me, does it work? 2009-01-26 11:43 ah, yes 2009-01-26 11:43 probably 2009-01-26 11:44 no :) 2009-01-26 11:44 gaurav, shapor will be tux3 fuse maintainer 2009-01-26 11:45 to send patches to and get programming advice 2009-01-26 11:45 ok. 2009-01-26 11:45 so please post the patch, describe the problem, and cc shapor 2009-01-26 11:46 ok. will do. 2009-01-26 11:46 btw, if it can share with tux3 command, it would be good 2009-01-26 11:47 well, if it's not needed to share, this is just noise :) 2009-01-26 11:48 hirofumi, very soon 2009-01-26 11:48 it will be in tux3 command 2009-01-26 11:49 good, btw, we have nice feature from shareing kernel 2009-01-26 11:49 tux_new_inode() cares about ->i_nlink already 2009-01-26 11:50 well, caller should care about parent ->i_nlink though 2009-01-26 11:51 anyway, the buffer that is modified before copy is left clean and discarded 2009-01-26 11:52 that is, it is not marked dirty 2009-01-26 11:52 it is sloppy, but harmless 2009-01-26 11:52 the loop could be written better to avoid this 2009-01-26 11:52 I though it is harmness 2009-01-26 11:52 I think it is harmless 2009-01-26 11:53 I think bitmap referencing old buffers from old root 2009-01-26 11:53 because balloc() is called on middle of btree modify 2009-01-26 11:55 so, if it was already modified, bitmap btree looks like strange 2009-01-26 11:55 I'm not sure it has problem or not though 2009-01-26 11:55 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-26 11:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-26 12:02 I don't think it's a problem, and if it is, it is not hard to fix 2009-01-26 12:04 in cursor_redirect, it's not a problem immidiately 2009-01-26 12:04 but, I was thinking about btree split 2009-01-26 12:05 I don't see a problem in btree split... yet 2009-01-26 12:05 it also calls balloc(), and it splits bnode 2009-01-26 12:05 are you thinking about if the btree is the bitmap btree? 2009-01-26 12:06 yes 2009-01-26 12:06 modifying btree is bitmap btree 2009-01-26 12:06 the balloc will do a fork_buffer in cache 2009-01-26 12:07 and btree is cursor redirect? 2009-01-26 12:07 yes 2009-01-26 12:07 so, I think the problem occurs 2009-01-26 12:08 in split and merge path 2009-01-26 12:08 ah, merge may not be occured for bitmap 2009-01-26 12:08 well, so, split path 2009-01-26 12:09 ok 2009-01-26 12:09 in stage_delta? 2009-01-26 12:09 yes 2009-01-26 12:10 it would not be cursor_redirect() problem actually 2009-01-26 12:10 however, I was expecting cursor_redirect() would solve this problem 2009-01-26 12:11 I don't see the problem yet 2009-01-26 12:12 ok 2009-01-26 12:12 please see insert_leaf() 2009-01-26 12:12 it splits bnode to insert new leaf 2009-01-26 12:13 yes 2009-01-26 12:13 and it calls balloc() via new_node() 2009-01-26 12:13 yes 2009-01-26 12:13 that balloc will not change the bitmap btree, only the bitmap block in page cache 2009-01-26 12:13 if balloc() was called, bnode may already be splited 2009-01-26 12:13 yes 2009-01-26 12:14 but, it uses btree to read buffers 2009-01-26 12:14 ah 2009-01-26 12:14 :) 2009-01-26 12:15 you are right, I did not think about that 2009-01-26 12:15 so, I was expecting cursor_redirect() solves this problem cleanly 2009-01-26 12:16 It should, but maybe there is a problem if I change the parent before copy, like you said 2009-01-26 12:16 yes 2009-01-26 12:16 probably 2009-01-26 12:17 well, new_block() and this are points I noticed on review 2009-01-26 12:19 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-26 12:19 I will think about the recursion while I make my coffee 2009-01-26 12:19 ok, thanks 2009-01-26 12:21 ah, btw, I've noticed we can use level_replace_brelse() to replace buffer in cursor_redirect() 2009-01-26 12:25 I should commit it pretty soon so you can do patches 2009-01-26 12:25 ok 2009-01-26 12:30 I wonder if I should try to add it so it does not actually do anything 2009-01-26 12:30 ok 2009-01-26 12:30 I will just fix up the unit test a little, and commit 2009-01-26 12:30 it currently is not used 2009-01-26 12:31 yes, good 2009-01-26 12:33 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-26 12:37 ok, pushed 2009-01-26 12:37 makes my diff small again 2009-01-26 12:38 ok 2009-01-26 12:42 getting linus's tree now 2009-01-26 12:43 goot 2009-01-26 12:43 verified we broke 2.6.26.5 2009-01-26 12:43 fs/tux3/filemap.c: In function 'blockget': 2009-01-26 12:43 fs/tux3/filemap.c:471: error: 'AOP_FLAG_NOFS' undeclared (first use in this function) 2009-01-26 12:43 "git -s -l" would help 2009-01-26 12:43 yes 2009-01-26 12:44 it was "git clone -s -l" 2009-01-26 12:45 figured that out, now reading the man page 2009-01-26 12:45 oh, I'm cloning over the net right now 2009-01-26 12:46 oh, it would slow 2009-01-26 12:46 -l... "This is now the default when the source repository is specified with /path/to/repo syntax" 2009-01-26 12:46 yes 2009-01-26 12:50 git clone --bare -s -l /pub/scm/linux/kernel/git/torvalds/linux-2.6.git tux3-2.6.git 2009-01-26 12:50 it would be done without any copy 2009-01-26 12:50 oh, you mean for the kernel.org repo 2009-01-26 12:50 yes 2009-01-26 12:50 I'm just doing local for now 2009-01-26 12:50 ah 2009-01-26 12:50 but I will save the command 2009-01-26 13:08 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-26 14:02 linus/.git is now 293 MB 2009-01-26 14:02 less than one full checked out and built tree 2009-01-26 14:04 make defconfig ARCH=um && make linux ARCH=um 2009-01-26 14:04 I will post a howto for moving to linus current 2009-01-26 14:04 yes, and full .git is needed only one per machine 2009-01-26 14:05 actually, I'm using 3 or 4 source tree from one parent 2009-01-26 14:06 it's nice 2009-01-26 14:07 btw, - cursor->path[level].next += bufdata(clone) - bufdata(buffer); 2009-01-26 14:08 it may be bug, because next is not (void *) 2009-01-26 14:08 I'll fix it 2009-01-26 14:08 whoops 2009-01-26 14:08 it probably works, but it is fragile 2009-01-26 14:09 depends on buffer->data alignment 2009-01-26 14:09 oh no 2009-01-26 14:09 it's really broken 2009-01-26 14:09 how did it work? 2009-01-26 14:09 I didn't try it, I found before testing 2009-01-26 14:09 +static void level_redirect(struct cursor *cursor, int level, struct buffer_head *clone) 2009-01-26 14:09 +{ 2009-01-26 14:09 + struct buffer_head *buffer = cursor->path[level].buffer; 2009-01-26 14:09 + unsigned offset = (void *)cursor->path[level].next - bufdata(buffer); 2009-01-26 14:09 + memcpy(bufdata(clone), bufdata(buffer), bufsize(clone)); 2009-01-26 14:09 (void *)cursor->path[level].next += bufdata(clone) - bufdata(buffer); 2009-01-26 14:09 + level_replace_brelse(cursor, level, clone, bufdata(clone) + offset); 2009-01-26 14:09 +} 2009-01-26 14:09 + 2009-01-26 14:10 sure 2009-01-26 14:10 well, I finally tried the above 2009-01-26 14:10 it seems it can be cursor operation like the above 2009-01-26 14:11 it makes it easier to understand 2009-01-26 14:11 but it only will be called from one place 2009-01-26 14:11 probably, yes 2009-01-26 14:12 anyway, easier to understand is good 2009-01-26 14:12 hope newer gcc optimize it if needed 2009-01-26 14:13 well, we can use inline though 2009-01-26 14:13 and performance isn't really critical here 2009-01-26 14:14 yes 2009-01-26 14:14 but, it seems you are good for it 2009-01-26 14:14 it looks pretty good 2009-01-26 14:15 I was surprised about fifo order link 2009-01-26 14:15 our big win is doing copy on write without reading from the disk in most cases 2009-01-26 14:15 btrfs submits a bio to read for the COW 2009-01-26 14:15 it was not haveing any branch insn before it 2009-01-26 14:15 insm? 2009-01-26 14:15 instruction 2009-01-26 14:16 heh 2009-01-26 14:17 the cost of the spinlock will totally dominate the fifo link, with or without a branch ;) 2009-01-26 14:17 maybe it is nice for a cell phone 2009-01-26 14:18 well, yes :) 2009-01-26 14:18 however, I couldn't find any other way without branch insn at least for me 2009-01-26 14:19 ok, the next round of changes will put the cursor_redirect in the four places I mentioned... we probably want to enable that on an ifdef for now 2009-01-26 14:20 good 2009-01-26 14:22 initial testing will verify that we get the right blocks on the delta write list, and the right log entries 2009-01-26 14:22 speaking of delta write list... we own b_assoc_buffers now, right? I want to use that to link together buffers for delta writeout 2009-01-26 14:23 yes 2009-01-26 14:24 all buffer should be free to use that list_head 2009-01-26 14:24 all our buffers 2009-01-26 14:27 fs/built-in.o: In function `debugfs_create_size_t': 2009-01-26 14:27 include/linux/debugfs.h:168: multiple definition of `debugfs_create_size_t' 2009-01-26 14:27 kernel/built-in.o:include/linux/debugfs.h:168: first defined here 2009-01-26 14:27 make: *** [vmlinux.o] Error 1 <- uml build is broken in linus git 2009-01-26 14:28 I will try building from a tagged version 2009-01-26 14:28 it can be offten 2009-01-26 14:32 uml was built for me 2009-01-26 14:33 which revision? 2009-01-26 14:33 I think lastest 2009-01-26 14:34 f3b8436ad9a8ad36b3c9fa1fe030c7f38e5d3d0b 2009-01-26 14:34 make defconfig ARCH=um && make linux ARCH=um ? 2009-01-26 14:34 make ARCH=um defconfig && make ARCH=um 2009-01-26 14:34 I'm trying SUBARCH=i386 now 2009-01-26 14:35 right, same 2009-01-26 14:35 we get to debug tux3 and linus git 2009-01-26 14:36 just like our dedup developers get to debug dedup and tux3 2009-01-26 14:36 yes 2009-01-26 14:37 well, if we have git tree, we don't need to update often 2009-01-26 14:41 it got further this time: 2009-01-26 14:41 LD .tmp_vmlinux1 2009-01-26 14:41 arch/um/sys-i386/built-in.o: In function `sys_call_table': 2009-01-26 14:41 (.rodata+0x308): undefined reference to `sys_sigprocmask' 2009-01-26 14:41 collect2: ld returned 1 exit status 2009-01-26 14:41 maybe try a distclean and compile again 2009-01-26 14:44 if that doesn't work I will try the 2.6.28 tag 2009-01-26 14:44 it seems recent syscall wrapper breaks something 2009-01-26 14:45 buffer delay is in 2.6.28, right? 2009-01-26 14:45 iirc, 2.6.28 is ok 2009-01-26 14:45 that's close enought to tip for me 2009-01-26 14:54 built, now with tux3 2009-01-26 14:55 rsync -t /src/tux3/kernel/* fs/tux3/ && make linux ARCH=um CONFIG_TUX3=y 2009-01-26 14:56 ok, workaround is #define sys_sigprocmask sys_kernel_sigprocmask 2009-01-26 14:57 add it to arch/um/sys-i386/sys_call_table.S 2009-01-26 14:57 thanks for the patch :) 2009-01-26 14:58 uml and syscall wrapper does strage things 2009-01-26 14:59 well, uml is, -Dsigprocmask=kernel_sigprocmask for kernel 2009-01-26 15:02 maybe 2.6.28 does not have AOP_FLAG_NOFS? 2009-01-26 15:04 came in at 2.6.28.1 2009-01-26 15:05 yes 2009-01-26 15:05 it was introduced after 2.6.29-rc1 2009-01-26 15:11 hey flips 2009-01-26 15:11 hi bh 2009-01-26 15:12 sk8 oclock 2009-01-26 15:36 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-26 15:41 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-01-26 18:17 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-26 18:21 linus git v2.6.29-rc1 is his only tagged version that has AOP_FLAG_NOFS and does not have the missing UML sigprocmask symbol 2009-01-26 22:30 cursor redirect in tree_chop has a conceptual difficulty: we don't know if a particular leaf should be redirected until calling the leaf chop method, and by that time the leaf has been changed 2009-01-26 22:30 leaf methods also don't know anything about btrees or cursors 2009-01-26 22:32 and as a third difficulty, multiple leaf nodes might need to be redirected by the chop, so there has to be something like an 'advance and redirect" 2009-01-26 22:35 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-01-26 22:37 can't we redirect whole for chop? 2009-01-26 22:38 probably can 2009-01-26 22:38 i.e. optimize it with interface rethink 2009-01-26 22:39 interface rethink is needed 2009-01-26 22:40 it should be before atomic commit, or after? 2009-01-26 22:42 it can be after 2009-01-26 22:42 I think it's good 2009-01-26 22:42 I guess interface can rethink with atomic commit 2009-01-26 22:42 can rethink including atomic commit 2009-01-26 22:42 ok, now we needs something like make ATOMIC=1 2009-01-26 22:43 i see 2009-01-26 22:43 ok 2009-01-26 22:43 which appends -DATOMIC to CFLAGS 2009-01-26 22:44 and I write a make conditional outside a make rule? 2009-01-26 22:44 s/and/can/ 2009-01-26 22:44 yes 2009-01-26 22:44 ok, easy then 2009-01-26 22:44 or we can just use UCFLAGS=-DATOMIC 2009-01-26 22:45 let's try that 2009-01-26 22:46 works fine 2009-01-26 22:46 good 2009-01-26 22:47 ok, I will check in the cursor_redirects so far... it is disabled by #ifdef ATOMIC 2009-01-26 22:47 or maybe I should post first 2009-01-26 22:48 lets post first 2009-01-26 22:54 I'm going to think about atomic commit for bitmap more or less 2009-01-26 22:55 it's a good test case 2009-01-26 22:55 I guess it is most complex file 2009-01-26 22:55 initially, metadata blocks will always all be clean after a flush 2009-01-26 22:56 that is, when we only have one delta in the pipeline at a time 2009-01-26 22:56 ah, probably yes 2009-01-26 22:56 I'm not sure about bitmap yet though 2009-01-26 22:57 I guess it can make dirty by flush 2009-01-26 22:59 because bitmap would prepare for next delta 2009-01-26 22:59 yes, that is ok 2009-01-26 22:59 and expected 2009-01-26 22:59 yes 2009-01-26 23:00 ok, patch is posted for comment 2009-01-26 23:03 we already comment on it ;) 2009-01-26 23:07 ah 2009-01-26 23:07 #ifdef ATOMIC 2009-01-26 23:07 cursor_redirect 2009-01-26 23:07 #endif 2009-01-26 23:07 I guess 2009-01-26 23:07 cursor_redirect() { #ifdef ... #endif return 0; } is easy 2009-01-26 23:08 ok 2009-01-26 23:09 int cursor_redirect(struct cursor *cursor) 2009-01-26 23:09 { 2009-01-26 23:09 #ifndef ATOMIC 2009-01-26 23:09 return 0; 2009-01-26 23:09 #endif 2009-01-26 23:09 ok? 2009-01-26 23:09 ah, I meant 2009-01-26 23:09 real cursor_redirect() 2009-01-26 23:09 { 2009-01-26 23:10 #ifdef ATOMIC 2009-01-26 23:10 2009-01-26 23:10 #endif 2009-01-26 23:10 return 0; 2009-01-26 23:10 } 2009-01-26 23:10 ugh 2009-01-26 23:10 #ifdef ATOMIC 2009-01-26 23:10 #else 2009-01-26 23:10 return 0; 2009-01-26 23:10 #endif 2009-01-26 23:10 so, it shares cursor_redirect define 2009-01-26 23:11 slightly less ugly 2009-01-26 23:11 ah 2009-01-26 23:11 the above is top of real cursor_redirect 2009-01-26 23:11 I mean, yours is slightly less ugly 2009-01-26 23:11 new patch coming in a minute 2009-01-26 23:12 I misunderstand the above is now function for non-ATOMIC 2009-01-26 23:12 new function 2009-01-26 23:12 it's better to have fewer #ifdefs 2009-01-26 23:12 yes, #ifndef ATOMIC looks fine 2009-01-26 23:14 yes, that's better because it compiles the code 2009-01-26 23:14 i see, good 2009-01-26 23:18 the patch also fixes the tree_expand parameter list, there are a few other places we can get rid of a btree parameter 2009-01-26 23:18 advance for example 2009-01-26 23:18 separate patch 2009-01-26 23:18 I will check this one in 2009-01-26 23:19 yes, good 2009-01-26 23:21 the leaf truncate/delete methods really need to be able to call cursor_redirect at the point they decide to dirty the leaf 2009-01-26 23:22 the only real reason these functions do not take cursor parameters is so they can be unit tested without compiling all the btree code 2009-01-26 23:24 tree_chop? 2009-01-26 23:24 yes 2009-01-26 23:25 that's the culprit 2009-01-26 23:25 _chop is not really a good name for it 2009-01-26 23:25 it is capable of walking a full btree, removing any entries that match some condition 2009-01-26 23:26 currently, the condition is "above some logical address" 2009-01-26 23:26 yes 2009-01-26 23:26 but it could also be "is a given version" 2009-01-26 23:26 just something to think about, we don't need to support that now 2009-01-26 23:26 well, tree_chop can take cursor 2009-01-26 23:27 separate it to two part 2009-01-26 23:27 it allocates the cursor inside 2009-01-26 23:27 it's a full tree operation 2009-01-26 23:27 probe start point, and start shop 2009-01-26 23:27 as it does now 2009-01-26 23:27 start chop 2009-01-26 23:28 yes 2009-01-26 23:28 the caller does "probe start point" for other operation 2009-01-26 23:29 it might be good to do that for other reasons 2009-01-26 23:29 but right now, the problem is the interface to ops->leaf_chop 2009-01-26 23:29 maybe 2009-01-26 23:29 yes 2009-01-26 23:30 currently takes a btree, should take a cursor 2009-01-26 23:30 ok, so it seems easy now 2009-01-26 23:30 well, for now, I guess leaf_chop should be dirty always 2009-01-26 23:31 is it? 2009-01-26 23:31 you're right 2009-01-26 23:31 because probe points right position 2009-01-26 23:31 yes 2009-01-26 23:32 so we can save a little time by not changing the leaf_chop interface 2009-01-26 23:32 and fixing it goes on the list of things to do for versioning 2009-01-26 23:32 yes, and later it will be rewrite more or less 2009-01-26 23:35 what does a +1 return from dleaf_chop mean? 2009-01-26 23:35 +1 ? 2009-01-26 23:35 ah 2009-01-26 23:35 it means it was dirty 2009-01-26 23:36 yes, 1 means changed, 0 means no change, negative means error 2009-01-26 23:36 dirtied 2009-01-26 23:38 @@ -450,8 +450,10 @@ int tree_chop(struct btree *btree, struc 2009-01-26 23:38 while (1) { 2009-01-26 23:38 ret = (ops->leaf_chop)(btree, info->key, bufdata(leafbuf)); 2009-01-26 23:38 if (ret) { 2009-01-26 23:38 + if (ret < 0) 2009-01-26 23:38 + goto error_leaf_chop; 2009-01-26 23:38 mark_buffer_dirty(leafbuf); 2009-01-26 23:38 - if (ret < 0) 2009-01-26 23:38 + if ((ret = cursor_redirect(cursor))) 2009-01-26 23:38 goto error_leaf_chop; 2009-01-26 23:38 } 2009-01-26 23:39 redirect after change? 2009-01-26 23:39 sorry :) 2009-01-26 23:39 wasn't thinking 2009-01-26 23:40 -!- edt(~Ed@dsl-61-49.aei.ca) has joined #tux3 2009-01-26 23:40 ah 2009-01-26 23:41 btw, tree_chop cursor is abnormal 2009-01-26 23:41 cursor doesn't have leaf level 2009-01-26 23:41 ? 2009-01-26 23:41 does not always have the full path? 2009-01-26 23:42 it does probe() -> level_pop() 2009-01-26 23:42 then goes to loop 2009-01-26 23:42 yes 2009-01-26 23:42 should fix that 2009-01-26 23:42 the leaf can stay in the cursor 2009-01-26 23:42 tree_chop is handling leaf with leafbuf 2009-01-26 23:42 yes 2009-01-26 23:42 it should handle it with the cursor 2009-01-26 23:43 I hope to merge the two levels in leaf_chop some day ;) 2009-01-26 23:43 they are almost the same 2009-01-26 23:44 ok, checked in cursor_redirect for leaf_chop 2009-01-26 23:44 that was too easy ;) 2009-01-26 23:49 diff -puN user/kernel/btree.c~tree_chop-push-pop user/kernel/btree.c 2009-01-26 23:49 --- tux3/user/kernel/btree.c~tree_chop-push-pop 2009-01-27 16:48:10.000000000 +0900 2009-01-26 23:49 +++ tux3-hirofumi/user/kernel/btree.c 2009-01-27 16:48:27.000000000 +0900 2009-01-26 23:49 @@ -444,12 +444,12 @@ int tree_chop(struct btree *btree, struc 2009-01-26 23:49 2009-01-26 23:49 down_write(&btree->lock); 2009-01-26 23:49 probe(btree, info->resume, cursor); 2009-01-26 23:49 - leafbuf = level_pop(cursor); 2009-01-26 23:49 2009-01-26 23:49 /* leaf walk */ 2009-01-26 23:49 while (1) { 2009-01-26 23:49 if ((ret = cursor_redirect(cursor))) 2009-01-26 23:49 goto error_leaf_chop; 2009-01-26 23:49 + leafbuf = level_pop(cursor); 2009-01-26 23:49 ret = (ops->leaf_chop)(btree, info->key, bufdata(leafbuf)); 2009-01-26 23:49 if (ret) { 2009-01-26 23:49 if (ret < 0) 2009-01-26 23:49 @@ -549,6 +549,7 @@ keep_prev_node: 2009-01-26 23:49 ret = -EIO; 2009-01-26 23:49 goto out; 2009-01-26 23:49 } 2009-01-26 23:49 + level_push(cursor, leafbuf, NULL); 2009-01-26 23:50 } 2009-01-26 23:50 2009-01-26 23:50 error_leaf_chop: 2009-01-26 23:50 _ 2009-01-26 23:50 this will insert leaf to cursor 2009-01-26 23:51 and probably, in cursor_redirect() "unsigned level = btree->root.depth" 2009-01-26 23:51 it is "unsigned level = btree->root.depth + 1"? 2009-01-26 23:51 ah, no 2009-01-26 23:56 I guess cursor_redirect is broken until this is fixed 2009-01-26 23:58 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-01-26 23:58 I pushed tree_chop fix 2009-01-26 23:59 I think original cursor_redirect is right 2009-01-27 00:00 if depth == 1, it handles 0 and 1 levels 2009-01-27 00:00 it is right 2009-01-27 00:03 there was another bug in cursor_redirect 2009-01-27 00:03 assert(oldblock == from_be_u64(sb->super.iroot)); 2009-01-27 00:03 iroot is including depth 2009-01-27 00:04 whoops 2009-01-27 00:04 btw, why btree->root.root can be used? 2009-01-27 00:04 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-01-27 00:05 something funny about the from_ macro 2009-01-27 00:06 hmm 2009-01-27 00:06 actually, disksuper.iroot doesn't have depth 2009-01-27 00:07 I wonder where we get the depth from 2009-01-27 00:07 iroot is packed struct root? 2009-01-27 00:07 must be 2009-01-27 00:08 our type saftey sucks there 2009-01-27 00:08 so, it is (u64)root->depth << 48 | root->block;? 2009-01-27 00:08 something like that 2009-01-27 00:08 yes 2009-01-27 00:08 so, depth was got from iroot 2009-01-27 00:10 we have unpack_root(iroot_val); 2009-01-27 00:10 yes 2009-01-27 00:10 and pack_root(struct root *root) 2009-01-27 00:10 ah, in userland disksuper was not copied to sb->super 2009-01-27 00:11 no 2009-01-27 00:11 it was copied 2009-01-27 00:11 yes 2009-01-27 00:12 (gdb) p/x from_be_u64(sb->super.iroot) 2009-01-27 00:12 $1 = 0x1000000000002 2009-01-27 00:12 looks like right value 2009-01-27 00:12 it is including depth though 2009-01-27 00:14 anyway, it's the wrong thing to assert against 2009-01-27 00:14 yes 2009-01-27 00:14 it should just be checking against the btree root 2009-01-27 00:14 ok 2009-01-27 00:15 but, it should be update the btree->root.block? 2009-01-27 00:15 it shouldn't be 2009-01-27 00:15 it shouldn't update 2009-01-27 00:16 it should update the btree root 2009-01-27 00:16 in both cases 2009-01-27 00:16 ok 2009-01-27 00:16 so the cases aren't really different 2009-01-27 00:16 and we don't have to log the dleaf root change 2009-01-27 00:16 only the itable root change 2009-01-27 00:17 so that was a mess ;) 2009-01-27 00:17 @@ -349,15 +349,10 @@ int cursor_redirect(struct cursor *curso 2009-01-27 00:17 if (!level--) { 2009-01-27 00:17 trace("redirect root"); 2009-01-27 00:17 - if (btree != itable_btree(sb)) { 2009-01-27 00:17 - assert(oldblock == btree->root.block); 2009-01-27 00:17 - btree->root.block = newblock; 2009-01-27 00:18 - log_droot(sb, newblock, oldblock, tux_inode(btree_inode(btree))->inum); 2009-01-27 00:18 - return 0; 2009-01-27 00:18 - } 2009-01-27 00:18 - 2009-01-27 00:18 - assert(oldblock == from_be_u64(sb->super.iroot)); 2009-01-27 00:18 - log_iroot(sb, newblock, oldblock); 2009-01-27 00:18 + assert(oldblock == btree->root.block); 2009-01-27 00:18 + btree->root.block = newblock; 2009-01-27 00:18 + if (btree == itable_btree(sb)) 2009-01-27 00:18 + log_iroot(sb, newblock, oldblock); 2009-01-27 00:18 return 0; 2009-01-27 00:18 } 2009-01-27 00:18 yes, exactly 2009-01-27 00:19 with it, make tests seems to be passed with some warnings 2009-01-27 00:20 warnings are expected more or less 2009-01-27 00:23 checked in 2009-01-27 00:25 - " | aroot %llu | blockbits %u (size %u) | volblocks %llu" 2009-01-27 00:25 + " | blockbits %u (size %u) | volblocks %llu" 2009-01-27 00:25 - (L)from_be_u64(txsb->aroot), sb->blockbits, sb->blocksize, 2009-01-27 00:25 + sb->blockbits, sb->blocksize, 2009-01-27 00:25 (aroot is not longer used) 2009-01-27 00:27 s/not/no/ 2009-01-27 00:30 good 2009-01-27 01:11 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-01-27 01:12 next step is to flush out the redirected metadata blocks and the log blocks 2009-01-27 01:44 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-27 05:19 -!- pgquiles(~pgquiles@45.Red-83-40-80.dynamicIP.rima-tde.net) has joined #tux3 2009-01-27 05:45 -!- pgquiles_(~pgquiles@137.Red-88-0-158.dynamicIP.rima-tde.net) has joined #tux3 2009-01-27 05:50 -!- Chip_M(stefanc@apollo.orakel.ntnu.no) has joined #tux3 2009-01-27 08:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-27 08:39 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-01-27 08:55 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-27 09:16 -!- kushal(~kushal@117.195.38.192) has joined #tux3 2009-01-27 09:17 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-27 09:22 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-27 10:09 -!- amey(~amey@117.195.38.192) has joined #tux3 2009-01-27 10:34 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-27 11:02 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-27 11:29 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-27 12:31 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-27 14:49 how does #ifdef build_inode work in user/inode.c? 2009-01-27 14:49 it must be defined somewhere, but I don't see it 2009-01-27 14:51 ah, in Makefile 2009-01-27 14:52 -Dbuild_$(<:.c=) 2009-01-27 14:54 probably can just be build_main 2009-01-27 14:55 it's only once per top level compile 2009-01-27 14:55 hmm, no 2009-01-27 14:56 it has to be build_$binary all right 2009-01-27 14:56 hirofumi, nice hack 2009-01-27 16:05 sk8 oclock 2009-01-27 19:10 sync error: Resource temporarily unavailable <- this is because sync_super can't just flush the bitmap any more 2009-01-27 19:11 I suppose sync_super doesn't have any use with atomic commit 2009-01-27 19:51 -!- amey(~amey@117.195.38.192) has joined #tux3 2009-01-27 20:15 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-27 23:41 -!- edt(~Ed@dsl-61-49.aei.ca) has joined #tux3 2009-01-27 23:56 -!- cdk(~chinmay@117.195.35.97) has joined #tux3 2009-01-28 00:10 hirofumi, there? 2009-01-28 02:10 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-28 04:28 -!- cdk(~chinmay@117.195.35.97) has joined #tux3 2009-01-28 09:47 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-28 09:48 -!- gila(~gila@5ED41295.cable.ziggo.nl) has joined #tux3 2009-01-28 10:35 -!- pranith(~bobby@122.162.69.20) has joined #tux3 2009-01-28 11:25 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-28 13:12 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-28 13:41 -!- cdk(~chinmay@117.195.34.82) has joined #tux3 2009-01-28 13:55 -!- dagle(~weechat@host162-104.bornet.net) has joined #tux3 2009-01-28 14:10 -!- dcg(~dcg@64.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-01-28 15:01 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-28 15:35 hirofumi, there? 2009-01-28 15:35 yes 2009-01-28 15:35 hi 2009-01-28 15:35 hi 2009-01-28 15:35 I broke valgrind yesterday 2009-01-28 15:35 I'll fix it later today 2009-01-28 15:35 one really small thing I want to do: change all the *test to test_* in the makefile 2009-01-28 15:35 so for example, test_commit 2009-01-28 15:36 yes 2009-01-28 15:36 ok, patch on the way 2009-01-28 15:36 since I use only tests, so I don't care it 2009-01-28 15:36 the other thing is, what's in test_commit 2009-01-28 15:37 so today and tomorrow I will put the whole atomic commit pipeline together 2009-01-28 15:37 good 2009-01-28 15:37 I wrote a post about it last night, but didn't quite finish all the details, it will be posted in a couple of hours 2009-01-28 15:38 btw, before, I was thinking to add prefix or postfix for tests 2009-01-28 15:38 yes, we have similar ideas about things like that 2009-01-28 15:38 yes 2009-01-28 15:39 another thought of log I have 2009-01-28 15:39 I was thinking about log buffer changes to stash 2009-01-28 15:40 I'm not sure it is good or not 2009-01-28 15:41 well, the intent is it can use same infrastructure, and it can be per delta list 2009-01-28 15:41 two logs? 2009-01-28 15:41 no, log list per delta 2009-01-28 15:42 that seems to make sense 2009-01-28 15:42 well, I'm not sure it is needed or not yet 2009-01-28 15:43 let's see how the user/commit prototype turns out 2009-01-28 15:43 if we want to add logs for next delta, I thought it would help 2009-01-28 15:44 I was thinking about how to do that 2009-01-28 15:44 and this seems like a reasonable way 2009-01-28 15:44 possibly the best way 2009-01-28 15:45 good 2009-01-28 15:47 ok, so it is now make test_ instead of test 2009-01-28 15:47 ok 2009-01-28 15:48 test_commit is pretty cool, it makes a mountable filesystem 2009-01-28 15:48 yes 2009-01-28 15:48 this will be much faster to hack on than running the full filesystem for commit tests 2009-01-28 15:49 basically, all I had to do was store the magic number in sb->disksuper 2009-01-28 15:50 we will have two cycle counters: sb->delta and sb->flush 2009-01-28 15:50 the flush counter takes care of the rollup, where rollup flushes btree nodes and bitmap blocks to disk 2009-01-28 15:51 i see 2009-01-28 15:51 during that flush, btree nodes will be redirected and bitmap blocks can be forked 2009-01-28 15:51 with make_tux3() in commit.c, I've noticed init_btree() for itable_btree is unnecessary 2009-01-28 15:52 ah 2009-01-28 15:52 got a patch? 2009-01-28 15:52 just remove it? 2009-01-28 15:52 I'll make a patch for it 2009-01-28 15:52 yes 2009-01-28 15:53 next thing I have to do is fix the valgrind complaint about test_commit so make tests runs 2009-01-28 15:53 good 2009-01-28 15:54 anyway, the rollup flush is settling down into a simple form: it just adds blocks to the dirty list for the next delta 2009-01-28 15:55 and it adds its deferred frees to the deferred frees for the next delta 2009-01-28 15:55 well, I will prototype it a post code 2009-01-28 15:55 it's the best way to explain 2009-01-28 15:56 sk8 oclock 2009-01-28 15:56 ok 2009-01-28 17:22 back 2009-01-28 18:36 the fix for valgrind complaint was easy, new directory blocks were unintialized so just clear them 2009-01-28 18:36 ext2/3 etc don't do this and should 2009-01-28 18:37 otherwise random kernel memory becomes visible on disk 2009-01-28 18:39 interesting how it took so long to catch that, I guess it is because we never ran ./tux3 or fuse under valgrind 2009-01-28 18:42 I think, at least ext2 does it 2009-01-28 18:42 it uses ->readpage for new page too 2009-01-28 18:43 with it, since the page is outside of i_size, so it is zeroed by ->readpage() 2009-01-28 18:45 ah 2009-01-28 18:47 ==28655== definitely lost: 248 bytes in 2 blocks. 2009-01-28 18:47 ==28655== indirectly lost: 8,032 bytes in 2 blocks. 2009-01-28 18:47 I suppose I should find out why 2009-01-28 18:48 124 bytes per create 2009-01-28 19:00 "tux3 write" also hit same bug 2009-01-28 19:00 I wonder why didn't hit before 2009-01-28 19:00 you mean leaking memory? 2009-01-28 19:00 no, uninitialized block 2009-01-28 19:01 we used to initialize the buffers 2009-01-28 19:01 until a couple of weeks ago 2009-01-28 19:01 who was initializing it? 2009-01-28 19:02 they where filled with "dd" 2009-01-28 19:02 //memset(data_pool, 0xdd, max_buffers*bufsize); /* first time init to deadly data */ 2009-01-28 19:03 I guess this bug didn't happen with BUFFER_PARANOIA_DEBUG 2009-01-28 19:08 sizeof(struct inode) = 124, and we lose 124 bytes/create 2009-01-28 19:08 in the user/commit.c test 2009-01-28 19:09 aj 2009-01-28 19:09 ah 2009-01-28 19:09 duh 2009-01-28 19:10 no more leak 2009-01-28 19:21 oh, uninitialized buffer was long standing bug 2009-01-28 19:22 I knew it, however I was thinking valgrind doesn't warn it 2009-01-28 19:23 but, valgrind warned it for "tux3 write" after BUFFER_PARANOIA_DEBUG actually 2009-01-28 19:34 -!- pranith(~bobby@122.162.69.20) has joined #tux3 2009-01-28 19:39 still? 2009-01-28 19:41 no, it was fixed with your patch 2009-01-28 19:41 right, just tried it 2009-01-28 19:42 printf("No text to write\n"); 2009-01-28 19:42 return 1; <- this loses resources 2009-01-28 19:44 yes 2009-01-28 19:44 well, there are many places like ti 2009-01-28 19:44 it 2009-01-28 19:45 in this case it is because there was no flush 2009-01-28 19:45 on error path, we are not freeing resources yet 2009-01-28 19:45 char text[2 << 16]; :p 2009-01-28 19:46 I'm sure we meant 1 << 16 2009-01-28 19:46 well, we can remove it though 2009-01-28 19:46 ah, no 2009-01-28 19:47 it was using 2009-01-28 19:47 right, it's the transfer buffer 2009-01-28 19:48 it's better to provide some default text to write I think, otherwise running under gdb is hard 2009-01-28 19:48 I think it is "blocksize * MAX_EXTENT_BLOCKS" 2009-01-28 19:49 the exact size doesn't really matter 2009-01-28 19:49 strcpy(text, "data") doesn't work? 2009-01-28 19:49 sure it does 2009-01-28 19:49 there used to be default text, then the error exit was added 2009-01-28 19:50 I probably did that 2009-01-28 19:50 ah, guess_region() limits it 2009-01-28 19:50 right 2009-01-28 19:50 we can just remove #if 1 part (checking S_ISCHR) 2009-01-28 19:51 I guess it would help debug 2009-01-28 19:51 right. There isn't really a right thing to do 2009-01-28 19:51 how would you run gdb, to take input from stdin? 2009-01-28 19:52 I forget it 2009-01-28 19:53 http://www.cygwin.com/ml/cygwin/1999-04/msg00304.html 2009-01-28 19:54 ah, yes 2009-01-28 20:00 if (S_ISCHR(stat.st_mode)) { 2009-01-28 20:00 - printf("No text to write\n"); 2009-01-28 20:00 - return 1; 2009-01-28 20:00 + warn("No text to write"); 2009-01-28 20:00 + } else { 2009-01-28 20:01 btw, why doesn't it allow chrdev? 2009-01-28 20:01 so we have a way to write provide text with gdb, now we can always just use stdin 2009-01-28 20:01 it's trying to detect end of file on input 2009-01-28 20:01 do you know another way? 2009-01-28 20:01 ctrl-d? 2009-01-28 20:01 that is, trying to detect nothing connected to stdin 2009-01-28 20:02 isttry? 2009-01-28 20:02 istty 2009-01-28 20:02 probably better 2009-01-28 20:02 I think right now it's detecting "not a pipe" 2009-01-28 20:03 so it should really be !ISFIFO 2009-01-28 20:04 I think I cut and pasted that from somewhere 2009-01-28 20:04 bad taste has a way of propagating 2009-01-28 20:04 um... 2009-01-28 20:04 I think it works on any file 2009-01-28 20:04 what does? 2009-01-28 20:04 fifo also can tell eof by write close 2009-01-28 20:05 chrdev/blockdev is working like file 2009-01-28 20:05 socket is also working like fifo 2009-01-28 20:05 tty can tell eof with ctrl-d 2009-01-28 20:06 eof is ok actually 2009-01-28 20:06 it's just trying to detect nothing connected at all 2009-01-28 20:06 0 is closed? 2009-01-28 20:06 and actually, it should just write an empty file in that case 2009-01-28 20:07 fo probably we don't need that at all 2009-01-28 20:08 yes, I guess so 2009-01-28 20:08 I think it needs to check error from read 2009-01-28 20:09 that would be a good idea 2009-01-28 20:11 right now it will loop forever if read returns an error 2009-01-28 20:11 sorry about that ;) 2009-01-28 20:11 :) 2009-01-28 20:12 -!- cdk(~chinmay@117.195.32.98) has joined #tux3 2009-01-28 20:16 well, just removing the ISFIFO check makes it read text from console 2009-01-28 20:16 which is normal unixy behavior 2009-01-28 20:16 always makes you think the program is stuck 2009-01-28 20:17 I guess that is enough tux3 command fixing for now 2009-01-28 20:19 sounds good 2009-01-28 20:19 ok, back to commit.c 2009-01-28 20:24 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-01-28 20:24 just some cleanups 2009-01-28 20:28 reading 2009-01-28 20:29 I reclone every time from your hg, otherwise I get multiple heads 2009-01-28 20:29 I wonder if there is another way 2009-01-28 20:30 there are temporary changes? 2009-01-28 20:31 probably 2009-01-28 20:32 I will mention it next time it comes up 2009-01-28 20:33 git seems to work well for it 2009-01-28 20:35 hg also have branch 2009-01-28 20:35 it doesn't work for this? 2009-01-28 20:35 I will wait for the next time it happens, then we can look at the case 2009-01-28 20:35 ok 2009-01-28 20:38 your changes look good 2009-01-28 20:38 the only one that isn't obvious is the filemap.c change for cursor_redirect 2009-01-28 20:40 and it's clear when I read the file instead of the patch 2009-01-28 20:40 first thing is to clean error path up 2009-01-28 20:40 yes, and it move the redirect to where the dleaf changes start 2009-01-28 20:40 yes 2009-01-28 20:42 pushed to public 2009-01-28 20:42 thanks 2009-01-28 20:42 btw, if error was happened, do we do when it? 2009-01-28 20:43 ? 2009-01-28 20:43 in map_region? 2009-01-28 20:43 entirely 2009-01-28 20:44 e.g. balloc() already did, but after that, error was happen 2009-01-28 20:44 ah right 2009-01-28 20:44 I did think about that before 2009-01-28 20:45 well, it would not be fixed right now, but we need to do later 2009-01-28 20:45 I'll take a look right now 2009-01-28 20:46 for example, an error from cursor_redirect 2009-01-28 20:46 that could be EIO or ENOMEM 2009-01-28 20:47 on other fs before, I tried to revert change, and if revert can't be done, give up that change with message 2009-01-28 20:47 yes 2009-01-28 20:47 all we need to revert there are the ballocs 2009-01-28 20:47 and -ENOSPC 2009-01-28 20:47 and they are in a vector 2009-01-28 20:48 -ENOSPC, we have to continue and write metadata 2009-01-28 20:48 EIO and ENOMEM are bad conditions 2009-01-28 20:48 EIO might come from somebody removing a removable disk? 2009-01-28 20:48 it would be usual case 2009-01-28 20:49 ext3 will make the fs ro in that case 2009-01-28 20:49 yes 2009-01-28 20:49 I guess error handling of ext3 is not good 2009-01-28 20:50 so, EIO is different from ENOSPC, because we will probably not be able to complete writing the metadata 2009-01-28 20:51 yes 2009-01-28 20:51 if we are ZFS, might might keep trying 2009-01-28 20:51 using different devices 2009-01-28 20:51 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-28 20:51 that really scares me 2009-01-28 20:52 yes 2009-01-28 20:52 there is two cases on EIO, read and write 2009-01-28 20:52 well I agree, we should try to bfree the ballocs, then exit with the error 2009-01-28 20:52 ok 2009-01-28 20:53 thanks, for now it's enough for me 2009-01-28 20:54 btw, ENOSPC must not be happened on write path? 2009-01-28 20:54 ? 2009-01-28 20:54 on other hand, we have to reserve space for write? 2009-01-28 20:54 we need to reserve space for metadata, but not for the write itself 2009-01-28 20:55 ah, write path is flush path 2009-01-28 20:55 we can return a short write for ENOSPC, but we have to be able to complete the delta 2009-01-28 20:55 yes 2009-01-28 20:55 so begin_change has to check that there is enough space for metadata 2009-01-28 20:56 worst case estimate 2009-01-28 20:56 yes 2009-01-28 20:56 it is data pages too? 2009-01-28 20:56 not data pages, just metadata 2009-01-28 20:56 because we can return a short write 2009-01-28 20:56 but we can't leave the metadata inconsistent 2009-01-28 20:57 but, if delalloc for data pages was ENOSPC, metadata will points invalid data blocks? 2009-01-28 20:58 uninitialized data blocks 2009-01-28 20:58 we allocate the data blocks and set up the metadata in map_region 2009-01-28 20:58 I don't think delalloc changes that 2009-01-28 20:59 if not delalloc (e.g. ext2), blocks allocated on sys_write() path 2009-01-28 21:01 we just balloc as many as we can in map_region, then when we get ENOSPC, stop allocating, set the number of segs to as far as we got, then update the metadata to point to the data blocks that were successfully allocated 2009-01-28 21:01 well 2009-01-28 21:01 balloc has to give ENOSPC before it runs out... 2009-01-28 21:01 ah 2009-01-28 21:03 well, maybe I'd like to reverve for data pages too 2009-01-28 21:03 it might be a nice option 2009-01-28 21:03 becase I want to know ENOSPC by sys_write() 2009-01-28 21:03 ok 2009-01-28 21:04 I think, if somebody writes a 10 GB file, and it goes ENOSPC halfway through, they expect the first half to be written 2009-01-28 21:05 in other words, it's not expected to try to make a transaction out of a write 2009-01-28 21:05 we need to check sb->freeblocks in map_region before balloc 2009-01-28 21:05 if sys_write(10GB memory), I think somebody assums that sys_write was successed 2009-01-28 21:06 if it is below the metadata reservation, we will give ENOSPC 2009-01-28 21:06 and complete the metadata update 2009-01-28 21:06 but, it can be half? 2009-01-28 21:07 yes, a file can be half written if the write returns ENOSPC 2009-01-28 21:07 that is posix 2009-01-28 21:07 yes, if the write returns ENOSPC 2009-01-28 21:08 to returns ENOSPC on the write, I assumed we need to reserve space for data pages 2009-01-28 21:08 I don't think we do 2009-01-28 21:08 but convince me :) 2009-01-28 21:09 ok :) 2009-01-28 21:09 users call sys_write(50M), then sys_write(10M) 2009-01-28 21:09 uses will assumes both is written 2009-01-28 21:10 with sync() 2009-01-28 21:10 ok, you convinced me 2009-01-28 21:11 well, half write may be useful, I didn't think before though 2009-01-28 21:11 it allows overcommit memory like behavior 2009-01-28 21:11 I think 2009-01-28 21:12 if it removes files before flush, uses can be create biiiig file 2009-01-28 21:12 may be useful 2009-01-28 21:13 anyway, it is not hard to calculate the space requirement for data 2009-01-28 21:13 yes 2009-01-28 21:16 whoops, I introduced a stupid bug in dir.c 2009-01-28 21:17 what bug? 2009-01-28 21:17 I removed name_llen = 0; 2009-01-28 21:18 for no good reason 2009-01-28 21:18 just was not thinking 2009-01-28 21:18 oh :) 2009-01-28 21:18 I didn't notice it 2009-01-28 21:19 gcc noticed 2009-01-28 21:19 thankyou gcc 2009-01-28 21:20 it seems bug of gcc 2009-01-28 21:20 gcc-4.3 doesn't warn it, and it seems to be unused 2009-01-28 21:22 tux_dirent *newent = (tux_dirent *)((char *)entry + name_len); 2009-01-28 21:22 it is used, but only if !is_deleted(entry) 2009-01-28 21:22 this code is not pretty 2009-01-28 21:22 hard to verify 2009-01-28 21:22 it's straight from ext2 2009-01-28 21:22 yes, on blockget path, it is is_deleted() always 2009-01-28 21:23 but how can gcc know that? 2009-01-28 21:23 *entry = {} clear entry->name_len 2009-01-28 21:24 *entry = (tux_dirent){ .rec_len ... }; 2009-01-28 21:24 so, entry->name_len == 0 on that path 2009-01-28 21:24 right, but it uses the name_len variable separately 2009-01-28 21:25 that is the problem with this code, it caches variables unnecessarily 2009-01-28 21:25 but I don't really want to fix it 2009-01-28 21:25 hopefully we will have a proper directory index not too far in the future 2009-01-28 21:26 gcc_unsued() was introduced for this false possible warn 2009-01-28 21:27 it was uninitialized_var() 2009-01-28 21:27 as far as I can see, name_len really was unintialized 2009-01-28 21:27 is_delete() checks !entry->name_len 2009-01-28 21:27 ah 2009-01-28 21:27 well 2009-01-28 21:28 but it isn't entry->name_len it complains about, but the name_len variable 2009-01-28 21:28 yes, old gcc can't detect running path always for unused val 2009-01-28 21:29 can't use the detected unused path for unused val 2009-01-28 21:30 um.., old gcc warns unused val for all path, even if one path never be used 2009-01-28 21:31 this was known bad behavior of gcc-4.0 or gcc-4.1 or so 2009-01-28 21:31 so, uninitialized_val() was introduced to shutup gcc 2009-01-28 21:31 #define uninitialized_var(x) x = x 2009-01-28 21:32 this result is no-op 2009-01-28 21:34 let's see if etch has a newer gcc for me 2009-01-28 21:35 yes, it has 4.2 2009-01-28 21:36 4.2.4-6? 2009-01-28 21:37 4.2-20070627-1 2009-01-28 21:37 whatever that means 2009-01-28 21:38 it seems a bit older 2009-01-28 21:38 well, however, this gcc may have fix 2009-01-28 21:50 the bug seems 2009-01-28 21:50 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20644 2009-01-28 21:53 I think in this case gcc correctly warned 2009-01-28 21:53 so I wonder why 4.2 did not warn 2009-01-28 21:55 why in this case name_len is uninitialized? 2009-01-28 21:55 because I removed name_len = 0 in an earlier change 2009-01-28 21:56 gcc warned, and so I put it back 2009-01-28 21:56 is that path, if (!is_delete(entry)) { AAA; }, I think AAA never be used 2009-01-28 21:58 ok, you're right :) 2009-01-28 21:58 :) 2009-01-28 21:58 sometimes you only have to tell me something 3 or 4 times before I get it 2009-01-28 21:59 so, since gcc is so smart now, it must also be smart enough to remove the uneeded name_len = 0 2009-01-28 21:59 :) I should learn english and talking tech more more 2009-01-28 22:00 no, it was just me not noticing that intializing the structure controlled that path 2009-01-28 22:00 thanks 2009-01-28 22:00 but, name_len = 0 would not be removed 2009-01-28 22:01 I guess 2009-01-28 22:05 hey flips 2009-01-28 22:05 hi bh 2009-01-28 22:05 ACTION reads the backlog 2009-01-28 22:06 oh, it seems to removed name_len=0 2009-01-28 22:06 you checked :) 2009-01-28 22:07 yes, I can't believe always do it, however it's great enough 2009-01-28 22:07 that's pretty smart 2009-01-28 22:07 yes 2009-01-28 22:07 btw, of course, it neeeds -O option 2009-01-28 22:23 um..., on stage_delta(), new_block() is not dirty state for sb->delta? 2009-01-28 22:23 ? 2009-01-28 22:24 (new_block() doesn't links to map->dirty) 2009-01-28 22:24 I'm thinking about flushing directory data 2009-01-28 22:24 mark_buffer_dirty puts the block in the map dirty lsit 2009-01-28 22:24 in kernel... 2009-01-28 22:24 yes 2009-01-28 22:24 we have to do something different 2009-01-28 22:25 I mean, mark_buffer dirty does not do that in kernel 2009-01-28 22:25 well, now in userland 2009-01-28 22:25 yes 2009-01-28 22:25 only worrying about userland for now 2009-01-28 22:25 directory data changes marks buffer dirty as delta[1] 2009-01-28 22:25 it's assuming now sb->delta==1 2009-01-28 22:26 in my unit test? 2009-01-28 22:26 no 2009-01-28 22:26 just in my mind 2009-01-28 22:27 well, so, in stage_delta() 2009-01-28 22:27 try to flush directory data buffers 2009-01-28 22:27 we should flush the directory before incrementing the delta counter 2009-01-28 22:27 now, we are incremented delta to sb->delta==2 2009-01-28 22:28 oh 2009-01-28 22:28 I have a post about that written, I should post it 2009-01-28 22:28 ah, maybe I heared it before 2009-01-28 22:29 it blocks almost all fs operation 2009-01-28 22:29 ? 2009-01-28 22:29 yes, it will for now 2009-01-28 22:29 later it can be changed? 2009-01-28 22:29 but deltas can be allowed to grow large 2009-01-28 22:29 yes 2009-01-28 22:30 and even now, it will perform ok on some loads 2009-01-28 22:30 how can we change it without delta change? 2009-01-28 22:30 change what without delta change? 2009-01-28 22:31 avoid to wait flush before delta incremnt 2009-01-28 22:33 right, what I call overlapping deltas 2009-01-28 22:34 i see 2009-01-28 22:37 let me remember what the plan was 2009-01-28 22:38 the goal is to let a new file operation start in a new delta, but not change any blocks in the previous delta 2009-01-28 22:39 yes 2009-01-28 22:40 I guess it can be done by incrementing delta counter 2009-01-28 22:51 increment delta counter before flush, and use delta-1 for btree? 2009-01-28 22:54 yes 2009-01-28 22:54 and the details are complex 2009-01-28 22:55 i see 2009-01-28 22:55 which is why I thought we should start with non-overlapped deltas 2009-01-28 22:55 i see, thanks 2009-01-28 22:55 let me think about that and say something intelligent tomorrow 2009-01-28 22:55 instead of something dumb now ;) 2009-01-28 22:55 :) 2009-01-28 23:04 http://mailman.tux3.org/pipermail/tux3/2009-January/000701.html <- includes algorithms for staging and rollup 2009-01-28 23:05 please find flaws, and in the mean time, I will get to work implementing 2009-01-28 23:06 ok 2009-01-28 23:41 -!- edt(~Ed@dsl-61-49.aei.ca) has joined #tux3 2009-01-29 00:49 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-29 06:29 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-01-29 06:42 hi. what's the best way to disable optimizations for the kernel compile? 2009-01-29 09:36 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-29 10:04 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-01-29 10:43 -!- rautelap(~weechat@122.167.66.58) has joined #tux3 2009-01-29 10:43 tuxroot suddenly doesnot works with UML linux binary. 2009-01-29 10:43 Why? 2009-01-29 10:45 :( 2009-01-29 12:09 -!- flips(~phillips@phunq.net) has joined #tux3 2009-01-29 12:27 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-29 17:32 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-29 18:17 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-01-29 18:18 hirofumi, there? 2009-01-29 18:18 yes 2009-01-29 18:19 about overlapped deltas... 2009-01-29 18:19 running stage_delta in parallel is nasty 2009-01-29 18:19 there would be interaction between vfs and backend locks 2009-01-29 18:21 i see 2009-01-29 18:22 for example, A does change_begin, then delta increments, then B does change_begin and takes i_mutex on directory Q, then wants to change a dirent block, but has to wait for previous stage_delta to finish, now A tries to take the i_mutex for directory Q, and the previous stage_delta will never complete 2009-01-29 18:22 and many similar deadlocks 2009-01-29 18:23 why B have wait previous stage_delta? 2009-01-29 18:24 because it can't modify the same dirent block as previous delta until previous delta has finished staging 2009-01-29 18:25 fortunately, it is not very important to run stage_deltas in parallel 2009-01-29 18:25 um..., but delta counter was incremented already 2009-01-29 18:25 so, change of B will do blockfork? 2009-01-29 18:26 but previous delta may not have made its updates to the direct block yet, so B can't fork yet 2009-01-29 18:27 anyway, buffered writes with delalloc do not have to wait for the delta read lock 2009-01-29 18:27 only namespace operations will stall 2009-01-29 18:28 and stage_delta will be a cache-to-cache operation after just a little more optimization 2009-01-29 18:28 it meant dirent change was delayed? 2009-01-29 18:28 if we had delayed namespace operations, they would not have to wait on the delta lock either 2009-01-29 18:28 but that is far in the future 2009-01-29 18:29 yes 2009-01-29 18:29 so, I thought previous delta has updated data already 2009-01-29 18:29 which data? 2009-01-29 18:30 namespace changes 2009-01-29 18:30 directory buffers 2009-01-29 18:30 but we need a boundary between filesystem operations that belong to different deltas 2009-01-29 18:32 it means someone may still be using same delta counter with stage_delta() 2009-01-29 18:32 ? 2009-01-29 18:33 for example, a rename may change two directories, and they need to be changed in the same delta 2009-01-29 18:33 yes 2009-01-29 18:34 ext3/4 must have the same problem 2009-01-29 18:34 what is problem? 2009-01-29 18:34 closing a journal transaction 2009-01-29 18:35 delta counter can be changed on middle of rename? 2009-01-29 18:35 we don't allow that now 2009-01-29 18:35 in the prototype change_end 2009-01-29 18:35 yes 2009-01-29 18:35 allowing it seems very complex 2009-01-29 18:35 so, it is in same delta? 2009-01-29 18:36 what is in the same delta? 2009-01-29 18:36 two directories that rename changed is in same delta? 2009-01-29 18:36 yes 2009-01-29 18:37 and the dleaf and ileaf changes 2009-01-29 18:37 we will not wait for IO to complete in stage_delta 2009-01-29 18:38 yes 2009-01-29 18:38 except in our first prototype 2009-01-29 18:38 um... 2009-01-29 18:39 rename change is in same delta, and directory changes is done by frontend 2009-01-29 18:40 right 2009-01-29 18:40 sorry, again. why B have to wait previous stage_delta? 2009-01-29 18:40 because A may change the same dirent block, but has not done it yet 2009-01-29 18:40 so B can't fork 2009-01-29 18:41 in this case, A is in change_end()? 2009-01-29 18:42 A is not yet in change_end 2009-01-29 18:42 A maybe got delayed for some reason, like reading in a metadata block 2009-01-29 18:43 change_end increments delta counter? 2009-01-29 18:43 yes 2009-01-29 18:43 B is in change_end? 2009-01-29 18:44 B is not in change_end in my example, it is trying to change a dirent block 2009-01-29 18:44 C was changed delta counter? 2009-01-29 18:45 and A and B is trying to change directory? 2009-01-29 18:45 yes 2009-01-29 18:45 exactly 2009-01-29 18:45 i see 2009-01-29 18:45 it's a hard problem 2009-01-29 18:45 in this case, in my mind, C should wait A and B 2009-01-29 18:46 because those are using same delta yet 2009-01-29 18:46 put A and B in the same delta is possible, but what if there are also C, D, E.... 2009-01-29 18:46 when do you close the delta? 2009-01-29 18:47 it can be reference counter 2009-01-29 18:47 last unref can be wakeup 2009-01-29 18:47 what if new tasks keep taking new refs? 2009-01-29 18:48 in this case, C is, increments delta counter, then wait preivous delta 2009-01-29 18:49 will A and B be in delta before increment, or delta after? 2009-01-29 18:50 in the above example, A would be before, and B would be after 2009-01-29 18:51 A call change_begin, then C change_end and increment, and B change_begin 2009-01-29 18:52 ok, now B wants to change the dirent block, but A has not made its change yet 2009-01-29 18:52 e.g. A get delta (1), C stage_delta for (1) and incremnt to (2), B get (2) 2009-01-29 18:52 how does B wait for A? 2009-01-29 18:52 ah 2009-01-29 18:52 i see 2009-01-29 18:53 ok, so that is the reason for the write lock in stage delta 2009-01-29 18:53 new delta can't start until delta close 2009-01-29 18:53 right 2009-01-29 18:54 however, it doesn't need to wait stage_delta? 2009-01-29 18:54 this is ok for delalloc buffered writes, the write can complete in cache and return to caller 2009-01-29 18:54 it has to wait for state_delta to release the delta write lock 2009-01-29 18:56 it is possible that the only thing stage_delta has to do under the write lock is increment the delta counter 2009-01-29 18:56 for now it will do a lot more 2009-01-29 18:56 ah, yes 2009-01-29 18:57 um..., no 2009-01-29 18:58 C has to wait A, and B has to wait increment of C 2009-01-29 18:58 not stage_delta 2009-01-29 18:59 C wait stage_delta, then increment 2009-01-29 18:59 C wait to start stage_delta 2009-01-29 18:59 yes 2009-01-29 18:59 i see 2009-01-29 18:59 ok, let's get some code to play with 2009-01-29 18:59 so, actually, stage_delta can be outside of down_write? 2009-01-29 19:00 I think so 2009-01-29 19:00 yes 2009-01-29 19:00 and for now, by some reason, it has to be in down_write 2009-01-29 19:01 I remembered original one was outside of down_write 2009-01-29 19:01 then by some reason, it move to inside of down_write 2009-01-29 19:02 you suggested it ;) 2009-01-29 19:02 yes :) 2009-01-29 19:02 I forget it was why 2009-01-29 19:03 it's easier to get it to run without bugs 2009-01-29 19:03 maybe, we were not going to do blockfork completely 2009-01-29 19:03 at first 2009-01-29 19:03 yes 2009-01-29 19:03 I think we should still do it that way 2009-01-29 19:03 yes 2009-01-29 19:03 and then improve it 2009-01-29 19:03 exactly 2009-01-29 19:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-29 19:29 btw, is there any thought/patch for dirty inodes list? 2009-01-29 19:30 http://userweb.kernel.org/~hirofumi/temp/inode-dirty-list.patch 2009-01-29 19:30 I was thinking like the above or something 2009-01-29 19:31 no dirty inodes patch yet 2009-01-29 19:31 ok 2009-01-29 19:32 yes, just like that :) 2009-01-29 19:32 good :) 2009-01-29 19:32 pull? 2009-01-29 19:32 not yet 2009-01-29 19:33 ok, when you're ready 2009-01-29 19:33 yes 2009-01-29 19:43 ah, namespace changes are serialized by i_mutex already 2009-01-29 19:53 yes 2009-01-29 19:55 so, we can optimize stage_delta stall in future 2009-01-29 19:55 we may able to optimize 2009-01-29 19:55 A and B are exclusive 2009-01-29 19:57 yes, there are good opportunities to optimize, and stage_delta does not have to wait on writeout 2009-01-29 19:57 that is the biggest optimization 2009-01-29 19:58 yes 2009-01-29 21:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-29 21:44 flips, there? 2009-01-29 21:44 i'm talking to him on the phone...so, no, he's not there :) 2009-01-29 21:45 :) 2009-01-29 22:47 well, I was thinking about atomic commit for the bitmap 2009-01-29 22:48 so, I think changing the bitmap is backend and under the down_write(sb->delta_lock) 2009-01-29 22:49 if so, it is meaning only one user changes the bitmap buffers 2009-01-29 22:50 so, I guess the bitmap locking and atomic commit is much simpler 2009-01-29 22:50 there is no lock and there is no recursive 2009-01-29 23:39 hirofumi, exactly 2009-01-29 23:41 -!- edt(~Ed@dsl-61-49.aei.ca) has joined #tux3 2009-01-30 00:08 I noticed the one possible user, the writepage of normal file can be using balloc() to writeout 2009-01-30 00:08 however, we can disallow it, I'm not sure whether it is good 2009-01-30 00:14 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-01-30 00:15 well, if balloc was called only under delta_lock, it solves recursive dirty cleanly 2009-01-30 00:16 it can modify previous delta, because there is no other user to change it 2009-01-30 00:17 -!- kedars(~kedars@socks.wantstofly.org) has left #tux3 2009-01-30 00:17 maybe, it can be just bitmap delta is "sb->delta - is_bitmap_write()" 2009-01-30 00:18 hmm 2009-01-30 00:19 what is for? 2009-01-30 00:19 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-01-30 00:19 I wasn't actually planning on doing all file flushing under the delta lock 2009-01-30 00:19 we are changing bitmap buffer to map bitmap buffer itself 2009-01-30 00:20 yes 2009-01-30 00:20 that is solved I think 2009-01-30 00:20 well, delta_lock does not matter 2009-01-30 00:20 it means only one user can change bitmap buffer 2009-01-30 00:21 why only one user? 2009-01-30 00:21 if stage_delta is serialized with stage_delta for other delta, I think it is work 2009-01-30 00:22 if there is other users, we can't change "sb->delta -1" 2009-01-30 00:23 what I'm thinking is 2009-01-30 00:24 e.g. bitmap buffer has delta(2) 2009-01-30 00:24 and now we are mapping it 2009-01-30 00:24 we call balloc() to allocate block for the buffer 2009-01-30 00:24 it may changes that buffer again 2009-01-30 00:25 we will do blockfork() here 2009-01-30 00:25 because sb->delta == 3 2009-01-30 00:26 but, this change is for delta == 2 2009-01-30 00:29 btw, I think current one has this problem... 2009-01-30 00:31 ah, no 2009-01-30 00:31 it doesn't have this problem 2009-01-30 00:31 blockfork() is ok there 2009-01-30 00:41 yes, it seems to work 2009-01-30 00:41 that was the plan 2009-01-30 00:45 actually, the bitmap flush runs on a different cycle than sb->delta 2009-01-30 00:45 it will be sb->flush 2009-01-30 00:46 because there can be several deltas per bitmap flush 2009-01-30 00:47 i see 2009-01-30 00:47 for now, blockdirty is only used by bitmap 2009-01-30 00:47 yes 2009-01-30 00:47 so we will just make it use sb->flush 2009-01-30 00:49 cursor_redirect will use sb->delta for leaf blocks, and sb->flush for index blocks 2009-01-30 00:50 because leaf blocks are flushed per delta 2009-01-30 00:50 but index blocks are only flushed on rollup 2009-01-30 00:50 both counter should be atomic? 2009-01-30 00:51 maybe 2009-01-30 00:51 probably 2009-01-30 00:51 yes 2009-01-30 00:52 with the current heavy locking, the semaphores probably provide the necessary barriers 2009-01-30 00:52 yes 2009-01-30 01:08 um... 2009-01-30 01:09 if blockget() in user/filemap.c:filemap_extent_io():97 2009-01-30 01:10 we can get wanted delta buffer, we don't need write_bitmap()? 2009-01-30 01:13 umm... 2009-01-30 01:14 bitmap of current commit.c will flush by two stage? 2009-01-30 01:14 one is from stage_delta, and another one is commit_delta? 2009-01-30 01:15 bitmap will be flushed by two stages? 2009-01-30 01:17 in write_bitmap(), if we hit the -EAGAIN, the buffer is not wrote in write_bitmap() 2009-01-30 01:17 because it was forked in map_region() 2009-01-30 01:17 so, old buffer will be flushed in commit_delta 2009-01-30 01:18 but, it will call map_region() again in commit_delta? 2009-01-30 01:19 right 2009-01-30 01:19 um... 2009-01-30 01:19 a bitmap block that is forked goes on a list and has to be written separately 2009-01-30 01:20 "- initiate writeout for delta dirty list blocks." <- this step 2009-01-30 01:20 it means one buffer can be mapping twice? 2009-01-30 01:21 I hope not 2009-01-30 01:21 um.. 2009-01-30 01:22 map_region() is called create==2 2009-01-30 01:22 so, it seems to redirect 2009-01-30 01:22 it sounds like modifying dleaf twice 2009-01-30 01:22 where is the other modify? 2009-01-30 01:23 one is from stage_delta (-EAGAIN) 2009-01-30 01:23 another one is from commit_delta 2009-01-30 01:24 ah 2009-01-30 01:24 yes, we call map_region for the EAGAIN block, with create == 0, because it must already be mapped 2009-01-30 01:25 next stage_delta should be wait previous commit_delta? 2009-01-30 01:25 yes 2009-01-30 01:26 um... 2009-01-30 01:26 for now, all writeout will complete in the stage delta and next does not have to wait 2009-01-30 01:26 yes 2009-01-30 01:27 if blockget() can take delta parameter, it will be solved? 2009-01-30 01:27 just a idea 2009-01-30 01:27 blockget does not dirty a buffer 2009-01-30 01:27 yes 2009-01-30 01:28 e.g. in filemap_extent_io() 2009-01-30 01:28 it calls map_region() 2009-01-30 01:29 then blockget() and writeout the buffer get from it 2009-01-30 01:29 if this blockget() can take delta parameter, this problem seems not to be happened 2009-01-30 01:33 what would blockget do with the delta parameter? 2009-01-30 01:33 delta parameter is used to get interesting buffer (has dirty delta counter) 2009-01-30 01:34 I have a slightly different solution 2009-01-30 01:34 simpler I think 2009-01-30 01:34 good 2009-01-30 01:35 that is, the cloned buffer remembers its mapping and index 2009-01-30 01:35 which is used later to walk the delta dirty list, look up the physical address and write it out 2009-01-30 01:36 yes, it's good 2009-01-30 01:36 I was thinking it for other purpose 2009-01-30 01:36 I checked that we can do the same thing in kernel 2009-01-30 01:36 that is, have a page that is removed from a mapping, but still points at the mapping 2009-01-30 01:37 well, maybe somehow I think I don't want to write bitmap buffer two places 2009-01-30 01:37 otherwise, it would be harder to keep track of where the block should be written 2009-01-30 01:37 it isn't written in two places 2009-01-30 01:38 stage_delta and commit_delta for now 2009-01-30 01:39 btw, why we need mapping? 2009-01-30 01:39 the buffer is only written in one place, either in the bitmap flush, or later in the delta dirty block writeout... but it is passed to map_region twice, the first time with create = 2, and later with create = 0 2009-01-30 01:39 mapping? 2009-01-30 01:40 remembers its mapping and index 2009-01-30 01:40 this "mapping" 2009-01-30 01:40 because otherwise we have no way to know what physical address the forked block should be written to 2009-01-30 01:41 I thought index is meaning physical address 2009-01-30 01:41 for a logical block, it is the logical address 2009-01-30 01:41 at the time it is forked, it may not even have a physical address 2009-01-30 01:42 yes 2009-01-30 01:42 so, I was thinking to cache physical address 2009-01-30 01:42 until writeout 2009-01-30 01:42 that is one way 2009-01-30 01:42 and buffer_head provides a field for it, b_blocknr 2009-01-30 01:42 yes 2009-01-30 01:42 and I was trying to avoid using that field 2009-01-30 01:43 i see 2009-01-30 01:43 with those, it will read from btree? 2009-01-30 01:43 with the mapping field? 2009-01-30 01:43 yes, mapping and index 2009-01-30 01:43 yes, it will look up in the btree 2009-01-30 01:44 i see 2009-01-30 01:44 the btree will be complete then, and because it is create = 0, it will not change 2009-01-30 01:44 yes 2009-01-30 01:44 I'm sorry I did not explain this 2009-01-30 01:44 I had forgotten it myself 2009-01-30 01:45 it is not obvious 2009-01-30 01:45 I thought if we have temporary b_blocknr 2009-01-30 01:45 we could have a temporary b_blocknr to use as an error check 2009-01-30 01:45 if it has, I thought next stage_delta doesn't need to wait previous commit_delta 2009-01-30 01:46 it doesn't have to wait in any case 2009-01-30 01:46 if commit_delta reads btree, btree can't be changed 2009-01-30 01:47 ah, true 2009-01-30 01:47 ok 2009-01-30 01:48 maybe you found a flaw in my plan 2009-01-30 01:48 well, it is still just imaine, I guess we have very good parallerithm 2009-01-30 01:50 we may still be ok 2009-01-30 01:50 bitmap is not flushed per delta, it is flushed per rollup 2009-01-30 01:51 there will never be two rollups in progress at the same time 2009-01-30 01:51 yes 2009-01-30 01:51 i see 2009-01-30 01:52 I think maybe we can bitmap also can do some sort of parallel 2009-01-30 01:52 maybe, it is simpler than I thought 2009-01-30 01:53 it would be harder if bitmap was flushed per delta 2009-01-30 01:53 and slower too 2009-01-30 01:54 i see 2009-01-30 01:54 then I think we would need to remember the physical mapping as you suggested 2009-01-30 01:54 i see 2009-01-30 01:55 oh, sorry, time to go 2009-01-30 01:55 thanks for explan 2009-01-30 01:55 ok 2009-01-30 01:55 I will try to have code tomorrow 2009-01-30 01:56 bitmap? 2009-01-30 01:56 stage_delta and rollup flush, including bitmap 2009-01-30 01:56 i see, ok 2009-01-30 01:57 I'm going to change for diry inodes 2009-01-30 01:57 it would be including userland namespace cleanup 2009-01-30 01:57 namespace operations 2009-01-30 01:59 I would appreciate that 2009-01-30 02:00 I planned to write the dirty inode code just like you wrote it, but you already did it 2009-01-30 02:01 yes, I started to write that from your plan 2009-01-30 04:07 -!- pgquiles__(~pgquiles@134.Red-79-146-250.dynamicIP.rima-tde.net) has joined #tux3 2009-01-30 04:07 -!- pgquiles__(~pgquiles@134.Red-79-146-250.dynamicIP.rima-tde.net) has joined #tux3 2009-01-30 05:22 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-30 10:47 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-30 14:08 -!- bh_(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-01-30 14:32 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-30 15:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-30 18:24 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-30 20:22 -!- vomjom(~vomjom@99-157-248-71.lightspeed.stlsmo.sbcglobal.net) has joined #tux3 2009-01-30 20:55 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-30 21:08 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-01-30 21:16 hey flips 2009-01-30 21:16 ACTION reads the backlog 2009-01-30 22:45 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-30 23:09 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-31 01:58 -!- fqh(~fqh@219.131.240.229) has joined #tux3 2009-01-31 06:41 -!- Chip_M(stefanc@apollo.orakel.ntnu.no) has joined #tux3 2009-01-31 08:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-31 09:47 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-31 10:17 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-31 13:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-31 14:30 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-31 15:24 -!- pgquiles(~pgquiles@81.Red-217-125-197.dynamicIP.rima-tde.net) has joined #tux3 2009-01-31 15:25 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-01-31 19:24 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-01-31 22:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-01-31 22:52 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-01 01:49 -!- paola(~paola@ppp-228-19.20-151.libero.it) has joined #tux3 2009-02-01 01:50 -!- paola(~paola@ppp-228-19.20-151.libero.it) has left #tux3 2009-02-01 05:29 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-02-01 07:11 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-01 07:22 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-01 07:42 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-01 08:44 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-01 10:02 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-01 13:36 -!- pgquiles(~pgquiles@88.Red-83-53-120.dynamicIP.rima-tde.net) has joined #tux3 2009-02-01 14:28 flips: you there ? 2009-02-01 14:29 do you recommend that I really understand how a b-tree works inside out before getting into tux3 development ? 2009-02-01 14:37 bh, it's not necessary, there are lots of bits besides the btree 2009-02-01 14:50 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-02-01 15:25 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-02-01 17:25 rollup and delta commit compiles... now to make it work 2009-02-01 17:37 ok 2009-02-01 17:37 well, I'm done with the first part of the first phase of my scheduler project 2009-02-01 17:38 how many phases are there? 2009-02-01 17:38 time to move on to other things in that project 2009-02-01 17:38 like or 2-3 2009-02-01 17:39 there's a drive that's a fd based timer that I need to glue to the schedule 2009-02-01 17:39 schduler 2009-02-01 17:39 scheduler 2009-02-01 17:39 bah 2009-02-01 17:39 there's a driver that's a fd based timer that I need to glue to the scheduler 2009-02-01 17:39 what does it schedule? replacement for current 2.6 scheduler? 2009-02-01 17:39 no, an EDF policy that's statically order higher than any of the other policies, it's to implement bandwidth allocation 2009-02-01 17:40 the problem with that is that the current scheduler, SCHED_OTHER, isn't able to track time that other policies use 2009-02-01 17:40 so it could be unfair to a certain degree since it's not tracking all fo teh used time properly 2009-02-01 17:41 afte that's done, I'm either going to do EDF based sensitive priority inheritance implementation under -rt or not 2009-02-01 17:41 depends on how I feel and what's in the scope of things that I think are releveant 2009-02-01 17:41 relevant 2009-02-01 17:41 edf? 2009-02-01 17:41 earliest deadline first 2009-02-01 17:42 rebalancing EDF based policies is very complicated, probably not easily computable at runtime 2009-02-01 17:43 so the use of that has to be restricted, possibly bound to a specific CPU 2009-02-01 17:43 so your realtime tasks run on a specific cpu? 2009-02-01 17:44 it should because rebalancing with determinancy or some kind of bin-pack algorithm is very complicated 2009-02-01 17:45 generally suitable only for precomputed systems 2009-02-01 17:45 yeah, you can use a dynamic algorithm, but you're going to increase the jitter greaterly 2009-02-01 17:45 greatly 2009-02-01 17:45 after that... 2009-02-01 17:46 link the fd timer code to the EDF system, having the fd timer driver control or drive the scheduler, I need to possibly work on the /dev/rtc driver for that or write a new replacement for it that's accurance per cpu aware 2009-02-01 17:46 that's...acccurate and per cpu aware 2009-02-01 17:47 because the current rtc implementation is glue to legacy hardware 2009-02-01 17:49 the final result being the first ever hard realtime scheduler for Linux? And it will also perform fine for traditional scheduling? 2009-02-01 17:49 depending on how complicated it is, I'll either continuing with rtc developement or not 2009-02-01 17:50 flips: -rt is hard real time if you avoid fork bombing the system 2009-02-01 17:50 so an incremental improvement on -rt? 2009-02-01 17:50 with less jitter? 2009-02-01 17:51 no, it's really design aggressive, I'm going to fuck some folks up 2009-02-01 17:51 whee 2009-02-01 17:51 blow some people's minds away and cut the balls of MS at the same time 2009-02-01 17:51 what will it do better? 2009-02-01 17:51 the problem with Linux has always been the "what do we do with this ?" after you get much of the kernel working 2009-02-01 17:52 I'm going to introduce bandwidth control for fair novel purpose 2009-02-01 17:52 ressurrecting some old SGI ideas that have been forgotten 2009-02-01 17:53 after I get EDF, a new rtc driver, linking the two so that they can report cycle overruns I have an api to use in usersapce 2009-02-01 17:54 sgi has lots of solid ideas that need ressurecting 2009-02-01 17:55 like realtime disk scheduling 2009-02-01 17:55 then you'll get another final piece when I glue to a OpenGL driver and get Quake to use it 2009-02-01 17:55 it's a lot of work, a dream of mine since 2002 2009-02-01 17:55 but I've been hopping around job to job in the valley and got little or nothing of it done 2009-02-01 17:55 yeah 2009-02-01 17:55 XFS has a lot of that 2009-02-01 17:55 so the idea is to run quake with realtime response? 2009-02-01 17:55 getting a properly allocator helps with the parallel reads 2009-02-01 17:56 yes, because it's the best test program for that purpose that I can think of at this time 2009-02-01 17:56 it's relevant to me (quake) 2009-02-01 17:56 when folks see it and realize it'll never drop a frame compared to either OS X or Win32 2009-02-01 17:56 not matter what the f-ing load 2009-02-01 17:56 NFS load 2009-02-01 17:56 linux is about the worst frame dropper going 2009-02-01 17:56 CPU load 2009-02-01 17:57 yeah, that's because of the scheduler 2009-02-01 17:57 and the lack of sensitivity for these issues 2009-02-01 17:57 and what Con was working on 2009-02-01 17:57 typical game, you drag the window, it stops refreshing 2009-02-01 17:58 right 2009-02-01 17:58 that's partially an X windows issue, but that shouldn't effect OpenGL rendering, it's another system entirely 2009-02-01 17:59 fixing it probably requires rooting out braindamage on several levels 2009-02-01 17:59 like arjan is doing with fast boot 2009-02-01 17:59 right, that's exactly what I'm going to fix and more 2009-02-01 17:59 it'll be the end all for time control in a game environment after I get all of this done 2009-02-01 18:00 should send linus a copy of oblivion and get him addicted 2009-02-01 18:00 it won't fix the problems with OpenGL lack of development, it's stalled, but it will make headwaves in the gaming community 2009-02-01 18:00 and a core goal of mmLinux 2009-02-01 18:00 once he actually cares about frame rates and skipping audio, it might get better 2009-02-01 18:01 what's Oblivion ? 2009-02-01 18:01 addictive roleplaying game 2009-02-01 18:01 Linus doesn't understand the technical problem with this kind of stuff 2009-02-01 18:01 game of the year for 2006 2009-02-01 18:01 so somebody has to step up 2009-02-01 18:01 linus is perfectly capabile of understanding, he just doesn't care 2009-02-01 18:01 his minions don't understand it as well 2009-02-01 18:02 na, he doesn't understand, nor does the rest of the regular kernel community 2009-02-01 18:02 you have to do it and show it to them before they start understanding this 2009-02-01 18:03 after all of that's done, it's tux3 development hard core 2009-02-01 18:03 but that's not going to be easy 2009-02-01 18:03 I intend to mod the VM system heavily for this purpose 2009-02-01 18:04 because the Linux VM basically blows 2009-02-01 18:04 overall 2009-02-01 18:04 and the semantics for file systems and VM needs to be separated, they are too different 2009-02-01 18:05 best of luck to you 2009-02-01 18:05 and learning all of that stuff needed for the mods is heavy 2009-02-01 18:05 yeah, I need it, it's a lot of work 2009-02-01 18:06 you never know wtf you'll run into doing this kind of stuff 2009-02-01 18:06 vm doesn't have a decent way of telling filesystems to shrink cache 2009-02-01 18:06 it needs to be multi-queue 2009-02-01 18:06 I'll talk to riel, you and others about it 2009-02-01 18:06 certainly peterz 2009-02-01 18:07 vm doesn't have a way of to a filesystem how much cache is available 2009-02-01 18:07 otherwise, online checking performance is going to blow 2009-02-01 18:07 yes, and many other things 2009-02-01 18:07 those are pretty basic things 2009-02-01 18:08 linux has never had them, it's just, implement all the behaviours out on the peripheral components of the os and hope they interact in a stable way 2009-02-01 18:09 yeah, it'll take time to understand the issues 2009-02-01 18:11 if/when I get this -rt work done, it'll take a number of months before folks figure it out 2009-02-01 18:11 and how valuable it is 2009-02-01 18:12 maybe I'll get picked up by Canonical or something like that 2009-02-01 18:16 let's hope somebody recognizes the falue 2009-02-01 18:17 they failed with Con 2009-02-01 18:17 a demo will help 2009-02-01 18:20 I'm not counting on them 2009-02-01 18:20 I flamed the crap out of them nearly 2 years ago regarding Con 2009-02-01 18:21 something budged within tem 2009-02-01 18:21 them in the process, I was just flaming pissed about it and it wasn't even me that was rolled over by that community 2009-02-01 18:58 sounds like superbowl is heating up out there 2009-02-01 20:09 Is there any way to initialize list_head for compound literals? 2009-02-01 20:10 list_head needs the address itself 2009-02-01 20:11 but, compound literal doesn't have name 2009-02-01 20:11 I'm introduced some nice bugs that way 2009-02-01 20:11 I have, I mean 2009-02-01 20:11 how does it do? 2009-02-01 20:12 actually, it was not in literals, but something like (struct foo){ LIST_INIT(name) } 2009-02-01 20:13 yes, if it has name already, it's no problem 2009-02-01 20:13 struct something *foo = &(struct foo){ .list = LIST_HEAD_INIT(foo->list) } 2009-02-01 20:14 that breaks 2009-02-01 20:14 yes 2009-02-01 20:14 foo is not initialized yet 2009-02-01 20:14 it compiles, but is broken 2009-02-01 20:14 yes 2009-02-01 20:15 it is using uninitialized foo 2009-02-01 20:15 so, for a literal there is no name, and we are saved from such crazy ideas ;) 2009-02-01 20:16 it is not problem usually 2009-02-01 20:16 but, there is no lazy way for &(struct foo) 2009-02-01 20:17 well, for now, I'm finding for sb and inode 2009-02-01 20:18 actually, it is not needed to use &(struct foo) 2009-02-01 20:18 however, workaround seems a bit dirty 2009-02-01 20:19 the usual way is to malloc 2009-02-01 20:19 struct inode __inode = {}; struct inode *inode = &__inode; 2009-02-01 20:19 yes 2009-02-01 20:19 what you wrote is exactly what I do in the user space code in a number of places 2009-02-01 20:20 I try not to let things like that get into production code, it's too obscure 2009-02-01 20:20 I mean, the = {} 2009-02-01 20:20 so in our production code, we are always using our new_xxx functions 2009-02-01 20:20 struct inode __inode = {}; struct inode *inode = &__inode; 2009-02-01 20:20 this? 2009-02-01 20:21 well, yes 2009-02-01 20:22 however, if there is any way for unit test, it would be good 2009-02-01 20:22 for unit tests, being clever and lazy is good 2009-02-01 20:22 the less code to write for a unit test, the more likely it is somebody will write one 2009-02-01 20:22 yes 2009-02-01 20:23 however, I can't find good way for it yet 2009-02-01 20:23 struct inode *inode = ({ struct inode __inode = { ...}; &__inode; }) 2009-02-01 20:24 it can 2009-02-01 20:24 but, not good 2009-02-01 20:24 struct inode *inode = &(struct inode){ ...other fields... }; ; init_list_head(&inode->list); 2009-02-01 20:25 doing it as a pure declaration is too hard for me ;) 2009-02-01 20:25 :) 2009-02-01 20:25 my current best way is 2009-02-01 20:25 DEFINE_INODE(name, ...) 2009-02-01 20:26 struct inode *name = ({ struct __ ## name = { ... }; &__##name}) 2009-02-01 20:26 well it looks like typical linux 2009-02-01 20:27 yes 2009-02-01 20:27 it will break if you do &(struct foo){ DEFINE_INODE(...) } 2009-02-01 20:27 which of course I would do 2009-02-01 20:27 it shouldn't be needed 2009-02-01 20:27 because DEFINE_INODE() is 2009-02-01 20:27 it's used at the top level 2009-02-01 20:28 #define DEFINE_INODE(name, ...) ({ struct name = { INIT_INODE()}; }) 2009-02-01 20:28 you're testing the dirty inode list code? 2009-02-01 20:28 actually 2009-02-01 20:28 yes 2009-02-01 20:28 I have written almost all the rest of the flush code except the inode flushing 2009-02-01 20:28 in the above case, it will use INIT_INODE() 2009-02-01 20:29 good 2009-02-01 20:29 INIT_INODE looks good to me 2009-02-01 20:29 ok 2009-02-01 20:29 at least, until found good way, I'll do DEFINE_* stuff 2009-02-01 20:30 we don't actually have to flush every inode on every delta 2009-02-01 20:30 (by the way) 2009-02-01 20:30 why? 2009-02-01 20:31 I think it is including ->i_size 2009-02-01 20:31 if the delta was not causes by a sync, the the user does not care whether the data is flushed or not 2009-02-01 20:32 when it is flushed, the i_size has to be consistent with the written data of course 2009-02-01 20:32 anyway, we are going to fush every inode on every delta for now 2009-02-01 20:32 ah, i see 2009-02-01 20:32 it will be like a sync on every delta 2009-02-01 20:32 ok 2009-02-01 20:34 ok, I will try to post a patch for flushing in about an hour 2009-02-01 20:34 ok 2009-02-01 20:53 hirofumi, why did we have a problem in kernel with mark_buffer_dirty in new_block? 2009-02-01 20:54 mark_buffer_dirty requires buffer is uptodate 2009-02-01 20:54 ah, no 2009-02-01 20:54 it is just unnecessary for pdflush 2009-02-01 20:55 btw, blockget garantee buffer is uptodate 2009-02-01 20:55 buffer is marked as uptodate 2009-02-01 20:55 even if it's not uptodate actually 2009-02-01 20:56 blockget should leave the buffer uptodate only if it found an uptodate buffer 2009-02-01 20:56 yes 2009-02-01 20:57 but, we don't set buffer as uptodate 2009-02-01 20:57 if we marks it as mark_buffer_dirty 2009-02-01 20:58 we should define our own set_buffer_dirty that does both mark_buffer_dirty and set_buffer_uptodate 2009-02-01 20:58 well, it can 2009-02-01 20:58 but, it should be more carefully 2009-02-01 20:59 set_buffer_uptodate needs memory barrier actually 2009-02-01 20:59 ah 2009-02-01 20:59 to interact properly with buffer_endio I suppose 2009-02-01 21:00 I think anywhere is checking buffer_uptodate 2009-02-01 21:01 cpu1 cpu2 2009-02-01 21:01 init buffer 2009-02-01 21:01 buffer_uptodate() 2009-02-01 21:01 read buffer 2009-02-01 21:01 set_buffer_uptodate() 2009-02-01 21:01 ugh 2009-02-01 21:01 cpu1 cpu2 2009-02-01 21:01 init buffer 2009-02-01 21:01 buffer_uptodate() 2009-02-01 21:01 read buffer 2009-02-01 21:01 set_buffer_uptodate() 2009-02-01 21:02 if set_buffer_uptodate() bit was leaking to cpu2 before "init buffer" was completed 2009-02-01 21:02 cpu2 will see uninitialized data 2009-02-01 21:03 and why current one is ok? 2009-02-01 21:03 nice example 2009-02-01 21:04 we are locking the blockget() buffer until pointer to the buffer was completed 2009-02-01 21:04 e.g. if it was directory, it is locked by i_mutex 2009-02-01 21:05 there is no any reader until i_mutex was unlocked 2009-02-01 21:05 for now 2009-02-01 21:05 yes 2009-02-01 21:05 and it is true for btree buffers 2009-02-01 21:05 I think lock is btree->lock 2009-02-01 21:06 yes 2009-02-01 21:06 well, anyway, we should rethink with memory barrier later 2009-02-01 21:07 why does pdflush not care if our new block is dirty? 2009-02-01 21:07 it uses lock_page to write 2009-02-01 21:08 it just sees a dirty page then, and writes it? 2009-02-01 21:08 ah 2009-02-01 21:08 if new block is dirty, pdflush will write it 2009-02-01 21:11 anyway, we only use new_block for new_leaf and new_node, and we mark those blocks dirty right away 2009-02-01 21:11 so, mark_buffer_dirty should always be ok in new_block 2009-02-01 21:12 yes 2009-02-01 21:13 all mark_buffer_dirty will be rethinked for atomic commit 2009-02-01 21:13 current mark_buffer_dirty is for pdflush almost 2009-02-01 21:14 ok, I will just leave the ifdef KERNEL there for now 2009-02-01 21:14 ifndef KERNEL 2009-02-01 21:14 I was thinking if we start to flush by atomic commit, it will be changed 2009-02-01 21:15 yes 2009-02-01 21:15 I think we started 2009-02-01 21:15 I think so :) 2009-02-01 21:15 so, I think change is ok 2009-02-01 21:16 well, change at a time would be good though 2009-02-01 21:16 change all mark_buffer_dirty 2009-02-01 21:16 I'm leaving it right now 2009-02-01 21:16 ok 2009-02-01 21:41 #define rapid_sb(dev, init_defs...) ({ \ 2009-02-01 21:41 struct sb *__sb = &(struct sb){}; \ 2009-02-01 21:41 *__sb = (struct sb){ \ 2009-02-01 21:41 INIT_SB(*__sb, dev), \ 2009-02-01 21:41 init_defs \ 2009-02-01 21:41 }; \ 2009-02-01 21:41 __sb; \ 2009-02-01 21:41 }); 2009-02-01 21:41 I can't see why this works 2009-02-01 21:41 but, it seems to work 2009-02-01 21:42 scope of "&(struct sb){}" is unclear 2009-02-01 21:42 good thing lkml doesn't get to review it ;) 2009-02-01 21:43 :) 2009-02-01 21:48 this is what I have come up with: the volmap->dirty list will only have btree leaf nodes on it, which are put there by cursor_redirect and new_leaf 2009-02-01 21:49 cursor_redirect puts dirty btree index nodes on the sb->pinned list 2009-02-01 21:49 the leaf nodes are flushed per delta, and the index nodes are flushed per rollup 2009-02-01 21:50 i see 2009-02-01 21:50 we have sb->pinned that will be flushed per rollup and sb->commit that is flushed per delta 2009-02-01 21:51 no arrays of list heads at the moment 2009-02-01 21:51 I originally intended to use arrays of four list heads, for list of blocks belonging to different deltas, for the flush pipeline 2009-02-01 21:52 but with only one delta or flush active at a time, that is just confusing 2009-02-01 21:52 ok 2009-02-01 21:52 so now there is just sb->pinned and sb->commit, and they can be generalized to arrays later 2009-02-01 21:53 yes 2009-02-01 21:53 btw, I was thinking to remove inode->map->dirty 2009-02-01 21:53 it may be able to delta dirty state list 2009-02-01 21:54 it really needs to be inode->dirty 2009-02-01 21:54 I am also using the per-inode list of dirty blocks 2009-02-01 21:54 until map_region was called? 2009-02-01 21:55 the inode dirty list would only be for metadata blocks 2009-02-01 21:55 not data blocks 2009-02-01 21:55 using it for data blocks would require buffer heads on all the data blocks, which we don't want 2009-02-01 21:56 but for directory inodes, the bitmap inode, and volmap, we need a list of dirty blocks 2009-02-01 21:57 i see 2009-02-01 21:57 well, I thought it may become simple the buffer list 2009-02-01 21:57 forked buffer is on delta dirty state list 2009-02-01 21:57 just on buffer list for all dirty buffers? 2009-02-01 21:58 non-forked buffer is on map->dirty 2009-02-01 21:58 yes, that is the case in my prototype code 2009-02-01 21:58 it can be changed later? 2009-02-01 21:59 of course 2009-02-01 21:59 it compiles, I should post it ;) 2009-02-01 21:59 it still has some mistakes and omissions, but I will post it anyway 2009-02-01 22:00 well, this is just a idea with physical address cache 2009-02-01 22:00 I am not sure what you meant by "remove inode->map->dirty" 2009-02-01 22:01 if we cached the physical address to write temporary, maybe we can make tree radix tree indexed by physical address 2009-02-01 22:01 I was thinking to make this tree 2009-02-01 22:01 for write buffers 2009-02-01 22:01 ah, I am still trying to avoid caching the physical address 2009-02-01 22:02 yes, it is just a possibility in future 2009-02-01 22:02 but a tree might be good for this writeout, we can use it to sort the buffers to be written, then make few calls to balloc and submit_bio 2009-02-01 22:02 yes, exactly 2009-02-01 22:03 I was thinking it 2009-02-01 22:03 I thought about the same thing 2009-02-01 22:03 it seems like a useful optimization, and a lot of new code 2009-02-01 22:04 yes, maybe radix tree can be used for it 2009-02-01 22:04 and lookup_gang will get buffers (or pages) 2009-02-01 22:04 that seems more appealling than having a special purpose structure 2009-02-01 22:05 yes 2009-02-01 22:05 another useful technique is a list sort 2009-02-01 22:06 list sort? 2009-02-01 22:06 merge sort on a list, very efficient 2009-02-01 22:06 oh, i see 2009-02-01 22:07 however, list should be sorted before it? 2009-02-01 22:07 sorting a random list can be efficient 2009-02-01 22:07 for some reason, you don't often see that used 2009-02-01 22:08 i see 2009-02-01 22:08 more efficient that insertion sort into a tree 2009-02-01 22:08 oh 2009-02-01 22:08 both are n * log(n), but sorting the list can be done with just on elock 2009-02-01 22:08 lock 2009-02-01 22:09 it can be used with list_head? 2009-02-01 22:09 somebody has to write the sort 2009-02-01 22:09 I have written such a sort before 2009-02-01 22:09 i see 2009-02-01 22:10 it is documented in knouth? 2009-02-01 22:11 Knuth 2009-02-01 22:11 I haven't seen it 2009-02-01 22:12 you take two elements off the list, make one element lists out of them by setting tail, then merge the two lists, put the result on a stack 2009-02-01 22:12 then do it again, now you have two short, sorted lists on the stack, so merge them 2009-02-01 22:13 put back on the stack. Repeat, every time you have two lists the same length, merge them 2009-02-01 22:16 i see, merge sort 2009-02-01 22:16 yes 2009-02-01 22:17 the last step is to merge all the lists on the stack, regardless of size 2009-02-01 22:17 you would think this would be a common algorithm, but I have not seen an example in C 2009-02-01 22:17 I wrote that in Forth 2009-02-01 22:18 insert to list until merge (after map_region?), then pull from list, merge them? 2009-02-01 22:20 I was thought the sort would be useful for metadata, but I have not thought about it much 2009-02-01 22:20 i see 2009-02-01 22:20 for now, I just try to process things in order, and not scramble the lists 2009-02-01 22:20 well, when we tackle it, I'll benchmark if needed 2009-02-01 22:21 it seems pure optimization process 2009-02-01 22:21 yes, it's like relaxation 2009-02-01 22:21 what I am doing right now is not relaxing at all 2009-02-01 22:22 I will just check my diff once more and post a draft 2009-02-01 22:22 ok 2009-02-01 22:25 inode dirty list seems to be starting to work 2009-02-01 22:36 draft is posted 2009-02-01 22:36 ok 2009-02-01 22:47 I'm not sure whether this is good or not 2009-02-01 22:47 need more time to read 2009-02-01 22:47 however, I think it's ok to push to public 2009-02-01 22:50 in stage_delta(), loop is checking the sb->lognext 2009-02-01 22:51 however, it seems not to update sb->lognext 2009-02-01 22:52 sb->prevlog? 2009-02-01 22:54 ok, I will work with it a little more before pushing 2009-02-01 22:55 ok 2009-02-01 22:55 and it seems sb->logthis is not used yet 2009-02-01 22:55 logthis is not setted yet, but it is used 2009-02-01 22:56 sb->lognext is updated by log_begin 2009-02-01 22:56 well 2009-02-01 22:56 it does have to start a new log block, yes 2009-02-01 22:56 at delta transition 2009-02-01 22:57 need to set logthis 2009-02-01 22:58 + for (unsigned logblock = sb->logthis; sb->lognext; logblock++) { 2009-02-01 22:58 in stage_delta 2009-02-01 22:58 right, logthis needs to be set to lognext after the delta transition, and on replay 2009-02-01 22:59 this sb->lognext is right? 2009-02-01 23:00 no :) 2009-02-01 23:00 ok :) 2009-02-01 23:02 for (unsigned index = sb->logthis; index < sb->lognext; index++) { 2009-02-01 23:02 i see 2009-02-01 23:02 and: blockget(mapping(sb->logmap), index); 2009-02-01 23:02 , block) was wrong 2009-02-01 23:03 ah, i see 2009-02-01 23:03 also, need log_finish() 2009-02-01 23:07 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-01 23:11 it needs list initializer 2009-02-01 23:11 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-01 23:12 so, this would be useful 2009-02-01 23:12 inode dirty list 2009-02-01 23:12 and namespace op cleanup 2009-02-01 23:12 reading 2009-02-01 23:16 ah, we never had expanding truncate before 2009-02-01 23:17 yes 2009-02-01 23:17 well, since kernel is working, it may work 2009-02-01 23:19 rapid_* is tricky 2009-02-01 23:20 and mark_inode_dirty was introduced, but it is not using for now except dir.c 2009-02-01 23:20 and I can try it in the commit unit test 2009-02-01 23:21 yes, list can be initialized by chaning INIT_* 2009-02-01 23:21 it is including new_inode() 2009-02-01 23:22 single linked lists with null termination are easier to initialize 2009-02-01 23:22 yes 2009-02-01 23:23 we might change to that later 2009-02-01 23:23 well 2009-02-01 23:23 it's not a queue 2009-02-01 23:23 and therefor gets the entries backwards 2009-02-01 23:24 making it circular, i.e., flink, makes it messy to initialize again 2009-02-01 23:24 so some things are just messy 2009-02-01 23:26 just add FLINK_HEAD_INIT to INIT_SB? 2009-02-01 23:28 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-02-01 23:31 I am ok with double linked lists for now 2009-02-01 23:32 ah, inode dirty list? 2009-02-01 23:34 yes 2009-02-01 23:34 assoc_buffers is a double linked list 2009-02-01 23:34 which we will use for out dirty metadata block list 2009-02-01 23:35 yes 2009-02-01 23:38 -!- pgquiles_(~pgquiles@88.Red-83-53-120.dynamicIP.rima-tde.net) has joined #tux3 2009-02-01 23:50 hirofumi, it looks good for a pull 2009-02-01 23:50 thanks 2009-02-01 23:53 pushed to public 2009-02-01 23:57 thanks 2009-02-01 23:58 inode dirty list should use link_*? 2009-02-01 23:59 maybe later 2009-02-01 23:59 if it's prefer, I'll convert it with next patch 2009-02-01 23:59 it is fine as it is 2009-02-01 23:59 ok 2009-02-01 23:59 the commit/replay work is most important 2009-02-01 23:59 yes 2009-02-02 00:00 the commit is close to functioning I think 2009-02-02 00:00 well, it is just s/list/link/ 2009-02-02 00:00 :) 2009-02-02 00:00 if you want to, fine 2009-02-02 00:00 is it that simple? 2009-02-02 00:01 yes 2009-02-02 00:01 it would need link_del_init though 2009-02-02 00:03 I think it would be better to leave that change for later, even though small 2009-02-02 00:03 ok 2009-02-02 01:59 user/commit runs without segfaulting 2009-02-02 02:00 tomorrow, need to see what it does right and wrong 2009-02-02 02:11 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-02-02 07:49 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-02-02 07:51 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-02 10:04 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-02 11:32 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-02 13:41 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-02 14:35 hey flips 2009-02-02 14:35 hi 2009-02-02 14:35 how's atomic commits going ? 2009-02-02 14:35 ACTION was reading about b-tree rebalancing yesterday 2009-02-02 14:36 http://mailman.tux3.org/pipermail/tux3/2009-February/000703.html 2009-02-02 14:36 http://hg.tux3.org/tux3/rev/135a952822ad 2009-02-02 14:37 ACTION reads 2009-02-02 14:40 nice 2009-02-02 14:40 have you decided in your head as to what gets logged ? definitely some metadata, all b-tree changes ? 2009-02-02 14:41 allocation map 2009-02-02 14:52 yes, in detail 2009-02-02 14:52 all allocations and changes to btree index nodes are logged 2009-02-02 14:54 the log is immediately written to the disk ? 2009-02-02 14:54 do you have the log format yet ? 2009-02-02 14:54 I would think so 2009-02-02 14:54 yes 2009-02-02 14:54 see kernel/log.c 2009-02-02 14:54 ok 2009-02-02 14:54 the log is written to disk per delta 2009-02-02 14:55 what about log replaying ? 2009-02-02 14:55 there's a prototype in user/commit.c 2009-02-02 14:57 ok 2009-02-02 15:16 I'd imagine that this need to hook up with the mount code and needs to validate that portion of the log before committing 2009-02-02 15:16 the replay is purely in kernel 2009-02-02 15:16 mount in fact is pure kernel 2009-02-02 15:17 so no userspace testing then ? is that what you're trying to say ? 2009-02-02 20:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-02 22:11 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-02 22:35 bh, no, not what I meant 2009-02-02 22:35 what I meant is, when running in kernel, the mount also runs in kernel 2009-02-03 00:08 -!- flips(~phillips@phunq.net) has joined #tux3 2009-02-03 03:15 flips: ok 2009-02-03 07:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-03 07:52 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-02-03 08:15 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-03 08:53 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-03 08:58 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-03 09:54 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-03 17:19 flips, there? 2009-02-03 17:24 http://userweb.kernel.org/~hirofumi/io/bio-queue.patch 2009-02-03 17:24 how about this idea? 2009-02-03 17:24 this uses the pointer to data instead of buffer_head 2009-02-03 17:25 because we are interesting only data to write, not buffer 2009-02-03 17:25 so, it uses bio to correct data to write, then write it out 2009-02-03 17:26 and buffer_head will be freed later on sb->flush or delta dirty state was written 2009-02-03 17:27 I hope map->io can be generic with this 2009-02-03 17:27 and it can make the one bio per extent 2009-02-03 17:30 anyway, if this works, I guess we don't need much care about dirty state 2009-02-03 18:06 hirofumi, hi 2009-02-03 18:06 hi 2009-02-03 18:06 what do you think this patch? 2009-02-03 18:07 I think it is not so bad 2009-02-03 18:07 yes, good 2009-02-03 18:07 reading again 2009-02-03 18:08 I think this type of write_bitmap can be used all inodes except volmap 2009-02-03 18:09 one issue is the passed buffer must be stable against bufferfork until bufdata(buffer) 2009-02-03 18:10 probably, vecio can be used instead of bio_alloc/bio_add_data 2009-02-03 18:10 i.e. between grabing buffer from map->dirty and bufdata(buffer) 2009-02-03 18:11 probably, yes 2009-02-03 18:11 I guess there is no big difference 2009-02-03 18:11 and we can easily write a vecio for userspace 2009-02-03 18:12 to make more of the code portable between userspace and kernel 2009-02-03 18:12 except it will not be async in userspace 2009-02-03 18:13 unless we want to do a lot of fiddling with aio 2009-02-03 18:13 ok, the stable against fork issue 2009-02-03 18:14 yes 2009-02-03 18:14 in my mind, it was including a bit about io tree 2009-02-03 18:14 or list or something 2009-02-03 18:14 bio has physical address 2009-02-03 18:15 we can make sorted list or something with it 2009-02-03 18:15 well, it doesn't matter at least for now 2009-02-03 18:16 vecio() should work well 2009-02-03 18:16 so you have ported the bio layer to user space :) 2009-02-03 18:17 :) 2009-02-03 18:17 it's a pretty simple thing, except for the async part 2009-02-03 18:17 well, the intent of this is to forget about the detail of dirty state on io path 2009-02-03 18:18 and make map->io generic 2009-02-03 18:18 we own bi_next? 2009-02-03 18:19 I think so 2009-02-03 18:19 seems to be last time I looked, it was only trivially used in the block layer 2009-02-03 18:19 until submit_bio 2009-02-03 18:19 yes 2009-02-03 18:19 we own it for sure till then 2009-02-03 18:19 yes 2009-02-03 18:19 and every other field 2009-02-03 18:20 almost 2009-02-03 18:20 ok, so the idea is... write_bitmap just links up a lot of bios, then they are submitted later? 2009-02-03 18:20 yes 2009-02-03 18:21 well, delay submit seems not important 2009-02-03 18:21 what is the advantage of not submitting them right away 2009-02-03 18:21 right 2009-02-03 18:22 yes 2009-02-03 18:22 so we only have to worry about asynchronous fork between the blockget and bufdata -> attach to bio 2009-02-03 18:22 yes 2009-02-03 18:23 lock_buffer I think 2009-02-03 18:23 I guess it is true on another way 2009-02-03 18:24 I used lock_buffer in fork_buffer, it seems to be natural 2009-02-03 18:25 for bitmap, we may not call balloc() and bfree() asynchronously 2009-02-03 18:25 I guess it can be only ->writepage 2009-02-03 18:25 not now 2009-02-03 18:25 ah 2009-02-03 18:26 I mean, yes, we do not call asynchronously 2009-02-03 18:26 maybe, lock_page or lock_buffer 2009-02-03 18:26 ok, I've misread 2009-02-03 18:27 it becomes asynchronous when we separate map_region from submit_bio 2009-02-03 18:27 and for now, I guess we use ->i_mutex for other inodes? 2009-02-03 18:28 right, it is very un-asynchronous right now 2009-02-03 18:28 which is good 2009-02-03 18:28 ok 2009-02-03 18:28 so, for now, I think we don't have the above fork issue 2009-02-03 18:28 true 2009-02-03 18:29 if it become asynchronous, we can use lock_page or lock_buffer 2009-02-03 18:29 lock_buffer seems to be the right now 2009-02-03 18:29 lock_page is only used to find the buffers 2009-02-03 18:30 iirc, lock_page is in blockdirty 2009-02-03 18:32 ah, lock_buffer seems good 2009-02-03 18:32 blockdirty takes it just to check the dirty state 2009-02-03 18:32 then drops it to call fork_buffer, which seems sloppy 2009-02-03 18:33 yes 2009-02-03 18:33 fork_buffer immediately locks it again 2009-02-03 18:33 yes 2009-02-03 18:33 the two functions are not really separate 2009-02-03 18:34 well, and I think this atomic state check is needed for any other ways 2009-02-03 18:34 then the lock_page is help for the full walk across the page buffers, which seems necessary 2009-02-03 18:34 yes 2009-02-03 18:34 lock_buffer is done inside lock_page, consistent with other kernel usage 2009-02-03 18:35 good 2009-02-03 18:36 and there will be some spinlock nested inside lock_buffer for list operations 2009-02-03 18:36 I didn't know exactly what lists there would be, so I didn't write that 2009-02-03 18:37 anyway, it doesn't even need lock_buffer right now 2009-02-03 18:37 but it doesn't hurt 2009-02-03 18:37 yes 2009-02-03 18:37 I found the bug of my patch 2009-02-03 18:37 I'm forgetting about forked buffer 2009-02-03 18:38 after the walk across the dirty bitmap list, there will be a few forked buffers on the sb->flush list 2009-02-03 18:38 it can not be happen for now though 2009-02-03 18:38 sure it can 2009-02-03 18:39 it happens in my simplest test, on the very first bitmap block flushed 2009-02-03 18:39 ah, yes it can, however, that patch writes all buffers 2009-02-03 18:39 ah, no problem 2009-02-03 18:40 so that is the only purpose of fork for now 2009-02-03 18:40 yes 2009-02-03 18:40 balloc is called only write_bitmap path 2009-02-03 18:41 so, I think there is no unexcepted fork 2009-02-03 18:41 all fork should be after bufdata(buffer) 2009-02-03 18:42 no 2009-02-03 18:42 it can happen 2009-02-03 18:43 um... 2009-02-03 18:43 we are writing a snapshot of the bitmap data 2009-02-03 18:43 so we call bufdata before map_region 2009-02-03 18:43 and even if the buffer changes because of fork, the bufdata stays the same 2009-02-03 18:43 that is the point of fork 2009-02-03 18:43 yes 2009-02-03 18:44 however, we are walking map->dirty list 2009-02-03 18:44 I think this list is not fork safe 2009-02-03 18:44 as in, the list may change while we walk it? 2009-02-03 18:44 yes 2009-02-03 18:45 it would be better to remove the entire list as the first step 2009-02-03 18:45 ah, i see 2009-02-03 18:47 in kernel, we will have to check for completions 2009-02-03 18:47 so I was thinking, we remove elements from the dirty list and add to completion list 2009-02-03 18:48 good 2009-02-03 18:50 anyway, there are two small things needed to complete the delta writeout: 1) handle btree node splits 2) update the superblock 2009-02-03 18:51 i see 2009-02-03 18:51 for btree node splits, we log the position of the split, and replay will redo the split 2009-02-03 18:51 i see 2009-02-03 18:51 the alternative to this is to write out the split blocks in full, which seems harder to prove correct and less efficient 2009-02-03 18:52 the only reason I didn't code the superblock update is, our superblock code is a little messy 2009-02-03 18:53 it's time to decide whether the disksuper should be kept around in a buffer, or kept separately kmalloced 2009-02-03 18:53 I remember you said it will be replaced by some fixed blocks 2009-02-03 18:53 yes 2009-02-03 18:53 it will 2009-02-03 18:53 but we need to update it now, because it's what we have 2009-02-03 18:54 I would like to present for review before improving the way we store the pointer to the log 2009-02-03 18:54 i see 2009-02-03 18:55 so for now, we will just do a lot of stores into the superblock with no cleverness or special protection 2009-02-03 18:55 it means before some fixed blocks? 2009-02-03 18:55 and nothing bad will happen :) 2009-02-03 18:55 ok :) 2009-02-03 18:56 -!- cam(~cam@203-219-255-75.tpgi.com.au) has joined #tux3 2009-02-03 18:56 or always in log? 2009-02-03 18:57 we need a pointer to the latest log commit, from the disksuper 2009-02-03 18:57 ah :) 2009-02-03 18:57 I will add that right now 2009-02-03 18:57 ok 2009-02-03 18:59 be_u64 logchain; /* Most recent delta commit block */ 2009-02-03 18:59 i see 2009-02-03 19:00 my patch has bug 2009-02-03 19:01 about sb->pinned 2009-02-03 19:05 ok, right now disksuper is a block in the volmap, and is flushed by sync_inode_pages 2009-02-03 19:05 ok 2009-02-03 19:06 you wrote that 2009-02-03 19:06 ah, yes 2009-02-03 19:07 probably, we now want to use syncio to transfer it, and it doesn't really matter if it's mapped into any mapping 2009-02-03 19:08 it can 2009-02-03 19:09 it's a little funny to see vol_bread in write_super 2009-02-03 19:10 yes 2009-02-03 19:10 we have sb_bread in tux_load_sb and vol_bread in tux_write_super 2009-02-03 19:10 I guess we have not paid much attention to this :) 2009-02-03 19:10 it was my lazyness 2009-02-03 19:10 yes, this isn't the most exciting part 2009-02-03 19:11 I didn't decide it is pinned or not 2009-02-03 19:11 I think it's entirely owned by commit_delta 2009-02-03 19:12 yes 2009-02-03 19:12 probably, it's cleaner to initialize the magic in pack_sb 2009-02-03 19:12 right now, the magic is only ever initialized by mkfs 2009-02-03 19:13 yes, well it was copied from original :) 2009-02-03 19:13 anyway, we should have a function that is the same in userspace ane kernel for writing the original 2009-02-03 19:14 for writing the superblock I meant 2009-02-03 19:14 -!- cam(~cam@203-219-255-75.tpgi.com.au) has joined #tux3 2009-02-03 19:15 tux_save_sb isn't used yet 2009-02-03 19:15 sync_super? 2009-02-03 19:15 in sync_super 2009-02-03 19:15 ah right 2009-02-03 19:15 good 2009-02-03 19:15 the traditional name 2009-02-03 19:17 well, anyway, we need to clean sb->disksuper up 2009-02-03 19:17 yes 2009-02-03 19:17 the immediate need is to have a sync_super that works both in kernel and userspace 2009-02-03 19:18 I was thinking before, I trying to use buffer_head or page instead of sb->super 2009-02-03 19:18 sync_super? 2009-02-03 19:18 save_sb? 2009-02-03 19:18 either one 2009-02-03 19:18 memcpy(&tux_sb(sb)->super, bufdata(bh), sizeof(tux_sb(sb)->super)); <- let's do something about this 2009-02-03 19:19 or just keep bh until unmount? 2009-02-03 19:19 so, we currently have a disksuper nested inside struct sb, which is only ever accessed by pack_sb and unpack_sb 2009-02-03 19:19 I think 2009-02-03 19:20 yes 2009-02-03 19:20 do we ever use sb->super? 2009-02-03 19:20 no 2009-02-03 19:21 it is used only for testing and debug 2009-02-03 19:21 so the first obvious cleanup is, remove it from struct sb 2009-02-03 19:22 next thing is, there is no reason for struct super to be in a buffer 2009-02-03 19:22 not that I know of 2009-02-03 19:22 yes 2009-02-03 19:22 the only reason for that ever is to use sb_bread, and even that is bogus because the size of the super is not necessarily the filesystem blocksize 2009-02-03 19:23 well, many fs does overwrite 2009-02-03 19:23 so, I think what we really want is struct sb { struct disksuper *super; 2009-02-03 19:23 so, buffer_head is convinience 2009-02-03 19:24 yes 2009-02-03 19:24 we don't need it 2009-02-03 19:24 I think having the disksuper in a buffer just adds confusion 2009-02-03 19:24 it isn't really a filesystem block 2009-02-03 19:25 well, it wouldn't matter 2009-02-03 19:26 not very much, which is why it is hard to fix :) 2009-02-03 19:26 maybe, we allocate in write or keep memory 2009-02-03 19:26 yes, kmalloc disksuper in fill_super and free in shutdown 2009-02-03 19:27 alloc_page? 2009-02-03 19:27 sure 2009-02-03 19:28 ok, so we keep page instead of ->super 2009-02-03 19:28 and initialize it same way 2009-02-03 19:28 write_super will update the changable fileds only 2009-02-03 19:29 yes, even though that does not matter much 2009-02-03 19:30 yes, just to avoid some constants values to be introduced 2009-02-03 19:30 tux_load_sb will do unpack_sb(tux_sb(sb), sb->disksuper, iroot, silent); 2009-02-03 19:31 yes, something like it 2009-02-03 19:32 tux3_write_super can actually write it, synchronously for now 2009-02-03 19:33 with syncio 2009-02-03 19:33 yes 2009-02-03 19:36 actually, it can just call tux_save_sb(sb), which we can also have for userspace 2009-02-03 19:36 so there we are: the immediate need is just to write tux_save_sb for user and kernel 2009-02-03 19:37 yes 2009-02-03 19:38 well, I guess most easy way would be keeping buffer_head on both of user and kernel 2009-02-03 19:39 so, difference is just the detail of io 2009-02-03 19:39 I was thinking, the most easy way is to just make the size sb.super a multiple of sector size 2009-02-03 19:39 and then it is automatically allocated both in userspace and kernel 2009-02-03 19:40 bio wouldn't need page? 2009-02-03 19:40 it doesn't, no 2009-02-03 19:40 there's a handy function to return the page given a kernel address 2009-02-03 19:41 I think the address does have to be sector aligned 2009-02-03 19:41 kzalloc makes that true 2009-02-03 19:44 sector aligned? 2009-02-03 19:44 virt_to_page(p) 2009-02-03 19:45 yes 2009-02-03 19:45 kzalloc doesn't garantee 2009-02-03 19:45 it would be 16 or 8 (actually, depending on arch though) 2009-02-03 19:46 however, I guess the sector aligned is not required for disk dma 2009-02-03 19:47 -!- cam(~cam@203-219-255-75.tpgi.com.au) has joined #tux3 2009-02-03 19:49 it isn't? 2009-02-03 19:50 sector aligned means memory is 512 or greater aligned? 2009-02-03 19:50 surely something in the io path must be bothered by sub-sector io 2009-02-03 19:51 ah 2009-02-03 19:51 io size? 2009-02-03 19:52 biovec lets us specifiy any source address we want, whether unaligned works or not is another question 2009-02-03 19:53 ah, yes 2009-02-03 19:54 struct sb { 2009-02-03 19:54 struct disksuper super; 2009-02-03 19:54 char pad[-sizeof(struct disksuper) & 0x1ff]; 2009-02-03 19:54 but, well, bio_page + bio_offset? 2009-02-03 19:54 sure, tell bio to do it, but does it work? 2009-02-03 19:54 probably, almost all disk 2009-02-03 19:55 kernel only ever uses block aligned bio_offset 2009-02-03 19:55 yes 2009-02-03 19:55 anyway, padding is just one line 2009-02-03 19:57 union? 2009-02-03 19:57 union { disksuper, char buffer[size];} 2009-02-03 19:57 anonymous union 2009-02-03 19:57 sure 2009-02-03 19:58 PAGE_SIZE :) 2009-02-03 19:59 actually 2009-02-03 19:59 why PAGE_SIZE? 2009-02-03 20:00 it's wrong 2009-02-03 20:00 and so is sb_bread(sb, SB_LOC >> sb->s_blocksize_bits); 2009-02-03 20:00 we need SB_LEN back 2009-02-03 20:01 or just SECTOR_SIZE 2009-02-03 20:02 if that's the size of our disksuper 2009-02-03 20:02 well PAGE_SIZE would be good 2009-02-03 20:02 @@ -220,6 +220,7 @@ static inline void flink_last_del(struct 2009-02-03 20:02 #define MAX_FILESIZE (1LL << MAX_FILESIZE_BITS) 2009-02-03 20:02 #define MAX_EXTENT (1 << 6) 2009-02-03 20:02 #define SB_LOC (1 << 12) 2009-02-03 20:02 +#define SB_LEN (1 << 12) 2009-02-03 20:02 /* Special inode numbers */ 2009-02-03 20:02 #define TUX_BITMAP_INO 0 2009-02-03 20:02 it allows maximum hard sector size 2009-02-03 20:02 @@ -297,7 +298,8 @@ struct stash { struct flink_head head; u 2009-02-03 20:02 /* Tux3-specific sb is a handle for the entire volume state */ 2009-02-03 20:02 struct sb { 2009-02-03 20:02 - struct disksuper super; 2009-02-03 20:02 + union { disksuper, char thisbig[SB_LEN]; } 2009-02-03 20:02 + char pad[-sizeof(struct disksuper) & 0x1ff]; 2009-02-03 20:02 struct inode *volmap; /* Volume metadata cache (like blockdev). 2009-02-03 20:02 * Note, ->btree is the btree for itable. */ 2009-02-03 20:03 struct inode *bitmap; /* allocation bitmap special file */ 2009-02-03 20:03 whoops 2009-02-03 20:03 struct sb { 2009-02-03 20:03 - struct disksuper super; 2009-02-03 20:03 + union { disksuper, char thisbig[SB_LEN]; } 2009-02-03 20:03 struct inode *volmap; /* Volume metadata cache (like blockdev). 2009-02-03 20:03 actually, union { struct disksuper super; char thisbig[SB_LEN]; } 2009-02-03 20:03 actually, union { struct disksuper super; char thisbig[SB_LEN]; }; 2009-02-03 20:04 SB_LEN would be PAGE_CACHE_SIZE? 2009-02-03 20:04 it doesn't really matter 2009-02-03 20:04 as long as its at least as big as our disksuper will ever be 2009-02-03 20:04 ext2/3/4 use 1K super 2009-02-03 20:05 well, but if hard sector size is bigger than 1k, it wouldn't work 2009-02-03 20:05 I don't know it without special raid though 2009-02-03 20:06 um... 2009-02-03 20:06 640M MO has 2k hard sector size, iirc 2009-02-03 20:07 the union breaks rapid_sb 2009-02-03 20:07 oh 2009-02-03 20:11 um... 2009-02-03 20:11 well, those can be replaced with sb.volblocks 2009-02-03 20:12 maybe just write it without the union, it's not hard 2009-02-03 20:13 patch coming 2009-02-03 20:13 it can 2009-02-03 20:13 however, union seems right way 2009-02-03 20:14 you can improve it :) 2009-02-03 20:14 ok :) 2009-02-03 20:14 you already wrote? 2009-02-03 20:15 it wasn't hard 2009-02-03 20:15 everything compiles 2009-02-03 20:15 ok 2009-02-03 20:15 @@ -220,6 +220,7 @@ static inline void flink_last_del(struct 2009-02-03 20:15 #define MAX_FILESIZE (1LL << MAX_FILESIZE_BITS) 2009-02-03 20:15 #define MAX_EXTENT (1 << 6) 2009-02-03 20:15 #define SB_LOC (1 << 12) 2009-02-03 20:15 +#define SB_LEN (1 << 12) 2009-02-03 20:15 /* Special inode numbers */ 2009-02-03 20:15 #define TUX_BITMAP_INO 0 2009-02-03 20:15 I've find the .super users 2009-02-03 20:15 @@ -298,6 +299,7 @@ struct stash { struct flink_head head; u 2009-02-03 20:15 struct sb { 2009-02-03 20:16 struct disksuper super; 2009-02-03 20:16 + char pad[SB_LEN - sizeof(struct disksuper)]; 2009-02-03 20:16 struct inode *volmap; /* Volume metadata cache (like blockdev). 2009-02-03 20:16 * Note, ->btree is the btree for itable. */ 2009-02-03 20:16 struct inode *bitmap; /* allocation bitmap special file */ 2009-02-03 20:16 will commit, then you can play :) 2009-02-03 20:17 ok, good 2009-02-03 20:18 next thing is to directly read/write it, no need to involve buffers 2009-02-03 20:22 btw, we have to make sure any in progress overlapping io with this 2009-02-03 20:22 it should not be happen though 2009-02-03 20:23 -!- RazvanM(~RazvanM@96.234.239.248) has joined #tux3 2009-02-03 20:24 for now, not 2009-02-03 20:25 good 2009-02-03 20:25 later we will use async transfer to update the log start, and wake_up something in the endio 2009-02-03 20:25 overlapping io is not allowed by block layer for now 2009-02-03 20:26 right, this is synchronized by change_end 2009-02-03 20:26 if it's not overlapping, there is no problem 2009-02-03 20:26 oh 2009-02-03 20:26 you mean, overlapping address 2009-02-03 20:26 it won't happen 2009-02-03 20:26 overlapping physical address 2009-02-03 20:27 overwrite flying io with another io 2009-02-03 20:28 that would be a bug 2009-02-03 20:28 it can be happened with discard command for now 2009-02-03 20:29 discard command? 2009-02-03 20:29 erase command for flash based disk 2009-02-03 20:29 disk based on flash 2009-02-03 20:30 scsi is trim, ata is discard, iirc 2009-02-03 20:31 and axboe said overlapping will not work before 2009-02-03 20:33 well, anyway, io outside lock can be the cause of this problem 2009-02-03 20:34 I don't think it can happen to our disksuper 2009-02-03 20:34 so, we shouldn't it, we would not though 2009-02-03 20:34 yes 2009-02-03 20:35 block should already be invalidated 2009-02-03 20:35 if it is reused 2009-02-03 20:39 it's our bug if we ever generate overlapping IO 2009-02-03 20:39 yes, it's good 2009-02-03 20:40 ok, it is about time to start using vecio 2009-02-03 20:40 you can try it for a while, and if you prefer your form, change it 2009-02-03 20:41 vecio? 2009-02-03 20:41 static int vecio(int rw, struct block_device *dev, sector_t sector, 2009-02-03 20:41 bio_end_io_t endio, void *data, unsigned vecs, struct bio_vec *vec) 2009-02-03 20:41 { 2009-02-03 20:42 single function call to initialize and submit a bio, instead of lots of repetitive initializations 2009-02-03 20:43 for now, we will always submit IO at the same time as initializing the bio 2009-02-03 20:43 int syncio(int rw, struct block_device *dev, sector_t sector, unsigned vecs, struct bio_vec *vec) 2009-02-03 20:44 on stack vec would be big 2009-02-03 20:44 a single vec is not big 2009-02-03 20:44 yes 2009-02-03 20:45 but, write_bitmap can be one extent 2009-02-03 20:45 well, for right now, it would be ok 2009-02-03 20:46 it passes *vec anyway 2009-02-03 20:46 so if the vec is big it doesn't need to be on the stack 2009-02-03 20:46 if so, it will allocate vec twice 2009-02-03 20:47 that is true, but it doesn't matter 2009-02-03 20:47 it's only 8 bytes 2009-02-03 20:47 compared to a disk transfer 2009-02-03 20:48 if 32bit arch and 1 vec 2009-02-03 20:48 twice as big still doesn't matter 2009-02-03 20:48 and if there are lots of vecs, it still doesn't matter 2009-02-03 20:48 because its a big transfer 2009-02-03 20:48 well, yes 2009-02-03 20:49 but, needless 2009-02-03 20:49 not as bad as the usual verbose bio code though 2009-02-03 20:49 yes 2009-02-03 20:50 in any case where it matters, write out by hand 2009-02-03 20:50 I doubt there is such a case 2009-02-03 20:50 big vec? 2009-02-03 20:51 well, I imagine guess_extent() 2009-02-03 20:51 the overhead of copying one bvec is still nothing compared to setting up the scatter/gather in the driver 2009-02-03 20:51 I don't think we have timers sensitive enought measure that nanosecond ;) 2009-02-03 20:52 ah, I'm thinking memory allocation is more long time 2009-02-03 20:52 under the memory pressure 2009-02-03 20:53 bio_alloc is the big cost 2009-02-03 20:54 yes, it is not avoidable 2009-02-03 20:54 vecio actually does 2009-02-03 20:55 in internal 2009-02-03 20:55 well, anyway, this is pure optimization 2009-02-03 20:55 yes 2009-02-03 20:56 so, I think current vecio works for us 2009-02-03 20:56 ok, it is about time to start using it 2009-02-03 20:57 ok 2009-02-03 20:57 by the way, bio doesn't really need bi_size 2009-02-03 20:58 I mean, it's extra 2009-02-03 20:58 should just get rid of it sometime 2009-02-03 20:58 it seems to be needed for dataless io 2009-02-03 20:59 the bi_size is just the sum of the bv_lens 2009-02-03 20:59 dataless doesn't have bio_vec at all 2009-02-03 20:59 dataless means? 2009-02-03 20:59 barrier? 2009-02-03 20:59 invalidate? 2009-02-03 20:59 yes, barrier and discard 2009-02-03 21:00 probably the problem is with the barrier design 2009-02-03 21:00 ah 2009-02-03 21:00 it's not worth worrying about 2009-02-03 21:02 bi_size seems to be only used by discard command 2009-02-03 21:04 ok, we don't have any file in kernel for general support like vecio 2009-02-03 21:04 just put it in kernel/filemap.c? 2009-02-03 21:04 yes 2009-02-03 21:05 I guess commit.c? 2009-02-03 21:05 good 2009-02-03 21:07 well, commit.c is included from user 2009-02-03 21:08 so, not so good 2009-02-03 21:08 maybe utility.c 2009-02-03 21:09 and put hexdump in there too 2009-02-03 21:09 yes 2009-02-03 21:10 another choise seems only super.c 2009-02-03 21:10 we will transfer log blocks with it too, I think 2009-02-03 21:11 yes 2009-02-03 21:18 looks like I broke the kernel compile a couple days ago 2009-02-03 21:19 CC fs/tux3/btree.o 2009-02-03 21:19 fs/tux3/btree.c: In function 'cursor_redirect': 2009-02-03 21:19 fs/tux3/btree.c:353: error: 'struct buffer_head' has no member named 'link' 2009-02-03 21:19 ah 2009-02-03 21:19 list_move_tail 2009-02-03 21:20 well, at least I don't care at all for now 2009-02-03 21:21 that is the only breakage 2009-02-03 21:21 maybe 2009-02-03 21:21 only compile breakage 2009-02-03 21:22 I also tried kernel recently 2009-02-03 21:22 it was ok when that time 2009-02-03 21:22 well, anyway, I don't care 2009-02-03 21:22 renaming hexdump.c to utility.c 2009-02-03 21:23 ok 2009-02-03 21:26 ok, vecio and syncio have arrived 2009-02-03 21:27 now I should fix btree.c somehow 2009-02-03 21:28 well, it will be fixed with ATOMIC 2009-02-03 21:29 good 2009-02-03 21:29 the issue is, redirected btree blocks have to go on different dirty lists depending on whether they are leaf or node 2009-02-03 21:30 because leaf blocks are flushed per delta, nodes are per rollup 2009-02-03 21:31 if leaf is special, we may be able to handle outside cursor_redirect 2009-02-03 21:35 the problem is just that we don't have a way of referring to the kernel buffer_head link field 2009-02-03 21:35 b_assoc_buffers 2009-02-03 21:35 i see 2009-02-03 21:35 easiest thing is just to change userspace buffer.link to buffer.b_assoc_buffers 2009-02-03 21:36 it's ugly, but then... buffers are ugly 2009-02-03 21:36 I guess it will be changed with mark_buffer_dirty change 2009-02-03 21:36 helper functions would help 2009-02-03 21:37 yes, we want 2009-02-03 21:38 the only way a block gets onto sb->pinned is by new_block 2009-02-03 21:38 so we might want to pass a list parameter to new_block 2009-02-03 21:39 I think we need pinned way the buffer 2009-02-03 21:39 way? 2009-02-03 21:39 the way pining buffer 2009-02-03 21:39 way to pin buffer 2009-02-03 21:39 :) 2009-02-03 21:39 in kernel 2009-02-03 21:40 :) 2009-02-03 21:40 so, I think list_move_tail will become some function 2009-02-03 21:40 pin buffer, then move to sb->pinned 2009-02-03 21:41 I was thinking, something like: new_block(btree, level < btree->root.depth ? &sb->pinned : &sb->commit); 2009-02-03 21:42 it can 2009-02-03 21:43 it would be better, 2009-02-03 21:43 let me see, on the pinned list we have btree index nodes and forked bitmap blocks, is that all? 2009-02-03 21:43 if (level < depth) new_leaf else new_block 2009-02-03 21:43 whoops 2009-02-03 21:43 if (level < depth) new_node else new_leaf 2009-02-03 21:44 new_node and new_leaf do unneed initialization 2009-02-03 21:44 unneeded 2009-02-03 21:44 ah 2009-02-03 21:45 think later would be good 2009-02-03 21:46 because we don't all convert all btree 2009-02-03 21:46 dleaf is not handled yet, iirc 2009-02-03 21:47 new_btree needs a log message 2009-02-03 21:47 because we never actually write out the new node 2009-02-03 21:48 i see 2009-02-03 21:51 ok, dleaf 2009-02-03 21:51 let me think 2009-02-03 21:52 yes 2009-02-03 21:52 I'm still thinking about flush_log 2009-02-03 21:52 a new dleaf needs to go on sb->commit 2009-02-03 21:54 i see, and after insert_leaf, we may need to call cursor_redirect 2009-02-03 21:55 cursor_redirect before insert_leaf I think 2009-02-03 21:55 the new leaf does not have to be redirected because it does not overwrite any existing block 2009-02-03 21:56 after insert_leaf, path can be different 2009-02-03 21:57 example? 2009-02-03 21:57 add child, and incremnt to next? 2009-02-03 21:57 increment 2009-02-03 21:59 in filemap.c? 2009-02-03 21:59 yes 2009-02-03 21:59 btree_insert_leaf -> insert_leaf -> at->next++? 2009-02-03 22:00 changing the path does not require a redirect 2009-02-03 22:01 ah, i see 2009-02-03 22:05 um... 2009-02-03 22:06 if we have delta dirty list per inode, the code become simple 2009-02-03 22:06 the code seems to become simpler 2009-02-03 22:06 but, it will waste memory more or less 2009-02-03 22:06 um... 2009-02-03 22:09 I don't think more than one dirty list head per inode is ever needed 2009-02-03 22:09 simple code is good :) 2009-02-03 22:09 e.g. I'm thinking about the following 2009-02-03 22:10 mark_buffer_dirty() does buffer dirty and save delta counter 2009-02-03 22:10 and in stage_delta we will just write out the interesting delta 2009-02-03 22:11 list is per delta, so it is stable list like buffer data 2009-02-03 22:11 I guess we don't need to care about async fork 2009-02-03 22:12 and just loop until empty the interesting delta dirty list 2009-02-03 22:12 current map->dirty is not stable against delta counter 2009-02-03 22:13 it may have previous or next 2009-02-03 22:23 maybe it can be two list, frontend dirty and backend dirty 2009-02-03 22:24 frontend and backend are changed by "delta counter % 2" 2009-02-03 22:25 so, in blockdirty(), new buffer will be inserted into frontend list 2009-02-03 22:25 and stage_delta will be worked for backend list 2009-02-03 22:39 in stage_delta we will just write out the interesting delta <- I considered that 2009-02-03 22:40 yes 2009-02-03 22:40 but, there is no list for it 2009-02-03 22:40 map->dirty should only have dirty from one delta 2009-02-03 22:40 if it is otherwise, it's a bug 2009-02-03 22:41 if blockfork is happened, it has no interesting buffer 2009-02-03 22:41 volmap is an exception 2009-02-03 22:41 yes 2009-02-03 22:43 I think it is true that map->dirty only has dirty state == sb->delta, or for bitmap, state == sb->flush 2009-02-03 22:45 especially, bitmap is redirty after incremented sb->flush 2009-02-03 22:45 and redirtyed buffer is inserted into map->dirty 2009-02-03 22:46 right, so we should start the flush by moving the whole dirty list to a temporary list head 2009-02-03 22:47 yes, and need to flush sb->pinned after flush 2009-02-03 22:48 it may work to walk the dirty list once, and at the end of the walk, only redirtied blocks are on the list, but it is hard to understand 2009-02-03 22:48 yes 2009-02-03 22:48 yes, and it would be hard to do 2009-02-03 22:49 because we don't know which buffer is redirtyed 2009-02-03 22:50 right, it would only work if redirtied block is always inserted behind the list traversal position 2009-02-03 22:50 I don't like to write code that is so tricky 2009-02-03 22:50 yes 2009-02-03 22:51 and this issue is also true if we unlock delta_log before stage_delta 2009-02-03 22:51 delta_lock 2009-02-03 22:51 for all inodes 2009-02-03 22:53 in a fully pipelined version, stage_delta can be running at the same time as commit_delta 2009-02-03 22:53 all IO is started by commit_delta 2009-02-03 22:55 yes 2009-02-03 22:55 if the blockdev queue is busy, then there will be no pause 2009-02-03 22:56 yes, so the list will become more problem 2009-02-03 22:58 if linked by the buffer head, a fork can remove a buffer from the delta commit list and put the clone buffer there 2009-02-03 22:59 yes 2009-02-03 22:59 I don't like that behavior very much 2009-02-03 23:00 I thought about using the page lru link for that list, because of that issue 2009-02-03 23:00 anyway, there is lots of time to think about it 2009-02-03 23:01 can we return buffer from blockdirty 2009-02-03 23:01 I considered that and convinced myself it would not work 2009-02-03 23:01 whoops 2009-02-03 23:01 and I really need to write down the reason 2009-02-03 23:03 just to be clear, were you suggesting that fork should not put the new buffer on a delta list, but should return it instead? 2009-02-03 23:03 ah, no 2009-02-03 23:04 return clone, and don't touch old buffer 2009-02-03 23:04 ah, yes 2009-02-03 23:05 ok, I understood correctly 2009-02-03 23:06 in that case, parallel users of buffers on the same page would never find out about the state change 2009-02-03 23:07 (something like that) 2009-02-03 23:09 um... 2009-02-03 23:09 frontend state change is only blockdirty? 2009-02-03 23:10 that's not quite the right problem 2009-02-03 23:10 ah, truncate 2009-02-03 23:11 returning a clone buffer to the dirtier leaves other users of buffers on the same page with the old buffer 2009-02-03 23:11 it can truncate only current delta 2009-02-03 23:13 ah 2009-02-03 23:14 um... 2009-02-03 23:14 blockdirty changes all buffer_head on page? 2009-02-03 23:14 and it calls blockdirty before change 2009-02-03 23:15 actually, it changes the page and does not change the buffer_heads 2009-02-03 23:15 however, bh->b_page is changed? 2009-02-03 23:16 yes 2009-02-03 23:16 that is what I meant by "it changes the page" 2009-02-03 23:16 ah, yes 2009-02-03 23:19 um... 2009-02-03 23:20 ah, i see 2009-02-03 23:21 buffer state is used by fronend 2009-02-03 23:21 my reason for thinking this works is: 1) parallel readers don't care whether they get the data from the old or new page, it is the same 2) a reader that becomes a writer has to call bufdata again 3) after calling blockdirty, there cannot be any parallel fork, because all buffers on the page now belong to the current delta 2009-02-03 23:22 1), because meta is locked? 2009-02-03 23:22 metadata 2009-02-03 23:23 because that buffer is clean, a copy is made of the clean data 2009-02-03 23:23 (let's see if that is what I actually implemented) 2009-02-03 23:24 if (buffer_uptodate(oldbuf)) 2009-02-03 23:24 memcpy(newdata, olddata, blocksize); 2009-02-03 23:24 per buffer 2009-02-03 23:24 it is not meaning clean 2009-02-03 23:24 it can be dirty 2009-02-03 23:24 right, dirty in previous delta 2009-02-03 23:25 yes 2009-02-03 23:25 not clean, but read-only 2009-02-03 23:25 so, reader may read previous delta 2009-02-03 23:25 yes 2009-02-03 23:25 sorry, I was imprecise 2009-02-03 23:26 all the "clean" above should be "read-only" 2009-02-03 23:26 and if it's not locked, current delta might change it 2009-02-03 23:26 current delta is not allowed to change data on a block belonging to an earlier delta 2009-02-03 23:27 it has to change a copy 2009-02-03 23:27 yes 2009-02-03 23:27 reader might be reading to change it 2009-02-03 23:27 so, it needs lastest data 2009-02-03 23:27 the reader will keep reading the old copy 2009-02-03 23:28 ah 2009-02-03 23:28 the reader will always read the latest data 2009-02-03 23:29 there can be no write in parallel on the same block 2009-02-03 23:29 I mean, it is not allowed 2009-02-03 23:29 yes 2009-02-03 23:29 so, reader might change is also serialized 2009-02-03 23:29 yes 2009-02-03 23:30 so, the lock with set_bit is not enough 2009-02-03 23:30 and serialized against parallel operations on other blocks by lock_page and lock_buffer 2009-02-03 23:31 i see, bus lock can not be used 2009-02-03 23:31 bus lock? 2009-02-03 23:31 oh 2009-02-03 23:31 e.g., atomic_t 2009-02-03 23:31 set_bit or something 2009-02-03 23:31 yes 2009-02-03 23:32 I think that's true 2009-02-03 23:33 hopefully, lock_buffer is usually just a trylock with a bus lock, I think the real cost is the bus lock in either case 2009-02-03 23:35 ok, what inodes will actually have parallel forking? 2009-02-03 23:35 if unlock delta_lock before stage_delta, I think all inodes can be parallel 2009-02-03 23:36 regular data files will not have buffers 2009-02-03 23:36 volmap does redirect, not fork 2009-02-03 23:36 that leaves directories and bitmaps 2009-02-03 23:37 yes, and atable 2009-02-03 23:37 thanks :) 2009-02-03 23:37 btw, regular file doesn't fork? 2009-02-03 23:37 if it's not ordered write 2009-02-03 23:38 I'm not really thinking past ordered write, right now ;) 2009-02-03 23:38 ok :) 2009-02-03 23:38 but let's 2009-02-03 23:38 it's fun 2009-02-03 23:38 yes 2009-02-03 23:39 ignoring the mmapped case for now... 2009-02-03 23:39 ok 2009-02-03 23:39 so, ->write_begin is... 2009-02-03 23:39 we want to try to do everything in a data inode with pages 2009-02-03 23:40 -!- macan(~macan@159.226.41.137) has joined #tux3 2009-02-03 23:40 i see 2009-02-03 23:40 staging means starting IO on all the dirty pages of the file, and then we will not allow those pages to be changed 2009-02-03 23:40 yes 2009-02-03 23:40 so, we would do a page fork in write_begin 2009-02-03 23:41 old page removed from the mapping, insert copy in its place 2009-02-03 23:41 good 2009-02-03 23:42 mmap means we have to fiddle with pte bits 2009-02-03 23:42 sounds horrible 2009-02-03 23:42 page_mkwrite will do it 2009-02-03 23:43 right, but we are stuck with write protecting every page that is put under writeout 2009-02-03 23:44 ah 2009-02-03 23:44 easist thing is just to let users break the semantics via mmap 2009-02-03 23:45 how bad would that be? 2009-02-03 23:45 just data is fully stable 2009-02-03 23:45 just data is not 2009-02-03 23:45 libc sometimes implements cp with mmap 2009-02-03 23:45 so, it would be surprising to break the consistency with a cp 2009-02-03 23:46 yes 2009-02-03 23:46 data may not be stable like timestamp 2009-02-03 23:46 the range may have old state 2009-02-03 23:47 a similar problem if we want to set a snapshot without flushing the block device 2009-02-03 23:48 actually, that is not so bad, we could go and write protect very mmap on the filesystem, this is ok because snapshot isn't expected to be completely free 2009-02-03 23:49 anyway, this is all "beyond posix" 2009-02-03 23:49 i see 2009-02-03 23:49 going back to the cases we actually have to handle... 2009-02-03 23:50 yes 2009-02-03 23:50 directories are protected by parent i_mutex 2009-02-03 23:50 I don't think parallel fork is possible 2009-02-03 23:50 in stage_delta? 2009-02-03 23:51 stage_delta will not change dirent blocks 2009-02-03 23:51 yes 2009-02-03 23:51 but, frontend can be blockdirty 2009-02-03 23:52 you are right 2009-02-03 23:52 that's why we need fork, so I am happy :) 2009-02-03 23:52 :) 2009-02-03 23:53 and the fork does not have a lot of SMP probelms to worry about for directories 2009-02-03 23:53 the only SMP problem is... changing lists has to be synchronized with staging 2009-02-03 23:54 yes 2009-02-03 23:55 bitmaps are the last case, I think fully SMP forking is possible 2009-02-03 23:55 when we change from i_mutex to lock_buffer for bitmap update 2009-02-03 23:55 read/update 2009-02-03 23:56 or if we didn't allow partial flush, I guess stage_delta is only flusher 2009-02-03 23:57 partial flush is actually the natural linux behaviour 2009-02-03 23:57 yes 2009-02-03 23:57 flush of only some data files 2009-02-03 23:57 also flush of only some directories? 2009-02-03 23:57 ah, it meant flush by vm 2009-02-03 23:57 yes 2009-02-03 23:58 so tux3's behavior of always flushing everything is not traditional linux behavior 2009-02-03 23:58 good 2009-02-03 23:59 and probalby not very efficient, we will want to selectively flush files at some point 2009-02-03 23:59 writepage call from vm will not flush actually? 2009-02-03 23:59 now? 2009-02-03 23:59 always 2009-02-04 00:00 ah, current code is flushing 2009-02-04 00:00 after atomic commit 2009-02-04 00:00 after atomic commit, we will flush page by vm's writepage? 2009-02-04 00:00 I think so 2009-02-04 00:01 it's kind of the wrong interface for vm to be using 2009-02-04 00:02 vm indicates a file should be flushed by writing one of its pages... not very natural 2009-02-04 00:03 maybe, it can just be hint from vm? 2009-02-04 00:03 vm tells the passed page seems unused page 2009-02-04 00:03 it's been a while since I looked at pdflush, I'm doing a refresh right now 2009-02-04 00:04 ah, that's srhink_caches 2009-02-04 00:04 yes 2009-02-04 00:07 and pdflush can be hooked with writepages 2009-02-04 00:07 if we add it, I guess writepage will be used only from vm 2009-02-04 00:08 pdflush has become more abstract since I last looked at it 2009-02-04 00:09 i see 2009-02-04 00:09 pdflush.c doesn't seen to contain any actualyl flushing code 2009-02-04 00:09 yes 2009-02-04 00:10 it depends on __mark_inode_dirty though 2009-02-04 00:17 pdflush seems to have two main places where it actually works: background_writeout and wb_kupdate 2009-02-04 00:18 yes 2009-02-04 00:18 wb_kupdate is to flush periodically 2009-02-04 00:18 background_writeout is to flush too many dirty buffer 2009-02-04 00:19 it's unfortunate that the names of the related functions are completely different 2009-02-04 00:20 both of them call writeback_inodes (another completely different name for the same functionality) 2009-02-04 00:21 yes 2009-02-04 00:21 kupdate is pdflush only 2009-02-04 00:22 background_writeout can be from vm 2009-02-04 00:22 can be related to vm 2009-02-04 00:23 it's really the same thing 2009-02-04 00:23 "write stuff back and don't let it get too old" 2009-02-04 00:24 this code is, ahem, a little ad hoc 2009-02-04 00:24 ACTION wonders if hirofumi knows latin 2009-02-04 00:24 I have the dictionaries :) 2009-02-04 00:25 anyway, it calls generic_sync_sb_inodes, which eventually starts throwing pages at us 2009-02-04 00:25 yes 2009-02-04 00:25 a pretty awaful way for the vm to tell us it things we should flush an inode 2009-02-04 00:25 I think, we we know what is idea, we might want to put an override in that path 2009-02-04 00:26 so instead of doing writepage on every inode page, it just calls our hook, which tells us to do that 2009-02-04 00:26 yes 2009-02-04 00:27 redirty page seems usual way for it 2009-02-04 00:27 that's pretty crude 2009-02-04 00:28 I think vm is not working very well in this area 2009-02-04 00:28 linux spends rather more time churning pages than it should 2009-02-04 00:28 I don't have figures to prove that though 2009-02-04 00:28 only the feeling of my computer 2009-02-04 00:29 i see 2009-02-04 00:30 __write_single_inode calls do_writepages, then write_inode 2009-02-04 00:30 yes 2009-02-04 00:31 there is our hook 2009-02-04 00:31 ret = mapping->a_ops->writepages(mapping, wbc); 2009-02-04 00:31 so we don't have to use ->writepage as our writeout hint, we can use ->writepages 2009-02-04 00:32 btw, iirc, some people try to overwrite sync_sb_inodes 2009-02-04 00:32 overwrite? 2009-02-04 00:32 um.. replace with hook 2009-02-04 00:33 yes, well that part seems pretty reasonable 2009-02-04 00:34 I wonder how it decides which inode should be writte out next 2009-02-04 00:35 dirtyed order 2009-02-04 00:35 order of first dirty of the inode? 2009-02-04 00:35 old inode is first write 2009-02-04 00:35 yes 2009-02-04 00:36 reasonable 2009-02-04 00:38 by the way, you meant override above, rather than overwrite 2009-02-04 00:38 ah, yes 2009-02-04 00:39 there is no hook, so I assume you mean hack in a hook 2009-02-04 00:39 yes, I thought it was introduced, but it seems not to be happened 2009-02-04 00:40 it's not immediately obvious how to do it better 2009-02-04 00:40 sb->s_op->sync_sb_inodes() 2009-02-04 00:40 ah 2009-02-04 00:41 we may want this hook later 2009-02-04 00:45 I don't see sync_sb_inodes 2009-02-04 00:45 super_operations.sync_sb_inodes 2009-02-04 00:46 yes 2009-02-04 00:46 it is not introduced for now 2009-02-04 00:46 we may want to add it 2009-02-04 00:47 it would be good to experiment with 2009-02-04 00:48 yes 2009-02-04 00:49 ->writepages will be our main interface, that is the place where we can do a nice walk of the radix tree to use map_region efficiently 2009-02-04 00:50 yes, exactly 2009-02-04 00:51 well I must sleep and gather my strength to make the flush work the rest of the way 2009-02-04 00:51 ok 2009-02-04 00:51 oyasumi 2009-02-04 00:51 oyasumi 2009-02-04 00:53 -!- RazvanM_(~RazvanM@96.234.232.151) has joined #tux3 2009-02-04 01:24 hey flips 2009-02-04 01:24 ACTION reads the backlog 2009-02-04 07:19 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-04 07:51 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-04 08:01 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-04 08:56 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-04 09:38 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-04 09:42 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-04 10:06 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-04 10:25 hirofumi made top 3 hottest message on lkml this morning :) 2009-02-04 12:01 Short and sweet 2009-02-04 14:03 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-02-04 14:29 -!- cam(~cam@203-219-255-75.tpgi.com.au) has joined #tux3 2009-02-04 14:56 hey 2009-02-04 16:36 -!- macan(~macan@159.226.41.137) has joined #tux3 2009-02-04 16:52 -!- yosi(~chatzilla@96.232.26.79) has joined #tux3 2009-02-04 17:14 hirofumi, there? 2009-02-04 18:48 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-04 18:54 mounted tux3 with syncio for sb read 2009-02-04 19:09 dwalk_next: 2009-02-04 19:09 free_map: Failed assertion "list_empty(&map->dirty)"! 2009-02-04 19:09 running fuse? 2009-02-04 19:09 yeah 2009-02-04 19:09 ah, its volume full 2009-02-04 19:10 oh good 2009-02-04 19:10 would be nice to get ENOSPC 2009-02-04 19:10 yeah... 2009-02-04 19:10 i'll look a little more at it 2009-02-04 19:20 now syncio works for sb write too 2009-02-04 19:20 easy to use 2009-02-04 19:21 arguably works better because the sb write is immediate now, not just a mark dirty 2009-02-04 19:38 sometimes uml just hangs on boot 2009-02-04 19:39 more frequently in 2.6.29-rc2 2009-02-04 19:39 sorry, 2.6.29-rc1 2009-02-04 20:03 hi 2009-02-04 20:04 hi, I was going to apologize for suggesting we use the volmap inode for the itable btree 2009-02-04 20:04 you made it work, but it's not natural 2009-02-04 20:04 it requires the volmap to be initialized before itable btree is initialized 2009-02-04 20:05 what is problem? 2009-02-04 20:06 well, I think both is not so natural 2009-02-04 20:06 I was trying to move things around in kernel/super.c, and that tie made it hard 2009-02-04 20:07 trying to have the same IO functions for userspace and kernel for super load/save 2009-02-04 20:08 userspace can use kernel like initialization 2009-02-04 20:08 initialization like kernel 2009-02-04 20:09 yes, it will be more similar 2009-02-04 20:09 I made this wrapper for syncio: int devio(int io, struct block_device *dev, void *buf, unsigned len, loff_t loc); 2009-02-04 20:09 similar to pread/write 2009-02-04 20:10 loc? 2009-02-04 20:10 location 2009-02-04 20:10 block_t? 2009-02-04 20:10 I should write "offset" 2009-02-04 20:10 I moved the sector shift into vecio 2009-02-04 20:11 it doesn't really make sense for every caller to do the sector shift 2009-02-04 20:11 why devio doesn't do it? 2009-02-04 20:12 no killer reason 2009-02-04 20:12 oh 2009-02-04 20:12 well, it makes it more similar to userspace 2009-02-04 20:12 yes, current diskwrite 2009-02-04 20:13 but, usual diskwrite can be replaced by devio 2009-02-04 20:13 yes 2009-02-04 20:13 and the only funny parameter is struct block_device *dev 2009-02-04 20:14 yes 2009-02-04 20:14 userspace can be: int devio(int io, struct dev *dev, void *buf, unsigned len, loff_t loc); 2009-02-04 20:15 yes 2009-02-04 20:15 I think both can be use block_t 2009-02-04 20:15 instead of loff_t 2009-02-04 20:15 our superblock isn't necessarily aligned on a block 2009-02-04 20:15 if we support 8K blocks sometime, it would break 2009-02-04 20:16 if it's 8k block, superblock should be on block_t 0 2009-02-04 20:17 I suppose transfers other than the superblock will be to block boundaries 2009-02-04 20:17 but none of them will be syncrhonous I think 2009-02-04 20:18 I don't know if devio will ever be used in more than one place, to read/write the superblock 2009-02-04 20:18 well, for the immediate prototype, we can use syncio for the log blocks 2009-02-04 20:18 hmm 2009-02-04 20:18 for journal replay 2009-02-04 20:19 I mean, log replay 2009-02-04 20:19 yes 2009-02-04 20:19 well, I think fixed offset is not good itself 2009-02-04 20:20 fixed offser for superblock? 2009-02-04 20:20 yes 2009-02-04 20:20 I think fixed block address is better 2009-02-04 20:22 well, it doesn't so matter though 2009-02-04 20:27 oh sb->itable again 2009-02-04 20:31 it looks good at least for now 2009-02-04 20:32 btw, I was thinking about blockdirty and fork 2009-02-04 20:33 to get stable list and parallel forks 2009-02-04 20:33 there is some ideas, but all are not so good 2009-02-04 20:34 the issue is buffer_head is shared by frontend and backend if it's not forked yet 2009-02-04 20:35 so, backend has to take care to touch buffer_head fields 2009-02-04 20:35 probably, list and dirty state 2009-02-04 20:36 if buffer is not forked, and if write-io was completed, we would need to clean dirty 2009-02-04 20:37 the state change would be needed to serialize with fork 2009-02-04 20:39 stable structres are buffer_head for frontend, and page for backend 2009-02-04 20:40 so, I was thinking to remove buffer_head from backend 2009-02-04 20:40 it's not hard, but I guess additional memory is needed 2009-02-04 20:40 for now 2009-02-04 20:42 back 2009-02-04 20:43 my girl is sick today, have to take care 2009-02-04 20:43 bad 2009-02-04 20:44 you should take care 2009-02-04 21:26 folks 2009-02-04 22:24 -!- RazvanM(~RazvanM@96.234.232.151) has joined #tux3 2009-02-04 23:36 hirofumi, still there? 2009-02-05 00:48 hi 2009-02-05 00:49 I think it may be possible to set buffers clean in a foreground task after all writes for a delta complete 2009-02-05 00:49 if that is possible, then lock_buffer protects the state 2009-02-05 00:50 I have to go for a while 2009-02-05 00:50 ok 2009-02-05 00:59 i see 2009-02-05 00:59 it may be good 2009-02-05 01:14 it may have refcnt issue 2009-02-05 01:15 if page is under io, fork will not take refcnt 2009-02-05 01:15 if page is before io, fork will take refcnt 2009-02-05 01:15 however, this may be able to be solved some trick 2009-02-05 01:15 not sure 2009-02-05 01:32 fork does leave a refcount on the original page, that has to be dropped some time 2009-02-05 01:33 yes, drop timing is problem 2009-02-05 01:33 I guess 2009-02-05 01:33 yes, it is a problem in traditional buffer IO also 2009-02-05 01:34 if forked, page refcount is inherited 2009-02-05 01:34 if not, page refcount is not inherited 2009-02-05 01:34 I think 2009-02-05 01:35 we could take one refcount on a page for every buffer that will be written 2009-02-05 01:36 it would work 2009-02-05 01:37 however, for starting io, it will need to serialize with fork to take refcnt 2009-02-05 01:37 here is another issue: a buffer may be forked when IO on the original, dirty page has already completed, now we get a forked page that does not need to be written 2009-02-05 01:38 ah, yes 2009-02-05 01:38 that's not quite an accurate description 2009-02-05 01:38 we get a cloned buffer that does not need to be written 2009-02-05 01:38 I think it can be solved with page state 2009-02-05 01:39 e.g. SetPageChecked before start io 2009-02-05 01:39 fork should check it 2009-02-05 01:39 maybe 2009-02-05 01:40 well, I thought yesterday about adding header by fork 2009-02-05 01:40 it actually add by dirty 2009-02-05 01:40 fork is just part of dirty 2009-02-05 01:40 if buffer become dirty, it adds header for backend 2009-02-05 01:41 yes 2009-02-05 01:41 this idea is adding header even if buffer is clean 2009-02-05 01:41 yes, I'm thinking about it 2009-02-05 01:41 I guess it would be easy 2009-02-05 01:42 but, I guess it would be not so efficient 2009-02-05 01:42 because it may add the needless header 2009-02-05 01:43 e.g. if buffer is truncated before io, I guess header is needless completely 2009-02-05 01:44 I think the strongest idea is to do the cleaning in a foreground task, after all delta writeout has completed 2009-02-05 01:44 forked blocks will be fairly rare 2009-02-05 01:44 so it is more important to be robust than efficient 2009-02-05 01:45 robust and efficient would be nice of course 2009-02-05 01:45 i see 2009-02-05 01:45 I'm thinking to add header with dirty is robust 2009-02-05 01:46 good, we have two robust ideas ;) 2009-02-05 01:47 foreground task clean solves parallel fork issue? 2009-02-05 01:47 if so, I think it is more efficient than adding header 2009-02-05 01:47 I think it does, details need to be considered 2009-02-05 01:48 i see 2009-02-05 01:48 good 2009-02-05 01:48 well, FWIW, adding header would be simple, let me tell a bit 2009-02-05 01:48 ah, you may know already 2009-02-05 01:49 please tell 2009-02-05 01:49 well, not completed though 2009-02-05 01:49 it adds new header for backend when buffer become dirty 2009-02-05 01:50 if fork is happened, this header will inherit to clone page 2009-02-05 01:51 with this, owner of buffer_head is frontend, and new header is backend 2009-02-05 01:51 backend will just work for new header 2009-02-05 01:51 it doesn't touch by frontend, so it will provide stable list too 2009-02-05 01:52 e.g. 2009-02-05 01:52 yes, I see it 2009-02-05 01:52 blockdirty() adds new header, and insert to dirty list to it for background 2009-02-05 01:53 header will stay always if it was not truncated 2009-02-05 01:53 even if page is forked, header is not modified by frontend 2009-02-05 01:53 and 2009-02-05 01:54 new header takes refcnt of page 2009-02-05 01:54 so, if io was completed, it will free page refcnt and header 2009-02-05 01:54 it is also true, even if page was forked 2009-02-05 01:55 because, before and after has new header 2009-02-05 01:55 and 2009-02-05 01:56 if page is under io, it can tell with page state like SetPageChecked 2009-02-05 01:56 this is what I thought yesterday 2009-02-05 01:57 I'm not so sure yet though 2009-02-05 01:57 it sounds workable, and the cost is extra buffer allocations 2009-02-05 01:57 yes 2009-02-05 01:58 the idea I am working on is, two lists of buffers for writeout, one is only original buffers and the other is only cloned 2009-02-05 01:58 a fork may take place at any time before all the buffers are cleaned, and the effect will be to move a buffer from the original list to the cloned list 2009-02-05 01:59 after IO completes, we walk the list of original buffers, setting them all clean 2009-02-05 01:59 when that walk is finished, no more buffers can be forked because there are no more original, dirty buffers 2009-02-05 02:00 at that point, the list of forked buffers is stable 2009-02-05 02:00 so it can be walked, releasing each page 2009-02-05 02:01 moving a buffer from the original to the forked list is covered by a spinlock 2009-02-05 02:01 I did not describe how to submit the writeout... 2009-02-05 02:02 because I have not thought clearly about it yet 2009-02-05 02:02 i see 2009-02-05 02:02 cloned == forked 2009-02-05 02:03 origianl buffers is non-forked buffers? 2009-02-05 02:03 yes 2009-02-05 02:03 non-forked, dirty in the commit delta 2009-02-05 02:03 i.e. it shares with frontend? 2009-02-05 02:03 yes 2009-02-05 02:03 i see 2009-02-05 02:04 submitting all the buffers for writeout is a little tricky, because they can change lists in the middle of the submit loop 2009-02-05 02:04 yes 2009-02-05 02:05 the point of extra header would be it 2009-02-05 02:06 yes 2009-02-05 02:06 but, I wonder whether the extra header is overkill for it 2009-02-05 02:06 so, I am thinking about an algorithm that can submit each dirty block exactly once 2009-02-05 02:07 i see 2009-02-05 02:07 it's nice to have a robust algorithm like yours to fall back on if necessary 2009-02-05 02:07 ok 2009-02-05 02:08 I'll try to improve it a bit 2009-02-05 02:08 a "submitted" bit in the buffer flags might help me 2009-02-05 02:08 btw, buffer_head has some unused fileds 2009-02-05 02:08 yes 2009-02-05 02:08 b_private and b_end_io 2009-02-05 02:09 b_private can be used for us completely 2009-02-05 02:09 I think 2009-02-05 02:09 I think so too 2009-02-05 02:10 ideally, we can eventually change to block handles, so try not to use extra buffer fields if there is an alternative 2009-02-05 02:10 pragmatically, we will use any field that makes life easier 2009-02-05 02:10 yes, it's good 2009-02-05 02:11 I think the "submitted" bit makes the writeout submission robust 2009-02-05 02:12 BH_Req 2009-02-05 02:12 it doesn't clear though 2009-02-05 02:13 well, I guess similar 2009-02-05 02:14 walk the original list by removing an element from it and put it on a submitted list. A fork can remove a buffer either from the original or submitted list, and put a copy on the cloned list. When the original list is empty, do the same to the forked list, only submitting buffers that are not marked submitted. 2009-02-05 02:14 something like that 2009-02-05 02:15 lru list scanning works something like that... always removes a page from the beginning of the lru list 2009-02-05 02:16 trying to walk the original list with for_each_safe doesn't work, because the "next" element might be deleted 2009-02-05 02:16 next delta buffer can be added to original list? 2009-02-05 02:16 probably don't actually need a "cloned" list 2009-02-05 02:16 well 2009-02-05 02:16 yes, need it 2009-02-05 02:17 otherwise the walk of the original list may not terminate 2009-02-05 02:17 yes 2009-02-05 02:18 it would be solved with per delta list per inode 2009-02-05 02:18 actually, two original list 2009-02-05 02:19 it means next delta buffer will be added to lists[sb->detla % 2] 2009-02-05 02:19 why per inode? 2009-02-05 02:20 we may want to walk for one inode 2009-02-05 02:20 to make good contiguous extent 2009-02-05 02:20 yes 2009-02-05 02:20 well, blocks only matter for metadata inodes 2009-02-05 02:21 like directories 2009-02-05 02:21 yes 2009-02-05 02:22 we can think about pages for regular data files 2009-02-05 02:22 it's simpler when there is only one set of state to worry about 2009-02-05 02:22 for data page, we will allow partial write? 2009-02-05 02:23 yes 2009-02-05 02:23 it also means partial dirty 2009-02-05 02:23 mpage_writepage handles that by giving up, doesn't it 2009-02-05 02:23 and it lets writepage handle it, which may put buffers on the page 2009-02-05 02:24 yes 2009-02-05 02:24 so... yes, I guess we have the same issue 2009-02-05 02:24 probably 2009-02-05 02:25 and have to put buffers on the page any time a partial page transfer is required, like at the last page of a file 2009-02-05 02:26 or don't allow partial dirty 2009-02-05 02:26 is it possible? 2009-02-05 02:26 if we write non-dirty buffer too 2009-02-05 02:27 on same page 2009-02-05 02:27 I guess it is possible 2009-02-05 02:27 last page doesn't have block on dleaf, so I guess it can know 2009-02-05 02:28 at least, if we think blocksize of file is PAGE_CACHE_SIZE, I guess it is possible 2009-02-05 02:29 I think, possible even with sub-page blocks 2009-02-05 02:29 without buffer_head? 2009-02-05 02:29 I think so 2009-02-05 02:30 if we submit the writeout while walking dleaf 2009-02-05 02:30 who remembers dirty state of buffer? 2009-02-05 02:30 good question 2009-02-05 02:32 maybe, we are always completely cleaning a page when we write any part of it out 2009-02-05 02:32 I have not thought about this very much 2009-02-05 02:33 there is one big problem sub-page blocks without buffer 2009-02-05 02:33 if there is a one block hole on page 2009-02-05 02:34 we can't know users want to fill it or not 2009-02-05 02:34 without per block dirty state 2009-02-05 02:35 you mean, a sparse file, or an unwritten part of a page? 2009-02-05 02:35 it meant a sparse file 2009-02-05 02:36 and user dirtied the exsisted block on the same page 2009-02-05 02:37 of course, we can fill it with zeroed data 2009-02-05 02:38 however, it means blocksize is similar to PAGE_CACHE_SIZE more or less 2009-02-05 02:43 anyway, it will not be so bad if we just put buffers on some file pages 2009-02-05 02:44 as is done now 2009-02-05 02:47 yes 2009-02-05 02:47 at least, first version would be good with it 2009-02-05 02:48 I'm hopeing it will be done by using existed code 2009-02-05 02:49 yes 2009-02-05 02:49 we will just focus on metadata now 2009-02-05 02:49 yes 2009-02-05 02:50 btw, your girl is ok without you? 2009-02-05 02:50 my wife is with her 2009-02-05 02:50 good 2009-02-05 02:51 but it is time to sleep 2009-02-05 02:51 I will return to userspace tomorrow 2009-02-05 02:51 yes, oyasumi 2009-02-05 02:51 oyasumi 2009-02-05 03:13 -!- cam(~cam@203-219-255-75.tpgi.com.au) has joined #tux3 2009-02-05 03:17 he guys, just trying to test tux3 and get an understanding before I try to help, but i am getting errors on compilation. Can anyone confirm I am taking the right steps, I followed the article here - http://lwn.net/Articles/308950/ and up to the part of creating a tux3 filesystem, so i got the latest mecurial and ran make but i get a compilation error 2009-02-05 03:18 what error? 2009-02-05 03:18 well its long but the first error is tux3.c:12:18: error: popt.h: No such file or directory 2009-02-05 03:19 you need to install libpopt-dev 2009-02-05 03:19 cool, thanks 2009-02-05 03:20 no problem, and good night 2009-02-05 04:15 ah, flips is asleep now 2009-02-05 04:15 bah 2009-02-05 04:15 ACTION is still wide awake 2009-02-05 08:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-05 09:50 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-05 11:10 -!- ff31b3pre(~chatzilla@ANice-151-1-31-125.w83-197.abo.wanadoo.fr) has joined #tux3 2009-02-05 11:25 -!- ff31b3pre(~chatzilla@ANice-151-1-31-125.w83-197.abo.wanadoo.fr) has left #tux3 2009-02-05 12:12 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-05 14:13 -!- dcg(~dcg@157.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-05 16:46 -!- macan(~macan@159.226.41.137) has joined #tux3 2009-02-05 17:38 -!- cam(~cam@203-219-255-75.tpgi.com.au) has joined #tux3 2009-02-05 19:59 -!- cam(~cam@203-219-255-75.tpgi.com.au) has joined #tux3 2009-02-05 20:13 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2009-02-05 20:20 hirofumi, there? 2009-02-05 20:20 hi 2009-02-05 20:21 been following the commits? 2009-02-05 20:22 last commit was not seen 2009-02-05 20:23 ah, that is the first one from me in the last few days that makes actual progress 2009-02-05 20:24 now we have a way of sharing some of the kernel/super.c code with userspace 2009-02-05 20:25 I think rename dev to block_device would be helpful 2009-02-05 20:25 that would be fine with me 2009-02-05 20:26 that way the device can be in a local variable 2009-02-05 20:26 or we can always just use sb_dev(), which is efficient 2009-02-05 20:27 and the compiler will take care of the optimization 2009-02-05 20:27 maybe, rename will be able to share more for us 2009-02-05 20:28 well, I'm not sure though 2009-02-05 20:28 just a idea 2009-02-05 20:28 btw, I have some cleanup and iroot thought 2009-02-05 20:31 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-05 20:31 I found the bugs in userspace 2009-02-05 20:31 related to blocksize 2009-02-05 20:31 and it changed my mind about itable initialization 2009-02-05 20:51 I would like to see it 2009-02-05 20:51 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-05 20:58 one thing we should do when we have time: make the inode style more like kernel 2009-02-05 20:58 by using container_of just like kernel 2009-02-05 20:59 it is what's for? 2009-02-05 21:00 that way the userspace and kernel inode code can be closer 2009-02-05 21:00 for now, inode seems to be abstracted cleanly 2009-02-05 21:00 our way of doing things is not too bad, I am just thinking of small refinements 2009-02-05 21:00 yes 2009-02-05 21:01 I didn't see expect small problems 2009-02-05 21:01 but when we start review, any unnecessary abstraction may be criticized 2009-02-05 21:01 i see 2009-02-05 21:01 we can make the change then of course 2009-02-05 21:01 in fact, it's best to wait and do it when somebody asks 2009-02-05 21:01 agreed, it has not caused problems 2009-02-05 21:02 i see 2009-02-05 21:02 the whole kernel/userspace arrangement works very well 2009-02-05 21:02 yes 2009-02-05 21:02 if we separated kernel and userspace, I thought it will be removed 2009-02-05 21:02 and I was thinking it is likely 2009-02-05 21:03 and now I think we will never separate user and kernel 2009-02-05 21:03 it is too useful to have both with the same code 2009-02-05 21:04 for now 2009-02-05 21:04 I also originally planned to separate them at some point 2009-02-05 21:04 now, I see that most new developments will be done in userspace 2009-02-05 21:04 I think if kernel works fine at long future, we may not want to maintain userspace code like tux3 command 2009-02-05 21:05 probably, the tux3 command will just become more like tune2fs 2009-02-05 21:05 an interface for tweaking the fs, debugging, tuning 2009-02-05 21:06 probably 2009-02-05 21:06 emergency repair 2009-02-05 21:06 I think that a command driven interface will always be better for reassembling a badly damaged filesystem than any automatic system 2009-02-05 21:06 so, I thought to maintain shared code would be the cause of lazy 2009-02-05 21:07 it is more work to maintain the sharing, but it saves time in adding features 2009-02-05 21:07 and it allows more rapid changes, to improve the code 2009-02-05 21:08 so in the current work I was doing, I had to build both kernel and userspace at each step 2009-02-05 21:08 that is a little more work 2009-02-05 21:08 yes, it is long future 2009-02-05 21:09 I'm thinking in future the code will become different more 2009-02-05 21:09 if there is not much different, share is good, of course 2009-02-05 21:09 + complete(&sync->completion); <- nice 2009-02-05 21:10 thanks :) 2009-02-05 21:10 thisbig is a nice field name :) 2009-02-05 21:11 :) 2009-02-05 21:11 ok, it's nice to see popt go 2009-02-05 21:12 it is unsure stuff 2009-02-05 21:12 I guess we will end up with our own parser, with a few more features than you implemented 2009-02-05 21:12 because we need kernel options parsing 2009-02-05 21:12 it may depend glibc instead of popt 2009-02-05 21:12 sure 2009-02-05 21:12 we can make something nice for kernel that also works for userspace 2009-02-05 21:13 we don't use just parse stuff in kernel? 2009-02-05 21:13 well, it doesn't matter 2009-02-05 21:13 I was going to get rid of fd_t also 2009-02-05 21:13 so thanks 2009-02-05 21:14 there is no example of a nice options parser in a filesystem in kernel 2009-02-05 21:14 they are all horrible 2009-02-05 21:14 kernel hackers and parsing don't seem to mix well 2009-02-05 21:14 they used to be much worse, not even reentrant 2009-02-05 21:15 yes, it is not so good 2009-02-05 21:16 your "separate itable intialize" patch might conflict with mine 2009-02-05 21:16 but, parser is not interesting part 2009-02-05 21:16 no problem, I'll update it 2009-02-05 21:16 no, but it is a big part of most super.c files 2009-02-05 21:17 but, uninteresting? 2009-02-05 21:17 well 2009-02-05 21:17 yes, uninteresting 2009-02-05 21:17 it's necessary work 2009-02-05 21:17 new parser? 2009-02-05 21:18 any form of options parser 2009-02-05 21:18 since it is such a large part of the filesystem, it should also be a nice part 2009-02-05 21:18 well, intially we will not have many options 2009-02-05 21:18 options start to arrive over time 2009-02-05 21:18 as people add things 2009-02-05 21:19 well, match_table_t stuff is easy to use for me 2009-02-05 21:19 everything looks good 2009-02-05 21:19 it's not so good though 2009-02-05 21:19 I did not look too closely at the fix blocksize patch 2009-02-05 21:20 blocksize one is 2009-02-05 21:20 we initialize dev->bits from command arguemnt 2009-02-05 21:20 but, if it's not make_tux3, it should read from super block 2009-02-05 21:21 right 2009-02-05 21:21 current one is using dev->bits for init_buffers before load_sb 2009-02-05 21:21 oops 2009-02-05 21:22 tux3.c just grew, it was never designed 2009-02-05 21:22 or planned in any way 2009-02-05 21:22 so, tux3 userspace was not working until that patch if blockbits != 12 2009-02-05 21:22 ah 2009-02-05 21:23 well, itself is no problem, it's just a bug 2009-02-05 21:23 fixed 2009-02-05 21:23 so, initialization order become like kernel 2009-02-05 21:23 init dev from sb, and start others 2009-02-05 21:24 yes 2009-02-05 21:24 we will try to share more of the init code as we progress 2009-02-05 21:24 good 2009-02-05 21:25 load/save_sb now work in userspace 2009-02-05 21:25 yes 2009-02-05 21:25 and they can replace some code in tux3(fuse/graph) 2009-02-05 21:25 already done? 2009-02-05 21:25 yes 2009-02-05 21:25 ok 2009-02-05 21:25 in commit.c 2009-02-05 21:25 I'll merge those patches after it 2009-02-05 21:26 a funny place for it, but that file is shared by user and kernel 2009-02-05 21:26 so I put shared code that really belongs to super.c in there 2009-02-05 21:26 it has some relation to commit 2009-02-05 21:27 place was changed? 2009-02-05 21:27 I like having kernel/utility.c now, it looks much better in a listing than hexdump.c 2009-02-05 21:27 the load/save_sb code was moved from kernel/super.c to kernel/commit.c 2009-02-05 21:27 now, I see tux_load_sb/tux_save_sb in commit.c 2009-02-05 21:27 ah, yes 2009-02-05 21:28 and those can be used from userspace too 2009-02-05 21:28 hmm, where did the check for MAGIC go? 2009-02-05 21:28 MAGIC is in unpack_sb 2009-02-05 21:29 right 2009-02-05 21:29 I think that should be ENOENT 2009-02-05 21:29 do you have already convert patch for tux_save/load_sb? 2009-02-05 21:29 no 2009-02-05 21:29 not yet 2009-02-05 21:29 ENOENT will confuse mount command 2009-02-05 21:30 so EINVAL is expected for unrecognized superblock? 2009-02-05 21:30 yes 2009-02-05 21:31 ok, fine 2009-02-05 21:31 bad choice of errno 2009-02-05 21:31 there are so many other things than can give EINVAL on a mount 2009-02-05 21:31 iirc, mount syscall is already using ENOENT 2009-02-05 21:32 yes 2009-02-05 21:32 so, mount command will output 2009-02-05 21:32 EINVAL source had an invalid superblock. 2009-02-05 21:32 wrong fs, or option, or bad superblock, or something 2009-02-05 21:32 ENOENT A pathname was empty or had a nonexistent component. 2009-02-05 21:32 really bad decisions on those error codes 2009-02-05 21:33 they're exactly backwards 2009-02-05 21:33 I wonder if that comes from posix, or us 2009-02-05 21:33 linux-specific 2009-02-05 21:33 iirc, us 2009-02-05 21:33 it came from us 2009-02-05 21:33 I think mount is not posix 2009-02-05 21:34 when you are ready for a pull, please say 2009-02-05 21:35 ok 2009-02-05 21:39 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-05 21:39 done 2009-02-05 21:39 it is including convert of load_sb to tux_load_sb 2009-02-05 21:42 this is an example of where pulling from you into the same temporary repo produced two heads 2009-02-05 21:42 it will be ok when I pull into the real repo 2009-02-05 21:42 but just for interest, what were the hg steps that caused this? 2009-02-05 21:43 temporary repo has what patches? 2009-02-05 21:43 ah 2009-02-05 21:43 you pull from my repo twice? 2009-02-05 21:43 yes 2009-02-05 21:43 just to see what happens 2009-02-05 21:43 I expected two heads 2009-02-05 21:44 yes, it's the problem of my work style 2009-02-05 21:44 I'm just interested in what you did in your repo to cause that 2009-02-05 21:44 it's not a problem, I don't pull until you say you are ready 2009-02-05 21:44 I'm working by the patches 2009-02-05 21:45 and to push it, I'm using "hg import", then send to server it 2009-02-05 21:45 so how do you revert your hg back to some branch point? 2009-02-05 21:45 ah 2009-02-05 21:45 hg will produce new sha1 for earch push 2009-02-05 21:45 I see\ 2009-02-05 21:46 so it doesn't automatically create a branch, it just relies on the sha1 to determine the parent 2009-02-05 21:46 yes 2009-02-05 21:46 that is actually a weakness of git, mercurial and monotone 2009-02-05 21:47 one of the few weaknesses 2009-02-05 21:47 yes, well but it has different timestamp 2009-02-05 21:47 it produces the expected result most of the time 2009-02-05 21:48 yes 2009-02-05 21:48 anyway, now I know that hg import is the cause, I will be able to sleep peacfully tonight :) 2009-02-05 21:48 well, my style is changing history 2009-02-05 21:48 :) 2009-02-05 21:49 changing history is a good thing 2009-02-05 21:49 except for bankrupt hedge funds 2009-02-05 21:49 yes, it's very good thing until real commit 2009-02-05 21:49 even then, it would be nice to be able to commit a "change histor" commit 2009-02-05 21:49 because people make mistakes 2009-02-05 21:50 it's stupid to force them to exist forever 2009-02-05 21:50 yes 2009-02-05 21:50 hg seems it can't completely 2009-02-05 21:50 git is more flexible 2009-02-05 21:51 yes 2009-02-05 21:51 I think it is partly the philosophy of the designer 2009-02-05 21:51 probably 2009-02-05 21:52 but hg has rollback, so it may just not be implemented 2009-02-05 21:52 well, I'll go to shop for food 2009-02-05 21:52 by the way, I like load/save_sb more than tux_load/save_sb 2009-02-05 21:52 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-05 21:52 as you originally had it 2009-02-05 21:53 yes 2009-02-05 21:53 which one should we use? 2009-02-05 21:53 I felt it too 2009-02-05 21:53 ok, want to change the patch before I pull then? 2009-02-05 21:53 in past, I'm going to use prefix in kernel always 2009-02-05 21:54 I'll change it now 2009-02-05 21:54 it doesn't make sense to use a prefix on anything that is already obvious from the call chain 2009-02-05 21:54 if that made sense 2009-02-05 21:54 it is just to avoid name conflict from other kernel part 2009-02-05 21:55 right 2009-02-05 21:55 if we export it only internally, can we tell linker to keep it private? 2009-02-05 21:56 iirc, it can with some trick 2009-02-05 21:56 but, I can't remember, and it was easy or not 2009-02-05 21:56 well, load/save_sb will be exported internally, so they will be non-static, we probably don't collide but we could 2009-02-05 21:57 yes 2009-02-05 21:57 so whatever you want 2009-02-05 21:57 I can pull now or wait a few minutes and pull a revised patch 2009-02-05 21:57 I also think non-prefix is good now 2009-02-05 21:57 ok, good 2009-02-05 21:58 and we will figure out the linker trick if we have to 2009-02-05 21:58 I think it is "strip" 2009-02-05 21:58 there's also a "weak symbol" idea 2009-02-05 21:59 so I wait for a revised patch? 2009-02-05 21:59 yes 2009-02-05 21:59 I'll push soon 2009-02-05 22:00 by the way, my girl is fine 2009-02-05 22:00 but last night was very scary 2009-02-05 22:00 oh, good 2009-02-05 22:02 child is our future (hard to say well with english for me :)) 2009-02-05 22:07 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-05 22:07 ok, done 2009-02-05 22:07 remove tux_ prefix from tux_load_sb/save_sb/load_itable 2009-02-05 22:09 especially my child, according to me :) 2009-02-05 22:15 pushed to public 2009-02-06 00:35 -!- RazvanM(~RazvanM@96.234.232.151) has joined #tux3 2009-02-06 01:19 hey flips 2009-02-06 01:19 ACTION reads the backlog like normal 2009-02-06 03:00 -!- cam(~cam@203-219-255-75.tpgi.com.au) has joined #tux3 2009-02-06 06:30 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-06 07:28 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-06 08:02 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-06 09:31 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-06 10:23 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-06 11:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-06 11:53 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-06 14:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-06 18:59 -!- macan(~macan@159.226.41.137) has joined #tux3 2009-02-06 20:59 ok, time to make atomic commit do more 2009-02-06 20:59 starting with our nice new superblock IO stuff 2009-02-06 21:34 two more biggish bits left to add for atomic commit: 1) btree node split logging 2) load log blocks and replay to reconstruct pinned metadata 2009-02-06 21:34 then comes the for-real debugging 2009-02-06 21:34 we're close 2009-02-06 21:36 ACTION assumes the lotus position to focus on btree node splits 2009-02-06 23:09 another idea for robust stage_delta 2009-02-06 23:09 blockdirty() can return buffer_head 2009-02-06 23:10 if page is already forked, it returns -EAGAIN 2009-02-06 23:10 if page is not forked yet, it forks 2009-02-06 23:11 I guess -EAGAIN can happen only small window 2009-02-06 23:11 well, so if EAGAIN happen, caller should restart from blockread() 2009-02-06 23:12 I looked into that strategy, and found a difficult issue 2009-02-06 23:12 but I didn't write it down 2009-02-06 23:12 :) 2009-02-06 23:12 what is issue? 2009-02-06 23:13 I will have to think 2009-02-06 23:13 I didn't not consider the EAGAIN return 2009-02-06 23:13 my key point is using the page bit to marks page was forked 2009-02-06 23:13 I mean, I did not consider the EAGAIN return 2009-02-06 23:14 ah, that is indeed a robust idea 2009-02-06 23:14 probably, I guess this would work 2009-02-06 23:15 It sounds promising 2009-02-06 23:15 but, not sure performance is good or not 2009-02-06 23:15 what is the performance issue? 2009-02-06 23:15 -EAGIN will run again from blockread 2009-02-06 23:15 restart from read? 2009-02-06 23:15 yes 2009-02-06 23:15 it sounds rare 2009-02-06 23:16 probably 2009-02-06 23:16 just not sure 2009-02-06 23:16 so the second blockread will retreive the copied page from page cache? 2009-02-06 23:17 yes 2009-02-06 23:17 meaning we do not have to change the page out from under the buffer 2009-02-06 23:18 the forked page would need to carry a (2 bit) delta number 2009-02-06 23:19 why 2bit number is needed? 2009-02-06 23:19 because a page forked into the current delta will belong to the committing delta after phase transition 2009-02-06 23:19 it would have to be forked again if dirtied in current delta 2009-02-06 23:21 forked twice? 2009-02-06 23:22 yes, if dirtied in three successive deltas 2009-02-06 23:22 I think if it's three, the page is 2009-02-06 23:23 first => second => third 2009-02-06 23:23 clean -> dirty in current -> delta transition -> dirty in committing -> dirty -> dirty in current -> delta transition -> dirty in committing -> dirty -> dirty in current 2009-02-06 23:23 ah, that's right 2009-02-06 23:23 :) 2009-02-06 23:23 good 2009-02-06 23:24 I was thinking about the wrong object, I was thinking about the buffer when I should have thought about the page 2009-02-06 23:24 it seems robust indeed 2009-02-06 23:24 less radical 2009-02-06 23:26 and buffer will also be inherit to forked page 2009-02-06 23:27 and I guess another benefit is frontend would not be needed buffer_head 2009-02-06 23:27 buffer_head is just used to remember dirty buffer 2009-02-06 23:28 yes 2009-02-06 23:28 and pointer to data may be enough for frontend 2009-02-06 23:29 well, it's another story 2009-02-06 23:30 good, well we have several weeks to analyze this 2009-02-06 23:30 yes 2009-02-06 23:30 I'll play with this for a while 2009-02-06 23:31 I will keep making incrmental steps towards atomic commit demo 2009-02-06 23:31 ok 2009-02-06 23:32 your work is main for atomic commit 2009-02-06 23:39 -!- RazvanM(~RazvanM@96.234.232.151) has joined #tux3 2009-02-06 23:45 the diskwrites in user/commit.c need to be replaced with async IO, that is sync in userspace 2009-02-06 23:45 in order to work in kernel 2009-02-06 23:45 hey flips 2009-02-06 23:46 ACTION reads the backlog 2009-02-06 23:46 hi bh 2009-02-06 23:46 it's good 2009-02-06 23:46 just sped through your area to get to San Francisco, made it 2009-02-06 23:46 hand made aio or aio thread 2009-02-06 23:49 aio is so much easier in kernel than usespace 2009-02-06 23:49 anyway, it doesn't need to be async in userspace 2009-02-06 23:49 it can just complete synchronously in the submission, and the completion code does nothing 2009-02-06 23:50 ah, I guessed you want it 2009-02-06 23:50 optimizing user space is not a high priority 2009-02-06 23:50 having it run similar code to kernel is the important thing 2009-02-06 23:50 yes, it's test purpose 2009-02-06 23:51 so, something like syncio 2009-02-06 23:52 maybe using user space definition of biovec, like you showed in your patch 2009-02-06 23:52 or maybe, abstracting the interface in a different way 2009-02-06 23:52 hiding the biovecs 2009-02-06 23:52 pass an endio callback to something like devio 2009-02-06 23:53 yes 2009-02-06 23:53 I guess that would be the simplest 2009-02-06 23:53 except, some control struct is needed 2009-02-06 23:53 well, if we need aio stuff, I guess it's not hard if simple one 2009-02-06 23:53 which in kernel is bio 2009-02-06 23:54 yes 2009-02-06 23:54 I found one issue with page bit strategy 2009-02-06 23:54 anyway, or usespace endio can just be a no-op 2009-02-06 23:54 our usespace endio can just be a no-op 2009-02-06 23:54 yes 2009-02-06 23:55 it drops ability to get forked page via buffer_head which current one does 2009-02-06 23:56 "it" means what? 2009-02-06 23:56 page bit strategy 2009-02-06 23:56 page bit strategy with buffer_head inheriting 2009-02-06 23:57 stable buffer_head for backend by page bit 2009-02-06 23:57 right 2009-02-06 23:57 thinking 2009-02-06 23:57 it's really not important how we trace the back end page, we just need to track it somehow 2009-02-06 23:58 putting a buffer head on it was one way, would could also wrap it with a bio 2009-02-06 23:58 which is probably sensible 2009-02-06 23:58 we have to wrap it with a bio some time anyway 2009-02-06 23:59 ah, too many typos 2009-02-06 23:59 it's really not important how we >track< the back end page, we just need to track it somehow 2009-02-06 23:59 putting a buffer head on it was one way, >we< could also wrap it with a bio 2009-02-07 00:00 cpu1 cpu2 2009-02-07 00:00 blockread() 2009-02-07 00:00 buffer_head-0x001 2009-02-07 00:00 blockread() 2009-02-07 00:00 buffer-0x001 2009-02-07 00:00 blockdirty() old page 2009-02-07 00:00 fork page 2009-02-07 00:00 return new buffer-0x002 2009-02-07 00:00 modify buffer-0x002 2009-02-07 00:00 blockdirty() 2009-02-07 00:00 return -EAGIN 2009-02-07 00:00 blockread() 2009-02-07 00:00 ugh 2009-02-07 00:00 2009-02-07 00:00 2009-02-07 00:00 2009-02-07 00:00 cpu1 cpu2 2009-02-07 00:00 blockread() 2009-02-07 00:00 buffer_head-0x001 2009-02-07 00:00 blockread() 2009-02-07 00:00 buffer-0x001 2009-02-07 00:00 blockdirty() old page 2009-02-07 00:00 fork page 2009-02-07 00:00 return new buffer-0x002 2009-02-07 00:00 modify buffer-0x002 2009-02-07 00:00 blockdirty() 2009-02-07 00:00 return -EAGIN 2009-02-07 00:00 blockread() 2009-02-07 00:01 in this example, cpu2 is reading old page 2009-02-07 00:01 it means this blockdirty() is not right 2009-02-07 00:02 marking the page is still a strong idea 2009-02-07 00:02 that I did not consider 2009-02-07 00:02 blockdirty can be spurious 2009-02-07 00:03 probably 2009-02-07 00:04 cpu1 cpu2 2009-02-07 00:04 blockread() 2009-02-07 00:04 buffer_head-0x001 2009-02-07 00:04 blockread() 2009-02-07 00:04 mutex_lock() <- 2009-02-07 00:04 mutex_lock() <- 2009-02-07 00:04 buffer-0x001 2009-02-07 00:04 blockdirty() old page 2009-02-07 00:04 fork page 2009-02-07 00:04 return new buffer-0x002 2009-02-07 00:04 modify buffer-0x002 2009-02-07 00:04 mutex_unlock() <- 2009-02-07 00:04 blockdirty() 2009-02-07 00:05 return -EAGIN 2009-02-07 00:05 blockread() 2009-02-07 00:05 mutex_unlock() <- 2009-02-07 00:05 current one with those locks, fork garantee the data is latest 2009-02-07 00:05 but, -EAGAIN doesn't garatee 2009-02-07 00:06 I think this means we may want check_forked() function to check whether page is latest or not 2009-02-07 00:06 I assumed that was your plan 2009-02-07 00:07 check_forked() can be used after mutex_lock() 2009-02-07 00:07 blockdirty is a kind of check_forked, where else do we need check_forked? 2009-02-07 00:07 in the above, "old page" point can not be latest data 2009-02-07 00:08 with -EAGAIN strategy 2009-02-07 00:08 e.g. it may checking old bitmap data 2009-02-07 00:09 so, this blockdirty() can be spurious 2009-02-07 00:09 or needed blockdirty can not be called 2009-02-07 00:10 right, the problem is that only the dirty that causes the fork gets EAGAIN 2009-02-07 00:12 yes 2009-02-07 00:12 ah, one idea 2009-02-07 00:13 we may be able to mark buffer too as forked 2009-02-07 00:13 so, check_forked() can be very cheep 2009-02-07 00:15 if page was forked, blockdirty marks buffers and page 2009-02-07 00:15 which page is supposed to have the forked bit set, the original or the clone? 2009-02-07 00:15 original page 2009-02-07 00:15 ok, and the original must be removed from page cache, right? 2009-02-07 00:16 yes 2009-02-07 00:16 so, blockdirty on cpu should see the forked bit and therefore know it must look up the new page in the cache 2009-02-07 00:17 yes 2009-02-07 00:17 there is still a problem? 2009-02-07 00:18 if check_forked was introduced, there is no big problem 2009-02-07 00:18 It seems good to me 2009-02-07 00:18 I though, check_forked() may take lock_page() 2009-02-07 00:18 old page keeps original buffer heads 2009-02-07 00:18 yes 2009-02-07 00:19 but, in the above example, lock_page was not needed 2009-02-07 00:19 lock_page protects the buffer list, and we aren't changing the buffer list with this strategy 2009-02-07 00:20 yes, but if there is no mutex_lock(), check_forked() should be serialized with fork 2009-02-07 00:20 it looks like this is a strong strategy, about the same efficiency as my proposal, but more robust 2009-02-07 00:21 lock_page might be better than mutex_lock 2009-02-07 00:21 it's depending caller 2009-02-07 00:22 I meant mutex_lock is caller's lock 2009-02-07 00:22 ah, already has it 2009-02-07 00:22 in this example, bitmap->i_mutex 2009-02-07 00:22 ok 2009-02-07 00:23 yes, and improved fine grained locking would be doing lock_buffer instead of i_mutex 2009-02-07 00:23 well 2009-02-07 00:23 lock_buffer can't protect a page bit 2009-02-07 00:23 lock_buffer is also ok for check_forked() 2009-02-07 00:24 because lock_buffer() is protect blockdirty entirely 2009-02-07 00:24 yes, it prevents the fork from completing 2009-02-07 00:24 at which point exactly does the forked bit get set? 2009-02-07 00:25 what makes the page copy + set forked bit atomic? 2009-02-07 00:25 it would be lock_page or any lock 2009-02-07 00:25 at least, current blockdirty() seems lock_page() 2009-02-07 00:25 sure 2009-02-07 00:26 it's ok 2009-02-07 00:26 well, forked bit itself there is no need any lock 2009-02-07 00:27 it can be test_and_set_bit() 2009-02-07 00:27 yes 2009-02-07 00:27 but, to garantee blockread will see new page, it would need some lock 2009-02-07 00:28 setting the forked bit can even be protected by a barrier I think 2009-02-07 00:28 because it is not a problem to think that a page has not yet been forked, when it actually is already forked 2009-02-07 00:29 if cpu2 decide to fork a page that is already forked, it will take the page lock, then find out the page is forked 2009-02-07 00:29 yes 2009-02-07 00:30 ok, I think the new strategy is less fragile than my proposal 2009-02-07 00:30 yes 2009-02-07 00:31 however, if it's work without check_forked(), it would be good 2009-02-07 00:31 because it introduces some complexcy 2009-02-07 00:31 check_forked would be called where? 2009-02-07 00:32 in the above example, cpu2 may want to call after mutex_lock() immidiately 2009-02-07 00:32 if cpu2 need to see latest data 2009-02-07 00:34 e.g. concurrent balloc/bfree may want to see latest data 2009-02-07 00:34 to use freed block surely 2009-02-07 00:35 I think the lock will normally be before the blockread 2009-02-07 00:36 balloc? 2009-02-07 00:36 it is not allowed that data on a block can change after blockread and before brelse 2009-02-07 00:36 yes 2009-02-07 00:37 if lock before blockread() was fine, I guess it will see latest data without check_forked() 2009-02-07 00:37 I think so 2009-02-07 00:38 I don't think we need to allow parallel read and modify on any block, even bitmap block 2009-02-07 00:38 maybe, I think we want to more fine granularity 2009-02-07 00:38 I thought that was what you thought :) 2009-02-07 00:39 :) 2009-02-07 00:39 eventually, frontend will not access bitmap at all, only backend 2009-02-07 00:40 and most of the parallelism will be on the front end 2009-02-07 00:40 yes, it's very good thing for this 2009-02-07 00:40 making the back end parallel too will be an interesting challenge, with a smaller benefit 2009-02-07 00:41 stage_delta vs stage_delta? 2009-02-07 00:41 we can achieve back end parallelism with a different strategy if we want 2009-02-07 00:41 right, multiple stage_delta threads in parallel -> hard, but possible 2009-02-07 00:41 oh, i see 2009-02-07 00:41 maybe, just one stage_delta with multiple threads 2009-02-07 00:42 ah 2009-02-07 00:42 we can divide the threads by logical address range 2009-02-07 00:42 for a big file 2009-02-07 00:42 each bitmap block covers 128 MB, for example, and is owned by one backend thread 2009-02-07 00:43 just an idea 2009-02-07 00:43 backend wait may not be a big problem 2009-02-07 00:43 we will already be efficient with even a single threaded backend 2009-02-07 00:44 yes, btw, wait -> not fine granularity 2009-02-07 00:44 ? 2009-02-07 00:44 if backend doesn't have fine granularity, it may not be a big problem 2009-02-07 00:44 that is what I think 2009-02-07 00:44 yes 2009-02-07 00:45 the nice thing is letting the frontend run without waiting on writeout 2009-02-07 00:45 it will still wait on sync reads 2009-02-07 00:45 yes 2009-02-07 00:47 ok, that strategy seems not bad at least 2009-02-07 00:47 I'll think about check_forked() a bit 2009-02-07 00:48 because it introduces a complexcity to locking rule 2009-02-07 00:48 but, if there is no alternative, it seems ok 2009-02-07 00:49 actually, it's not complexity, probably unflexible 2009-02-07 00:50 prototype code would help think about it 2009-02-07 00:50 if nonfunctional code 2009-02-07 00:50 even nonfunctional code 2009-02-07 00:50 yes 2009-02-07 00:51 probably, I'll play with hackfs code and userspace 2009-02-07 00:55 just an idea: int devaio(int rw, struct block_device *dev, loff_t offset, void *data, unsigned len, endio_t endio, void *info) 2009-02-07 00:56 endio(bio, info) is called on completion 2009-02-07 00:56 just a shell for vecio that hides the bvec 2009-02-07 00:58 so, for our delta commit, the endio can just decrement a count of total bios submitted and wakeup commit_delta when it hits zero 2009-02-07 01:00 the endio can put completed bios on a list instead of freeing them, as is typical for endio 2009-02-07 01:00 then commit_delta can walk the list, setting buffer state clean and freeing the bio, in foreground 2009-02-07 01:00 it seems so easy :) 2009-02-07 01:01 ah, devaio is a stupid idea 2009-02-07 01:02 oh wait, it is ok 2009-02-07 01:02 the bvec can be hidden because it is just an internal interface 2009-02-07 02:37 bio counter is good idea 2009-02-07 02:38 and setting clean sate in foreground is a reasonable thing to do, I think 2009-02-07 02:38 start_submit_bio will take one count, end_submit_bio will release one count 2009-02-07 02:39 if foregound can clean, we don't need to clean it? 2009-02-07 02:39 things get a little interesting if inflight count hits zero while there is still IO to submit 2009-02-07 02:40 right 2009-02-07 02:40 yes, start/end will prevent it 2009-02-07 02:40 setting buffers clean in endio is fragile, with limited synchronization options 2009-02-07 02:40 yes 2009-02-07 02:41 I'm not sure though, page bit may also work for it 2009-02-07 02:42 it gets complex when blocks in different states are on the same page 2009-02-07 02:42 well, it has been done before of course, in buffer.c 2009-02-07 02:42 I thought we may just want to know whether page is under io or not 2009-02-07 02:43 if under io, it will fork 2009-02-07 02:43 if not, it will clean 2009-02-07 02:43 not sure at all though 2009-02-07 02:44 ...thinking about whether page dirty is enough 2009-02-07 02:44 intuition says it is not enough, but I can't prove that yet 2009-02-07 02:44 there is buffers 2009-02-07 02:45 dirty bits are in buffers here 2009-02-07 02:45 right, and if the buffers are now stable because of your proposal, then they can be traversed in endio just as end buffer aio now works in buffer.c 2009-02-07 02:46 well, page_lock is used to protect buffers for a traversal normally, but can't be used in interrupt 2009-02-07 02:46 yes 2009-02-07 02:46 page->private_lock is used instead, with some reasoning that I do not understand 2009-02-07 02:46 so, it may be able to delay until next blockdirty 2009-02-07 02:47 we avoid that issue with a foreground clean 2009-02-07 02:48 yes, I felt actually we may not need clean itself 2009-02-07 02:49 clean prevent unneed fork 2009-02-07 02:49 yes 2009-02-07 02:49 I imaged the page bit may prevent it too 2009-02-07 02:50 and clean will allow eviction 2009-02-07 02:50 page bit? 2009-02-07 02:50 yes 2009-02-07 02:50 page forked bit? 2009-02-07 02:50 fork is only needed if page is under io 2009-02-07 02:50 hmm, no 2009-02-07 02:50 right 2009-02-07 02:50 which page bit? 2009-02-07 02:50 oh 2009-02-07 02:50 maybe, writeback 2009-02-07 02:50 "some" page bit 2009-02-07 02:51 yes 2009-02-07 02:51 a page with writeback bit set can't be evicted 2009-02-07 02:52 yes, so endio will clear it 2009-02-07 02:52 ok 2009-02-07 02:52 but, not sure at all 2009-02-07 02:52 what about ->releasepage, how does it know if buffers can be removed? 2009-02-07 02:53 it will see page count 2009-02-07 02:53 page refcount 2009-02-07 02:54 how will it know if the page is in a mapping or not? 2009-02-07 02:54 page_mapped? 2009-02-07 02:54 I guess there is no difference with current kernel 2009-02-07 02:55 true, if we do fork your way 2009-02-07 02:55 http://lxr.linux.no/linux+v2.6.28.4/fs/buffer.c#L403 2009-02-07 02:57 I think we don't need this if we submit the page with bio 2009-02-07 02:57 that would be a nice result 2009-02-07 02:57 yes 2009-02-07 02:58 ah, read 2009-02-07 02:58 it can be possible for the same page to be inflight in more than one parallel bio 2009-02-07 02:58 different blocks 2009-02-07 02:59 say, a btree node being read while a bitmap block is written 2009-02-07 02:59 i see 2009-02-07 03:00 in that case, the bio should specify iwhich block is being transferred, this can be determined from the biovec offset 2009-02-07 03:00 an altermative is to store the buffer_head in the bio private field 2009-02-07 03:01 this implies only one block per bio, which is not a big problem 2009-02-07 03:02 btree node being read while a bitmap block is written <- this example is incorrect 2009-02-07 03:02 btree node being read while another btree node on the same page is written <- better example 2009-02-07 03:03 well, btree is volmap, so there is no blockdirty() 2009-02-07 03:03 :) 2009-02-07 03:03 you got me 2009-02-07 03:03 it helps 2009-02-07 03:04 well, page bit (writeback in here) may be just a hack 2009-02-07 03:04 dirent block being read while another dirent block on the same page is written <- vfs locking prevents this 2009-02-07 03:05 yes 2009-02-07 03:06 it will be happen on bitmap pages 2009-02-07 03:06 if blockread() reads one block 2009-02-07 03:06 yes 2009-02-07 03:06 my example was lame :) 2009-02-07 03:07 :) 2009-02-07 03:07 well, if there is clean way to clean, it would be good 2009-02-07 03:07 well, however it works out in the end, it will be simpler than the current end_buffer strategy 2009-02-07 03:08 good 2009-02-07 03:09 and we will avoid using the b_end_io field, another step towards using block handles 2009-02-07 03:09 i see 2009-02-07 03:10 btw, I guess buffer_head will be hidden from frontend almost with blockdirty change 2009-02-07 03:10 that would be nice 2009-02-07 03:10 it will be there actually, but it will just use b_data probably 2009-02-07 03:11 right, bufdata() 2009-02-07 03:11 yes 2009-02-07 03:11 if it's true, I guess it can be replaced with other header 2009-02-07 03:12 well, it's later though 2009-02-07 03:12 ah 2009-02-07 03:13 btw, I'm thinking to store buffer_heads for cloned page to store to stable buffer_head->b_private 2009-02-07 03:13 so, check_forked() can be return new buffer_head 2009-02-07 03:14 instead of -EAGAIN 2009-02-07 03:14 whatever works right now is good 2009-02-07 03:14 yes 2009-02-07 03:15 well, with it, I thought we don't need to release mutex_lock() in example 2009-02-07 03:15 repeating the hash lookup is not a bad thing either 2009-02-07 03:15 ah 2009-02-07 03:16 just a idea, however if there is real example actually, it may help 2009-02-07 03:17 real usage 2009-02-07 03:17 yes 2009-02-07 03:17 well the easiest way to make that happen is, make user/commit.c compile for kernel 2009-02-07 03:18 yes 2009-02-07 03:18 we can start creating log blocks in the kernel code, and just clear the log on commit without necessarily having the right contents in the log 2009-02-07 03:18 that will exercise a lot of the mechanism 2009-02-07 03:18 i see 2009-02-07 03:19 ok, so we I remove the userspace dependencies in user/commit.c, then we can work in parallel 2009-02-07 03:19 you can work on transfers, and I can work on correct logging 2009-02-07 03:20 ok 2009-02-07 03:21 the quickest way to do that is, write the log blocks synchronously with devio 2009-02-07 03:21 this will obviously suck, but we know what to do about it 2009-02-07 03:21 well, maybe, I'll play on userspace a bit, then will play on hackfs 2009-02-07 03:21 fine 2009-02-07 03:22 so, I will try to find bad things with that 2009-02-07 03:22 and, I will 2009-02-07 03:25 - if ((err = diskwrite(sb->dev->fd, buffer->data, sb->blocksize, block << sb->blockbits))) { 2009-02-07 03:25 + if ((err = devio(WRITE, sb_dev(sb), block << sb->blockbits, buffer->data, sb->blocksize))) { 2009-02-07 03:25 now let me see what part of user/commit.c breaks in kernel 2009-02-07 03:27 btw, for now, commit.c is not needed hurry for me 2009-02-07 03:28 yes, behavior will change a lot if change_end is moved to kernel 2009-02-07 03:28 it is a little bit too early for that 2009-02-07 03:29 ah, yes 2009-02-07 03:29 but maybe I will just compile most of user/commit.c in kernel, without using it 2009-02-07 03:29 or maybe just keep it where it is 2009-02-07 03:29 ok 2009-02-07 03:30 it looks like it has very few problems in kernel, but it will not move today 2009-02-07 03:30 ok 2009-02-07 03:32 I'll go to shop for food 2009-02-07 03:32 ok, next thing is to load the log blocks in replay 2009-02-07 03:32 I will sleep 2009-02-07 03:32 yes 2009-02-07 03:32 later... 2009-02-07 03:32 oyasumi 2009-02-07 03:32 it's not quite right for me to say oyasumi back :) 2009-02-07 03:33 since it not night there yet 2009-02-07 03:33 so, by for now 2009-02-07 08:55 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-07 09:44 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-07 10:06 -!- dcg(~dcg@114.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-07 10:07 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-02-07 10:12 -!- dcg(~dcg@114.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-07 10:15 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-02-07 10:42 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-07 14:00 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-07 15:03 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-07 15:10 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-07 19:16 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-07 23:00 -!- RazvanM(~RazvanM@96.234.232.151) has joined #tux3 2009-02-07 23:39 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-08 00:41 -!- cam(~cam@203-219-255-75.tpgi.com.au) has joined #tux3 2009-02-08 01:21 lockdebug.h found a locking bug 2009-02-08 01:21 in the commit unit test 2009-02-08 03:16 hey 2009-02-08 08:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-08 08:23 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-08 09:11 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-08 11:06 -!- dcg(~dcg@161.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-02-08 11:22 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-08 12:36 -!- dcg(~dcg@74.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-02-08 14:45 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-08 15:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-08 15:52 now... various replay mistakes getting squashed 2009-02-08 15:57 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-08 15:57 userspace test for block fork 2009-02-08 15:57 good morning 2009-02-08 15:57 hi 2009-02-08 15:58 it looks like simple and good 2009-02-08 15:58 however, there is "clean page/buffer" issues 2009-02-08 15:58 it is not new issues by those patches though 2009-02-08 15:59 the issue is? 2009-02-08 16:01 job to free has difference with forked or not forked 2009-02-08 16:01 ah 2009-02-08 16:01 right 2009-02-08 16:02 now I'm thinking about it 2009-02-08 16:02 well, there are various strategy with some lock 2009-02-08 16:03 yes, I also was leaving that as the last thing, because it is not clear now to handle it best until other decisions are made 2009-02-08 16:03 I'm finding lockless way, or with small lock window 2009-02-08 16:03 ah, yes 2009-02-08 16:07 btw, flush_log for bitmap seems simple with those patches 2009-02-08 16:07 reading in a minute 2009-02-08 16:07 it seems list_for_each_entry_safe() is enough 2009-02-08 16:07 I am committing some improvements to replay right now 2009-02-08 16:08 ok 2009-02-08 16:08 those patches are in progress patches, not yet mergable 2009-02-08 16:12 reading 2009-02-08 16:18 oh, "Add log block loading to replay prototype" patch is changing disksuper without changing magic 2009-02-08 16:18 replay: load 1 logblocks 2009-02-08 16:18 replay: log magic 10ad 2009-02-08 16:18 replay: child = 0xf, parent = 0x2, key = 0x0 2009-02-08 16:18 replay: child = 0x24, parent = 0x8, key = 0x0 2009-02-08 16:18 log replay is doing something 2009-02-08 16:20 to be sure it works, I think we will checksum the dirty volmap blocks at delta commit time, log the checksum, and check the checksum against reconstructed volmap dirty blocks on replay 2009-02-08 16:21 for debug? 2009-02-08 16:22 yes 2009-02-08 16:22 self check 2009-02-08 16:22 sounds good for now 2009-02-08 16:22 log replay has many chances for mistakes 2009-02-08 16:22 btree is not changing durning checksum? 2009-02-08 16:23 the checksum has to be computed at exactly the right point 2009-02-08 16:23 the principle is: replay is supposed to reconstruct all the dirty, pinned metadata 2009-02-08 16:24 yes 2009-02-08 16:24 all btree leaf nodes should be clean at the point of the checksum 2009-02-08 16:25 bitmap blocks can also be checksummed for the same reason 2009-02-08 16:25 checksum bitmap blocks before flushing the bitmap 2009-02-08 16:26 per delta checksum for bitmap may not be easy 2009-02-08 16:26 ah, and after too 2009-02-08 16:26 actually, checksum after the flush is better, and check the checksum after replaying log 2009-02-08 16:27 I don't see anything hard about doing the bitmap checksum per delta 2009-02-08 16:27 buffer can be modified for next delta 2009-02-08 16:28 yes, and the modification is logged 2009-02-08 16:28 the replay is supposed to reconstruct exactly the modified bitmap 2009-02-08 16:29 I don't see a change to fork in this patch set 2009-02-08 16:29 blockdirty returns new buffer 2009-02-08 16:30 ah 2009-02-08 16:30 I thought the change would be bigger :) 2009-02-08 16:30 ah, yes 2009-02-08 16:30 :) 2009-02-08 16:30 yes, it was simple 2009-02-08 16:30 I made a similar change to log_finish in my latest commit 2009-02-08 16:31 but I test for null logbuf in the replay code 2009-02-08 16:31 no problem 2009-02-08 16:31 those patches are just for review for now 2009-02-08 16:31 review and testing purpose 2009-02-08 16:32 well, so the result is not slight 2009-02-08 16:32 it seems good, because the list is stable, there is not new fork buffer anymore 2009-02-08 16:33 I have not yet tested a fork in the replay code 2009-02-08 16:33 I just did the first real replay today 2009-02-08 16:33 redirect logging seems to work well 2009-02-08 16:33 good 2009-02-08 16:34 with those patch, I guess bitmap will also be work 2009-02-08 16:34 write_bitmap is really simple 2009-02-08 16:34 the commit-transfer change shows the effect of the new fork strategy, right?\ 2009-02-08 16:34 yes, right 2009-02-08 16:35 there is no -EAGAIN from write_bitmap anymore 2009-02-08 16:35 and there is no buffer in sb->pinned 2009-02-08 16:36 however, there is free issue here 2009-02-08 16:36 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-08 16:36 now, in userspace, buffer will be set_buffer_clean() 2009-02-08 16:38 if (IS_ERR(clone)) 2009-02-08 16:38 - return PTR_ERR(clone); 2009-02-08 16:38 + return clone; 2009-02-08 16:38 yes 2009-02-08 16:38 ah 2009-02-08 16:38 right 2009-02-08 16:39 + buffer = clone; <- this is the key change 2009-02-08 16:39 yes 2009-02-08 16:40 and the buffer is rehashed 2009-02-08 16:40 unhashed 2009-02-08 16:40 yes, and clone is hashed instead 2009-02-08 16:40 btw, userspace also have free issue 2009-02-08 16:41 at least, unclear for now 2009-02-08 16:41 the the original buffer is the one that must be written, how does that happen? 2009-02-08 16:41 it is listed in map->dirty 2009-02-08 16:42 ok, good 2009-02-08 16:42 by previous delta's dirty 2009-02-08 16:42 and after the commit, the clone must be moved to the dirty list 2009-02-08 16:42 correct? 2009-02-08 16:42 no 2009-02-08 16:43 there is no needed 2009-02-08 16:43 yes, I see, duh 2009-02-08 16:43 the clone is read only 2009-02-08 16:43 ok, this is better 2009-02-08 16:43 yes, really simple 2009-02-08 16:43 yes, really simple and needs a very clear explanation 2009-02-08 16:44 better than my approach 2009-02-08 16:44 yes, at least for backend 2009-02-08 16:45 I think there are two points on those strategy 2009-02-08 16:45 dirty buffer is stable for backend 2009-02-08 16:46 it's inserted into map->dirty when it become dirty 2009-02-08 16:46 there is no fork yet 2009-02-08 16:46 then, delta counter is incremented 2009-02-08 16:46 now, fork can be happened, and now map->dirty can be taked new delta dirty buffer 2009-02-08 16:47 so, it is using list_splice_init() to clean map->dirty 2009-02-08 16:47 and now we are get stable dirty buffer list in io_buffers list 2009-02-08 16:48 fork never touch those buffers 2009-02-08 16:48 backend can write buffers on io_buffers list with normal way 2009-02-08 16:48 the io_buffers list is not implemented yet? 2009-02-08 16:49 oh yes 2009-02-08 16:49 there is it in flush_log() before list_for_each_entry_safe() 2009-02-08 16:49 in commit-transfer 2009-02-08 16:49 yes 2009-02-08 16:49 sb->flush++ and it should be serialized with block fork 2009-02-08 16:51 if we have two lists for dirty, we may not need io_buffers 2009-02-08 16:51 good, very nice 2009-02-08 16:52 io_buffers is ok 2009-02-08 16:52 in userspace, yes 2009-02-08 16:53 in kernel, we may want two lists for optimization 2009-02-08 16:53 the second list does not need to be per-inode 2009-02-08 16:53 well 2009-02-08 16:53 if it is per inode, then we don't need to remember the mapping in the original buffer 2009-02-08 16:54 which makes it easier to use a list of pages instead of list of buffers 2009-02-08 16:54 yes 2009-02-08 16:55 list of pages means taking of the page->lru link 2009-02-08 16:55 which may or may not be safe 2009-02-08 16:55 yes 2009-02-08 16:55 or two dirty radix-tree 2009-02-08 16:55 or wait 2009-02-08 16:55 the page->private link should be ok 2009-02-08 16:56 page->private is using for buffer_head link 2009-02-08 16:56 if page is not forked 2009-02-08 16:56 at that point we may have a bio attached 2009-02-08 16:56 and we can use the bio to link the pages together 2009-02-08 16:57 where is bio allocated? 2009-02-08 16:57 we have not decided that yet 2009-02-08 16:57 we would not use vecio for this 2009-02-08 16:58 so we can allocate the bio when we want to 2009-02-08 16:58 yes 2009-02-08 16:58 I think it will be one bio per metadata block 2009-02-08 16:59 it is complicated to do multiple metadata blocks per bio, and because there are few metadata blocks, not a big benefit 2009-02-08 16:59 um... not sure 2009-02-08 17:00 if it's bitmap page, we may can use bio per page 2009-02-08 17:01 actually, the bio per range 2009-02-08 17:03 well, I'm thinking the benefit of two lists is, we don't need to walk dirty inode under delta_lock 2009-02-08 17:03 I expect bitmaps will never be physically contiguous, so must have one bio per block 2009-02-08 17:04 yes, locking benefit is likely 2009-02-08 17:04 oh, i see 2009-02-08 17:05 I was thinking it will be written like guess_range() 2009-02-08 17:05 when we implement allocation goals, we will try to put each bitmap near the blocks it covers 2009-02-08 17:06 like block groups in ext2/3/4 2009-02-08 17:06 that means the bitmaps will be about 128 MB apart 2009-02-08 17:07 i see 2009-02-08 17:08 we can develop an optimized version of map_region that maps a single block more efficiently 2009-02-08 17:08 later... 2009-02-08 17:09 directory blocks are more likely to be physically contiguous 2009-02-08 17:09 ah, btw, the benefit of new strategy is, I guess there is no special case for all types of inode 2009-02-08 17:09 i see 2009-02-08 17:09 the new strategy is basically stronger 2009-02-08 17:09 and less code too 2009-02-08 17:09 yes 2009-02-08 17:11 directory blocks... making them contiguous is good for readdir 2009-02-08 17:11 so, I guess we can apply allocation policy flexible 2009-02-08 17:11 yes 2009-02-08 17:11 but making too many directory blocks contiguous means that the dirent will be far from the inode 2009-02-08 17:12 i see 2009-02-08 17:12 it seems best depends on storage technorogy 2009-02-08 17:12 anyway, I plan to not do any allocation optimization at all until after we start review 2009-02-08 17:13 yes 2009-02-08 17:13 which means it will be easy to generate test cases that do not perform well 2009-02-08 17:13 but that is ok I think 2009-02-08 17:13 yes 2009-02-08 17:14 one way of looking at it... allocation optimization is something that many people can participate in 2009-02-08 17:14 I'm guessing the allocation policy is incremental and huristic more or less 2009-02-08 17:14 yes 2009-02-08 17:15 anyway, I'll think about free issue, and kernel code for this strategy 2009-02-08 17:16 next is 2009-02-08 17:17 you probably have to merge your latest patches against my changes to user/commit.c 2009-02-08 17:17 yes, well it's easy 2009-02-08 17:17 btw, I should those merge before kernel become sure? 2009-02-08 17:18 I shoule merge those 2009-02-08 17:18 yes 2009-02-08 17:18 any time 2009-02-08 17:18 ok, next is, I'll merge those 2009-02-08 17:18 ok 2009-02-08 17:18 ah 2009-02-08 17:18 there is free issue in userspace too 2009-02-08 17:19 buffer reclaim was not sure yet 2009-02-08 17:19 for now, buffer->lru is remaining 2009-02-08 17:19 I'm not sure yet 2009-02-08 17:19 you can use the buffer->lru field if you like 2009-02-08 17:20 well 2009-02-08 17:20 there are many options 2009-02-08 17:20 just try something :) 2009-02-08 17:20 ok :) 2009-02-08 17:20 next, I will see if multiple log block commit/replay works 2009-02-08 17:20 ok 2009-02-08 17:21 when I get your new changes, I will start to test bitmap block replay 2009-02-08 17:21 currently allocation logging is disabled 2009-02-08 17:22 ok, however, probably, it's not very soon due to free issue 2009-02-08 17:22 well I have other things I can do also, like implement the checksum strategy 2009-02-08 17:22 and btree node split logging and replay 2009-02-08 17:23 ok 2009-02-08 17:36 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-08 18:02 mysterious bug when setting blocksize low and running more commits was just entries_per_node set wrong 2009-02-08 18:02 sb->entries_per_node = (sb->blocksize - sizeof(struct bnode)) / sizeof(struct index_entry); 2009-02-08 18:02 that's all we need 2009-02-08 18:03 in commit.c, it's .entries_per_node = 20 2009-02-08 18:04 yes 2009-02-08 18:04 well we should do that somewhere in initialization 2009-02-08 18:04 maybe have a tux3_init 2009-02-08 18:04 it's not clear what that should do 2009-02-08 18:06 for now, load_sb is initializing it 2009-02-08 18:06 load_sb() is also initialize sb itself 2009-02-08 18:06 I will at least put the proper expression in load_sb 2009-02-08 18:07 yes 2009-02-08 18:07 well, I was thinking =20 is for debug 2009-02-08 18:07 because it'll split btree more often 2009-02-08 18:07 yes 2009-02-08 18:08 we can always set it lower for debugging after initializing sb 2009-02-08 18:09 ah, yes, lower than max 2009-02-08 18:09 and sb->max_inodes_per_block = sb->blocksize / 64; 2009-02-08 18:09 64 is about the size of an inode 2009-02-08 18:10 yes, however, it's not used for now 2009-02-08 20:02 -!- inverse(~michael@h67-net10.simres.netcampus.ca) has joined #tux3 2009-02-08 20:10 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-02-08 20:26 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-02-08 20:41 back 2009-02-08 20:49 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-08 20:59 just ran the first rollup flush cycle 2009-02-08 22:31 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-08 22:39 -!- RazvanM(~RazvanM@96.234.232.151) has joined #tux3 2009-02-08 23:25 it's not so easy, getting more than one log block per delta 2009-02-08 23:26 just because we try not to generate a lot of log blocks 2009-02-08 23:26 and redirect only happens once per btree leaf per delta 2009-02-08 23:27 so I think I need to disable some optimizations to get more than one log block 2009-02-08 23:27 always redirect, even if already dirty, and only log one entry per log block 2009-02-08 23:27 just use small page? 2009-02-08 23:27 already using a small page 2009-02-08 23:28 e.g. 32bytes page 2009-02-08 23:28 then the check for minimum block size asserts 2009-02-08 23:29 ah, log is based on block 2009-02-08 23:30 yes, I will make the log top smaller 2009-02-08 23:30 ah, yes 2009-02-08 23:30 it sounds good 2009-02-08 23:31 if (1 || sb->logpos + bytes > sb->logtop) { 2009-02-08 23:31 (start a new log block every time) 2009-02-08 23:32 or "sb->logtop = bufdata(sb->logbuf) + sb->blocksize" <- use small size instead of sb->blocksize 2009-02-08 23:32 yes 2009-02-08 23:33 resize inum 0x80 at 0x0 from 0 to 36 2009-02-08 23:33 replay: load 2 logblocks 2009-02-08 23:33 replay: log magic 10ad 2009-02-08 23:33 replay: log magic 10ad 2009-02-08 23:33 replay: child = 0x2f, parent = 0x20, key = 0xc 2009-02-08 23:33 replay: child = 0x4b, parent = 0x26, key = 0x0 2009-02-08 23:33 I guess multiple log blocks works 2009-02-08 23:34 looks like good 2009-02-09 00:55 With this patch I get a segfault in cursor_redirect: 2009-02-09 00:55 --- a/user/kernel/btree.c Sun Feb 08 23:49:37 2009 -0800 2009-02-09 00:55 +++ b/user/kernel/btree.c Mon Feb 09 00:55:28 2009 -0800 2009-02-09 00:55 @@ -337,8 +337,8 @@ int cursor_redirect(struct cursor *curso 2009-02-09 00:55 struct sb *sb = btree->sb; 2009-02-09 00:55 while (1) { 2009-02-09 00:55 struct buffer_head *buffer = cursor->path[level].buffer; 2009-02-09 00:56 - if (buffer_dirty(buffer)) 2009-02-09 00:56 - return 0; 2009-02-09 00:56 +// if (buffer_dirty(buffer)) 2009-02-09 00:56 +// return 0; 2009-02-09 00:56 struct buffer_head *clone = new_block(btree); 2009-02-09 00:56 if (IS_ERR(clone)) 2009-02-09 00:56 the patch is just supposed to redirect on every leaf change instead of first dirty 2009-02-09 00:56 probably something simple, but I have to sleep now 2009-02-09 00:57 make clean && make UCFLAGS=-DATOMIC commit && ./commit foodev 2009-02-09 00:57 for test? 2009-02-09 00:57 yes 2009-02-09 00:57 this test found a bug 2009-02-09 00:58 anyway, oyasumi 2009-02-09 00:58 oyasumi 2009-02-09 03:24 -!- kunir(~kvirc@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-09 04:08 what is the git url ? 2009-02-09 04:09 i want to pull out ddtree 2009-02-09 07:47 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-09 08:01 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-02-09 10:03 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-09 10:33 -!- pgquiles(~pgquiles@40.Red-88-25-132.staticIP.rima-tde.net) has joined #tux3 2009-02-09 11:38 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-09 12:26 -!- pgquiles(~pgquiles@55.Red-88-16-38.dynamicIP.rima-tde.net) has joined #tux3 2009-02-09 13:05 -!- dcg(~dcg@109.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-09 15:26 -!- mingming_(~mingming@bi-02pt2.bluebird.ibm.com) has joined #tux3 2009-02-09 15:29 -!- mingming__(~mingming@32.97.110.51) has joined #tux3 2009-02-09 15:48 -!- mingming_(~mingming@bi-02pt2.bluebird.ibm.com) has joined #tux3 2009-02-09 15:51 -!- mingming__(~mingming@32.97.110.51) has joined #tux3 2009-02-09 16:55 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-02-09 21:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-09 22:25 hmm, maybe I should check in the mod that makes the commit unit test segfault 2009-02-09 22:25 either that or find/fix it 2009-02-09 22:25 tough call 2009-02-09 22:38 segfault? 2009-02-09 23:12 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-02-09 23:39 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-02-09 23:55 -!- RazvanM(~RazvanM@96.234.232.151) has joined #tux3 2009-02-10 00:12 -!- kunir(~kvirc@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-10 00:37 -!- RazvanM(~RazvanM@96.234.232.151) has joined #tux3 2009-02-10 00:48 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-02-10 00:53 Program received signal SIGSEGV, Segmentation fault. 2009-02-10 00:53 0x0804ce49 in __list_add (new=0x8064514, prev=0x0, next=0xbfb6f86c) at list.h:26 2009-02-10 00:53 26 prev->next = new; 2009-02-10 00:53 (gdb) bt 2009-02-10 00:53 #0 0x0804ce49 in __list_add (new=0x8064514, prev=0x0, next=0xbfb6f86c) at list.h:26 2009-02-10 00:53 #1 0x0804ce24 in list_add_tail (new=0x8064514, head=0xbfb6f86c) at list.h:36 2009-02-10 00:53 #2 0x08053ad7 in list_move_tail (list=0x8064514, head=0xbfb6f86c) at list.h:67 2009-02-10 00:53 #3 0x0805390c in cursor_redirect (cursor=0x8064178) at kernel/btree.c:354 2009-02-10 00:54 happens if I do: 2009-02-10 00:54 / if (buffer_dirty(buffer)) 2009-02-10 00:54 / return 0; 2009-02-10 00:54 // if (buffer_dirty(buffer)) 2009-02-10 00:54 // return 0; 2009-02-10 00:55 that is, force redirect on every leaf dirty 2009-02-10 01:10 found it 2009-02-10 01:10 sb->pinned list not initialized :p 2009-02-10 01:11 hmm, but it was initialized 2009-02-10 01:11 let's find out where that changes 2009-02-10 01:11 ah 2009-02-10 01:12 initialized too late 2009-02-10 01:12 works 2009-02-10 01:13 debug lists can be really tedious 2009-02-10 01:13 debugging lists I mean 2009-02-10 01:19 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-02-10 01:19 created 39 log blocks 2009-02-10 01:19 that's lots 2009-02-10 01:19 replay: log block 39 2009-02-10 01:19 replay: child = 0xc8, parent = 0xc1, key = 0x0 2009-02-10 01:20 one promise per block 2009-02-10 01:20 good think we can actually fit more than that 2009-02-10 01:37 -!- MaZe(~MaZe@c-67-169-182-160.hsd1.ca.comcast.net) has joined #tux3 2009-02-10 02:56 hey flips 2009-02-10 02:56 hi bh 2009-02-10 02:57 how's it going ? 2009-02-10 02:57 not bad 2009-02-10 03:07 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-02-10 03:12 -!- kunir(~ashishraj@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-10 03:29 -!- kunir(~ashishraj@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-10 03:57 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-10 04:30 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-02-10 04:42 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-02-10 05:08 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-02-10 05:37 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-10 06:44 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has left #tux3 2009-02-10 06:44 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-10 06:44 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has left #tux3 2009-02-10 07:14 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-10 07:19 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-10 07:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-10 08:18 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-10 08:28 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-10 09:07 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-10 09:43 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-10 09:59 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-10 12:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-10 12:31 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-02-10 12:53 -!- dcg(~dcg@184.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-02-10 13:33 new design note coming 2009-02-10 13:34 practice for the SCALE talk on the 20th 2009-02-10 13:34 rather close 2009-02-10 13:39 good luck with that 2009-02-10 13:52 thanks 2009-02-10 13:53 hirofumi, when you get up... an idea for tux3graph: when recover is implemented, tux3graph can draw any dirty metadata blocks with dotted outline, because they don't really exist on disk, they are reconstructed 2009-02-10 13:54 most likely, that will only be bitmap blocks, which are not really shown separately 2009-02-10 13:54 so need some way of showing that 2009-02-10 13:55 we have to create more than 70 or so inodes before a "virtual" inode btree node appears 2009-02-10 13:55 that will be a pretty bushy graph 2009-02-10 15:55 my prof glanced over B+ trees :) 2009-02-10 17:20 free issue seems also complex 2009-02-10 18:08 hirofumi, hi 2009-02-10 18:08 konrad? 2009-02-10 18:13 hirofumi, maybe write a post to the mailing list about the free issue? 2009-02-10 18:14 flips: just thought of tux3 when he did 2009-02-10 18:14 hirofumi, maybe you are talking about the issue of when a flushed btree node block can be freed 2009-02-10 18:14 that is pretty clear I think 2009-02-10 18:15 and I think the prototype already has that 2009-02-10 18:15 konrad, yes somebody could add b+tree to tux3 for a Master's thesis, most probably 2009-02-10 18:15 maybe bachelor 2009-02-10 18:16 b+tree is an incremental optimization on vanilla btree from my mind 2009-02-10 18:16 introduce considerable extra complexity to increase density slightly 2009-02-10 18:17 ACTION will be out for about 3 hours 2009-02-10 18:42 b+tree is just meaning data is in leaf? (i.e. separate index node and leaf node) 2009-02-10 18:42 well 2009-02-10 18:42 free issue means the race of block fork and block free 2009-02-10 18:43 frontend may reference the page which is trying to free 2009-02-10 18:44 there is some solutions, but all is not clean (the change for it is not very small) 2009-02-10 18:55 -!- MaZe(~MaZe@c-67-169-182-160.hsd1.ca.comcast.net) has joined #tux3 2009-02-10 21:04 -!- flips(~phillips@phunq.net) has joined #tux3 2009-02-10 21:25 um... 2009-02-10 21:26 hopefully, normal page cache reclaimer can free forked page and buffers 2009-02-10 21:26 clear page->mapping and leaves ->lru as is 2009-02-10 21:27 so, reclaimer free buffers and page 2009-02-10 21:27 actually, we want to free forked page and buffers as soon as possible 2009-02-10 21:28 however, to do it, I guess we would need to check it for each put_bh() and put_page() 2009-02-10 21:29 so, hopefully, reclaimer does it insteaf of us 2009-02-10 22:21 hirofumi, b+tree just means that instead of splitting one leaf into two for 50% fullness, you split two leaves into three for 2/3rds fullness each 2009-02-10 22:22 oh, I was thinking 2/3 is b*tree 2009-02-10 22:22 hirofumi, fs frees buffers in ->releasepage 2009-02-10 22:23 yes 2009-02-10 22:23 but, forked page is not normal 2009-02-10 22:24 how not normal? because mapping is set? 2009-02-10 22:24 sure 2009-02-10 22:24 because it is not in radix tree 2009-02-10 22:24 we should free those manually 2009-02-10 22:25 take an extra count to keep vmscan away from our forked page 2009-02-10 22:25 explicit free after Io completion 2009-02-10 22:25 manual free is complex, so I thought reclaimer does it 2009-02-10 22:26 we can be sure that only our IO code has a reference 2009-02-10 22:26 the point on io completion has race with block fork 2009-02-10 22:26 it can be non-fork page until io completed 2009-02-10 22:26 free in foreground is a possibility 2009-02-10 22:26 probably 2009-02-10 22:27 but, finally, I thought reclaimer is more simple 2009-02-10 22:27 normal reclaimer 2009-02-10 22:28 foreground free has to search forked page with lookup, I think 2009-02-10 22:28 "n a B+ tree, in contrast to a B-tree, all records are stored at the leaf level of the tree; only keys are stored in interior nodes." 2009-02-10 22:28 well by that description, tux3 uses b+tree 2009-02-10 22:28 I have never used any other kind of tree 2009-02-10 22:28 yes 2009-02-10 22:28 storing keys in internal nodes doesn't make sense to me 2009-02-10 22:28 for a btree 2009-02-10 22:29 to me, that is just a btree 2009-02-10 22:29 non-internal nodes can be optimization btree for space 2009-02-10 22:30 but, lookup is slow instead, I think 2009-02-10 22:30 non-key internal nodes 2009-02-10 22:30 the higher the fan out the less that matters 2009-02-10 22:31 probably 2009-02-10 22:32 I'm not expert, but I like b+tree more or less, it's simple 2009-02-10 22:32 -!- RazvanM(~RazvanM@96.234.232.151) has joined #tux3 2009-02-10 22:49 -!- flips(~phillips@phunq.net) has joined #tux3 2009-02-11 00:22 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 00:31 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 00:38 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 00:39 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 00:42 -!- |kunir|(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 00:46 -!- |kunir|(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 00:54 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 00:54 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has left #tux3 2009-02-11 00:55 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 00:56 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 01:11 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 01:12 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 03:03 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-02-11 04:31 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-02-11 07:30 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-11 08:02 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-11 08:17 I'm going to give the SCALE presentation using my ancient vaio 2009-02-11 08:17 with Tux3 as the root fs hopefully 2009-02-11 08:21 now I have to reassemble the vaio 2009-02-11 08:21 which has been lying in parts for some months as a result of having its second hard disk die 2009-02-11 10:01 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-11 10:10 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-11 11:46 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-11 11:49 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-02-11 13:20 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-11 14:13 flips: I should have include tux3 in the graph 2009-02-11 14:43 oh that was your article? 2009-02-11 14:56 yup ;-) 2009-02-11 14:59 :) 2009-02-11 14:59 getting close to sk8 oclock 2009-02-11 14:59 after that it will be hack oclock 2009-02-11 14:59 get some more atomic commit working 2009-02-11 15:54 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-11 19:26 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-11 19:51 I've posted new blockdirty() strategy for userspace 2009-02-11 19:52 I'm thinking synchronise_delta() for free issue 2009-02-11 19:52 it will do, 2009-02-11 19:52 void synchronise_delta() 2009-02-11 19:52 { 2009-02-11 19:53 down_write(sb->delta_lock); 2009-02-11 19:53 up_write(sb->delta_lock); 2009-02-11 19:53 } 2009-02-11 19:54 with this, I think any users of previous delta dirty buffer will be gone 2009-02-11 19:54 so, after synchronise_delta(), we can free forked buffers immidiately 2009-02-11 19:55 I'll work blockdirty() for kernel 2009-02-11 20:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-11 20:09 hirofumi, reading 2009-02-11 20:09 thanks 2009-02-11 20:35 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-11 20:55 hirofumi, shall we merge it and work with it? 2009-02-11 20:55 I think, yes 2009-02-11 20:55 if you are ok with new strategy 2009-02-11 20:55 ok 2009-02-11 20:56 we discussed it before 2009-02-11 20:56 yes 2009-02-11 20:56 and you have refined it 2009-02-11 20:56 let's try it 2009-02-11 20:56 btw, it seems to work for flush_log() 2009-02-11 20:58 good 2009-02-11 20:58 pulled 2009-02-11 20:58 I will not do much tonight 2009-02-11 20:58 tomorrow will try to do more 2009-02-11 20:58 thanks. ok 2009-02-11 20:59 I am writing a new lkml post :) 2009-02-11 20:59 which will be similar to my talk at scale 2009-02-11 20:59 I am trying to explain all the concepts in few words 2009-02-11 20:59 i see 2009-02-11 20:59 good 2009-02-11 20:59 then the next thing is, I want to be run tux3 as rootfs 2009-02-11 20:59 I know it is not ready 2009-02-11 20:59 but just for fun 2009-02-11 20:59 yes 2009-02-11 21:00 with atomic commit? 2009-02-11 21:00 without most probably 2009-02-11 21:00 i see 2009-02-11 21:00 I think, do as much as possible 2009-02-11 21:01 yes, probably 2009-02-11 21:01 there is a little more than a week to go 2009-02-11 21:01 that is a long time 2009-02-11 21:02 if you need help for something by me, please let me know 2009-02-11 21:04 I will most certainly :) 2009-02-11 21:05 ok :) 2009-02-11 21:05 oyasumi 2009-02-11 21:05 oh, oyasumi 2009-02-11 21:05 oyasumi 2009-02-11 21:05 oyasumi 2009-02-11 21:19 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-11 21:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-11 22:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-11 22:15 -!- RazvanM(~RazvanM@96.234.232.151) has joined #tux3 2009-02-11 23:05 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-12 02:53 -!- RazvanM(~RazvanM@96.234.233.219) has joined #tux3 2009-02-12 03:03 -!- MaZe(~MaZe@c-67-169-182-160.hsd1.ca.comcast.net) has joined #tux3 2009-02-12 08:07 -!- dcg(~dcg@242.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-12 09:16 -!- dcg(~dcg@242.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-12 10:03 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-12 12:00 -!- dcg(~dcg@10.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-02-12 12:42 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-12 14:22 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-12 16:17 oh, current hg repo has two heads 2009-02-12 16:17 hg clone http://tux3.org/tux3/ 2009-02-12 16:17 hg merge 2009-02-12 16:17 hg commit 2009-02-12 16:18 the above seems to solve it 2009-02-12 18:53 hirofumi: yeah thats why the bitbucket mirror is broken 2009-02-12 19:11 oh, yes 2009-02-12 19:12 luckly, it seems easy to be fixed 2009-02-12 20:04 hirofumi, checking 2009-02-12 20:06 sorry about that, fixed 2009-02-12 20:06 I have no idea what caused that 2009-02-12 20:08 probably, tree is not synced to public? 2009-02-12 20:08 well 2009-02-12 20:13 hmm 2009-02-12 20:13 I pulled to the public tree, but then I have to repeat the merge 2009-02-12 20:13 strange 2009-02-12 20:14 the "merge heads" checking shows in the public repo, but there are still two heads 2009-02-12 20:14 seems like mercurial strangeness 2009-02-12 20:15 I don't know "merge heads" 2009-02-12 20:15 maybe just "hg merge"? 2009-02-12 20:15 I mean, the commit comment was "merge heads" 2009-02-12 20:15 I see that comment on one of the heads 2009-02-12 20:16 the other head is "Initialize commit list fields before creating filesystem" 2009-02-12 20:17 looks like more strange repo 2009-02-12 20:17 "Test delta with multiple log blocks by using one entry per log block" 2009-02-12 20:17 it has 3 heads 2009-02-12 20:18 ah, no 2009-02-12 20:18 it's normal 2009-02-12 20:18 ok 2009-02-12 20:18 it looks ok with hgweb 2009-02-12 20:20 ok, what I think I did was, after the last pull from you I did not do hg update 2009-02-12 20:20 instead I edited a file to fix a bug 2009-02-12 20:20 then committed 2009-02-12 20:20 that created a new head for some reason 2009-02-12 20:20 ah, yes 2009-02-12 20:20 merge comment seems "Merge heads" 2009-02-12 20:21 yes 2009-02-12 20:21 whatever that means :) 2009-02-12 20:21 well, that comment can be changed with "hg commit -m " 2009-02-12 20:21 well, now, it looks good 2009-02-12 20:22 ah, I can change commit comments after commit? 2009-02-12 20:23 probably, hg can't 2009-02-12 20:25 well it's happy now I hope 2009-02-12 20:25 yes 2009-02-12 20:26 btw, I wrote the draft blockdirty() code for kernel 2009-02-12 20:32 I would like to see it 2009-02-12 20:35 yes, wait a bit 2009-02-12 20:37 http://userweb.kernel.org/~hirofumi/fork.patch 2009-02-12 20:37 here is 2009-02-12 20:37 can complie, however untested at all 2009-02-12 20:37 yet 2009-02-12 20:41 reading 2009-02-12 20:48 bufdelta, nice wrapper 2009-02-12 20:48 yes, it is from your hackfs :) 2009-02-12 20:48 heh 2009-02-12 20:48 I recognize good taste when I see it :) 2009-02-12 20:49 ok, there is a cmpxchg based atomic state set from the handles prototype 2009-02-12 20:49 block handles 2009-02-12 20:49 let's see if I can find that, we don't need it immediately of course 2009-02-12 20:50 yes 2009-02-12 20:50 so lru_cache_add is not exported, but everything to write it is exported? 2009-02-12 20:51 exported only pagevec version 2009-02-12 20:52 it seems percpu pagevec version is not exported 2009-02-12 20:52 hmm 2009-02-12 20:54 I think I would prefer inode->dirty is the list of dirty buffers, and inode->link is the link field for dirty inodes 2009-02-12 20:54 a small thing 2009-02-12 20:55 ok 2009-02-12 20:55 looking at the actual function now 2009-02-12 20:59 all lru_cache_add does is put a page on the lru list? 2009-02-12 20:59 it has become very complex 2009-02-12 20:59 all lru_cache_add? 2009-02-12 20:59 __lru_cache_add just puts a page on the lru list, right? 2009-02-12 21:00 yes 2009-02-12 21:00 some type of list 2009-02-12 21:00 it has become very complex 2009-02-12 21:00 yes 2009-02-12 21:00 inline documentation is completely missing 2009-02-12 21:00 this is where I need to find somebody to blame :) 2009-02-12 21:00 some pages are splited from inactive/active list 2009-02-12 21:01 :) 2009-02-12 21:01 I wonder if vm sucks less now? 2009-02-12 21:01 :) 2009-02-12 21:02 well, it seems the list is too long on big machine 2009-02-12 21:02 ok, so the main thing in blockdirty->fork is still replace_slot 2009-02-12 21:02 ah now I know who to blame 2009-02-12 21:02 christoph 2009-02-12 21:02 maybe 2009-02-12 21:03 I hope it also makes 4 core cpu better 2009-02-12 21:03 but the missing documentation is unforgivable 2009-02-12 21:03 for something like this 2009-02-12 21:03 (never mind I have not written any for tux3) 2009-02-12 21:03 :) 2009-02-12 21:03 unforgivable is a relative term 2009-02-12 21:04 :) 2009-02-12 21:05 void synchronize_delta(struct sb *sb) 2009-02-12 21:05 { 2009-02-12 21:05 down_write(&sb->delta_lock); 2009-02-12 21:05 up_write(&sb->delta_lock); 2009-02-12 21:05 } 2009-02-12 21:05 static void free_forked_page(struct page *page) 2009-02-12 21:05 why do we always have !PageUptodate for oldpage? 2009-02-12 21:05 { 2009-02-12 21:05 assert(PageForked(page)); 2009-02-12 21:05 assert(page_count(page) == 3); 2009-02-12 21:05 int ret = try_to_free_buffers(page); 2009-02-12 21:05 assert(ret); 2009-02-12 21:05 page->mapping = NULL; 2009-02-12 21:05 page_cache_release(page); 2009-02-12 21:05 /* Drop the final reference */ 2009-02-12 21:05 page_cache_release(page); 2009-02-12 21:05 } 2009-02-12 21:05 because now we are using ->readpage 2009-02-12 21:06 whoops 2009-02-12 21:06 it meant assert(PageUptodate(page)) 2009-02-12 21:06 it meant assert(PageUptodate(oldpage)) 2009-02-12 21:06 oh good 2009-02-12 21:07 I think that will always be true 2009-02-12 21:07 we can't fork a page that is not up to date 2009-02-12 21:07 in tux3 teminology, we can't fork an empty page 2009-02-12 21:07 yes 2009-02-12 21:08 but, later, we may allow partial uptodate page 2009-02-12 21:08 yes 2009-02-12 21:08 I took the buffer lock in the earlier prototype to handle that 2009-02-12 21:08 yes 2009-02-12 21:09 so now there is no loop over buffers at all 2009-02-12 21:09 well, only page_buffer() to get new buffer 2009-02-12 21:09 how do we exclude against parallel blockread? 2009-02-12 21:10 parallel blockread? 2009-02-12 21:10 parallel blockread of a different buffer? 2009-02-12 21:10 blockread vs blockread? 2009-02-12 21:10 blockread vs fork 2009-02-12 21:10 ah 2009-02-12 21:10 ok 2009-02-12 21:10 I see 2009-02-12 21:10 nice :) 2009-02-12 21:10 block-fork takes lock_page 2009-02-12 21:11 but block read does not hold the page lock 2009-02-12 21:11 yes, if the page is uptodate 2009-02-12 21:12 ah, yes. it doesn't hold 2009-02-12 21:12 I think that a block read should only hold the buffer lock, not the page lock 2009-02-12 21:12 right 2009-02-12 21:12 so lock_buffer is necessary to exclude against blockread 2009-02-12 21:12 I think 2009-02-12 21:13 which case is lock_buffer needed? 2009-02-12 21:13 blockread garantee the page is uptodate 2009-02-12 21:13 when a read is in flight. We need to let the read finish before we swap the cache page 2009-02-12 21:14 yes 2009-02-12 21:14 otherwise the data that was just read will be lsot 2009-02-12 21:14 lost 2009-02-12 21:14 readpage is lock_page -> io -> unlock_page 2009-02-12 21:14 but blockread does not do lock_page 2009-02-12 21:14 current blockread does 2009-02-12 21:14 I see 2009-02-12 21:15 maybe it is ok then 2009-02-12 21:15 yes, for now 2009-02-12 21:15 when we allow it, we need to fix that assertion 2009-02-12 21:16 well your blockdirty looks very believable 2009-02-12 21:16 it is more complete than mine 2009-02-12 21:16 and more robust in theory 2009-02-12 21:16 as we discussed yesterday 2009-02-12 21:16 yes, hopefully 2009-02-12 21:17 it even has the page accounting 2009-02-12 21:17 yes 2009-02-12 21:17 so, I expect to be able to get back to work on the atomic commit prototype tomorrow 2009-02-12 21:17 I also want to try tux3 on root, without atomic commit 2009-02-12 21:17 I set up a machine to try it on yesterday 2009-02-12 21:18 ok 2009-02-12 21:18 well, I need to do a little more work on that machine 2009-02-12 21:18 disk as bad 2009-02-12 21:18 disk was bad 2009-02-12 21:18 btw, the plan is, we set writeback and clear dirty to writeout 2009-02-12 21:19 then bio callback clears writeback bit 2009-02-12 21:19 that is good and sensible 2009-02-12 21:20 and standard 2009-02-12 21:20 yes 2009-02-12 21:20 so, after clear writeback bit, blockdirty() can fork that page 2009-02-12 21:20 even if bufferdelta() != newdelta 2009-02-12 21:21 did you ever try tux3 on root? 2009-02-12 21:21 no 2009-02-12 21:21 bmap should work ok 2009-02-12 21:21 so lilo should work ok 2009-02-12 21:21 well 2009-02-12 21:21 I should try it in uml first 2009-02-12 21:22 I hope it's not so hard 2009-02-12 21:22 I can make a very small root partition 2009-02-12 21:22 probably, mount tux3, rsync / tux3 2009-02-12 21:22 and boot from tux3 2009-02-12 21:22 yes 2009-02-12 21:22 or cp -a 2009-02-12 21:23 yes 2009-02-12 21:23 I forgot the limitation of current tux3 2009-02-12 21:23 um... 2009-02-12 21:23 ah, inode->i_blocks 2009-02-12 21:24 not important I hope 2009-02-12 21:24 ah 2009-02-12 21:24 might break cp -a ? 2009-02-12 21:24 no 2009-02-12 21:24 I guess it's not big problem 2009-02-12 21:24 i_blocks is a stupid idea anyway ;) 2009-02-12 21:25 create/delete cycle leaks some blocks 2009-02-12 21:25 this is also not big problem 2009-02-12 21:25 should not be a problem for right now 2009-02-12 21:25 do you know which blocks? 2009-02-12 21:26 dleaf? 2009-02-12 21:26 at least, btree blocks 2009-02-12 21:26 root and last leaf 2009-02-12 21:26 ah right 2009-02-12 21:26 because ddsnap never deleted those :) 2009-02-12 21:26 :) 2009-02-12 21:28 for now, I know it only except atable 2009-02-12 21:28 except atable? 2009-02-12 21:29 if inode was deleted, I guess atable entry should be deleted 2009-02-12 21:29 you mean itable? 2009-02-12 21:29 it's atable 2009-02-12 21:30 if inode is using xattr 2009-02-12 21:32 ah right 2009-02-12 21:32 if the atable entry hits zero 2009-02-12 21:32 yes 2009-02-12 22:04 maze, ping? 2009-02-12 22:07 git blame tells me that christoph is indeed to be blamed about the new improved lru_cache_add 2009-02-12 22:07 ;) 2009-02-12 22:28 oh, which revision? 2009-02-12 22:45 hey flips 2009-02-12 22:53 -!- RazvanM(~RazvanM@96.234.233.219) has joined #tux3 2009-02-12 22:59 flips: pong 2009-02-12 23:07 maze, still there? 2009-02-12 23:07 indeed 2009-02-12 23:54 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-13 02:00 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-02-13 03:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-13 05:36 The another issue of block-fork is, 2009-02-13 05:36 cpu1 cpu2 2009-02-13 05:36 blockread() 2009-02-13 05:36 blockget() 2009-02-13 05:36 return buffer-1:page-1 2009-02-13 05:36 2009-02-13 05:36 blockdirty() 2009-02-13 05:36 2009-02-13 05:36 return new buffer-5:page-2 2009-02-13 05:36 2009-02-13 05:36 modify buffer-5:page-2 2009-02-13 05:36 return buffer-2:page-1 2009-02-13 05:36 2009-02-13 05:36 blockdirty() 2009-02-13 05:36 return -EAGIN 2009-02-13 05:36 cpu2 will get -EAGAIN and will restart from blockread(). Then, 2009-02-13 05:36 re-blockread() will get forked page-2. But page-2 doesn't have the 2009-02-13 05:36 uptodated data by buffer-2:page-1. 2009-02-13 05:36 so, we will need read_mapping_page() like initial version 2009-02-13 05:41 btw, lock_buffer() or lock_page() doesn't prevent it 2009-02-13 05:42 because new strategy doesn't change buffer->b_data 2009-02-13 05:45 another solution is to use check_forked() on cpu2 side under lock_page or lock_buffer 2009-02-13 05:50 probably, this is big disadvantage of new strategy 2009-02-13 07:20 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-13 07:28 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-13 07:37 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-13 07:50 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-13 08:02 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-13 09:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-13 09:50 -!- dcg(~dcg@125.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-02-13 09:55 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-13 10:13 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-13 10:21 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-02-13 10:52 -!- dcg(~dcg@125.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-02-13 11:35 -!- dcg(~dcg@131.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-02-13 11:41 however, I guess new strategy still has advantage more 2009-02-13 11:43 if not, maybe old strategy is hard to solve the backend vs block-fork issue without locking fontend durning map_region and block allocation 2009-02-13 11:44 so, I think check_forked() is solution for this problem 2009-02-13 11:45 it would be used after taking lock for read io, and before submit 2009-02-13 11:45 I guess it solves the above race 2009-02-13 11:47 modification seems blockread and/or readpage only for now 2009-02-13 11:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-13 12:14 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-13 12:40 -!- dcg(~dcg@7.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-02-13 13:45 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-13 15:46 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-13 19:37 bringing a shuttle xpc online for testing 2009-02-13 20:10 apt-get upgrade is over 1 GB 2009-02-13 23:15 -!- RazvanM(~RazvanM@96.234.233.219) has joined #tux3 2009-02-14 07:02 -!- dcg(~dcg@138.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-14 07:38 -!- dcg_(~dcg@164.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-02-14 13:49 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-15 00:21 -!- RazvanM(~RazvanM@96.234.233.219) has joined #tux3 2009-02-15 21:16 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-02-15 21:44 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-15 22:23 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-15 22:40 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-02-15 23:45 -!- RazvanM(~RazvanM@96.234.233.219) has joined #tux3 2009-02-16 06:40 -!- flips(~phillips@phunq.net) has joined #tux3 2009-02-16 06:40 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-02-16 06:40 -!- vomjom(~vomjom@99-157-248-71.lightspeed.stlsmo.sbcglobal.net) has joined #tux3 2009-02-16 07:03 http://userweb.kernel.org/~hirofumi/hackfs.tar.gz 2009-02-16 07:03 hackfs for blockdirty test 2009-02-16 07:04 it seems to work, but it quite not clean 2009-02-16 07:05 at least, it would have to rethink freeing page strategy 2009-02-16 07:05 it frees forked page by free_forked_page() 2009-02-16 07:06 however, we may want to free page by normal page reclaim process 2009-02-16 07:06 with ->releasepage() handler 2009-02-16 07:06 um... 2009-02-16 07:08 well, anyway, this is what I thought for now 2009-02-16 07:08 still need rethink more or less though 2009-02-16 07:11 ah, btw, we would have to add ->invalidatepage() handler to handle truncate 2009-02-16 07:12 if this is used for normal file data 2009-02-16 07:13 and ->page_mkwrite() 2009-02-16 07:18 btw, free issues is 2009-02-16 07:18 one is buffer->b_assoc_mapping 2009-02-16 07:19 ah, one is buffer->b_assoc_buffers 2009-02-16 07:20 we don't add it to inode->i_data.private_list 2009-02-16 07:23 so vfs will not remove it automatically (and it is prefer, because we would want to use different locking rule for ->b_assoc_buffers) 2009-02-16 07:23 and second one is 2009-02-16 07:23 page reclaim process may try to free via page->lru list at any point 2009-02-16 07:24 so, we have to protect the page and buffers from it 2009-02-16 07:24 note, page->count doesn't protect to free buffers 2009-02-16 07:25 s/protect/prevent/ 2009-02-16 07:58 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-02-16 08:52 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-16 11:54 hirofumi, freeing forked pages explicitly will be ok 2009-02-16 11:55 after writing they are not used again (I think) 2009-02-16 12:54 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-16 14:32 yes 2009-02-16 14:33 the problem is that forked page is needed to special cleanup before freeing it 2009-02-16 14:34 buffer->b_assoc_buffers and page->mapping 2009-02-16 14:35 since page is still on ->lru list, vm stuff can touch forked page 2009-02-16 14:43 ->releasepage then 2009-02-16 14:43 if necessary, we can mark the page with a flag 2009-02-16 14:44 in ->releasepage, if the page is marked as forked, just return 2009-02-16 14:44 then cleanup of the page is done after bio completion 2009-02-16 14:45 yes, it can 2009-02-16 14:46 probably 2009-02-16 14:47 a forked page could have a special endio, then in the endio, link the page onto a list of completed pages using the ->private field 2009-02-16 14:47 just an idea 2009-02-16 14:47 block-fork can be happen after submit io 2009-02-16 14:48 right 2009-02-16 14:48 and after the endio 2009-02-16 14:48 after the endio is not allowed for now 2009-02-16 14:48 I'm using writeback bit on page for it 2009-02-16 14:49 submit set bit, and endio clear it 2009-02-16 14:49 good 2009-02-16 14:49 so, I'm using bio->bi_private as list for now 2009-02-16 14:50 current hackfs 2009-02-16 14:50 that makes sense 2009-02-16 14:51 suppose we remove the forked page from lru? 2009-02-16 14:51 no 2009-02-16 14:52 what happens that is bad? 2009-02-16 14:52 for now, I'm getting refcount of page 2009-02-16 14:52 and cleans buffer->b_assoc_buffers up before submit 2009-02-16 14:53 I was using bio list, so buffers seems not needed after submit 2009-02-16 14:53 do you like hackfs as a way to experiment? 2009-02-16 14:54 yes and no 2009-02-16 14:54 well, yes 2009-02-16 14:54 what is the "no" part? 2009-02-16 14:54 some improvement would be needed 2009-02-16 14:54 as fs 2009-02-16 14:54 now, it works as memory backend storage 2009-02-16 14:55 yes, and is ok for experimenting with transfer methods 2009-02-16 14:55 yes 2009-02-16 14:55 except it, it looks good 2009-02-16 14:55 except race check 2009-02-16 14:56 prefetchw 2009-02-16 14:56 e.g. vm accout, and dirty state has difference 2009-02-16 14:57 it just copied from kernel 2009-02-16 14:57 flush_one_page() and flush_endio() are assuming there is no partial write 2009-02-16 14:57 parital dirty page 2009-02-16 15:00 where should I begin my reading in hackfs? 2009-02-16 15:01 start point is still test() 2009-02-16 15:01 blockget and blockread are the same has kernel/filemap 2009-02-16 15:01 yes 2009-02-16 15:01 please don't care for now 2009-02-16 15:02 change_end has a fork test 2009-02-16 15:02 yes 2009-02-16 15:02 it redirty buffer by next delta 2009-02-16 15:02 for fork debug 2009-02-16 15:03 the test() is 2009-02-16 15:03 1K blocks 2009-02-16 15:03 yes 2009-02-16 15:04 but, now it writes whole page 2009-02-16 15:04 and blockread doesn't handle race 2009-02-16 15:05 you decided to use EAGAIN 2009-02-16 15:05 yes 2009-02-16 15:05 for now 2009-02-16 15:06 the race can only happen with blockget() 2009-02-16 15:06 current hackfs is not usiing it, so race handling was ignored 2009-02-16 15:07 let me see how it submits the forked page 2009-02-16 15:08 http://userweb.kernel.org/~hirofumi/fork3.patch 2009-02-16 15:08 btw, blockread() vs blockdirty() race handling draft 2009-02-16 15:08 well 2009-02-16 15:08 blockdirty() adds buffer to inode->dirty 2009-02-16 15:09 and stage_delta() walks the inode->dirty 2009-02-16 15:09 actually, splice inode->dirty to io_buffers 2009-02-16 15:10 note of it, list_splice_init() has to do before releasing sb->delta_lock 2009-02-16 15:11 because frontend can be insert new dirty buffer to inode->dirty 2009-02-16 15:12 and it calls flush_one_page() 2009-02-16 15:12 flush_one_page() removes all buffers on the page from io_buffers 2009-02-16 15:12 and submit bio 2009-02-16 15:15 ok, flush_one_page is a good place to read 2009-02-16 15:17 io_buffers is the list of buffers to submit for the delta, stage delta starts by moving the inode dirty buffers to the io_buffers list 2009-02-16 15:18 yes 2009-02-16 15:19 btw, page_cache_get() is point to free the page in flush_one_page() 2009-02-16 15:30 ok, flush_wait_io waits on io count to reach zero 2009-02-16 15:30 sorry for the slow reading, my daughter is learning inkscape 2009-02-16 15:31 and she gets help from me 2009-02-16 15:31 no problem 2009-02-16 15:31 no hurry 2009-02-16 15:32 it is still needed to rethink more or less 2009-02-16 15:32 about race 2009-02-16 15:32 so flush_wait_io is at the beginning of commit, to wait for the previous commit 2009-02-16 15:32 yes 2009-02-16 15:33 before commit_delta, there is synchronize_delta() 2009-02-16 15:33 it is another point to prevent race 2009-02-16 15:33 synchronize_delta() take/release sb->delta_lock 2009-02-16 15:34 io_count counts bios 2009-02-16 15:34 yes 2009-02-16 15:34 let me see how synchronize_ works 2009-02-16 15:35 ah, probably I was find the bug 2009-02-16 15:35 briefly takes write lock on delta_lock 2009-02-16 15:35 probably, it should be after flush_wait_io() 2009-02-16 15:35 well 2009-02-16 15:35 flush_wait_io() waits all bios 2009-02-16 15:36 after that, writeback bit was cleared on pages 2009-02-16 15:36 so, at this point, there is no fork anymore 2009-02-16 15:36 delta_lock is never held for a long time 2009-02-16 15:37 so your synchronization strategy is aggressive 2009-02-16 15:38 and if we call synchronize_delta(), all referencer of forked-pages will be gone 2009-02-16 15:40 yes 2009-02-16 15:40 without it, frontend can be on the middle of fork 2009-02-16 15:41 because fork and clear writeback bit has race 2009-02-16 15:41 well, so to prevent it, there is synchronize_delta() 2009-02-16 15:42 so the only stall will be on waiting for the delta write lock 2009-02-16 15:43 yes 2009-02-16 15:45 I'm looking at the flush_one_page loop now 2009-02-16 15:45 it is similar to block_write_full_page 2009-02-16 15:46 buffer = head = page_buffers(page); 2009-02-16 15:46 do { 2009-02-16 15:46 list_del_init(&buffer->b_assoc_buffers); 2009-02-16 15:46 buffer = buffer->b_this_page; 2009-02-16 15:46 } while (buffer != head); 2009-02-16 15:47 does this assume that every buffer on the page is on the io_buffers list? 2009-02-16 15:47 there is no dirty bit on page here 2009-02-16 15:47 ah, we know this page is dirty because it was on the io_buffers list? 2009-02-16 15:47 yes 2009-02-16 15:48 no 2009-02-16 15:48 it is just buggy 2009-02-16 15:48 we have to fix it 2009-02-16 15:49 it has to check bufdelta() for each buffer 2009-02-16 15:49 yes 2009-02-16 15:49 similar to my first fork_buffer prototype 2009-02-16 15:49 it is just lazyness 2009-02-16 15:50 lazy is not a word I would use to describe you :-) 2009-02-16 15:50 :) 2009-02-16 15:51 so, it should become flush_one_buffer? 2009-02-16 15:51 probably, it would still be flush_one_page() 2009-02-16 15:52 or flush_one_range() 2009-02-16 15:52 and make one bio for each buffer? 2009-02-16 15:52 try to optimize later? 2009-02-16 15:52 make one bio for each contigunous buffers 2009-02-16 15:52 yes, however for metadata IO it will be good enough at first to do one bio per buffer 2009-02-16 15:53 multiple buffers per bio is a pretty big hack 2009-02-16 15:53 well, commit_delta want to walk the page to free it 2009-02-16 15:53 we can link bios together 2009-02-16 15:54 allocate a bio->private struct 2009-02-16 15:54 and use it for our link 2009-02-16 15:54 so, if there is the page is appered multiple times, it would be supprise 2009-02-16 15:54 if we want 2009-02-16 15:54 yes 2009-02-16 15:54 there are cases that cannot be handled by a single bio per page 2009-02-16 15:54 for example, if the first and last 1K block are dirty 2009-02-16 15:55 set flag for fist only to free it 2009-02-16 15:55 e.g. bi_private |= 1 2009-02-16 15:55 well, take a page count for each submitted buffer 2009-02-16 15:56 and do put_page in endio 2009-02-16 15:56 forked page has to call put_page() 3 times 2009-02-16 15:57 the three calls are? 2009-02-16 15:57 radix-tree and try_to_release_buffers() and own refcount 2009-02-16 15:58 that is ok 2009-02-16 15:58 and we have to set page->mapping = NULL without radix_delete 2009-02-16 15:58 the alternative is to walk the buffer list in every endio 2009-02-16 15:58 which is worse I think 2009-02-16 15:59 yes 2009-02-16 16:00 well, I guess it is not hard 2009-02-16 16:00 lazy is good 2009-02-16 16:00 to solve 2009-02-16 16:00 at this point 2009-02-16 16:01 just add first bio on the page to free list 2009-02-16 16:01 bio on the page? 2009-02-16 16:02 if we need multiple bios for one page 2009-02-16 16:02 link the page to the list of submitted bios? 2009-02-16 16:03 add only first bio to link 2009-02-16 16:03 page->private? 2009-02-16 16:03 first 1k and last 1k on page was dirty 2009-02-16 16:03 we will submit two bios 2009-02-16 16:04 bio->bi_sector = 0 and bio->bi_sector = 3k>>9 2009-02-16 16:04 so, first bio (bio->bi_sector==0) will be added to bio->bi_private list 2009-02-16 16:05 so, bio->bi_private list will be only one for each page 2009-02-16 16:07 it can work 2009-02-16 16:07 yes 2009-02-16 16:07 it is nice not to have to do an extra allocation for bio->bi_private 2009-02-16 16:08 yes 2009-02-16 16:08 ok, it is a pretty good design 2009-02-16 16:08 and buffers can be freed more early 2009-02-16 16:08 let's compare to current buffer.c strategy 2009-02-16 16:09 end_buffer_sync_write()? 2009-02-16 16:09 yes 2009-02-16 16:10 ah 2009-02-16 16:10 the buffer.c strategy relies on the page->private buffer ring 2009-02-16 16:10 end_page_writeback() has to call only at last callback :/ 2009-02-16 16:10 we replace that by a list of bios 2009-02-16 16:11 the buffer.c handling for buffer endio is complex 2009-02-16 16:11 as you know 2009-02-16 16:11 locking is complex, race prevention is subtle 2009-02-16 16:11 yes 2009-02-16 16:11 there are three kinds of objects involved: page, buffer, bio 2009-02-16 16:12 we would aslo need to know last callback 2009-02-16 16:12 so we reduce that to page, bio 2009-02-16 16:12 I was forgetting about it 2009-02-16 16:12 since the last callback is always the same function, we do not need a generic mechanism 2009-02-16 16:13 ah 2009-02-16 16:13 we need to count down to known when the page is completed 2009-02-16 16:13 yes 2009-02-16 16:14 let's see if any fields are available 2009-02-16 16:14 yes 2009-02-16 16:15 what about page->_mapcount? 2009-02-16 16:15 page can be non-forked page 2009-02-16 16:15 I was thinking, only for forked pages 2009-02-16 16:16 so we have a fork_endio 2009-02-16 16:16 well, it can 2009-02-16 16:16 but, we have to handle same issue on non-forked page 2009-02-16 16:17 how many references can a forked page have? 2009-02-16 16:17 it is depends on vm 2009-02-16 16:18 right, vmscan.c might be looking at it 2009-02-16 16:18 if vm is trying to free it, it will have refcount for it 2009-02-16 16:18 yes 2009-02-16 16:18 and another refcount can be percpu lru_add_pvecs 2009-02-16 16:19 so forget that 2009-02-16 16:20 well, the bio can tell us 2009-02-16 16:20 bi_idx 2009-02-16 16:20 all bios on page completed -> page completed 2009-02-16 16:20 how does it work? 2009-02-16 16:20 similar to how buffer.c determines all buffers completed, by looking at each buffer 2009-02-16 16:22 if I recall, generic endio handling increments the bi_idx according to parameters passed to the ->endio 2009-02-16 16:23 sorry, that is no correct, the low level block driver will increment the bi_idx 2009-02-16 16:24 i see 2009-02-16 16:24 http://lxr.linux.no/linux+v2.6.28/fs/bio.c#L1214 2009-02-16 16:24 bio_endio 2009-02-16 16:26 http://lxr.linux.no/linux+v2.6.28/block/blk-core.c#L125 req_bio_endio 2009-02-16 16:26 yes 2009-02-16 16:27 actually, it is bi_size that determines when the bio has completed 2009-02-16 16:27 there is some redundancy in struct bio fields 2009-02-16 16:29 if all bi_size are zero, we know all are completed 2009-02-16 16:29 it is only true for one bio? 2009-02-16 16:30 for one bio? 2009-02-16 16:30 bio can't be used for scatter-gather io 2009-02-16 16:30 so, bio is for each range 2009-02-16 16:31 and for each range, bio->bi_size become 0? 2009-02-16 16:31 the bio is one physical range, and each bio will have ->bi_size = zero when completed 2009-02-16 16:31 however, there is a difficult race to handle 2009-02-16 16:32 well 2009-02-16 16:32 maybe not so difficult 2009-02-16 16:32 so, we can't know last bio on the page? 2009-02-16 16:32 at least one of the ->endio handlers will see all bi_size zero, but we have a problem if more than one ->endio sees all zero 2009-02-16 16:33 endio is called only if bi_size is zero? 2009-02-16 16:33 that question is fuzzy 2009-02-16 16:33 in general it is true 2009-02-16 16:34 it is true for all low level drivers I know of 2009-02-16 16:34 and the only place that violates it is submit_bio, I think 2009-02-16 16:34 and there does not seem to be a good reason for this 2009-02-16 16:34 just a wart 2009-02-16 16:35 well, so bio->bi_size is zero on callback 2009-02-16 16:35 anyway, we could use an atomic bit in the page flags to determine final completion: check all bi_size on the list, if all zero then test_and_set the page completed bit 2009-02-16 16:36 there is still a race, the list might be destroyed while we are checking it 2009-02-16 16:36 so we would need to take a spinlock for the check 2009-02-16 16:36 list? 2009-02-16 16:37 the list of bios for the page that you proposed above 2009-02-16 16:38 list is all bios 2009-02-16 16:38 for commit 2009-02-16 16:39 ah, I thought it was list of bios for one inflight page 2009-02-16 16:39 list of all bios for commit is also a workable approach 2009-02-16 16:39 that list is using to walk all pages on all bios 2009-02-16 16:40 ok, let me adjust my thinking to this idea 2009-02-16 16:40 by the way, it was my original idea 2009-02-16 16:41 ah, walk bio list on one page 2009-02-16 16:41 that was the idea I was exploring above 2009-02-16 16:41 it is not such a bad idea 2009-02-16 16:41 yes, it can 2009-02-16 16:42 if we want a list of all pages in flight, we can re-create that list at endio time, using the same link field 2009-02-16 16:42 it can be racy 2009-02-16 16:42 all these ideas can be racy, only the difficulty of the race changes ;) 2009-02-16 16:42 it may still working for submit 2009-02-16 16:43 :) 2009-02-16 16:43 well, submiter has no lock almost for now 2009-02-16 16:43 it can work with a lightweight spinlock to submit, to complete, and to clean up 2009-02-16 16:44 well, yes 2009-02-16 16:44 you are trying to avoid the spinlock? 2009-02-16 16:44 very hard I think 2009-02-16 16:44 yes 2009-02-16 16:44 backend one thread 2009-02-16 16:45 so, I thought we may be able to avoid locking 2009-02-16 16:45 one thread for now 2009-02-16 16:46 asych endio makes it hard to avoid a lock, even for submit 2009-02-16 16:47 if we rely on any list walking 2009-02-16 16:47 yes 2009-02-16 16:47 then a lock is needed 2009-02-16 16:47 if we can do it just with atomic counters, maybe no lock, however I do not see any field available for this 2009-02-16 16:47 list can be stable at endio 2009-02-16 16:47 avoiding an allocation is a bigger when than avoiding a spinlock, if we have to choose 2009-02-16 16:48 s/when/win/ 2009-02-16 16:48 if we prepare bios for one page, then submit those 2009-02-16 16:48 the list may not be stable at endio if there are multiple bios on the same page 2009-02-16 16:48 so, the list can be stable for one page 2009-02-16 16:48 ah 2009-02-16 16:48 and clean up the list in foreground 2009-02-16 16:49 true 2009-02-16 16:49 clean up is backend after wait all bios 2009-02-16 16:49 however, then endio cannot modify the list 2009-02-16 16:49 so that is a choice to make 2009-02-16 16:49 yes 2009-02-16 16:50 I like the idea of clean up after wait 2009-02-16 16:50 so, endio does not have to determine the last completion for a page 2009-02-16 16:51 it just counts down all bios in flight, as you have written 2009-02-16 16:51 yes, except end_page_writeback 2009-02-16 16:53 first time I looked at it 2009-02-16 16:53 it does the wakeup for lock_page 2009-02-16 16:54 err\ 2009-02-16 16:55 well, if there is no good alternative, we can use buffers like end_buffer_async_write() 2009-02-16 16:55 I think we have two workable alternatives already 2009-02-16 16:56 both simpler than the buffer_async path 2009-02-16 16:56 good 2009-02-16 16:57 or maybe we can page_buffer()->b_private 2009-02-16 16:57 ACTION <- busy for a few minutes 2009-02-16 16:57 ok 2009-02-16 17:03 back 2009-02-16 17:04 ok, let's give names to the two strategies 2009-02-16 17:04 let's call one the single io list, and the other the per page list strategy 2009-02-16 17:05 the single io list would rely on page completion in foreground, after all endios for the delta complete 2009-02-16 17:06 the per page list strategy and complete pages asynchronously 2009-02-16 17:06 the single io list seems easier and more robust to start with 2009-02-16 17:07 single io list? 2009-02-16 17:07 correction: the per page list strategy >can< complete pages asynchronously 2009-02-16 17:07 single io list is what you have now in hackfs I think 2009-02-16 17:07 ah, ok 2009-02-16 17:07 the io_buffers list 2009-02-16 17:08 btw, set_bit(&(unsigned long)page_buffers()->b_private) seem easy 2009-02-16 17:09 btw, what happened to the page private lock, I don't see it in struct page 2009-02-16 17:09 a bit lock? 2009-02-16 17:09 set_bit() on submit for each buffer pos 2009-02-16 17:09 and clear_bit() at endio 2009-02-16 17:10 then check it if zero at endio 2009-02-16 17:10 that is reasonable 2009-02-16 17:12 page private lock seems gone for long time ago 2009-02-16 17:51 ah, it became mapping->private_lock 2009-02-16 17:56 hirofumi, let's call your approach the "io_buffers" strategy, and we will start to document it 2009-02-16 19:19 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2009-02-16 19:21 ok 2009-02-16 19:21 we should use mapping->private_lock for something 2009-02-16 19:22 well 2009-02-16 19:22 yes, it should be good for something 2009-02-16 19:23 FWIW, I'm writing about blockdirty() (about race) 2009-02-16 19:23 it's worth a lot 2009-02-16 19:24 yes, submit io is also complex point 2009-02-16 19:25 actually, the code of it would be simple 2009-02-16 19:25 submit, endio, cleanup, blockread, blockdirty are all sensitive 2009-02-16 19:25 yes 2009-02-16 19:26 and release, truncate... 2009-02-16 19:26 and page reclaim and truncate has race (adds complecity) 2009-02-16 19:26 yes 2009-02-16 19:27 7 tricky operations 2009-02-16 19:27 yes 2009-02-16 19:27 x 2 for data vs metadata 2009-02-16 19:28 and with variations for physical vs logically mapped metadata 2009-02-16 19:28 physical mapped metadata shouldn't use blockdirty() 2009-02-16 19:28 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-02-16 19:28 yes, that is a help 2009-02-16 19:29 it feels good to be tackling this set of operations from a fresh point of view 2009-02-16 19:29 data adds truncate and mmap complexcity 2009-02-16 19:29 it is long overdue that this area should be cleaned up 2009-02-16 19:29 yes 2009-02-16 19:29 and logically mapped metadata adds fork complexity 2009-02-16 19:29 it's fun :) 2009-02-16 19:30 :) 2009-02-16 19:30 well, buffers (handles), pages, and bios has different main owner 2009-02-16 19:30 I guess it solves those complexity 2009-02-16 19:31 buffers is used by stage_delta() 2009-02-16 19:31 pages is just data and flags for data 2009-02-16 19:31 bios is used for commit_delta 2009-02-16 19:31 and frontend copy the buffers to cloned page 2009-02-16 19:32 almost main owner is different 2009-02-16 19:33 well, most racy one is page reclaim 2009-02-16 19:34 and with ->releasepage(), I guess some complexity can be avoided 2009-02-16 19:35 it may be able to garantee the page and buffers never free until we allow it 2009-02-16 19:36 2009-02-16 19:36 btw, http://userweb.kernel.org/~hirofumi/fork-buffer.note 2009-02-16 19:36 this is current note for fork race 2009-02-16 19:36 just for blockdirty() though 2009-02-16 19:36 yet 2009-02-16 19:36 implementation note 2009-02-16 19:38 ah, btw, if we can remove the forked-page from lru list, it would be forked-page simple 2009-02-16 19:39 if I'm not missing something, lru is only reference path from others of us 2009-02-16 19:40 it makes a lot of sense to free a forked page explicitly, not under control of vmscan 2009-02-16 19:41 reading the note 2009-02-16 19:41 yes 2009-02-16 19:41 however, current kernel seems to remove page from lru when page->count become zero 2009-02-16 19:46 return buffer-1 ? 2009-02-16 19:47 on, buffer1 2009-02-16 19:47 oh, buffer1 2009-02-16 19:49 buffer-1 means the pointer to buffer, and buffer- is different pointer 2009-02-16 20:07 why do we take the mutex after the blockread? 2009-02-16 20:07 I thought it is normally taken before 2009-02-16 20:08 blockread will be read btree 2009-02-16 20:08 we are trying to avoid holding the mutex during read I/O? 2009-02-16 20:09 yes, and read btree 2009-02-16 20:10 and it has to be a mutex because lock_page in blockdirty can sleep 2009-02-16 20:10 yes 2009-02-16 20:11 so we will abandon my crude i_mutex locking with this approach\ 2009-02-16 20:12 i_mutex? 2009-02-16 20:12 existing smp locking 2009-02-16 20:12 using ->i_mutex 2009-02-16 20:13 for balloc/bfree? 2009-02-16 20:13 ACTION checks 2009-02-16 20:14 ah, existing locking takes mutex after blockread also 2009-02-16 20:14 yes 2009-02-16 20:15 that was a fix from you 2009-02-16 20:15 well, mutex after blockread is for deadlock for bitmap 2009-02-16 20:15 yes 2009-02-16 20:15 to solve for deadlock 2009-02-16 20:16 the deadlock that we did not hit initially because it only happens if the bitmap block has to be read from disk 2009-02-16 20:16 yes 2009-02-16 20:16 and did happen when you ran tests under memory load 2009-02-16 20:16 exactly 2009-02-16 20:16 also could happen on startup 2009-02-16 20:16 yes 2009-02-16 20:17 well, we may not need locking for backend structures if backend is one thread 2009-02-16 20:18 at least, it may be able to be simpler 2009-02-16 20:19 it is good 2009-02-16 20:20 and even if we need the backend to be parallel, we can divide the threads by inode number for example 2009-02-16 20:20 or for allocation, threads can be divided by logical address 2009-02-16 20:21 unlike the frontend, which cannot be divided naturally by object 2009-02-16 20:21 so I think this is a strong approach 2009-02-16 20:21 yes 2009-02-16 20:21 backend vs backend locking should be simple 2009-02-16 20:21 fontend vs backend is complex becase of block-fork 2009-02-16 20:22 I guess 2009-02-16 20:26 also because user pattern vs cpus is hard to predict 2009-02-16 20:27 it is not natural to pin a part of the fs to a given cpu 2009-02-16 20:28 yes 2009-02-16 20:28 well, backend can be kernel thread? 2009-02-16 20:29 it could 2009-02-16 20:29 I would like to be able to operate without any kernel daemons by default, and only introduce daemons for optimization 2009-02-16 20:29 i see 2009-02-16 20:30 this is for robustness 2009-02-16 20:30 it sounds good 2009-02-16 20:30 daemons deadlock on memory writeout without careful treatment 2009-02-16 20:31 ah, io error handling is also pending 2009-02-16 20:34 btw, next is, document for fork, and work on userland 2009-02-16 20:34 document means implementation note 2009-02-16 20:36 I'm still reading your note, very slowly 2009-02-16 20:36 ok, I guess submitting io would have more note 2009-02-16 20:37 how does it prevent freeing buffers/pages from lru 2009-02-16 20:37 yes, an easy thing but needs a note 2009-02-16 20:37 yes 2009-02-16 20:38 well, easy for forked blocks, which have only two references: 1) from the io list 2) from the bio 2009-02-16 20:38 unforked blocks are more complex 2009-02-16 20:39 forked blocks will also be referenced from lru list 2009-02-16 20:39 3) from the lru ;) 2009-02-16 20:39 :) 2009-02-16 20:39 lru is most racy one 2009-02-16 20:39 unforked blocks that are forked during writeout are also interesting 2009-02-16 20:40 yes 2009-02-16 20:41 for now, hackfs solves it by adding no special case for forked or non-forked, until flush_wait_io 2009-02-16 20:41 flush_wait_io() and synchronize_delta() 2009-02-16 20:41 yes, I was just thinking the same 2009-02-16 20:42 good 2009-02-16 20:42 this is where keeping the same buffer on the page with your approach makes it more sane 2009-02-16 20:44 thanks 2009-02-16 20:44 so, forked and unforked metadata blocks will be on io_buffers list 2009-02-16 20:44 yes 2009-02-16 20:45 and fontend shouldn't modify those buffers, include truncate (not implemented yet) 2009-02-16 20:47 and handling is the same until after delta completion 2009-02-16 20:47 yes 2009-02-16 20:48 then the only difference is, strip buffers from pages that are no longer in mappings 2009-02-16 20:48 and free the pages 2009-02-16 20:49 yes 2009-02-16 20:56 ah, found another race point 2009-02-16 20:56 free_forked_page() is calling try_to_free_buffers() 2009-02-16 20:56 without checking PagePrivate() 2009-02-16 20:57 page reclaimer can free buffers before it 2009-02-16 20:57 ACTION looks 2009-02-16 20:57 didn't we plan to protect against that by ->releasepage? 2009-02-16 20:57 it can 2009-02-16 20:58 well, current code doesn't require it 2009-02-16 20:58 if there is not truncate 2009-02-16 20:58 if there is no truncate 2009-02-16 20:58 right, there isn't 2009-02-16 20:59 so, with ->releasepage(), the code can be more cleaned up 2009-02-16 20:59 well, basically, buffers can be free at any point 2009-02-16 21:00 it is why commit_delta is using bios list 2009-02-16 21:01 actually, buffers is not needed after end_page_writeback() was cleared 2009-02-16 21:01 I think that when we take a page out of page cache, we should then take responsibility for removing buffers, and prevent vmscan from removing them 2009-02-16 21:01 we use a list link field on the buffer, right? 2009-02-16 21:02 yes, until submit 2009-02-16 21:02 and then reference only from the bio? 2009-02-16 21:02 yes, bio and page 2009-02-16 21:02 except free_forked_page() 2009-02-16 21:03 to prevent to free buffers, the page should have dirty bit 2009-02-16 21:04 that will do 2009-02-16 21:04 if dirty bit was cleared, vm can be free buffers if buffers is not busy 2009-02-16 21:04 we don't make buffer busy 2009-02-16 21:05 so, the hackfs does't use after dirty bit was cleared 2009-02-16 21:05 with this, I guess we don't need to mark buffer dirty, and locking buffer 2009-02-16 21:06 hackfs doesn't use the buffers after the page dirty bit was cleared 2009-02-16 21:07 ok, the io_buffers list, linked through _assoc_ is only used until submit in stage_delta 2009-02-16 21:08 yes 2009-02-16 21:09 and it garantee that frontend forks those if it wants to modify 2009-02-16 21:09 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-16 21:09 then we link submitted IO via bio->private 2009-02-16 21:09 yes, after removing b_assoc_buffers 2009-02-16 21:09 so, fontend can be reuse b_assoc_buffers after io comple 2009-02-16 21:10 if non-forked page 2009-02-16 21:10 sb->io is the inflight bio list 2009-02-16 21:10 yes 2009-02-16 21:10 it's good :) 2009-02-16 21:10 thanks 2009-02-16 21:11 it is quite generic too, not tied to tux3 2009-02-16 21:11 a very direct path for metadata IO 2009-02-16 21:11 I was not thinking about others of tux3, however maybe it can 2009-02-16 21:12 maybe after maturing 2009-02-16 21:12 yes 2009-02-16 21:13 a simple measure of worth is, if it submits IO with few slab allocs, fewer locks and less cpu 2009-02-16 21:13 supporting the idea of cache layering is a bonus 2009-02-16 21:14 well, all ideas is from cache layering 2009-02-16 21:14 it shifting datas by, buffers -> page -> bio 2009-02-16 21:15 it is normal though, the list is also shifting 2009-02-16 21:16 actually, it is shifting pages, that are first referenced by buffers, then referenced by bios 2009-02-16 21:16 yes 2009-02-16 21:17 page is data 2009-02-16 21:17 and page is linked by buffers, then bios 2009-02-16 21:17 well, hackfs may have many bugs, however it's not so bad 2009-02-16 21:17 and this is good, because buffers are best at holding block state, while bios are best at interfacing to hardware 2009-02-16 21:18 yes 2009-02-16 21:18 so, I guess buffers can still be replaced by handles cleanly 2009-02-16 21:19 it is getting closer with your new work 2009-02-16 21:19 we might end up needing more than four state bits, but that is ok 2009-02-16 21:19 probably 2009-02-16 21:19 yes 2009-02-16 21:20 we can easily allow 8 state bits, using a single u64 2009-02-16 21:20 to handle 512 byte blocks on a 4K page 2009-02-16 21:20 i see 2009-02-16 21:21 so, the one issue I worried about with block handles is, how do we make lists of them 2009-02-16 21:21 because they do not have any link fields that can be used by staging 2009-02-16 21:22 well, list itself is still having some issue 2009-02-16 21:22 which issue? 2009-02-16 21:22 list can be spliced before delta_lock 2009-02-16 21:22 and we may want to sorted list by logical index 2009-02-16 21:22 and we may want the sorted list by logical index 2009-02-16 21:23 s/before delta_lock/before unlocking delta_lock/ 2009-02-16 21:24 yes, some time later we might want to sort it 2009-02-16 21:24 yes, later 2009-02-16 21:24 and release delta_lock can aslo be later 2009-02-16 21:24 ok, let's look at the splice issue 2009-02-16 21:25 in stage_delta? 2009-02-16 21:25 yes 2009-02-16 21:25 list_splice_init() 2009-02-16 21:25 ah, you are using mapping->private_lock 2009-02-16 21:25 it is not enough at all 2009-02-16 21:26 but per-delta lists might be enough 2009-02-16 21:26 yes 2009-02-16 21:26 ok, that was always the plan 2009-02-16 21:26 I think it does not need to be per-delta, per-inode 2009-02-16 21:26 just a per-delta array in sb 2009-02-16 21:27 it can 2009-02-16 21:27 but, we may want to walk for one inode 2009-02-16 21:27 to make big extent 2009-02-16 21:27 contigunous extent 2009-02-16 21:28 e.g. for file data 2009-02-16 21:28 isn't that just local processing? 2009-02-16 21:28 local processing? 2009-02-16 21:29 process a full inode, then move on to the next 2009-02-16 21:29 if it per-delta array in sb, dirty buffers of multiple inodes is in it? 2009-02-16 21:30 if it's under delta_lock, it's ok 2009-02-16 21:30 that is the idea 2009-02-16 21:30 I was thinking about unlock delta_lock more early 2009-02-16 21:31 well, if per-delta list, multiple inodes would be on it 2009-02-16 21:31 I was thinking, only the dirty blocks in current delta need to be listed from the inode 2009-02-16 21:31 yes 2009-02-16 21:31 dirty blocks in earlier deltas are all under IO 2009-02-16 21:32 we have to remove dirty buffers from on the inode->dirty at some point 2009-02-16 21:32 yes, in staging as now 2009-02-16 21:33 in staging, remove the dirty buffers from inode, sort them, make big biovecs 2009-02-16 21:33 and until it was removed, next delta can't be started 2009-02-16 21:33 true\ 2009-02-16 21:34 so, if it's before unlocking delta_lock, it's ok 2009-02-16 21:34 but, if before unlock, it needs some trick 2009-02-16 21:34 ok I see your point 2009-02-16 21:34 yes 2009-02-16 21:34 just for optimization 2009-02-16 21:35 so a per-inode delta dirty array is a simple approach 2009-02-16 21:35 yes, probably 2009-02-16 21:35 it would just be two or four list heads 2009-02-16 21:35 say, 32 bytes per inode 2009-02-16 21:35 yes, it would be two list heads 2009-02-16 21:36 one is for fontend, and another one is for backend 2009-02-16 21:36 and we only allow stage_delta to overlap two deltas 2009-02-16 21:36 yes 2009-02-16 21:36 yes 2009-02-16 21:36 it's good 2009-02-16 21:37 with some trick it could be one list per inode, but it probably doesn't matter 2009-02-16 21:37 yes 2009-02-16 21:37 probably 2009-02-16 21:37 ok, fine 2009-02-16 21:37 that is our plan 2009-02-16 21:38 ok 2009-02-16 21:38 it's optimization process though 2009-02-16 21:38 next we have to ask linus is we can have spinning rw locks ;) 2009-02-16 21:38 maybe we can talk about that for a moment 2009-02-16 21:39 spining rw locks? 2009-02-16 21:39 eventually the delta_lock will be a bottleneck on multicpu 2009-02-16 21:39 yes 2009-02-16 21:39 probably not spinning rw locks, but a combination of spinlock and some other synchronizer 2009-02-16 21:39 so, adaptive lock? 2009-02-16 21:40 something better than rw semaphore 2009-02-16 21:40 i see 2009-02-16 21:40 probably some per-cpu thing 2009-02-16 21:41 so that begin/end_change is very efficient 2009-02-16 21:41 at the expense of delta transition 2009-02-16 21:41 don't block frontend 2009-02-16 21:41 um... 2009-02-16 21:42 I still have not thought of a clever way of allowing frontend to continue on every cpu 2009-02-16 21:42 at time of delta transition 2009-02-16 21:42 we was talking about it more or less 2009-02-16 21:42 however, I can't remember the detail of it 2009-02-16 21:43 ah, ok 2009-02-16 21:43 previous delta blocks the next delta to start 2009-02-16 21:43 backend doesn't need to block frontend actually 2009-02-16 21:44 yes 2009-02-16 21:44 well, read and write can overlap delta staging I think 2009-02-16 21:45 with some work 2009-02-16 21:45 yes 2009-02-16 21:45 that is, if we decide that some inode will not be flushed in this delta, then IO can continue in parallel with staging 2009-02-16 21:46 something similar might be possible with directories 2009-02-16 21:46 directories is simple for not at least 2009-02-16 21:46 read and write is blocking 2009-02-16 21:46 by i_mutex 2009-02-16 21:46 yes 2009-02-16 21:46 which is really not much more costly than staging 2009-02-16 21:47 well 2009-02-16 21:47 the rw lock for staging will be costly :) 2009-02-16 21:47 but it will not cost as much as we will save with only a few forks 2009-02-16 21:48 i see 2009-02-16 21:48 that is my theory 2009-02-16 21:48 ok, I had not thought of those methods of overlapping deltas before, but it seems obvious now 2009-02-16 21:48 if it read (sys_read, sys_readdir), maybe we can allow it 2009-02-16 21:49 it follows from the fact that we are not required to flush every inode at every delta 2009-02-16 21:49 yes, reads should always overlap 2009-02-16 21:49 only writes may stall, and that is always true anyway for one reason or another 2009-02-16 21:50 only exception is atime 2009-02-16 21:50 ah, a separate problem 2009-02-16 21:50 ok 2009-02-16 21:51 we can leave atime out completely to start and people will not complain much 2009-02-16 21:51 per inode delta dirty? 2009-02-16 21:51 per inode delta dirty state? 2009-02-16 21:51 um.., inode has delta dirty state? 2009-02-16 21:52 so, if delta is different, it blocks 2009-02-16 21:52 maybe 2009-02-16 21:52 I think it works 2009-02-16 21:53 just as it works for rollup flush 2009-02-16 21:54 well, delta_lock seems not to require all modify at least 2009-02-16 21:55 it just can't re-dirty 2009-02-16 21:55 probably 2009-02-16 21:56 if the inode has a delta dirty array it can re-dirty 2009-02-16 21:56 essentially, it would be equivalent to taking a snapshot of the inode dirty state 2009-02-16 21:57 yes 2009-02-16 21:57 and transfer the snapshot to media asynchronously 2009-02-16 21:57 it is a useful model, I think 2009-02-16 21:57 yes 2009-02-16 21:57 stable buffer for inode with per-delta 2009-02-16 21:59 mmap is an issue 2009-02-16 21:59 mmap modify buffer 2009-02-16 21:59 one way to handle that is, exact state of disk is undefined for unsynchronized mmap 2009-02-16 21:59 so, it would be blockdirty() 2009-02-16 22:00 well that is fine except we would need wp_page 2009-02-16 22:00 and that wiould be costly 2009-02-16 22:00 yes 2009-02-16 22:00 so it is better just to require the user to use fsync when they want an exact state on disk for a mmapped file 2009-02-16 22:00 I think that works ok 2009-02-16 22:00 if order mode, it should be ok 2009-02-16 22:01 if data=ordered 2009-02-16 22:01 even if a stronger mode, we just allow the user to break the model with mmap 2009-02-16 22:01 sorry, I meant msync 2009-02-16 22:02 but, we have to fork buffer and relocate it? 2009-02-16 22:02 just at msync time 2009-02-16 22:02 I have not thought deeply about this 2009-02-16 22:03 there would be no buffers involved 2009-02-16 22:03 maybe, msync would have to modify pte if we fork buffer 2009-02-16 22:03 just pages 2009-02-16 22:03 yes 2009-02-16 22:05 hmm, what would msync do that fsync would not do? 2009-02-16 22:05 ah 2009-02-16 22:05 search for dirty ptes to flush to disk 2009-02-16 22:05 after that, it is just fsync 2009-02-16 22:05 yes 2009-02-16 22:06 I don't think we need to do any forking 2009-02-16 22:06 it write page by in place? 2009-02-16 22:07 it can even redirect 2009-02-16 22:07 ah 2009-02-16 22:07 but, wait io? 2009-02-16 22:08 the only thing we do not do, is try to guarantee that the msync is an exact snapshot of the file at any particular time 2009-02-16 22:08 yes, msync/fsync must wait for the delta to complete 2009-02-16 22:09 it also have to wait previous delta? 2009-02-16 22:09 yes 2009-02-16 22:09 (actually, the page) 2009-02-16 22:09 ah, yes 2009-02-16 22:09 well, waiting on the current delta implies waiting on the previous delta 2009-02-16 22:10 if there is fork, it doesn't need to wait io 2009-02-16 22:10 well 2009-02-16 22:10 msync/fsync flush forcely, so there is no problem? 2009-02-16 22:10 there is no actual disadvantage 2009-02-16 22:11 um... 2009-02-16 22:11 I think that is correct 2009-02-16 22:11 I don't yet see a purpose for fork of a regular file block, except maybe snapshot 2009-02-16 22:11 i see 2009-02-16 22:11 I don't think anybody really tries to include mmap in a snapshot currently 2009-02-16 22:12 but, it works like transactional file? 2009-02-16 22:12 "it" ? 2009-02-16 22:12 snapshot? 2009-02-16 22:12 fork of regular file block 2009-02-16 22:13 fork of a regular file block would be for the purpose of committing a snapshot of the file 2009-02-16 22:13 it is easy to see how that can be useful 2009-02-16 22:13 yes 2009-02-16 22:13 with fork, we are not forced to drain the block device to make the snapshot 2009-02-16 22:14 I guess ext4 may be trying mmap too 2009-02-16 22:14 there are cases where that can be very valuable, even if we do not handle mmap perfectly 2009-02-16 22:14 heh, then mingming and friends can get the wp_page part working ;) 2009-02-16 22:14 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-16 22:14 I guess it is page_mkwrite 2009-02-16 22:15 and clear dirty makes the page read-only 2009-02-16 22:15 we are already using that trick for dirty page accounting? 2009-02-16 22:15 however, it is only the page for in-place 2009-02-16 22:16 I guess it can 2009-02-16 22:16 with modify flush_one_page() 2009-02-16 22:16 http://www.redhat.com/archives/cluster-devel/2008-January/msg00212.html 2009-02-16 22:16 flush_one_page() calls cancel_dirty_page() 2009-02-16 22:18 yes, probably like it 2009-02-16 22:18 ok, well that will become important when we do snapshots, or clustering 2009-02-16 22:19 well, so, if we use clear_dirty_page_for_io(), it will make pte as read-only 2009-02-16 22:19 still some time in the future for both 2009-02-16 22:19 yes 2009-02-16 22:19 I thought linus used to hate pte tricks like taht 2009-02-16 22:19 and now we use it heavily 2009-02-16 22:19 what changed? 2009-02-16 22:20 maybe, someone had good example for it 2009-02-16 22:20 maybe, network fs 2009-02-16 22:20 maybe we flush the ptes more lazily 2009-02-16 22:20 I don't know 2009-02-16 22:20 something changed ;)\ 2009-02-16 22:20 it maps shared-writable as read-only 2009-02-16 22:21 and, user modifies that read-only page, page fault is happened 2009-02-16 22:21 and fault handler will call ->page_mkwrite 2009-02-16 22:22 then clear_dirty_page_for_io() will makes page as read-only again 2009-02-16 22:22 iirc, it is basic trick of this 2009-02-16 22:27 yes, but I thought LInus always used to say that the cost of flushing the page table made the pte write protect useless 2009-02-16 22:27 maybe recent intel and amd cpus have a lower cost 2009-02-16 22:27 that seems likely 2009-02-16 22:28 probably 2009-02-16 22:28 well, network fs might be necessary to support page_mkwrite 2009-02-16 22:28 ah, it must be an improved invalidate page instruction\ 2009-02-16 22:28 I forget first page_mkwrite intent 2009-02-16 22:28 invlpg or something like that 2009-02-16 22:29 per-pte page invalidate 2009-02-16 22:29 maybe 2009-02-16 22:29 as an alternative to reloading cr3 2009-02-16 22:29 right, invlpg 2009-02-16 22:30 it seems from nfs/afs with cachefs 2009-02-16 22:30 and fuse 2009-02-16 22:31 what is? 2009-02-16 22:31 9637a5efd4fbe36164c5ce7f6a0ee68b2bf22b7f 2009-02-16 22:33 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9637a5efd4fbe36164c5ce7f6a0ee68b2bf22b7f 2009-02-16 22:34 ah, invlpg 2009-02-16 22:38 -!- cdk(~chinmay@115.109.15.177) has joined #tux3 2009-02-16 23:00 ACTION <- busy for a while 2009-02-16 23:01 ok 2009-02-16 23:11 -!- RazvanM(~RazvanM@96.234.233.219) has joined #tux3 2009-02-17 05:23 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2009-02-17 05:32 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-17 07:53 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-17 08:22 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-17 08:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-17 09:18 -!- dcg(~dcg@201.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-17 09:49 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-17 11:20 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-17 15:23 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-17 16:24 sk8 oclock 2009-02-17 16:28 since I'm skiing tomorrow, no sk8 for me 2009-02-17 16:30 you lucky stiff 2009-02-17 17:30 tim_dimm: where you skiing? 2009-02-17 17:30 baldy? 2009-02-17 17:34 yup 2009-02-17 20:47 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-17 20:47 hi hirofumi 2009-02-17 20:47 hhi 2009-02-17 20:49 http://scale7x.socallinuxexpo.org/conference-info/schedules 2009-02-17 20:49 Tux3 presentation on Sunday, Feb 22, 11:30 2009-02-17 20:49 present the design 2009-02-17 20:50 and hopefully have the non-atomic-commit version running on the laptop 2009-02-17 20:50 preparation was done already? 2009-02-17 20:50 I had to replace and reinitialize the laptop hard disk, which took some extra time 2009-02-17 20:51 so I have not yet tried it 2009-02-17 20:51 i see 2009-02-17 20:51 by tomorrow I will 2009-02-17 20:51 shapor also promised to try it ;) 2009-02-17 20:51 :) 2009-02-17 20:51 I think the only thing we haven't actually tried is bmap 2009-02-17 20:51 so that lilo will work 2009-02-17 20:51 I don't think there are other issues 2009-02-17 20:52 so soon... 2009-02-17 20:52 i see 2009-02-17 20:52 let me see 2009-02-17 20:52 ah, real hardware? 2009-02-17 20:52 not uml 2009-02-17 20:52 yes 2009-02-17 20:52 i see 2009-02-17 20:52 well you have already been running on real hardware 2009-02-17 20:52 but not me 2009-02-17 20:52 I'll try it on kvm 2009-02-17 20:52 ok, now... kernel version 2009-02-17 20:53 we are at 2.6.29-rc1 2009-02-17 20:53 ok 2009-02-17 20:53 I suppose that is what I should put on the laptop 2009-02-17 20:55 I think that our latest kernel changes did not change much 2009-02-17 20:55 the anti-deadlock hack is still there 2009-02-17 20:55 for bitmap read 2009-02-17 20:55 ACTION looks 2009-02-17 20:56 By the way, the five good D of software development: design develop debug document deploy 2009-02-17 20:57 it was just something that entered my mind today as I was preparing the notes for the scale presentation 2009-02-17 20:57 the 5D plan 2009-02-17 20:58 deploy? 2009-02-17 20:59 deploy means distribute? 2009-02-17 21:00 well, now coping from / to tux3 2009-02-17 21:01 btw, I guess "grub" is more easier than "lilo" 2009-02-17 21:02 ah, one problem was found 2009-02-17 21:03 #define TUX_LINK_MAX 64 /* just for debug for now */ 2009-02-17 21:03 we are now limiting link count 2009-02-17 21:03 yes, it could be distribute instead of deploy 2009-02-17 21:03 to give a more open source feeling 2009-02-17 21:03 i see 2009-02-17 21:04 is there anything that has more links than that in a normal linux distribution? 2009-02-17 21:05 it is inode->i_link count, so it is limiting directory entries 2009-02-17 21:06 ah, well let's change it 2009-02-17 21:06 yes 2009-02-17 21:07 suitable value? 2009-02-17 21:07 2**32 ? 2009-02-17 21:07 32000 is ext2 is using 2009-02-17 21:08 2**15? 2009-02-17 21:08 ok, good enough for today 2009-02-17 21:08 less thinking :) 2009-02-17 21:08 probably, it's good 2009-02-17 21:08 maybe, old 32bit userspace may be using u16 link count 2009-02-17 21:09 -#define TUX_LINK_MAX 64 /* just for debug for now */ 2009-02-17 21:09 +#define TUX_LINK_MAX (1 << 15) /* arbitrary limit, increase it */ 2009-02-17 21:09 yes 2009-02-17 21:11 done 2009-02-17 21:11 I have set up a nice little shuttle PC for testing 2009-02-17 21:12 btw, pc are installed some distro? 2009-02-17 21:12 already 2009-02-17 21:12 I normally use debian 2009-02-17 21:12 should I try asianus? ;) 2009-02-17 21:12 asianux 2009-02-17 21:12 :) I don't recommend it 2009-02-17 21:12 probably keep using debian 2009-02-17 21:12 it is usually the best for development 2009-02-17 21:12 well, it's redhat base 2009-02-17 21:13 sure 2009-02-17 21:13 there is partition for tux3? 2009-02-17 21:13 not yet 2009-02-17 21:13 ah, there is disk space for tux3? 2009-02-17 21:13 let's see what I have available 2009-02-17 21:13 yes 2009-02-17 21:13 I have a lot of disks here 2009-02-17 21:14 i see 2009-02-17 21:14 at least one 500 GB I'm not using 2009-02-17 21:14 with grub, tux3 root may be easy 2009-02-17 21:14 it's the backup disk for the other 500 GB 2009-02-17 21:14 I thought with grub I have to compile tux3 support into grub itself 2009-02-17 21:15 ah 2009-02-17 21:15 lilo is easy for me to understand 2009-02-17 21:15 bmap just has to work 2009-02-17 21:15 ah, no 2009-02-17 21:15 and the bmap you implemented looks very simple 2009-02-17 21:15 probably, grub is not needed to support tux3 if kernel is on ext3 or something 2009-02-17 21:16 but grub has to include tux3 if the root fs is tux3 2009-02-17 21:16 right? 2009-02-17 21:16 rootfs can pass by root=/dev/ 2009-02-17 21:16 well let me see 2009-02-17 21:17 well, it is actually passing to kernel though 2009-02-17 21:17 grub has to load kernel, but it can be from another partition 2009-02-17 21:17 ok, I see, only the kernel image has to be on an ext3 partition 2009-02-17 21:17 yes 2009-02-17 21:17 the root fs can be tux3 without tux3 support in grub 2009-02-17 21:17 now, why is that easier than lilo? 2009-02-17 21:18 because there is no need to work with mbr 2009-02-17 21:18 ah, but I am used to that 2009-02-17 21:18 used lilo for ten years 2009-02-17 21:18 still using it on a couple machines 2009-02-17 21:18 including the laptop 2009-02-17 21:18 ah, i see 2009-02-17 21:19 most others have grub 2009-02-17 21:19 so I use both 2009-02-17 21:19 grub would be more easier for this work 2009-02-17 21:19 well, if lilo is work, it's good enough 2009-02-17 21:19 if I have time I will try both 2009-02-17 21:19 I was forget about lilo already 2009-02-17 21:20 I had to do a lilo rescue last week 2009-02-17 21:20 I prepared the laptop hard disk without initializing the mbr 2009-02-17 21:20 and put it into the laptop that way 2009-02-17 21:20 I did not way to remove the disk again 2009-02-17 21:20 it is a lot of work 2009-02-17 21:21 so I booted knoppix and did a rescue 2009-02-17 21:21 installed lilo, and set up the kernel boot 2009-02-17 21:21 the thing to remember to make this easy is chroot 2009-02-17 21:22 you boot from the cd, mount the hd, chroot to the mountpoint and run lilo 2009-02-17 21:22 that's it, problem fixed 2009-02-17 21:22 yes 2009-02-17 21:22 and I had forgotten everything about lilo before I did that 2009-02-17 21:22 just like you ;) 2009-02-17 21:22 man pages rock, google rocks 2009-02-17 21:24 :) 2009-02-17 21:24 well, grub should be boot directly hd from cd 2009-02-17 21:24 with grub command line 2009-02-17 21:28 yes, I have done rescue with grub before too 2009-02-17 21:28 it has a useful feature to scan for partitions 2009-02-17 21:28 tricky to use 2009-02-17 21:28 yes 2009-02-17 21:28 but if you make a bad mistake, sometimes it is the only choice 2009-02-17 21:29 lilo? 2009-02-17 21:29 I overwrite the first 64 MB of my workstation hd once 2009-02-17 21:29 grub 2009-02-17 21:29 yes 2009-02-17 21:29 I rescued several times with grub 2009-02-17 21:29 and harddisk replace too 2009-02-17 21:30 I was able to repair that using gparted 2009-02-17 21:30 sorry, not grub 2009-02-17 21:30 it is gparted that has the scanning feature 2009-02-17 21:30 oh 2009-02-17 21:30 well, grub also can scan partitions and disks 2009-02-17 21:31 http://www.linux.com/feature/57748 2009-02-17 21:31 I think the maintainers of grub and gparted may be the same 2009-02-17 21:31 both are fsf projects 2009-02-17 21:32 grub maintainer is OKUJI-san 2009-02-17 21:32 I overwrite my boot partition while doing benchmarking 2009-02-17 21:32 did a dd to the wrong device 2009-02-17 21:32 easy to do 2009-02-17 21:32 :) 2009-02-17 21:33 ah, so, scan without partition table 2009-02-17 21:34 yes, I knew the machine was unbootable 2009-02-17 21:34 yes 2009-02-17 21:34 copy was done 2009-02-17 21:34 and I managed to install gparted, recreate the partition table, and reboot successfully 2009-02-17 21:35 stephen tweedie was advising me on irc :) 2009-02-17 21:35 oh, good :) 2009-02-17 21:35 always nice to have a scottish PhD around 2009-02-17 21:36 with chroot to tux3, it seems to work and usual 2009-02-17 21:36 ok, just installed git on the shuttle 2009-02-17 21:36 that is promising :) 2009-02-17 21:37 btw, I'm using 2.6.29-rc2 2009-02-17 21:37 it doesn't matter 2009-02-17 21:37 sure 2009-02-17 21:38 the only difference is, uml compiles with -rc1 2009-02-17 21:38 a small problem 2009-02-17 21:38 probably fixed by now 2009-02-17 21:41 it seems not to be fixed yet, however I have a patch for it 2009-02-17 21:42 post to the mailing list maybe? 2009-02-17 21:43 lkml know it already, however patch is not in Linus tree yet 2009-02-17 21:45 ok, it seems to boot with tux3 root 2009-02-17 21:45 anyway, I don't have a reason to change from -rc1 yet 2009-02-17 21:45 :) 2009-02-17 21:45 kvm with direct kernel load though 2009-02-17 21:45 load kernel by kvm 2009-02-17 21:46 I ah 2009-02-17 21:46 ah 2009-02-17 21:46 magic 2009-02-17 21:46 yes 2009-02-17 21:46 however, there is no big difference with grub 2009-02-17 21:46 it is very likely to work just the same on a real machine 2009-02-17 21:46 I am still installing git, and you are almost done ;) 2009-02-17 21:47 it is just avoid the kernel to copy to guest disk 2009-02-17 21:47 yes 2009-02-17 21:50 kernel message is too many 2009-02-17 21:52 it would be cause of syslog full 2009-02-17 21:52 kernel output message, and syslog write to disk, so kernel output message again 2009-02-17 21:52 ah, let's turn it off 2009-02-17 21:52 easiest way is to hack trace.h 2009-02-17 21:53 ah, yes 2009-02-17 21:53 what I want to do is use a static variable to control trace.h 2009-02-17 21:54 it would be easy with module_parm() 2009-02-17 21:54 so we can have tracing enabled or not as runtime, but not have to add a new parameter to every trace call 2009-02-17 21:54 maybe you know a smarter way to do that 2009-02-17 21:54 module parm? 2009-02-17 21:54 yes 2009-02-17 21:55 if module is built into kernel, module_parm() become boot parametor 2009-02-17 21:55 and it can change via sysfs at runtime 2009-02-17 21:55 http://www.faqs.org/docs/kernel/x350.html 2009-02-17 21:55 ok, good 2009-02-17 21:56 it is not per-sb though 2009-02-17 21:57 it would be like "module_param(verbose_trace, int, 0644);" 2009-02-17 21:58 MODULE_PARM() is old way only for module parametor 2009-02-17 21:58 not per-sb is ok 2009-02-17 21:59 ok, well right now I can just disable it in trace.h 2009-02-17 21:59 using a static variable 2009-02-17 22:00 yes 2009-02-17 22:00 #define logline(caller, fmt, args...) if (tux3_trace) { \ 2009-02-17 22:02 map_region() calls dleaf_dump() 2009-02-17 22:03 and the problem is? 2009-02-17 22:03 dleaf_dump() is dumpping dleaf unconditionaly 2009-02-17 22:03 ok, let's fix that 2009-02-17 22:03 it is right though, message is too many 2009-02-17 22:04 - dleaf_dump(btree, leaf); 2009-02-17 22:04 yes 2009-02-17 22:08 the hard part is declaring tux3_trace 2009-02-17 22:08 super.c? 2009-02-17 22:08 have to put it somewhere for both userspace and kernel 2009-02-17 22:08 super.c isn't always compiled 2009-02-17 22:08 what is problem? 2009-02-17 22:09 super.c is compiled in kernel always 2009-02-17 22:09 need to pick a file for the variable definition that is always compiled in userspace 2009-02-17 22:09 well 2009-02-17 22:09 userspace doesn't really need this option 2009-02-17 22:09 ah, trace.h is shared 2009-02-17 22:09 yes 2009-02-17 22:09 easy to fix that 2009-02-17 22:09 but maybe think about a way to make it work in userspace first 2009-02-17 22:09 so, #define tux3_trace 1 in user/trace.h 2009-02-17 22:10 :) 2009-02-17 22:12 fs/tux3/balloc.c:232: error: too few arguments to function 'blockdirty' 2009-02-17 22:12 yes 2009-02-17 22:12 I have the patches for it 2009-02-17 22:13 ready to pull? 2009-02-17 22:14 I'm not adding the comments to those yet 2009-02-17 22:26 ok, I hacked blockdirty to compile, just to see if the trace variable is right 2009-02-17 22:27 I will remove the hack now and check in the trace patch 2009-02-17 22:30 ok 2009-02-17 22:30 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-17 22:30 some patches are done 2009-02-17 22:31 reading 2009-02-17 22:34 started git clone on the test machine 2009-02-17 22:35 I'm not smart enought to clone it from my other machine ;) 2009-02-17 22:35 so getting from git.kernel.org 2009-02-17 22:35 well 2009-02-17 22:35 more like to lazy to configure the git daemon 2009-02-17 22:35 it can just be ssh or rsync 2009-02-17 22:35 ah 2009-02-17 22:36 well I am only using 32KB/sec of kernel.org bandwidth 2009-02-17 22:36 I wonder why it is so slow 2009-02-17 22:37 slow mirror? 2009-02-17 22:37 it's kernel.org 2009-02-17 22:37 probably I should use a mirror 2009-02-17 22:37 well, tux3 root seems to freeze at somewhere 2009-02-17 22:37 hmm 2009-02-17 22:39 I handled the sizeof bnode issue a little differently 2009-02-17 22:39 just make the struct definition shared 2009-02-17 22:39 ok 2009-02-17 22:39 but will pull and change after I think 2009-02-17 22:39 ok 2009-02-17 22:39 it makes sense to share struct bnode 2009-02-17 22:40 i see. I thought it shouldn't export bnode 2009-02-17 22:40 well 2009-02-17 22:40 ah, and you added tux3_trace 2009-02-17 22:41 yes 2009-02-17 22:41 just a hack though 2009-02-17 22:41 sure 2009-02-17 22:41 we will improve it 2009-02-17 22:41 yes 2009-02-17 22:42 pulling 2009-02-17 22:42 -!- RazvanM(~RazvanM@96.234.233.219) has joined #tux3 2009-02-17 22:48 hirofumi, ok, the freeze on boot 2009-02-17 22:48 this is where I really really wish we had kdb in kernel 2009-02-17 22:48 probably, it was my fault 2009-02-17 22:49 maybe, /dev/ doesn't have needed devices 2009-02-17 22:49 ah 2009-02-17 22:49 e.g. /dev/null, /dev/full 2009-02-17 22:49 full? 2009-02-17 22:49 wow 2009-02-17 22:49 tehre is one ;) 2009-02-17 22:49 :) 2009-02-17 22:50 so why do we have that and not /dev/fail? 2009-02-17 22:50 sh MAKEDEV 2009-02-17 22:51 if you still have one 2009-02-17 22:51 distros are starting to drop that 2009-02-17 22:51 yes 2009-02-17 22:51 I have it in /sbin now 2009-02-17 22:51 MAKEDEV needs parametor 2009-02-17 22:51 it used to be in /dev I think 2009-02-17 22:51 yes, some stupid parameter 2009-02-17 22:51 an "improvement" 2009-02-17 22:52 iirc, there is bind-mount /dev/ before udev 2009-02-17 22:52 I forgot where is on 2009-02-17 22:55 old /dev seems to be gone anymore 2009-02-17 22:55 debian MAKEDEV is now a huge script, with no help option 2009-02-17 22:56 ok, it seems "MAKEDEV std" 2009-02-17 22:56 in debian it is "generic" 2009-02-17 22:56 thanks for consistency guys ;) 2009-02-17 22:57 I'm also debian 2009-02-17 22:57 ah, generic seems right one 2009-02-17 22:57 man MAKEDEV is helpful 2009-02-17 22:59 well git is only 12% done, I guess I better do something smarter 2009-02-17 22:59 like copy my git tree 2009-02-17 22:59 probably, rsync would be faster 2009-02-17 23:00 tar/scp will be ok 2009-02-17 23:00 much faster than clone 2009-02-17 23:00 yes 2009-02-17 23:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-17 23:06 untarring 2009-02-17 23:06 tar scales very well 2009-02-17 23:06 untar is already done, on a 6 year old machine 2009-02-17 23:08 um..., it still freeze by rsyslogd 2009-02-17 23:10 rsyslogd? 2009-02-17 23:10 yes 2009-02-17 23:10 new syslog daemon via boot process 2009-02-17 23:11 ah, "Warning: unable to open an initial console." 2009-02-17 23:12 it seems there is no /dev/console 2009-02-17 23:13 ah, ok 2009-02-17 23:13 we are not storing i_rdev yet 2009-02-17 23:13 ah 2009-02-17 23:15 our detection of fuse install is imperfect 2009-02-17 23:16 ok tux3/user built on test machine 2009-02-17 23:24 building 2009-02-17 23:25 ok, setting up a root fs 2009-02-17 23:26 need to have a tux3 to do that 2009-02-17 23:26 I don't think fuse can handle it right now 2009-02-18 00:17 http://userweb.kernel.org/~hirofumi/rdev-support.patch 2009-02-18 00:17 this seems to work for now 2009-02-18 00:17 dirty hack though 2009-02-18 00:18 checking 2009-02-18 00:18 just made a tux3 partition on my test machine 2009-02-18 00:19 magic number was changed with it 2009-02-18 00:19 yes, good 2009-02-18 00:19 we made a few changes where we should have updated the magic I think 2009-02-18 00:19 mostly adding fields 2009-02-18 00:20 ah, you removed the unused 2009-02-18 00:21 yes, well, probably we should improve it before commit to pulibc 2009-02-18 00:21 so, temporary patch, maybe 2009-02-18 00:21 anyway, now, it seems to boot 2009-02-18 00:21 "seems to" :-) 2009-02-18 00:21 fsck returns error 2009-02-18 00:22 that makes sense 2009-02-18 00:22 a small thing: I think we don't need the _BIT defines for attributes 2009-02-18 00:22 we can just write (1 << ) 2009-02-18 00:22 each is only used in one place I think 2009-02-18 00:23 maybe, I'm not checking it at all 2009-02-18 00:23 too small to waste time on 2009-02-18 00:23 ah, /etc/fstab is not changed on my tux3 root 2009-02-18 00:24 ah, so you got a panic? 2009-02-18 00:24 not panic, but read-only root 2009-02-18 00:24 that's nice 2009-02-18 00:26 so rdev is only allowed for IFCHR and IFBLK 2009-02-18 00:26 why is the assert disabled? 2009-02-18 00:26 rdev can be 0 for /dev/null or something 2009-02-18 00:27 ah, and you have if (rdev) 2009-02-18 00:27 yes 2009-02-18 00:27 maybe if (is a device) 2009-02-18 00:27 ? 2009-02-18 00:27 it can 2009-02-18 00:28 dirty hack is fine now 2009-02-18 00:28 how did you notice rdev is missing? 2009-02-18 00:28 ls -l /dev/console 2009-02-18 00:28 it didn't have 5,1 2009-02-18 00:29 so we had to import the device number macros 2009-02-18 00:29 for userspace 2009-02-18 00:30 yes 2009-02-18 00:30 nice hack :) 2009-02-18 00:30 :) 2009-02-18 00:30 it's not even a hack really 2009-02-18 00:30 except for the if (rdev) 2009-02-18 00:31 + [RDEV_ATTR] = 8 <- 64 bit device number? 2009-02-18 00:31 and, I don't like to touch inode->i_rdev directy from iattr.c 2009-02-18 00:31 yes 2009-02-18 00:32 high-32bit is not used anywhere yet though 2009-02-18 00:32 it's fine 2009-02-18 00:32 well, just not thinking deeply 2009-02-18 00:32 what is wrong with touching device in iattr.c? 2009-02-18 00:32 um 2009-02-18 00:32 right 2009-02-18 00:33 somehow, I'd like to initialize inode->i_rdev by init_special_ 2009-02-18 00:33 well I like the style you have written here more 2009-02-18 00:33 and corrupted attr can be store it unexpectly 2009-02-18 00:34 it would be the same for imit_special_ 2009-02-18 00:34 well, more error checking could be done by a dedicated routine 2009-02-18 00:34 between decode_attr and init, we can check corruption if needed 2009-02-18 00:35 it is not worse than what we do in init_btree 2009-02-18 00:35 we just don't want to expose a broken inode 2009-02-18 00:35 well, yes 2009-02-18 00:36 I think, better to decode the inode, check for corruption, then clear the inode to an error state if something is wrong 2009-02-18 00:36 nobody can see it 2009-02-18 00:36 yet 2009-02-18 00:36 pulling 2009-02-18 00:36 oops 2009-02-18 00:36 nothing to pull yet :) 2009-02-18 00:37 well, it's good 2009-02-18 00:38 init initialize i_rdev, we know it 2009-02-18 00:38 however, someone may change i_rdev internal in future 2009-02-18 00:39 interface seems to intent it 2009-02-18 00:39 ? 2009-02-18 00:39 init_special_inode() stores rdev parameter to inode->i_rdev 2009-02-18 00:40 that patch is using i_rdev directly in decode_attr() 2009-02-18 00:43 ah 2009-02-18 00:43 ah, another problem is fsync 2009-02-18 00:43 we are not supporting fsync yet 2009-02-18 00:44 what breaks? 2009-02-18 00:44 for now, nvi checks error code from fsync() 2009-02-18 00:44 so, it output error message without any problem 2009-02-18 00:45 right, this was reported by one of our users earlier 2009-02-18 00:45 editing a file on a fuse mount I think 2009-02-18 00:45 ok, booted as tux3 root 2009-02-18 00:45 well fsync can wait for atomic commit 2009-02-18 00:46 yes 2009-02-18 00:46 congratulations 2009-02-18 00:46 you made history 2009-02-18 00:46 it would not be big problem for now 2009-02-18 00:46 the reporters will be at your home in the morning 2009-02-18 00:46 :) 2009-02-18 00:46 :) 2009-02-18 00:47 ok, now how should I make my tux3 root partition 2009-02-18 00:47 ok, so, /dev, and /etc/fstab, and that patch seems to boot tux3 root 2009-02-18 00:47 I just copied it from another root 2009-02-18 00:48 cd /mnt && rsync -avHS -x / . 2009-02-18 00:48 /dev will be ok on most systems 2009-02-18 00:48 cd /mnt/dev 2009-02-18 00:48 ../sbin/MAKEDEV generic 2009-02-18 00:48 I can do cp -a 2009-02-18 00:48 vi /mnt/etc/fstab 2009-02-18 00:48 yes 2009-02-18 00:49 now let me see, I don't want it to cross to other partitions 2009-02-18 00:49 it would be needed exclude /proc /sys, etc. 2009-02-18 00:49 -x option is it 2009-02-18 00:50 cp is ... 2009-02-18 00:50 another thing I have done in the past is boot to my uml root fs 2009-02-18 00:50 it's nice to be able to boot to a 100MB linux system 2009-02-18 00:50 just need to fiddle with inittab 2009-02-18 00:51 yes 2009-02-18 00:54 hmm, I'm getting tracing output 2009-02-18 00:54 I thought we turned that off 2009-02-18 00:54 lookup inode... or something? 2009-02-18 00:54 resize inum... 2009-02-18 00:55 yes 2009-02-18 00:55 printf 2009-02-18 00:55 yes 2009-02-18 00:55 ok, I will change those now 2009-02-18 00:55 ok 2009-02-18 00:55 or did you already do it? 2009-02-18 00:55 no 2009-02-18 00:55 I just ignore it 2009-02-18 00:56 btw, /sys/module/tux3/parameters/tux3_trace seems to work 2009-02-18 00:56 :) 2009-02-18 00:57 let me write down the time of your historic first boot 2009-02-18 00:57 do you have a timestamp? 2009-02-18 00:57 boot time? 2009-02-18 00:58 or now on JST? 2009-02-18 00:58 boot time 2009-02-18 00:58 JST 2009-02-18 00:59 17:59, 02/18, wed 2009-02-18 01:00 Wed Feb 18 18:00:06 JST 2009 2009-02-18 01:00 it goes in the next Tux3 Report 2009-02-18 01:00 good 2009-02-18 01:05 oops, I broke the kernel build 2009-02-18 01:05 ileaf.c doesn't know about trace 2009-02-18 01:09 ah 2009-02-18 01:09 #ifndef trace stuff 2009-02-18 01:09 #ifndef trace 2009-02-18 01:09 #define trace trace_on 2009-02-18 01:09 #endif 2009-02-18 01:09 <- in tux3.h 2009-02-18 01:09 good 2009-02-18 01:09 ah 2009-02-18 01:09 I will have to remove some explicit defines 2009-02-18 01:10 it may confuse userspace 2009-02-18 01:10 only for kernel 2009-02-18 01:10 it works 2009-02-18 01:10 ah, in tux3.h 2009-02-18 01:10 both build now 2009-02-18 01:13 booting to new improved tux3 2009-02-18 01:22 I kept rebooting to the wrong kernel, then I realized it is because I forgot to run lilo ;) 2009-02-18 01:22 grub is better in that way 2009-02-18 01:22 :) 2009-02-18 01:23 I was really forgetting about lilo, yes, it was reinstall with kernel change 2009-02-18 01:23 it needs to reinstall 2009-02-18 01:25 there are a few more traces to remove 2009-02-18 01:25 btree splits 2009-02-18 01:25 it is nice that they are rare 2009-02-18 01:26 and dump_attrs() too 2009-02-18 01:26 yes 2009-02-18 01:29 and I forgot to remove some \n in the last patch 2009-02-18 01:29 ah 2009-02-18 01:31 so dump_attrs is a problem 2009-02-18 01:31 because it doesn't want to write newlines 2009-02-18 01:32 it is same with leaf_dump()? 2009-02-18 01:32 we should probably not call it at all when trace is not trace_on 2009-02-18 01:32 if (tux3_trace) return; 2009-02-18 01:32 yes 2009-02-18 01:32 ah 2009-02-18 01:32 if (!tux3_trace) return; 2009-02-18 01:32 yes 2009-02-18 01:34 if (tux3_trace) 2009-02-18 01:34 dump_attrs(inode); 2009-02-18 01:34 will do for now 2009-02-18 01:34 ok 2009-02-18 01:38 I have never come up with a good solution for trace output without a newline 2009-02-18 01:38 I think it is good that trace supplies a newline by default 2009-02-18 01:39 but it would be nice to be able to suppress that sometimes 2009-02-18 01:39 well 2009-02-18 01:39 but not essential 2009-02-18 01:39 too bad cpp macro can't test for trace == trace_on 2009-02-18 01:43 ok, tracing is all off except for s_blocksize printout on mount 2009-02-18 01:43 I think that is ok 2009-02-18 01:43 it's nice to see something :) 2009-02-18 01:45 what is problem with trace()? 2009-02-18 01:45 what is problem of newline 2009-02-18 01:45 it has no way of generating output without a newline 2009-02-18 01:45 I never thought of a nice way to do that 2009-02-18 01:45 and it is unimportant, actually 2009-02-18 01:46 ah, and without line version 2009-02-18 01:46 right 2009-02-18 01:47 want without line version 2009-02-18 01:47 well, if you are doing that, you probably should just use printf 2009-02-18 01:47 trace does not have to be perfect :) 2009-02-18 01:48 well, if we are needing trace in future for kernel, probably, we want to use trace infrastructure 2009-02-18 01:48 ok, I'm doing time cp on /usr 2009-02-18 01:49 which trace infrastructure is that? 2009-02-18 01:49 linux/kernel/trace/* 2009-02-18 01:49 ah 2009-02-18 01:49 like ltt? 2009-02-18 01:49 or relayfs? 2009-02-18 01:49 like ltt with relayfs 2009-02-18 01:50 it would be much faster than printk 2009-02-18 01:55 oops, we have a memory leak 2009-02-18 01:55 I oomed on the cp -ax /usr /mnt 2009-02-18 01:55 you must have much more memory than I do 2009-02-18 01:56 how many memory? 2009-02-18 01:56 it may be unlikely 2009-02-18 01:56 1 GB 2009-02-18 01:56 too many 2009-02-18 01:56 kvm has only 512M 2009-02-18 01:57 hmm 2009-02-18 01:57 hald and hald-runner were killed 2009-02-18 01:58 is there oom message? 2009-02-18 01:58 it has scrolled off 2009-02-18 01:59 yes 2009-02-18 01:59 kill process (hald) 2009-02-18 01:59 /var/log/kern.log? 2009-02-18 01:59 I have to reboot 2009-02-18 02:00 and of course I must remake the tux3 filesystem 2009-02-18 02:00 but this is a good test 2009-02-18 02:00 I will repeat the cp -a and watch the memory use 2009-02-18 02:00 yes 2009-02-18 02:00 I also try with cp -a 2009-02-18 02:01 what was your copy method? 2009-02-18 02:01 and how much did you copy? 2009-02-18 02:01 rsync -avHS -x 2009-02-18 02:01 1.2GB 2009-02-18 02:02 2.7 gb attempted here 2009-02-18 02:03 I will run the test once before sleeping 2009-02-18 02:04 running 2009-02-18 02:05 memfree has dropped to zero, expected 2009-02-18 02:05 102 MB in buffers 2009-02-18 02:06 I'll also try with 1GB memory 2009-02-18 02:07 I suppose I should also check slab usage 2009-02-18 02:07 just tux3 slab 2009-02-18 02:09 ah, inode creation may leak 2009-02-18 02:10 it tested a bit 2009-02-18 02:10 stoped in D state 2009-02-18 02:10 it is not tested much 2009-02-18 02:10 congestion_wait 2009-02-18 02:11 that piece of crap 2009-02-18 02:11 well 2009-02-18 02:11 sync 2009-02-18 02:11 ? 2009-02-18 02:11 then, echo 3 > /proc/sys/vm/drop_caches 2009-02-18 02:11 it will free caches 2009-02-18 02:12 system is hung 2009-02-18 02:13 when drop_caches? 2009-02-18 02:13 no, when switching console windows 2009-02-18 02:13 I guess it is oom 2009-02-18 02:13 running the cp as root 2009-02-18 02:13 um.. 2009-02-18 02:13 I don't think root gets any more memory than other id 2009-02-18 02:14 proc/meminfo did not show the oom 2009-02-18 02:14 it seems to think it has lots of "cached" memory 2009-02-18 02:14 maybe we have leaked a use count 2009-02-18 02:14 well 2009-02-18 02:15 I will do a smaller cp 2009-02-18 02:15 and then the drop_caches 2009-02-18 02:16 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-02-18 02:16 we should call our inode cache tux3_inode_cache 2009-02-18 02:17 I will make that change now before I forget 2009-02-18 02:17 ah, yes 2009-02-18 02:18 cp -ax was done without problem 2009-02-18 02:19 I wonder what it is here 2009-02-18 02:20 kernel version? 2009-02-18 02:20 well, maybe, tux3 has bugs 2009-02-18 02:21 ah, big difference is arch, I'm using x86_64 2009-02-18 02:21 kernel version is 2.6.29-rc1 2009-02-18 02:21 ah 2009-02-18 02:22 btw, there is swap partition? 2009-02-18 02:22 there is 2009-02-18 02:23 swap size? 2009-02-18 02:23 just one GB 2009-02-18 02:23 for some reason 2009-02-18 02:23 ok 2009-02-18 02:23 maybe, I should try on i386 2009-02-18 02:24 my swap partition is actually 1.9 GB, odd 2009-02-18 02:25 ok 2009-02-18 02:26 I wonder if it is a highmem issue 2009-02-18 02:27 boot with mem=512M? 2009-02-18 02:27 good idea 2009-02-18 02:27 or compile with no highmem 2009-02-18 02:28 well in the last run I did not display all of proc/meminfo 2009-02-18 02:28 yes 2009-02-18 02:28 was using watch 2009-02-18 02:28 so I did not see the slab statistic 2009-02-18 02:29 I will try again 2009-02-18 02:29 without watch 2009-02-18 02:29 is there any debug message in kern.log? 2009-02-18 02:30 I will check 2009-02-18 02:30 slab is staying steady 2009-02-18 02:31 unreclaimable is staying steady, whatever that is 2009-02-18 02:31 dirty stays low 2009-02-18 02:31 very low actually 2009-02-18 02:32 it doesn't really look like a memory leak 2009-02-18 02:32 if kernel hang, any key doesn't have any responce? 2009-02-18 02:32 slab at 67 MB 2009-02-18 02:33 magic sysrq works 2009-02-18 02:33 oh 2009-02-18 02:33 that is how I saw the congestion_wait 2009-02-18 02:33 disk problem? 2009-02-18 02:33 slab may be increasing 2009-02-18 02:34 up to 91 MB now 2009-02-18 02:39 cp -ax was done 2009-02-18 02:39 and 2009-02-18 02:39 free is 14560, buff is 136, cache is 1932, after drop_caches 2009-02-18 02:39 free is too small? 2009-02-18 02:40 ? 2009-02-18 02:40 folks 2009-02-18 02:40 those are output of vmstat 2009-02-18 02:40 a kernel panic in an unusual place 2009-02-18 02:41 which one? 2009-02-18 02:41 ah, I should use vmstat 2009-02-18 02:41 kmem_cache_alloc 2009-02-18 02:41 I wonder if it is a hardware issue 2009-02-18 02:41 seems unlikely 2009-02-18 02:41 can be tux3 bug 2009-02-18 02:41 this was after a rm -r 2009-02-18 02:42 the call trace does not make sense 2009-02-18 02:42 truncate may have race 2009-02-18 02:43 that should show up in fsx-linux 2009-02-18 02:43 there is another truncate point in delete_inode() 2009-02-18 02:45 one more try before sleeping 2009-02-18 02:46 ext3 recovered nicely, as I have come to expect :) 2009-02-18 02:46 oh 2009-02-18 02:46 there is no btree->lock in purge_inum() 2009-02-18 02:47 ah, that could be a problem 2009-02-18 02:47 but it does not explain the cp -a oom 2009-02-18 02:47 yes 2009-02-18 02:47 it would be different problem 2009-02-18 02:52 I don't see anythng unusual in meminfo 2009-02-18 02:52 oom was happened multiple times? 2009-02-18 02:53 yes, 3 times 2009-02-18 02:53 I think it is oom 2009-02-18 02:53 it does not look like oom according to meminfo 2009-02-18 02:53 if it's oom, the kernel should be dump global memory info 2009-02-18 02:54 it did, the first time 2009-02-18 02:54 and hald was oom-killed 2009-02-18 02:54 well, maybe, purge_inum() fix will clear the problem more or less 2009-02-18 02:54 ah, i see 2009-02-18 02:55 http://kerneltrap.org/index.php?q=mailarchive/linux-fsdevel/2009/1/20/4774034 2009-02-18 02:55 I'm not running nfs on this machine 2009-02-18 02:56 http://userweb.kernel.org/~hirofumi/purge_inum-fix.patch 2009-02-18 02:56 but I did see init crash in credential code 2009-02-18 02:56 purge_inum fix 2009-02-18 02:56 looks good 2009-02-18 02:57 should I apply the patch or pull? 2009-02-18 02:57 please apply for now 2009-02-18 02:57 I'll push it after test 2009-02-18 02:57 ok, I will just apply on the test machine 2009-02-18 03:04 and one page leak is found 2009-02-18 03:06 where was that? 2009-02-18 03:07 in get_buffer() 2009-02-18 03:07 on filemap.c 2009-02-18 03:07 if !PageUptodate(), it is missing to page_cache_release() 2009-02-18 03:08 yes 2009-02-18 03:09 http://userweb.kernel.org/~hirofumi/page-leak-fix.patch 2009-02-18 03:09 this is for it 2009-02-18 03:10 building 2009-02-18 03:11 ok 2009-02-18 03:11 it seems to be the cause memory leak 2009-02-18 03:11 now vmstat seems sane 2009-02-18 03:12 I wonder why I did not notice the leaked pages 2009-02-18 03:13 it was temporary hack, so maybe it was less reviewed 2009-02-18 03:14 I mean, I wonder why I did not notice that in meminfo 2009-02-18 03:14 ah 2009-02-18 03:15 it can be unclear, at least, without drop_caches 2009-02-18 03:16 ok, cache is full now, I could interrupt and drop the caches 2009-02-18 03:16 I will do that 2009-02-18 03:16 or maybe... wait for this one to complete 2009-02-18 03:16 since it is late 2009-02-18 03:17 two fixes are now in static- 2009-02-18 03:17 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-18 03:17 please pull later 2009-02-18 03:17 now is fine 2009-02-18 03:17 they are both clearly right 2009-02-18 03:18 yes 2009-02-18 03:20 still running 2009-02-18 03:20 probably, I should test more more bigger on partition 2009-02-18 03:21 vmscan could maybe keep a statistic on how many pages have high use count 2009-02-18 03:21 at least as debug 2009-02-18 03:22 completed :) 2009-02-18 03:22 there are some memory debug in /proc by compile options 2009-02-18 03:22 that was it 2009-02-18 03:22 good :) 2009-02-18 03:22 fucking nice work :) 2009-02-18 03:23 thanks :) 2009-02-18 03:23 it seems there is /proc/kpagecount 2009-02-18 03:23 I don't know what is it though 2009-02-18 03:24 it is good that I am finally running tux3 on a real machine 2009-02-18 03:24 great! :) 2009-02-18 03:26 actually, a really simple statistic would be the difference between get_page and put_page 2009-02-18 03:26 if that climbs, we have trouble 2009-02-18 03:28 ah, yes 2009-02-18 03:28 oopsed on rm -r 2009-02-18 03:29 segfault 2009-02-18 03:29 no wait 2009-02-18 03:29 our assert 2009-02-18 03:29 ah, WARN_ON? 2009-02-18 03:29 ah, no 2009-02-18 03:29 no, it's ours 2009-02-18 03:30 "dleaf_groups(leaf) >= 1" 2009-02-18 03:30 dleaf_merge 2009-02-18 03:31 ah, i aslo seen before 2009-02-18 03:31 there is bug in tree_chop or related stuff 2009-02-18 03:32 well that is for another day 2009-02-18 03:32 we are not supposed to be bug free now anyway, Andrew said :) 2009-02-18 03:32 performance is just supposed to be decent, which it is 2009-02-18 03:32 yes 2009-02-18 03:33 actually, performance seems fine considering we have not done any optimization at all 2009-02-18 03:33 no measurements to back that up of course 2009-02-18 03:34 yes 2009-02-18 03:34 there is no atomic commit though 2009-02-18 03:35 true 2009-02-18 03:35 we perform well now just as tux2 does ;) 2009-02-18 03:35 btw, the above bug is path of tree_chop? 2009-02-18 03:35 yes 2009-02-18 03:36 ok 2009-02-18 03:36 tomorrow I will check if it repeats at the same place 2009-02-18 03:36 maybe, tree_chop is not tested well at all 2009-02-18 03:37 well, my system is usable still 2009-02-18 03:37 so maybe I will run another test 2009-02-18 03:37 well, umount is not possible 2009-02-18 03:37 yes 2009-02-18 03:39 ah 2009-02-18 03:39 bug is simple 2009-02-18 03:39 truncate(0) can remove all entries on leaf 2009-02-18 03:40 and it become leafprev 2009-02-18 03:40 then merge empty leafprev and leafbuf 2009-02-18 03:40 hit assert() 2009-02-18 03:40 ah 2009-02-18 03:41 so, if vinto is empty, we can just copy it? 2009-02-18 03:41 so maybe not do the merge of empty leaf 2009-02-18 03:41 maybe 2009-02-18 03:41 yes 2009-02-18 03:41 it is the empty destination that causes the assert, right? 2009-02-18 03:42 yes 2009-02-18 03:43 yes, copy is right 2009-02-18 03:44 it is a small chance to optimize at a higher level, but not worth the extra code path 2009-02-18 03:44 yes 2009-02-18 03:44 and it may be not so easy 2009-02-18 03:44 because the buffer may have reference 2009-02-18 03:47 if (dleaf_groups(leaf) == 0) { 2009-02-18 03:47 memcpy(leaf, from, btree->sb->blocksize); 2009-02-18 03:47 return; 2009-02-18 03:47 } 2009-02-18 03:47 ? 2009-02-18 03:47 ah, maybe 2009-02-18 03:47 ->used and ->free 2009-02-18 03:47 it is lazy and can be optimized 2009-02-18 03:47 yes 2009-02-18 03:47 we can fix later 2009-02-18 03:50 I will check it in, I am sure you will change it use used/free ;) 2009-02-18 03:51 :) 2009-02-18 03:54 I guess we are not ever going to get rid of leaf used/free 2009-02-18 03:54 they are too useful 2009-02-18 03:54 yes 2009-02-18 03:54 probably 2009-02-18 03:56 I wonder how fsx-linux did not turn up those problems 2009-02-18 03:57 it can be only on big btree 2009-02-18 03:58 and may be truncate to 0 2009-02-18 03:59 probably, fsx-linux can hit with more bigger file 2009-02-18 03:59 it is nice that the complex code like tree_chop is pretty reliable 2009-02-18 03:59 that is because it was worked on for years 2009-02-18 03:59 in ddsnap 2009-02-18 03:59 actually, it never had bugs that I can remember 2009-02-18 04:00 after the initial implementation 2009-02-18 04:00 it is really scary code 2009-02-18 04:00 dleaf internal was changed 2009-02-18 04:00 yes 2009-02-18 04:00 so, I added the several bugs 2009-02-18 04:00 that doesn't count as a tree_chop bug 2009-02-18 04:00 but assertion was helped it luckly 2009-02-18 04:01 the assertions have been very good for us 2009-02-18 04:01 yes 2009-02-18 04:01 that reminds, I think I will change the output format slightly 2009-02-18 04:02 I adds assertion whether my thought is really right or not 2009-02-18 04:02 #define assert(expr) do { if (!(expr)) error("Failed assert(%s)", #expr); } while (0) 2009-02-18 04:02 so, if my assumption was wrong, assertion should found it and tell me it 2009-02-18 04:03 yes, as I use it 2009-02-18 04:03 looks good 2009-02-18 04:03 I remember at my first linux kernel summit, I only spoke to Linus once 2009-02-18 04:03 he hated assert at that time 2009-02-18 04:04 I also hate it in production code 2009-02-18 04:04 I asked him if he thought that BUG_ON(!cond) is the same as assert(cond) 2009-02-18 04:04 he agreed it was, and after that did not object to asserts 2009-02-18 04:04 ah, yes 2009-02-18 04:05 what he objects to is, including part of the essential processing in the assert 2009-02-18 04:05 but, BUG_ON clear for me 2009-02-18 04:05 in other words, you should be able to define the assert as a noop and the code should still work 2009-02-18 04:05 a lot of asserted code does not 2009-02-18 04:05 yes 2009-02-18 04:06 the cp -a ran to completion again 2009-02-18 04:06 howver, on big product, some people will write complex assertion with noop 2009-02-18 04:06 so, I hate it on production code 2009-02-18 04:07 usually, I remove it before production with cleanup 2009-02-18 04:07 without noop 2009-02-18 04:08 e.g. assertion needs temporary value sometimes 2009-02-18 04:08 well, it doesn't matter 2009-02-18 04:08 assertion is much helpful 2009-02-18 04:09 it is more important 2009-02-18 04:10 hit the same assertion on rm 2009-02-18 04:10 um... 2009-02-18 04:10 I am pretty sure I ran with the right kernel 2009-02-18 04:10 bogus 2009-02-18 04:10 checking everything now 2009-02-18 04:10 assertion was removed? 2009-02-18 04:11 :) 2009-02-18 04:11 no 2009-02-18 04:11 :) 2009-02-18 04:11 ACTION wacks self 2009-02-18 04:11 it's late 2009-02-18 04:27 dleaf_merge() optimization 2009-02-18 04:27 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-18 04:27 please pull later 2009-02-18 04:27 probably, tomorrow 2009-02-18 04:27 today :) 2009-02-18 04:27 :) 2009-02-18 04:37 cpu usage for the cp -a is about 6-7% 2009-02-18 04:37 about 3-7% 2009-02-18 04:38 it sounds good 2009-02-18 04:38 I will compare to ext2 and ext3 later 2009-02-18 04:38 atomic commit looks like it will be pretty efficient 2009-02-18 04:38 maybe it is including reading source? 2009-02-18 04:38 maybe 2009-02-18 04:42 your improvement to merge can be 14 characters shorter ;) 2009-02-18 04:42 memcpy(vinto + used, vfrom + used, btree->sb->blocksize - used); 2009-02-18 04:43 thus fitting in 80 columns 2009-02-18 04:44 otherwise is exactly as I would have written it 2009-02-18 04:44 even the local variable, and the name of the local variable 2009-02-18 04:44 yes 2009-02-18 04:45 well, I'd like to use void, instead of vleaf 2009-02-18 04:45 vleaf tell me it is abstruction, and to_dleaf() convert it 2009-02-18 04:46 fine :) 2009-02-18 04:46 vinto is an ugly parameter name anyway 2009-02-18 04:46 maybe 2009-02-18 04:47 well, it clear enough though 2009-02-18 04:48 vleaf can be overkill 2009-02-18 04:49 not sure 2009-02-18 05:05 rm -r runs to completion 2009-02-18 05:06 good 2009-02-18 05:06 maybe, I also should run "racer" on big partition 2009-02-18 05:08 umount is taking an unusually long time 2009-02-18 05:08 120 seconds so far 2009-02-18 05:08 disk light is on 2009-02-18 05:08 not seeking much 2009-02-18 05:08 yes, it would be normal for ext2 like fs 2009-02-18 05:09 more than 2 minutes to flush 800 MB of cache seems too long 2009-02-18 05:10 ah 2009-02-18 05:11 looks like a loop 2009-02-18 05:12 it finished 2009-02-18 05:12 wow 2009-02-18 05:12 that was just wrong :) 2009-02-18 05:12 what was happened? 2009-02-18 05:12 4 minutes to umount 2009-02-18 05:12 about 2009-02-18 05:13 um... 2009-02-18 05:13 it was in sync_inodes 2009-02-18 05:13 __writeback_single_inode 2009-02-18 05:13 it seems normal flush process 2009-02-18 05:14 ah, flush with sync mode? 2009-02-18 05:14 sync mode? 2009-02-18 05:14 hmm 2009-02-18 05:14 SYNC_* 2009-02-18 05:14 not an sync mount 2009-02-18 05:14 yes 2009-02-18 05:15 something is broken, I think it's not tux3 2009-02-18 05:15 however, maybe, vfs can be force sync 2009-02-18 05:15 for umount 2009-02-18 05:15 the flush algorithm is likely broken 2009-02-18 05:16 well 2009-02-18 05:16 we will not use that for very much longer 2009-02-18 05:17 it can be 2009-02-18 05:17 umount -> __fsync_super -> sync_inodes_sb(sb, 1) 2009-02-18 05:17 then sync_sb_inodes() with WB_SYNC_ALL 2009-02-18 05:18 we don't support write_inode with sync 2009-02-18 05:18 however, the pages will wait per inode 2009-02-18 05:18 Ext2 uses 2.5 - 4% cpu for the copy 2009-02-18 05:19 the pages will be waited per inode 2009-02-18 05:19 ah, and there are many inodes 2009-02-18 05:19 yes, it can be 2009-02-18 05:21 ah, ext2 sys time increases a little as it runs 2009-02-18 05:21 is now not much less than tux3 2009-02-18 05:21 interesting 2009-02-18 05:21 it is now 7% cpu 2009-02-18 05:22 now it is about the same as tux3 2009-02-18 05:22 that makes me happy :) 2009-02-18 05:22 :) 2009-02-18 05:23 I'm not tracking ext2 much, however it is interesting 2009-02-18 05:23 getting anywhere close to tux2 cpu usage is good 2009-02-18 05:23 now ext2 cpu has fallen again, and rose again 2009-02-18 05:23 funny 2009-02-18 05:24 it may be hitting seek artifacts 2009-02-18 05:24 that lower the cpu usage because of more seeking 2009-02-18 05:25 err, I meant close to ext2 cpu usage 2009-02-18 05:25 ah, and ext2 was only slightly faster than tux3 2009-02-18 05:26 i see 2009-02-18 05:26 well, it sounds good for now 2009-02-18 05:26 yes, it is fine 2009-02-18 05:27 it's good if we have regression test environment for atomic commit 2009-02-18 05:27 ext2 delete is dramatically faster 2009-02-18 05:27 oh 2009-02-18 05:27 and umount is only .7 seconds 2009-02-18 05:27 so we have something broken 2009-02-18 05:28 um.. 2009-02-18 05:28 delalloc may change it 2009-02-18 05:28 disabling delalloc 2009-02-18 05:29 we will find it 2009-02-18 05:29 the important thing is, Tux3 does the cp in 8m35s, Ext2 does it in 7m59s 2009-02-18 05:29 that is pretty close 2009-02-18 05:29 and we are doing some stupid things 2009-02-18 05:30 like creating a btree root and leaf for every file 2009-02-18 05:30 I am surprised it is that close 2009-02-18 05:30 yes 2009-02-18 05:30 tux3_get_block() should be slow than ext2 2009-02-18 05:30 yes, I thought it would be a lot slower 2009-02-18 05:30 but it isn't 2009-02-18 05:31 and we haven't optimized it 2009-02-18 05:31 yes 2009-02-18 05:31 it is not using writepages, and one block per extent 2009-02-18 05:31 the delete issue is a bug 2009-02-18 05:31 we will find it 2009-02-18 05:32 delete is too slow? 2009-02-18 05:32 ext2 deletes many times faster 2009-02-18 05:32 and umounts 200 times faster 2009-02-18 05:32 i see 2009-02-18 05:33 well it is not a big concern, we do need to share some debugging with the community :) 2009-02-18 05:33 why should we have all the fun? 2009-02-18 05:33 :) 2009-02-18 05:38 oyasumi 2009-02-18 05:38 oyasumi 2009-02-18 06:06 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-18 06:10 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-18 10:50 -!- RazvanM_(~RazvanM@96.234.239.183) has joined #tux3 2009-02-18 11:11 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-02-18 11:21 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-18 12:58 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-18 14:16 flips: http://bsd.slashdot.org/article.pl?sid=09/02/18/2036229 2009-02-18 14:29 hi bh 2009-02-18 15:28 flips: it's going to really boost dfbsd 2009-02-18 15:28 matt's probably the best engineer out of the entire group of bsds on the planet 2009-02-18 15:28 well it should 2009-02-18 15:28 I should tag onto the slashdot thread 2009-02-18 15:28 ignore /. 2009-02-18 15:28 it's useless 2009-02-18 15:28 Not really 2009-02-18 15:29 you're better off just getting atomic commits fully working and stuff 2009-02-18 15:29 let others advocate this technology 2009-02-18 15:29 The last post I made to slashdot resulted in 10,000 hits on tux3.org the same day 2009-02-18 15:29 oh really ? 2009-02-18 15:29 really 2009-02-18 15:29 well, folks know about tux3 already 2009-02-18 15:29 10,000 didn't 2009-02-18 15:30 well, what do you estimate it will be this time ? 2009-02-18 15:30 by the way, Hirofumi booted tux3 on root yesterday 2009-02-18 15:30 nice 2009-02-18 15:30 I will hopefully today (still preparing my root fs) 2009-02-18 15:30 I haven't been tracking tux3 development, been coding up EDF 2009-02-18 15:31 it's rather complicated with changes needed to the process model under priority inheritence 2009-02-18 15:31 The only reason I would take onto the df article is to help support matt 2009-02-18 15:31 tag onto I mean 2009-02-18 15:32 might be at least a couple of months of work there since I'm relatively inexperienced doing this kind of work 2009-02-18 15:32 ok 2009-02-18 15:32 ok, good 2009-02-18 15:32 yeah, the EDF work is coming out pretty smoothly, I rarely get to code up something new these days so the skillset is a bit rusty 2009-02-18 15:33 the good thing about this is that I've really takening the time to understand as much detail in the scheduler code as possibly and I'm much more familiar and confident with it 2009-02-18 15:33 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-02-18 15:33 I'm pretty comfortable with it and the corner cases is has to fix now 2009-02-18 15:33 including the wake up code, requeuing code, hrtimers, etc... 2009-02-18 15:34 nice, dfbsd is maintained with git now instead of CVS 2009-02-18 15:35 there was too much politics in the BSDs regarding CVS commits 2009-02-18 15:35 that changes things to a more Linux like distributed model 2009-02-18 15:51 so Linus helped BSD, it usually goes the other way 2009-02-18 15:51 plus, Git kicks the tail of CVS in every way 2009-02-18 15:52 Fortunately, cvs is just a bad memory for me now 2009-02-18 15:52 It's sad that sourceware.org, red hat's attempt at a community site, still uses cvs 2009-02-18 15:53 and you say, going out of your way to become irrelevant 2009-02-18 15:58 well I wonder if systemtap would be a good thing for further tux3 debugging 2009-02-18 15:58 like the umount issue from yesterday 2009-02-18 15:58 3 minutes to umount 2009-02-18 15:58 at least it did it :) 2009-02-18 15:59 sigh, I'd better connect a crossover cable 2009-02-18 16:08 yeah, I agree 2009-02-18 16:08 CVS must die 2009-02-18 17:33 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-02-18 18:02 mounted root 2009-02-18 18:02 on tux3 2009-02-18 18:02 need to get the rdev fix 2009-02-18 19:06 have to make the rootfs again, this time I made a script 2009-02-18 19:06 which will be posted 2009-02-18 19:06 everybody with the energy to get linux from linus's git tree, please raise your hand 2009-02-18 19:06 ...thought so 2009-02-18 19:07 well I will make detailed instructions for booting tux3 on a real machine 2009-02-18 19:07 as root 2009-02-18 19:07 rootfs 2009-02-18 19:25 takes ten minutes to copy my rootfs over to tux3 each try 2009-02-18 19:25 and I need a few more tries than Hirofumi 2009-02-18 19:26 the super magic works really well 2009-02-18 19:26 just caught a mismatch between kernel and user/tux3 mkfs 2009-02-18 19:36 mounted tux3 as rootfs 2009-02-18 19:36 :) 2009-02-18 19:37 ACTION hands hirofumi a large glass of sake 2009-02-18 19:37 ACTION goes to get a large glass of sake for himself 2009-02-18 19:38 booted X 2009-02-18 19:38 pinged yahoo 2009-02-18 19:39 booted kde 2009-02-18 19:41 apt-get install fails because msync returns EINVAL 2009-02-18 19:41 it should be ok to make it return success, but not do anything 2009-02-18 19:52 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-18 19:53 hey tim_dimm 2009-02-18 19:54 yo 2009-02-18 19:54 how's it going? 2009-02-18 19:55 -!- tux3root(~root@phunq.net) has joined #tux3 2009-02-18 19:55 ACTION <- running tux3 as rootfs 2009-02-18 19:55 nice 2009-02-18 19:56 time for that large glass of sake 2009-02-18 19:56 right on 2009-02-18 20:01 also did apt-get install of xchat onto tux3 rootfs 2009-02-18 20:01 apt-get is picky about msync/fsync not returning error 2009-02-18 20:01 I lied to it though, and apt believed me 2009-02-18 20:10 128,292 files in /usr 2009-02-18 20:15 grepping for foo in /usr 2009-02-18 20:16 -!- amey(~amey@117.195.32.51) has joined #tux3 2009-02-18 20:28 grep seemed to work 2009-02-18 20:29 so, next thing is to see if we can boot from tux3 partition 2009-02-18 20:47 ok, lilo can create a bootable kernel hosted by a tux3 filesystem 2009-02-18 20:54 -!- tux3(~root@phunq.net) has joined #tux3 2009-02-18 20:55 ok, that worked 2009-02-18 20:55 booted tux3 without help from an ext* boot partition 2009-02-18 20:55 lilo 2009-02-18 20:56 doing this with grub would be a little harder 2009-02-18 20:56 time for a new Tux3 Report 2009-02-18 21:25 flips: good to hear that a tux3 in running on a live file system 2009-02-18 21:25 root file system 2009-02-18 21:25 indeed 2009-02-18 22:07 http://lkml.org/lkml/2009/2/19/13 <- latest tux3 report 2009-02-18 22:10 -!- macan(~macan@159.226.41.137) has joined #tux3 2009-02-18 22:44 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-02-19 00:14 -!- gebi(~gebi@84-119-42-89.dynamic.xdsl-line.inode.at) has joined #tux3 2009-02-19 00:14 morning :) 2009-02-19 00:14 hi 2009-02-19 00:14 or should I say, guten Morgen 2009-02-19 00:18 both is ok ;) 2009-02-19 00:31 Ich sage jetzt, zzzz 2009-02-19 02:37 flips: good job 2009-02-19 02:37 just read the post 2009-02-19 03:24 -!- cdk(~chinmay@115.109.14.217) has joined #tux3 2009-02-19 07:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-19 09:01 -!- amey(~amey@117.195.42.58) has joined #tux3 2009-02-19 09:10 -!- cdk(~chinmay@115.109.14.217) has joined #tux3 2009-02-19 09:11 -!- gaurav(~gaurav@59.95.30.190) has joined #tux3 2009-02-19 09:15 hi flips 2009-02-19 09:25 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-19 10:12 -!- kushal(~kushal@115.109.12.97) has joined #tux3 2009-02-19 11:21 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-19 13:10 -!- dcg(~dcg@104.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-19 13:44 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-19 18:38 hi 2009-02-19 19:22 hi hirofumi 2009-02-19 19:22 still there? 2009-02-19 19:23 hi 2009-02-19 19:23 I'm getting my ancient vaio ready to run tux3 2009-02-19 19:23 this is fun 2009-02-19 19:23 -!- macana(~macan@159.226.41.137) has joined #tux3 2009-02-19 19:23 oh 2009-02-19 19:24 pcg-f540 2009-02-19 19:24 let's see how old that is 2009-02-19 19:25 1999 2009-02-19 19:25 ten years old 2009-02-19 19:25 tux3 has not crashed on me yet, with about 6 hours of usage including running KDE 2009-02-19 19:25 warns on truncate 2009-02-19 19:26 yes 2009-02-19 19:27 good 2009-02-19 19:28 flips: you probably need a exercising swuite 2009-02-19 19:28 suite 2009-02-19 19:28 and exercising users to run the suite 2009-02-19 19:28 right 2009-02-19 19:28 hirofumi uses fsx-linux 2009-02-19 19:29 nfs traffic should be interesting 2009-02-19 19:30 yes, tux3 should be very good at sync writes when atomic commit is working 2009-02-19 19:30 it's already doing prett well as is 2009-02-19 19:31 just about as fast as ext2 according to your post 2009-02-19 19:31 which is great, just wait until extents come into play 2009-02-19 19:31 all of that will be contiguous 2009-02-19 19:33 hah, the f540 has no built-in ethernet 2009-02-19 19:33 this is going to be interesting 2009-02-19 19:33 have to find the old adapter 2009-02-19 19:33 pcmcia 2009-02-19 19:33 gas prices are going up again, cheap drives to LA aren't that cheap any more 2009-02-19 21:30 hey, something fun to think about 2009-02-19 21:30 for an alpha geek 2009-02-19 21:31 consider a tux3 file used as a volume 2009-02-19 21:31 that is, mount -oloop 2009-02-19 21:31 and consider that we would like to have a bio submitted to that loopback mounted file proceed through our filesystem as smoothly as possible 2009-02-19 21:32 what do we need to do? 2009-02-19 21:32 this is a very important and common situation 2009-02-19 21:32 it would be more common if it worked well 2009-02-19 21:32 but as it is, it tends to be somewhat inefficient, and very deadlock prone 2009-02-19 22:01 flips: can you get tux3 to be faster than ext2 after some further development ? 2009-02-19 22:02 or is ext2 pretty much da bomb when it comes to this kind of file io ? 2009-02-19 22:04 bh, I think so 2009-02-19 22:05 tux3 does not need to keep seeking to far away places to update bitmaps 2009-02-19 22:05 it can log the allocations at the write point instead 2009-02-19 22:07 iirc, /dev/loop is using ->readpage, ->write_begin, and ->write_end 2009-02-19 22:07 so, I guess it doesn't have big problem 2009-02-19 22:09 yes, it seems to use ->write_begin and ->write_end in do_lo_send_aops() 2009-02-19 22:09 I will read that code 2009-02-19 22:10 will btrfs be fast in this way ? 2009-02-19 22:10 stupid noob question :) 2009-02-19 22:10 ah, I see 2009-02-19 22:11 bh, btrfs is intelligently designed in that regard 2009-02-19 22:11 it benchmarks pretty well 2009-02-19 22:11 so does Ext4 2009-02-19 22:11 but tux3 should get close to optimum 2009-02-19 22:11 eventually, when everything is working 2009-02-19 22:12 it is already writing at a very reasonable speed in the few tests I have done 2009-02-19 22:12 oh, read side is interesting, it's using splice 2009-02-19 22:15 wow 2009-02-19 22:15 high tech 2009-02-19 22:15 that must be clostly 2009-02-19 22:16 last time I looked, splice used a very low bandwidth interface 2009-02-19 22:16 page at a time 2009-02-19 22:16 whoops, I forgot to run lilo after updated the kernel file again ;) 2009-02-19 22:16 grub does have advantages 2009-02-19 22:17 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-19 22:24 flips: too to hear 2009-02-19 22:24 -!- macana(~macan@159.226.41.137) has joined #tux3 2009-02-19 22:31 umm... fork may have race with splice's ->steal() 2009-02-19 22:32 however, ->steal() user is not found 2009-02-19 22:32 well, probably, we have to write own splice handlers 2009-02-19 22:35 splice is a new thing, need to study it more 2009-02-19 22:35 ah, ->steal users was removed with some reasom 2009-02-19 22:35 steal sounds scary just from its name :) 2009-02-19 22:35 :) 2009-02-19 22:36 well, some things that matter more for the next few days... 2009-02-19 22:36 iirc, ->steal is grab page from one radix-tree, and insert it to another one 2009-02-19 22:36 I will be spending almost all the time getting ready for the tux3 presentation at scale 2009-02-19 22:36 ah 2009-02-19 22:36 something similar is also used for page migration 2009-02-19 22:37 yes 2009-02-19 22:40 some idiot really broke libc 2009-02-19 22:40 now requires kernel 2.6 2009-02-19 22:40 and debian will not complete an update unless libc6 is installed 2009-02-19 22:41 lots of people putting making their own stupid contributions to one big stupid result 2009-02-19 22:41 ACTION is annoyed that the vaio upgrade has failed because of those things 2009-02-19 22:41 flips: good to hear I meant 2009-02-19 22:41 they aren't really stupid, to tell the truth 2009-02-19 22:42 just idiotic 2009-02-19 22:42 ACTION can't wait to get at tux3 development 2009-02-19 22:42 it's fun 2009-02-19 22:42 it'll be months before I can though for reasons I outlined before 2009-02-19 22:42 yeah, I know. I'm busting out with some wicked scheduler code 2009-02-19 22:43 now I suppose if I want to make any progress I have to pull the hard disk a put it in a machine that can get on the network 2009-02-19 22:43 bleah 2009-02-19 22:43 relatively small but it's going to be powerful, the KVM folks should use htis 2009-02-19 22:43 this 2009-02-19 22:43 oh, and pcmcia is one big piece of idiocy :) 2009-02-19 22:43 bh, cool 2009-02-19 22:44 this laptop is so old it has a built in modem but not an ethernet 2009-02-19 22:46 ok, I will now test my skill at apt-get update in chroot 2009-02-19 22:46 I wonder if it works 2009-02-19 22:47 it should work with chroot /proc, /sys, and some /dev/* 2009-02-19 22:48 my plan is to mount the laptop hd on /mnt, chroot /mnt, then attempt to apt-get install 2009-02-19 22:56 this time I will not screw the hard disk in again, in case I have to repeat the process a few times 2009-02-19 22:57 getting the harddrive out a vaio is pretty tricky 2009-02-19 22:58 it isn't really meant to be user servicable 2009-02-19 23:00 ok, it's out and in the shuttle 2009-02-19 23:00 much faster than the first time earlier today 2009-02-19 23:03 I saw 60 MB/sec copy between two sata disks in the old shuttle, earlier today 2009-02-19 23:03 dd with bs=1M 2009-02-19 23:03 I was impressed 2009-02-19 23:03 not far from media speed 2009-02-19 23:05 ok, apt-get install is working in the chroot 2009-02-19 23:05 nice 2009-02-19 23:05 didn't know you could do that 2009-02-19 23:19 ok, back in the vaio it goes 2009-02-19 23:20 I must say, this is the most challenging computer rescue I've done 2009-02-19 23:20 part of the process involved pliers 2009-02-19 23:23 damm, forget to run lilo again ;) 2009-02-19 23:23 well that does not require a network 2009-02-19 23:23 the hard disk can stay in 2009-02-19 23:26 no it can't, I don't have a bootable 2.6 kernel 2009-02-19 23:26 out it comes 2009-02-19 23:27 :p 2009-02-19 23:28 lilo done, back into the vaio 2009-02-19 23:28 very easy to bend pata pins while doing this 2009-02-19 23:28 such ancient stuff 2009-02-19 23:29 this laptop is so old the rubber feet have turned back into petroleum\ 2009-02-19 23:41 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-02-19 23:47 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-20 00:02 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-02-20 00:02 -!- flips(~phillips@phunq.net) has joined #tux3 2009-02-20 00:02 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-02-20 00:02 -!- vomjom(~vomjom@99-157-248-71.lightspeed.stlsmo.sbcglobal.net) has joined #tux3 2009-02-20 00:37 googlebot indexing the mail archives now 2009-02-20 00:38 my download speed drops by 2/3rds when there is a bot snooping around 2009-02-20 00:45 Tux3 Report is the 3rd most popular lkml message on lkml.org: http://lkml.org/ 2009-02-20 00:46 I suppose that mainly means that people like to read about filesystems 2009-02-20 00:46 probably has something to do with everything in unix is a filesystem 2009-02-20 00:57 oh 2009-02-20 00:57 I found one article 2009-02-20 00:57 http://www.linux-magazin.de/meldung/34306 2009-02-20 00:57 heh 2009-02-20 00:57 those guys know me 2009-02-20 00:57 from when I lived in germany 2009-02-20 00:58 one of the guys I worked with is now an editor for linux magazine 2009-02-20 00:58 oh 2009-02-20 00:58 it says, phillips and hirofumi booted linux on a tux3 filesystem in the last few days 2009-02-20 00:59 you made news in germany :) 2009-02-20 00:59 is it the first time? probably not, because they are also interested in fatfs there 2009-02-20 00:59 :) 2009-02-20 01:00 they tell the story in detail, how the root filesystem was copied 2009-02-20 01:00 that's why I like germans 2009-02-20 01:00 they are interested in every detail 2009-02-20 01:01 i see 2009-02-20 01:02 it says, after a little fiddling phillips was able to run apt-get install 2009-02-20 01:02 and it says, for not the computer ran on tux3 without a problem 2009-02-20 01:03 it seems that google translate is not so bad 2009-02-20 01:03 it tells about the upcoming SCALE talk, and it notes that tux3 has no crash recovery right now 2009-02-20 01:03 :) 2009-02-20 01:04 and it says how we will implement versioning during the review process 2009-02-20 01:04 well, I hope my translation was as good as google's ;) 2009-02-20 01:05 I guees so much good than google :) 2009-02-20 01:29 -!- dagle1(~dagle@host162-104.bornet.net) has joined #tux3 2009-02-20 02:00 flips: what about the /dev directory ? is that fine as well ? 2009-02-20 02:21 bh, /dev is fine, thanks to hirofumi's patch from two days ago 2009-02-20 02:21 this path is not in the repository 2009-02-20 02:21 this patch I mean 2009-02-20 02:49 I tink it nice 2009-02-20 02:49 nice 2009-02-20 02:49 bah 2009-02-20 02:49 garbage on the entry line 2009-02-20 02:49 well, that's a huge step to have it be a live file system 2009-02-20 02:49 hopefully, more developers will come on board because of it 2009-02-20 02:50 it is quite likely 2009-02-20 02:50 yeah, less of an appearance of tux3 being vaporware 2009-02-20 02:50 which is always important trying to attract help 2009-02-20 02:54 this vaporware connected to irc yesterday ;) 2009-02-20 02:55 hm.. every new distri uses tmpfs for /dev 2009-02-20 02:55 nice :0 2009-02-20 02:55 :) 2009-02-20 02:56 the more exposure you get the better the chance to discover and fix bugs. I'm pretty impressed with how low the bug count is at this moment, there aren't that many wierd corner cases hit so far which could be either a good or bad sign :) 2009-02-20 02:57 running it as a root file system will definitely help 2009-02-20 03:32 the bug count has in fact been unusually low 2009-02-20 03:32 this is for a number of reasons, hirofumi is one of the reasons 2009-02-20 03:32 and the method of developing mainly in user space is a big reason 2009-02-20 03:33 also, the structure is fairly simple, where there is complexity is is largely local 2009-02-20 03:34 the unit tests are probably the biggest reason for the low bug count 2009-02-20 06:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-20 07:37 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-20 08:13 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-20 09:18 -!- gaurav(~gaurav@59.95.7.96) has joined #tux3 2009-02-20 09:44 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-20 11:37 flips: unit testing is critical 2009-02-20 11:38 there's no reason to write a chunk of software like this without it 2009-02-20 12:25 -!- cdk(~chinmay@115.109.14.217) has joined #tux3 2009-02-20 13:43 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-02-20 15:42 it just occurred to me that a variant of advance that always advances to the next block and returns, as opposed to advancing to the next leaf block, would be useful for writing some utilities 2009-02-20 15:42 that is, it would return after doing a single push, if a push is required 2009-02-20 15:42 for symmetry, return after a single pop 2009-02-20 15:44 return a result code: 1 = pushed, but not at leaf; 0 = at leaf; -1 = popped; -2 = finished (something like that) 2009-02-20 15:53 how difficult would it be to write a program that i'd invoke with 'superstat file1' and it would spit out number of all the blocks that it occupies? 2009-02-20 15:54 not difficult 2009-02-20 15:55 easiest is to add it to tux3.c as an additional command 2009-02-20 15:55 i.e. "tux3 marcindump " 2009-02-20 16:28 flips: another test would be to fill the file system near capacity to see how the allocation and speed of goes 2009-02-20 16:28 fitting all of those blocks and stuff would then be harder 2009-02-20 16:32 I already know how it will go, it will suck 2009-02-20 16:32 tux3 has no allocation policy yet 2009-02-20 16:36 yea i know that's why i wanna aproach it right at the beginning 2009-02-20 16:38 right after scale for me 2009-02-20 16:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-20 16:39 sk8 oclock 2009-02-20 16:39 ok 2009-02-20 17:38 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-02-20 18:15 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-02-20 18:53 http://www.experts-exchange.com/Storage/Hard_Drives/Q_24159694.html <- in spite of intel's denial 2009-02-20 19:05 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-02-20 19:09 ACTION hands marcin a trophy for being the 3rd person in history to run tux3 on a real machine 2009-02-20 19:23 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-20 19:42 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-02-20 19:49 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-02-20 19:52 hirofumi, there? 2009-02-20 19:54 ok so how would we go about tracing what blew it up? 2009-02-20 19:54 a lockup without any oops, assert or panic can be tricky 2009-02-20 19:54 best is to see how to reproduce it 2009-02-20 19:55 and then we try to reproduce 2009-02-20 19:55 ok if my irc freezes again, that means i was sucessful ;) 2009-02-20 19:55 then characterize the sequence of events and figure out how to narrow it down 2009-02-20 19:55 well, and think about a dedicated test box 2009-02-20 19:55 surely you have one sitting around 2009-02-20 19:55 an old junker 2009-02-20 19:55 i do have a server laying around, real scsi goodness 2009-02-20 19:56 but that thing is LOUD 2009-02-20 19:56 art is hard ;) 2009-02-20 19:56 i only turn it on if i absolutely have to, or just buy bigger speakers ;) 2009-02-20 19:56 let it run in the garage 2009-02-20 19:57 I have one of those by the way 2009-02-20 19:57 well another way to go, is put out the call to your friends for an old windows box that now gathers dust because latest windows can't handle it 2009-02-20 19:58 ok, time to put the upgraded hg back in the ancient vaio 2009-02-20 19:58 everything new and spiffy now 2009-02-20 19:58 oh, first I should verify I can compile the kernel in the chroot 2009-02-20 20:00 i've been reading about all kinds of wacky container mechanisms in linux lately 2009-02-20 20:01 i think it was on ibm linux dev's site 2009-02-20 20:01 chroot is working great for me 2009-02-20 20:02 actually how would you set up a test box for something that blows up kernels? is chroot enough? wouldnt you want like a vm server? 2009-02-20 20:02 apt-get upgraded a sid system that had already fallen into disrepair 3 years go 2009-02-20 20:02 kvm is good 2009-02-20 20:03 hirofumi uses that 2009-02-20 20:03 qemu also 2009-02-20 20:03 or uml 2009-02-20 20:03 actually, somebody posted changes to our root filesystem to support that, and I haven't incorporated them as promised 2009-02-20 20:03 i've been playing with vmserver2.0 lately, poor man's esx but with much better hwd support 2009-02-20 20:04 need to get off my butt 2009-02-20 20:04 hmm, vmserver, an oracle fork of some other vm project? 2009-02-20 20:05 vmware server 2009-02-20 20:05 2 is VERY different from 1 2009-02-20 20:05 ah yes, also good 2009-02-20 20:05 esx? 2009-02-20 20:06 esx is the enterprise vmware 2009-02-20 20:06 full hypervisor, remote machine migration, storage on nas/san, etc 2009-02-20 20:07 uberleet 2009-02-20 20:08 hey, i just remounted the tux3 partition that crashed, and the files are still there 2009-02-20 20:08 no need to remake stuff 2009-02-20 20:09 nope, i take that back 2009-02-20 20:09 i ran du -sh on it and it segfaulted 2009-02-20 20:09 I'd be very surprised 2009-02-20 20:09 if it survived 2009-02-20 20:10 well this is were we start thinking about tux3fsck 2009-02-20 20:10 or tux3 fsck 2009-02-20 20:10 hm, now i cannot unmount it :/ 2009-02-20 20:11 also no surprise 2009-02-20 20:11 reboot 2009-02-20 20:11 and do tux3 mkfs 2009-02-20 20:11 linux is like that, everything has to run to completion or things get borked 2009-02-20 20:12 3 years from now it probably be more tolerant of broken kernel code 2009-02-20 20:12 well 2009-02-20 20:12 another thing you can do is just hide the mount point 2009-02-20 20:12 marcin /mnt $ lsof +D tux3/ 2009-02-20 20:12 lsof: WARNING: can't lstat(tux3/linux-2.6.28.7/block): No child processes 2009-02-20 20:12 there's a umount option for that 2009-02-20 20:12 lazy umount 2009-02-20 20:12 does that mean anything to you? 2009-02-20 20:12 no 2009-02-20 20:12 I'll google 2009-02-20 20:13 lsof +D searches for processes using the path 2009-02-20 20:13 but i've never seen it spit out this warning 2009-02-20 20:15 anyway, umount -l 2009-02-20 20:26 -!- marcin_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-02-20 21:18 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-02-20 21:40 hi 2009-02-20 22:41 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-21 00:02 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-21 00:28 hi hirofumi 2009-02-21 01:46 flips: got some funny crashes yet ? i was reading the backlog 2009-02-21 01:47 you folks should get a bugzilla going 2009-02-21 01:47 one lockup reported 2009-02-21 01:47 bh, you are hereby nominated as the bugmaster 2009-02-21 01:47 uh 2009-02-21 01:48 give me root to the site and I'll install the packages 2009-02-21 01:48 don't mind being a sysadmin bitch 2009-02-21 01:49 flips: this scheduler work is mushrooming like crazy, I took another look at it and I'm a bit amazed or scared if it'll work at all and if it has the right API abstractions 2009-02-21 01:49 the scope and complexity basically exploded from it being just a simple deadline mechanism 2009-02-21 01:49 and turning into an ad hoc kind of CPU reservations scheduler 2009-02-21 01:50 off topic conversation, sorry :) 2009-02-21 01:52 hmm, a mere copy should now make a linux system unresponsive, yet it does, with the very latest kernel 2009-02-21 01:53 we still suck 2009-02-21 01:53 ACTION looks at axboe 2009-02-21 01:53 oh, not here 2009-02-21 01:53 maybe he should be 2009-02-21 01:53 or is it mingo 2009-02-21 01:53 it's either scheduler or block IO, or both 2009-02-21 01:55 OS/2 on 286 hardware was smooth as silk with disk IO going full blast 2009-02-21 02:01 FreeBSD as well since the beginning of time 2009-02-21 02:01 I almost went to that community because of the VM interactivity issues 2009-02-21 02:12 it's not vm, it's scheduler and/or block IO 2009-02-21 03:27 -!- persson(~persson@nescafe.bsnet.se) has joined #tux3 2009-02-21 06:39 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-21 06:46 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-21 07:16 got tux3 partition to blow up upon untarring, got a kernel panic 2009-02-21 07:20 can you reproduce it? 2009-02-21 07:20 yesterday i blew it up with a mkdir 2009-02-21 07:21 so there's something with making directories 2009-02-21 07:21 i see 2009-02-21 07:21 i should make a test where i untar pure files, no directories, see if it makes any difference 2009-02-21 07:21 it is happened on clean tux3? 2009-02-21 07:22 yes, this is right after tux mkfs 2009-02-21 07:22 clean means there is no panic before it 2009-02-21 07:22 i see 2009-02-21 07:23 i still got the panic on my console, do you need anything off it? 2009-02-21 07:23 if you have backtrace, it may help to find bug 2009-02-21 07:23 how would i get that? 2009-02-21 07:24 kernel was freezed? 2009-02-21 07:24 if not, the backtrace may be in the syslog 2009-02-21 07:25 yea, i was runing my tests through ssh, and it lost connection, so i looked at the console and there's a panic 2009-02-21 07:26 can you paste the top of panic message? 2009-02-21 07:26 not paste, i could retype it, which pieces are you interested in? 2009-02-21 07:26 top few lines and eip/rip line would be interested 2009-02-21 07:27 e.g. "Failed assert(...)" would be helpful 2009-02-21 07:28 dont see that, i think that scrolled :( 2009-02-21 07:28 top of the screen starts with 'modules linked in' 2009-02-21 07:28 shift-pageup can't see it? 2009-02-21 07:28 nope, frozen hard 2009-02-21 07:29 i see 2009-02-21 07:29 is there functions trace? 2009-02-21 07:29 it's got all the registers, stack and call trace, you want that? 2009-02-21 07:29 yes 2009-02-21 07:30 i'm gonna take a picture of my screen, this is way too error prone to type ;) 2009-02-21 07:31 good 2009-02-21 07:31 please post to tux3-ml 2009-02-21 07:47 it's coming, slowly, needed batteries and a usb cable ;) 2009-02-21 07:47 :) 2009-02-21 07:53 ok, gimme an address to mail it to 2009-02-21 07:55 tux3@tux3.org 2009-02-21 07:58 sent 2009-02-21 07:58 thanks 2009-02-21 07:58 crap, picassa sent a smaller picture than it showed me, grrr 2009-02-21 08:00 if it's hard to read, i can send you a 1600x1200 version 2009-02-21 08:03 picture is ok 2009-02-21 08:03 picassa doesnt tell you or give you an option to send bigger pics through email 2009-02-21 08:04 well, it seems to die on __end_that_request_first+0x154 2009-02-21 08:04 what does that do? 2009-02-21 08:05 it is end of I/O request 2009-02-21 08:05 um... 2009-02-21 08:07 well, it is not fs driver bug usually, umm... 2009-02-21 08:08 do you need anything else out of that panic screen? i wanna reboot ;) 2009-02-21 08:08 it's enough for now 2009-02-21 08:08 reboot is ok :) 2009-02-21 08:09 i'm gonna set up an experiemental tux3 box in a VM, i dont wanna keep crashing my server ;) 2009-02-21 08:10 yes, it's good :) 2009-02-21 08:15 ok, any particular tests you want me to run on this? 2009-02-21 08:15 i got time for like one more thing, then i gotta clean the house :/ 2009-02-21 08:20 there is no test for now 2009-02-21 08:20 if you can reproduce this, please let us know 2009-02-21 08:20 I can't see the cause of this for now 2009-02-21 08:21 sure will, i just upped the vesa mode resolution, so next time we'll catch more of the kernel panic 2009-02-21 08:21 thanks 2009-02-21 08:21 btw, tux3 is on sata or scsi? 2009-02-21 08:21 sata 2009-02-21 08:21 ok 2009-02-21 08:21 well, now, I can see from it 2009-02-21 08:22 i have some scsi's sitting around if you want me to run it on that 2009-02-21 08:22 tux3 -> block layer -> scsi-mid -> sata -> block layer (end io) -> panic 2009-02-21 08:23 well, it may be the bug of tux3 likely 2009-02-21 08:23 this machine has been quite stable, and i use it all the time 2009-02-21 08:24 yes 2009-02-21 08:24 if you can test on vm or something, I recommend 2.6.29-rc1 or later 2009-02-21 08:25 sure, i'll try to build up a vm later 2009-02-21 08:25 because, I didn't test well on 2.6.28 2009-02-21 08:27 btw, echo y > /sys/module/tux3/parameters/tux3_trace 2009-02-21 08:27 it will output verbose trace of tux3 2009-02-21 08:27 cool 2009-02-21 08:28 well, so, if you can get some trace, it would also be helpful 2009-02-21 09:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-21 09:04 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 09:07 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 09:10 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-21 09:23 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 09:32 are there some small projects someone new to the code could do? 2009-02-21 09:32 there was the suggestion by daniel to create the special inodes only when they are first written 2009-02-21 09:32 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 09:34 um... 2009-02-21 09:35 it may depend on someone likes which part 2009-02-21 09:39 adding function to tux3 command or fuse might be interested 2009-02-21 09:39 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 09:39 i don't really have an overview of the state of affairs right now 2009-02-21 09:40 daniel's status messages have gotten less ;) 2009-02-21 09:41 now, we are implementing the atomic commit 2009-02-21 09:42 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 09:42 flips is adding logging and replay, and I'm trying to add blockdirty() to fork in kernel 2009-02-21 09:42 that's a little to involved irght now 2009-02-21 09:42 for me :) 2009-02-21 09:43 yes, it is complex function 2009-02-21 09:43 well, for me too :) 2009-02-21 09:44 i guess it's learning by doing. but you are somewhat more advanced than me regarding pretty much everything kernel-like 2009-02-21 09:44 how complete is the fuse-implementation compared to kernel right now? 2009-02-21 09:46 iirc, fuse is not implemented basic functionality yet 2009-02-21 09:46 yes 2009-02-21 09:46 e.g. it doesn't have mkdir() 2009-02-21 09:46 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 09:47 readlink, rmdir too 2009-02-21 09:47 there are many ENOSYS 2009-02-21 09:48 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 09:51 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 09:57 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 10:06 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 10:07 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 10:10 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 10:14 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 10:21 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 10:28 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 10:40 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 10:56 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 10:58 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 11:12 -!- pranith(~bobby@122.162.67.162) has joined #tux3 2009-02-21 11:32 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-21 12:59 um, it seems there are several bugs on metadata operations 2009-02-21 13:00 I've ran the fsstress, then oops was happened 2009-02-21 13:00 it seems to be race of something 2009-02-21 13:01 well, I found truncate bug and tux3_iget bug 2009-02-21 13:01 well, I'll see those a bit more 2009-02-21 13:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-21 13:40 hirofumi: i still haven't figured out a reasonable work flow. i read somewhere you use kvm to test 2009-02-21 13:41 do you use boot per pxe or how do you get the most recent kernel? 2009-02-21 13:41 most recent kernel? 2009-02-21 13:41 git? 2009-02-21 13:41 well, the one you are programming on 2009-02-21 13:42 2.6.29-rc1 or later 2009-02-21 13:42 no, i mean while you are programming. how do you test? 2009-02-21 13:42 if it's kernel, I'm using kvm almost 2009-02-21 13:43 if it's userland, it's normal, trace or gdb 2009-02-21 13:43 do you copy the kernel by hand to your kvm-instance? or do you have it automated in some way? 2009-02-21 13:43 kvm can load bzImage directly 2009-02-21 13:44 qemu-system-x86_64 -usb -soundhw es1370 -m 512 -serial file:serial.txt -kernel /devel/linux/works/git/mercurial/tux3fs-2.6-build/arch/x86/boot/bzImage -append root=/dev/sda1 console=ttyS0,115200 console=tty0 -hda /devel/works/qemu/image/debian-amd64.qcow2 -hdb /devel/works/qemu/image/debian-swap.base -hdc /devel/linux/works/git/mercurial/qemu-tux3/tux3.img 2009-02-21 13:44 oh ok. didn't know about that 2009-02-21 13:44 thanks 2009-02-21 13:44 this is command line of me 2009-02-21 13:45 ah, it's -append "root=/dev/sda1 console=ttyS0,115200 console=tty0" 2009-02-21 13:46 well, it means the options to pass kernel 2009-02-21 13:46 figured that :) 2009-02-21 13:46 there's no public git-tree yet, right>? 2009-02-21 13:47 there is, but not updated yet 2009-02-21 13:47 linus tree + tux3 hg repo is latest one 2009-02-21 13:47 ok, so i should just use the mercurial-tree and copy iot over? 2009-02-21 13:47 ok 2009-02-21 13:48 for kernel 2009-02-21 14:23 data, there? 2009-02-21 14:24 flips: yes, i am 2009-02-21 14:25 so... interesting projects 2009-02-21 14:25 a simple fsck would be very helpful 2009-02-21 14:25 just as a consistency checker? 2009-02-21 14:25 yes 2009-02-21 14:26 hmm, not sure if i am up to that. but that sounds really interesting 2009-02-21 14:26 it can start very simple 2009-02-21 14:26 just another command in tux3, fsck 2009-02-21 14:26 and be a walk of the tree, like tree dump 2009-02-21 14:27 except when it gets to an inode table leaf, it walks all the file btrees in it 2009-02-21 14:27 calling the leaf_check method for each 2009-02-21 14:27 just a leaf checker to start would already be useful 2009-02-21 14:28 ok, that sounds doable 2009-02-21 14:30 now... to try tux3 on the laptop 2009-02-21 14:30 yes, there are bugs 2009-02-21 14:30 I hope it stays up long enough to give the talk 2009-02-21 14:30 ;) 2009-02-21 14:31 bug seems make_inode() 2009-02-21 14:32 it allocates twice or more same inum 2009-02-21 14:35 I was looking at that last week 2009-02-21 14:35 it didn't look right 2009-02-21 14:35 but did not have time to really think about it 2009-02-21 14:35 well, now 2009-02-21 14:36 ok, test case is 2009-02-21 14:37 make_inode(, 0x480); make_inode(, 0x4c0); make_inode(, 0x4c0) 2009-02-21 14:37 maybe, 64 is point 2009-02-21 14:38 setting up a test 2009-02-21 14:50 ah, no 2009-02-21 14:50 it seems to work 2009-02-21 14:50 ah 2009-02-21 14:51 just got my test set up ;-) 2009-02-21 14:51 oh well 2009-02-21 14:51 it is a useful unit test 2009-02-21 14:51 I will add it later 2009-02-21 14:51 find_empty_inode: base 480 2009-02-21 14:51 find_empty_inode: result inum is 4bf 2009-02-21 14:51 make_inode: created 4bf 2009-02-21 14:51 make_inode: buffer 4ad 2009-02-21 14:51 find_empty_inode: base 480 2009-02-21 14:51 find_empty_inode: result inum is 4c0 2009-02-21 14:51 make_inode: created 4c0 2009-02-21 14:51 make_inode: buffer 4ad 2009-02-21 14:51 find_empty_inode: base 480 2009-02-21 14:51 find_empty_inode: result inum is 4c0 2009-02-21 14:51 make_inode: no more inode space here, advance 1 2009-02-21 14:51 make_inode: buffer 4ad 2009-02-21 14:51 find_empty_inode: base 480 2009-02-21 14:51 find_empty_inode: result inum is 4c0 2009-02-21 14:51 make_inode: created 4c0 2009-02-21 14:51 make_inode: buffer 4ad 2009-02-21 14:51 find_empty_inode: base 480 2009-02-21 14:51 find_empty_inode: result inum is 4c0 2009-02-21 14:51 make_inode: no more inode space here, advance 1 2009-02-21 14:51 make_inode: buffer 4ad 2009-02-21 14:52 find_empty_inode: base 480 2009-02-21 14:52 find_empty_inode: result inum is 4c0 2009-02-21 14:52 make_inode: created 4c0 2009-02-21 14:52 well, this is log of wrong case 2009-02-21 14:54 ah, it did the advance() 2009-02-21 14:55 but base was 0x480 again 2009-02-21 14:55 yes, that logic looks funny 2009-02-21 14:57 diff -r ebc1b8b3d3bc user/inode.c 2009-02-21 14:57 --- a/user/inode.c Wed Feb 18 21:23:03 2009 +0900 2009-02-21 14:57 +++ b/user/inode.c Sat Feb 21 14:57:08 2009 -0800 2009-02-21 14:57 @@ -317,6 +317,16 @@ int main(int argc, char *argv[]) 2009-02-21 14:57 sb->volmap = rapid_open_inode(sb, NULL, 0); 2009-02-21 14:57 sb->logmap = rapid_open_inode(sb, NULL, 0); 2009-02-21 14:57 2009-02-21 14:57 +if (1) { 2009-02-21 14:57 + struct inode *inode = new_inode(sb); 2009-02-21 14:57 + assert(inode); 2009-02-21 14:57 + assert(!make_tux3(sb)); 2009-02-21 14:57 + make_inode(inode, 0x480); 2009-02-21 14:57 + make_inode(inode, 0x4C0); 2009-02-21 14:57 + make_inode(inode, 0x4C0); 2009-02-21 14:57 + exit(0); 2009-02-21 14:57 +} 2009-02-21 14:57 + 2009-02-21 14:57 trace("make tux3 filesystem on %s (0x%Lx bytes)", name, (L)size); 2009-02-21 14:57 if ((errno = -make_tux3(sb))) 2009-02-21 14:57 goto eek; 2009-02-21 14:57 ah, extra line 2009-02-21 14:58 oh no, it's fine 2009-02-21 14:58 it might be bug of ileaf_split() 2009-02-21 14:58 ibase seems to set by it only 2009-02-21 14:59 ah, no 2009-02-21 14:59 advance did, but same buffer->index 2009-02-21 15:00 ok, that test case creates 0x4c0 twice, just as you say 2009-02-21 15:00 new_inode doesn't have ->present 2009-02-21 15:00 so, inode size is 0 2009-02-21 15:00 heh 2009-02-21 15:00 wrong is it in this case 2009-02-21 15:01 ok, I better set present 2009-02-21 15:01 that is kind of a dangerous thing 2009-02-21 15:01 we should probably assert(...present) 2009-02-21 15:02 yes, maybe, store_attrs() should it 2009-02-21 15:02 probably assert(size) 2009-02-21 15:04 um..., 2009-02-21 15:04 this seems to need a bit more time to see 2009-02-21 15:04 I'll sleep before finding more 2009-02-21 15:05 btw, reproduce test is 2009-02-21 15:05 fsstress -X -p 10 -n 100 -l 100 -d . 2009-02-21 15:06 ok, oyasumi 2009-02-21 15:06 oyasumi 2009-02-21 15:06 I must work on slides 2009-02-21 15:06 for tomorrow 2009-02-21 15:06 yes 2009-02-21 15:06 it's good 2009-02-21 15:09 ileaf_resize: resize inum 0x4c1 at 0xe from 0 to e 2009-02-21 15:09 fixed the missing attribute issue 2009-02-21 15:09 diff -r ebc1b8b3d3bc user/inode.c 2009-02-21 15:09 --- a/user/inode.c Wed Feb 18 21:23:03 2009 +0900 2009-02-21 15:09 +++ b/user/inode.c Sat Feb 21 15:09:47 2009 -0800 2009-02-21 15:09 @@ -317,6 +317,17 @@ int main(int argc, char *argv[]) 2009-02-21 15:09 sb->volmap = rapid_open_inode(sb, NULL, 0); 2009-02-21 15:09 sb->logmap = rapid_open_inode(sb, NULL, 0); 2009-02-21 15:09 2009-02-21 15:09 +if (1) { 2009-02-21 15:09 + struct inode *inode = new_inode(sb); 2009-02-21 15:10 + assert(inode); 2009-02-21 15:11 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-21 15:12 + tux_inode(inode)->present |= MODE_OWNER_BIT; 2009-02-21 15:12 + assert(!make_tux3(sb)); 2009-02-21 15:12 + make_inode(inode, 0x480); 2009-02-21 15:12 + make_inode(inode, 0x4C0); 2009-02-21 15:12 + make_inode(inode, 0x4C0); 2009-02-21 15:12 + exit(0); 2009-02-21 15:12 +} 2009-02-21 15:12 + 2009-02-21 15:12 trace("make tux3 filesystem on %s (0x%Lx bytes)", name, (L)size); 2009-02-21 15:12 if ((errno = -make_tux3(sb))) 2009-02-21 15:12 goto eek; 2009-02-21 15:12 so that test does not create a duplicate inum 2009-02-21 15:15 openoffice 3 is up and running on the ancient laptop 2009-02-21 15:15 under xfce 2009-02-21 15:15 which is necessary because both kde and gnome are broken in debian unstable right now 2009-02-21 15:24 hmm, ooffice comes up full screen under xfce 2009-02-21 15:24 no window decorations 2009-02-21 15:25 rude of it 2009-02-21 17:14 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-21 17:19 flips: i just started looking at the code. a fsck would pretty much mimick the traversal by tux3graph, right? 2009-02-21 17:23 data, yes 2009-02-21 17:23 and also show_tree 2009-02-21 17:23 flips, would that be similar to the marcindump we discussed yesterday? 2009-02-21 17:24 it would 2009-02-21 17:25 in fact, you could write a generic traverser that can either check or dump 2009-02-21 17:25 or start with something that does just one job and abstract it after 2009-02-21 17:25 tux3graph is a good model 2009-02-21 17:25 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-21 17:25 it is generic to some extent 2009-02-21 17:39 folks 2009-02-21 17:50 ok, the ten year old vaio is copying over a tux3 root fs at 3 MB/sec 2009-02-21 17:50 that is not too bad at all 2009-02-21 17:50 about have what the P4 with 3.5" sata does 2009-02-21 17:50 about half 2009-02-21 17:51 cpu is 9-20% 2009-02-21 17:51 again, not bad 2009-02-21 17:51 500 Mhz processor 2009-02-21 17:57 ah, closer to 5 MB/sec 2009-02-21 18:28 dd with 4M blocksize only goes twice as fast 2009-02-21 18:29 ACTION makes a copy of the tux3 partition for recovery from the expected crash 2009-02-21 18:30 flips, i got my vm going, any particular tests to run, or just stress the shit out of? 2009-02-21 18:30 see if you can get it to lock up again 2009-02-21 18:30 that would be much useful 2009-02-21 18:31 I guess that means stress the shit out 2009-02-21 18:53 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-21 19:07 ok, I'm up and running with the ooffice slide show on the ancient laptop, on tux3 2009-02-21 19:07 it all feels very scary ;) 2009-02-21 19:08 pft, you werent freezing your server at least ;) 2009-02-21 19:09 not yet 2009-02-21 19:16 marcin, you don't have to stand in front of a crowd in danger of disaster tomorrow ;) 2009-02-21 19:16 I will be sure to have a copy of the presentation on the ext3 partition as well 2009-02-21 19:16 eh, BillG did it and crashed win98 ;) 2009-02-21 19:17 the lilo boot label for tux3 is "danger" 2009-02-21 19:17 he did recover nicely, he said 'and this is why it's in beta' smooth operator that bill 2009-02-21 19:17 heh, I wonder if billg ever tried it once before the show 2009-02-21 19:17 well he's done that more than once 2009-02-21 19:18 well I know it's tux3 running because of the warning tracebacks on truncate 2009-02-21 19:18 otherwise I wouldn't know 2009-02-21 19:19 wifi is up 2009-02-21 19:19 installing xchat 2009-02-21 19:19 soon tux3 will come over for a visit 2009-02-21 19:19 just about time to dd the tux3 partition again 2009-02-21 19:20 don't want to have to do this all over again 2009-02-21 19:21 there's a bug in the neomagic graphics that makes xor blits not work 2009-02-21 19:21 causes some very ugly effects on the xfce desktop, but ooffice doesn't hit it fortunately 2009-02-21 19:21 don't have time to chase it down 2009-02-21 19:23 -!- tux3(~root@phunq.net) has joined #tux3 2009-02-21 19:24 hey 2009-02-21 19:24 it's ALIVE! 2009-02-21 19:24 flips: my fsck is going to be very much like the draw_* 2009-02-21 19:24 getting slowly on its feet 2009-02-21 19:24 I mean 2009-02-21 19:24 but hirofumi is just too good :) 2009-02-21 19:24 I'm getting up :) 2009-02-21 19:24 data, that about sums it up 2009-02-21 19:25 data, good choice 2009-02-21 19:25 ok, back to making slides 2009-02-21 19:25 actually, if I had nothing more to show than cat /proc/mounts, it would be enough 2009-02-21 19:28 rootfs / rootfs rw 0 0 2009-02-21 19:28 /dev/root / tux3 rw 0 0 2009-02-21 19:28 tmpfs /lib/init/rw tmpfs rw,nosuid,mode=755 0 0 2009-02-21 19:28 proc /proc proc rw,nosuid,nodev,noexec 0 0 2009-02-21 19:28 sysfs /sys sysfs rw,nosuid,nodev,noexec 0 0 2009-02-21 19:28 tmpfs /dev tmpfs rw,size=10240k,mode=755 0 0 2009-02-21 19:28 tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0 2009-02-21 19:28 devpts /dev/pts devpts rw,nosuid,noexec,gid=5,mode=620 0 0 2009-02-21 19:31 how about grep tux3 /proc/slabinfo ? :D 2009-02-21 19:31 sure 2009-02-21 19:32 and a df :P 2009-02-21 19:33 I booted back to ext3 2009-02-21 19:33 to do my partition backup 2009-02-21 19:33 slabinfo just has tux3_inode_cache, which some inodes 2009-02-21 19:33 well I suppose that's interesting 2009-02-21 19:33 right now, 11 inodes 2009-02-21 19:35 another thing that might be cool to do: find / -type f -exec md5sum '{}' ';' 2009-02-21 19:36 a separate checksup for each file? 2009-02-21 19:36 yeah 2009-02-21 19:36 ah, that's another thing we might find useful 2009-02-21 19:37 a checksum of all the logical data in a tux3 fs 2009-02-21 19:38 anyway, I will be trying to avoid too many stress tests just now 2009-02-21 19:38 I need to give my slide show on this tomorrow morning 2009-02-21 19:38 and stilll have lots of work to do on the slides 2009-02-21 19:38 good luck! 2009-02-21 19:38 ACTION actually likes doing slides :P 2009-02-21 19:38 so I don't want to be spending a lot of time repairing broken tux3 partitions 2009-02-21 19:39 it's more fun when you have enough time ;) 2009-02-21 19:39 true 2009-02-21 19:39 ooffice is a little less than intuitive 2009-02-21 19:39 http://www.slideshare.net/razvanm/koala 2009-02-21 19:39 can spend 5-10 minutes just figuring out how to make a text box 2009-02-21 19:40 an example of too much fun 2009-02-21 19:40 I did in keynote :D 2009-02-21 19:40 but I only use simple-simple stuff 2009-02-21 19:40 oh i gave a huge demo about linux 'office capabilities' and everyone was superimpressed until i got to the slides, and the OO's powerpoint sucked hard 2009-02-21 19:42 but hirofumi is just too good :) 2009-02-21 19:43 ah, sorry 2009-02-21 19:43 gotta hate the paste on middle-button :) 2009-02-21 19:43 which i always miss in windows 2009-02-21 19:43 I find it convenient ;-) 2009-02-21 19:44 it is pretty much every time the mouse is not over my irssi-window :) 2009-02-21 19:55 my 4 year old wants me to teach her how to use openoffice now 2009-02-21 19:55 she thinks the stuff I'm doing with it is cool 2009-02-21 19:55 she _will_ learn to use it to 2009-02-21 19:55 she's already figured out large parts of inkscape 2009-02-21 19:55 including how to control the stroke transparency 2009-02-21 19:59 nice... i am trying to ruin something in my tux3-partition by randomly changing bits and fsck shows nothing. so i go to mount it, and it still works :) 2009-02-21 19:59 what should our mission statement be? 2009-02-21 19:59 'ive seen a million faces and i've rocked them all!' 2009-02-21 19:59 thank you bon jovi ;) 2009-02-21 19:59 oh maybe 2009-02-21 20:01 "Tux3: a filesystem lover's filesystem" ;) 2009-02-21 20:01 tux3: in case you've had too much free time 2009-02-21 20:01 "Tux3: It sucks less and faster" 2009-02-21 20:02 I'm putting that on the slide for now 2009-02-21 20:02 sombody better think of something better ;) 2009-02-21 20:02 tux3: organically produced, low bit alternative filesystem 2009-02-21 20:02 running light without overbyte 2009-02-21 20:03 course nobody ever used that one before ;) 2009-02-21 20:04 well, what do you want to convey? 2009-02-21 20:04 lack of uselessness? 2009-02-21 20:05 well, that's a temporary condition 2009-02-21 20:05 ok, what's the current situation? 2009-02-21 20:06 1) ext2/3/4 starting to look creaky 2009-02-21 20:06 and lacking snapshots 2009-02-21 20:06 2) sun kicking our butts 2009-02-21 20:06 3) Tux3 riding to the rescue on a jet scooter that may explode in a blinding flash at any instant 2009-02-21 20:07 tux3: 21st century FS without 128bit overhead and bad licenses? 2009-02-21 20:07 4) oracle building, building, building away on something btr 2009-02-21 20:08 "Tux3: we only boil a puddle, not an ocean" 2009-02-21 20:08 ok, i have a very rough first draft, but i'd rather have you tear it down now. 2009-02-21 20:08 http://jonasfietz.de/hgwebdir.cgi/tux3/rev/f5989090b59f 2009-02-21 20:08 (keep in mind it's 5am here:)) 2009-02-21 20:09 this is largely compiled checked only 2009-02-21 20:10 and keep in mind I have about 2 minutes to look at it ;) 2009-02-21 20:10 no problem :) better prep for your talk 2009-02-21 20:10 it's reasonable to punt the sb check just for now 2009-02-21 20:11 it will change a lot and probably never break anyway 2009-02-21 20:11 well, i punted pretty much every check but {i,d}leaf_check 2009-02-21 20:11 but this way it's easy to add more 2009-02-21 20:11 as someone more skilled thinks of them :) 2009-02-21 20:11 please use "warn" instead of printf 2009-02-21 20:12 for warnings 2009-02-21 20:13 data, it looks good 2009-02-21 20:13 thanks, i'll correct the coding style and rearrange the code somewhat 2009-02-21 20:14 it's a nascent fsck, indeed 2009-02-21 20:14 first pass will only check physical structure 2009-02-21 20:15 once you know physical structure is good, then you can access the allocation bitmaps 2009-02-21 20:15 simplest check for the second pass is probably allocation bitmap 2009-02-21 20:15 right :0 2009-02-21 20:15 yes :) 2009-02-21 20:15 allocation bitmap sounds interesting ;) 2009-02-21 20:15 tell me more 2009-02-21 20:17 so... UFS is the filesystem every unix filesystem descended from? 2009-02-21 20:17 or is there a more ancient one? 2009-02-21 20:17 what is preferable: to keep the typedefs, functions and everything ordered by type or by logical block? 2009-02-21 20:18 however you want, so long as it compiles without forward refs 2009-02-21 20:18 it can always be sorted later 2009-02-21 20:19 whatever way makes it easiest to work 2009-02-21 20:20 wikipedia refers to an original filesystem used by system v but does not give it a name 2009-02-21 20:20 version 7 unix 2009-02-21 20:21 FFS 2009-02-21 20:21 the sources should be available somewhere... 2009-02-21 20:22 in the german edition it says that ext2 stems from ufs 2009-02-21 20:22 it is still being worked on 2009-02-21 20:22 by mckusick 2009-02-21 20:22 http://minnie.tuhs.org/UnixTree/V7/ 2009-02-21 20:22 and the original one was just called FS 2009-02-21 20:22 data, that seems safe to put in the slides 2009-02-21 20:22 ah 2009-02-21 20:27 now what shall I say about btrfs 2009-02-21 20:27 it's on the way, but... 2009-02-21 20:27 * some folks like volume management done by a volume manager 2009-02-21 20:28 * a complicated code base may take a long time to stabilize 2009-02-21 20:28 * it's not invented here ;) 2009-02-21 20:32 http://cm.bell-labs.com/7thEdMan/v7vol1.pdf 2009-02-21 20:33 indeed the filesystem doesn't have a name 2009-02-21 20:33 is just refered as 'file system' 2009-02-21 20:33 imagine, a filesystem without a name 2009-02-21 20:33 how primitive ;) 2009-02-21 20:33 in those days they used wooden computers and booted them by throwing rocks 2009-02-21 20:34 how ego-free ;) 2009-02-21 20:34 a modern filesystem is 10% ego 2009-02-21 20:34 anything i ever produced is 90% annoyance 2009-02-21 20:35 well... linux creator is linus ;-) 2009-02-21 20:35 right, the other 90% is annoyance 2009-02-21 20:37 oh, mission statement is "A next generation filesystem for Linux" 2009-02-21 20:38 you might wanna mention that while others are evolutionary, you dont mind to start a revolution 2009-02-21 20:38 burn baby bytes, burn 2009-02-21 20:38 how about "Our filesystem for Linux" :P 2009-02-21 20:39 'our filesystem can beat up your filesystem' 2009-02-21 20:39 "from us to you with love and kisses" 2009-02-21 20:39 or "A modern filesystem for Linux" 2009-02-21 20:40 :-) 2009-02-21 20:40 of filesystem has a longer, ahem, um,... _reach_ than your filesystem 2009-02-21 20:40 our filesystem has a longer, ahem, um,... _reach_ than your filesystem 2009-02-21 20:40 it's not how many bits you got, it's how you allocate them! 2009-02-21 20:42 "a btre in the hand is worth two in the bush" 2009-02-21 20:43 ahem, a btre is a new form of tree structure with branching factor of one, for those who do not know 2009-02-21 20:43 something for the first front page: http://www.sxc.hu/photo/808985 :D 2009-02-21 20:44 http://www.sxc.hu/photo/506581 this would be good to introduce the other filesystems 2009-02-21 20:44 oh yeah 2009-02-21 20:46 a bear barrel would be nice as a logo ;-) 2009-02-21 20:46 I think I may use those photos 2009-02-21 20:46 if it' 2009-02-21 20:47 just temporarily of course 2009-02-21 20:47 I will give them back when done 2009-02-21 20:47 :-) 2009-02-21 20:47 they are free :P 2009-02-21 20:50 free? there must be a catch ;) 2009-02-21 20:52 now I need a photo of a pack of wolves to represent non-linux filesystems 2009-02-21 20:52 http://www.sxc.hu/photo/306745 2009-02-21 20:52 not a big pack though 2009-02-21 21:04 good enough 2009-02-21 21:05 http://www.firstpeople.us/FP-Html-Pictures/wolves_pg3.html#Wolf_Photographs_5 2009-02-21 21:07 http://www.firstpeople.us/pictures/wolves/1024x768/Gray-Wolf-Pup-1024x768.html :D 2009-02-21 21:09 well, if we didn't have tux... 2009-02-21 21:14 how about this: http://groovyadventures.com/adventures/articles/shark.jpg 2009-02-21 21:15 :-) 2009-02-21 21:15 scary 2009-02-21 21:16 that's ZFS, from the point of view of a penguin 2009-02-21 21:18 so that's tux? http://4.bp.blogspot.com/_tNRAa-BYU7M/Ru7zpQTKSYI/AAAAAAAAADw/nz9gMP7T3zU/s1600/ninja_tux.jpg ? 2009-02-21 21:19 i am going to bed, see you tomorrow 2009-02-21 21:19 see you 2009-02-21 21:19 nice stuff 2009-02-21 21:19 nice picture! :D 2009-02-21 21:19 yah 2009-02-21 21:19 captures the essence :P 2009-02-21 21:54 how about "Tux3: A shiny new filesystem for Linux" 2009-02-21 22:25 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-21 22:47 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-02-22 00:02 ok, make_inode() bug was found 2009-02-22 00:08 goom morning hirofumi 2009-02-22 00:08 good even 2009-02-22 00:09 cool 2009-02-22 00:09 what was it? 2009-02-22 00:10 it was simple 2009-02-22 00:10 in make_inode, leafbuf is updated outside loop 2009-02-22 00:10 but, leafbuf should be updated after advance() 2009-02-22 00:11 diff -puN user/kernel/inode.c~make_inode-fix user/kernel/inode.c 2009-02-22 00:11 --- tux3/user/kernel/inode.c~make_inode-fix 2009-02-22 17:01:08.000000000 +0900 2009-02-22 00:11 +++ tux3-hirofumi/user/kernel/inode.c 2009-02-22 17:01:16.000000000 +0900 2009-02-22 00:11 @@ -189,13 +189,13 @@ static int make_inode(struct inode *inod 2009-02-22 00:11 down_write(&cursor->btree->lock); 2009-02-22 00:11 if ((err = probe(cursor, goal))) 2009-02-22 00:11 goto out; 2009-02-22 00:11 - struct buffer_head *leafbuf = cursor_leafbuf(cursor); 2009-02-22 00:11 2009-02-22 00:11 /* FIXME: inum allocation should check min and max */ 2009-02-22 00:11 trace("create inode 0x%Lx", (L)goal); 2009-02-22 00:11 assert(!tux_inode(inode)->btree.root.depth); 2009-02-22 00:11 assert(goal < next_key(cursor, depth)); 2009-02-22 00:11 while (1) { 2009-02-22 00:11 + struct buffer_head *leafbuf = cursor_leafbuf(cursor); 2009-02-22 00:11 trace_off("find empty inode in [%Lx] base %Lx", (L)bufindex(leafbuf), (L)ibase(leaf)); 2009-02-22 00:11 goal = find_empty_inode(itable, bufdata(leafbuf), goal); 2009-02-22 00:11 trace("result inum is %Lx, limit is %Lx", (L)goal, (L)next_key(cursor, depth)); 2009-02-22 00:11 _ 2009-02-22 00:12 duh 2009-02-22 00:13 we should not cache the leafbuf 2009-02-22 00:13 use the access function instead 2009-02-22 00:13 sure 2009-02-22 00:13 should I wait for your patch, or just code it? 2009-02-22 00:14 waiting for a patch would be nice 2009-02-22 00:14 I'm still working on slides 2009-02-22 00:14 ok 2009-02-22 00:14 they are looking ok now 2009-02-22 00:14 but there are still a lot of things to add 2009-02-22 00:14 ok, I'll prepare the patches 2009-02-22 00:15 how did you notice that the same inum was allocated twice? 2009-02-22 00:15 fsstress test with some trace 2009-02-22 00:16 this time, a real patch posted to the list would be nice 2009-02-22 00:16 fsstress is including create/delete test 2009-02-22 00:16 not hg repo url? 2009-02-22 00:16 that's fine 2009-02-22 00:17 ok 2009-02-22 00:17 I can pick up a patch from there 2009-02-22 00:19 I wonder why it was so rare 2009-02-22 00:19 probably, it is not so rare 2009-02-22 00:20 fs may have same inode 2009-02-22 00:20 I guess I will rebuild the root partition for the talk tomorrow then 2009-02-22 00:20 so much is working though 2009-02-22 00:20 no segfaults 2009-02-22 00:21 yes 2009-02-22 00:22 find -printf '%i %p\n' 2009-02-22 00:23 this seems can print inode number and path 2009-02-22 00:24 fsstress seems to live more with the above path 2009-02-22 00:24 17 + trace_off("find empty inode in [%Lx] base %Lx", (L)bufindex(cursor_leafbuf(cursor)), (L)ibase(leaf)); 2009-02-22 00:24 18 + goal = find_empty_inode(itable, bufdata(cursor_leafbuf(cursor)), goal); 2009-02-22 00:24 yes 2009-02-22 00:24 ok, well I will make my patch here 2009-02-22 00:25 leafbuf was removed? 2009-02-22 00:25 yes 2009-02-22 00:25 also, 2009-02-22 00:25 it looks good 2009-02-22 00:25 + struct ileaf *ileaf = to_ileaf(bufdata(cursor_leafbuf(cursor))); 2009-02-22 00:26 purge_inum()? 2009-02-22 00:26 yes 2009-02-22 00:27 it isn't necessary to fix 2009-02-22 00:27 there is no bug 2009-02-22 00:27 yes 2009-02-22 00:27 but it's better not to cache that when we have an access function 2009-02-22 00:27 the compiler can cache it for us 2009-02-22 00:27 yes 2009-02-22 00:27 well, there is no loop in purge_inum 2009-02-22 00:28 did I mention that I have tux3 running as rootfs on the old vaio laptop? 2009-02-22 00:28 it runs xfce -> openoffice presenter -> tux3 slide show 2009-02-22 00:28 yes, I seen it from irc log 2009-02-22 00:28 good 2009-02-22 00:28 well I might not remake that partition 2009-02-22 00:29 it takes a long time, not just the copy, but then setting up little details 2009-02-22 00:29 so I will apply the patch, update the kernel, and hope the fs is not badly damaged already 2009-02-22 00:29 if it didn't remove exsisted inode, I guess it is ok 2009-02-22 00:30 it did, just a few times 2009-02-22 00:30 well, the inode allocation goal keeps increasing 2009-02-22 00:30 good, conflict should be not many 2009-02-22 00:30 that is why it is rare 2009-02-22 00:30 ah 2009-02-22 00:30 yes 2009-02-22 00:30 now, why does it happen at all? 2009-02-22 00:31 ah 2009-02-22 00:31 there is no incrememnt, I guess 2009-02-22 00:31 it creates concurrently with some processes 2009-02-22 00:31 yes 2009-02-22 00:31 and apt-get is very unconcurrent 2009-02-22 00:32 yes 2009-02-22 00:32 I am probably ok 2009-02-22 00:32 and maybe, it will use balloc() very operations before make_inode(), I guess 2009-02-22 00:32 for each operations 2009-02-22 00:33 so, ->nextalloc may be incremented 2009-02-22 00:33 yes 2009-02-22 00:35 btw, if you are ok, please commit your patch instead 2009-02-22 00:35 ok 2009-02-22 00:36 btw, another founded bug is 2009-02-22 00:36 tux3_iget() race 2009-02-22 00:37 pushed to public 2009-02-22 00:37 ah 2009-02-22 00:37 iget5_locked() can return uninitialized inode 2009-02-22 00:37 oops 2009-02-22 00:37 smp bug? 2009-02-22 00:37 by grab inode from clear_inode() 2009-02-22 00:37 yes 2009-02-22 00:38 it will not run free and alloc path 2009-02-22 00:38 I'm not sure, there are actual uninitialized fileds with it 2009-02-22 00:38 but it may be possible 2009-02-22 00:38 got a patch? 2009-02-22 00:38 no 2009-02-22 00:38 ok, I haven't hit it yet 2009-02-22 00:39 I'm not sure yet which fields is needed to initialized actually 2009-02-22 00:39 yes 2009-02-22 00:39 Marcin hit something, he can try the fixes tomorrow maybe 2009-02-22 00:39 it should be rare 2009-02-22 00:39 it has been very stable 2009-02-22 00:39 for this early in its life 2009-02-22 00:39 yes 2009-02-22 00:39 this is the first real world use 2009-02-22 00:39 in the last 3 days 2009-02-22 00:40 well, it may not have the uninitialized fields actually 2009-02-22 00:40 because we add present forcely 2009-02-22 00:40 present bits 2009-02-22 00:40 so, open_inode() will initialize inode almost all 2009-02-22 00:41 for now 2009-02-22 00:45 ileaf_split was die 2009-02-22 00:47 was die? 2009-02-22 00:48 kernel panic 2009-02-22 00:48 just now? 2009-02-22 00:48 yes 2009-02-22 00:48 new bug? 2009-02-22 00:48 after passed fsstress 5 times 2009-02-22 00:48 yes 2009-02-22 00:50 0xffffffff80356325 : rep movsb %ds:(%rsi),%es:(%rdi) 2009-02-22 00:50 rdi seems to wrong pointer 2009-02-22 00:51 looks like a negative number 2009-02-22 00:51 well, a messed up number 2009-02-22 00:51 it may be pointer to kernel pages 2009-02-22 00:51 kernel address 2009-02-22 00:52 the ff's fill the high half of the address 2009-02-22 00:52 memcpy(dest->table, leaf->table + split, free - split); 2009-02-22 00:52 it could be a 64/32 bug 2009-02-22 00:53 most probably is 2009-02-22 00:53 it seems it 2009-02-22 00:54 so, I'm compiling linus's latest kernel on a 500 Mhz celeron with 256 MB ram 2009-02-22 00:54 oh 2009-02-22 00:54 it takes about 40 minutes after I reduce the options 2009-02-22 00:54 it is for test for tux3? 2009-02-22 00:55 now that I have got the machine on the network, I could actually compile on a different machine 2009-02-22 00:55 yes 2009-02-22 00:55 I worked all night last night, to get the wireless configured 2009-02-22 00:55 it is the machine I will give the slide presentation with 2009-02-22 00:55 running on tux3 2009-02-22 00:55 good 2009-02-22 00:55 the main tux3 test machine is a shuttle xpc 2009-02-22 00:55 with a P4 2009-02-22 00:56 somebody has said they will send an 8 core machine to me 2009-02-22 00:56 a high end 2U server 2009-02-22 00:56 that should help a lot 2009-02-22 00:57 presentation on tux3 can be dangerous 2009-02-22 00:57 well, it seems to work almost luckly 2009-02-22 01:06 yes, it's dangerous 2009-02-22 01:06 I have the presentation also on an ext3 partition 2009-02-22 01:06 so far, it did not crash when I did the presentation for my wife ;) 2009-02-22 01:06 good :) 2009-02-22 01:07 the worst issue last night was dealing with the new feature of adding -dirty to the kernel release if it was changed vs linus's release 2009-02-22 01:08 module loading doesn't work properly with a -dirty kernel 2009-02-22 01:08 I suppose LInus never has a -dirty kernel 2009-02-22 01:08 and everybody else runs only git kernels 2009-02-22 01:08 this is a bug that should have been noticed 2009-02-22 01:09 I guess you didn't run modules_install 2009-02-22 01:09 I did in fact 2009-02-22 01:09 did not fix the issue 2009-02-22 01:09 um.. 2009-02-22 01:09 I started to investigate, spend a couple of hours digging 2009-02-22 01:10 I don't really want to know about kbuild, but now I do ;) 2009-02-22 01:10 -dirty is just tag from git 2009-02-22 01:10 with CONFIG_LOCAL_VERSION or something 2009-02-22 01:10 yes 2009-02-22 01:10 well, I will see if it happens again 2009-02-22 01:11 I turned off the automatic localversion 2009-02-22 01:11 yes 2009-02-22 01:11 I'm also not using it 2009-02-22 01:12 well I have 20 slides and I probably need another 5 or so 2009-02-22 01:13 sounds good 2009-02-22 01:14 is it available after scale7? 2009-02-22 01:14 btw, I found the bug of truncate in userland 2009-02-22 01:15 yes it will be 2009-02-22 01:15 ah 2009-02-22 01:15 it was my stupid bug 2009-02-22 01:17 that is rare ;) 2009-02-22 01:17 yes, anybody doesn't use it probably :) 2009-02-22 01:23 ah, performance test are a bit untrue 2009-02-22 01:24 we are using 64 inodes per ileaf 2009-02-22 01:24 it would be increase blocks for inode 2009-02-22 01:24 well, it's clearly good for testing 2009-02-22 01:28 64 inodes per ileaf is accurate 2009-02-22 01:29 inodes are a nearly 64 bytes in size 2009-02-22 01:29 the main thing that hurts now is two extra blocks for every file, for the btree 2009-02-22 01:29 very few files actually need a btree 2009-02-22 01:50 ah 2009-02-22 01:50 i see 2009-02-22 01:50 I've got the log of bug 2009-02-22 01:50 ileaf_split: dest ffff880019c3f000, leaf ffff880019d54000, inum 4d382, ibase 4d340, icount 0, at 0, split 0, free 67e0 2009-02-22 01:50 it seems free is too big 2009-02-22 02:08 ah 2009-02-22 02:08 ACTION is still making slides 2009-02-22 02:14 yes, I'm trying to get the detail of it 2009-02-22 02:30 ah, icount == 0, it seem to be buffer is wrong 2009-02-22 02:31 it is possible to be CONFIG_DEBUG_PAGEALLOC 2009-02-22 02:31 not sure whether config is buggy or not 2009-02-22 02:31 well, disable it 2009-02-22 02:53 ok, that slides look good enough 2009-02-22 02:53 couple more checks 2009-02-22 02:53 I'm going to be kind of tired 2009-02-22 02:53 but at least it's not like last time when I stayed up all night 2009-02-22 02:53 finishing the paper 2009-02-22 02:53 ...and didn't even need a paper 2009-02-22 02:54 this year is work smarter year 2009-02-22 02:54 slides for a presentation 2009-02-22 02:54 not a paper 2009-02-22 03:07 booting to xcfe under tux3... 2009-02-22 03:08 (this takes a while on the old machine) 2009-02-22 03:09 starting openoffice with the tux3 slide show 2009-02-22 03:09 it's up 2009-02-22 03:09 slide show is started 2009-02-22 03:10 it's fast enough to be usable 2009-02-22 03:16 ok, it worked fine 2009-02-22 03:16 I won't try anything different at the talk ;) 2009-02-22 03:50 hey 2009-02-22 03:50 "Tux3: to go where ZFS hasn't gone before and never will" 2009-02-22 03:51 heh 2009-02-22 03:51 use that 2009-02-22 03:51 ACTION just read the backlog 2009-02-22 03:51 the title for the talk is "Tux3: A Shiny new Filesystem for Linux 2009-02-22 03:51 what ever 2009-02-22 03:52 slides are all finished, machine is shut down and won't be started again until moments before the talk 2009-02-22 03:53 at which time I will once again boot to tux3 and hold my breath 2009-02-22 03:53 how's the metadata crashes going ? 2009-02-22 03:53 several known bugs 2009-02-22 03:53 fix them yet ? 2009-02-22 03:53 not all 2009-02-22 03:53 haven't hit any bugs on the vaio yet, though hirofumi fixed a real one today 2009-02-22 03:54 marcin hit a couple yesterday 2009-02-22 03:54 pretty low bug count, really 2009-02-22 03:54 yeah, I was reading, it's good seeing that. Yeah, but it's too early 2009-02-22 03:54 you have to wait a bit and get more testers that need your file system 2009-02-22 03:55 and they can fix them too 2009-02-22 03:55 you're tracking them somehow at least right ? like under a email list thread ? 2009-02-22 03:55 we'll send out a "how to fix tux3" email 2009-02-22 03:55 oh god 2009-02-22 03:55 no 2009-02-22 03:55 that's got to be mindblow for some folks with little FS experience 2009-02-22 03:55 fixing broken software is half the fun 2009-02-22 03:56 we wouldn't keep that just for ourselves 2009-02-22 03:57 sleep time 2009-02-22 03:57 precious little of it left 2009-02-22 03:57 all the best 2009-02-22 04:45 -!- kushal(~kushal@121.246.32.94) has joined #tux3 2009-02-22 05:34 -!- cydork(~vihang@59.184.15.155) has joined #tux3 2009-02-22 07:38 -!- cydork(~vihang@59.184.58.82) has joined #tux3 2009-02-22 08:38 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-22 10:39 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-22 12:47 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-22 13:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-22 13:36 -!- paola(~paola@151.59.41.35) has joined #tux3 2009-02-22 13:38 -!- paola(~paola@151.59.41.35) has left #tux3 2009-02-22 13:48 -!- cydork(~vihang@59.184.54.226) has joined #tux3 2009-02-22 15:12 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-02-22 15:41 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-22 16:48 -!- flips(~phillips@phunq.net) has joined #tux3 2009-02-22 17:06 -!- flips(~phillips@phunq.net) has joined #tux3 2009-02-22 17:46 flips: how was the presentation? 2009-02-22 17:48 are we awaiting the triumphant return of our fearless leader? 2009-02-22 17:48 :-) 2009-02-22 18:16 razvanm, it went very well 2009-02-22 18:16 tux3 did not crash 2009-02-22 18:17 there was huge applause when I did cat /proc/mounts at the end of the talk 2009-02-22 18:17 awesome 2009-02-22 18:29 and marcin, shapor has a special surprise for you 2009-02-22 18:29 ACTION hides 2009-02-22 18:29 so i hear... 2009-02-22 18:33 hirofumi, here? 2009-02-22 18:33 now that scale is done, I have in mind a fine hack to add to tux3, to help with debugging the really hard stuff 2009-02-22 18:33 should be able to hack it up quick, like tomorrow 2009-02-22 18:35 and maybe write up some simple instructions for the kerneldebuggingtechniques-challanged 2009-02-22 18:35 and I think we should start kernel review this week 2009-02-22 18:35 even without atomic commit fully working 2009-02-22 18:35 its working enough to know how it impacts the fs structure 2009-02-22 18:35 and efficiency is more than decent 2009-02-22 18:35 yes 2009-02-22 18:35 well 2009-02-22 18:35 with the upcoming patch ;-) 2009-02-22 18:35 I'll write instructions for install as root next 2009-02-22 18:35 then generic instructions for connecting kgdb 2009-02-22 18:35 that's the best way for really tough stuff 2009-02-22 18:35 well 2009-02-22 18:36 kdb would be even better 2009-02-22 18:36 but linus is still confused about that 2009-02-22 18:36 oh, and vmware is _great_ for this 2009-02-22 18:36 marcin, you should try to get it to hang in vmware 2009-02-22 18:36 if you can, it's gold 2009-02-22 18:37 i'll try, even though doing anything in vm is friggin painful 2009-02-22 18:37 so slow 2009-02-22 18:37 well, in about 12hrs i'll be in rhode island for a week so we'll see if i can do anything before i get back on friday 2009-02-22 18:38 -!- cydork(~vihang@59.184.23.120) has joined #tux3 2009-02-22 18:53 ah 2009-02-22 18:53 well then, kgdb instructions are coming 2009-02-22 18:53 my laptop has a serial port :) 2009-02-22 18:53 a nice quiet machine to debug with 2009-02-22 18:53 i thought you're going to dinner with shap 2009-02-22 18:53 unfortunately, no smp 2009-02-22 18:53 however there is a noisy 8 way on its way 2009-02-22 18:54 yes 2009-02-22 18:54 couple minutes 2009-02-22 18:54 pit stop with the family 2009-02-22 19:12 8 way? :D 2009-02-22 19:12 sweet 2009-02-22 19:12 4 cpu with 2 cores? 2009-02-22 19:13 btw: flips, would you mind uploading the slides to slideshare? :D 2009-02-22 19:15 2 sockets x 4 core 2009-02-22 19:15 the slides will be on tux3.org later tonight 2009-02-22 19:30 great! 2009-02-22 20:12 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-22 20:50 hi 2009-02-22 20:50 hey man 2009-02-22 20:50 hi 2009-02-22 20:51 your slide looked great today at Daniel's talk 2009-02-22 20:51 your graphic, rather 2009-02-22 20:51 oh 2009-02-22 20:51 tux3graph? 2009-02-22 20:51 yeah 2009-02-22 20:51 good :) 2009-02-22 20:51 ~50 people I'd guess 2009-02-22 20:51 maybe more 2009-02-22 20:52 the session wasn't long enough for Daniel to go into much detail 2009-02-22 20:52 i see 2009-02-22 20:53 the versioned pointers explanation too long 2009-02-22 20:53 but that's required info to understand before the rest of the concepts really make sense 2009-02-22 20:53 sure 2009-02-22 20:54 we should work on that explanation for presentation 2009-02-22 20:54 the more people that understand it, hopefully the more good developers we can attract to the project 2009-02-22 20:54 yes 2009-02-22 20:55 There was great interest in the ability to scale from 16k to 1EB 2009-02-22 20:56 i see 2009-02-22 20:57 btw, hg.tux3.org seems to down 2009-02-22 20:57 hmm 2009-02-22 20:57 that's on shapor's server, right? 2009-02-22 20:57 probably, flips's 2009-02-22 20:58 that would be a bad thing 2009-02-22 20:58 ;) 2009-02-22 20:58 yes :) 2009-02-22 20:58 people can't get source 2009-02-22 20:59 its slow or dead? 2009-02-22 20:59 it seems to be dead 2009-02-22 21:00 just sms'd flips 2009-02-22 21:00 he's with shapor having a beer I believe 2009-02-22 21:00 ah, i see :) 2009-02-22 21:01 he'll tell shapor who will attempt a fix from his android phone 2009-02-22 21:01 oh, android 2009-02-22 21:02 :) 2009-02-22 21:02 he give me a hard time about my iphone 2009-02-22 21:03 :) 2009-02-22 21:04 well, I also have small news 2009-02-22 21:04 oh? 2009-02-22 21:04 batchannel? 2009-02-22 21:04 fsstress test was passed for several hours 2009-02-22 21:04 :) 2009-02-22 21:05 that's awesome 2009-02-22 21:05 yesterday, it was dead few minutes 2009-02-22 21:05 flips' server is down 2009-02-22 21:05 can't ssh in 2009-02-22 21:05 yes 2009-02-22 21:06 ping too 2009-02-22 21:06 ah, ping is working 2009-02-22 21:07 time to migrate to something more robust 2009-02-22 21:07 between scale and lkml, I'm sure he's getting more traffic than ever 2009-02-22 21:07 i see 2009-02-22 21:12 he is going home to fix it soon 2009-02-22 21:13 ACTION is on the G1 at the bar 2009-02-22 21:13 theory is that router is unplugged from modem 2009-02-22 21:15 i see 2009-02-22 21:16 bad night to be unplugged 2009-02-22 21:27 traffic on tux3.org seems normal 2009-02-22 21:27 no news sites linking 2009-02-22 21:28 unless its linking to hg which seems unlikely 2009-02-22 22:28 -!- flips(~phillips@phunq.net) has joined #tux3 2009-02-22 22:38 -!- flips(~phillips@phunq.net) has joined #tux3 2009-02-22 22:38 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-02-22 22:38 ok, hg.tux3.org should be back 2009-02-22 22:52 http://virkaz.blogspot.com/2009/02/backup-and-versioning-service-uses.html 2009-02-22 22:56 whoops, I resent the email again 2009-02-22 22:57 btw, now, fsstress is running for 9.5 hours 2009-02-22 23:02 hirofumi, reading the changes 2009-02-22 23:03 thats pretty awesome 2009-02-22 23:04 it is 2009-02-22 23:05 wll hopefully hirofumi is still awake 2009-02-22 23:20 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-22 23:38 hi hirofumi 2009-02-22 23:38 hi 2009-02-22 23:38 ok, checking out all the bug fixes 2009-02-22 23:38 tux3 on root was a big hit at scale today 2009-02-22 23:38 didn't crash ;) 2009-02-22 23:38 good :) 2009-02-22 23:39 min/max, good 2009-02-22 23:40 BUG_ON, fine 2009-02-22 23:42 is_expand = size > inode->i_size; <- better 2009-02-22 23:43 ah :) 2009-02-22 23:45 btw, after 9.5 hours, it seems there is no leak 2009-02-22 23:45 :) 2009-02-22 23:45 with simple free_btree() 2009-02-22 23:46 ileaf_resize on empty leaf triggered on a test? 2009-02-22 23:46 yes 2009-02-22 23:46 it become the cause of crash 2009-02-22 23:46 fsstress create and delete inodes 2009-02-22 23:47 if all inodes on the ileaf was removed, we are leaving it as is 2009-02-22 23:47 ileaf_trim can be shortened by two lines ;) 2009-02-22 23:47 good 2009-02-22 23:48 maybesomething like atdict_nonzero is easier to read than __atdict 2009-02-22 23:49 good paranoia check 2009-02-22 23:50 hirofumi, no problems in the patch set, a couple minor nits 2009-02-22 23:50 looks ready to pull 2009-02-22 23:51 ok 2009-02-22 23:51 however, I'll review myself tonight 2009-02-22 23:51 after that, I'll resend request of pull 2009-02-22 23:52 sure 2009-02-22 23:52 I'll get the slides ready to post 2009-02-22 23:52 I think they're ok 2009-02-22 23:53 good 2009-02-23 00:35 http://tux3.org/docs/tux3.scale.7x.pdf 2009-02-23 00:57 great! :) 2009-02-23 01:08 will there support for internal replication in tux3? 2009-02-23 01:13 internal replication? 2009-02-23 01:18 e.g i can overwrite the first 2GB of 10GB zfs and there are no structural damages beside maybe a bit of dataloss due to the default replication count of data of 1 2009-02-23 01:21 oh, you mean redundandant storage 2009-02-23 01:21 tux3 will rely on he volume manager to do that, however we intend to improve the volume manager 2009-02-23 01:21 nice :) 2009-02-23 01:21 because this is the biggest problem nowadays 2009-02-23 01:21 it is a big problem, yes 2009-02-23 01:21 there is not even the time to format a 16TB raid under linux 2009-02-23 01:22 not necesesarily the biggest problem 2009-02-23 01:22 yes 2009-02-23 01:22 we will considerably elaborate the scope of what raid can do 2009-02-23 01:22 so you can only use xfs, but you just need _WAY_ too much ram to fsck a 16TB xfs *g* 2009-02-23 01:23 that what zfs does is simply raid, they do tend to hype it to make it seem like more than that 2009-02-23 01:23 ack 2009-02-23 01:24 so, what we intend is to extend the raid interfaces (bio) to allow the filesytem to inquire and specify the raid geometry, per logical extent 2009-02-23 01:24 much better than building that into the filesystem 2009-02-23 01:25 in terms of factoring 2009-02-23 01:25 and this ability can then be used by any filesystem 2009-02-23 01:25 that would be nice 2009-02-23 01:25 it's a few months in the future, when work begins on that 2009-02-23 01:25 so not the whole hd needs to be synced in case of an error on the device? 2009-02-23 01:25 for now, we will gather requirements, and maybe work on the base patches 2009-02-23 01:26 yes, that is the kind of control we wish to expose 2009-02-23 01:26 to the filesytem, and to the admin 2009-02-23 01:27 it would be nice if nothing except the block which gaves an error needs to be synced 2009-02-23 01:31 yes, that's what we want 2009-02-23 01:32 fine grained persistent dirty mapping for md 2009-02-23 01:33 well 2009-02-23 01:33 your request is considerably simpler 2009-02-23 01:33 cool :) 2009-02-23 01:33 it seems straightforward 2009-02-23 02:12 good 2009-02-23 02:12 you folks rock 2009-02-23 02:12 ACTION just finished reading the backlog 2009-02-23 02:30 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-23 03:05 good 900 GB in the shuttle now, one pata and one sata 2009-02-23 03:06 s/good/got/ 2009-02-23 04:14 tux3 was faster than ext3? nice :) 2009-02-23 05:01 and I just asked akp;m for a merge ;) 2009-02-23 05:01 http://lkml.org/lkml/2009/2/23/133 2009-02-23 05:05 oh, early 2009-02-23 05:05 I was thinking about after atomic commit 2009-02-23 05:05 sure 2009-02-23 05:05 before or after 2009-02-23 05:05 either way 2009-02-23 05:06 we see how it's going to look now 2009-02-23 05:06 yes 2009-02-23 05:06 worst that can happen is, it's a "no" 2009-02-23 05:06 and that doesn't change anything 2009-02-23 05:07 maybe, it can be mergable 2009-02-23 05:08 let's see what happens 2009-02-23 05:08 we can do a few cleanups, get the git tree set up 2009-02-23 05:08 yes 2009-02-23 05:08 I'm going to add a patch for enhanced tracing I think 2009-02-23 05:08 using the ddlink method 2009-02-23 05:08 well 2009-02-23 05:08 I'll post it on the list 2009-02-23 05:08 in a couple days 2009-02-23 05:09 however, git and hg repo may be thinkable 2009-02-23 05:09 thinkable? 2009-02-23 05:09 need to rethink 2009-02-23 05:09 well, it's time to set up a kernel tree, with just the kernel code 2009-02-23 05:10 but yes 2009-02-23 05:10 yes 2009-02-23 05:10 let's think about the work flow 2009-02-23 05:10 yes 2009-02-23 05:10 well, it would be later 2009-02-23 05:10 a little later 2009-02-23 05:11 maybe something like: changes made while hacking the kernel go into the git tree, changes made while hacking userspace go into the mercurial tree 2009-02-23 05:11 maybe 2009-02-23 05:12 and we just rsync between them 2009-02-23 05:12 well 2009-02-23 05:12 seems off ;) 2009-02-23 05:12 we can use something better than rsync 2009-02-23 05:12 for getting changes between them 2009-02-23 05:12 it does have to be two repositories 2009-02-23 05:12 or both on git if sync is easy 2009-02-23 05:12 the only question is, should it be two gits, or one git and one mercurial 2009-02-23 05:13 well, first priority I want is atomic commit for now 2009-02-23 05:13 yes 2009-02-23 05:13 we want to know the real performance 2009-02-23 05:14 and we want to be able to use it for real work 2009-02-23 05:14 yes 2009-02-23 05:14 I should start running my dev workstation on tux3 in a couple weeks 2009-02-23 05:14 if everything goes well 2009-02-23 05:15 good 2009-02-23 05:17 if we want to merge to upstream, we would want to good history on git 2009-02-23 05:17 yes 2009-02-23 05:17 it is time to create that 2009-02-23 05:17 and it would be for some days 2009-02-23 05:19 the thing that encouraged me to start the merge review now is, performance is so good 2009-02-23 05:19 much better than I expected 2009-02-23 05:20 I thought the inode table btree would cost a lot 2009-02-23 05:20 and need to be optimized 2009-02-23 05:20 but it doesn't 2009-02-23 05:20 yes 2009-02-23 05:29 ok, well, let's see what happen 2009-02-23 05:29 right 2009-02-23 05:29 I guess I better sleep 2009-02-23 05:29 well, I'd like to concentrate to atomic commit for now, if possible 2009-02-23 05:29 ok 2009-02-23 05:30 oyasumi 2009-02-23 05:30 oyasumi 2009-02-23 05:30 I will concentrate on atomic commit too 2009-02-23 05:30 but also on getting the merge started 2009-02-23 05:30 ok 2009-02-23 06:04 regarding workflow, you might just do that through git and hg-hooks 2009-02-23 06:47 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-23 06:54 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-23 06:58 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-23 07:56 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-02-23 09:39 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-23 09:40 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-23 11:13 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-23 12:33 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-23 12:47 http://www.phoronix.com/scan.php?page=news_item&px=NzA4Mw 2009-02-23 12:48 data, I will look into hg-hooks 2009-02-23 12:55 hi 2009-02-23 12:56 probe(cursor, info->resume); 2009-02-23 12:56 tree_chop():btree.c is using info->resume 2009-02-23 12:57 it seems strange, well, it wouldn't have actual problem 2009-02-23 12:57 but strange 2009-02-23 13:00 ACTION looks 2009-02-23 13:01 tree_chop is supposed to be able to run a truncate incrementally 2009-02-23 13:01 which explains the into->resume field 2009-02-23 13:01 I think that is a good thing, for very large truncates 2009-02-23 13:01 but we can re-examine that 2009-02-23 13:02 what is for? 2009-02-23 13:02 well, anyway, initial one would be info->key 2009-02-23 13:02 and re-probe() can use info->resume 2009-02-23 13:02 in ddsnap, we would sometimes have to remove a version from a very large tree of metadata 2009-02-23 13:02 that requires traversing the entire metadata tree 2009-02-23 13:03 it is not good to make other tree operations wait for the entire traversal 2009-02-23 13:03 ah, big one btree with lock? 2009-02-23 13:03 well, the idea is, tree_chop updates info->key 2009-02-23 13:03 yes 2009-02-23 13:03 we will have the same issue with tux3 when versioning is added 2009-02-23 13:04 it's a background full tree edit 2009-02-23 13:04 by the way, tree_chop was designed only for full tree edits, which is the only kind of delete that ddsnap does 2009-02-23 13:05 It may not be quite right for a random delete 2009-02-23 13:05 or for a truncate, I think it may fail to merge some nodes to the left of the truncate point 2009-02-23 13:05 but, why doesn't it do cond_resched_lock() on proper point in tree_chop()? 2009-02-23 13:06 I think we want it to return to caller, not just schedule 2009-02-23 13:06 because a full tree pass might be interrupted by a umount/remount 2009-02-23 13:06 i see 2009-02-23 13:06 in this case, cooperative parallelism seems to be the right approach 2009-02-23 13:07 um..., I can't see exactly case on fs 2009-02-23 13:07 but please examine this idea crtitically 2009-02-23 13:08 ok 2009-02-23 13:08 well, suppose we have a multi-terabyte volume that is heavily versioned, and we delete some version, this might take a few minutes 2009-02-23 13:08 then the user decides to shut down the machine 2009-02-23 13:08 we don't want the user to wait minutes for the shutdown 2009-02-23 13:08 um... 2009-02-23 13:09 user asked to truncate that file 2009-02-23 13:09 instead, we should just write the index key of the incomplete traverse to the metablock (superblock for now) and shut down immediately 2009-02-23 13:09 this example is not a truncate, it is a version delete 2009-02-23 13:09 but truncate may have similar issues 2009-02-23 13:09 it can be 2009-02-23 13:10 but, it sounds like user may sync version delete was done? 2009-02-23 13:10 user may think 2009-02-23 13:10 well, it's later 2009-02-23 13:10 current issue is tree_chop() is starting at 0 always 2009-02-23 13:12 so, I guess update info->key may be good 2009-02-23 13:13 well, tux3 will report that the version delete was done immediately 2009-02-23 13:13 but it will continue processing the version delete in the background 2009-02-23 13:14 um.., it means shutdown is blocked by someone? 2009-02-23 13:15 we don't want to block shutdown for example 2009-02-23 13:15 is that what you mean? 2009-02-23 13:16 version delete in the background 2009-02-23 13:16 yes 2009-02-23 13:16 but, shutdown is in progress? 2009-02-23 13:16 or umount 2009-02-23 13:17 well it can be implemented as a task maybe 2009-02-23 13:17 without the "info" 2009-02-23 13:17 we should use the best approach 2009-02-23 13:17 it means the background keep going until poweroff? 2009-02-23 13:18 hmm 2009-02-23 13:18 I think we want to stop the background immediately in unmount 2009-02-23 13:18 yes 2009-02-23 13:19 user may unplug storage after umount 2009-02-23 13:19 however, what is in background? 2009-02-23 13:19 we will record the progress of the background delete in a metablock 2009-02-23 13:20 so the progress will be consistent with the last committed delta 2009-02-23 13:20 ah 2009-02-23 13:20 ok, I got 2009-02-23 13:20 real delete does on idle? 2009-02-23 13:20 yes 2009-02-23 13:20 i see 2009-02-23 13:20 this is an idea from ddsnap 2009-02-23 13:20 ddsnap only has one btree 2009-02-23 13:21 we need to think about the implications of multiple btree deletes in progress at the same time 2009-02-23 13:21 do we do them one at a time, or in parallel? 2009-02-23 13:22 I guess either is good 2009-02-23 13:22 well, it seems gc 2009-02-23 13:22 yes, I think one at a time is a good way to start 2009-02-23 13:22 yes 2009-02-23 13:22 the key is, we tell the user the delete is done as soon as we commit the delta, the actual delete may not have been completed yet 2009-02-23 13:23 yes, it can 2009-02-23 13:23 so we need a mechanism for saying "this inode is truncated to this point", before we actually do the truncate 2009-02-23 13:23 we could use the size attribute for that, maybe 2009-02-23 13:24 one issue is free space 2009-02-23 13:24 yes 2009-02-23 13:24 well, it is not a big issue until the fs is nearly full 2009-02-23 13:24 yes 2009-02-23 13:24 then we can make new allocations wait on pending deletes 2009-02-23 13:24 and can be option if user dislike it 2009-02-23 13:25 I think all users will like a delete that completes immediately 2009-02-23 13:25 but of course we much not return a false ENOSPACE 2009-02-23 13:25 ah 2009-02-23 13:26 so, allocation waits on pending deletes if freespace is low 2009-02-23 13:26 well, it may depend on usage 2009-02-23 13:26 however, it sounds like good 2009-02-23 13:27 it needs to be written in a design note 2009-02-23 13:27 probably 2009-02-23 13:27 well, my initial question was simple :) 2009-02-23 13:28 probe(cursor, info->resume); -> probe(cursor, info->key); 2009-02-23 13:28 is ok? 2009-02-23 13:28 for now 2009-02-23 13:29 yes 2009-02-23 13:29 ok 2009-02-23 13:29 I'll include this change too for next patchset 2009-02-23 13:30 ok, 14 patches 2009-02-23 13:32 :) 2009-02-23 13:33 and another two bugs are possible 2009-02-23 13:33 inode resuse problem mentioned before 2009-02-23 13:33 and alloc_cursor() problem 2009-02-23 13:33 we are using alloc_cursor() without lock 2009-02-23 13:34 but, alloc_cursor() reads btree->depth 2009-02-23 13:34 ah 2009-02-23 13:34 so, it would need the locking 2009-02-23 13:34 two bugs would be rare case, so it is still pending 2009-02-23 13:35 well I will do some more work on atomic commit 2009-02-23 13:35 good 2009-02-23 13:36 btw, with 14 patches, fsstress seems to work for 9.5 hours or more 2009-02-23 13:36 I just shutdown machine at 9.5 hours 2009-02-23 13:37 so, maybe tux3 is became more hard to crash 2009-02-23 13:39 and hopefully, the atomic commit debugging become easy more or less 2009-02-23 13:40 it's a wonderful result 2009-02-23 13:44 yes 2009-02-23 13:44 the rest of my usual stress test is only "racer" 2009-02-23 13:45 marcin will start testing soon too 2009-02-23 13:45 well, maybe there are bugs, however it seems to good to start new things 2009-02-23 13:45 continue testing 2009-02-23 13:45 yes 2009-02-23 13:45 ileaf stuff can do memory corruption 2009-02-23 13:45 it is a reason for starting the merge review soon 2009-02-23 13:46 how can ileaf corrupt memory? 2009-02-23 13:46 it reads *(dict - 0) 2009-02-23 13:46 ah 2009-02-23 13:47 and it uses the result of it 2009-02-23 13:47 my sloppy code ;) 2009-02-23 13:47 both is wrong 2009-02-23 13:47 :) 2009-02-23 13:47 well, it may be changed situation 2009-02-23 13:48 probably just me missing a case 2009-02-23 13:48 which funtion? 2009-02-23 13:48 well, ileaf assumion was changed 2009-02-23 13:48 ileaf_resize() is that 2009-02-23 13:48 I actually hitted it 2009-02-23 13:49 with your new check? 2009-02-23 13:49 ugh 2009-02-23 13:49 sorry 2009-02-23 13:49 it was ileaf_split() 2009-02-23 13:49 free = *(dict - icount()) 2009-02-23 13:49 right, split used to split in middle 2009-02-23 13:49 but, icount can be 0 2009-02-23 13:50 now it can split at zero or count as well 2009-02-23 13:50 current public repo? 2009-02-23 13:51 way back 2009-02-23 13:51 it was changed in september or so 2009-02-23 13:52 current one still have free = *(dict - icount()) 2009-02-23 13:52 atdict should work there 2009-02-23 13:52 yes 2009-02-23 13:53 atdict should fix it 2009-02-23 13:53 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-23 13:53 14 patches 2009-02-23 13:53 I'll also post as email 2009-02-23 13:54 reading 2009-02-23 13:56 it is similar to the patch set from yesterday 2009-02-23 13:56 I thought it all looked good then 2009-02-23 13:56 thanks 2009-02-23 13:56 implement free_btree is new 2009-02-23 13:57 some patches are slightly modified 2009-02-23 13:57 anything major 2009-02-23 13:57 and free_btree() was there yesterday? 2009-02-23 13:58 last two patches may be new 2009-02-23 14:02 well, so, next is I'll back to document of block fork 2009-02-23 14:02 the point of buffer/page reclaim, and io submit 2009-02-23 14:03 yes 2009-02-23 14:03 is your latest hackfs version still at the same url? 2009-02-23 14:04 yes 2009-02-23 14:04 http://userweb.kernel.org/~hirofumi/hackfs.tar.gz 2009-02-23 14:04 maybe this? 2009-02-23 14:19 yes 2009-02-23 14:19 do you have a userspace version? 2009-02-23 14:20 yes 2009-02-23 14:20 we should merge that now 2009-02-23 14:21 ah, userspace was already merged 2009-02-23 14:21 oh sorry 2009-02-23 14:22 ok, so I can continue the userspace atomic commit with that model 2009-02-23 14:22 yes 2009-02-23 14:22 it is a very small change for the bitmap flush 2009-02-23 14:22 ok 2009-02-23 14:23 btw, simple test of bitmap flush seems to work with fork change 2009-02-23 14:23 after changed fork 2009-02-23 14:23 good 2009-02-23 14:26 9~9~ 2009-02-23 15:03 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-23 17:19 hirofumi, ready to pull? 2009-02-23 17:19 ACTION thinks yes 2009-02-23 20:48 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-23 21:15 yes 2009-02-23 22:01 -!- kushal(~kushal@121.246.34.225) has joined #tux3 2009-02-23 22:12 pulled 2009-02-23 22:12 sleeping 2009-02-23 22:13 screw sleeping, i just got off work :/ 2009-02-23 22:13 hi flips 2009-02-23 22:13 -!- cdk(~chinmay@121.246.34.225) has joined #tux3 2009-02-23 22:14 hi kushal 2009-02-23 22:14 marcin, great idea 2009-02-23 22:15 it worked for me 3 days this week, why not try for four? 2009-02-23 22:15 done with the first prototype for dedup in userspace 2009-02-23 22:15 bitbucket repository available here http://bitbucket.org/cdkamat/tux3/ 2009-02-23 22:19 kushal, what kind of results do you see? 2009-02-23 22:20 just a min....will give you the accurate figures... 2009-02-23 22:20 I wonder why it says "4 weeks old" 2009-02-23 22:21 coz it is ... been testing since then .. and have been away for 2 weeks :( 2009-02-23 22:21 we've been busy in that time 2009-02-23 22:22 yes .. trying to catchup 2009-02-23 22:22 the repo has the entire code checked in 2009-02-23 22:23 that is, it does not have the main history 2009-02-23 22:23 so it is hard to see what you changed 2009-02-23 22:24 can you create a bitbucket repository as a clone of some other repository, for example, mine? 2009-02-23 22:26 its changes to the snapshot before you changed to the git mainline... 2009-02-23 22:26 doing the clone now.. 2009-02-23 22:26 just a min 2009-02-23 22:27 also our changes have a comment /* DREAMZ */ 2009-02-23 22:27 nice name 2009-02-23 22:28 well, you could make a changeset by diffing against your original code and post it to the mailing list 2009-02-23 22:28 how about that? 2009-02-23 22:30 creating the repo.... it'll be done in a while 2009-02-23 22:31 hi all 2009-02-23 22:31 hi shaporj 2009-02-23 22:31 hi shap 2009-02-23 22:31 hi shapor 2009-02-23 22:31 flips: did you already review/merge hirofumi's patch? 2009-02-23 22:32 i'm just catching up on mail now 2009-02-23 22:32 I just did 2009-02-23 22:32 marcin, hirofumi fixed 3 or 4 bugs 2009-02-23 22:32 sounds like it survived quite a test 2009-02-23 22:32 one of them _might_ have been your lockup 2009-02-23 22:33 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-23 22:36 the results are something like this 2009-02-23 22:36 on copying the same 100mb file 3 times 2009-02-23 22:37 without dedup - each copy takes around 3.8 secs with a total of 76803 blocks 2009-02-23 22:37 i'll see what i can do with my gui vm through ssh from a hotel tomorrow, now i'm gonna go pass out, i spent all day in airports 2009-02-23 22:37 with dedup - first copy takes 4.18 and the later 2 take 2.5 secs 2009-02-23 22:38 with a total of 26229 blocks 2009-02-23 22:38 marcin, I will see you in dreamland 2009-02-23 22:40 kushal, I would call that a success 2009-02-23 22:40 i think so...giving you another result... 2009-02-23 22:41 Copying the daily snapshots of tux3 repo for 16 days... 2009-02-23 22:41 without dedup - 3327 blocks 2009-02-23 22:41 with dedup - 1968 blocks 2009-02-23 22:42 wow, nice 2009-02-23 22:42 :) 2009-02-23 22:42 kushal, how long is your project supposed to last? 2009-02-23 22:43 another 3 weeks....currently working on collision handling and kernel porting 2009-02-23 22:44 what are you plans when finished? 2009-02-23 22:44 will have final yr exams which will take close to a month... 2009-02-23 22:44 abort: repository static-http://userweb.kernel.org/~hirofumi/tux3/.hg not found! 2009-02-23 22:45 what was the workaround for that 2009-02-23 22:45 upgrade hg? 2009-02-23 22:45 then would like to contribute more to tux3 2009-02-23 22:45 :) 2009-02-23 22:45 kushal, then graduating with bachelor degrees? 2009-02-23 22:45 yes 2009-02-23 22:45 shapor, that's one workaround I think 2009-02-23 22:45 there's also a patch somewhere 2009-02-23 22:46 I'd be hard pressed to find it 2009-02-23 22:46 i know this worked on this box before though 2009-02-23 22:46 upgrade 2009-02-23 22:46 i ended up blowing away all my installs this past weekend 2009-02-23 22:46 yes, it's a version mismatch 2009-02-23 22:46 so i dont know whats what 2009-02-23 22:46 fscking hg 2009-02-23 22:46 I'm running 1.0.1 2009-02-23 22:46 that has the bug fix 2009-02-23 22:46 yeah 2009-02-23 22:46 thats what happened 2009-02-23 22:47 my "upgrade" blew away my manually installed hg 2009-02-23 22:47 woo 2009-02-23 22:53 time to sleep 2009-02-23 22:53 bitbucket repo http://bitbucket.org/kushal/tux3/ 2009-02-23 22:53 checked against Jan 26 snapshot 2009-02-23 22:54 -!- gaurav(~gaurav@121.246.34.225) has joined #tux3 2009-02-23 22:55 -!- amey(~amey@121.246.34.225) has joined #tux3 2009-02-23 22:55 you seem to have added some binary files to your repo 2009-02-23 22:56 shapor, could you check out kushal's repo and make some suggestions? 2009-02-23 22:56 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-02-23 22:56 some way to set up the repo so that their changes are obvious 2009-02-23 22:57 yeah i'm looking at it 2009-02-23 22:57 theres really no version history 2009-02-23 22:58 just the jan 26 base + 2 commits 2009-02-23 22:58 so its probably easiest just to rebase it against a jan 26 hg clone 2009-02-23 23:00 what I thought 2009-02-23 23:00 is it possible to make clones at bitbucket? 2009-02-23 23:01 yes 2009-02-23 23:01 you "fork a project" 2009-02-23 23:01 that's what they should do then 2009-02-23 23:01 but i dont know if you can do it against a point in the past 2009-02-23 23:01 i think so 2009-02-23 23:01 you can clone, then make a branch against a particular version and develop on it 2009-02-23 23:02 unfortunately my bitbucket repo is a bit broken 2009-02-23 23:02 due to that branch i can't merge down due to needing to reinitialize the repo 2009-02-23 23:02 i was going to email them about that 2009-02-23 23:03 good night 2009-02-23 23:03 flips: k, get sleep 2009-02-23 23:03 oh btw before you go 2009-02-23 23:03 http://www.phoronix.com/scan.php?page=news_item&px=NzA4Mw 2009-02-23 23:04 "Kernel patches for the Tux3 file-system are supposed to be in a Git tree within the next few days." 2009-02-23 23:04 so I'd better do that 2009-02-23 23:04 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-23 23:04 I'll make the repo tomorrow 2009-02-23 23:04 ok :) 2009-02-23 23:04 and we can see if there's a way to put history in it 2009-02-23 23:04 let me know if you want to set it up on my server 2009-02-23 23:05 I've got an account on kernel.org 2009-02-23 23:05 ah 2009-02-23 23:05 right 2009-02-23 23:05 much mo betta 2009-02-23 23:05 tim_dimm: now its a party 2009-02-23 23:05 we will set it up on phunq.net, git it right, the copy it to kernel.org 2009-02-23 23:05 get it right I mean ;) 2009-02-23 23:05 ;) 2009-02-23 23:05 sleep 2009-02-23 23:05 yah 2009-02-23 23:05 night 2009-02-23 23:05 what's a party? 2009-02-23 23:06 now that you're here 2009-02-23 23:06 only momentarily 2009-02-23 23:06 I'll be the one passed out on the floor of the chat room 2009-02-23 23:06 shapor, would do you suggest? 2009-02-23 23:07 what do you suggest? 2009-02-23 23:08 kushal: clone the official repo, then make a branch at the revision you make your changes on 2009-02-23 23:08 and copy your files over 2009-02-23 23:08 then you should be able to do a commit 2009-02-23 23:08 and get the full revision history + your changes 2009-02-23 23:09 official repo being http://hg.tux3.org/tux3 2009-02-23 23:09 ok...only the files that we have made changes to...rather than what i did now :( 2009-02-23 23:09 yeah its just important to put your changes on top of a repo rather than a tarball or whatever 2009-02-23 23:09 then after that if your feeling lucky you can try to merge it and see if it still works ;) 2009-02-23 23:10 ok 2009-02-23 23:10 will do it now 2009-02-23 23:12 i was trying to look at the code in the bitbucket repo you created, looks like something might be missing? http://bitbucket.org/kushal/tux3/src/tip/user/dedup.c 2009-02-23 23:12 87 bytes sounds pretty small 2009-02-23 23:12 even for a demo ;) 2009-02-23 23:13 its only for the consistency...the actual code is in kernel/dedup.c 2009-02-23 23:14 ah 2009-02-23 23:24 seems simple enough 2009-02-23 23:25 yes it is... but it works... :) 2009-02-23 23:25 yeah simple is good :) 2009-02-23 23:29 what happens if a bucket overflows? 2009-02-23 23:30 currently its capacity is 100 entries....after that we allocate a new bucket 2009-02-23 23:30 we need to update the capacity to the max possible 2009-02-23 23:31 so after the current write bucket gets full...a new bucket is allocated as the current write bucket 2009-02-23 23:56 shapor : i need to branch it at this revision http://hg.tux3.org/tux3/rev/58e077f83dc1 2009-02-23 23:56 so hg branch 58e077f83dc1 2009-02-23 23:56 ? 2009-02-24 00:03 i think you just "hg checkout 58e077f83dc1" 2009-02-24 00:03 same as "update" 2009-02-24 00:04 to switch to that revision 2009-02-24 00:04 ok 2009-02-24 00:04 and you work on it as a branch 2009-02-24 00:04 but i'm not an hg expert by any stretch of the imagination :) 2009-02-24 00:05 ok...will do it and upload to the same repo /kushal/tux3 2009-02-24 00:05 bye 2009-02-24 00:05 cdk btw i was able to generate a patch 2009-02-24 00:05 based off the repo you had uploaded 2009-02-24 00:05 thanks 2009-02-24 00:06 just did "hg clone http://bitbucket.org/kushal/tux3" 2009-02-24 00:06 then "hg diff 0" 2009-02-24 00:06 should be sufficient for a preliminary review 2009-02-24 00:06 cdk: want me to post it to the list, or will you? 2009-02-24 00:07 will do it 2009-02-24 00:07 awesome, thanks! :) 2009-02-24 03:48 -!- chesse(~eworm@dslb-084-062-190-157.pools.arcor-ip.net) has joined #tux3 2009-02-24 03:50 -!- amey(~amey@121.246.34.225) has joined #tux3 2009-02-24 08:08 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-24 08:27 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-24 08:43 -!- kushal(~kushal@121.246.34.225) has joined #tux3 2009-02-24 08:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-24 09:17 -!- kushal(~kushal@115.109.9.213) has joined #tux3 2009-02-24 09:47 -!- cdk(~chinmay@115.109.9.213) has joined #tux3 2009-02-24 09:58 -!- cdk(~chinmay@115.109.10.217) has joined #tux3 2009-02-24 10:06 -!- kushal(~kushal@115.109.10.217) has joined #tux3 2009-02-24 10:43 -!- amey(~amey@115.109.10.217) has joined #tux3 2009-02-24 10:56 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-24 13:23 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-02-24 13:34 hi konrad 2009-02-24 13:35 hi 2009-02-24 13:35 sending you a link regarding the openssl licensing 2009-02-24 13:35 http://www.openssl.org/support/faq.html#LEGAL2 2009-02-24 13:36 kushal: I think that applies to userland programs 2009-02-24 13:36 the kernel is different 2009-02-24 13:37 yes...we will be implementing the SHA-1 algo before the kernel port 2009-02-24 13:37 ok :) 2009-02-24 13:37 there is no problem then :) 2009-02-24 13:37 this was just to get things up and running :) 2009-02-24 13:38 ok, cool 2009-02-24 14:07 hehe 2009-02-24 14:07 oups... wrong channel 2009-02-24 16:21 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-24 18:43 hi 2009-02-24 18:44 today's project: add a tux3 ddlink 2009-02-24 18:44 ddlink is a cool way to do user kernel connections 2009-02-24 19:06 -!- chesse_(~eworm@dslb-084-062-166-102.pools.arcor-ip.net) has joined #tux3 2009-02-24 20:05 hey marcin 2009-02-24 20:06 -!- chesse(~eworm@dslb-084-062-145-174.pools.arcor-ip.net) has joined #tux3 2009-02-24 20:40 hey 2009-02-24 20:41 hi 2009-02-24 20:41 you're going to like ddlink 2009-02-24 20:42 perfect way for getting traces of tux3 activity 2009-02-24 20:42 never mind security logs ;) 2009-02-24 20:42 oh? do tell 2009-02-24 20:43 I'm just sniffing around for my most recent patch 2009-02-24 20:43 and I'll roll it for tux3 2009-02-24 21:00 -!- cdk(~chinmay@115.109.13.18) has joined #tux3 2009-02-24 21:06 -!- cdk_(~chinmay@115.109.9.5) has joined #tux3 2009-02-24 21:24 -!- flips(~daniel@phunq.net) has joined #tux3 2009-02-24 21:46 -!- flipz(~phillips@phunq.net) has joined #tux3 2009-02-24 23:07 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-02-25 01:49 hi hirofumi 2009-02-25 01:49 hi 2009-02-25 01:49 I'm preparing an experimental patch to use ddlink 2009-02-25 01:50 yes 2009-02-25 01:50 to help monitor filesystem behavior 2009-02-25 01:50 http://lwn.net/Articles/271805/ 2009-02-25 01:51 probably, trace stuff may be trend 2009-02-25 01:55 well it is not just for tracing 2009-02-25 01:55 but later to control snapshots 2009-02-25 01:56 and also to get extended logging information from the fs 2009-02-25 01:56 logging is log of trace? 2009-02-25 01:56 logging might be for security audit 2009-02-25 01:56 ah 2009-02-25 01:57 well, my concerned point is 2009-02-25 01:57 some of lkml people may dislike it... 2009-02-25 01:58 possibly 2009-02-25 01:58 and more, it may prevent to merge tux3 to upstream 2009-02-25 01:58 can keep it as an out of tree patch until at least akpm and linus like it 2009-02-25 01:59 it sounds good 2009-02-25 01:59 also, it is good for replication 2009-02-25 01:59 yes 2009-02-25 02:00 I just want to it or something become generic interface in kernel 2009-02-25 02:00 I want that it/something become generic/common interface 2009-02-25 02:00 not tux3 specific 2009-02-25 02:00 yes, and it will have a better chance if we use it, and therefore improve it 2009-02-25 02:01 yes, it has many uses 2009-02-25 02:01 the original use case was nfs 2009-02-25 02:01 rpc_pipefs 2009-02-25 02:01 which was the model 2009-02-25 02:01 yes 2009-02-25 02:01 but ddlink is better, and could be used to transparently improve rps_pipefs 2009-02-25 02:02 sounds good 2009-02-25 02:02 however, it may be different story with tux3 2009-02-25 02:03 (I meant it can submit without tux3) 2009-02-25 02:04 however, we will say, we want this interface or something that 2009-02-25 02:04 it will not be part of our initial submission 2009-02-25 02:04 sure 2009-02-25 02:04 it will arrive as part of versioning, possibly after merge 2009-02-25 02:05 it sounds good 2009-02-25 02:09 time to make a tux3_ioctl 2009-02-25 02:10 the way userspace obtains a ddlink is, ioctl(tux3fd, 0xdd); 2009-02-25 02:14 hirofumi, do you suppose we should stop supplying the default \n in tracing functions? 2009-02-25 02:14 that is, add \n to ever tracing format string 2009-02-25 02:15 I'm not sure which is good 2009-02-25 02:16 it is not the most urgent thing 2009-02-25 02:16 if we want non-one-liner message, default \n would bother us 2009-02-25 02:16 however, it would be rare 2009-02-25 02:17 it is rare 2009-02-25 02:17 but when it is wanted, there is no way to do it 2009-02-25 02:19 well, personally, I will not use trace almost 2009-02-25 02:20 because, I will want to trace of exactly bug case 2009-02-25 02:20 however, log would be slow to get bug 2009-02-25 02:21 and, new trace infrastructure would need help of userspace, I guess 2009-02-25 02:21 I rely on trace for my own development style 2009-02-25 02:21 well I am not proposing to change the existing trace 2009-02-25 02:22 that is, it can continue to go through kprint 2009-02-25 02:22 yes 2009-02-25 02:22 change that would be a lot of work for no obvious benefit 2009-02-25 02:22 probably 2009-02-25 02:23 however, maybe, it's good for optimization 2009-02-25 02:23 trace infrastructure may can trace the bottoleneck point 2009-02-25 02:24 well, so, I'm just not sure :) 2009-02-25 02:26 well we should change kernel/trace.h to use whatever infrastructure is available 2009-02-25 02:26 and keep our trace calls compatible with userspace 2009-02-25 02:27 probably 2009-02-25 02:31 ok, ddlink compiles 2009-02-25 02:31 damm, how do I turn off spell checking in xchat ;) 2009-02-25 02:33 it seems input box configuration 2009-02-25 02:33 ? 2009-02-25 02:33 oh 2009-02-25 02:34 xchat 2009-02-25 02:34 thanks 2009-02-25 02:40 -!- gila(~gila@62-177-200-122.dsl.bbeyond.nl) has joined #tux3 2009-02-25 02:52 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-02-25 03:10 well I successfully created a ddlink by ioctling the mount point of a tux3 filesystem 2009-02-25 03:10 tomorrow I will do something with that ddlink 2009-02-25 03:11 good 2009-02-25 03:11 I'm still writing doc for block fork 2009-02-25 03:12 I'm obviously slow to write doc :) 2009-02-25 03:24 :) 2009-02-25 03:31 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-02-25 05:17 -!- cdk(~chinmay@115.109.12.17) has joined #tux3 2009-02-25 06:19 man i missed all the good conversations about both file allocations and hash issues :/ 2009-02-25 06:49 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-25 07:57 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-25 08:34 -!- tim_dimm_(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-25 08:39 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-25 09:38 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-25 09:56 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-25 11:23 hi flips 2009-02-25 11:24 -!- gaurav(~gaurav@59.95.23.124) has joined #tux3 2009-02-25 12:37 hi cdk 2009-02-25 14:36 -!- dcg(~dcg@7.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-02-25 15:27 do we actually need CONFIG_COMPAT in tux3? 2009-02-25 15:48 ok, ddlink demo is working 2009-02-25 15:48 patch and writeup coming soon 2009-02-25 16:28 microsoft has declared open war on linux 2009-02-25 16:29 http://lwn.net/Articles/320737/#Comments 2009-02-25 16:29 "Microsoft sues TomTom" 2009-02-25 16:29 in part, suing over vfat 2009-02-25 18:50 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-02-25 19:18 good evening 2009-02-25 19:18 hi 2009-02-25 19:19 just about got ddlink ready for posting 2009-02-25 19:19 just writing stories about it now 2009-02-25 19:22 cool, looking forward to reading it 2009-02-25 19:22 you wanna talk hashes later? 2009-02-25 19:23 yes 2009-02-25 19:23 on the channel please 2009-02-25 19:23 the chat on the mailing list has been good 2009-02-25 19:24 i know, that's why i wanna talk more 2009-02-25 19:25 actually lemme read through it again, it got long and convoluted 2009-02-25 19:40 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-25 19:49 A better approach would be to design simple, robust kernel interfaces 2009-02-25 19:49 which make sense and which aren't made all complex by putting the user 2009-02-25 19:49 interface in kernel space. And to maintain corresponding userspace 2009-02-25 19:49 tools which manipulate and present the IO from those kernel interfaces. 2009-02-25 19:49 But we don't do that, because userspace is hard, because we don't have 2009-02-25 19:49 a delivery process. But nobody has even tried! 2009-02-25 19:49 -- akpm 2009-02-25 19:49 sounds like he's describing ddlink 2009-02-25 19:51 ACTION puts on security hat 2009-02-25 19:52 where'd you get the akpm quote? 2009-02-25 19:52 why would you want any user interface to kernel space? havnet driver exploits taught us anything? 2009-02-25 20:00 http://lwn.net/Articles/320656/ 2009-02-25 20:01 akpm is making the rather obvious point that parsing and text formatting do not belong in kernel, which some kernel devs seem not to understand 2009-02-25 20:05 absolutely, format string exploits are easy 2009-02-25 20:06 printf in kernel would open gates of new bullshit 2009-02-25 20:06 there should be NO direct access to data from user to ring0 2009-02-25 20:07 read on the new TXT intel released, it's got some interesting registeres that cna be only accessed by the means of crypto ops, no direct access 2009-02-25 20:07 _VERY_ smart 2009-02-25 20:08 Joanna of course 0wned it already 2009-02-25 20:08 well it's not just the security issue but the code / program text blaot 2009-02-25 20:08 bloat 2009-02-25 20:08 TXT ? 2009-02-25 20:08 yes, but bloat is aesthetic, security is functional 2009-02-25 20:09 trusted execution something 2009-02-25 20:09 it's more about the maintainability 2009-02-25 20:09 more code = more bugs 2009-02-25 20:09 yes 2009-02-25 20:09 big bloated interfaces have proven to be bountiful sources of smp bugs 2009-02-25 20:09 but it doesnt provide you wiht a new framework of exploitation ;) 2009-02-25 20:10 oh yes it does 2009-02-25 20:10 well, not directly at least 2009-02-25 20:10 "directly" means? 2009-02-25 20:10 one thing security and kernel people agree on is that bloat==0wnage 2009-02-25 20:10 oh, you mean like /proc/kmem 2009-02-25 20:10 that's for babies 2009-02-25 20:11 bloat MIGHT (highly likely) introduce bugs, thus exploits 2009-02-25 20:11 a clearly defined interface of dumping data into kenrel from userspace, is , quite literally, highway to hell 2009-02-25 20:13 it's a fortress with a nice tunnel built for the enemy to dump flaming shit through 2009-02-25 20:16 so, besides /proc/kmem, what else is there? 2009-02-25 20:17 i dont understand your stuff yet, but you sound like you want a pipe to throw commands at, and that pipe talks to kernel 2009-02-25 20:18 yes 2009-02-25 20:18 and the kernel implementation is responsible for enforcing security 2009-02-25 20:18 just like every kernel interface 2009-02-25 20:19 i'm extra sensitive to this direct kernel access after johnny cache owned macs, some other wifi cars, and i think some bluetooth 2009-02-25 20:19 for example, with tux3 we want to check that that the user has write access to the filesystem device before allowing filesystem-destroying commands 2009-02-25 20:19 so what's your security model for the ddlink? 2009-02-25 20:20 in other words, if the user can prove they have a license to lay waste anyway, we let them do what they want 2009-02-25 20:20 within the scope of what the interface provides 2009-02-25 20:20 which can be very limited 2009-02-25 20:20 how is that authentication done? pam? 2009-02-25 20:20 unix 2009-02-25 20:20 you open it, you got it 2009-02-25 20:20 yuck 2009-02-25 20:20 well, actually we will check after open 2009-02-25 20:20 because anybody can open a ddlink 2009-02-25 20:21 well actually 2009-02-25 20:21 we set a flag in the ddlink created, setting the ability of the opener 2009-02-25 20:21 ok, can you give me an example of how this would work? 2009-02-25 20:21 it's up to the particular ddlink implementor 2009-02-25 20:21 tux3 is a good example 2009-02-25 20:22 suppose we have a "remove an inode regardless of permission" command accessible via ddlink, for the purpose of filesystem repair 2009-02-25 20:22 fsck? 2009-02-25 20:22 more targetted than fsck 2009-02-25 20:22 precision repair 2009-02-25 20:22 driven by the admin 2009-02-25 20:22 ok, fine granular fsck 2009-02-25 20:23 more like tune2fs 2009-02-25 20:23 ok, so a param that deals with the entire fs? 2009-02-25 20:23 anyway we will not allow that command to execute unless the user has write access to the device the filesystem is mounted on 2009-02-25 20:23 which means they can obviously do what they want with or without ddlink's help 2009-02-25 20:24 we can make it as finegrained as we want 2009-02-25 20:24 so it sounds like you let the regular perms/ownership do its job? 2009-02-25 20:24 it's not defined by ddlink itself, but by the implementor of a particular ddlink instance 2009-02-25 20:24 well in this case, regular perm mechanism does not automagically cover it 2009-02-25 20:25 when the ddlink is created, the module implementor must implement a special check for write access 2009-02-25 20:25 that is, if they want anybody besides root to be able to use this interface 2009-02-25 20:25 traditionally, most such interfaces have been restricted to root 2009-02-25 20:25 but we can do better 2009-02-25 20:26 that sounds like you're shipping off security to other, user written modules? 2009-02-25 20:26 you're not giving me any warm'n'fuzzies here :/ 2009-02-25 20:28 incorrect interpretation 2009-02-25 20:28 it is the responsibility of the module implementor, who is using ddlink as a transport, to implement appropriate security 2009-02-25 20:28 i sure hope so ;) 2009-02-25 20:29 this is not different from any other interface 2009-02-25 20:29 same is true of ioctl 2009-02-25 20:31 ok, so how do you see the balance of pretty printing and parsing that the lkml article mentioned? 2009-02-25 20:31 you dotn want raw data, and you dont want full parsers 2009-02-25 20:31 so where is the happy middle? 2009-02-25 20:33 ddlink sends well defined structs over the pipe-like object 2009-02-25 20:33 binary objects 2009-02-25 20:33 no bullshit ascii text ;) 2009-02-25 20:33 i like it less as bins 2009-02-25 20:34 ascii is at least constrained and it ends with nul 2009-02-25 20:34 binaries are a free-for-all 2009-02-25 20:34 ascii leads to bloated parsers with copious buffer overruns 2009-02-25 20:34 exploit farm 2009-02-25 20:34 and sending entire payloads as a binary blob is ok? 2009-02-25 20:34 all ioctl interfaces are binary free for alls then 2009-02-25 20:34 it's not a binary blob 2009-02-25 20:35 it's a well defined structure 2009-02-25 20:35 fields have to be checkd of course 2009-02-25 20:35 structures in C arent delimited, they're just a linear chunk of memory, easily overwritten 2009-02-25 20:36 well then our kernel can't possibly work 2009-02-25 20:36 because the kernel is full of them 2009-02-25 20:37 i didnt say it doesnt work, i'm just saying there's some fundamental approaches that make my iris pucker ;) 2009-02-25 20:37 you're saying we should convert the entire kernel to ascii 2009-02-25 20:38 oh no, then exploits just need one more mapping exercise and we're back to square 1 2009-02-25 20:38 it's a bandaid on gangrene 2009-02-25 20:38 ok, ddlink + ddlink library is 1800 bytes of kernel code 2009-02-25 20:39 is it online yet? 2009-02-25 20:39 getting close 2009-02-25 20:39 I'm just doing some measurements 2009-02-25 20:39 and finishing up the writeup 2009-02-25 20:40 ok, i'm just thinking i'm not getting a crucial part here, missing a good bit 2009-02-25 20:40 but in general, user2kernel transitions in traditional unix environments are scary to me 2009-02-25 20:42 well you need to look at the ioctl mess this will replace 2009-02-25 20:42 and also the ascii clusterfuck that is sysfs 2009-02-25 20:42 i am reading about it as we speak 2009-02-25 20:49 so what do the openbsd nazis do for userland2kernelspace comms? 2009-02-25 21:16 marcin, posted 2009-02-25 21:16 hmm 2009-02-25 21:16 bsd still lives pretty much in ioctl land 2009-02-25 21:16 they like the smell of their own shit ;) 2009-02-25 21:47 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-02-25 21:55 flipz: nice, did you forget to cc lkml? 2009-02-25 22:10 no 2009-02-25 22:10 not until we're happy with it 2009-02-25 23:07 flipz, there? 2009-02-25 23:07 hi hirofumi 2009-02-25 23:07 hi 2009-02-25 23:08 I read a bit ddlink 2009-02-25 23:08 I expect you will tear it apart ;) 2009-02-25 23:09 :) 2009-02-25 23:10 ddlink() seems sys_open() as functionality 2009-02-25 23:10 it is similar, but the ddlink does not have a name 2009-02-25 23:10 it is also similar to an anonymous inode 2009-02-25 23:10 yes 2009-02-25 23:11 but returns a different sized inode 2009-02-25 23:11 I was thinking it will have the name 2009-02-25 23:11 well 2009-02-25 23:11 it is jumping the vfs 2009-02-25 23:11 that is a nice thing about it, it doesn't have a name 2009-02-25 23:11 the ddlink is returned to the user by any convenient mechanism 2009-02-25 23:12 looking up by name has issues 2009-02-25 23:12 such as namespace 2009-02-25 23:12 creating the namespace 2009-02-25 23:12 additional complexity 2009-02-25 23:12 with no purpsoe 2009-02-25 23:12 purpsoe 2009-02-25 23:12 ACTION can't type 2009-02-25 23:12 well, yes 2009-02-25 23:13 but, it jumps permission check too 2009-02-25 23:13 yes, the implementation has to do the permission check 2009-02-25 23:14 in the case of tux3, I think we want to check the user's permission against the device the file is mounted on 2009-02-25 23:14 assuming that the ddlink allows operations like snapshot 2009-02-25 23:16 um.. 2009-02-25 23:17 it creates fd by per-driver(ddlink user) security policy 2009-02-25 23:17 that is the proposal 2009-02-25 23:18 i see 2009-02-25 23:19 user have ddlink fd, but tux3 was umounted 2009-02-25 23:19 ah yes 2009-02-25 23:20 it may have many corner cases... 2009-02-25 23:20 that is a nice one :) 2009-02-25 23:21 nice one? 2009-02-25 23:21 probably, we want the ddlink create to do a mnt_get, and mnt_put on close 2009-02-25 23:21 nice one -> very good question to raise 2009-02-25 23:21 ok :) 2009-02-25 23:21 no so easy to answer 2009-02-25 23:21 I think, mnt_get/mnt_put is the answer 2009-02-25 23:24 maybe 2009-02-25 23:26 well, it seems the way to extend ioctl (in the case of tux3) 2009-02-25 23:26 that means, umount would report the filesystem is busy on attempt to umount when a ddlink is still open 2009-02-25 23:26 yes 2009-02-25 23:27 I guess it will work like file is openning of tux3 2009-02-25 23:29 another one is ddlink_error() 2009-02-25 23:29 it is going to be used for error report? 2009-02-25 23:29 yes, it is supposed to be the main method 2009-02-25 23:30 read/write/ioctl/poll also return error codes 2009-02-25 23:30 it is the decision of the implementation whether to also push an error string 2009-02-25 23:31 but, it is queueing 2009-02-25 23:31 errors are not queued 2009-02-25 23:31 they are pushed to the front of the queue 2009-02-25 23:32 so if there are other items queue, the error is seen first 2009-02-25 23:32 but, user may not read it? 2009-02-25 23:32 user should read it if they get an error from a write 2009-02-25 23:33 it means that fd can't use by thread? 2009-02-25 23:33 if the error code is not ENOMEM, then it would read the ddlink to get the error 2009-02-25 23:33 the idea is, each thread would get its own ddlink 2009-02-25 23:34 it would seem to be hard to have multiple threads reading from the same ddlink, even without errors 2009-02-25 23:35 I think if you wanted to do that, you would provide two ddlinks, one that is only for messages, and the other that is only for errors 2009-02-25 23:36 but it is not really intended to have multiple readers per ddlink 2009-02-25 23:36 just like you would not do that with a pipe 2009-02-25 23:37 multiple readers on pipe (fifo) would be normal 2009-02-25 23:37 if it is using atomic size 2009-02-25 23:37 well 2009-02-25 23:37 error message is for logging message? 2009-02-25 23:38 yes, only for errors 2009-02-25 23:38 or it is part of user<->kernel interface? 2009-02-25 23:38 i see 2009-02-25 23:38 it is a ddlink library function 2009-02-25 23:38 not used by the main interface 2009-02-25 23:38 available for applications 2009-02-25 23:38 it works very well for the ddsetup I wrote, that controls device mapper 2009-02-25 23:38 what is wrong with printk? 2009-02-25 23:39 it doesn't come back to the application 2009-02-25 23:39 the application will not normally see a printk 2009-02-25 23:39 yes 2009-02-25 23:39 so the idea is, the application can report an accurate error to the user 2009-02-25 23:39 ah 2009-02-25 23:39 ok 2009-02-25 23:40 it is extension of printk? 2009-02-25 23:40 the error can be formatted too, like a printk 2009-02-25 23:40 it uses some common part of printk 2009-02-25 23:40 but not the buffer 2009-02-25 23:40 yes, and per-ddlink-fd message 2009-02-25 23:40 it uses vsnprintf 2009-02-25 23:40 also used by printk 2009-02-25 23:41 well, I thought the ddlink_error() message become ABI itself 2009-02-25 23:41 this means the changing message can breaks ABI 2009-02-25 23:42 the application should rely on the error code, not the message 2009-02-25 23:42 but the message should provide accurate information to the user 2009-02-25 23:42 if we want, we can add a private error code field 2009-02-25 23:42 yes, if so, it would be ok 2009-02-25 23:42 we probably should 2009-02-25 23:43 it just logging, user should depend on message itself 2009-02-25 23:43 yes 2009-02-25 23:43 user shouldn't depend 2009-02-25 23:45 if the application really wants to report extended error information, it can push an additional item before the error 2009-02-25 23:46 the userspace application will know about that, and read two items in case of an error, the first is just the error text, the second has more detailed information 2009-02-25 23:46 but I have not found a use for such detail yet 2009-02-25 23:46 usually it is enough just to know there was an error, and to have some text to output 2009-02-25 23:46 yes 2009-02-25 23:47 well define error code is also ok for simple case 2009-02-25 23:47 well definedd error code 2009-02-25 23:50 I should post the ddsetup.c code, for a nontrivial example of ddlink usage 2009-02-25 23:50 btw, I'm attacking ddlink purposely 2009-02-25 23:50 thankyou :) 2009-02-25 23:50 :) 2009-02-25 23:51 ok 2009-02-26 00:26 ok, ddsetup.c example is posted 2009-02-26 00:42 -!- chesse(~eworm@dslb-084-062-172-253.pools.arcor-ip.net) has joined #tux3 2009-02-26 00:53 ACTION reading 2009-02-26 00:53 btw, ddlink can use anon_inode_getfd()? 2009-02-26 01:41 -!- gebi_(~gebi@84-119-61-74.dynamic.xdsl-line.inode.at) has joined #tux3 2009-02-26 01:57 ACTION looks 2009-02-26 02:00 anon_inode_getfd() version was done, untested though 2009-02-26 02:00 it may clear the point of ddlink 2009-02-26 02:01 you are trying it? 2009-02-26 02:01 yes 2009-02-26 02:01 good luck :) 2009-02-26 02:01 just compiled though 2009-02-26 02:01 you will kmalloc the file->private? 2009-02-26 02:01 yes 2009-02-26 02:02 I'll post it soon 2009-02-26 02:02 anything that makes it smaller is good 2009-02-26 02:02 there is no big difference, however, fd allocation code was gone 2009-02-26 02:03 -!- Mark__T(~Mark__T@jabber.freenet.de) has joined #tux3 2009-02-26 02:03 saves an inode 2009-02-26 02:03 yes 2009-02-26 02:05 where does the name come from? 2009-02-26 02:05 anon_inode_getfd()? 2009-02-26 02:05 yes 2009-02-26 02:06 fs/anon_inodes.c? 2009-02-26 02:06 the "class" idea seems a little bogus 2009-02-26 02:07 existing ddlink uses an empty dentry 2009-02-26 02:08 fd can read from /proc/fd/ 2009-02-26 02:08 no name would not be good 2009-02-26 02:08 true 2009-02-26 02:11 I've posted that patch to ml 2009-02-26 02:11 let me see what shows up in /proc/fd with existing code 2009-02-26 02:14 about the Data dedub discussion: why not use 2 different easy to calculate small hashes to reduce the risc of collision instead of using one big complicated hash? 2009-02-26 02:15 I am generally in favor of the small hash approach, no more than 64 bits 2009-02-26 02:15 but I would like to see how the discussion goes 2009-02-26 02:15 my feeling is, an exact match is rare, therefore it is ok to test against the actual data if a small hash matches 2009-02-26 02:16 if you use 2 small instead of one, you could even calculate them parallel 2009-02-26 02:16 I suspect that a 32 bit hash would be even better because you can cache more of them 2009-02-26 02:16 2 small is the same as one, conceptually 2009-02-26 02:17 well, it is just a hash, not crypto 2009-02-26 02:18 yes 2009-02-26 02:18 and it's not perfect hash, so, all needs good balance? 2009-02-26 02:19 hash size, and hash calc overhead, etc, etc.. 2009-02-26 02:19 I think I will suggest they try a smaller hash, plus a compare against actual data on match, and see if it performs better 2009-02-26 02:19 as long as nothing is 'dedubbed' that shouldn't :) 2009-02-26 02:20 hirofumi, how about struct ddinfo instead of ddctl 2009-02-26 02:21 not a big difference ;) 2009-02-26 02:21 ddctl is already using, iirc 2009-02-26 02:21 ddctl is "file -> ddfile", and ddinfo is "ddfile -> private" 2009-02-26 02:22 ah 2009-02-26 02:22 well 2009-02-26 02:22 let's get rid of ddctl and use ddinfo if we can ;) 2009-02-26 02:22 later 2009-02-26 02:23 it sounds good, current ddctl is just a overkill 2009-02-26 02:23 ddctl() 2009-02-26 02:24 there is no container_of() anymore 2009-02-26 02:24 well I think your version is an improvement 2009-02-26 02:25 thanks 2009-02-26 02:25 it is a little shorter I think, did you do a diffstat? 2009-02-26 02:26 5 files changed, 397 insertions(+), 2 deletions(-) <- for my version 2009-02-26 02:27 5 files changed, 342 insertions(+), 2 deletions(-) 2009-02-26 02:27 a bit shoter 2009-02-26 02:27 shorter 2009-02-26 02:28 55 lines shorter, that's a lot 2009-02-26 02:29 I've posted the one patch version 2009-02-26 02:30 -!- Mark__T(~Mark__T@jabber.freenet.de) has left #tux3 2009-02-26 02:32 http://userweb.kernel.org/~hirofumi/fork-buffer.note 2009-02-26 02:32 btw, this is current note of block fork 2009-02-26 02:33 and I'm thinking about freeing pages after flushing 2009-02-26 02:33 finding good way to free pages 2009-02-26 02:34 current one pin the pages until free 2009-02-26 02:34 is that the note you were working on yesterday? 2009-02-26 02:34 yes 2009-02-26 02:37 iirc, we are talked about freeing a bit 2009-02-26 02:38 however, I forgot the detail of it 2009-02-26 02:38 this is what we see with my patch: /proc/295/task/295/fd/4 -> ddlink (deleted) 2009-02-26 02:38 oh 2009-02-26 02:39 I wonder ddlink where come from 2009-02-26 02:42 ok, freeing forked pages, right? 2009-02-26 02:42 yes 2009-02-26 02:44 i see, unhashed dentry is " (deleted)" 2009-02-26 02:44 well 2009-02-26 02:44 current one makes list of bio 2009-02-26 02:44 and walk the pages on bio, and free the pages 2009-02-26 02:45 that seems reasonable to me 2009-02-26 02:45 there is not a big win from freeing earlier 2009-02-26 02:45 but, after endio(), the pages on bio can be freed from reclaim stuff 2009-02-26 02:46 so, current one is getting the reference count of the page to submit 2009-02-26 02:46 I'd like to avoid this reference count 2009-02-26 02:46 why? 2009-02-26 02:47 because, it is unnecessary for non-forked page at all 2009-02-26 02:47 but it is not harmful 2009-02-26 02:48 yes 2009-02-26 02:48 ok, well I see what you are worrying about 2009-02-26 02:48 well, this is just a some sort of optimization 2009-02-26 02:48 let's see where endio decrements the page count 2009-02-26 02:49 endio doesn't decrements the page count 2009-02-26 02:49 it calls end_page_writeback() to clear PG_writeback 2009-02-26 02:49 so why can't a page be freed before endio? 2009-02-26 02:50 1117static void bio_release_pages(struct bio *bio) 2009-02-26 02:50 ah 2009-02-26 02:50 not used in endio path 2009-02-26 02:50 yes 2009-02-26 02:51 well, forked page have to be used the special free function 2009-02-26 02:51 to clear page->mapping 2009-02-26 02:51 yes 2009-02-26 02:52 ah, it is the writeback bit that keeps the page from being freed 2009-02-26 02:52 yes 2009-02-26 02:52 and complex one is 2009-02-26 02:53 after clearing writeback bit, frontend may be on the middle of fork 2009-02-26 02:53 still 2009-02-26 02:53 well, the end_page_writeback can be done in foreground if we want 2009-02-26 02:54 when clears writeback bit? 2009-02-26 02:54 endio just wakes us up when all transfers are completed, then walk the list doing end_page_writeback 2009-02-26 02:55 and walker is fontend? 2009-02-26 02:55 frontend 2009-02-26 02:56 I was thinking backend, it is the delta completion 2009-02-26 02:56 but since it is a cycle, is there really a difference? 2009-02-26 02:56 ah, foreground means not endio? 2009-02-26 02:56 yes 2009-02-26 02:56 ok 2009-02-26 02:56 not in interrupt mode 2009-02-26 02:56 i see 2009-02-26 02:56 easier to control races, locking is easier 2009-02-26 02:56 yes, current one is like that 2009-02-26 02:57 and there is little difference in lifecycle of the freeable pages 2009-02-26 02:57 well I think that is good 2009-02-26 02:57 but maybe the current one does not clear writeback in foreground 2009-02-26 02:57 but in endio instead 2009-02-26 02:58 yes, it clears writeback bit outside endio 2009-02-26 02:58 current one clears writeback bit in endio, and instead get the page count 2009-02-26 02:58 ah, so writeback is already cleared in foreground? 2009-02-26 02:58 ok 2009-02-26 02:58 fine, so it seems like clearing writeback in foreground solves your issue above? 2009-02-26 02:59 my dream for this is, it doesn't touch non-forked page 2009-02-26 03:00 it will need to touch non-forked page too 2009-02-26 03:00 I think that dream can be reached ;) 2009-02-26 03:00 good :) 2009-02-26 03:00 by having a separate endio for forked pages 2009-02-26 03:01 fork or non-forked is not sure until endio was done 2009-02-26 03:01 yes 2009-02-26 03:01 (actually, endio was done, then frontend released that page) 2009-02-26 03:01 yes 2009-02-26 03:02 fork is possible until delta dirty is cleared 2009-02-26 03:02 yes 2009-02-26 03:02 well, pages that are dirty in current delta cannot be forked 2009-02-26 03:02 that will be most 2009-02-26 03:02 hmm, is that right? 2009-02-26 03:03 yes 2009-02-26 03:03 however, they are not under IO 2009-02-26 03:03 so it may be right, but it does not matter ;) 2009-02-26 03:03 yes, well it can 2009-02-26 03:03 however, unnecessary almost 2009-02-26 03:05 my original idea was, forking puts the page on a "forked" list 2009-02-26 03:05 i see 2009-02-26 03:05 and the forked list is walked after delta completion 2009-02-26 03:06 i see 2009-02-26 03:06 then, do the end_writeback in the endio 2009-02-26 03:06 and get_page in the fork 2009-02-26 03:06 does that make sense? 2009-02-26 03:07 yes 2009-02-26 03:07 that solves it? 2009-02-26 03:07 of did I miss more issues? ;) 2009-02-26 03:07 the issue is the field for list 2009-02-26 03:07 oh yes 2009-02-26 03:07 well 2009-02-26 03:07 -!- cdk(~chinmay@115.109.15.139) has joined #tux3 2009-02-26 03:08 do we have page->private available? 2009-02-26 03:08 that still has buffers on it? 2009-02-26 03:08 um 2009-02-26 03:08 hi flips 2009-02-26 03:08 hi cdk 2009-02-26 03:08 the patch ok ? 2009-02-26 03:08 you have created a lot of discussion of your patch ;) 2009-02-26 03:08 buffers can be truncated after endio 2009-02-26 03:08 yes that we have ;) 2009-02-26 03:08 well I think your results are entirely good enough for your project 2009-02-26 03:09 and the patch will be changed a lot before it becomes part of tux3 2009-02-26 03:09 a process that will take several months 2009-02-26 03:10 I think the general opinion is, a smaller hash together with compare against actual data on match would be faster 2009-02-26 03:10 are you interested in trying that as part of your project? 2009-02-26 03:10 but that would mean we have to read the block each time. 2009-02-26 03:10 not each time 2009-02-26 03:10 only if the hash matches 2009-02-26 03:11 yes...each time the hash matches 2009-02-26 03:11 well that will be rare 2009-02-26 03:11 and every time it matches, you are going to save a lot, so a read is ok 2009-02-26 03:13 hirofumi, can truncate touch our buffers? 2009-02-26 03:13 hmm....seems so... 2009-02-26 03:13 well, if you feel you have time it would be nice to try and compare the performance 2009-02-26 03:14 hirofumi, I thought truncate has to do ->releasepage 2009-02-26 03:15 well, it would be whether we alllow the possiblity of corruption of data 2009-02-26 03:15 yes 2009-02-26 03:15 yes. i guess we can try it. i dont think it should take much time. 2009-02-26 03:15 so if we have a count on the page, buffers should not be removed, and we can use a buffer field for linking 2009-02-26 03:16 cdk, and it would improve your final report I think 2009-02-26 03:16 flips, maybe, I guess we can keep buffers more long time if we want 2009-02-26 03:16 cdk, probably more important than porting to kernel 2009-02-26 03:16 hirofumi, right, it would not cost much 2009-02-26 03:17 and anyway, I think your current method is find for now, with an extra get_page for every IO page 2009-02-26 03:17 in practice it probably does not matter at all 2009-02-26 03:17 probably 2009-02-26 03:18 it is just finding more good strategy, or like this 2009-02-26 03:18 well I think the better strategy is possible 2009-02-26 03:18 i guess we can compare and check which works better. 2009-02-26 03:18 but that it is not necessary to implement right now 2009-02-26 03:19 ok 2009-02-26 03:19 _any_ strategy that makes fork work is a good strategy ;) 2009-02-26 03:19 :) 2009-02-26 03:19 ok, I'll try to note with current one 2009-02-26 03:19 however , unless we have a test data set large enough that will give us collisions i doubt we will actually see the difference/advantage 2009-02-26 03:19 or will we? 2009-02-26 03:20 hirofumi, are you satisfied with your ddlink patch? I can download and try it 2009-02-26 03:20 yes, it seems good for now 2009-02-26 03:20 cdk, I will predict that you will see lower cpu and higher performance with a smaller hash 2009-02-26 03:21 savings should be the same 2009-02-26 03:21 ok, time to sleep now 2009-02-26 03:21 just wanted to ask you one other thing 2009-02-26 03:22 tomorrow (later today) I will try the new, improved ddlink, then get back to work on the atomic commit prototype 2009-02-26 03:22 cdk, iirc, that patch saved some blocks already? 2009-02-26 03:22 and it means there are hash collisions 2009-02-26 03:22 hirofumi, so now ddlink is a solution, and it needs a problem ;-) 2009-02-26 03:22 :) 2009-02-26 03:22 their reported performce results are quite impressive 2009-02-26 03:23 in terms of savings 2009-02-26 03:23 and the overhead is not too bad 2009-02-26 03:23 yes 2009-02-26 03:23 less than a factor of two 2009-02-26 03:23 but I think it can be improved 2009-02-26 03:24 do we get included on the about us page for this ;-) 2009-02-26 03:24 hirofumi, your patch is against my patch? 2009-02-26 03:24 yes, so, if we can see the result of "compare data if collisions", it would be good 2009-02-26 03:24 yes 2009-02-26 03:24 also, I posted the one patch version too 2009-02-26 03:25 hirofumi, i guess we will get it going and compare the results... 2009-02-26 03:25 good 2009-02-26 03:26 any if not part of our project, separately for tux3 code 2009-02-26 03:27 hirofumi, is it possible to make your patch against the current hg repo? 2009-02-26 03:27 well 2009-02-26 03:27 I can handle that 2009-02-26 03:27 I think it is againt to current hg repo 2009-02-26 03:28 1006:3118107954f1 is latest revision? 2009-02-26 03:28 ok, so the single patch version does not have the whitespace cleanups against my patch, good 2009-02-26 03:28 yes fine 2009-02-26 03:29 yes, the single patch version is including those patches to your patch 2009-02-26 03:30 and the patch is against current repo 2009-02-26 03:32 ok, the new improved ddlink runs the ddfork.c test the same as the old one 2009-02-26 03:32 good 2009-02-26 03:33 what was your testing method? 2009-02-26 03:33 it was not untested at all 2009-02-26 03:33 I compiled it though 2009-02-26 03:33 :) 2009-02-26 03:33 flips, if we do get hash collisions .. on blocks not same .. then referring to them using the tree / bucket entries might be a problem 2009-02-26 03:33 very nice 2009-02-26 03:34 cdk, let's think about that 2009-02-26 03:34 which sequence of operations creates a problem? 2009-02-26 03:35 the bucket will have entries with same hash value , different block numbers 2009-02-26 03:35 yes 2009-02-26 03:35 so writing a new block - > collision .. we will have to read all the blocks having same hash in the bucket one by one , single byte at a time 2009-02-26 03:36 but collisions will be very rare 2009-02-26 03:36 after doing that if the block does not match any...we just wasted all those disk reads and compares 2009-02-26 03:36 and not matching the block after matching the hash will also be very rare 2009-02-26 03:36 even for 64 bits when we consider 2**48 blocks? 2009-02-26 03:37 yes 2009-02-26 03:37 for one thing, you will not have 2**48 blocks in practice 2009-02-26 03:37 yes 2009-02-26 03:37 2**32 is already a lot 2009-02-26 03:37 that is 16 TB 2009-02-26 03:38 -!- gebi(~gebi@84-119-55-35.dynamic.xdsl-line.inode.at) has joined #tux3 2009-02-26 03:38 that means, with uniform distribution, your change of a false positive is 1/2**32 2009-02-26 03:38 a very small number 2009-02-26 03:38 yes 2009-02-26 03:40 well...i guess there is no harm in trying .. if it works well we will get performance benefits 2009-02-26 03:44 flips, i never sent the mkdir patch for userspace .. will do that tonight cc to shapor right ? 2009-02-26 03:44 yes please 2009-02-26 03:45 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-02-26 03:45 flips, what do we do to get on the Tux 3 'Hall of Fame' ? ;-) 2009-02-26 03:46 send me all your names in an email 2009-02-26 03:46 :) 2009-02-26 03:46 it can be private 2009-02-26 03:46 right now? 2009-02-26 03:47 any time 2009-02-26 03:47 phillips@phunq.net 2009-02-26 03:48 in a couple of weeks I will report your work to the lkml list 2009-02-26 03:48 or you can 2009-02-26 03:48 either way 2009-02-26 03:48 you probably prefer that I do it ;) 2009-02-26 03:48 yes ;) 2009-02-26 03:49 so, please say a little bit about your project and your school in the email 2009-02-26 03:49 as you have done before, but this will put it all in one place 2009-02-26 03:49 and I can use that in the report to lkml 2009-02-26 03:49 you might say something about your supervisor 2009-02-26 03:50 ok .. doing it now 2009-02-26 03:52 our guide / supervisor is actually an alumni of our college itself sending his info too 2009-02-26 03:54 ok 2009-02-26 03:54 time for me to sleep 2009-02-26 03:54 oyasumi 2009-02-26 03:54 hall of fame will be updated pretty soon 2009-02-26 03:54 good night 2009-02-26 03:54 oyasumi 2009-02-26 03:54 thanks 2009-02-26 07:45 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-26 10:18 -!- amey(~amey@117.195.33.154) has joined #tux3 2009-02-26 10:26 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-26 10:27 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-26 10:30 -!- amey__(~amey@117.195.33.154) has joined #tux3 2009-02-26 10:34 -!- amey(~amey@117.195.33.154) has joined #tux3 2009-02-26 10:34 flips, might want to change the topic 2009-02-26 10:35 we haven't had Tux3 U in a while 2009-02-26 14:05 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-26 14:56 -!- ChanServ changed mode/#tux3 -> +o flips 2009-02-26 15:05 -!- flips changed topic to "http://tux3.org ~ Tux3 University on winter recess ~ Tux3 runs as root fs! ~ http://www.phoronix.com/scan.php?page=news_item&px=NzA4Mw" 2009-02-26 15:20 -!- flips changed topic to "http://tux3.org ~ Tux3 University on winter recess ~ Tux3 boots as root fs! ~ http://www.phoronix.com/scan.php?page=news_item&px=NzA4Mw" 2009-02-26 16:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-26 16:48 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-26 17:09 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-26 19:06 -!- chesse_(~eworm@dslb-084-062-166-107.pools.arcor-ip.net) has joined #tux3 2009-02-26 19:20 -!- urbano(~thiago@201.88.228.162) has joined #tux3 2009-02-26 19:20 -!- gaurav(~gaurav@59.95.22.172) has joined #tux3 2009-02-26 19:29 -!- urbano(~thiago@201.88.228.162) has left #tux3 2009-02-26 20:06 -!- chesse(~eworm@dslb-084-062-146-028.pools.arcor-ip.net) has joined #tux3 2009-02-26 20:29 ok, I have an idea what to do with the shiny new ddlink 2009-02-26 20:30 we have all that structure dumping code in the kernel 2009-02-26 20:30 so the tux3 command should get a new command to make the kernel dump a structure over the ddlink 2009-02-26 20:33 um.. what kind of structures? 2009-02-26 20:33 what is this ddlink thing anyway? google doesn't turn up much... 2009-02-26 20:34 marcin, for example btree metadata 2009-02-26 20:34 http://lwn.net/Articles/271805/ 2009-02-26 20:34 so we could make it into a wrapper that would dump the fragmentation data i've been wanting? :) 2009-02-26 20:35 exactly 2009-02-26 20:35 you hack tux3.c to get it from the ddlink 2009-02-26 20:35 and hack something in kernel to generate it 2009-02-26 20:35 fun? 2009-02-26 20:36 you've peaqued my interest, keep talking :) 2009-02-26 20:36 caveat: there are other kernel mechanisms that are supposed to handle stuff like this 2009-02-26 20:36 i think i just need a tree walking struct reader/parser 2009-02-26 20:36 for example, relayfs 2009-02-26 20:36 however you can be pretty sure they will be less convenient to use 2009-02-26 20:37 ranging from clunky to eye gouging 2009-02-26 20:37 ddlink is terse and nice 2009-02-26 20:37 could you mkae some sort of skeleton example? 2009-02-26 20:37 I already did, yesterday 2009-02-26 20:38 ok, i'll look at it 2009-02-26 20:38 i passed out yesterday at midnight 2009-02-26 20:38 http://mailman.tux3.org/pipermail/tux3/2009-February/000739.html 2009-02-26 20:39 hmm, I wish hypermail knew how to wrap lines of archived mails 2009-02-26 20:39 I need to avoid kmail word wrap I think 2009-02-26 20:42 so i just follow the ddtest.c? 2009-02-26 20:43 it's an example 2009-02-26 20:43 also, the kernel implementation of the ddlink 2009-02-26 20:44 so in which struct would i find the fragmenation info? 2009-02-26 20:45 see see the code in the ddlink patch for tux3_ioctl 2009-02-26 20:45 for example of setting up a ddlink on kernel side 2009-02-26 20:45 fragmentation info would be gleaned from the dleaf nodes 2009-02-26 20:45 see dleaf.c 2009-02-26 20:46 for example, dleaf_dump 2009-02-26 20:46 you would write a tree walker that walks a file btree and outputs all extents over the ddlink, for example 2009-02-26 20:47 ddlink_alloc_inode? 2009-02-26 20:47 a walker like that is easy to write 2009-02-26 20:47 examples can be cut n pasted from hirofumi's tux3graph.c 2009-02-26 20:48 ddlink_alloc_inode is from the ddlink library itself, the code that has to be written to set up a specific ddlink is in user/kernel/inode.c in the patch 2009-02-26 20:48 I should post a new patch with hirofumi's improved ddlink code 2009-02-26 20:48 way too many unknowns for me :9 2009-02-26 20:48 :( 2009-02-26 20:49 well you start somewhere, grab a thread and pull on it 2009-02-26 20:49 pretty soon you unravel the whole sweater 2009-02-26 20:49 see? 2009-02-26 20:49 there will be a bunch more ddlink examples coming 2009-02-26 20:49 and things will be more obvious when ddlink is checked into the repo 2009-02-26 20:50 can you start with writing one that returns block numbers that a given file allocates? :) 2009-02-26 20:50 now, we may not make ddlink part of our kernel submission for review 2009-02-26 20:50 so I would check it in "just for now" and remove it a while later when we start review 2009-02-26 20:50 i'm good with wrappers, experimental interfaces to kernel code...not so much 2009-02-26 20:50 marcin, sure 2009-02-26 20:50 that's a typical use of ddlink 2009-02-26 20:51 i think that'd be a good example to have 2009-02-26 20:51 we could actually use the "stash" mechanism to whole block number traces like that 2009-02-26 20:51 this would make internals readable 2009-02-26 20:51 extents actually 2009-02-26 20:52 you will get the idea of this experimental interface pretty quickly 2009-02-26 20:52 actually, tha'ts a quesiton i had earlier today, how does the whole allocation approach changes with extents? should i just put more weight on the contigous space? 2009-02-26 20:52 it's very straightforward to write the kernel code for a ddlink 2009-02-26 20:52 kernel code is never straightforward ;) 2009-02-26 20:53 I have no idea what the implications of extents are ;) 2009-02-26 20:53 it's my job to create implications, your job to divine the meaning 2009-02-26 20:53 good, then i have some real motivation to make my allocator algo more tunable 2009-02-26 20:53 ddlink implementation is in fact quite straightforward 2009-02-26 20:54 i've ventured into kernel code once, in minix, and frankly, that was enough for me 2009-02-26 20:54 probably will check in the base ddlink support, then posted patches will only have the kernel usage part, which is pretty simple 2009-02-26 20:54 well tux3 kernel code is not like typical kernel code 2009-02-26 20:54 most of it runs in userspace too, for one thing 2009-02-26 20:55 ddlink doesn't though 2009-02-26 20:55 we have the curious challenge ahead of us, of porting ddlink to userspace 2009-02-26 20:55 you just give me some nice clean ddlink interfaces, and i'll make utils 2009-02-26 20:55 i like making tools 2009-02-26 20:55 well, I'll give you some basic kernel interface you can fiddle with 2009-02-26 20:55 fiddling with is easier than implementing from scratch 2009-02-26 20:56 "there is nothing to fear but fear itself" 2009-02-26 20:56 ddlink is a tool, on both ends 2009-02-26 20:56 pft, said by the man that never stared at the abyss of kernel code 2009-02-26 20:56 a tool for scraping useful info from kernel activity 2009-02-26 20:56 that's why i'm interested, we've talked before about 'tooling up' 2009-02-26 20:57 well, for example, kernel/iattr.c is kernel code, have a look, it's perfectly readable 2009-02-26 20:57 we might not do it for the whole kernel, but at least it would be nice to do it for our little territory 2009-02-26 20:58 http://hg.tux3.org/tux3/file/3118107954f1/user/kernel/iattr.c 2009-02-26 20:58 this implements the variable inode attributes 2009-02-26 20:59 so you want to convert the dump_attrs to ddlink? 2009-02-26 20:59 do you see it more as a conversion, or just an augmentation? 2009-02-26 20:59 actually, I think I will just dump a stream of extents for you 2009-02-26 20:59 as an example 2009-02-26 21:00 very good example to start with 2009-02-26 21:01 stream of extents..you mean the metadata, not hte actual extents contents right? 2009-02-26 21:27 I mean the actual extents 2009-02-26 21:27 why would i want these? 2009-02-26 21:27 they tell you the physical locations in use 2009-02-26 21:27 i.e., fragmentation 2009-02-26 21:28 hm, i apparently dont know what you mean by extent :/ 2009-02-26 21:28 we also want to know to which logical object each extent belongs 2009-02-26 21:29 and extent is just a start block and count of blocks that map a particular logical address of a file 2009-02-26 21:29 each extent has a logical address, a physical address and a count of blocks 2009-02-26 21:29 here's a function call i'm imagining: listofextents= gimmeextents(fd) 2009-02-26 21:29 sure 2009-02-26 21:29 like that 2009-02-26 21:30 what what would each extent data gimme exactly? 2009-02-26 21:31 well we have to decide 2009-02-26 21:31 internally, we pack physical address and count into 64 bits 2009-02-26 21:31 if there are no holes, then a simple row of those would do 2009-02-26 21:32 do i care for physical or just logical for fragmentation's sake? 2009-02-26 21:32 so probably we want something like, logical address (64 bits) count of contiguous extents (32 bits) bunch of 64 bit physical extents 2009-02-26 21:32 repeating over and over 2009-02-26 21:33 you care about both logical and physical 2009-02-26 21:33 "this logical object is represented by which physical extents" 2009-02-26 21:33 contiguous extents? if multiple extents were contiguous, wouldnt it be better for them to be one big extent instead? 2009-02-26 21:33 and "this cloud of extents for this object is located where in relation to this other cloud of extents" 2009-02-26 21:34 true 2009-02-26 21:34 we probably just want: logical, physical, count 2009-02-26 21:34 e.g., logical, 6 bytes, physical, 6 bytes, count, 4 bytes 2009-02-26 21:35 = 16 bytes 2009-02-26 21:35 then the user space code needs to be leet enough to decode 6 byte fields 2009-02-26 21:35 logical and physical...thse would be locations right? 2009-02-26 21:35 that is where we'd reject an intern at google if they can't do it ;) 2009-02-26 21:35 in fact I should have used that for an interview question 2009-02-26 21:35 -!- amey(~amey@socks.wantstofly.org) has joined #tux3 2009-02-26 21:35 oy, tough crowd ;) 2009-02-26 21:36 that's softball ;) 2009-02-26 21:36 ACTION is not a coder 2009-02-26 21:36 ok, you get a waiver 2009-02-26 21:36 first thing i'll do is dump it into matlab 2009-02-26 21:36 no 6 hours standing and sweating in front of a whiteboard for you 2009-02-26 21:36 the logical operations in matlab are fantastic 2009-02-26 21:37 all kinds of addressing and filtering tricks, which we need for this 2009-02-26 21:37 i know you could you it in c, but i'll leave that exercise for the end 2009-02-26 21:37 true 2009-02-26 21:38 the sort of stuff that is completely dire in C 2009-02-26 21:38 amen brother 2009-02-26 21:38 do you have numerical recipes in C? 2009-02-26 21:39 i've been thinking of doing some allocation stuff based on distances, but that's clustering, which starts with distance matrix, which is n^2, so wiht lots of files is gonna get big real quick 2009-02-26 21:39 i might, havent looked at striaght c since undergrad 2009-02-26 21:40 i'm kinda weird, i look at asm, but i write .net or sed/awk ;) 2009-02-26 21:40 well, the point is, the C code in that book is really gross 2009-02-26 21:41 speaking of weird...so exactly how much time do you have to make a allocation decision? 2009-02-26 21:41 even though its a seminal work 2009-02-26 21:41 very little 2009-02-26 21:41 it's a common operation 2009-02-26 21:41 microseconds would be way too much 2009-02-26 21:41 ok, so clustering is out 2009-02-26 21:41 well, maybe for a defragger 2009-02-26 21:41 then we got time to think 2009-02-26 21:42 yep, the tricks have to be cheap tricks 2009-02-26 21:42 that does not mean unsophisticated 2009-02-26 21:42 oh i know 2009-02-26 21:42 fuzzy k-means clustering would be really good for this, but it'd be a performance nightmare 2009-02-26 21:44 well the thought I have about it are clustery thoughts 2009-02-26 21:44 here's the basic idea: for one thing, versioning makes the problem way harder 2009-02-26 21:44 yea no shit ;) 2009-02-26 21:44 however it does give us more info about fileusage 2009-02-26 21:45 because you have multiple different versions logically overlaid 2009-02-26 21:45 it's clearly impossible to lay out completely linearly 2009-02-26 21:45 so i think we might be able to turn vice into virtue again 2009-02-26 21:45 maybe 2009-02-26 21:45 anyway, so base concepts are useful 2009-02-26 21:45 well if we at least group it, hopefully the drive/controller will figure out a sane way to do it 2009-02-26 21:46 obvious one: try to make physical layout montonically increasing 2009-02-26 21:46 - try to make physical allocation contiguous 2009-02-26 21:47 how closely related are the physical and logical addressing? 2009-02-26 21:47 - when there is a gap between two regions of relatively dense/contiguous allocation, it doesn't matter much how big the gap is 2009-02-26 21:47 and what happens for example if we...raid the damn volume? 2009-02-26 21:47 so, try to arrange things in globs, with arbitrary gaps between them 2009-02-26 21:47 - worry about raid later 2009-02-26 21:47 (if it works well on a single disk you're doing a good job) 2009-02-26 21:48 - single spindle is the worst case 2009-02-26 21:48 duly noted ;) 2009-02-26 21:48 ok, so those are really simple rules that everybody knows 2009-02-26 21:48 yea but they're also probably responsible for 85% of speed 2009-02-26 21:49 here's a not so obvious one: try to make globs of allocation that are somewhat more than a track in size 2009-02-26 21:49 a track these days has 1-3 MB 2009-02-26 21:49 maybe it's more now 2009-02-26 21:49 a track in size? wouldnt that be drive dependent? 2009-02-26 21:49 it's easy to derive 2009-02-26 21:49 how would a fs know that? 2009-02-26 21:49 takes your MB/sec and divide by rotations/sec 2009-02-26 21:49 not really, not with changable density of info per cylinder 2009-02-26 21:50 gives MB/rotation, i.e., track size 2009-02-26 21:50 ah, that doesn't affect it much 2009-02-26 21:50 factor of two at most 2009-02-26 21:50 how come? 2009-02-26 21:50 look at a disk 2009-02-26 21:50 the writeable area is a band 2009-02-26 21:50 it doesn't go right into the middle 2009-02-26 21:51 the difference between path lenght on inside vs outside has to be at least 3x 2009-02-26 21:51 granted, still the same magnitude... 2009-02-26 21:51 hmm...so why did you says 'somehwat more than a track'? 2009-02-26 21:53 it's not 3X 2009-02-26 21:53 less than 2x 2009-02-26 21:53 well, when you seek off to some other place because of a discontinuity in allocation, you want to try to land in the middle of data 2009-02-26 21:53 i got a dead drivve at home, i'll measure the damn thing ;) 2009-02-26 21:54 if your cluster size is much less than a track, you will mostly land in empty space 2009-02-26 21:54 the cost of that gets extremely high 2009-02-26 21:54 so you want some sort of probabilistic model of where you land after a track shift? 2009-02-26 21:55 i cant think of anything right now to come up wiht that... 2009-02-26 21:55 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-26 21:55 let me see, I think a revolution of a 7200 rpm disk is 8 1/3 ms 2009-02-26 21:56 sounds right 2009-02-26 21:56 i know 6000rpm is 10ms (car computer hacking trivia) :) 2009-02-26 21:58 woudlnt that be the sort of stuff best left to conrollers? 2009-02-26 21:58 i hate when two algorithms are trying to 'game' each other 2009-02-26 21:58 say you pick up one 4K block, that takes about 64 microseconds I think 2009-02-26 21:58 so, it costs you up to 8 1/3 ms to get 64 usec worth of data 2009-02-26 21:59 of you are maximally fragmented 2009-02-26 21:59 that ignores the cost of the seek itself 2009-02-26 21:59 that's just rotational latency 2009-02-26 21:59 that is the biggest performance killer 2009-02-26 21:59 much more so than track to track seek 2009-02-26 21:59 which one, the latency or hte seek? 2009-02-26 21:59 latency then 2009-02-26 21:59 rotational latency is a much bigger problem than track to track seek latency 2009-02-26 22:00 so you want to cluster data 3tracks wide, so no matter which way you seek, you land in the middle of some data 2009-02-26 22:01 am i on the right path here? 2009-02-26 22:01 -!- cdk(~chinmay@115.109.15.139) has joined #tux3 2009-02-26 22:02 and of course all this stuff goes out the window on solid memory 2009-02-26 22:04 hi flips 2009-02-26 22:05 he's been in and out all night, give him few mins, he'll show back up 2009-02-26 22:05 ok :) 2009-02-26 22:07 hi cdk 2009-02-26 22:07 did you send your list of famous people? 2009-02-26 22:07 sending you the information that you requested 2009-02-26 22:08 sent 2009-02-26 22:08 you should write the names in the order you would like them to appear in the hall of fame 2009-02-26 22:08 that is, most famous first 2009-02-26 22:09 :) i doubt that matters to us....we have all contributed equally 2009-02-26 22:11 flips, i think you gave me a new idea. so far i've been thinking of making my objective function in the units of blocks, a difference between good (contig, grouped), vs bad (fragemented, loners) blocks. but i'm starting to think that a more heuristic based one (the few good rules you've mentioned) with some sort of arbitrary scoring system 2009-02-26 22:14 ok, well here is my idea, roughly 2009-02-26 22:14 we do allocation something like a quadratic hash 2009-02-26 22:14 -!- kushal(~kushal@115.109.10.141) has joined #tux3 2009-02-26 22:15 if there is not room at the "home" location, we just to a new location, using a generating function to determine the length of the jump 2009-02-26 22:15 so, if a bunch of stuff gets bounced away from "home" it gets bounced to the same palce 2009-02-26 22:15 where hopefully it will end up relatively monotonic in layout 2009-02-26 22:15 so, that place may fill too 2009-02-26 22:16 take another jump, based on the generating function, a little further away 2009-02-26 22:16 see, i want my objective function to create a score, so whatever scenario you come up with can be objectively evaluated 2009-02-26 22:16 now, as a complication, you want to try to have data belonging to a given version mostly bounced the same number of bounces 2009-02-26 22:16 define bounce? 2009-02-26 22:17 store at some place other than home 2009-02-26 22:17 some predictable place 2009-02-26 22:17 what is home? current head position? 2009-02-26 22:17 like a quadratic hash 2009-02-26 22:17 "home" is more or less where the file was first allocated 2009-02-26 22:18 that will normaly be, near where its immediate predecessor was allocated 2009-02-26 22:18 is that how you use your 'log'? 2stage write? 2009-02-26 22:18 log? 2009-02-26 22:18 oh 2009-02-26 22:18 you told me long time ago to say log not journal ;) 2009-02-26 22:18 the log block will normally get laid down inline with the data blocks 2009-02-26 22:18 bunch of data blocks, a log block, bunch of data blocks, etc 2009-02-26 22:18 wait, so log is for metadata only? 2009-02-26 22:19 yes 2009-02-26 22:19 so data does not get a 2stage treatment? 2009-02-26 22:21 no 2009-02-26 22:22 better put it down roughly where it belongs on the first try 2009-02-26 22:22 so, some form of weighting function can be used to determine a good position for the intial block of a file 2009-02-26 22:22 i was thinking dump it into some staging area for now, and when having idle io, realloc based on a much smarter decision 2009-02-26 22:23 after that, the weighting function becuase 99% "right after the last block" 2009-02-26 22:23 then when that becojmes impossible, or high inefficient because of fragmentation, the generating function driven jump kicks in 2009-02-26 22:23 call that "teleporting" 2009-02-26 22:24 you just teleported right over my head.. 2009-02-26 22:24 you know about quadratic hash? 2009-02-26 22:24 what generating function? 2009-02-26 22:24 no, i'm reading about right now 2009-02-26 22:25 http://www.chilton-computing.org.uk/acl/literature/reports/p012.htm 2009-02-26 22:25 so, the quadratic part of it is simply one of many possible generating functions 2009-02-26 22:26 generating a repeatable sequence of distances 2009-02-26 22:26 i was just reading that, clear as mud :/ 2009-02-26 22:26 like a pseudorandom generator 2009-02-26 22:26 except, the sequence has known, useful distance properties 2009-02-26 22:27 usually steadily increasing, sometimes flipping negative 2009-02-26 22:27 like towers of hanoi sort of useful sequence? 2009-02-26 22:27 like fractal compression sort of useful 2009-02-26 22:28 you know what the generating function is going to do, given some set of parameters 2009-02-26 22:28 like rand() seeded with the same thing 2009-02-26 22:28 so, the generating function can actually vary depending on where it starts 2009-02-26 22:28 to try to avoid really nasty effects of interaction, if you always use the same generating function 2009-02-26 22:28 yes, like rand 2009-02-26 22:29 except the sequence is not random 2009-02-26 22:29 it has useful spatial properties 2009-02-26 22:29 which can be pretty simple, such as "increasingly far away, by an increasing delta" 2009-02-26 22:30 weird..how is that applicable to spining disks? 2009-02-26 22:34 I thought it was obvious ;) 2009-02-26 22:34 well I guess not 2009-02-26 22:36 just becasue we can easily generate the next 'decent' place to put a file, i don tthink that's a great allocation algorithm 2009-02-26 22:36 the thing is, it's obvious that you can't store all versions of a file at the same palce 2009-02-26 22:37 so, natural remedy is just to choose the next available location when space runs out 2009-02-26 22:37 marcin: we can't rely on waiting for the disk to become idle 2009-02-26 22:37 this doesn't work very well 2009-02-26 22:37 you have to take a stab at where it needs to be 2009-02-26 22:37 ok, bad phrasing, not idle, but lighter io load...we need to be able to do online defrag 2009-02-26 22:38 that's a fallback 2009-02-26 22:38 yeah, its really a crutch 2009-02-26 22:38 try to not need defrag in the first place 2009-02-26 22:38 if you say you dont need legs cause you have crutchs your arms will get sore ;) 2009-02-26 22:39 yea but you're gonna get a good allocation by simply going of some magical sequence number, are you? 2009-02-26 22:39 we're going to try 2009-02-26 22:39 there are also loads which dont get idle 2009-02-26 22:39 agument that with some context sensitivity 2009-02-26 22:39 i guess you have a lot more faith in numerology than i do ;) 2009-02-26 22:40 i was thinking constant monitoring and adjusting 2009-02-26 22:40 even if you could always get idle, people don't like their disks trundling away for minutes after they finish using them 2009-02-26 22:40 full feedback system 2009-02-26 22:40 monitoring and adjusting = good 2009-02-26 22:40 but adjusting does not mean moving things 2009-02-26 22:40 well yes, you want to put in some sort of 'stops' so the system doesnt keep itself busy forever 2009-02-26 22:40 ext3 gets away without online defrag 2009-02-26 22:41 and it doesn't suck that horribly 2009-02-26 22:41 moving stuff is a desparation measume 2009-02-26 22:41 measure 2009-02-26 22:41 wow, you're really busting up my vision here ;) 2009-02-26 22:41 welcome to "what works" ;) 2009-02-26 22:42 now the thing is, we don't know "what works" with versioning 2009-02-26 22:42 nobody has really investigated that 2009-02-26 22:42 netapp is famous for handling it badly 2009-02-26 22:42 although i can certainly imagine cases that are "-o bursty" 2009-02-26 22:42 well, we'll have to play scientists 2009-02-26 22:42 where you want to just blast data down quickly 2009-02-26 22:42 and reorder that 2009-02-26 22:42 when its idle 2009-02-26 22:42 tool up, instrument the system, run some load, and then oogle logs for days 2009-02-26 22:43 trying to solve undefined problems is hard, some many different workloads 2009-02-26 22:43 s/some/so/ 2009-02-26 22:43 if we had 5 years to develop it, that would make sense 2009-02-26 22:43 we certainly can improve it later, but we need something fairly good, early on 2009-02-26 22:43 that's why i want my fragmentation walker, we gotta have some initial idea of what we're dealing with 2009-02-26 22:44 no point to chasing problems that arent real problems 2009-02-26 22:44 and you will get your fragmentation dump 2009-02-26 22:44 marcin: have you ever played with blktrace? 2009-02-26 22:44 we could certainly instrument with that 2009-02-26 22:44 no, what is that? 2009-02-26 22:44 oh man 2009-02-26 22:44 share, you ghetto care bear ;) 2009-02-26 22:44 http://oss.oracle.com/~mason/seekwatcher/ 2009-02-26 22:45 movies of allocation? oh man, i think this is my new pr0n! 2009-02-26 22:46 don't lie, its not exotic enough :P 2009-02-26 22:47 i'm so pleading the FIF! 2009-02-26 22:50 marcin, ok so you can generate your intial data with blktrace and analyze it with seekwatcher 2009-02-26 22:50 when that strategy peters out, then we can do something ddlinkish 2009-02-26 22:51 will do 2009-02-26 22:51 for now i gotta pack and sleep, got a flight home in 8hrs 2009-02-27 01:44 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-02-27 02:45 -!- amey_(~amey@115.109.8.33) has joined #tux3 2009-02-27 04:17 http://userweb.kernel.org/~hirofumi/fork-buffer.note 2009-02-27 04:17 this is the note for block fork 2009-02-27 04:18 this is not quite perfect, but I'll back to atomic commit 2009-02-27 04:19 and if someone had the question of it, maybe I'll update it again at the time 2009-02-27 04:19 well, anyway, I'm tired to write doc :) 2009-02-27 09:09 -!- gebi(~gebi@84-119-55-220.dynamic.xdsl-line.inode.at) has joined #tux3 2009-02-27 09:39 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-02-27 10:47 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-27 11:54 -!- cdk(~chinmay@121.246.33.35) has joined #tux3 2009-02-27 12:57 -!- dcg(~dcg@83.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-27 14:06 hirofumi, there? 2009-02-27 14:06 yes 2009-02-27 14:07 well I think I should check in ddlink, even if we are not going to make it part of the initial review 2009-02-27 14:07 ok 2009-02-27 14:07 maybe do this on a branch 2009-02-27 14:07 I'm not sure 2009-02-27 14:07 easiest is just do it on trunk, then fix it up later the way we want it 2009-02-27 14:07 another thing: I think maybe we should move all of tux3/user up into tux3 2009-02-27 14:08 well, maybe I'll play the atomic commit until it's done more or less 2009-02-27 14:08 I think it's not good 2009-02-27 14:08 the ddlink stuff is part of working on atomic commit for me 2009-02-27 14:08 what is not good? 2009-02-27 14:08 move user to tux3? 2009-02-27 14:09 I guess it confuses hg more or less 2009-02-27 14:09 ok, let's just leave it 2009-02-27 14:09 actually, hg confuses us 2009-02-27 14:09 "for historical reasons" 2009-02-27 14:09 so, it would be later 2009-02-27 14:09 ok fine 2009-02-27 14:09 setting up kernel.org git is more important 2009-02-27 14:10 so, I will make an initial tree on tux3.org for later today 2009-02-27 14:10 with complete history? 2009-02-27 14:10 and we can experiment with importing history 2009-02-27 14:10 ah, tux3.org 2009-02-27 14:10 I won't try to import history today 2009-02-27 14:10 probably 2009-02-27 14:11 well it could be phunq.net, but tux3.org is better I think, the point is, it's not kernel.org until the tree is right 2009-02-27 14:11 btw, what is doing on ddlink for atomic commit? 2009-02-27 14:11 I want to monitor the state of the log 2009-02-27 14:11 and be able to trigger deltas 2009-02-27 14:11 from externa 2009-02-27 14:12 from external source 2009-02-27 14:12 for debugging and testing? 2009-02-27 14:12 yes 2009-02-27 14:12 i see 2009-02-27 14:12 for rapid tuning 2009-02-27 14:12 also, start to use the ddlink to do btree range dumps 2009-02-27 14:12 start to work on online fsck and repair 2009-02-27 14:13 um... 2009-02-27 14:13 my idea is, we can let the user attempt some repairs, that are controlled by ddlink 2009-02-27 14:13 another idea: implement block read over a ddlink 2009-02-27 14:13 so that user space code can read metadata 2009-02-27 14:13 obviously there are locking issues 2009-02-27 14:13 yes 2009-02-27 14:13 and maybe, cache alias 2009-02-27 14:14 but there is a big advantage too: reading over the ddlink, we get the kernel's cached version of a block 2009-02-27 14:14 no cache alias 2009-02-27 14:14 we read from cache 2009-02-27 14:14 well 2009-02-27 14:14 user address space is different with kernel 2009-02-27 14:14 reading from a block that is supposed to be in page cache will create an alias 2009-02-27 14:14 but block number space is the same 2009-02-27 14:14 well 2009-02-27 14:14 yes, that's correct 2009-02-27 14:15 just an idea 2009-02-27 14:15 it could be a cool hack 2009-02-27 14:15 I hope it doesn't bother the normal code 2009-02-27 14:15 no 2009-02-27 14:15 it's strictly userspace 2009-02-27 14:15 with very loose locking 2009-02-27 14:16 we can think about the locking issues 2009-02-27 14:16 later 2009-02-27 14:16 but, helper function is in kernel? 2009-02-27 14:16 for now it will be, kernel stops changing the fs, then userspace can examine it 2009-02-27 14:16 helper function is just a ddlink method 2009-02-27 14:16 so we can remove all of that when we want 2009-02-27 14:16 for kernel review 2009-02-27 14:17 well, for example, to feed data to ddlink, we may modify kernel/btree.c? 2009-02-27 14:17 I don't think that's necessary 2009-02-27 14:18 i see 2009-02-27 14:18 we just use the cursor functionality 2009-02-27 14:18 probe, advance, read out the block over the ddlink 2009-02-27 14:19 with copy of btree.c functions? 2009-02-27 14:19 some sort of copy 2009-02-27 14:19 using the btree.c primitives 2009-02-27 14:19 the copy is part of the ddlink method 2009-02-27 14:19 so no change to btree.c 2009-02-27 14:19 the ddlink interface can live in a separate file maybe 2009-02-27 14:19 i see 2009-02-27 14:19 currently it is in inode.c 2009-02-27 14:19 but it can be somewhere else 2009-02-27 14:20 hooks.c or something 2009-02-27 14:20 interface.c 2009-02-27 14:20 userland.c 2009-02-27 14:20 interface.c is a pretty good name 2009-02-27 14:21 control.c 2009-02-27 14:21 I like the last one 2009-02-27 14:21 _.c? 2009-02-27 14:21 when we actually know what the functionality should be, maybe 2009-02-27 14:22 yes 2009-02-27 14:22 so I am thinking, there will the three new files, all of which we can strip for review: ddlink.c, ddlink.h and control.c 2009-02-27 14:22 well, I'm going to try to complete userland atomic commit 2009-02-27 14:22 :) 2009-02-27 14:23 ok, I will be here to work with you on it 2009-02-27 14:23 are there open questions? 2009-02-27 14:23 I have just been lazy :-/ 2009-02-27 14:23 it should have been written by now 2009-02-27 14:23 well, I'm not starting it yet, so there is no question yet 2009-02-27 14:24 maybe I should just talk about it a bit 2009-02-27 14:24 thanks 2009-02-27 14:24 the main missing bit is the promise/replay logic 2009-02-27 14:24 yes 2009-02-27 14:24 so, this needs to be written in detail as a design note 2009-02-27 14:24 I will just try to describe some main points now 2009-02-27 14:25 there are two good places to start with: bitmaps and btree nodes 2009-02-27 14:25 well, I'd like to forget about replay at first 2009-02-27 14:25 yes 2009-02-27 14:25 it is necessary to thinkg about replay in order to know what to log 2009-02-27 14:25 but the replay does not actually have to work 2009-02-27 14:25 yes, of course 2009-02-27 14:26 yes 2009-02-27 14:26 forget about reconstruct 2009-02-27 14:26 so, bitmaps have one interesting rule for replay: the physical pinned blocks have to be reconstructed first, before bitmap replay can be done 2009-02-27 14:26 well 2009-02-27 14:26 here comes an interesting point 2009-02-27 14:27 physical pinned blocks? 2009-02-27 14:27 when replaying btree nodes, the promise will be made against the _redirected_ block 2009-02-27 14:27 but replay will only have the _original_ block 2009-02-27 14:27 that means, we have to log the redirect 2009-02-27 14:27 yes 2009-02-27 14:28 so that replay can follow the physical redirect 2009-02-27 14:28 um... 2009-02-27 14:28 physical pinned block = dirty btree node in volmap that we do not normally flush out in each delta 2009-02-27 14:29 ah 2009-02-27 14:29 we have physicall pinned blocks and logical pinned blocks, that is, bitmaps 2009-02-27 14:29 ok, I think I have covered the main points 2009-02-27 14:29 everything else we have at least a prototype implementation of 2009-02-27 14:29 well, pinned blocks for bitmap can just be forget? 2009-02-27 14:30 forget? 2009-02-27 14:30 ignore? 2009-02-27 14:30 sure 2009-02-27 14:30 we need to implement it before doing more benchmarks though 2009-02-27 14:30 updating bitmap blocks will be a major write bottleneck 2009-02-27 14:31 i see 2009-02-27 14:31 however, it is optimization stuff? 2009-02-27 14:31 for now, we can flush bitmaps per delta 2009-02-27 14:31 I think it is optimization 2009-02-27 14:31 ok 2009-02-27 14:31 ok, one other thing 2009-02-27 14:31 wait a bit 2009-02-27 14:31 ok 2009-02-27 14:32 about btree redirect and logs 2009-02-27 14:32 we have to care about it? 2009-02-27 14:32 yes, it's fundamental 2009-02-27 14:32 if delta was commited, we have all of informations simpley? 2009-02-27 14:33 not unless we flush the btree nodes, and that would force us to do that recursively all the way to the root 2009-02-27 14:33 it is essential that we not do that 2009-02-27 14:33 we need to use the "promise" method, otherwise we fail to avoid recursive copy on write 2009-02-27 14:34 yes 2009-02-27 14:34 however, what is the point on replay? 2009-02-27 14:34 promises for btree node redirect is not much code 2009-02-27 14:35 replay of that is pretty siimple 2009-02-27 14:35 ok 2009-02-27 14:35 the above means, we have to store log of bnode redirect? 2009-02-27 14:35 replay sees the redirect log record, loads the original, "clean", metadata node, and copies it to the new, redirected location 2009-02-27 14:36 after that, parent update promises are applied to the redirected block 2009-02-27 14:36 yes 2009-02-27 14:36 we have to log the btree node redirect, maybe I already did that? 2009-02-27 14:36 ACTION looks 2009-02-27 14:37 log_redirect(sb, oldblock, newblock); 2009-02-27 14:37 yes 2009-02-27 14:37 so there the only bit missing there is the replay, I can code that while you are doing other things 2009-02-27 14:38 if you like 2009-02-27 14:38 let's see if I coded the log reload 2009-02-27 14:39 um... 2009-02-27 14:39 ah, yes I did 2009-02-27 14:39 tested it a couple of times, even 2009-02-27 14:39 log_redirect() is what's for? 2009-02-27 14:39 why do we need redirect? 2009-02-27 14:39 just a question 2009-02-27 14:39 that logs the redirect record I describe above, that alls the replay to reconstruct redirected btree nodes 2009-02-27 14:40 why do we need to redirect itself? 2009-02-27 14:40 redirect is to avoid overwriting blocks that are needed to be able to reconstructed dirty, pinned metadata, from the state of the previous flush cycle 2009-02-27 14:41 we don't overwrite until btree flush 2009-02-27 14:41 true, but we redirect at the time the btree node is made dirty 2009-02-27 14:42 why do we need this redirect? 2009-02-27 14:42 now why is that... (I have a reason) 2009-02-27 14:42 I had a reason 2009-02-27 14:42 I forget it just now 2009-02-27 14:42 :) 2009-02-27 14:42 and maybe the reason was wrong 2009-02-27 14:42 but let's think carefully about that 2009-02-27 14:42 yes 2009-02-27 14:42 well, later we don't know what we have to redirect 2009-02-27 14:42 maybe that's it 2009-02-27 14:43 I hit the reason when I coded the redirect 2009-02-27 14:43 well, I thought what do we do for log_redirect's log 2009-02-27 14:43 ok, that is our thinking project for today :) 2009-02-27 14:43 :) 2009-02-27 14:43 log_redirect is very simple 2009-02-27 14:43 on replay 2009-02-27 14:44 just the source and destination physical block address 2009-02-27 14:44 on replay is also simple 2009-02-27 14:44 just load the block from source into the cache at destination 2009-02-27 14:44 that is, the volmap cache 2009-02-27 14:44 so, maybe, we will just throw away the original block 2009-02-27 14:45 yes, that's fine 2009-02-27 14:45 the original cache block 2009-02-27 14:45 well, so, I though why do we store throw away log 2009-02-27 14:45 I thought 2009-02-27 14:45 well 2009-02-27 14:45 we need the block in cache at the redirected location, where we will apply promises to it 2009-02-27 14:46 on replay 2009-02-27 14:46 yes 2009-02-27 14:46 however, on replay, redirect wouldn't necessary 2009-02-27 14:47 maybe, it can change original block on the cache 2009-02-27 14:47 well, it is what I felt 2009-02-27 14:49 let's think about that 2009-02-27 14:49 for one day :) 2009-02-27 14:49 :) 2009-02-27 14:49 I have a strong sense the redirect is essential 2009-02-27 14:49 well, for now, those are a bit far for me 2009-02-27 14:49 but I can't find my nodes on that right now :) 2009-02-27 14:50 well, that is the entire description I think 2009-02-27 14:50 :) 2009-02-27 14:50 well, so, I'll try to store bitmap completely 2009-02-27 14:50 maybe it will be like fork, and my approach is more complicated than necessary 2009-02-27 14:50 then, I'll do next thing 2009-02-27 14:51 or maybe I am right this time 2009-02-27 14:52 current one (commit.c) seems to work almost 2009-02-27 14:52 yes, it was close 2009-02-27 14:52 but, it's not a part of fs 2009-02-27 14:52 right 2009-02-27 14:52 there are a few minor things that have to be changed for kernel 2009-02-27 14:52 some wrappers 2009-02-27 14:52 so, I'm going to merge it to fs stuff 2009-02-27 14:52 yes 2009-02-27 14:52 good start 2009-02-27 14:53 and I'm imagineing the kernel port is final one 2009-02-27 14:53 after working on the userland more or less 2009-02-27 14:54 final before review? 2009-02-27 14:54 I'm not sure about it 2009-02-27 14:55 I think it is ok to start review when atomic commit/replay still does not work perfectly 2009-02-27 14:55 as long as the basic mechanism exists 2009-02-27 14:55 maybe 2009-02-27 14:55 of course it is better if it works :) 2009-02-27 14:56 yes 2009-02-27 14:56 at least, it should not crash the fs 2009-02-27 14:56 or corrupt it 2009-02-27 14:56 that is, corrupt it in normal running, or remount 2009-02-27 14:56 it is ok if it corrupts on replay 2009-02-27 14:56 and I guess it would get attention if it have atomic commit 2009-02-27 14:56 we will just turn off replay in that case 2009-02-27 14:56 well yes 2009-02-27 14:57 but it will also get attention if we _develop_ the replay in public 2009-02-27 14:57 yes 2009-02-27 14:57 I think that is more interesting for the linux observers than seeing something merged that works perfectly 2009-02-27 14:57 yes 2009-02-27 14:57 we are developing in public here of course, but it is a small corner of the net 2009-02-27 14:57 there are two things 2009-02-27 14:58 for the users/testers, for the fs devlopers 2009-02-27 14:58 yes 2009-02-27 14:58 for the users/testers, I guess with atomic commit would be better 2009-02-27 14:58 and for us, I think it is enjoyable to develop with more observers 2009-02-27 14:58 right 2009-02-27 14:59 we should merge as tux3-dev 2009-02-27 14:59 as was proposed for btrfs 2009-02-27 14:59 otherwise, without would be better if someone want to join tux3 2009-02-27 14:59 yes 2009-02-27 15:00 recently, I'm feeling kernel devlopers are small 2009-02-27 15:00 um... 2009-02-27 15:00 there are many developers actually 2009-02-27 15:01 however, great devlopers and are having time is small 2009-02-27 15:01 yes 2009-02-27 15:01 they tend to have jobs ;) 2009-02-27 15:01 and bosses 2009-02-27 15:01 yes 2009-02-27 15:01 but there are some who are able to help 2009-02-27 15:01 probably 2009-02-27 15:01 and who will not help until we are at least in -mm 2009-02-27 15:02 yes 2009-02-27 15:02 for example, akpm ;) 2009-02-27 15:02 the greatest of the great 2009-02-27 15:02 however, I'm expecting the many users/testers are there 2009-02-27 15:02 yes 2009-02-27 15:03 right, and many of those users _will_ be able to hack our userspace code 2009-02-27 15:03 he is kind 2009-02-27 15:03 :) 2009-02-27 15:03 probably 2009-02-27 15:03 we can also offer up the possibility of more student projects 2009-02-27 15:03 like the dedup 2009-02-27 15:03 yes 2009-02-27 15:04 and our dedup team can pass the kernel implementation part of their project to the next group of students 2009-02-27 15:04 oh 2009-02-27 15:04 I think I will encourage them to keep dedup in user space for now, and improve it some more there 2009-02-27 15:04 rather than beginning kernel port right now 2009-02-27 15:04 good 2009-02-27 15:04 they only have a few more weeks left in their term 2009-02-27 15:04 so there should not be stress for them, or us 2009-02-27 15:05 good 2009-02-27 15:05 well, there woudl be some interesting optimization for dedup 2009-02-27 15:06 well, so, I'm thinking atomic commit is basic functionality 2009-02-27 15:06 so, I thought with atomic commit, it would be good for people 2009-02-27 15:06 true 2009-02-27 15:07 when we have atomic commit, it is actually an ext3 replacement, or very close to it 2009-02-27 15:07 if someone does have interest to atomic commit 2009-02-27 15:07 atomic commit and fsck 2009-02-27 15:07 yes 2009-02-27 15:08 well, anyway, I'm not much care with or without 2009-02-27 15:08 I just care tux3 is including to linus tree 2009-02-27 15:09 and without crappy trouble 2009-02-27 15:09 yes 2009-02-27 15:10 like reiserfs4 2009-02-27 15:10 that means no ddlink at submission ;) 2009-02-27 15:10 :) 2009-02-27 15:13 um... akpm would like which one, without atomic commit as eairly, or with atomic commit 2009-02-27 15:13 he did not specify 2009-02-27 15:13 with atomic commit was my idea 2009-02-27 15:14 I think akpm wanted it without actually 2009-02-27 15:14 booting as root fs is good 2009-02-27 15:14 yes 2009-02-27 15:14 not crashing very much is good 2009-02-27 15:14 performing well is good 2009-02-27 15:14 I think we are performing above expectations at this point 2009-02-27 15:15 probably 2009-02-27 15:15 thanks largely to you :) 2009-02-27 15:15 however, it is without key function 2009-02-27 15:15 thanks :) 2009-02-27 15:15 it is, but it is also with writing a log of extra data 2009-02-27 15:16 and also, the standard kernel flush causes extra seeking that we will avoid via promises for bitmaps and btree nodes 2009-02-27 15:16 because right now we keep seeking back to those locations to update them 2009-02-27 15:16 so I think it will run faster with atomic commit + logging 2009-02-27 15:17 we will soon see 2009-02-27 15:17 yes 2009-02-27 15:17 it is also with writing a _lot_ of extra data <- correction 2009-02-27 15:17 that is, 8k extra for every file 2009-02-27 15:17 ah, yes 2009-02-27 15:18 seeking back to the btree node blocks and bitmap blocks is probably a bigger loss of performance 2009-02-27 15:18 when we have logging working, our initial kernel untar will be nearly a perfectly linear write 2009-02-27 15:19 we will only be pausing for commit blocks 2009-02-27 15:19 and we will fix that later, too 2009-02-27 15:20 yes 2009-02-27 15:20 there is no reason I can see why we should not get very close to platter speed for a kernel untar 2009-02-27 15:20 flips, you're making me drool 2009-02-27 15:22 many small files might be slow than platter speed 2009-02-27 15:24 maybe it depends on metadata write speed 2009-02-27 15:26 we will see about the small flies ;) 2009-02-27 15:26 I think we will be able to optimize that very well 2009-02-27 15:27 with large deltas 2009-02-27 15:27 yes 2009-02-27 15:27 the directory blocks will just be written inline with the data 2009-02-27 15:28 if we have three deltas in the pipeline, we can organize the directory blocks to be _at the beginningj_ of the inode table and data blocks for those files 2009-02-27 15:28 which is best for reading 2009-02-27 15:28 so we get perfect optimization for both writing and reading 2009-02-27 15:28 I guess I never stated that before ;) 2009-02-27 15:29 this is enabled by the design of the commit pipeline 2009-02-27 15:29 a fundamental advantage, I think 2009-02-27 15:31 write the data to ileaf? 2009-02-27 15:31 the ileaf blocks can be written inline two 2009-02-27 15:31 by the same principle 2009-02-27 15:31 ah 2009-02-27 15:31 even without writing the data to ileaf 2009-02-27 15:32 the ileaf optimization will only save bandwidth 2009-02-27 15:32 not platter speed 2009-02-27 15:32 our atomic commit strategy saves seeks and rotational latency, the big performance killers 2009-02-27 15:32 bandwidth is not as important 2009-02-27 15:32 but we will later optimize the bandwidth as well, with the immediate data attribute optimization 2009-02-27 15:33 where file data is handled like an xattr 2009-02-27 15:34 hirofumi, can we delete fat_compat_dir_ioctl? 2009-02-27 15:34 probably 2009-02-27 15:35 plugging the delta pipeline? 2009-02-27 15:35 ? 2009-02-27 15:35 plug/unplug like io-schedular 2009-02-27 15:37 organize means, write it as liner? 2009-02-27 15:38 write those deltas at a time 2009-02-27 15:38 ? 2009-02-27 15:38 ah 2009-02-27 15:38 well 2009-02-27 15:39 I think we will do something better than plug 2009-02-27 15:39 plug has always been a little less than nice ;-) 2009-02-27 15:39 i see 2009-02-27 15:39 it is responsible for major loss of throughput in some loads 2009-02-27 15:39 well, so, compat_ioctl 2009-02-27 15:39 yes 2009-02-27 15:39 where the disk is doing nothing, even though lots of data is dirty 2009-02-27 15:40 yes 2009-02-27 15:41 there is also the mysterious iowait performance bug that has been lurking for a few years 2009-02-27 15:41 I think we have a better chance for finding/fixing that one than anybody 2009-02-27 15:41 we will just keep removing bogus interfaces until it disappears ;) 2009-02-27 15:41 :) 2009-02-27 15:42 well, it might be io-sched stuff 2009-02-27 15:42 and submit command stuff (e.g. WRITE_SYNC) 2009-02-27 15:43 compat_ioctl, if our ioctl is compatible on 32bit/64bit, it is not needed 2009-02-27 15:44 root@usermode:~# tree /proc/295/task/295/fd 2009-02-27 15:44 /proc/295/task/295/fd 2009-02-27 15:44 |-- 0 -> /dev/tty0 2009-02-27 15:44 |-- 1 -> /dev/tty0 2009-02-27 15:44 |-- 2 -> /dev/tty0 2009-02-27 15:44 `-- 4 -> anon_inode:[ddlink] 2009-02-27 15:44 how does it look? 2009-02-27 15:44 nice 2009-02-27 15:45 yes, it seems good 2009-02-27 15:45 hrm theres no identifier though 2009-02-27 15:45 maybe there should be? 2009-02-27 15:45 like sockets have 2009-02-27 15:45 lrwx------ 1 shapor eng 64 Feb 26 13:45 3 -> socket:[334344] 2009-02-27 15:47 I guess it would not have identifier 2009-02-27 15:47 http://mailman.tux3.org/pipermail/tux3/2009-February/000749.html <- Patch: Example ddlink control interface 2009-02-27 15:48 well it is attached to some state 2009-02-27 15:48 right, I don't really like how the proc output looks, but at least it looks as intended 2009-02-27 15:48 (however much bad taste that might be) 2009-02-27 15:48 might be useful for troubleshooting 2009-02-27 15:48 we can fix it later 2009-02-27 15:48 by fixing anon_inode 2009-02-27 15:49 add such a feature when/if its needed 2009-02-27 15:49 yes 2009-02-27 15:49 at least everybody will know what it is 2009-02-27 15:49 well 2009-02-27 15:49 and we aren't submitting ddlink ;) 2009-02-27 15:49 ddlink is for us to start developing interfaces we will need for replication 2009-02-27 15:49 etc 2009-02-27 15:49 online repair 2009-02-27 15:50 online interactive repair 2009-02-27 15:50 that will be really cool 2009-02-27 15:50 tell the filesystem where to look for broken data, and what to do when it finds it 2009-02-27 15:50 there is no example_ddlink() 2009-02-27 15:50 we can also implement logical dump/restore 2009-02-27 15:50 whoops 2009-02-27 15:50 I should have compiled? ;-) 2009-02-27 15:51 I thought I did 2009-02-27 15:51 yes 2009-02-27 15:51 oh 2009-02-27 15:51 it is in header 2009-02-27 15:51 maybe not the last time 2009-02-27 15:51 ah 2009-02-27 15:52 fixed 2009-02-27 15:52 I moved the ioctl function into control.c, from its former home in inode.c 2009-02-27 15:52 I think it belongs there 2009-02-27 15:52 other filesystems would call this file ioctl.c 2009-02-27 15:52 it would have some problems 2009-02-27 15:52 ? 2009-02-27 15:52 just for record 2009-02-27 15:53 security? 2009-02-27 15:53 no 2009-02-27 15:53 but it does have security problems as we discussed, fixable though 2009-02-27 15:53 we want the better define of DDLINK 2009-02-27 15:53 ah, I actually checked that very carefully 2009-02-27 15:53 respell maybe 2009-02-27 15:53 DDLINK_IOCTL 2009-02-27 15:54 _IO(0, ) <- 0, someone may be using already? 2009-02-27 15:54 no, I did a survey of all users 2009-02-27 15:54 it's ours to grab 2009-02-27 15:54 oh 2009-02-27 15:54 there is one obscure user that does not conflict 2009-02-27 15:54 it's in the ioctl list somewhere 2009-02-27 15:55 and that user uses it with some subfunction 2009-02-27 15:55 but please recheck my check ;) 2009-02-27 15:55 ioctl numbers is in Documentation/ioctl/ioctl-number.txt 2009-02-27 15:56 yes 2009-02-27 15:56 it seems "Code 0x00" is used for some placed 2009-02-27 15:56 some places 2009-02-27 15:57 0xDD 00-3F ZFCP device driver see drivers/s390/scsi/ 2009-02-27 15:57 and seq# is different? 2009-02-27 15:57 0xdd is seq# for DDLINK? 2009-02-27 15:58 remembering now ;) 2009-02-27 15:59 well, it doesn' matter right now though 2009-02-27 15:59 another one is __WAIT_QUEUE_HEAD_INITIALIZER() 2009-02-27 15:59 ok, yes 2009-02-27 15:59 iirc, it is not lockdep compatible 2009-02-27 15:59 ddlink is code 0, sequence dd 2009-02-27 15:59 yes 2009-02-27 15:59 no conflict 2009-02-27 16:00 yes 2009-02-27 16:00 that makes us better than the other users of code zero ;) 2009-02-27 16:00 :) 2009-02-27 16:01 640x00 00-1F linux/fs.h conflict! 2009-02-27 16:01 650x00 00-1F scsi/scsi_ioctl.h conflict! 2009-02-27 16:01 660x00 00-1F linux/fb.h conflict! 2009-02-27 16:01 670x00 00-1F linux/wavefront.h conflict! 2009-02-27 16:01 680x02 all linux/fd.h 2009-02-27 16:01 2009-02-27 16:01 however, our code number would be better 2009-02-27 16:01 well, we can use code dd, sequence dd 2009-02-27 16:01 that would be kind of cool 2009-02-27 16:01 ah, "dd" 2009-02-27 16:02 it looks good, except some nits 2009-02-27 16:03 like ioctl number, __WAIT_* 2009-02-27 16:04 can you post review comments to the list? it's sk8 oclock 2009-02-27 16:04 meeting tim on the boardwalk 2009-02-27 16:04 ok 2009-02-27 16:05 well, I'll sleep after that 2009-02-27 16:05 oyasumi then 2009-02-27 16:06 oyasumi 2009-02-27 16:08 -!- flipsout changed mode/#tux3 -> -o flipsout 2009-02-27 16:34 hirofumi: still there? I have a little trouble understanding one thing about the disk-layout 2009-02-27 16:35 yes 2009-02-27 16:35 in ileaf_lookup the size of the ileaf is returned as an unsigned 2009-02-27 16:35 whilst in ileaf_dump the validity is checked by comparing size >= 0 2009-02-27 16:35 i am asking as i tried injecting errors in my disk-layout for the fsck and tux3 segfaults when decoding the xsize 2009-02-27 16:36 and ileaf_dump correctly prints out 0xd: 2009-02-27 16:36 are those different things? 2009-02-27 16:38 those? 2009-02-27 16:38 the sizes that are being checked 2009-02-27 16:38 one being unsigned, the other one being checked against >=0 to detect errors 2009-02-27 16:38 negative value is wrong always 2009-02-27 16:39 the size is inode attributes size 2009-02-27 16:39 right, but as it is unsigned anyway, it should not be comparable 2009-02-27 16:39 so, it never be bigger than blocksize 2009-02-27 16:39 ah ok, one could check for that 2009-02-27 16:40 well, ileaf_lookup() is assuming it is not corrupted 2009-02-27 16:40 right, that's why i wanted to check for that in open_inode 2009-02-27 16:40 at least for now 2009-02-27 16:41 I'm not sure, "int" is ok or not 2009-02-27 16:41 however, probably "int" would be prefer for size 2009-02-27 16:42 open_inode: found inode 0xd 2009-02-27 16:42 open_inode: Failed assert(size <= sb->blocksize)! 2009-02-27 16:42 Trace/breakpoint trap 2009-02-27 16:42 ok, it fails the assert now 2009-02-27 16:43 if it's just for assert/check for now, you can just do "(int)size < 0" 2009-02-27 16:44 i am not involved enough to decide which is correct here 2009-02-27 16:44 both is ok 2009-02-27 16:45 there is no big difference, if sb->blocksize is already verified 2009-02-27 16:46 "(int)size < 0" would be work without trusted sb->blocksize 2009-02-27 16:46 hmm, that's certainly true 2009-02-27 16:46 yes 2009-02-27 16:49 ok, size <= sb->blocksize would be better 2009-02-27 16:49 because, it checks a bit strictly 2009-02-27 16:54 so attrs are always size-restricted? 2009-02-27 16:55 yes 2009-02-27 16:55 I think so 2009-02-27 16:55 512bytes inode is enough big 2009-02-27 16:56 and in the future it's quite likely we will be moving to 4k or even 128k blocks, as in intel SSDs 2009-02-27 16:56 one more question :) is there some place where the highest inode-number is stored? 2009-02-27 16:57 there is no highest inode-number for now 2009-02-27 16:57 ok 2009-02-27 16:58 currently, 48bits full numbers will be used always for now 2009-02-27 16:58 later, we may add option to limit the number for some reason 2009-02-27 16:58 but, not sure yet 2009-02-27 16:58 i was thinking about creating a bitmap for the fsck to check for duplicate inodes. but that is a clear point against that idea 2009-02-27 16:59 guess i have to do an extra tree pass for that 2009-02-27 17:00 tree pass? 2009-02-27 17:00 tree walk 2009-02-27 17:00 to get the highest given out or something 2009-02-27 17:00 or work with dynamic structures to record used inodes :/ 2009-02-27 17:01 I'm not familar to do it though 2009-02-27 17:01 just a idea, simple one is using the file? 2009-02-27 17:01 of course, sparse file 2009-02-27 17:02 create sparse file of 1<<48, then mmap it, and just use it 2009-02-27 17:02 yeah. could work. 2009-02-27 17:03 yes 2009-02-27 17:03 i am not sure if i have to scan through it later on though, for some reason i can't think of yet 2009-02-27 17:03 or would that still be possible? 2009-02-27 17:03 and work on small memory system without swap 2009-02-27 17:05 create file, unlink it, and don't close at end of process? 2009-02-27 17:05 so, you can scan it via that fd? 2009-02-27 17:06 what i mean is, that the file is huge, although it contains mostly zeros and doesn't really take space 2009-02-27 17:07 any kind of operation like that would probably take forever 2009-02-27 17:07 right now i am thinking about creating some kind of generic tree walker for tux3 to make the whole task somewhat easier to handle 2009-02-27 17:08 yes, tree would work 2009-02-27 17:09 i don't mean a tree for keeping the information, but a tree walker that gets the maximum inode number and then creates a bitmap accordingly 2009-02-27 17:09 ah 2009-02-27 17:09 i have to think about it some more i guess 2009-02-27 17:10 a generic tree walker would be handy anyway 2009-02-27 17:10 yes 2009-02-27 17:13 btw, maximum inode number can get from itable btree 2009-02-27 17:14 most right node is it 2009-02-27 17:16 ah ok, thanks 2009-02-27 17:17 but i just found a design note by flips about his general ideas for a fsck, and i noticed that my current method of walking the tree is far more complicated than it needs to be 2009-02-27 17:17 so i'll rewrite that first 2009-02-27 17:19 I'm not reading flips's fsck note, well, I guess flips back soon 2009-02-27 17:22 btw, tux3 is on multiple btrees 2009-02-27 17:28 i know 2009-02-27 17:28 when i say tree walker, i mean structure-walker 2009-02-27 17:29 actually now, i am looping over the ileafs using the cursor, and over the dleafs in an inner loop 2009-02-27 17:30 good 2009-02-27 18:08 data, still there? 2009-02-27 18:08 ACTION wonders if hirofumi is still awake 2009-02-27 18:30 yes, i am 2009-02-27 18:30 http://www.jonasfietz.de/hgwebdir.cgi/tux3/rev/afa14c5923ae 2009-02-27 18:30 this is how i did it now 2009-02-27 18:31 i have a hard time figuring out how to do the hooks correctly 2009-02-27 18:31 don't know if you remember the recursive version 2009-02-27 18:31 it was very closely mapped on hirofumi's tuxgraph 2009-02-27 18:32 maybe you have some idea what kind of signature the *_ops should have 2009-02-27 18:32 ACTION highlights flips 2009-02-27 18:33 hi data 2009-02-27 18:33 hi 2009-02-27 18:34 ok, the unsigned struct field is fine, it is actually more efficient for a range test 2009-02-27 18:34 since negative is a very large unsigned number, larger than any block size 2009-02-27 18:34 second thing is, a bitmap is just the right thing for fsck, use a mapping as hirofumi suggested so the bitmap is sparse 2009-02-27 18:35 that way you can cover at least 2**47 inodes on a 32 bit machine 2009-02-27 18:35 you mean his temp-file solution? 2009-02-27 18:35 not a file, but a mapping 2009-02-27 18:35 allocate a mapping just like we do for the log blocks 2009-02-27 18:36 k, have to look into that, but ok 2009-02-27 18:36 and access it like bitmap, in face using the same access method 2009-02-27 18:36 s/face/fact/ 2009-02-27 18:36 if we have to change the parameter to the bitmap access from inode * to mapping *, we can do that 2009-02-27 18:36 ACTION hasn't looked at it for a while 2009-02-27 18:37 third thing is, either using advance directly or walking with the cursor are both ok 2009-02-27 18:37 just suit yourself 2009-02-27 18:39 i knew there was an advance, but haven't looked into it. just did, and i think that's the better way to go 2009-02-27 18:39 sure 2009-02-27 18:39 there's also a cursor_advance I think 2009-02-27 18:40 hmm, maybe not 2009-02-27 18:40 something like that 2009-02-27 18:40 int advance(struct cursor *cursor) 2009-02-27 18:40 yes :) 2009-02-27 18:40 do you mean that? 2009-02-27 18:40 ok, that settles that 2009-02-27 18:40 use a cursor 2009-02-27 18:41 just did that 2009-02-27 18:41 it will handle your block locking for you also 2009-02-27 18:41 http://www.jonasfietz.de/hgwebdir.cgi/tux3/rev/2e5d8b59662e 2009-02-27 18:41 so it moves in the direction of online fsck 2009-02-27 18:41 right 2009-02-27 18:42 maybe we should make a generic advance that takes a function to apply 2009-02-27 18:42 to do the check_bnode 2009-02-27 18:42 but anyway 2009-02-27 18:42 does it make sense to try to make a generic walk? 2009-02-27 18:42 it's not necessary at this point 2009-02-27 18:42 it does 2009-02-27 18:42 but play with it first 2009-02-27 18:42 generic walk is not the immediate goal 2009-02-27 18:42 i am not sure what kind of use-cases there are, so i just fit this one to my use-ace :) 2009-02-27 18:42 case even 2009-02-27 18:42 right 2009-02-27 18:43 just make it fit your use 2009-02-27 18:43 it can be refactored to be more useful. shouldn't be to hard 2009-02-27 18:43 that is the proper way to make something generic 2009-02-27 18:43 don't try to make it generic first, it will usually not fit the use case that way 2009-02-27 18:43 right 2009-02-27 18:43 pretty much all of linux was built that way 2009-02-27 18:43 first, write something fast and non-generic 2009-02-27 18:43 then extract the generic bits and parameterize them, one at a time 2009-02-27 18:44 that's the way the vfs was developedc 2009-02-27 18:44 ok, cool, your diff looks good 2009-02-27 18:48 ok. i'll look into the mapping now. How much of your logging code is in the repository so far? 2009-02-27 18:49 lxr is way too slow 2009-02-27 18:54 lxr is a bottleneck, yes 2009-02-27 18:54 marcin promised to make a fast one for us 2009-02-27 18:54 well 2009-02-27 18:55 we will have a very fast server for tux3.org 2009-02-27 18:55 maybe 2009-02-27 18:59 ok, i just reread the parts about address_spaces and mappings. I am not quite sure how to work with them. Could you explain a little more? All I might need is a radix-tree, actually, but i am not in kernel, so i don't have an implementation 2009-02-27 19:05 -!- chesse_(~eworm@dslb-084-062-160-000.pools.arcor-ip.net) has joined #tux3 2009-02-27 19:29 data, still there? 2009-02-27 19:30 yes 2009-02-27 19:30 map_t *new_map(struct dev *dev, blockio_t *io) 2009-02-27 19:30 this is what you meant, right? 2009-02-27 19:31 right 2009-02-27 19:32 ok, didn't see it. i was grepping for mapping, and that only turned up the address_space-objects, but skipping through about 200 pages of "understanding the linux kernel" brought me to buffer.c 2009-02-27 19:32 sure 2009-02-27 19:33 tux3 maps are kernel mappings, which are pages, or in userspace, they are my port of buffer (page) cache to userspace 2009-02-27 19:34 we support the blockget operation on a kernel or a usespace mapping object 2009-02-27 19:34 that's about what i figured 2009-02-27 19:34 very handy 2009-02-27 19:34 yes, it is 2009-02-27 19:34 it's like a flex array of blocks 2009-02-27 19:34 so, you can optimize by keeping a pointer to the "current" bufdata(buffer) 2009-02-27 19:35 when your target falls outside the range of that block, brelse that buffer and blockget a new one 2009-02-27 19:36 once you have a pointer to the block data you can use set_bits etc to operate on it 2009-02-27 19:36 which is handy for marking an extent as "allocated" 2009-02-27 19:39 hmm. if i wanted to create a map to register the taken inodes, which device would i use as a backing for it? new_map(sb->bdev, NULL)? or some new dev with a customized .bits? 2009-02-27 19:40 the maximum range of a bitmap represented as a mapping on a 32 bit arch is is 1<<(32 pageindexshift + 12 blockshift + 3 byteshift) = 47 bits 2009-02-27 19:41 2**47 2009-02-27 19:41 as opposed to 2**48 that tux3 handles at maximum 2009-02-27 19:41 ok... 2009-02-27 19:41 I think that, on 32 bit arch our effective limit is therefore 2**47 blocks per filesystem 2009-02-27 19:41 i am somewhat unsure how to use it, yet. or rather how to initialize? 2009-02-27 19:42 if somebody makes a bigger filesystem than that (only possible on 64 bit arch) then it will not mount on a 32 bit arch 2009-02-27 19:42 struct dev *dev = &(struct dev) { .bits = XX}; 2009-02-27 19:42 map_t inode_map = new_map(dev, NULL); 2009-02-27 19:42 init_buffers(dev, 1 << 47, 0); 2009-02-27 19:42 and nobody will care ;) 2009-02-27 19:42 guess so 2009-02-27 19:42 well 2009-02-27 19:42 things will die if you init_buffers with that large 2009-02-27 19:42 thought so 2009-02-27 19:42 that's the amount of userspace memory to dedicate to the buffer pool 2009-02-27 19:43 ah ok 2009-02-27 19:43 2**24 should be plenty 2009-02-27 19:43 that's 16 MB 2009-02-27 19:44 call new_mapping to init 2009-02-27 19:44 don't use our hack stuff ;-) 2009-02-27 19:44 you are writing real code, not a unit test 2009-02-27 19:44 egrep -R new_mapping * 2009-02-27 19:44 doesn't return anything 2009-02-27 19:44 new_map 2009-02-27 19:44 ok 2009-02-27 19:45 04:42:20 < data> IImap_t inode_map = new_map(dev, NULL); 2009-02-27 19:45 let me see how it works in kernel 2009-02-27 19:45 ah 2009-02-27 19:45 make_inode 2009-02-27 19:45 because kernel does not have a new_map that is really usable 2009-02-27 19:46 kernel is a little messy there 2009-02-27 19:46 so, just do like we do to initialize the log 2009-02-27 19:46 ACTION looks for that 2009-02-27 19:47 sbi->logmap = tux_new_inode(sbi->rootdir, &iattr, 0); 2009-02-27 19:48 I don't really like tux_new_inode for that purpose, but that's how we do it 2009-02-27 19:48 well 2009-02-27 19:49 that works? :) So I can just use tux_new_inode(sb->rootdir, &iattr, 0);? 2009-02-27 19:49 I think we just call it tux_new_inode because new_inode is already take by the vfs, and does not fully initialize the inode 2009-02-27 19:49 yes 2009-02-27 19:51 I think we should probably refactor this a little bit so we have a new_mapnode or something like that, which creates a new inode that is not backed by disk 2009-02-27 19:52 right. i was trying to find out where that happens :) 2009-02-27 19:52 instead of using the indirect path that eventually does tux_setup_inode 2009-02-27 19:52 yes, it's more confusing than necessary 2009-02-27 19:52 anyway, it works as it is 2009-02-27 19:52 which is the most important thing :) 2009-02-27 19:54 but when i want to set a bit, i just do blockget(map, value) and then use bufdata() to get the actual memory address, which then I can modify by set_bit? 2009-02-27 19:54 or am i still missing something? 2009-02-27 19:54 that's correct 2009-02-27 19:55 and I was suggesting that you cache the bufdata pointer and avoid doing blockget every time, assuming that several references in a row will fall in the same block 2009-02-27 19:55 right 2009-02-27 19:58 not sure about the implications yet: as this is not really device-backed, or shouldn't be, there is nothing that will happen when I do a brelse? 2009-02-27 19:59 they will not be evicted 2009-02-27 20:00 and the correct operation would probably be brelse_dirty 2009-02-27 20:05 -!- chesse(~eworm@dslb-084-062-154-233.pools.arcor-ip.net) has joined #tux3 2009-02-27 20:23 i get a tux3.c:303: error: invalid initializer 2009-02-27 20:23 for struct inode inode_map = tux_new_inode(sb->rootdir, &iattr, 0); 2009-02-27 20:24 ...back 2009-02-27 20:24 should be *inode_map 2009-02-27 20:25 ah 2009-02-27 20:25 if you do a brelse your bitmap can be evicted unless it is dirty 2009-02-27 20:25 well 2009-02-27 20:25 that's a challenge ;) 2009-02-27 20:25 don't worry about it for now 2009-02-27 20:26 it's probably not a good idea to back your bitmap mapping by the filesystem your're checking 2009-02-27 20:26 right, should i just create a "virtual" device? 2009-02-27 20:27 IIstruct dev *dev = &(struct dev) { .bits = XX}; 2009-02-27 20:27 like that? 2009-02-27 20:27 suggestions gladly accepted on this point, for now just ignore it 2009-02-27 20:27 ok 2009-02-27 20:27 isn't it possible in kernel to use addresses so high that they can't be backed by the device? 2009-02-27 20:27 sure 2009-02-27 20:27 we'll deal with that later 2009-02-27 20:28 the limitation is not the size of the address, but how many blocks are dirty in cache 2009-02-27 20:28 we will do something pretty clever 2009-02-27 20:28 later 2009-02-27 20:28 well, i am not going to finish it now. my train leaves in 2.5 hours, and I haven't slept yet :) but thanks for all the hints. I'll gladly incorporate them when I come back on sunday 2009-02-27 20:28 for now, just assume you can hold infinite bitmap blocks in memory 2009-02-27 20:29 train to? 2009-02-27 20:29 muenster 2009-02-27 20:29 visiting some friends 2009-02-27 20:29 ah, schlaft gute denn 2009-02-27 20:29 danke :) 2009-02-27 20:29 err 2009-02-27 20:29 gute fiertag 2009-02-27 20:29 something like that 2009-02-27 20:30 shoense fiertag 2009-02-27 20:30 my deutsch is getting crappy ;) 2009-02-27 20:30 it was never good at the best of times 2009-02-27 20:30 feiertag, but that's more like holidays such as christmas :) 2009-02-27 20:30 fierabendt 2009-02-27 20:30 feierabend, right. :) 2009-02-27 20:31 nachtzub? 2009-02-27 20:31 nachtzug? 2009-02-27 20:32 nope, it's 5:30 AM here 2009-02-27 20:32 ooh 2009-02-27 20:32 that's late 2009-02-27 20:33 i hope i can get a little rest on the train, but when one stands up at around 12 pm, it's not that bad. 2009-02-27 20:34 sure 2009-02-27 20:34 I know the drill 2009-02-27 20:35 lived in berlin for 5 years 2009-02-27 20:41 sorry, i am cleaning up around here so i don't come back to a total mess 2009-02-27 20:41 no more cups on the table :) 2009-02-27 20:41 :) 2009-02-27 20:41 a couple of red bulls will take care of you 2009-02-27 20:42 nothing left, but you might know club mate :) 2009-02-27 20:43 you'd have to drink a whole carton 2009-02-27 21:13 probably, but the cold shower just now helped a lot, too 2009-02-27 22:13 -!- ned(~ned@c-76-19-208-96.hsd1.ma.comcast.net) has joined #tux3 2009-02-27 23:32 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-02-28 00:47 -!- gaurav(~gaurav@59.96.71.18) has joined #tux3 2009-02-28 00:49 -!- gaurav(~gaurav@59.96.71.18) has joined #tux3 2009-02-28 07:01 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-28 07:24 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-02-28 11:00 flips, hirofumi i got some new crash logs for ya ;) 2009-02-28 11:24 oh, another one, and this time with a kernel panic as a bonus 2009-02-28 11:49 at this time, are you applying the my new patches? 2009-02-28 11:50 i did the whole hg clone today 2009-02-28 11:50 I've found the a few bugs in ileaf stuff 2009-02-28 11:50 hg clone static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-28 11:50 if not, please try this hg repo 2009-02-28 11:51 hold on, lemme pack up and send you the stuff i found first, then we'll have a go with your new patches 2009-02-28 11:51 thanks 2009-02-28 11:52 you want me to send it to the mailing list or your private mail? 2009-02-28 11:52 either is ok for me 2009-02-28 11:53 hirofumi@mail.parknet.co.jp is my email address 2009-02-28 11:54 oh in the meantime, please tell me what do you want me to gather for a crash? so far i got a screenshot of the VM with the kernel panic, and a kernel log of the crash (first time around tux3 crashed but the kernel didnt panic). what else could be useful? 2009-02-28 11:56 screenshot may be useful 2009-02-28 11:56 well, however, ileaf bug can be the cause of memory corruption 2009-02-28 11:57 so, it's not so clear 2009-02-28 11:57 there seemed to be something about freeing already free memory 2009-02-28 11:57 in the logs, you'll see soon enough 2009-02-28 11:57 ok, maybe, it's ileaf bug 2009-02-28 11:58 ileaf was allocating the same inode number, so it can be the cause of double free 2009-02-28 12:09 sent 2009-02-28 12:17 hirofumi, i'm running the same test with code from your repo, so far no crashes, but it's been only few mintutes 2009-02-28 12:17 oh 2009-02-28 12:18 the other crashes were within the first 30-50 seconds of running the test though, so it's a good sign 2009-02-28 12:18 yes, it sounds good 2009-02-28 12:19 did you look at the log/screenshots yet? 2009-02-28 12:20 ACTION reading 2009-02-28 12:20 the tux3crashdump.txt seems to be ileaf bug 2009-02-28 12:21 it is the path related to delete_inode() 2009-02-28 12:21 TUX3-32bit_2.6.29rc6-Capture.PNG also seems the same stuff 2009-02-28 12:22 that's nice :) 2009-02-28 12:22 it should, it's from the same crash ;) 2009-02-28 12:22 :) 2009-02-28 12:23 TUX3-32bit_2.6.29rc6-Capture2.PNG also seems to related that bug 2009-02-28 12:23 because I saw like that backtrace when I was debugging 2009-02-28 12:24 *Capture2-panic.PNG is unknown 2009-02-28 12:24 yea that was weird 2009-02-28 12:24 however, it can be related to the same bug 2009-02-28 12:24 it didnt happen right away 2009-02-28 12:25 it seems double or triple fault 2009-02-28 12:25 i saw the tux3 flake out, so i take a screen shot 2009-02-28 12:25 then i go to grab the logs and THEN it panics 2009-02-28 12:26 marcin, and now you're got it tucked away inside vmware 2009-02-28 12:26 hey, what was that flag you showed me a while back to trace tux3? 2009-02-28 12:26 better than dropping off the net 2009-02-28 12:26 yea, i got both 32 and 64bit versions going 2009-02-28 12:26 maybe, tux3 touched the freed inode because of ileaf bug 2009-02-28 12:29 ok both stress and bonnie++ ran on the tux3 partition for a while, no problems so far 2009-02-28 12:29 good 2009-02-28 12:29 that flag? 2009-02-28 12:30 something inside /sys 2009-02-28 12:30 ah 2009-02-28 12:31 echo y > /sys/modules/.../tux3_trace? 2009-02-28 12:31 /sys/modules/tux3/parameters/tux3_trace 2009-02-28 12:32 heh, a lot more output 2009-02-28 12:32 yes 2009-02-28 12:33 output of trace() 2009-02-28 12:33 anything in particular i should be looking there for? 2009-02-28 12:34 it depends on what's for 2009-02-28 12:34 it is for this panic? 2009-02-28 12:34 marcim, it's sort of like information wallpager 2009-02-28 12:35 do you want me to go rerun the crashing code with the tracer on? 2009-02-28 12:35 did you still get a crach with the ileaf fixes? 2009-02-28 12:35 nope, clean as a daisy ;) 2009-02-28 12:37 good 2009-02-28 12:43 that's fault-in-fault probably caused a stack overflow 2009-02-28 12:44 which can cause a pretty weird crash 2009-02-28 12:45 or memory corruption by unexcepted freeing memory 2009-02-28 13:04 shoudl the tracer seriously slow it down? 2009-02-28 13:04 yes 2009-02-28 13:05 if you stoped the syslog, it may be better off 2009-02-28 13:28 hirofumi, should we work out the remaining issues with the rdev patch? 2009-02-28 13:30 ileaf and rdev 2009-02-28 13:30 furthermore 2009-02-28 13:30 there are some races 2009-02-28 13:31 open_inode() and alloc_cursor() race 2009-02-28 13:33 the rdev patch looks just about right 2009-02-28 13:33 new_encode_dev does not seem to have a purpose 2009-02-28 13:34 new_encode_dev() is used by huge_encode_dev()? 2009-02-28 13:34 well, it allows conversion from 32 to 64 bits 2009-02-28 13:34 is that actually necessary? 2009-02-28 13:34 which one? 2009-02-28 13:35 huge_encode_dev -> new_encode_dev 2009-02-28 13:35 well, new_encode_dev is for the filesystem is supporing 32bit dev 2009-02-28 13:35 and, huge_encode_dev is for 64bit filesystem 2009-02-28 13:35 we don't really have a reason to support 32 bit dev 2009-02-28 13:35 device nodes are rare 2009-02-28 13:36 and unlike other filesystems, the dev attribute is optional 2009-02-28 13:36 well, the kernel itself is still supporting 32bit dev only though 2009-02-28 13:37 sure, but that should not stop us from encoding as 64 bits 2009-02-28 13:37 probably 2009-02-28 13:37 it is one of points I was not decided 2009-02-28 13:37 let's use 64 bits, it has no real cost 2009-02-28 13:38 ok, so there is no change 2009-02-28 13:38 right, we can move new_encode_dev inside huge_enccode_dev and just call it encode_dev 2009-02-28 13:38 small thing compared to working or not ;-) 2009-02-28 13:39 well, it is just copy of the kernel 2009-02-28 13:39 one other very small point... so we use an extra attributde kind for rdev, where we could in theory reuse some other attribute, and know it is a device from the inode flags 2009-02-28 13:39 ah 2009-02-28 13:40 well, there is no real issue, the patch can merge as is 2009-02-28 13:40 small cleanups later 2009-02-28 13:40 yes, interface should be rethink soon or later 2009-02-28 13:40 reuse? 2009-02-28 13:41 we could save an attribute kind number if we really wanted, by overloading some other attribute 2009-02-28 13:41 some 64 bit attribute 2009-02-28 13:41 I don't necessarily think it is a good idea to overload the attribute kind numbers, but I just raise the possibility 2009-02-28 13:41 if it's unneeded, it is not present? 2009-02-28 13:41 true 2009-02-28 13:42 and the only drawback is, we take one value from our 16 value range, just for rdev 2009-02-28 13:42 so, save one bit? 2009-02-28 13:42 not even one bit, one kind out of 16 2009-02-28 13:42 (don't reserve a bit for it) 2009-02-28 13:43 so, it is a tiny issue, but because it is a disk format question, I ask the question 2009-02-28 13:43 if we want more than 16 attributes? 2009-02-28 13:44 right, that is not decided 2009-02-28 13:44 what is the chance we might need more? we also have the xattr mechanism 2009-02-28 13:44 if we only had 1 atkind left, we could use it to mean "extended atkind" 2009-02-28 13:45 and use a similar format to xattr 2009-02-28 13:45 where there is an indentifying code 2009-02-28 13:45 identifying code 2009-02-28 13:45 btw, attr is starting at 6 2009-02-28 13:45 yes 2009-02-28 13:45 about time to change that 2009-02-28 13:45 i see 2009-02-28 13:45 it was for noticing errors more easiliy 2009-02-28 13:46 there are more 6 attributes space is remaining? 2009-02-28 13:46 by the way, I like the style of making each of the atkind = isntead of just letting the enum count for us, it is a kind of documentation 2009-02-28 13:46 ah, 5 2009-02-28 13:46 if you are looking at a hexdump, it is handy to see each number beside the symbol 2009-02-28 13:47 let's see how many we have used, and how many more are planned 2009-02-28 13:47 rdev is planned 2009-02-28 13:48 with rdev patch, 6~15 are using 2009-02-28 13:48 ah, no 2009-02-28 13:48 with rdev patch, 6~13 are using 2009-02-28 13:49 9 in total 2009-02-28 13:49 8? 2009-02-28 13:50 right 2009-02-28 13:50 btw, what is IDATA_ATTR? 2009-02-28 13:50 the reason for starting at 6 was, I thought that with corrupted inode, finding a zero atkind was likely, and so zero should be avoided 2009-02-28 13:50 "immediate data" 2009-02-28 13:50 where the file data is encoded in the inode 2009-02-28 13:50 so, it's not used yet? 2009-02-28 13:51 not yet 2009-02-28 13:51 but certainly planned 2009-02-28 13:51 ok, well, so actually 7 2009-02-28 13:51 well, idata is certain to arrive 2009-02-28 13:51 so 8 is fair to say 2009-02-28 13:51 and idata is planning 2009-02-28 13:51 including rdev 2009-02-28 13:52 acl is another one, maybe 2009-02-28 13:52 it maybe xattr? 2009-02-28 13:52 yes, however maybe it is important enough to be work saving the xattr atom code 2009-02-28 13:52 s/work/worth/ 2009-02-28 13:53 i see 2009-02-28 13:53 so, 7 is used, 2 is planning 2009-02-28 13:53 i_blocks would be needed 2009-02-28 13:53 also 2009-02-28 13:53 whoops 2009-02-28 13:54 probably 2009-02-28 13:54 7 is used, 3 is planning (idata, acl, i_blocks) 2009-02-28 13:54 yes, with 6 for future 2009-02-28 13:55 maybe never use zero 2009-02-28 13:55 because it is would be a common result of corruption 2009-02-28 13:55 ok, so, well, 5 2009-02-28 13:56 maybe, inode flag would be needed 2009-02-28 13:56 extended mode flags? 2009-02-28 13:56 possibly optimization flags 2009-02-28 13:56 yes 2009-02-28 13:56 some allocation hints 2009-02-28 13:57 well, some of those can be merge to one attr 2009-02-28 13:58 may be able to merge 2009-02-28 13:58 if we are always going to use a blocks field, it might be part of the CTIME_SIZE attribute 2009-02-28 13:58 yes 2009-02-28 13:59 and i_version? 2009-02-28 13:59 and i_generation 2009-02-28 13:59 different from our version field? because every attribute already has a version 2009-02-28 13:59 generation, maybe 2009-02-28 14:00 I forgot i_version is what's for 2009-02-28 14:00 generation is a runtime thing I thought 2009-02-28 14:00 but, it's not our version 2009-02-28 14:00 right 2009-02-28 14:00 well I will write it down as a question to resolve 2009-02-28 14:00 it increase when dirty 2009-02-28 14:02 yes 2009-02-28 14:02 nfs thing 2009-02-28 14:02 bfields wrote an informative post about it 2009-02-28 14:02 yes, both seems to be for nfs? 2009-02-28 14:02 when I did not respond to :-/ 2009-02-28 14:02 maybe 2009-02-28 14:02 nfs = big mess 2009-02-28 14:02 yes 2009-02-28 14:03 new networkfs is interesting 2009-02-28 14:03 anyway, that is 12 atkinds at most 2009-02-28 14:03 however, nfs seems too common 2009-02-28 14:03 with 0 never used, and 3 for future expansion 2009-02-28 14:03 it seems like enough 2009-02-28 14:04 idata, acl, i_blocks, generation, version? 2009-02-28 14:04 + current 7 2009-02-28 14:04 yes 2009-02-28 14:04 allocation optimization bits 2009-02-28 14:04 ah 2009-02-28 14:04 idata, acl, i_blocks, generation, version, optimization 2009-02-28 14:05 so, 13 2009-02-28 14:05 maybe, some of those can be merged 2009-02-28 14:05 I also wrote "extended inode mode bits", I don't know if it is needed 2009-02-28 14:05 ah 2009-02-28 14:05 idata, acl, i_blocks, generation, version, optimization, inode flag 2009-02-28 14:05 14 2009-02-28 14:06 anyway, even with wild imagination I did not use up all 16 2009-02-28 14:06 yes 2009-02-28 14:06 and if we only have 1 left in can be a "more kinds" kind 2009-02-28 14:06 so I think 16 is enough 2009-02-28 14:06 yes 2009-02-28 14:08 but there is one more hard question 2009-02-28 14:08 the atkinds are divided into fixed and variable 2009-02-28 14:08 so, the remaining unused values have to be divided between fixed and variable as well 2009-02-28 14:09 divided? 2009-02-28 14:09 just implementation issue? 2009-02-28 14:10 its a backward compatibility issue 2009-02-28 14:10 the first N kinds are fixed, the remaining 16 - N are variable 2009-02-28 14:10 variable? 2009-02-28 14:11 variable size 2009-02-28 14:11 um... 2009-02-28 14:11 variable size attributes 2009-02-28 14:11 we use that fact in decode 2009-02-28 14:11 and in calculating the size required to store the attributes 2009-02-28 14:12 what is the problem of backward compatibility? 2009-02-28 14:12 of course, new bit is uncompatible 2009-02-28 14:13 backward compatibility... we do not want to shift the VAR_ATTRS value 2009-02-28 14:14 so, we have two kinds of future atkinds... fixed atkind and variable atkind 2009-02-28 14:14 by the way 2009-02-28 14:14 let's use kind = 0 for rdev 2009-02-28 14:14 um... 2009-02-28 14:15 but, if it's unknown attr, what does old code do? 2009-02-28 14:15 marcin hit an issue on rm -r of a huge directory 2009-02-28 14:15 so did I 2009-02-28 14:15 we will have to chase at some point 2009-02-28 14:15 well, we need to improve the behavior on decoding random junk 2009-02-28 14:16 currently , decode_attrs returns NULL if it finds an unknown one, that is kind of wrong 2009-02-28 14:17 but, what is that can do? 2009-02-28 14:17 ACTION thinks 2009-02-28 14:17 I guess it can do nothing 2009-02-28 14:18 well at least we might return ERR_PTR 2009-02-28 14:18 because other errors are possible 2009-02-28 14:18 ...maybe 2009-02-28 14:18 ah, yes 2009-02-28 14:19 like xattr too big 2009-02-28 14:19 it would be the kind of error handling bug 2009-02-28 14:19 yes 2009-02-28 14:19 decode_attrs needs to be robust if we give it random junk 2009-02-28 14:19 yes 2009-02-28 14:20 hopefully, all decode function 2009-02-28 14:20 yes 2009-02-28 14:20 I'm thinking it's hard problem for fs 2009-02-28 14:20 all fs 2009-02-28 14:20 yes, ext3 handles it well\ 2009-02-28 14:20 fs can check various things 2009-02-28 14:21 even so, ext3 does not catch all corruption 2009-02-28 14:21 I'm thinking ext3 is not enough at all 2009-02-28 14:21 after many years of fixes 2009-02-28 14:21 I guess it doesn't handle even -EIO 2009-02-28 14:22 perfectly 2009-02-28 14:22 well, however, checking various things become fsck 2009-02-28 14:22 I think the balance is needed 2009-02-28 14:23 -!- frindly(~frindly@i59F4C67F.versanet.de) has joined #tux3 2009-02-28 14:23 hello 2009-02-28 14:23 hi 2009-02-28 14:24 hi, english speaking here? 2009-02-28 14:24 yes 2009-02-28 14:24 ok 2009-02-28 14:24 are you a file system developer? 2009-02-28 14:24 yes, I think so :) 2009-02-28 14:24 ACTION tries 2009-02-28 14:25 ok. i am interested in file systems. i am right here? 2009-02-28 14:25 right, the purpose of runtime checking in decode_attrs is, to avoid corrupting cache and on-disk 2009-02-28 14:25 yes 2009-02-28 14:26 frindly, I think if it's tux3 2009-02-28 14:26 sorry. not tux3. i use the more mainstream... reiserfs 2009-02-28 14:28 enum atkind { 2009-02-28 14:28 RDEV_ATTR = 0, 2009-02-28 14:28 MODE_OWNER_ATTR = 1, 2009-02-28 14:28 DATA_BTREE_ATTR = 2, 2009-02-28 14:28 CTIME_SIZE_ATTR = 3, 2009-02-28 14:28 LINK_COUNT_ATTR = 4, 2009-02-28 14:28 MTIME_ATTR = 5, 2009-02-28 14:28 BLOCKS = 6 2009-02-28 14:28 = 7 2009-02-28 14:28 = 8 2009-02-28 14:28 = 9 2009-02-28 14:28 = 10 2009-02-28 14:28 VAR_ATTRS, 2009-02-28 14:28 IDATA_ATTR = 11, 2009-02-28 14:28 XATTR_ATTR = 12, 2009-02-28 14:28 = 13 2009-02-28 14:28 = 14 2009-02-28 14:28 = 15 2009-02-28 14:28 }; 2009-02-28 14:28 (proposal) 2009-02-28 14:29 allocation hints is variable? 2009-02-28 14:29 I think so 2009-02-28 14:29 frindly, unfortunately, I don't know well about reiserfs 2009-02-28 14:29 i see 2009-02-28 14:29 whatever we decide, we will want to change it later I think 2009-02-28 14:29 different kinds of reservation 2009-02-28 14:29 I don't really know ;) 2009-02-28 14:29 well, anyway, it looks good for now 2009-02-28 14:30 ok. how far is the development of tux3? not realy for productiv work or? 2009-02-28 14:30 right, we can make this change as part of the rdev patch 2009-02-28 14:30 ok 2009-02-28 14:31 we will not need MIN/MAX_ATTRS any more 2009-02-28 14:32 ok, that is all the rdev questions I think 2009-02-28 14:32 we can get that small thing done 2009-02-28 14:32 MAX would be good if we have 2009-02-28 14:32 it will be 16 2009-02-28 14:32 yes 2009-02-28 14:32 so, it will just be NUM_ATKINDS 2009-02-28 14:32 just for readability 2009-02-28 14:33 yes 2009-02-28 14:33 and for corruption check 2009-02-28 14:33 ah no 2009-02-28 14:33 just for making arrays the right size 2009-02-28 14:34 frindly, at least, I'm going to use tux3 as rootfs 2009-02-28 14:34 and for assert on encode 2009-02-28 14:34 yes 2009-02-28 14:34 hirofumi, you are an adventurer 2009-02-28 14:34 well, using tux3 for a presentation was an adventure 2009-02-28 14:34 :) 2009-02-28 14:35 well, it would be later though 2009-02-28 14:35 in future 2009-02-28 14:35 I had made a dd copy of the tux3 partition, in case I crashed 2009-02-28 14:35 and also an ext3 copy 2009-02-28 14:35 these were not needed, luckily 2009-02-28 14:36 probably, it would be after atomic commit in the case of me 2009-02-28 14:36 I think, maybe 2-3 months away from running as rootfs on my gcc machine 2009-02-28 14:36 maybe sooner 2009-02-28 14:36 yes, hopefully 2009-02-28 14:37 what about fragmentation and tux3? 2009-02-28 14:37 frindly, fragmentation will be bad at first 2009-02-28 14:37 and we will improve it 2009-02-28 14:39 ok. well there is no system with good fragmentaton expect reiser54 2009-02-28 14:39 reiser4 2009-02-28 14:39 yes, or ssd like backing storage may helps storage system more fast 2009-02-28 14:40 reiser4 has good for fragmentaion avoidance? 2009-02-28 14:42 yes, it has no fragmentation problem. reiserfs 3.6 and ext3 are not so good, and xfs is great too 2009-02-28 14:47 i see 2009-02-28 14:48 hirofumik, I think the atbin enumeration is not useful, time to drop it 2009-02-28 14:48 just write (1 << ...) where needed 2009-02-28 14:49 ok 2009-02-28 14:49 but, with separated patch 2009-02-28 14:54 ok 2009-02-28 15:00 we need bit to pos conversion 2009-02-28 15:00 to get size 2009-02-28 15:05 bit to pos? 2009-02-28 15:05 -!- dcg(~dcg@106.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-02-28 15:05 for atsize[] 2009-02-28 15:06 that is indexed by *_ATTR 2009-02-28 15:06 yes 2009-02-28 15:06 what is (1 << ...) meaning? 2009-02-28 15:06 ah, on decode 2009-02-28 15:06 um 2009-02-28 15:06 not on decode ;) 2009-02-28 15:07 generates a mask 2009-02-28 15:07 used for present 2009-02-28 15:07 the biggest use is in the default present mask generation 2009-02-28 15:07 that is a single line expression if all the _BITs are defined 2009-02-28 15:07 well 2009-02-28 15:08 it will be a define in tux3.h now, I think 2009-02-28 15:08 just add default present mask to current one? 2009-02-28 15:13 ? 2009-02-28 15:13 current what? 2009-02-28 15:13 ah 2009-02-28 15:13 current code 2009-02-28 15:13 something like that 2009-02-28 15:13 we probably need two defaults 2009-02-28 15:13 one for files and one for devices 2009-02-28 15:14 maybe, I found the bug in iattr.c 2009-02-28 15:14 ah, no 2009-02-28 15:14 iattr.c is so simple, how could it have a bug? ;-) 2009-02-28 15:15 :) 2009-02-28 15:16 well, fixed size and variable size calcs at different function 2009-02-28 15:16 trial patch for rdev posted 2009-02-28 15:16 it actually needs VAR_ATTRS = 11, 2009-02-28 15:17 ok, i_blocks, I think we decided there is no way to avoid it, and it has to be updated every time a block is allocated or freed from an inode 2009-02-28 15:17 now... I think i_blocks should not be per-version 2009-02-28 15:17 well 2009-02-28 15:17 it could be 2009-02-28 15:18 because it will have a version tag 2009-02-28 15:18 i_blocks per version seems like a lot of work to update 2009-02-28 15:18 I guess we just have to do it 2009-02-28 15:18 I guess it should be consists with i_size 2009-02-28 15:19 it is used by some utilities to determine spareness 2009-02-28 15:19 the patch seems almost same with my patch I'm working now :) 2009-02-28 15:19 yes 2009-02-28 15:19 that is because it is your patch ;) 2009-02-28 15:19 it will have --user=hirofumi when committed 2009-02-28 15:20 so, I guess it should be per-version 2009-02-28 15:20 I'm also changing like that 2009-02-28 15:21 removing MIN_ATTR etc 2009-02-28 15:25 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2009-02-28 15:25 -!- konrad(~konrad@smurf.cs.washington.edu) has joined #tux3 2009-02-28 15:26 4bits of attr are using as version? 2009-02-28 15:26 10 bits 2009-02-28 15:26 let me see 2009-02-28 15:26 -!- konrad(~konrad@attu3.cs.washington.edu) has joined #tux3 2009-02-28 15:27 -!- konrad(~konrad@dante01.u.washington.edu) has joined #tux3 2009-02-28 15:27 12 bits 2009-02-28 15:27 more version bits than for pointers/extents 2009-02-28 15:27 which have 10 bits 2009-02-28 15:28 unsigned version = head & 0xfff, kind = head >> 12; 2009-02-28 15:28 yes 2009-02-28 15:28 and decode16(, &head) 2009-02-28 15:29 kind is 4bit? 2009-02-28 15:29 yes 2009-02-28 15:29 ah 2009-02-28 15:29 I was wondering if it should be 5 or 6 bit, but 4 seems to be enough 2009-02-28 15:29 we can think about that for a couple more months before finalizing 2009-02-28 15:30 I misread it, somehow I was thinking it is *_BIT 2009-02-28 15:30 if we did not reserve that space for versions, our inodes could be about 15% smaller 2009-02-28 15:31 fortunately, kind is coded as a scalar, not a mask 2009-02-28 15:31 yes 2009-02-28 15:34 http://userweb.kernel.org/~hirofumi/rdev-support.patch 2009-02-28 15:34 changed the order of those 2009-02-28 15:34 rdev is top 2009-02-28 15:35 "Variable size inode" is slightly confusing to read 2009-02-28 15:35 just drop "inode" and it is clear 2009-02-28 15:35 i see 2009-02-28 15:36 ok I like your cleanup 2009-02-28 15:37 ready to pull? 2009-02-28 15:37 ah 2009-02-28 15:37 my comment to the MAGIC change is a little more informative 2009-02-28 15:38 + * 2008-02-28: Attributes renumbered, rdev added 2009-02-28 15:38 ok 2009-02-28 15:39 ready to pull? 2009-02-28 15:39 I'd like to change decode_attrs interface for rdev 2009-02-28 15:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-02-28 15:39 sure 2009-02-28 15:44 the variable sized attribute aspect of tux3 design is a fairly satisfactory design element 2009-02-28 15:44 not much to complain about with it 2009-02-28 15:44 not as much extra complexity as you would expect 2009-02-28 15:45 and big space saving 2009-02-28 15:47 well, I heared it first, I expect it is having perfect backward compatibility without hack like ext* 2009-02-28 15:47 :) 2009-02-28 15:47 "perfect", maybe not ;) 2009-02-28 15:48 but pretty good, it seems likely 2009-02-28 15:48 well 2009-02-28 15:48 when we finalize the encoding it is unlikely to need a backward-incompatible change 2009-02-28 15:49 hmm, one thing I did not consider is the possibility to have some variant attribute forms that are used for smaller number range 2009-02-28 15:49 in far future, we may need to add new bit 2009-02-28 15:50 if we add a new bit it will be on the "wrong side" of the existing bits 2009-02-28 15:50 so lets try to have enough bits ;) 2009-02-28 15:50 we might want to allow for optional, high precision times 2009-02-28 15:51 normally encode time in 48 bits, but optionally 64 2009-02-28 15:51 there are people who think that 64 bit file times make sense 2009-02-28 15:51 ACTION is not one of them 2009-02-28 15:51 I would rather save two bytes per time field 2009-02-28 15:51 by default 2009-02-28 15:51 well, 32bit sec may not be enough 2009-02-28 15:52 true 2009-02-28 15:52 we need to set the sec/subsec split a little more to the right 2009-02-28 15:52 and think about our base time 2009-02-28 15:52 nanosec is enough for me 2009-02-28 15:52 I thought, make time relative to a base time in the superblock, the time the filesystem was made 2009-02-28 15:53 oh 2009-02-28 15:53 i see 2009-02-28 15:53 um... 2009-02-28 15:54 however, it will not save space a lot, probably 2009-02-28 15:54 so, if we have 34 bit seconds and 30 bit nanoseconds, and a wider base seconds field in the superblock that is a 64 bit time field with enough range and precision, 2009-02-28 15:54 well, there are several time fields per inode 2009-02-28 15:54 if there are three fields and we save 2 bytes each, that is 10% reduction in inode size 2009-02-28 15:54 that is significant I think 2009-02-28 15:55 it meant it need negative time after all 2009-02-28 15:55 so, 48 bit time would be 14 bits of fraction, or one 1/16k 2009-02-28 15:55 no, the filesystem should not need negative time 2009-02-28 15:56 if somebody wants that, they can set their clock back when they create the filesystem ;) 2009-02-28 15:56 well 2009-02-28 15:56 um 2009-02-28 15:56 untar 2009-02-28 15:56 yes 2009-02-28 15:56 preserving time stamps 2009-02-28 15:56 so this is not completely clear 2009-02-28 15:58 ok then: high precision time format is 34:30, base = standard unich epoch 2009-02-28 15:58 unix 2009-02-28 15:58 so that untar is gauranteed to work 2009-02-28 15:58 it sounds good 2009-02-28 15:59 default precision is 34:14 2009-02-28 15:59 making inodes 10% smaller 2009-02-28 15:59 14bit is... 2009-02-28 15:59 1/16K 2009-02-28 15:59 1/15th of a millisecond 2009-02-28 15:59 1/16th 2009-02-28 16:00 which is really _way_ more than needed 2009-02-28 16:00 about > 10microsec 2009-02-28 16:00 ? 2009-02-28 16:00 100 usec 2009-02-28 16:00 um 2009-02-28 16:00 64 usec 2009-02-28 16:00 oh 2009-02-28 16:01 it may confuse "make" 2009-02-28 16:01 if somebody requires higher resolution than that, they are probably trying to hack the fs ;) 2009-02-28 16:01 make uses 1 sec resolution currently 2009-02-28 16:01 oh 2009-02-28 16:01 so this is way better than needed for make 2009-02-28 16:01 make is really broken 2009-02-28 16:01 yes 2009-02-28 16:02 I doubt there are any valid applications for > 1 ms resolution 2009-02-28 16:02 marcin has an opinion no doubt 2009-02-28 16:02 applications of high precision fs timestamps are mainly hacking and forensic investigation 2009-02-28 16:02 I was thinking nanosec timestamp is for app include make 2009-02-28 16:02 in that case, they can set an option 2009-02-28 16:03 make doesn't care about nanoseconds 2009-02-28 16:03 really :) 2009-02-28 16:03 :) 2009-02-28 16:03 well, we should at least assign a high precision time format atkind 2009-02-28 16:04 and we should assign its size, and skip it in decode 2009-02-28 16:04 that is "forward compatible" 2009-02-28 16:05 http://userweb.kernel.org/~hirofumi/rdev-support.patch 2009-02-28 16:06 changed the interface of decode_attr 2009-02-28 16:06 good 2009-02-28 16:06 still just a idea though 2009-02-28 16:06 still just a idea though <- my patch 2009-02-28 16:07 having a separate parameter just for rdev is not pretty 2009-02-28 16:07 yes 2009-02-28 16:07 however, it's handling as special in kernel 2009-02-28 16:08 34:32 time format delays the 2037 problem by 200 years, to 2238 2009-02-28 16:08 sorry 2009-02-28 16:08 34:30 2009-02-28 16:08 let's use some other way of knowing rdev 2009-02-28 16:09 I guess it's ok with 200 years 2009-02-28 16:09 200 years gives us enough time to improve it ;) 2009-02-28 16:09 :) 2009-02-28 16:09 and we still have "nanoseconds" 2009-02-28 16:10 though not quiet exact nanoseconds, and we don't care :) 2009-02-28 16:10 I don't know whether someone may want to store future timestamp or not 2009-02-28 16:10 they can store it in an xattr ;) 2009-02-28 16:10 oh 2009-02-28 16:10 well, I am thinking, as long as we don't break unix time, we are ok 2009-02-28 16:11 yes 2009-02-28 16:11 sun might do something like allowing nanosecond resolution since the beginning of the universe, but we do not need to ;) 2009-02-28 16:11 128 bit times? ;) 2009-02-28 16:12 ok, the rdev handling issue 2009-02-28 16:14 oh 2009-02-28 16:14 uint64_t zp_mtime[2] 2009-02-28 16:14 ah, we forgot the atime attr 2009-02-28 16:14 I was joking about sun 2009-02-28 16:15 128 bit times is too much even for them 2009-02-28 16:15 ok, having optional high precision times requires 3 additional attribute kinds 2009-02-28 16:15 it seems 64:64 2009-02-28 16:16 :p 2009-02-28 16:16 wow 2009-02-28 16:16 :) 2009-02-28 16:16 I thought I was joking 2009-02-28 16:16 you're right 2009-02-28 16:17 ok, I guess we better increase the size of the atkind field by one bit 2009-02-28 16:17 because we are getting too close to 16 different atkinds 2009-02-28 16:17 we still have reserved bit for now 2009-02-28 16:17 yes, but on the wrong side of the field 2009-02-28 16:17 the field is in the high order bits 2009-02-28 16:18 both of fixed and variable 2009-02-28 16:18 that's just a single reserved value 2009-02-28 16:18 one fixed, one variable reserved 2009-02-28 16:18 yes 2009-02-28 16:18 perhaps introducing an extended format 2009-02-28 16:19 for atime? 2009-02-28 16:19 now, the version field does not really have to be 12 bits, my thought is, that field encodes up to 1 << 10 versions _per inode table block_ 2009-02-28 16:20 extended format for atkind is what I meant 2009-02-28 16:20 not an attractive idea 2009-02-28 16:20 a mess, like ext* 2009-02-28 16:21 also, widening a field and putting the new info in a noncontiguous place is like ext*, I hope we can avoid it 2009-02-28 16:21 it works, but its ugly and verbose 2009-02-28 16:21 btw, idata is what's for? 2009-02-28 16:21 and if you ever look at it with hexump, it's confusing 2009-02-28 16:21 idata -> immediate file data 2009-02-28 16:22 only file data? 2009-02-28 16:22 could be symlink 2009-02-28 16:22 yes 2009-02-28 16:22 it's like an xattr, but we save the atom field 2009-02-28 16:22 what's big difference with ileaf formated data? 2009-02-28 16:23 ileaf formatted? 2009-02-28 16:23 ah 2009-02-28 16:23 it's very similar to an xattr 2009-02-28 16:23 I mean ileaf has atkind+fixed size data or atkind+variable size data 2009-02-28 16:24 we already support variable sized xattrs 2009-02-28 16:24 and idata has only data 2009-02-28 16:24 idata is just like an xattr, the only difference is, no atom field 2009-02-28 16:25 -!- ned(~ned@c-76-19-208-96.hsd1.ma.comcast.net) has joined #tux3 2009-02-28 16:25 so we save 2 or 4 bytes or whatever we decide the atom field size is 2009-02-28 16:25 just a idea though, what do it have atkind in it? 2009-02-28 16:25 speaking of which, we might want to have a variant xattr format for large atom numbers 2009-02-28 16:25 instead of making all xattr atoms 4 bytes 2009-02-28 16:26 oh, yes, that is what I was refering to above 2009-02-28 16:26 we can actually have an atkind that means "higher range atkind" 2009-02-28 16:26 not really pretty 2009-02-28 16:26 ok, so there are remaining atkind questions 2009-02-28 16:27 I will make a note of them 2009-02-28 16:27 it is will unresolved whether 4 atkind bits are enough 2009-02-28 16:27 ah, yes 2009-02-28 16:29 one thing that is true: filesystem designers usually allow for lots of extra fields that are never used 2009-02-28 16:30 take a look at all the ext2 features that were never implemented 2009-02-28 16:30 and unused bits 2009-02-28 16:30 yes 2009-02-28 16:32 what is wrong if we add the some bytes later 2009-02-28 16:32 it's ok 2009-02-28 16:32 ah, of course, it's uncompatible 2009-02-28 16:32 yes 2009-02-28 16:32 adds cruft 2009-02-28 16:32 let's just try to avoid having to do that in its first year ;) 2009-02-28 16:33 but, I guess new bit itself is uncompatible change 2009-02-28 16:33 when we review, we should invite people to ask for their favorite attributes 2009-02-28 16:33 so we can prioritze them 2009-02-28 16:34 in some cases we can make a forward compatible change, like a high precision time field we skip on decode 2009-02-28 16:34 yes 2009-02-28 16:34 but, the code should already know about it 2009-02-28 16:35 another thing we can do is store the atsize field in the superblock 2009-02-28 16:35 size is known 2009-02-28 16:35 in fact, that would be a very nice thing to do 2009-02-28 16:35 yes 2009-02-28 16:35 and we can offer optional time precision that way 2009-02-28 16:35 I think it is worth doing 2009-02-28 16:35 I was thinking about feature flag like ext* 2009-02-28 16:36 feature flag we probably want too, but storing the atsize table makes it a lot easier to do forward compatible attributes 2009-02-28 16:36 16BIT_ATKIND_FEATURE 2009-02-28 16:36 s/probably/definitely/ 2009-02-28 16:36 ah 2009-02-28 16:37 FEATURE_LOTS_OF_ATKINDS 2009-02-28 16:37 yes 2009-02-28 16:37 we better start doing that part of the design, as a background project 2009-02-28 16:38 these small things can be very time consuming 2009-02-28 16:38 probably, yes for the future 2009-02-28 16:38 i see 2009-02-28 16:39 storing the atkind table, maybe 64 bytes, seems like a good idea 2009-02-28 16:39 well 2009-02-28 16:39 it can wait a while 2009-02-28 16:39 ah 2009-02-28 16:40 I really don't want to have lots of different atkinds, just for different time precisions 2009-02-28 16:40 and do want to have inodes 10% smaller 2009-02-28 16:40 but do not want to have to say to people: you can't have nanosecond precision 2009-02-28 16:40 yes 2009-02-28 16:40 ah, I meant the atsize table 2009-02-28 16:41 atsize and what attr bits is using 2009-02-28 16:42 we could also store the fraction width for time fields in the superblock, or it can be implied by the atsize 2009-02-28 16:42 yes 2009-02-28 16:42 mapping of attr bits 2009-02-28 16:42 and number of fixed vs variable 2009-02-28 16:42 I do not think it will affect efficiency of encode/decode measurably 2009-02-28 16:43 well, probably, bits are enough 2009-02-28 16:43 because, unkown bits can't be handled by old code 2009-02-28 16:44 we can use a bitmap to determine which attrs are variable sized 2009-02-28 16:44 yes 2009-02-28 16:45 however, the old code doesn't know wether unkonwn bit is necessary or not 2009-02-28 16:45 it will have enough information to be able to skip unknown attributes 2009-02-28 16:45 that is useful 2009-02-28 16:46 like compatible feature and uncompatible feature? 2009-02-28 16:46 for example, we can introduce allocation hints later without needing to reformat old filesystems 2009-02-28 16:46 yes 2009-02-28 16:46 i see 2009-02-28 16:46 to make this work, we need to define the count field format of variable sized attributes 2009-02-28 16:48 well, I'd like to delay to later 2009-02-28 16:48 good idea :) 2009-02-28 16:48 thanks :) 2009-02-28 16:49 well, back to rdev issue 2009-02-28 16:49 decode_attr should initialize inode? 2009-02-28 16:50 that is ok for now 2009-02-28 16:51 it is strictly an internal thing 2009-02-28 16:51 just make a comment "this is ugly because..." 2009-02-28 16:52 one small thing, since RDEV = 0, maybe put it at the beginning of the encode/decode case switch 2009-02-28 16:52 yes 2009-02-28 16:54 RDEV=0 is a bit wrong? 2009-02-28 16:54 so... the format of variable sized attrs is already well defined, there is always a 16 bit count of bytes immediately after the atribute kind/version 2009-02-28 16:55 ? 2009-02-28 16:55 why wrong? 2009-02-28 16:55 wrong name for sure 2009-02-28 16:55 don't know, one small thing is meanning it? 2009-02-28 16:55 ? 2009-02-28 16:55 flips> one small thing, since RDEV = 0, maybe put it at the beginning of the encode/decode case switch 2009-02-28 16:55 oh 2009-02-28 16:56 I just mean it is a small point 2009-02-28 16:56 just code layout 2009-02-28 16:56 does not affect anything 2009-02-28 16:56 no problem with beginning of encode/decode? 2009-02-28 16:57 no problem :) 2009-02-28 16:57 ok :) 2009-02-28 16:58 we can do patches for above forward-compatible changes to attribute handling during review 2009-02-28 16:58 they are time consuming 2009-02-28 16:59 ok, back to rdev inode update 2009-02-28 17:00 well, MODE_OWNER assigns an inode field 2009-02-28 17:00 so if RDEV does also, it is not change 2009-02-28 17:00 rdev is assigned in init_special_inode() 2009-02-28 17:01 kernel code, hmm 2009-02-28 17:01 somehow, I'd like to avoid to assign directly 2009-02-28 17:01 vfs code 2009-02-28 17:01 yes 2009-02-28 17:02 by the way, my server response seems to be much better ever since I updated my robots.txt to keep them from reading my whole git tree 2009-02-28 17:04 well, I would say init_special is stupidly designed 2009-02-28 17:04 it should not assign i_rdev, and it should not pass it as a parameter 2009-02-28 17:05 it should just assign i_fop according to the mode 2009-02-28 17:06 yes, probably 2009-02-28 17:06 so, it is ok to use the i_rdev field to store the rdev in decode_attts 2009-02-28 17:06 one day, we will patch vfs to be prettier ;) 2009-02-28 17:06 ok, just set it with comment 2009-02-28 17:06 sure 2009-02-28 17:07 /* vfs, trying to be helpful, will rewrite the field /* 2009-02-28 17:07 there is no issue needing a change to our api there 2009-02-28 17:07 ok 2009-02-28 17:10 we can remove the rdev parameter from tux_setup_inode: init_special_inode(inode, inode->i_mode, inode->i_rdev); 2009-02-28 17:11 yes 2009-02-28 17:15 http://userweb.kernel.org/~hirofumi/rdev-support.patch 2009-02-28 17:15 could you review it 2009-02-28 17:16 ok 2009-02-28 17:16 wow, unix time is defined to increase exactly 86400 seconds per day, regardless of how many seconds are actually in the day 2009-02-28 17:17 that is brain damage 2009-02-28 17:17 in utc? 2009-02-28 17:18 http://en.wikipedia.org/wiki/Unix_epoch 2009-02-28 17:18 unix time is defined to be utc, which has this weirdness 2009-02-28 17:18 so unix time is not continuous 2009-02-28 17:18 it should be 2009-02-28 17:18 regardless of utc not being continuous 2009-02-28 17:19 time that suddenly accelerates/decelerates on all computers in the world is not good 2009-02-28 17:19 yes 2009-02-28 17:19 or another way of putting it: never relay on timeofday for anything important 2009-02-28 17:20 use jiffies 2009-02-28 17:20 liner time is good for on-disk format 2009-02-28 17:20 localtime is just wrong 2009-02-28 17:20 and even if utc 2009-02-28 17:21 utc seems to think about leap sec 2009-02-28 17:22 geez, i leave for 4hrs and backlog is big enough to be a book ;) 2009-02-28 17:22 it looks good 2009-02-28 17:22 just let me know when it is ready to pull 2009-02-28 17:23 ok 2009-02-28 17:23 speaking of time resolution, i've asked around for any govt spec for time resolution,and there arent any ;) 2009-02-28 17:23 with ileaf stuff, or without? 2009-02-28 17:23 marcin, the smallest things generate the most chat ;) 2009-02-28 17:24 marcin, you will be pleased to know that today's discussion resulted in the conclusion that tux3 will have optional precision 2009-02-28 17:24 your choice of 14 bit, 30 bit, or maybe 64 bit fraction 2009-02-28 17:24 is that precision encoded somewhere? as in, if i take the disk out and put it in another system with different resolution, is it gonna set proper or can it be 'cast' to something else? 2009-02-28 17:25 so zfs cannot bost about being the only fs able to measure zillionths of seconds since the big bang 2009-02-28 17:25 and until the next big bang 2009-02-28 17:25 it will be incoded in the superblock 2009-02-28 17:25 decision made at fs create time 2009-02-28 17:25 how hard/easy would be to force it to act as something else? 2009-02-28 17:26 what sort of something else? 2009-02-28 17:26 let's say i make a 64bit res, but use only 14bit, then use the other 50bits as a covert channel 2009-02-28 17:26 good luck 2009-02-28 17:26 pretty much impossible 2009-02-28 17:27 or, impossible, period 2009-02-28 17:27 then if i stuck it in another system with 14bit, would i be able to read the other 50bits ? 2009-02-28 17:27 no 2009-02-28 17:27 we don't let any spurious bits though 2009-02-28 17:27 your best bet would be to fiddle the clock 2009-02-28 17:28 if you really need to do that ;) 2009-02-28 17:28 yea, timing channel is pretty much unavoidable 2009-02-28 17:28 we're good about not letting random garbage onto disk 2009-02-28 17:29 good to know it's being thought about 2009-02-28 17:29 you and your friends can give the source a good scrubbing 2009-02-28 17:29 i hate unused/reserved/future features sort of bits, noone checks for this shit, they get left behind for easy covert channels 2009-02-28 17:30 I presume "rectal examination" is the proper technical term 2009-02-28 17:30 i wish people here would be interested/able to do that sort of thing 2009-02-28 17:31 well if you are able to use a covert channel that transmits one bit per remount, you are leet 2009-02-28 17:31 slow than smoke signals by a lot 2009-02-28 17:31 oh 2009-02-28 17:32 one bit per filesystem create ;) 2009-02-28 17:32 because once you turn the feature bit on (by using it) you can't turn it off 2009-02-28 17:32 so: one if by land, two if by sea is doable 2009-02-28 17:32 anything more than a bit is outta luck 2009-02-28 17:32 there are projects where allowed covert bandwidth is 1bit/5yrs ;) 2009-02-28 17:32 got to find yourself a better covert channel 2009-02-28 17:33 heh 2009-02-28 17:33 they'd better not use computers then 2009-02-28 17:33 solid chunk of steel sounds about right, or maybe that is iffy 2009-02-28 17:37 so an example of a future feature bit that was very successful is the ext2/3 index bit 2009-02-28 17:37 it eventually got used (by me) 2009-02-28 17:38 index bit? what does that do? 2009-02-28 17:38 indexed directories or not 2009-02-28 17:39 the cool thing was, ext2/3 would always clear that bit for a particular directory inode bit whenever it changed the directory 2009-02-28 17:39 I don't know why ted/stephen thought to do that, but he/they did 2009-02-28 17:40 so that allowed me to design a new index format that would automatically revert to being the old linear format if _a filesystem used on a newer kernel with indexing was remounted on an old kernel without indexing_ 2009-02-28 17:40 this is extreme compatibility 2009-02-28 17:40 weird, never heard of it 2009-02-28 17:41 no news is good news 2009-02-28 17:41 if affected you for sure 2009-02-28 17:41 if you used linux for the last 5 years 2009-02-28 17:42 we will get to considerations like that in the later steps of finalizing the disk formats 2009-02-28 17:42 and listen to our security advisors too of course 2009-02-28 17:45 we might do the same trick again in tux3, if we start getting real users before the directory indexing arrives 2009-02-28 17:46 probably better is just make sure indexing arrives in a timely manner 2009-02-28 17:46 i'm still not sure what indexing are you talking about here :/ 2009-02-28 17:46 like beagle sort of stuff? 2009-02-28 17:46 directory btree 2009-02-28 17:47 used to be linear lookup/create/delete 2009-02-28 17:47 which went verticle in cpu at about 8,000 files 2009-02-28 17:47 vertical 2009-02-28 17:54 the way to search directory entries 2009-02-28 17:54 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-28 17:55 at this time, rdev support patch only 2009-02-28 17:58 good 2009-02-28 17:59 there are pending patches though 2009-02-28 17:59 you mean, there are other patches not in your hg repository? 2009-02-28 17:59 ACTION still cloning 2009-02-28 17:59 yes 2009-02-28 18:00 ileaf stuff 2009-02-28 18:00 ileaf stuff? 2009-02-28 18:00 ileaf fixes 2009-02-28 18:00 right 2009-02-28 18:02 yes, rdev should always be present if chrdev/blkdev, not worth a revision right now though 2009-02-28 18:03 ok 2009-02-28 18:03 nice patch 2009-02-28 18:03 cleaner each rev 2009-02-28 18:03 time to pull 2009-02-28 18:07 hirofumi, I get a two-headed repo when I pull that 2009-02-28 18:07 ACTION tries again 2009-02-28 18:07 pulled to public repo? 2009-02-28 18:08 pulled to my private, which was pushed to public 2009-02-28 18:08 um.. 2009-02-28 18:08 there is no change with public? 2009-02-28 18:08 no 2009-02-28 18:08 just checked 2009-02-28 18:08 pulling from static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-02-28 18:08 searching for changes 2009-02-28 18:08 adding changesets 2009-02-28 18:08 adding manifests 2009-02-28 18:08 adding file changes 2009-02-28 18:08 added 1 changesets with 7 changes to 7 files (+1 heads) 2009-02-28 18:08 (run 'hg heads' to see heads, 'hg merge' to merge) 2009-02-28 18:08 daniel@moonbase:/src/tux3$ hg heads 2009-02-28 18:09 changeset: 1021:969915712d5f 2009-02-28 18:09 tag: tip 2009-02-28 18:09 parent: 1006:3118107954f1 2009-02-28 18:09 user: OGAWA Hirofumi 2009-02-28 18:09 date: Sun Mar 01 10:46:38 2009 +0900 2009-02-28 18:09 summary: Introduce inode->i_rdev support 2009-02-28 18:09 changeset: 1020:e034620d446a 2009-02-28 18:09 user: OGAWA Hirofumi 2009-02-28 18:09 date: Tue Feb 24 06:40:53 2009 +0900 2009-02-28 18:09 summary: Use info->key in tree_chop(), instead of info->resume 2009-02-28 18:09 ah 2009-02-28 18:10 I was forgetting to update my repo :) 2009-02-28 18:10 good :) 2009-02-28 18:10 I don't want it to be a "mercurial doesn't work" thing :) 2009-02-28 18:11 you oughtta be using github ;) 2009-02-28 18:13 ok 2009-02-28 18:13 updated 2009-02-28 18:13 github/hg 2009-02-28 18:20 damm msnbot chose this exact moment to index the mail archives 2009-02-28 18:20 whacking my hg pull performance by 3 orders of magnitude or so 2009-02-28 18:21 2 anyway 2009-02-28 18:22 I need kill -9 msnbot 2009-02-28 18:23 well I think I will ban that agent 2009-02-28 18:23 msnbot indexing tux3 mail archives seems as useless as you can get 2009-02-28 18:25 if it's network delay? 2009-02-28 18:25 is it network delay? 2009-02-28 18:25 yes 2009-02-28 18:25 caused by search bot 2009-02-28 18:26 on my slow dsl 2009-02-28 18:26 I will ban msnbot 2009-02-28 18:26 googlebot does not seem to be as bad 2009-02-28 18:26 apache may be able to control bandwidth 2009-02-28 18:27 ah, better 2009-02-28 18:27 we will give .01% to msnbot :) 2009-02-28 18:27 marcin will like that 2009-02-28 18:27 given the discussion of using fs feature bits as a covert channel 2009-02-28 18:28 ok, what is marcin's ability to crash tux3 now? 2009-02-28 18:28 delete bug still? 2009-02-28 18:29 lemme check on the status 2009-02-28 18:30 the rm bug happened on hiro's code, the ileaf went away 2009-02-28 18:31 ok 2009-02-28 18:32 TUX3_rmsegfault.zip <- this is for the rm bug? 2009-02-28 18:32 yes 2009-02-28 18:32 yea 2009-02-28 18:34 segfault in readdir 2009-02-28 18:39 could you post your recipe to repeat this to the list? 2009-02-28 18:39 maybe try to repeat it with a shorter procedure? 2009-02-28 18:39 for the crash? 2009-02-28 18:39 yes 2009-02-28 18:39 oh sure, it's very simple 2009-02-28 18:43 did the full oops make it to (the virtualized) /var/log/messages? 2009-02-28 18:43 hold on, i'm recreating it 2009-02-28 18:47 gr, now rm -fr ./ worked fine 2009-02-28 18:47 took a while, but worked cleanly 2009-02-28 18:53 yes, there is a bug that makes it take a while 2009-02-28 18:53 but we believe you are not making this up ;) 2009-02-28 18:54 or using your cosmic ray improbability generator 2009-02-28 18:54 this isnt highschool projcts, i dont need to make data up ;) 2009-02-28 18:55 a lot of windows bugs are caused by improbability generators you know 2009-02-28 18:55 they have a large improbability generator set up in redmond hg, on the upper floors 2009-02-28 18:59 wow, killzone takes shooter rendering to the next level 2009-02-28 19:00 might have to take a couple minutes for that 2009-02-28 19:00 could not let mey 4 year old play this, there's no way to turn off the blood and cussing 2009-02-28 19:01 ACTION recommends "flower" for kids 2009-02-28 19:05 -!- chesse_(~eworm@dslb-084-062-150-122.pools.arcor-ip.net) has joined #tux3 2009-02-28 19:24 -!- ned(~ned@c-76-19-208-96.hsd1.ma.comcast.net) has joined #tux3 2009-02-28 19:41 damn it, i cant reproduce it now :/ 2009-02-28 19:41 that's bad I think 2009-02-28 19:41 for you or for me? :) 2009-02-28 19:41 or, with luck, heh 2009-02-28 19:41 got to think about that 2009-02-28 19:42 i've been reading all afternoon about filesystem fuzzing 2009-02-28 19:42 well, in the best possible scenario it means that the earlier oops was actually before your ileaf fixes, and that you forgot 2009-02-28 19:43 which is bad for your prognosis of early senility, but good for stability 2009-02-28 19:43 no no, i did it after the ileaf fix 2009-02-28 19:43 ohoh 2009-02-28 19:43 I mean, the fuzzing 2009-02-28 19:43 basic strategy is: cat /dev/urandom to the volume and try to mount it 2009-02-28 19:44 better, dd if=/dev/urandom of=/dev/victim seek=whatever 2009-02-28 19:45 better, dd if=/dev/urandom of=/dev/victim seek=whatever count=1 bs=512 2009-02-28 19:46 i gotta hit up my gradschool buddy, he's been working on a full BNF-fueled fuzzing engine 2009-02-28 19:46 scary 2009-02-28 19:46 to me all the fuzzing engines are missing a vital component: time 2009-02-28 19:47 that's a one rough bitch to control, infinite resolution and all 2009-02-28 19:47 and with async events, timing is crucial 2009-02-28 19:49 i dont wanna just dd'ing crap to volumes, i'd rather feed your utils, work through your interfaces, this way it tests both the utils and the fs 2009-02-28 19:49 yes 2009-02-28 19:50 /dev/urandom is good for testing fsck 2009-02-28 19:50 i didnt think we have a fsck yet? 2009-02-28 19:53 how fast of /dev/urandom you get? 2009-02-28 19:55 fast? 2009-02-28 19:56 yea, how many mb/sec can you get? 2009-02-28 19:56 i get like 7mb/sec on my 2.4ghz machine 2009-02-28 19:56 which isnt much 2009-02-28 19:56 i'm thinking of just leaving it overnight and let it generate few gigs, then just cat the file 2009-02-28 19:57 for fuzzing it'd be a speedup, generate once, read off at disk speed many time 2009-02-28 19:58 3.3MB/sec, pretty slow 2009-02-28 19:58 on my pentium M 2009-02-28 19:59 I usually use rand(3) 2009-02-28 19:59 much faster 2009-02-28 19:59 yup, that's kinda slow, considering how many permutations you need for fuzzing, the curse of dimensionality and all 2009-02-28 19:59 it's meant more for key generation 2009-02-28 19:59 how does rand() do it? 2009-02-28 20:00 cuz for fuzzing i dont particularly care for cryptographic strengh of my noise ;) 2009-02-28 20:05 -!- chesse(~eworm@dslb-084-062-146-029.pools.arcor-ip.net) has joined #tux3 2009-02-28 21:13 marcin, can grab the course 2009-02-28 21:13 could be linear shift feedback 2009-02-28 21:13 prolly not linear congruential, used to be the fashion 2009-02-28 21:14 well, it claims to be slow too 2009-02-28 21:15 damn man, is there an area of CS where you dont know something? :) 2009-02-28 21:15 i avoid crypto like the plague 2009-02-28 21:15 something however little? ;) 2009-02-28 21:16 more than most 2009-02-28 21:19 one thing is, libc maintainers have limited scope to change the generator over time, because people do not like it when the same seed produces a different sequence 2009-02-28 21:19 so best is to find source for a decent generator 2009-02-28 21:20 matlab uses "mersenne twister" ooh leet 2009-02-28 21:20 i thought they've used like interrupts and network traffic 2009-02-28 21:20 to seed a sequence, sure 2009-02-28 21:21 using that stuff only would be very nonrandom 2009-02-28 21:21 to much internal correlation 2009-02-28 21:21 it's just used to de-pseudo the generator 2009-02-28 21:23 http://en.wikipedia.org/wiki/Mersenne_twister 2009-02-28 21:23 you probably want this 2009-02-28 21:23 http://en.wikipedia.org/wiki/Mersenne_twister 2009-02-28 21:23 whoops 2009-02-28 21:24 i can generate sequences in matlab and just use that ;) 2009-02-28 21:24 http://en.wikipedia.org/wiki/Mersenne_twister#Pseudocode 2009-02-28 21:24 "designed to have a period of 219937 ? 1 (the creators of the algorithm proved this property)" 2009-02-28 21:25 that is, 2**19937 2009-02-28 21:25 (lost some formatting there) 2009-02-28 21:25 period long enough for you? 2009-02-28 21:26 yea, marginally ;) 2009-02-28 21:27 http://en.wikipedia.org/wiki/Lagged_Fibonacci_generator <- fast one 2009-02-28 21:27 wikipedia is great :) 2009-02-28 21:27 "never roll yer own" is knuth's takeaway 2009-02-28 21:27 oh yes, i fully agree 2009-02-28 21:28 i leave crypto to the few that truly get it 2009-02-28 21:28 well this is number theory 2009-02-28 21:28 applied 2009-02-28 21:28 i'm a semi-skilled user of crypto, i know what not to do ;) 2009-02-28 23:09 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-02-28 23:43 http://www.bedaux.net/mtrand/ <- c++ source for mersenne twister 2009-02-28 23:43 I would value something short, personally 2009-02-28 23:45 http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.c 2009-02-28 23:46 (non free license, has advertising clause) 2009-02-28 23:47 well, it says binary distributions must reproduce the copyright notice, not display it 2009-02-28 23:49 "623-dimensional equidistribution property is assured" 2009-02-28 23:49 wow, that's more than 3 2009-02-28 23:55 http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/C-LANG/tt800.c <- nice short version 2009-03-01 00:05 unsigned long KISS() { 2009-03-01 00:05 static unsigned long x=123456789, 2009-03-01 00:05 y=362436, z=521288629, c=7654321; 2009-03-01 00:05 unsigned long long t, a=698769069LL; 2009-03-01 00:05 x=69069*x+12345; 2009-03-01 00:05 y^=(y<<13); y^=(y>>17); y^=(y<<5); 2009-03-01 00:05 t=a*z+c; c=(t>>32); 2009-03-01 00:05 return x+y+(z=t); } <- now this one is short 2009-03-01 00:05 this passes my "good for hacking" filter 2009-03-01 00:06 x=123456789 <- this number clearly has deeply pleasing theoretical properties 2009-03-01 00:06 c=7654321; <- likewise 2009-03-01 00:07 x=69069*x+12345; <- and another find constant 2009-03-01 00:29 Nice. 2009-03-01 01:28 I should put this in the version.c selftest, so of somebody finds a bug in the fuzz test I am gauranteed to be able to reproduce it 2009-03-01 01:28 I doubt I have that gaurantee with rand 2009-03-01 01:28 libc rand 2009-03-01 10:13 hirofumi, there? 2009-03-01 10:22 ah, you're just in time for me crashing readdir again 2009-03-01 10:24 heh 2009-03-01 10:24 got any new clues on trigger conditions? 2009-03-01 10:24 nope, this one was quite unintended 2009-03-01 10:25 well I suppose we will instrument readdir 2009-03-01 10:25 i think stack has to get corrupted somewhere 2009-03-01 10:25 doesn't sound like a stack thing 2009-03-01 10:25 cuz it gets invalid opcode 0000 2009-03-01 10:25 it could be 2009-03-01 10:26 that's BUG 2009-03-01 10:26 can you see the oops/bug message? 2009-03-01 10:27 well, i've done plenty of expoloits where i clobber the stack's return pointer, function exits and it jumps into boonies, and sometimes it grabs random data, like 0000 trying to interpret it as code 2009-03-01 10:27 it's coming 2009-03-01 10:28 since there are no BUGs or asserts in tuix3/dir.c, your theory gains merit 2009-03-01 10:29 you're overrurning a boundry somewhere 2009-03-01 10:29 my hacking sense is tingling ;) 2009-03-01 10:30 mmm, sushi for breakfast 2009-03-01 10:32 sent to your mail 2009-03-01 10:33 -!- amey_(~amey@117.195.37.190) has joined #tux3 2009-03-01 10:44 i'm reading tux3/user/kernel/dir.c readdir to be precise...any obvious targets i should be aware of? 2009-03-01 10:44 ah, tux_error 2009-03-01 10:44 hmm, I wonder why that had to be changed from error 2009-03-01 10:45 so it's our BUG 2009-03-01 10:45 hm? 2009-03-01 10:45 it means, our readdir hit an empty dirent 2009-03-01 10:45 that is, a zero 2009-03-01 10:45 that would explain the 0000 opcode 2009-03-01 10:45 but why would it dump it in RET? 2009-03-01 10:45 not the way you think ;) 2009-03-01 10:46 it found _data_ = 0 2009-03-01 10:46 where it is not allowed 2009-03-01 10:46 it is our check 2009-03-01 10:46 an assert, essentially 2009-03-01 10:47 geezus, this is all pointer arithmetic, i'm amazed it works at all 2009-03-01 10:47 tux_readdir: zero length entry at <86007:18> 2009-03-01 10:47 care to translate that for people who didnt write it? :) 2009-03-01 10:48 ok, i see the tux_error 2009-03-01 10:48 inode number 0x86007, dirent block 18 2009-03-01 10:48 that's a good sized directory 2009-03-01 10:48 wtf is brelse? 2009-03-01 10:48 what's in it? 2009-03-01 10:49 "buffer release" 2009-03-01 10:49 bad old bsd terminology from way back 2009-03-01 10:49 shitloads of files, i have a small program that generates many files in many directories 2009-03-01 10:49 good, well you're exercising something ;) 2009-03-01 10:50 i run it till it hits out of space ;) 2009-03-01 10:50 directory corruption on delete 2009-03-01 10:50 in this case was there an out of space before the BUG? 2009-03-01 10:50 tha'ts the program that i asked you about filling 2.2gb partition with 7xx kb of files 2009-03-01 10:50 yes 2009-03-01 10:50 ah 2009-03-01 10:50 well 2009-03-01 10:50 we're getting closer :) 2009-03-01 10:50 i fill it up, i clean it up 2009-03-01 10:50 wow 2009-03-01 10:51 hm? 2009-03-01 10:51 we're not supposed to be testing out of space at this point ;) 2009-03-01 10:51 I'm amazed we survive even one 2009-03-01 10:51 i didnt get that demo 2009-03-01 10:51 memo 2009-03-01 10:51 heh 2009-03-01 10:51 well I'm happy if this is the issue 2009-03-01 10:51 ehm...i'm glad i could help? 2009-03-01 10:52 :) 2009-03-01 10:52 so what's so special about out of space testing? 2009-03-01 10:52 ok, well I suppose we should think about how the out of space turned into directory corruption 2009-03-01 10:52 we haven't added the out of space handling yet 2009-03-01 10:52 random bad things can/will happen 2009-03-01 10:53 apparently ;) 2009-03-01 10:53 but it seems we're doing pretty well, for not having thought about it at all 2009-03-01 10:53 i love these, nice sideeffects of good design 2009-03-01 10:53 side effect of dumb luck 2009-03-01 10:54 how do you create a file? 2009-03-01 10:54 write something to it? 2009-03-01 10:54 or touch? 2009-03-01 10:54 or? 2009-03-01 10:54 gimme a min, my armchair just fell apart :/ 2009-03-01 10:56 ok it's back together...strange 2009-03-01 10:56 lemme post the code that creates the dirs/files 2009-03-01 10:56 good 2009-03-01 10:57 http://www.marcintology.com/make-many-files.c 2009-03-01 10:58 mail it to the list? 2009-03-01 10:58 it's just somehting i found on the internet, it's quite simple 2009-03-01 10:59 ok, you are essentially touching files to create them 2009-03-01 10:59 creat with some random buffer 2009-03-01 11:00 how does that affect your assement of the bug? 2009-03-01 11:00 right, and write something 2009-03-01 11:00 sorry, missed that the first time ;) 2009-03-01 11:02 ok, it makes a 3 level tree of directories with a leaf layer of files, directories are numbered sequentially 2009-03-01 11:02 and files 2009-03-01 11:03 so you must see some of the perrors? 2009-03-01 11:03 what's a perror? 2009-03-01 11:03 print error 2009-03-01 11:03 would that go to syslog? 2009-03-01 11:03 console 2009-03-01 11:03 would be things like "out of space" 2009-03-01 11:04 when it runs it's clean 2009-03-01 11:04 then it just hits out of space 2009-03-01 11:04 and what do you see? 2009-03-01 11:04 the out of space message 2009-03-01 11:04 and it stops 2009-03-01 11:04 right, that's the perror 2009-03-01 11:04 nice and clean dismount 2009-03-01 11:04 hmm, stops 2009-03-01 11:04 right 2009-03-01 11:04 exit(1) 2009-03-01 11:04 ok 2009-03-01 11:05 well this is a good clue 2009-03-01 11:05 it's a good test program 2009-03-01 11:05 yea i was looking for a simple 'populator' 2009-03-01 11:06 if it runs to completion it will create 405,000 files 2009-03-01 11:07 and with 4+8k per file that's... alot 2009-03-01 11:07 165 GB of file data 2009-03-01 11:08 sorry 2009-03-01 11:08 1.65 GB of file data 2009-03-01 11:09 1.54GB 2009-03-01 11:09 (binary gigabytes) 2009-03-01 11:09 or 4.something GB with our fluff metadata 2009-03-01 11:10 the partition size is? 2009-03-01 11:10 2.2 2009-03-01 11:10 gb 2009-03-01 11:10 ok, so everything makes much sense 2009-03-01 11:10 you want me to make a bigger one 2009-03-01 11:10 this is fine 2009-03-01 11:10 a smaller one, actually 2009-03-01 11:10 much smaller 2009-03-01 11:11 and see if you can produce the corruption with a run that takes just a few seconds 2009-03-01 11:11 will do 2009-03-01 11:11 that's called "tightening up the conditions" 2009-03-01 11:11 usual step in bug squishing 2009-03-01 11:12 meanwhile I will actually think about what we need to do for ENOSPC 2009-03-01 11:12 have thought about it a little 2009-03-01 11:14 posting your test program to the list would be good 2009-03-01 11:14 it's a nice little test 2009-03-01 11:15 dude should have signed it 2009-03-01 11:34 yup, blows up quickly, but the same way 2009-03-01 11:35 yes, good 2009-03-01 11:35 next is to post your recipe to the list 2009-03-01 11:35 then we have a group think about ENOSPC handling 2009-03-01 11:35 not much of a recipe, compile code off the net, execute ;) 2009-03-01 11:35 for the specific case of creating lots of files (which is actually a pretty general case) 2009-03-01 11:36 sure, however: size of partition, commands to run to compile and run the test, what to expect... 2009-03-01 11:36 lemme snapshot this badboy and i'll rm the files, see if we get it to blow up the same way 2009-03-01 11:36 maximizes the chance that somebody else will try it 2009-03-01 11:37 and that they will be a coder, and offer helpful suggestions or code :) 2009-03-01 11:37 it happens 2009-03-01 11:37 I guess you will 2009-03-01 11:37 in fact if I stop being lazy, I will think about it and realize the path to corruption 2009-03-01 11:38 if this was userland, i could just gdb this thing 2009-03-01 11:39 well, for some reason they don't check for ENOSPC in the write 2009-03-01 11:39 they? 2009-03-01 11:39 author of the proggy 2009-03-01 11:39 they = he/she 2009-03-01 11:39 they = he/she/it 2009-03-01 11:40 cuz it coulda been a dog, we don't know 2009-03-01 11:40 they catch an error, a more general case 2009-03-01 11:40 if this was written in a better language, you could do some real exception handling 2009-03-01 11:40 ACTION hides 2009-03-01 11:41 they don't check for error on write and we would tell them if they did, it's a fair deal 2009-03-01 11:41 we wouln't tell them I mean 2009-03-01 11:42 ok it rm'ed cleanly 2009-03-01 11:42 anyway, we don't know if the enospc happens when the vm flushes the data cache, or when the file is actually created 2009-03-01 11:42 so... 2009-03-01 11:43 we should add some printks on the ENOSPC path 2009-03-01 11:43 and find out which 2009-03-01 11:43 so what is __tux3_get_block: map_region failed: 28 ? 2009-03-01 11:43 it's very helpful :) 2009-03-01 11:43 i found a lot of these in the kernel logs 2009-03-01 11:43 thankyou :) 2009-03-01 11:43 ok, that tells me precisely what happened 2009-03-01 11:44 well 2009-03-01 11:44 not completely precisely 2009-03-01 11:44 ok 2009-03-01 11:44 we might want to put a stack dump there 2009-03-01 11:45 is that some sort of allocation thingy? 2009-03-01 11:45 it is 2009-03-01 11:46 it doesn't tell us whether it's trying to allocate a dirent block or regular file data to flush the buffered write to disk 2009-03-01 11:46 either way, it's a file flush 2009-03-01 11:47 I assume it could happen in either place, but if it happens to occur during a directory flush, we can end up with dir metadata updated but not the dir block 2009-03-01 11:47 wheee 2009-03-01 11:47 that was a sound of how it just went over my head 2009-03-01 11:47 proper handling is to report the ENOSPC at dirent create time 2009-03-01 11:48 lemme ask a basic question first: why are dirs stored differently? i thought in unix everything's a file 2009-03-01 11:48 in tux3, dirs are files, they aren't different 2009-03-01 11:49 in btrfs they are not files 2009-03-01 11:49 they are part of the big btree 2009-03-01 11:49 then why is it dirent, and not some generic filent? 2009-03-01 11:49 so it can be either way 2009-03-01 11:49 dirent is one directory name, that is name + inum 2009-03-01 11:50 ugh, sounds like this is where we need a whiteboard and some beer 2009-03-01 11:50 that's about the inum? 2009-03-01 11:50 inum = inode number 2009-03-01 11:50 simple :) 2009-03-01 11:51 time to make a strong cuppa java 2009-03-01 13:52 getting close to sk8 oclock 2009-03-01 13:53 new design note on the way 2009-03-01 13:53 specifically intended to help Hirofumi with the logging work 2009-03-01 13:54 by the way, I remember why it is necessary to redirect btree nodes when first dirtied, rather than at time of flush 2009-03-01 13:54 it is because we only know the parent block at time of dirty, that is, the parent is in the btree cursor 2009-03-01 13:55 finding the parent later when we only know that some volmap block is dirty, would require some complex means of tracking the parent 2009-03-01 13:55 not worth doing, when it is so easy to do the redirect at the time we know the parent 2009-03-01 13:56 not only do we know the parent from the cursor, but we have it locked, which will matter when we refine the locking granularity 2009-03-01 14:00 marcin, we may not do much about out of space handling until after atomic commit 2009-03-01 14:00 maybe some cheap temporary hack 2009-03-01 14:01 but atomic commit provides part of the mechanism we need for proper ENOSPC handling, that is, per-change space reservation 2009-03-01 14:01 or "log credits" 2009-03-01 14:01 maybe call that "change credits" 2009-03-01 14:01 better 2009-03-01 14:07 so, when a creat finds there is no space available in the target directory, first thing to do is reserve several blocks worth of "change credits", that is, 1 new direct block, and maybe: 1 new data index leaf, 1 new data index node, 1 new inode table leaf (if the directory has never been flushed) 1 new inode table index block, 1 new log block, and maybe 1 new bitmap block for each of those. 2009-03-01 14:07 s/direct/dirent/ 2009-03-01 14:08 that is, about a dozen blocks at worst 2009-03-01 14:09 then we actually do the change and count the number of blocks newly dirtied. We return the difference between credit reservation and actual new dirties to the credit pool 2009-03-01 14:09 if at the start of this change there are less than freeblocks - needed credits available, we return ENOSPC 2009-03-01 14:15 as it is, we just stumble along and bail out, and the vfs will "helpfully" flush our updated directory inode to disk with an increased i_size but a hole in the directory where the updated dirent block should be. 2009-03-01 14:15 this is why we see zeroes in the dirent block later when readdir tries to scan it 2009-03-01 14:17 question for hirofumi: can we now drop our MAX_LFS_FILESIZE hack, that is, are we using our own blockread for bitmaps now? 2009-03-01 14:17 ACTION should know that... 2009-03-01 14:38 sorry, had to talk to parental units, birthday or something 2009-03-01 15:25 -!- dcg(~dcg@157.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-01 15:29 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-01 16:32 marcin, I guess I lied, I'm going to make a preliminary patch for directory ENOSPC later today 2009-03-01 16:32 after the new design note 2009-03-01 17:02 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-01 17:18 ACTION needs to update the hall of fame 2009-03-01 18:34 http://mailman.tux3.org/pipermail/tux3/2009-March/000754.html <- latest design note 2009-03-01 18:34 and cusses, I forgot to wrap the paragraphs 2009-03-01 18:34 I hope I don't make too many eyes bleed over there on dragonfly list 2009-03-01 18:38 firebug + adding the white-space: pre-wrap style rule makes it readable :) 2009-03-01 18:39 ah 2009-03-01 18:39 well, kmail has been improved so it doesn't wrap on save as draft anymore, a surprise to me 2009-03-01 18:39 so the resend is also unwrapped 2009-03-01 18:39 double cusses 2009-03-01 18:40 kmail does not seem to provide any way to do hard wrapping now 2009-03-01 18:40 let alone an automatic way 2009-03-01 18:51 hey 2009-03-01 19:01 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-03-01 19:05 -!- chesse_(~eworm@dslb-084-062-172-162.pools.arcor-ip.net) has joined #tux3 2009-03-01 20:05 -!- chesse(~eworm@dslb-084-062-154-127.pools.arcor-ip.net) has joined #tux3 2009-03-01 20:46 hirofumi, there? 2009-03-01 21:18 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-03-01 22:47 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-03-02 00:01 we forgot to update the super magic date yesterday 2009-03-02 00:01 now I know what happens when tux3 tries to load an incompatible disk format 2009-03-02 00:01 it goes BUG 2009-03-02 00:01 that's pretty nice actually 2009-03-02 00:02 at least its not oops 2009-03-02 00:06 then the next mount of anything hangs trying to do down_read(&sb->s_umount) 2009-03-02 00:06 so... only sorta nice 2009-03-02 00:06 that's how linux is, fail to release a lock or complete some little thing and its game over 2009-03-02 00:07 not really big iron class yet 2009-03-02 00:14 ok, just about to try marcin's test 2009-03-02 01:40 hey fli 2009-03-02 01:41 bah 2009-03-02 02:56 -!- flips(~phillips@phunq.net) has joined #tux3 2009-03-02 04:39 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-02 04:50 -!- flips(~phillips@phunq.net) has joined #tux3 2009-03-02 07:05 -!- flips(~phillips@phunq.net) has joined #tux3 2009-03-02 08:04 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-03-02 09:56 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-02 10:07 -!- flips(~phillips@phunq.net) has joined #tux3 2009-03-02 11:12 -!- flips(~phillips@phunq.net) has joined #tux3 2009-03-02 11:26 marcin, there? 2009-03-02 11:45 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-02 11:47 yea 2009-03-02 11:50 flips, ping 2009-03-02 11:52 hey 2009-03-02 11:52 got a preliminary patch to handle out of space 2009-03-02 11:53 are you gonna sync up yours and hiro's trees? 2009-03-02 11:53 they're pretty close 2009-03-02 11:54 so what difference in behaviour should i see? 2009-03-02 11:55 no oops 2009-03-02 11:55 on readdir 2009-03-02 11:55 otherwise the same, gives no space error when it fills up 2009-03-02 11:55 hopefully performs the same 2009-03-02 11:55 I'll test that now 2009-03-02 11:58 looks like our trees are synced 2009-03-02 11:59 ok, i'll test that after 5pm EST 2009-03-02 12:00 or maybe not, i'm getting sick, who knows 2009-03-02 12:01 ACTION beams health waves at marcin 2009-03-02 12:04 i gotta figure out how to run my vm's remotely 2009-03-02 12:04 automation? 2009-03-02 12:05 well, wrong phrasing, i am running my vm's remotely, but they're in a gui window, even if they're text 2009-03-02 12:06 if i can just make them look like yet another screen session... ;) 2009-03-02 12:17 uml 2009-03-02 12:24 does uml have snapshots? 2009-03-02 12:32 probably not 2009-03-02 12:34 you kinda want snapshots for repetitive tests, especially when you're corrupting partitions ;) 2009-03-02 13:47 proper nospace handling requires making reservations against remaining free disk blocks 2009-03-02 13:47 so, strategy of the current crude hack is to make a gross overestimate of the blocks that will be required 2009-03-02 13:48 and keep accumulating a more and more wildly overestimated reservation as file changes proceed 2009-03-02 13:48 then, when it hits the wall... looks like nospace when actually there is lots of space... force a flush, set the reservations to 0, then see if we are really out of space 2009-03-02 13:49 this produces an interesting effect 2009-03-02 13:49 flushes to check space are very rare until the filesystem starts getting near full 2009-03-02 13:49 then they occur faster and faster 2009-03-02 13:49 until there is a flush on every change 2009-03-02 13:50 until that last few blocks are used up 2009-03-02 13:50 you can see that happen in the test run 2009-03-02 13:50 it's kind of cool 2009-03-02 13:51 flush rate accelerate until finally it gives up with "no space left on device" 2009-03-02 13:51 and otherwise, everything looks non-corrupt 2009-03-02 13:52 with considerably more work we can refine this strategy 2009-03-02 13:53 by refining the reservation credits calculation as a change proceeds 2009-03-02 13:53 and returning unused credits to the free pool 2009-03-02 13:54 well, the actual block usage normally happens outside the scope of the filesystem change 2009-03-02 13:54 if you're doing writes in batches, cant you count how much space you're gonna need by the next batch being done? 2009-03-02 13:54 that is, if we write to a buffer cache, we don't know how much metadata that will use until way later at delta commit time 2009-03-02 13:55 "doing writes in batches" ? 2009-03-02 13:57 i dont know the proper nomenclature for it... commit groups? 2009-03-02 13:57 you had a good name for it 2009-03-02 14:03 alright, i'm going home, ttyl in 30mins 2009-03-02 14:06 delta commit 2009-03-02 14:06 the problem is, the amount of space required isn't exactly fixed per write 2009-03-02 14:07 it depends on how full existing metadata blocks are, for one thing 2009-03-02 14:07 some blocks might be split or redirected 2009-03-02 14:07 a log flush might be required, creating new metadata blocks 2009-03-02 14:08 so a worst case estimate of new blocks needed has to be a pretty wild overestimate 2009-03-02 14:09 therefore it is a good thing to design in the ability to handle wild overestimates accurately, but iteratively flushing and working with smaller deltas as the volume gets near full 2009-03-02 14:09 when the volume is near full, you don't care about performance very much 2009-03-02 14:09 it will degenerate for much bigger reasons than the freespace calculation anyway, like seeking all over the place to fill in the last few holes 2009-03-02 14:11 ok, one more detail to work out... everybody has to wait for a flush to finish when the purpose of the flush is to determine whether the volume is really full 2009-03-02 14:11 some kind of mutex required here 2009-03-02 14:11 or other synchronizer 2009-03-02 14:12 hirofumi, around? 2009-03-02 15:00 -!- ceatinge_(~ceatinge@veryclever.net) has joined #tux3 2009-03-02 15:27 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-02 16:46 out of space patch posted 2009-03-02 16:49 http://mailman.tux3.org/pipermail/tux3/2009-March/000756.html 2009-03-02 16:58 just ran the test with your new code, get a segfault 2009-03-02 16:58 whoops 2009-03-02 16:58 EIP in tux3_add_dirent 2009-03-02 16:59 and i just got a shitload of __tux3_get_block: map_region failed: 28 2009-03-02 16:59 that is after an out of space? 2009-03-02 16:59 oh 2009-03-02 16:59 it's not working at all 2009-03-02 16:59 are you actually running the patched code? 2009-03-02 16:59 i synced like 15 mins ago 2009-03-02 17:00 it's not checked in 2009-03-02 17:00 you have to patch from the mailing list message 2009-03-02 17:00 booo...ok, lemme go get that 2009-03-02 17:00 then remember to unpatch before your next sync ;) 2009-03-02 17:00 oy, why not just include it then? 2009-03-02 17:00 not a big deal if you forget, but you will need to do some fixup 2009-03-02 17:01 too experimental? 2009-03-02 17:01 well it's gross 2009-03-02 17:01 yes 2009-03-02 17:01 not entirely good taste 2009-03-02 17:01 so it needs some review 2009-03-02 17:01 the central idea is sound, probably 2009-03-02 17:01 i'm reading the writeup...kinda funny 2009-03-02 17:02 kenrel development is a barrel of laughes 2009-03-02 17:02 kernel development is too 2009-03-02 17:08 ok rebooting to try the new code 2009-03-02 17:08 anything in particular i should be watching for? 2009-03-02 17:10 you put some new logging things to dump to kernel's syslog didnt you? 2009-03-02 17:11 not yet 2009-03-02 17:11 I've been playing with it 2009-03-02 17:11 not checked in 2009-03-02 17:12 i got some extra stuff 2009-03-02 17:13 reserve credits >>> freeblocks = 64, credits = a 2009-03-02 17:13 good 2009-03-02 17:13 that's just some tracing code to see its doing the right thing 2009-03-02 17:14 that must be just before it halted with out of space 2009-03-02 17:14 yup 2009-03-02 17:15 oh, you made a booboo 2009-03-02 17:15 i'm trying to remove the files, and now it's giving me 'cannot remove xyz: no space on device' 2009-03-02 17:21 heh 2009-03-02 17:21 yes 2009-03-02 17:22 i'm guessing that's not the desired effect? 2009-03-02 17:22 not unless you also like wearing hair shirts and sleeping on a bed of nails 2009-03-02 17:26 so why would it need more space to remove files? 2009-03-02 17:26 it's just a bug in my code 2009-03-02 17:27 but actually, removing a file can easily require allocating space 2009-03-02 17:27 it depends on the fielsystem design 2009-03-02 17:27 btrfs hits that problem a lot 2009-03-02 17:27 i know i'm gonna be sorry for asking this...how so? 2009-03-02 17:27 "allocate in free" is always nasty 2009-03-02 17:29 if you are holding a snapshot for example, a new block has to be allocated to contain the freed version of a directory entry 2009-03-02 17:30 whats so hard about a bounded reserve? ;) 2009-03-02 17:30 :) 2009-03-02 17:30 the "bounded" part? 2009-03-02 17:30 just like extents, easy :P 2009-03-02 17:30 (if you dont have to code it) 2009-03-02 17:31 and the "not necessarily monotonic" part? 2009-03-02 17:33 you two make me wanna chug nyquil ;) 2009-03-02 17:33 marcin: you want to do that anyway 2009-03-02 17:36 marcin, you can remove your old patch with: hg diff | patch -Rp1 2009-03-02 17:38 http://mailman.tux3.org/pipermail/tux3/2009-March/000757.html <- revised patch allowing delete 2009-03-02 17:38 (hopefully) 2009-03-02 17:40 ACTION sends it to his test machine 2009-03-02 17:43 one nice thing about this approach to enospace, it seems to be able to run right down to about 10 actual blocks remaining on the fs 2009-03-02 17:43 can probably squeeze closer than that to actually zero 2009-03-02 17:44 ok, rm works now 2009-03-02 17:44 confirmed ;) 2009-03-02 17:45 rm -r, even 2009-03-02 17:45 well, not gauranteed 2009-03-02 17:45 i was actually able to remount the old partition with leftover files, and remove them without a problem 2009-03-02 17:45 there's still fuzz in rm that hasn't been cleaned up 2009-03-02 17:45 nice 2009-03-02 17:45 well we will let this patch sit around for a while before committing 2009-03-02 17:46 aged patches seem to oops less 2009-03-02 17:46 some kind of special mold that grows on them I think 2009-03-02 17:47 i'm running the same test on my 2gb partition, to see if the behaviour is the same 2009-03-02 17:47 that would be nice 2009-03-02 17:47 I run on a 10 MB partition 2009-03-02 17:47 actually, fdisk refused to even create a partition that small 2009-03-02 17:48 gave me 16 MB 2009-03-02 17:48 i have like a 128mb partition and a 2.xx gb 2009-03-02 17:49 there are linux distributions that boot in 1MB 2009-03-02 17:49 or there were 2009-03-02 17:49 now the compressed kernel starts about that size 2009-03-02 17:49 that is, 1MB of disk 2009-03-02 17:49 old floppy 2009-03-02 17:50 something aint right 2009-03-02 17:50 the big partition test is still running 2009-03-02 17:50 normally quits by now? 2009-03-02 17:50 ya 2009-03-02 17:50 to dmesg and see what it says 2009-03-02 17:51 i'm watching all the logs, i got nothing 2009-03-02 17:51 that's good 2009-03-02 17:51 how about du /mountpoint ? 2009-03-02 17:51 it's only up to 176mb so far 2009-03-02 17:51 very slow 2009-03-02 17:52 didn't really change anything that should slow it 2009-03-02 17:52 in fact, it seems to be running more the speed it should 2009-03-02 17:52 15K files is really too much for a linear directory search 2009-03-02 17:53 what does top say? 2009-03-02 17:54 I'm running it on a big partition now 2009-03-02 17:54 make many files and pdflush are in D state but no utilization 2009-03-02 17:54 that's a bug 2009-03-02 17:55 different bug probably 2009-03-02 17:55 well 2009-03-02 17:55 yes, a bug 2009-03-02 17:55 mine is still running, hit the first flush cycle 2009-03-02 17:56 few mins later and it's still sitting at 176mb filled... 2009-03-02 17:56 completed 2009-03-02 17:56 this is very slow 2009-03-02 17:56 yes, it's a bug 2009-03-02 17:56 different bug 2009-03-02 17:56 mine completed 2009-03-02 17:56 so... on to the next bug 2009-03-02 17:57 anything i can do to help? 2009-03-02 17:57 can you go Alt-Sysrq-t with your vmthing? 2009-03-02 17:57 dunno, lemme check 2009-03-02 17:58 it's supposed to support it 2009-03-02 17:58 it's not doing anything 2009-03-02 17:59 and vmware docs on sysrq are not highly visible to google 2009-03-02 18:00 i'm wondering if i have that funcitonality compiled into the kernel 2009-03-02 18:00 it's on by default 2009-03-02 18:00 but may be disabled in vmware configuration 2009-03-02 18:00 yea but i dont do default kernels ;) 2009-03-02 18:00 however, it's not easy to find vmware docs 2009-03-02 18:00 well look in "kernel hacking" 2009-03-02 18:01 how much memory do you have in the virtual instance? 2009-03-02 18:01 512mb 2009-03-02 18:01 and it's on in the kernel config 2009-03-02 18:02 some vmware reason for it not working, or not obviously working 2009-03-02 18:02 the trace may in fact be in your /var/log/messages 2009-03-02 18:02 can you open another console? 2009-03-02 18:02 rm -r of the full tree completed 2009-03-02 18:03 this machine has 1 GB 2009-03-02 18:03 so its possible that flush activity is higher on your test setup, and exposed something 2009-03-02 18:04 something's hung like john holmes 2009-03-02 18:04 marcin: echo t > /proc/sysrq-trigger 2009-03-02 18:04 cannot kill the make-many-files process 2009-03-02 18:04 right 2009-03-02 18:04 comes with the territory when you're in D state 2009-03-02 18:04 or should we call this dd state? 2009-03-02 18:05 acutally theres one better now 2009-03-02 18:05 wages of sin 2009-03-02 18:05 echo w >/proc/sysrq-trigger 2009-03-02 18:05 will only show "blocked" processes 2009-03-02 18:05 in case you have many 2009-03-02 18:05 nice 2009-03-02 18:06 nothing, both clean 2009-03-02 18:06 w is short for wtf? 2009-03-02 18:06 "waiting" ? 2009-03-02 18:06 "watching"? 2009-03-02 18:06 marcin: clean how? 2009-03-02 18:06 "wanking"? 2009-03-02 18:06 he means, no obvious result I think 2009-03-02 18:06 flips: would probably be interested in the strack trace of the D state process 2009-03-02 18:06 clean as in it comes back with nothing 2009-03-02 18:06 marcin: it will output to dmesg 2009-03-02 18:06 oh yes 2009-03-02 18:06 or syslog 2009-03-02 18:07 depending how your syslog is setupo 2009-03-02 18:07 dmesg is the one 2009-03-02 18:07 there will certainly be lots of stack traces in your dmesg ;) 2009-03-02 18:07 it dumped a lot of stuff there 2009-03-02 18:07 :) 2009-03-02 18:07 what does that do? i never seen that stuff 2009-03-02 18:07 that's one thing: set the printk buffer size as high as it will go in the kernel config 2009-03-02 18:08 marcin, your next mission should you choose to accept it is to forward that dump to the tux3 mailing list 2009-03-02 18:10 http://www.marcintology.com/dmesgdump 2009-03-02 18:12 now how do you read that stuff? what are you looking for? 2009-03-02 18:13 marcin, is that a -t or a -w? 2009-03-02 18:13 we're busy reading it, watabit :) 2009-03-02 18:13 i did both, as shap told me 2009-03-02 18:14 first t then w 2009-03-02 18:14 congestion_wait 2009-03-02 18:14 it's kernel braindamage methinks 2009-03-02 18:14 not necessarily, but possibly 2009-03-02 18:15 why are there two make-many-files processes? 2009-03-02 18:15 did you start two at the same time? 2009-03-02 18:16 that's probably cuz i did both t and w? 2009-03-02 18:16 i've sent it some kill signals but that's about it 2009-03-02 18:16 one instance fired off 2009-03-02 18:17 oh, store_attrs is called recursively 2009-03-02 18:17 ah 2009-03-02 18:17 that's reasonable 2009-03-02 18:17 time to reboot 2009-03-02 18:17 that's a good error report 2009-03-02 18:17 ok, looks like a deadlock caused by the freeze_bdev 2009-03-02 18:17 recursive call to store_attrs 2009-03-02 18:18 how's your analysis going, shapor? 2009-03-02 18:18 why would that hapen? 2009-03-02 18:18 bug? 2009-03-02 18:18 misdesign? 2009-03-02 18:18 not paying your taxes in full? 2009-03-02 18:18 is that ours or kernels? 2009-03-02 18:18 ours 2009-03-02 18:19 congestion_wait is just where the vm flushes goes to while away time while waiting for us to never do anything 2009-03-02 18:23 well, I don't really see the place where it blocks 2009-03-02 18:23 store_attrs doesn't itself take any locks 2009-03-02 18:24 damm compiler has optimized some stack frames away 2009-03-02 18:24 store_attrs does not directly call write_cache_pages 2009-03-02 18:25 how nice it would be if kernel build with -O0 actually worked 2009-03-02 18:26 anyway, it's clear that the recursive store_attrs is bad 2009-03-02 18:28 down_write(&cursor->btree->lock); <- this may be the issue 2009-03-02 18:28 it is definitely an issue 2009-03-02 18:29 probably what it means is, freeze_bdev is no good as a solution, which we already knew, but it does serve to demonstrate the principle 2009-03-02 18:29 just don't run out of memory 2009-03-02 18:29 I guess, zfs has that rule, even in production 2009-03-02 19:05 -!- chesse_(~eworm@dslb-084-062-134-177.pools.arcor-ip.net) has joined #tux3 2009-03-02 19:16 ok, I see a general approach to dealing with the flush deadlock 2009-03-02 19:17 in short: at the point the big btree lock is released, check for a pending event needing servicing, for example, a volume flush (later: delta transition) 2009-03-02 19:17 taking a delta transition deep inside a lock seems like an idea doomed to fail 2009-03-02 20:05 -!- chesse(~eworm@dslb-084-062-178-219.pools.arcor-ip.net) has joined #tux3 2009-03-02 20:54 -!- tim_dimm(~timothyhu@cpe-76-168-94-231.socal.res.rr.com) has joined #tux3 2009-03-02 21:11 -!- cdk(~chinmay@115.109.10.27) has joined #tux3 2009-03-02 21:12 shapor, there? 2009-03-02 21:58 -!- cdk(~chinmay@115.109.10.27) has joined #tux3 2009-03-02 22:02 hi flips 2009-03-02 22:06 hi cdk 2009-03-02 22:06 the mail that i sent you, does it have everything you wanted ? 2009-03-02 22:06 ACTION hasn't updated the hall of fame yet 2009-03-02 22:06 was pinging shapor about that 2009-03-02 22:06 I'll send to him 2009-03-02 22:07 ok 2009-03-02 22:08 about the kernel port , what locking mechanism should be placed in case of meta-data blocks like the buckets we use? 2009-03-02 22:08 lock_page() ? 2009-03-02 22:09 it does 2009-03-02 22:10 I just sent it to shapor 2009-03-02 22:10 are you working on the kernel port? 2009-03-02 22:10 yes .. 2009-03-02 22:10 just use the btree lock 2009-03-02 22:10 but buckets are not part of the btree 2009-03-02 22:10 like the other two kinds of btrees 2009-03-02 22:11 just lock the whole tree 2009-03-02 22:11 -!- gaurav(~gaurav@115.109.10.27) has joined #tux3 2009-03-02 22:11 where are buckets stored? 2009-03-02 22:11 -!- kushal(~kushal@115.109.10.27) has joined #tux3 2009-03-02 22:11 in physical blocks 2009-03-02 22:11 I would suggest storing them in a file 2009-03-02 22:11 not in physical blocks 2009-03-02 22:12 and do the kernel port only if you want to 2009-03-02 22:12 the user space demonstration is more important 2009-03-02 22:12 if you feel you need to do the kernel port to complete your project, then by all means do it 2009-03-02 22:12 otherwise, more work on the user space model would probably be more useful 2009-03-02 22:12 -!- amey_(~amey@115.109.10.27) has joined #tux3 2009-03-02 22:13 yes, i think we will need it for our university review, and we are working on collision handling for userspace at the same time 2009-03-02 22:15 i think we will try the alternative approach of using a smaller hash after our project review 2009-03-02 22:18 -!- amey_(~amey@115.109.10.27) has joined #tux3 2009-03-02 22:19 gaurav, the result of that would be very interesting to me 2009-03-02 22:20 ok, so you are going to do both the kernel port and algorithm variant :) 2009-03-02 22:20 ok 2009-03-02 22:20 yes we will try to 2009-03-02 22:20 :) 2009-03-02 22:20 in a production version, we would move the hash btree into a logical file 2009-03-02 22:20 as is planned for the tux3 directory index 2009-03-02 22:20 but that mechanism is not ready for you to use yes 2009-03-02 22:21 so having the hash as a physical btree for now is fine 2009-03-02 22:21 the buckets should probably be stored as blocks of a file though 2009-03-02 22:22 if that is too big a change, just keep it as you have it 2009-03-02 22:22 otherwise, look at how bitmap blocks are handled 2009-03-02 22:22 ok 2009-03-02 22:23 for our review we will keep as it is. and later change it..when we attempt the variant 2009-03-02 22:23 -!- amey_m(~amey@115.109.10.27) has joined #tux3 2009-03-02 22:24 well, don't bother changing it even then 2009-03-02 22:24 the behavior of the variant hash is much more interesting 2009-03-02 22:25 ok 2009-03-02 22:25 so for now we have to lock_page ? 2009-03-02 22:25 it is relatively easy for us to change the physical/logical represenation later 2009-03-02 22:25 you can use lock_page, yes 2009-03-02 22:25 or you can add a mutex to the struct sb and use that 2009-03-02 22:26 lock_page should be fine 2009-03-02 22:26 ok. will use lock_page then 2009-03-02 22:26 but no such lock exists in user space 2009-03-02 22:27 it is desireable to keep your code running both in userspace and kernel, if you can 2009-03-02 22:27 for that reason, a mutex in struct sb would be preferable 2009-03-02 22:27 ok 2009-03-02 22:27 we provide an (non-smp) emulation of mutexes in userspace 2009-03-02 22:27 so mutex locking code is portable between userspace and kernel 2009-03-02 22:28 planning to use kernel crypto sha1 implementation that hirofumi mentioned 2009-03-02 22:28 fine 2009-03-02 22:53 folks 2009-03-02 22:53 hey flips 2009-03-02 22:53 hi bh 2009-03-02 22:53 how's the bug count thing going ? 2009-03-02 22:53 ACTION hits the backlog 2009-03-02 22:53 who's counting? 2009-03-02 22:53 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-03-02 22:53 the world :) 2009-03-02 22:53 the #tux3 count is going fine, 28 = new record 2009-03-02 22:55 mostly edge cases with the b-trees ? 2009-03-02 22:55 mostly this and that 2009-03-02 22:55 not many bugs 2009-03-02 22:56 no bugs with btree for many moons 2009-03-02 22:56 nice 2009-03-02 22:57 still working on scheduler stuff here, 1 of 3 parts is nearing completion 2009-03-02 22:57 ACTION goes private 2009-03-02 22:59 -!- amey_m(~amey@115.109.10.27) has joined #tux3 2009-03-02 23:00 flips, i userspace to lock a particular buffer , we would need to lock the entire sb? or just that buffer ? 2009-03-02 23:01 just that buffer 2009-03-02 23:01 well 2009-03-02 23:01 we don't emulate buffer lock in userspace 2009-03-02 23:01 I don't think we do 2009-03-02 23:02 best is to take the btree mutex if you're working in your own btree 2009-03-02 23:02 and in kernel space ? we use lock_page ? 2009-03-02 23:02 if not, create a mutex in the sb 2009-03-02 23:02 well I would not choose to use lock_page at this stage 2009-03-02 23:02 you can though 2009-03-02 23:03 it is usually best to start with a global lock, test your code under heavy multitheading, and if it works, _then_ use a finer granularity lock like page_lock 2009-03-02 23:03 lock_page 2009-03-02 23:03 ok 2009-03-02 23:05 so we use mutex in sb, lock it when we need to modify the buffer and release it after modification is done 2009-03-03 02:01 chk, that's it 2009-03-03 02:47 -!- amey_m(~amey@115.109.10.27) has joined #tux3 2009-03-03 02:56 I guess btree->lock should be fine for dedup btree 2009-03-03 02:56 not sb 2009-03-03 02:57 well, this means it should be fine with same way of other btrees 2009-03-03 02:59 yes hirofumi, fine for the dedup btree ... we were just thinking about the buckets 2009-03-03 02:59 i see 2009-03-03 02:59 as they are not exactly in the btree 2009-03-03 02:59 those are where is in? 2009-03-03 03:00 just stored in physical bloks 2009-03-03 03:00 is it pointed by where? 2009-03-03 03:01 -!- kushal(~kushal@115.109.10.27) has joined #tux3 2009-03-03 03:01 the btree leaf entries point to them 2009-03-03 03:01 i mean the btree leaves have the block numbers of the buckets 2009-03-03 03:01 ok, so, btree->lock should be lock those too? 2009-03-03 03:01 like data btree (dleaf)? 2009-03-03 03:02 but are the actual data blocks locked when the data btree is locked ? 2009-03-03 03:02 if so then yes...only a btree lock will do 2009-03-03 03:03 ah, buckets is modifying outside of btree? 2009-03-03 03:04 yes 2009-03-03 03:04 the ref counts will be increased 2009-03-03 03:07 ACTION reading the patch a bit 2009-03-03 03:07 i see 2009-03-03 03:07 btree->lock for htree, and sb lock for bucket? 2009-03-03 03:08 yes 2009-03-03 03:08 i see 2009-03-03 03:09 well, or maybe just sb lock in hash_lookup() 2009-03-03 03:09 ok, it needs bucket, not only btree 2009-03-03 03:10 yes....seems so....lets see....first we get the kernel code to compile...then work on locks 2009-03-03 03:11 yes 2009-03-03 03:11 CONFIG_LOCKDEP would help a bit 2009-03-03 03:12 for debugging 2009-03-03 03:12 ok 2009-03-03 03:15 btw, in latest FAST, someone seems to mention about dedup 2009-03-03 03:15 you may have interest it 2009-03-03 03:15 http://www.usenix.org/events/fast09/tech/ 2009-03-03 03:16 ok...will check it out 2009-03-03 04:13 -!- kushal_(~kushal@115.109.13.122) has joined #tux3 2009-03-03 04:25 -!- amey_m(~amey@115.109.13.122) has joined #tux3 2009-03-03 06:46 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-03 08:00 -!- dcg(~dcg@75.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-03 08:24 -!- cdk(~chinmay@115.109.15.151) has joined #tux3 2009-03-03 09:06 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-03 10:11 -!- tim_dimm_(~mobile@166.190.24.203) has joined #tux3 2009-03-03 10:32 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-03 12:19 -!- dcg(~dcg@50.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-03 13:17 -!- dcg(~dcg@13.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-03 13:33 -!- ceatinge(~ceatinge@veryclever.net) has joined #tux3 2009-03-03 14:43 hirofumi, this paper: http://www.usenix.org/events/fast09/tech/full_papers/lillibridge/lillibridge_html/index.html 2009-03-03 14:43 seems very relevant 2009-03-03 14:45 since I'm very slow to read english, so I'm not reading it fully 2009-03-03 14:45 however, yes 2009-03-03 14:45 well, optimization in future 2009-03-03 14:46 I will point it out to our Pune Institute students 2009-03-03 14:46 did you look at my nospace patch? 2009-03-03 14:46 a bit 2009-03-03 14:47 reserve base handling? 2009-03-03 14:47 the main idea is, force a flush before a filesystem change begins 2009-03-03 14:48 so that we can use a loose worst case estimate of blocks reserved for update 2009-03-03 14:48 try to wait some extra reserved space if allocation was failed? 2009-03-03 14:48 and the main question is, can a flush be forced without deadlock? 2009-03-03 14:48 not wait 2009-03-03 14:48 but to know the actual space available 2009-03-03 14:48 before each change, we reserve a worst case number of blocks 2009-03-03 14:49 and in my patch, I do not check the number of blocks actually used by the change 2009-03-03 14:49 yes 2009-03-03 14:49 but just let the worst case estimates keep adding up 2009-03-03 14:49 force a flush is what's meaning? 2009-03-03 14:49 so after some time, the worst case estimate is much higher than the space actually required 2009-03-03 14:49 yes 2009-03-03 14:49 force a flush = force a delta transition 2009-03-03 14:50 and wait to free the reserved space? 2009-03-03 14:50 which changes the predicted block usage to actual block usage 2009-03-03 14:50 no wait 2009-03-03 14:50 after the flush, we know nothing needs to be reserved 2009-03-03 14:50 because everything is already flushed 2009-03-03 14:50 so we set the reserved credits to zero 2009-03-03 14:51 but, it needs to wait to free reserved space (credits) 2009-03-03 14:51 it just needs to wait for the flush to finish 2009-03-03 14:51 the flush will allocate blocks 2009-03-03 14:52 yes, some sort of wait 2009-03-03 14:52 not so long though 2009-03-03 14:52 and so it does not need reserved blocks after the flush, because the actual blocks are already allocated 2009-03-03 14:52 yes 2009-03-03 14:52 I was thinking to just return -ENOSPC 2009-03-03 14:52 the way it flushes and waits right now is freeze_bdev 2009-03-03 14:52 when can't user reserve space 2009-03-03 14:53 it is not correct to return ENOSPC before the flush, because the number of reserved blocks is an over estimate 2009-03-03 14:53 yes 2009-03-03 14:53 but, I guess it's not big problem 2009-03-03 14:53 anyway, freeze_bdev deadlocks in some cases 2009-03-03 14:53 I have changed the patch to see if the deadlock can be avoided 2009-03-03 14:54 the deadlock seems to be on save_inode 2009-03-03 14:55 what is intent of freeze_bdev? 2009-03-03 14:55 just to flush the inodes 2009-03-03 14:55 like our delta transition 2009-03-03 14:56 but we do not have a delta commit right now, so I used freeze_bdev to see if the ENOSPC strategy works 2009-03-03 14:56 the nospace strategy seems to work pretty well 2009-03-03 14:57 here is a deadlock trace: http://www.marcintology.com/dmesgdump 2009-03-03 14:57 except deadlock? 2009-03-03 14:57 yes 2009-03-03 14:57 yes, it would deadlock 2009-03-03 14:58 it does not always deadlock 2009-03-03 14:58 change_begin() seems to be called too deep position 2009-03-03 14:58 right 2009-03-03 14:58 so I split it into reserve_credits and change_begin 2009-03-03 14:59 and call reserve_credits before taking the btree lock 2009-03-03 14:59 I have not tried that yet 2009-03-03 14:59 there are issues with compiling the code for user space 2009-03-03 14:59 it seems to be called in writepage 2009-03-03 14:59 the source include strategy is causing problems 2009-03-03 15:00 writepage may be wrong 2009-03-03 15:00 should be in begin_write 2009-03-03 15:00 I removed it from write_page already 2009-03-03 15:01 but, dmesgdump seems to be deadlock with writepage recursive 2009-03-03 15:01 what is *pagep = NULL in writepage for? 2009-03-03 15:01 yes 2009-03-03 15:01 writepage? 2009-03-03 15:02 write_begin? 2009-03-03 15:02 tux3_da_write_begin 2009-03-03 15:02 now does reserve_credits 2009-03-03 15:03 write_begin() can be take allocated page, otherwise it's NULL and allocate the page in it 2009-03-03 15:03 ah 2009-03-03 15:03 comment would be nice there :) 2009-03-03 15:03 write_begin() is 2009-03-03 15:04 writepage is using change_begin() 2009-03-03 15:04 yes 2009-03-03 15:04 well, anyway, I think reserved base strategy is right one 2009-03-03 15:05 ext3 uses a different strategy 2009-03-03 15:05 harder to implement 2009-03-03 15:06 what is different with ext3? 2009-03-03 15:06 it tracks ever buffer dirtied by a transation, using the transaction handle 2009-03-03 15:06 s/ever/every/ 2009-03-03 15:07 we could do that too, but it is a lot of extra parameter passing and we would need to allocate transaction handles 2009-03-03 15:07 it tracks dirtied buffer, but iirc, it reserves journal space in future 2009-03-03 15:07 yes, it makes a reservation, but it tracks actual block allocation for a transation too 2009-03-03 15:08 actually, this is for journal credits 2009-03-03 15:08 because ext3 has to worry about the journal filling up too 2009-03-03 15:08 it has to worry about out of disk blocks, and out of journal space 2009-03-03 15:09 yes 2009-03-03 15:10 it is about ext3_journal_start()? 2009-03-03 15:10 about second parameter 2009-03-03 15:11 yes 2009-03-03 15:11 ok 2009-03-03 15:11 so, the base strategy seems same 2009-03-03 15:11 later, the handle parameter 2009-03-03 15:11 yes 2009-03-03 15:12 I don't think ext3 uses this flush idea to see if it is really out of space 2009-03-03 15:12 I mean reserve space for future, and return -ENOSPC early 2009-03-03 15:12 yes 2009-03-03 15:12 that is necessary 2009-03-03 15:12 there is no alternative I think 2009-03-03 15:13 well, the different is journal space vs disk space 2009-03-03 15:15 that is a big difference 2009-03-03 15:15 the ext3 strategy is so complicated, it is difficult for me to follow it completely 2009-03-03 15:16 which is a complicated point? 2009-03-03 15:17 of course, it's not needed to follow completely 2009-03-03 15:17 transaction handle life cycle 2009-03-03 15:17 ah, yes 2009-03-03 15:17 I guess we don't need it 2009-03-03 15:17 also, possibility of recursive transactions 2009-03-03 15:17 for now, we don't 2009-03-03 15:18 ok 2009-03-03 15:20 have you made changes to user/commit.c ? 2009-03-03 15:20 well, so, the problem of our reserved space is, it's hard to calc to future possibilty 2009-03-03 15:20 no 2009-03-03 15:20 well, actually, changed, but one line only 2009-03-03 15:20 ok, I will reorganize it 2009-03-03 15:20 ok 2009-03-03 15:20 ok 2009-03-03 15:21 the source included strategy is becoming a more difficult problem 2009-03-03 15:21 we might eliminate the source includes pretty soon 2009-03-03 15:21 it is getting very hard to change the organization, for all the test cases 2009-03-03 15:21 well 2009-03-03 15:21 ok 2009-03-03 15:21 source includes work pretty well for the kernel fiels 2009-03-03 15:21 files 2009-03-03 15:22 but the user files including each other is a problem 2009-03-03 15:22 I guess it's commit.c only 2009-03-03 15:22 yes 2009-03-03 15:22 change_begin and friends 2009-03-03 15:22 it is because we have different implementation in user and kernel right now 2009-03-03 15:23 include problem is against user and kernel? 2009-03-03 15:24 it's not a problem for kernel 2009-03-03 15:24 include user/* from user/*? 2009-03-03 15:25 rignt now, the problem is compiling user/commit.c 2009-03-03 15:26 I guess we should use atomic commit entirely 2009-03-03 15:26 yes 2009-03-03 15:27 I am moving it into kernel right now 2009-03-03 15:27 ok 2009-03-03 15:28 back to reserved credits 2009-03-03 15:29 strategy is reserve by frontend and backend replace with actual balloc? 2009-03-03 15:30 yes 2009-03-03 15:31 and frontend will wait on delta commit if credits is greater that freeblocks 2009-03-03 15:31 um... 2009-03-03 15:32 why do frontend wait it? 2009-03-03 15:33 well, I guess we may be reserving too many blocks 2009-03-03 15:33 exactly 2009-03-03 15:34 it is because I was too lazy to calculate exactly the blocks used by the change 2009-03-03 15:34 but, why don't we fix too many, instead of wait? 2009-03-03 15:34 fix too many? 2009-03-03 15:35 reduce too many reserved blocks, to near of actual needed blocks 2009-03-03 15:35 don't reserve too many blocks 2009-03-03 15:36 I mean we should calc near of exactly blocks in future 2009-03-03 15:36 even if the reservation is very accurate, the error accumulates over many changes 2009-03-03 15:37 well, after every delta transition we can reset the error 2009-03-03 15:37 so this will work pretty well 2009-03-03 15:38 yes, so, if it's enough small, I feel we can just return -ENOSPC 2009-03-03 15:38 if we do not deadlock waiting for the delta when we think we might be ENOSPC, but do not know for sure 2009-03-03 15:38 well 2009-03-03 15:38 returning ENOSPC when space is available is not good 2009-03-03 15:39 it it is difference of 10 blocks on a million block volume, that is ok 2009-03-03 15:39 or 100 blocks 2009-03-03 15:39 but 10,000 block error on a milliion block volume is not ok 2009-03-03 15:39 yes 2009-03-03 15:40 we can also have a 100 block volume (floppy disk) 2009-03-03 15:40 so, it should be 100 or so 2009-03-03 15:40 then we must be accurate to within a block or two 2009-03-03 15:40 of course nobody cares much about floppy disks 2009-03-03 15:41 but somebody might put such a small disk in a consumer device 2009-03-03 15:41 well, enough small percentage 2009-03-03 15:41 yes 2009-03-03 15:42 even on a 500 GB disk, I have seen users complain when a few megabytes is not available when they think it should be 2009-03-03 15:42 that is because many users run normally with the disk 99% full 2009-03-03 15:43 I guess many users don't know about 5% reserved blocks on ext* 2009-03-03 15:43 for root 2009-03-03 15:43 yes 2009-03-03 15:43 if they did, they would just run as root, then complain 2009-03-03 15:43 users always want more ;) 2009-03-03 15:44 anyway, that space is there to allow the adminstrator to clean up a full disk 2009-03-03 15:44 otherwise it is a kind of deadlock 2009-03-03 15:45 if the user does some copies as part of the cleanup, and the disk goes ENOSPC before using all of the 5%, the administrator will be mad 2009-03-03 15:45 on tux3, I guess fs should reserve some blocks 2009-03-03 15:45 not for root 2009-03-03 15:46 yes 2009-03-03 15:46 well, it's reserved for root so that normal user applications do not use of the space the administrator needs 2009-03-03 15:46 yes, admin issue 2009-03-03 15:46 the ext3 5% is not to avoid problems in the ENOSPC calculation :) 2009-03-03 15:47 it's better to have an algorithm that goes ENOSPC exactly when all space is used 2009-03-03 15:47 in practice, that is not possible, because of the relationship between metadata and data 2009-03-03 15:47 but we can get very close, normally within 1-5 blocks 2009-03-03 15:48 by the way, this gets more complicated when we get to versioning 2009-03-03 15:49 and also, delta commit 2009-03-03 15:49 more complex reservation would be able to return -ENOSPC 2009-03-03 15:49 where certain blocks cannot be freed before the delta has committed 2009-03-03 15:49 yes 2009-03-03 15:49 so, in general, we cannot free any of the blocks of the previous delta before committing the new one 2009-03-03 15:50 that means: as freespace gets close to zero, delta has to get very small 2009-03-03 15:50 similar to what I have implemented in the current nospace patch 2009-03-03 15:51 um... 2009-03-03 15:52 well, for example, we can add the reserved space for each objects 2009-03-03 15:53 log, btree, file data blocks 2009-03-03 15:53 in that case, we will add reserved space by operations 2009-03-03 15:54 yes 2009-03-03 15:54 e.g. create the new file 2009-03-03 15:54 currently, I just did that calculation inaccurately in my head 2009-03-03 15:54 but sometimes it is not possible to know 2009-03-03 15:54 for example, will adding a new data block cause a dleaf split? 2009-03-03 15:55 it adds to split space to btree 2009-03-03 15:55 we can't just check the dleaf, because there might be other changes in the same delta that add to the same dleaf 2009-03-03 15:55 so, we normally have to make a worst case guess that the dleaf will always be split, and that every node in the btree will be split too, and the same for the inode update 2009-03-03 15:57 so we will often need to make a worst case guess for the reservation that is 5 - 10 times higher than the actual usage 2009-03-03 15:57 yes 2009-03-03 15:58 I thought, if we add the reserved space to objects, it wouldn't be counted twice 2009-03-03 15:58 if object has space already, it doesn't need to space anymore 2009-03-03 15:59 well, it seems to complex 2009-03-03 15:59 yes, that is what I meant by complex 2009-03-03 15:59 ext3 does things like that 2009-03-03 16:00 I feel the another strategy now 2009-03-03 16:00 I guess we need to reserve the space for fs 2009-03-03 16:00 the current strategy is much more simple, if the deadlock can be solved 2009-03-03 16:00 I guess it shouldn't be deadlock 2009-03-03 16:01 well, because I think we need to space to remove the files 2009-03-03 16:02 ok, user space is finally compiling again 2009-03-03 16:02 I guess fs is really full, user can't remove file anymore 2009-03-03 16:02 and now kernel needs to be fixed... 2009-03-03 16:03 then I need to split the patch to reorganize user/commit.c from the nospace patch 2009-03-03 16:03 the user needs to be able to remove files 2009-03-03 16:03 yes 2009-03-03 16:04 so we have to reserve enough space to allow rm 2009-03-03 16:04 I already made that change 2009-03-03 16:04 we will need to improve it 2009-03-03 16:04 marcin already hit that problem yesterday in my first version of the nospace patch 2009-03-03 16:05 well, so, if fs needs reserve space for rm, just return -ENOSPC? 2009-03-03 16:06 um.. it doesn't sounds good 2009-03-03 16:06 um... 2009-03-03 16:07 correct thing to do is have a special reserve for rm, which is included in the check for non-rm changes 2009-03-03 16:07 again, this reserve has to be large when a delta is large 2009-03-03 16:07 and can be just a few blocks when a delta is restricted to one change 2009-03-03 16:07 that is, the rm 2009-03-03 16:08 yes 2009-03-03 16:08 we reserve file blocks 2009-03-03 16:09 and reserve space is lager than one delta 2009-03-03 16:09 metadata blocks in one detla 2009-03-03 16:09 large than the previous delta anyway 2009-03-03 16:09 well 2009-03-03 16:09 it's complicated :) 2009-03-03 16:10 larger than the current delta, yes 2009-03-03 16:10 this also gets more interesting with multiple deltas in the pipeline 2009-03-03 16:10 it's a pretty important part of a filesystem 2009-03-03 16:11 with snapshots, rm does not necessarily free space 2009-03-03 16:11 then it may be necessary to free a snapshot 2009-03-03 16:12 anyway, I will commit the commit.c reorganization, and post the new revision of the nospace patch for review 2009-03-03 16:12 ok 2009-03-03 16:12 and maybe let marcin try it, he is good at hitting the deadlock 2009-03-03 16:12 I did not hit it 2009-03-03 16:12 probably have to run with less memory 2009-03-03 16:13 sk8 oclock 2009-03-03 16:13 ok 2009-03-03 16:59 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-03-03 18:15 hirofumi, still there? 2009-03-03 19:06 -!- chesse_(~eworm@dslb-084-062-174-082.pools.arcor-ip.net) has joined #tux3 2009-03-03 20:06 -!- chesse(~eworm@dslb-084-062-160-114.pools.arcor-ip.net) has joined #tux3 2009-03-03 20:23 uml is hanging sometimes on boot 2009-03-03 20:23 Setting the System Clock using the Hardware Clock as reference... 2009-03-03 20:23 IRQ 13/xterm: IRQF_DISABLED is not guaranteed on shared IRQs 2009-03-03 20:23 then nothing 2009-03-03 20:24 annoyingly, it always boots under gdb 2009-03-03 20:25 then after booting under gdb, it will boot by itself 2009-03-03 20:25 some uninitialized memory thing seems like 2009-03-03 20:26 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-03 20:49 -!- cdk(~chinmay@115.109.11.94) has joined #tux3 2009-03-03 21:38 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-03 21:52 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-03 22:54 new version of nospace patch going into testing now 2009-03-03 22:54 hopefully resolves yesterday's deadlock 2009-03-03 22:55 well it still functions on the small partition 2009-03-03 22:58 well it ran on the big partition too 2009-03-03 22:58 including on iteration of freeze_bdev 2009-03-03 22:58 marcin, around? 2009-03-03 22:58 ACTION didn't think so 2009-03-03 22:58 ok, I'll post it 2009-03-03 23:29 -!- ned(~ned@c-76-19-208-96.hsd1.ma.comcast.net) has joined #tux3 2009-03-03 23:32 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-03 23:36 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-03-04 00:11 -!- cdk(~chinmay@115.109.11.94) has joined #tux3 2009-03-04 01:44 ok, slightly different lockup now 2009-03-04 01:45 hmm, maybe just a slight mistake 2009-03-04 01:45 let's see 2009-03-04 01:46 btw, the order seems strage 2009-03-04 01:46 reserve -> change_begin() 2009-03-04 01:46 but, I think it should be 2009-03-04 01:46 change_begin() -> reserve 2009-03-04 01:47 because, reserve space depends on delta counter 2009-03-04 01:49 hmm 2009-03-04 01:50 simple race is 2009-03-04 01:50 could you explain more? 2009-03-04 01:50 reserve space -> change_begin() -> wait delta 2009-03-04 01:51 on another process, change_begin() -> change_end() -> commit -> clear reserve 2009-03-04 01:51 I guess this clear is including first process's reservation 2009-03-04 01:52 my intuition is that the reserve can be done outside the change_begin 2009-03-04 01:52 but intuition is not always right 2009-03-04 01:54 cpu1 cpu2 2009-03-04 01:54 2009-03-04 01:54 2009-03-04 01:54 2009-03-04 01:54 2009-03-04 01:54 change_begin() 2009-03-04 01:54 change_begin() 2009-03-04 01:54 change_end() 2009-03-04 01:54 stage_delta() 2009-03-04 01:54 2009-03-04 01:54 commit_delta() 2009-03-04 01:54 change_end() 2009-03-04 01:55 I'm thinking this race 2009-03-04 01:55 adds space to sb->xxxx 2009-03-04 01:56 I'm thinking about it 2009-03-04 01:56 ACTION thinks slowly 2009-03-04 01:56 sort of like moss growing sometimes 2009-03-04 01:56 ok, there was a mistake in my patch 2009-03-04 01:56 trying again 2009-03-04 01:57 well, but, I'm thinking the intuition is important 2009-03-04 01:58 so, I say "add credit" when a file change starts 2009-03-04 01:58 (it's not deadlocking this time, at least not as soon) 2009-03-04 01:58 good 2009-03-04 01:58 freeze_bdev is locking unneeded lock 2009-03-04 01:59 it's doing a lot of freeze_bdevs 2009-03-04 01:59 so, it might be deadlock 2009-03-04 01:59 because I set the partition size just a little too small 2009-03-04 01:59 however, it doesn't matter 2009-03-04 01:59 it's expected 2009-03-04 01:59 whoops, segfault 2009-03-04 01:59 ok, back to thinking 2009-03-04 02:00 segfault is in __set_page_dirty 2009-03-04 02:00 a little strange 2009-03-04 02:00 ah 2009-03-04 02:00 in _tux_create_entry\ 2009-03-04 02:01 ok, back to your race 2009-03-04 02:01 btw, generic_sync_sb_inodes() would be more prefer for testing 2009-03-04 02:02 ah 2009-03-04 02:02 much better 2009-03-04 02:02 I'll use it 2009-03-04 02:02 but the freeze_bdev seems to work, to my surprise 2009-03-04 02:03 good 2009-03-04 02:03 I really expected it to deadlock 2009-03-04 02:03 well, the deadlock appears to be gone but I have an oops in dirent create 2009-03-04 02:04 so, I could change to generic_sync_sb_inodes now, or I could keep the freeze_bdev, to try to track down the oops 2009-03-04 02:04 I suppose I will make it an ifdef 2009-03-04 02:05 and try the generic_sync now, see how it affects performance 2009-03-04 02:05 should be better 2009-03-04 02:07 um.. __set_page_dirty 2009-03-04 02:09 ok, the resolution to your race above... 2009-03-04 02:10 yes 2009-03-04 02:10 reserve_credits establishes whether a transaction may proceed, if there is insufficient disk space then it will force a delta transition 2009-03-04 02:10 inside the reserve 2009-03-04 02:10 so we do not want that to happen inside change_begin 2009-03-04 02:10 I think 2009-03-04 02:10 well 2009-03-04 02:10 ah 2009-03-04 02:10 it will probably work that way too 2009-03-04 02:11 however, since reserve_credits can be outside the btree_lock, it makes some kind of sense 2009-03-04 02:11 I am not completely sure about this :) 2009-03-04 02:11 that is why there is a prototype for discussion 2009-03-04 02:12 I was thinking immidiately after change_begin 2009-03-04 02:12 if it can't reserve... 2009-03-04 02:12 yes 2009-03-04 02:12 well, but why should it be part of change_begin? 2009-03-04 02:13 it is testing a different limit 2009-03-04 02:13 limit of disk blocks, rather than limit of cache memory 2009-03-04 02:13 so if it is convenient to do those two different kinds of tests in different places, that is ok 2009-03-04 02:13 disk blocks is yes 2009-03-04 02:13 however, reserve space is not disk space 2009-03-04 02:14 it is disk space 2009-03-04 02:14 it clears per-delta 2009-03-04 02:14 not cache 2009-03-04 02:14 it just clears the overestimate 2009-03-04 02:14 credits is an overestimate 2009-03-04 02:14 yes 2009-03-04 02:14 we might want to make credits per delta 2009-03-04 02:14 I have not thought about that very much 2009-03-04 02:14 yes, it can per delta 2009-03-04 02:15 I thought, try to handle nospace without thinking about deltas, first 2009-03-04 02:15 well, so, it is why I'm thinking it is under delta_lock 2009-03-04 02:15 I guess it is unrelated to deltas 2009-03-04 02:16 but, reservation implement is relating to delta 2009-03-04 02:16 we can develop our thinking about it as we go 2009-03-04 02:16 now, I need a struct writeback_control for generic_sync 2009-03-04 02:16 I guess I have to invent it 2009-03-04 02:16 struct writeback_control wbc = { 2009-03-04 02:17 .sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE, 2009-03-04 02:17 .range_start = 0, 2009-03-04 02:17 .range_end = LLONG_MAX, 2009-03-04 02:17 }; 2009-03-04 02:17 :) 2009-03-04 02:17 thankyou 2009-03-04 02:17 whoops 2009-03-04 02:17 wait is 1 2009-03-04 02:17 so, WB_SYNC_ALL 2009-03-04 02:17 it will sync all, and wait 2009-03-04 02:18 and .range_start can be implicitly initialized to zero 2009-03-04 02:19 yes 2009-03-04 02:19 I was forgetting about ->nr_to_write 2009-03-04 02:20 suggestion for that? 2009-03-04 02:21 LONG_MAX 2009-03-04 02:24 I need to set up my test machine to autologin 2009-03-04 02:24 trying the new version 2009-03-04 02:25 inittab? 2009-03-04 02:25 use bash or something instead of login 2009-03-04 02:26 I have it set up on my main workstation 2009-03-04 02:26 customized init 2009-03-04 02:26 inittty 2009-03-04 02:26 getty 2009-03-04 02:26 getty -l /bin/bash? 2009-03-04 02:28 let me see 2009-03-04 02:28 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-04 02:28 rungetty 2009-03-04 02:28 customized gettty 2009-03-04 02:28 supports autologin 2009-03-04 02:28 oh 2009-03-04 02:28 I could try yours above 2009-03-04 02:29 ok, the generic_sync_ version ran about the same, also segfaulted the same 2009-03-04 02:29 the segfault happens when space runs out 2009-03-04 02:30 I guess my reserve full report is not passed back to the application 2009-03-04 02:30 let's see wy 2009-03-04 02:30 this time the segfault is in _tux_create_entry 2009-03-04 02:31 I will post the revised patch, and the description of how I run the test 2009-03-04 02:31 ok 2009-03-04 02:31 and the oops 2009-03-04 02:31 because it's getting late here 2009-03-04 02:31 I think this is progress 2009-03-04 02:36 well I see why the ENOSPC didn't get passed back 2009-03-04 02:36 so I will try again ;) 2009-03-04 02:36 ok :) 2009-03-04 02:54 here we go again 2009-03-04 02:54 ok 2009-03-04 02:56 there are some periods of no disk activity 2009-03-04 02:56 hmm 2009-03-04 02:56 now there is a long period of dead :) 2009-03-04 02:58 yes 2009-03-04 02:58 oh a recursive generic_sync, that can't be good 2009-03-04 02:58 we don't need to wait writeout actually 2009-03-04 02:59 true 2009-03-04 02:59 we just need the flush, to allocate 2009-03-04 02:59 but, there is no convinient way 2009-03-04 02:59 we should be doing this with our own code 2009-03-04 02:59 yes 2009-03-04 02:59 but it is useful to see how far the generic kernel takes us 2009-03-04 03:00 now, which pastebin site should I use? 2009-03-04 03:00 I can paste the traceback 2009-03-04 03:00 of course, it isn't much use without the patch too 2009-03-04 03:00 so I should post both to our list 2009-03-04 03:01 it's save_inode? 2009-03-04 03:01 it is 2009-03-04 03:01 how did you know? 2009-03-04 03:01 just read a path 2009-03-04 03:01 tux3_write_inode 2009-03-04 03:02 I am now sure I should even be checking credits there 2009-03-04 03:02 save_inode is path of sync_inodes_sb 2009-03-04 03:02 because the credits should already be reserved by the namei operation 2009-03-04 03:02 so... 2009-03-04 03:02 I will remove that check 2009-03-04 03:02 yes 2009-03-04 03:05 I am very glad to have such a reliable filesystem as ext3, for all the reboots I am doing 2009-03-04 03:05 "unplanned interruption" 2009-03-04 03:05 next test... 2009-03-04 03:06 I don't like how the disk light goes out for long periods, I suspect that is not the fault of tux3 2009-03-04 03:06 yes 2009-03-04 03:07 yes 2009-03-04 03:07 it waits the blocks is not efficient order 2009-03-04 03:09 it completed :) 2009-03-04 03:09 with an error message 2009-03-04 03:09 error message? 2009-03-04 03:10 "no space left on device" 2009-03-04 03:10 as it should 2009-03-04 03:10 because I made the partition just a little too small 2009-03-04 03:10 oh, good 2009-03-04 03:17 another cycle, then post it 2009-03-04 03:18 clean up a little 2009-03-04 03:18 ok 2009-03-04 03:33 posted 2009-03-04 03:34 thaw_bdev is remaining 2009-03-04 03:34 I will try it with WB_SYNC_NONE 2009-03-04 03:35 oops :) 2009-03-04 03:35 that was a quick catch 2009-03-04 03:35 btw, WB_SYNC_NONE doesn't garantee to flush all dirty buffers 2009-03-04 03:35 iirc 2009-03-04 03:36 it's a good thing we will write our own :) 2009-03-04 03:36 yes 2009-03-04 03:37 in fact now would be a good time to write it 2009-03-04 03:37 yes, delta flush would be our own 2009-03-04 03:37 even before delta commit 2009-03-04 03:37 it can 2009-03-04 03:37 write a _sync_inodes_ just the way we want it 2009-03-04 03:37 for this application 2009-03-04 03:38 well, I'll think about to-make-dirty-buffer side though for now 2009-03-04 03:38 ok, I will try it with _NONE and see what happens 2009-03-04 03:38 ok 2009-03-04 03:38 this shuttle is a very nice test machine 2009-03-04 03:38 it has two big disks in it, and it boots fast 2009-03-04 03:38 cpu is good enough for kernel compiles 2009-03-04 03:39 needs more cores though 2009-03-04 03:39 oh, that didn't work at all 2009-03-04 03:39 failed with bogus out of space 2009-03-04 03:39 so now we know that _NONE really means none 2009-03-04 03:39 seems a little pointless 2009-03-04 03:41 _NONE is if inode is busy, it will ignore 2009-03-04 03:42 well they they must all have been busy 2009-03-04 03:42 because it didn't flush anything at all 2009-03-04 03:42 this busy means I/O is congested or inode/pages is locked 2009-03-04 03:43 the state can't start to submit right now 2009-03-04 03:44 what is reduce_begin()? 2009-03-04 03:44 for changes that are supposed to free space 2009-03-04 03:44 that is, reduce usage 2009-03-04 03:44 like rm 2009-03-04 03:45 um... 2009-03-04 03:45 which has a slightly lower theshold than mknod 2009-03-04 03:45 for ENOSPC 2009-03-04 03:45 however, free space is available after commit? 2009-03-04 03:47 ? 2009-03-04 03:48 rm is actually reserving the space 2009-03-04 03:48 true 2009-03-04 03:48 it should probably not reserve space at all 2009-03-04 03:48 why? 2009-03-04 03:49 well 2009-03-04 03:49 I was thinking, a lot of rms would reserve a lot of space 2009-03-04 03:49 but that is ok 2009-03-04 03:49 it will be handled in the same way, by doing a flush before reporting nospace 2009-03-04 03:50 ah 2009-03-04 03:50 so, we tell, this can use special disk space 2009-03-04 03:50 use 2009-03-04 03:51 a tiny bit 2009-03-04 03:51 the difference is only 8 blocks 2009-03-04 03:51 um.. 2009-03-04 03:52 to make sure, the term for now, credicts - sb->credits, reserve is disk space 2009-03-04 03:53 credits is sb->credits 2009-03-04 03:53 reserve is a verb 2009-03-04 03:53 credits is a noun 2009-03-04 03:53 ah 2009-03-04 03:53 well 2009-03-04 03:53 so, credits are the unit of measurement 2009-03-04 03:53 that term comes from ext3 2009-03-04 03:53 as in, journal credits 2009-03-04 03:54 well, I mean just to talk about it with some term 2009-03-04 03:54 yes 2009-03-04 03:54 credits 2009-03-04 03:54 I want to talk 2009-03-04 03:54 ok 2009-03-04 03:54 credits, yes 2009-03-04 03:54 and free blocks 2009-03-04 03:54 and I want to talk about the reserved space on disk too 2009-03-04 03:55 reserved blocks on disk 2009-03-04 03:55 there isn't any reserved space on disk yet 2009-03-04 03:55 yes 2009-03-04 03:55 however, it's needed for rm 2009-03-04 03:55 ok 2009-03-04 03:55 well it is implied by the smaller credit number for reduce_begin 2009-03-04 03:56 but we should make a sb variable to define that, probably 2009-03-04 03:56 or at least a define 2009-03-04 03:56 I guess the rm should have some credits 2009-03-04 03:57 it does 2009-03-04 03:57 it seems 0 2009-03-04 03:57 sb->margin 2009-03-04 03:57 the "safety margin" 2009-03-04 03:58 this margin is handled in reserve_credits 2009-03-04 03:58 so that is your "reserved" I think 2009-03-04 03:59 margin is used for all transaction? 2009-03-04 04:00 I'm thinking about the margin for rm only 2009-03-04 04:00 yes, for all 2009-03-04 04:01 so the margin for rm is sb->minchange 2009-03-04 04:01 not really very clear :) 2009-03-04 04:01 but then, these ideas are being developed now 2009-03-04 04:04 btw, I'm not understanding a bit about sb->margin 2009-03-04 04:04 margin is 100 or -100 2009-03-04 04:04 so I have a new measurement: time to create all the files when there is plenty of space, vs when the flushes are repeatedly called 2009-03-04 04:05 it takes twice as long to do all the creates, with the flushing 2009-03-04 04:05 so this needs investigation 2009-03-04 04:06 ah 2009-03-04 04:06 the -margin is a stupid hack 2009-03-04 04:07 it says "a flush is in progress, so just succeed, assuming that the caller that sees -margin is doing the flush" 2009-03-04 04:07 this is not very good code ;) 2009-03-04 04:07 in fact, it's horrible code 2009-03-04 04:07 I should have kept the hack.nospace name 2009-03-04 04:08 well, my idea is tux3 have reserved space 2009-03-04 04:08 it would be like sb->margin probably 2009-03-04 04:08 yes, something like that is needed 2009-03-04 04:09 but, it's sb->freeblocks - sb->margin for normal operation 2009-03-04 04:09 and for rm, sb->freeblocks 2009-03-04 04:09 to know free blocks count 2009-03-04 04:10 and rm and normal operations tell credits which is needed 2009-03-04 04:11 I'm not sure, the patch is intenting it or not 2009-03-04 04:12 one observation: all we want for the flush is a stage_delta 2009-03-04 04:12 or, not even the full stage_delta 2009-03-04 04:12 we just need space allocated for dirty inode pages 2009-03-04 04:13 yes 2009-03-04 04:13 and inode table blocks 2009-03-04 04:13 I think that's it 2009-03-04 04:13 oh, and the log blocks 2009-03-04 04:13 that's easy 2009-03-04 04:14 that's just a number 2009-03-04 04:14 it means credits is replaced by freeblocks on stage_delta? 2009-03-04 04:14 I think credits usage stays the same 2009-03-04 04:14 just the flush improves 2009-03-04 04:14 to be less heavyweight 2009-03-04 04:15 it seems, generic_sync is very inefficient 2009-03-04 04:15 yes 2009-03-04 04:15 it must do a lot of synchronous waiting 2009-03-04 04:15 it is for "sync" 2009-03-04 04:15 and not efficient enough 2009-03-04 04:15 well, then it seems sync could be faster than it is 2009-03-04 04:15 of course 2009-03-04 04:15 this only affects filesystems like ext2 and fat 2009-03-04 04:15 yes 2009-03-04 04:16 well, probably, so many users is not using "sync" in real life 2009-03-04 04:16 so, it didn't become the problem 2009-03-04 04:16 well 2009-03-04 04:17 replace by freeblocks means 2009-03-04 04:17 reduce credits durning stage_delta 2009-03-04 04:17 and credits will be cleared at end of stage_delta 2009-03-04 04:18 well, if it is complex, it is unnecessary 2009-03-04 04:19 I noticed, probably it is not so simple than I was thinking 2009-03-04 04:20 yes, something like that 2009-03-04 04:20 no, it's not simple :) 2009-03-04 04:20 it's pretty challenging 2009-03-04 04:20 see all the code to handle it in ext3 2009-03-04 04:20 it is much simple, if credit is fixed number 2009-03-04 04:20 and btrfs guys are still having trouble with it after a couple of years 2009-03-04 04:20 yes 2009-03-04 04:21 btrfs seems to be working free space issue now 2009-03-04 04:21 yes, now they have started the serious work on it 2009-03-04 04:21 ah, it is simple 2009-03-04 04:22 add credit with some number bigger than actually needed 2009-03-04 04:22 as is done now 2009-03-04 04:22 and reduce it actually used by balloc() 2009-03-04 04:22 and extra credit will be cleared by end of stage_delta() 2009-03-04 04:22 but we don't see the balloc until stage_delta 2009-03-04 04:23 yes 2009-03-04 04:23 and even then, it is not so simple to do all that accounting 2009-03-04 04:23 but not very hard either 2009-03-04 04:23 anyway, we know after stage delta that all credits have been cleared 2009-03-04 04:24 yes 2009-03-04 04:24 some thoughts is needed though 2009-03-04 04:24 so, by tracking the actual ballocs, we just reduce the size of the overestimate 2009-03-04 04:24 yes 2009-03-04 04:24 about locking or something 2009-03-04 04:24 yes 2009-03-04 04:24 it is just an optimization 2009-03-04 04:24 I'm happy that the current patch runs without deadlocking 2009-03-04 04:24 to reduce to wait flush 2009-03-04 04:24 well, I have not run it under memory pressure 2009-03-04 04:25 current patch seems to work under memory pressue too 2009-03-04 04:25 yes, it is probably a worthwhile optimization 2009-03-04 04:25 however, there is no locking 2009-03-04 04:25 it's pretty sloppy 2009-03-04 04:25 it needs atomics too 2009-03-04 04:25 yes 2009-03-04 04:25 or mb 2009-03-04 04:26 I like mb more, it does not mess up the source code as much 2009-03-04 04:26 atomic_* is pretty ugly 2009-03-04 04:26 probably, it would be needed the lock 2009-03-04 04:26 something better than -margin too 2009-03-04 04:26 and per-delta 2009-03-04 04:27 since we have to write the inode flush anyway, I think it is logical to do that next 2009-03-04 04:28 that is, force map_region for all dirty inode data and start writeout 2009-03-04 04:28 btw, about locking for counter, it would be per-cpu counter 2009-03-04 04:29 yes, it could be per-cpu 2009-03-04 04:29 since it is an overestimate, that is reasonable 2009-03-04 04:29 it is more code though 2009-03-04 04:29 an optimization for later 2009-03-04 04:29 yes 2009-03-04 04:31 if flush is replaced by stage/commit_delta, it would need to increment delta counter 2009-03-04 04:32 and need to lock delta_lock 2009-03-04 04:32 true 2009-03-04 04:32 but it is ok to do that outside the normal change_begin/end 2009-03-04 04:32 it might be ok to always do this check as part of change_begin/end too 2009-03-04 04:33 I don't know yet 2009-03-04 04:33 I guess sb->credit = 0 can't without serialize delta pipeline 2009-03-04 04:33 something smarter is needed there 2009-03-04 04:33 yes 2009-03-04 04:34 maybe an array of credit indexed by lower bits of delta 2009-03-04 04:44 yes, something like it 2009-03-04 05:12 oyasumi time (or long past it) 2009-03-04 05:12 oyasumi 2009-03-04 06:15 -!- cdk(~chinmay@115.109.10.31) has joined #tux3 2009-03-04 06:41 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-04 07:35 -!- gaurav(~gaurav@59.95.27.40) has joined #tux3 2009-03-04 07:50 -!- amey(~amey@socks.wantstofly.org) has joined #tux3 2009-03-04 07:50 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-03-04 09:28 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-04 09:28 -!- amey_m(~amey@117.195.38.10) has joined #tux3 2009-03-04 10:18 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-04 11:10 -!- tim_dimm(~timothyhu@rrcs-64-183-50-58.west.biz.rr.com) has joined #tux3 2009-03-04 12:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-04 12:53 -!- dcg(~dcg@188.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-04 13:54 -!- tim_dimm(~timothyhu@rrcs-64-183-50-58.west.biz.rr.com) has joined #tux3 2009-03-04 14:05 -!- Man_of_Wax(~wax@gualtiero.cs.unibo.it) has joined #tux3 2009-03-04 15:10 1:2345:respawn:/sbin/rungetty --autologin root tty1 2009-03-04 15:10 how's that for insecure? 2009-03-04 15:10 works great 2009-03-04 15:15 heh, what are you trying to do? 2009-03-04 15:19 not have to type in id and password every time I reboot my test machine 2009-03-04 15:19 yea, that'd do it 2009-03-04 15:20 alright, lemme cook up some dinner, and we'll get to testing your new patch, cool? 2009-03-04 15:20 cool 2009-03-04 15:51 -!- tim_dimm(~timothyhu@rrcs-64-183-50-58.west.biz.rr.com) has joined #tux3 2009-03-04 16:01 ok so what am i applying to what 2009-03-04 16:14 marcin, sync from the latest mercurial, then apply the latest nospace patch 2009-03-04 16:15 http://mailman.tux3.org/pipermail/tux3/2009-March/000760.html 2009-03-04 16:16 my results: this patch is stable, doesn't lock up and does report out of space without corruption, but it costs some performance when the filesystem is near full 2009-03-04 16:16 in other words, a good start 2009-03-04 16:47 it's not applying cleanly :/ 2009-03-04 16:48 whoops 2009-03-04 16:48 well, applied cleanly, but when i compile tux3/user then it's complaining 2009-03-04 16:48 let me check 2009-03-04 16:48 ah 2009-03-04 16:48 right 2009-03-04 16:48 I broke the user code 2009-03-04 16:48 I will fix 2009-03-04 16:48 ETA one hour 2009-03-04 16:49 sk8 oclock 2009-03-04 16:49 For now, you can run the user code without the patch 2009-03-04 16:49 just to be able to do the tux3 mkfs 2009-03-04 16:50 the kernel code, with the patch 2009-03-04 16:50 that is, apply the patch to the mercurial tree, copy the files to kernel, then reverse the patch and compile the user code 2009-03-04 16:51 oh hmm, you don't copy the files do you 2009-03-04 16:51 just just build the module 2009-03-04 16:51 well: apply the patch to the mercurial tree, build the module, then reverse the patch and compile the user code and do the tux3 mkfs 2009-03-04 17:12 hirofumi, there? 2009-03-04 17:49 ok, new patch is in, ran on both both and small partitions, seems to behave well on out of space 2009-03-04 17:49 fine, next is to make it not slow things down 2009-03-04 17:49 i'm trying to rm /mnt2/* and it's giving me out of space again 2009-03-04 17:49 ah 2009-03-04 17:49 sorry 2009-03-04 17:49 :) 2009-03-04 17:50 didn't get you the latest version of the patch it seems 2009-03-04 17:50 i just downloaded it off the mailing list 2009-03-04 17:50 the new new 2009-03-04 17:50 right 2009-03-04 17:50 let me check and see if its really the newest 2009-03-04 17:50 sometimes code on my test machine gets ahead of the mercurial machine 2009-03-04 17:51 do these patches have some sort of ID system? 2009-03-04 17:53 no, they're ad hoc 2009-03-04 17:53 pretty soon it will start going into the repo, then it will be more organized 2009-03-04 17:53 great, can you call them something remotely unique, like enospc_patch01? :) 2009-03-04 17:53 it's progressing from an off the wall idea to a mergable patch 2009-03-04 17:53 howabout the date/time of the email? 2009-03-04 17:54 that'd work too 2009-03-04 17:54 patch appears to be up to date 2009-03-04 17:54 let me see if I can reproduce the rm failure 2009-03-04 17:54 dont care as long as we dont confuse ourselves ;) 2009-03-04 17:54 don't confuse ourselves that way, anyway 2009-03-04 17:54 ACTION has many methods of confusing himself 2009-03-04 17:55 huh? what? what is that voice from my computer? 2009-03-04 17:55 wake up Neo 2009-03-04 17:58 being able to change the matrix brings with it a certain level of responsibiility 2009-03-04 17:59 ah, I easily reproduce the failure here 2009-03-04 17:59 will fix 2009-03-04 17:59 so we're not just confusing each other here, good ;) 2009-03-04 18:01 so is that gonna be like a 5 mins fix or something bigger? you've guilttripped me the other day, so i'm trying to figure out lxr again 2009-03-04 18:05 I did? 2009-03-04 18:06 well it will be a few minutes 2009-03-04 18:06 i'm asking cuz lxr is such a mess there is no way of starting it for few mins 2009-03-04 18:06 s/no way/no point/ 2009-03-04 18:07 oh right :) 2009-03-04 18:08 the new lxr is a bear to install, indeed 2009-03-04 18:08 but it sure is useful 2009-03-04 18:08 have you seen the fxr.something site? it's got a bunch of linux and bsd... 2009-03-04 18:08 using lxr 0.3 I think 2009-03-04 18:09 easier to install, but considerably less useful 2009-03-04 18:09 does not track structure fields for one thing 2009-03-04 18:09 also depends on glimpse which is non-free 2009-03-04 18:10 ok, I see another bug 2009-03-04 18:10 after rm -r, umount takes forever 2009-03-04 18:11 well 2009-03-04 18:11 we have some big changes to make in that part 2009-03-04 18:12 ok, so should i just leave you fix stuff and do lxr for a change of pace? 2009-03-04 18:20 new patch posted 2009-03-04 18:20 suit yourself ;) 2009-03-04 18:22 Offtopic... so answer me this: why is it that China does a better job of following Keynesian economics than we so-called capitalists? 2009-03-04 18:22 what's Keynesian? 2009-03-04 18:23 govt is supposed to accumulate budget surplus during times of expansion and spend on public works during contraction 2009-03-04 18:23 has a variety of benefits, including roads without potholes 2009-03-04 18:24 nah, we'd rather spending cash on chasing around weapons of mass phantomness 2009-03-04 18:24 and developing new and bigger ponzi schemes 2009-03-04 18:25 ACTION drops into lxr to find out why generic_sync_sb_inodes sucks so much 2009-03-04 18:26 it's about time to try chris mason's seekwatcher 2009-03-04 18:27 rm -r ran to completion for me 2009-03-04 18:27 slowly, but it ran 2009-03-04 18:27 ah, now we have the silly long umount time 2009-03-04 18:27 got to find out why 2009-03-04 18:31 it's doing something very silly in generic_shutdown_super -> generic_sync_sb_inodes 2009-03-04 18:32 even though all inodes are supposedly removed 2009-03-04 18:33 by the way, writeout seems to spend a huge percentage of its time in bio -> mempool alloc 2009-03-04 18:33 this part of the block IO path sucks enormously 2009-03-04 18:33 only hidden behind slow disk IO 2009-03-04 18:34 elevator bloat shows too 2009-03-04 18:35 so... it looks like rm -r removes the backing disk structure but not the inodes 2009-03-04 18:35 cache inodes 2009-03-04 18:35 doesn't really explain why the generic_sync is going nuts 2009-03-04 18:37 474 wbc->encountered_congestion = 1; 2009-03-04 18:37 475 if (!sb_is_blkdev_sb(sb)) 2009-03-04 18:37 476 break; /* Skip a congested fs */ <- ah, we probably broke this feature 2009-03-04 18:37 21static inline int sb_is_blkdev_sb(struct super_block *sb) 2009-03-04 18:37 22{ 2009-03-04 18:37 23 return sb == blockdev_superblock; 2009-03-04 18:37 24} 2009-03-04 18:37 well, we want to write our own inode flusher anyway 2009-03-04 18:37 time to do that now 2009-03-04 18:38 488 /* Was this inode dirtied after sync_sb_inodes was called? */ 2009-03-04 18:39 489 if (time_after(inode->dirtied_when, start)) 2009-03-04 18:39 490 break; <- this looks like a probable culprit 2009-03-04 18:39 dirty-in-dirty would produce the observed effect 2009-03-04 18:39 yet another reason for writing our own 2009-03-04 18:51 ran the last patch, made files, now rm'ing them, so far no blowups 2009-03-04 18:56 cabal time is nigh upon us 2009-03-04 18:58 um, so after deleting the files, i created the files again, and du -sh /mnt1 (that's the small one, 128mb) says only 14mb is used...could it really be that inefficient? 2009-03-04 18:59 and we got ourselv some sort of blowup :) 2009-03-04 19:00 EIP in probe, does that mean anything? 2009-03-04 19:02 kernel BUG at kernel/btree.c:245 2009-03-04 19:04 it means we have a bug :) 2009-03-04 19:04 could you post your traceback please? 2009-03-04 19:05 that is, the oops 2009-03-04 19:05 http://marcintology.com/btreebug.txt 2009-03-04 19:05 thankyou 2009-03-04 19:05 there's a leadup, then the crash, then another crash (not sure why) 2009-03-04 19:05 rm is only lighly worked on so far 2009-03-04 19:05 -!- chesse_(~eworm@dslb-084-062-161-050.pools.arcor-ip.net) has joined #tux3 2009-03-04 19:06 rm is kinda useful to have for testing ;) 2009-03-04 19:06 probe: Failed assert((btree->ops->leaf_sniff)(btree, bufdata(buffer)))! <- means a btree index node points at something that is not a btree leaf 2009-03-04 19:07 usually indicating something went fubar :) 2009-03-04 19:07 well, cabal time 2009-03-04 19:07 pointer math? 2009-03-04 19:07 thanks for the destruction :) 2009-03-04 19:07 unlikely 2009-03-04 19:07 it's more likely to be an smp race 2009-03-04 19:07 not something you want to debug as your first kernel project 2009-03-04 19:07 marcin: is it an smp machine? 2009-03-04 19:08 vm with 2 cpus assigned to it 2009-03-04 19:08 could try it with one cpu 2009-03-04 19:08 you want me to? 2009-03-04 19:08 see if it happens again 2009-03-04 19:08 would be an interesting data point 2009-03-04 19:08 one sec 2009-03-04 19:08 which can be gathered while flips eats pizza and drinks wine :) 2009-03-04 19:09 and slogs through the middle of soggy santa monica in his gumshoes 2009-03-04 19:10 sounds very santa monica ;) 2009-03-04 19:11 it actually rained this year 2009-03-04 19:11 oh the herecy! 2009-03-04 19:15 1. cleanup problem. first time i run make-many-files, it hits ENOSPC at 42mb, after rm and recreating the files again, it fills up at 14mb 2009-03-04 19:23 yea, cleanup is epic borked, after rm finishes 'sucessfully' df showed used amount the same as the du showed BEFORE rm'ing 2009-03-04 19:24 http://marcintology.com/btreebug_singlecpu.txt 2009-03-04 20:05 -!- chesse(~eworm@dslb-084-062-146-197.pools.arcor-ip.net) has joined #tux3 2009-03-04 20:12 -!- cdk(~chinmay@121.246.33.3) has joined #tux3 2009-03-04 20:34 -!- kushal(~kushal@115.109.8.2) has joined #tux3 2009-03-04 20:41 -!- cdk(~chinmay@115.109.8.2) has joined #tux3 2009-03-04 22:07 marcin, thanks for the bug report 2009-03-04 22:14 hi flips 2009-03-04 22:14 hi kushal 2009-03-04 22:14 shapor, there? 2009-03-04 22:15 in the deduplication patch, we need to blockget the block to calculate the SHA1 hash in map_region 2009-03-04 22:15 seems we are waiting on a lock there 2009-03-04 22:15 got a stack trace? 2009-03-04 22:15 how to do that in the kernel ? 2009-03-04 22:16 Alt-Sysrq-w 2009-03-04 22:16 ok ... be back with it in a few minutes 2009-03-04 23:00 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-03-04 23:05 hey flips 2009-03-04 23:32 flips: pong 2009-03-05 00:00 hi shapor 2009-03-05 00:20 flips, sysrq was not enabled in my kernel config. will need to do that and get back later . 2009-03-05 00:25 ok 2009-03-05 00:40 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-05 01:27 -!- cdk(~chinmay@121.246.33.3) has joined #tux3 2009-03-05 02:06 -!- kushal(~kushal@121.246.34.189) has joined #tux3 2009-03-05 03:23 so, generic_sync_sb_inodes is about 300 lines of rambling bloat 2009-03-05 03:23 doing all kinds of things that are not useful for us 2009-03-05 03:24 makes heavy use of a "redirty_tail" function 2009-03-05 03:24 that seems scary 2009-03-05 03:26 there is a lot of interaction with queue congestion mechanism here 2009-03-05 03:27 * Otherwise fully redirty the inode so that 2009-03-05 03:27 * other inodes on this superblock will get some 2009-03-05 03:27 * writeout. Otherwise heavy writing to one 2009-03-05 03:27 * file would indefinitely suspend writeout of 2009-03-05 03:27 * all the other files. 2009-03-05 03:27 need to think about what we want there 2009-03-05 03:31 redirty_tail is called from 5 places in what is essentially one function 2009-03-05 03:45 now, how can I find out which symbols are not exported to modules? 2009-03-05 03:45 ACTION supposes... compile the module and try to modprobe it 2009-03-05 03:56 ok, the most interesting symbol not exported is do_writepages 2009-03-05 03:56 let me see 2009-03-05 03:57 the mechanism is exported I think 2009-03-05 03:57 it's just a wrapper for generic_writepages 2009-03-05 03:59 now... __iget 2009-03-05 04:01 some strange stuff going on there 2009-03-05 04:01 inode_in_user list, and some means of avoiding write/rm races 2009-03-05 04:02 inode_in_use I mean 2009-03-05 04:04 inode_lock is not exported 2009-03-05 04:07 http://lkml.org/lkml/2009/1/17/173 2009-03-05 04:09 * The whole writeout design is quite complex and fragile. We want to avoid 2009-03-05 04:09 * starvation of particular inodes when others are being redirtied, prevent 2009-03-05 04:09 * livelocks, etc. 2009-03-05 04:09 ah, I noticed 2009-03-05 06:16 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-05 09:24 -!- kushal(~kushal@121.246.34.189) has joined #tux3 2009-03-05 10:09 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-05 10:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-05 10:50 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-05 11:06 -!- tim_dimm_(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-05 11:06 -!- gaurav(~gaurav@59.95.15.170) has joined #tux3 2009-03-05 15:16 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-05 17:31 -!- konrad_(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-05 17:35 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-05 19:02 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-05 19:05 -!- chesse_(~eworm@dslb-084-062-149-080.pools.arcor-ip.net) has joined #tux3 2009-03-05 19:15 let's see, how hard will it be to write a tux3_writepage using write_cache_pages 2009-03-05 19:15 tux3_writepages that is 2009-03-05 19:20 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-05 20:04 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-05 20:05 -!- chesse(~eworm@dslb-084-062-130-254.pools.arcor-ip.net) has joined #tux3 2009-03-05 20:31 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-05 20:46 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-05 21:13 -!- kushal(~kushal@115.109.9.113) has joined #tux3 2009-03-05 22:17 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-03-05 22:18 -!- kushal(~kushal@115.109.9.113) has joined #tux3 2009-03-05 22:19 hi flips 2009-03-05 22:46 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-03-05 23:20 -!- cdk(~chinmay@115.109.9.113) has joined #tux3 2009-03-05 23:20 hi flips 2009-03-05 23:24 -!- gaurav(~gaurav@115.109.9.113) has joined #tux3 2009-03-05 23:37 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-03-06 00:17 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-06 01:41 hi kushal 2009-03-06 01:42 about the alternate algo using a smaller hash... 2009-03-06 01:43 everytime when a new block's hash matches one of the entries in our index... 2009-03-06 01:43 we have to read the already present block, and perform byte by byte comparision.. 2009-03-06 01:44 because we would not be sure whether it is a hash collision or an actual duplicate 2009-03-06 01:44 yes 2009-03-06 01:44 whether it is a win depends on the frequency of hash matches 2009-03-06 01:45 http://www.usenix.org/events/fast09/tech/full_papers/lillibridge/lillibridge_html/index.html <- have you seen this? 2009-03-06 01:51 -!- kushal_(~kushal@121.246.33.209) has joined #tux3 2009-03-06 01:51 lights down 2009-03-06 01:51 http://www.usenix.org/events/fast09/tech/full_papers/lillibridge/lillibridge_html/index.html <- kushal, have you seen this? 2009-03-06 04:18 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-06 04:32 -!- kushal(~kushal@115.109.11.58) has joined #tux3 2009-03-06 04:32 hi flips 2009-03-06 04:32 hi 2009-03-06 04:32 sorry for the earlier episode... 2009-03-06 04:32 http://www.usenix.org/events/fast09/tech/full_papers/lillibridge/lillibridge_html/index.html <- have you seen this? 2009-03-06 04:33 ya...Hirofumi had sent us... 2009-03-06 04:33 havent seen it completely 2009-03-06 04:34 continuing the earlier discussion... 2009-03-06 04:34 we still feel that the 160bit hash is necessary... 2009-03-06 04:35 and if we use the 64bit hash...there will be huge performance hit in cases where most data is duplicate 2009-03-06 04:35 which is a typical scenario in backups 2009-03-06 04:36 well in the article they use a 40 byte hash - 320 bits 2009-03-06 04:37 did you measure it? 2009-03-06 04:37 what about the case where not very much data is duplicate? 2009-03-06 04:37 no...but theoretically speaking... 2009-03-06 04:38 and our primary use case would be backups 2009-03-06 04:38 your current strategy seems to be effective 2009-03-06 04:38 based on the initial numbers you mentioned 2009-03-06 04:39 we'll still try out the alternate solution 2009-03-06 04:39 and try to get actual numbers.... 2009-03-06 04:40 sounds good 2009-03-06 04:40 well it is time to sleep here 2009-03-06 04:41 ACTION waiting for the Hall of Fame to be updated :) 2009-03-06 04:41 good night... 2009-03-06 04:41 good night 2009-03-06 05:15 flips: have you thought about applying to GSoC? 2009-03-06 05:15 ok, too late :) 2009-03-06 08:10 -!- dagle1(~dagle@host162-104.bornet.net) has joined #tux3 2009-03-06 08:50 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-06 09:32 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-06 10:38 -!- cdk(~chinmay@115.109.11.168) has joined #tux3 2009-03-06 11:13 -!- data(~data@echo489.server4you.de) has joined #tux3 2009-03-06 12:12 -!- gebi_(~gebi@84-119-69-67.dynamic.xdsl-line.inode.at) has joined #tux3 2009-03-06 14:50 -!- dcg(~dcg@220.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-06 15:39 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-06 18:13 -!- marcin_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-03-06 19:05 -!- chesse_(~eworm@dslb-084-062-137-133.pools.arcor-ip.net) has joined #tux3 2009-03-06 20:05 -!- chesse(~eworm@dslb-084-062-133-051.pools.arcor-ip.net) has joined #tux3 2009-03-07 00:04 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-03-07 05:25 -!- amey(~amey@socks.wantstofly.org) has joined #tux3 2009-03-07 05:25 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-03-07 07:52 -!- dcg(~dcg@131.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-07 09:21 -!- dcg(~dcg@131.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-07 09:24 -!- cdk(~chinmay@115.109.11.168) has joined #tux3 2009-03-07 10:27 -!- ned(~ned@c-76-19-208-96.hsd1.ma.comcast.net) has joined #tux3 2009-03-07 11:07 -!- cdk(~chinmay@115.109.11.168) has joined #tux3 2009-03-07 11:07 hi flips 2009-03-07 12:47 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-07 12:53 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-07 12:57 -!- dcg(~dcg@136.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-07 16:08 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-03-07 16:38 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-07 18:09 -!- cdk(~chinmay@115.109.11.168) has joined #tux3 2009-03-07 18:10 hi flips 2009-03-07 18:41 hi cdk 2009-03-07 18:41 whoops, bye 2009-03-07 22:22 -!- cdk(~chinmay@121.246.35.128) has joined #tux3 2009-03-07 22:22 hi flips 2009-03-07 22:22 hi 2009-03-07 22:22 wanted to talk about the lock problem that i mentioned. 2009-03-07 22:22 u asked for the stack trace. 2009-03-07 22:23 got it 2009-03-07 22:23 ok 2009-03-07 22:24 where? 2009-03-07 22:24 yes. just a min 2009-03-07 22:26 also the mkdir patch for userspace is ready. 2009-03-07 22:26 will send it in a few minutes 2009-03-07 22:30 heres the trace 2009-03-07 22:30 http://paste2.org/p/160568 2009-03-07 22:45 -!- amey_m(~amey@117.195.37.78) has joined #tux3 2009-03-07 22:48 back 2009-03-07 22:49 cdk, is this current repo, plus a patch? 2009-03-07 22:49 under what test? 2009-03-07 22:50 -!- kushal(~kushal@121.246.33.34) has joined #tux3 2009-03-07 22:50 this is current repo + our code 2009-03-07 22:50 this happens when i umount the parition 2009-03-07 22:51 is your kernel configured with stack frames enabled? 2009-03-07 22:51 let me check 2009-03-07 22:51 under "kernel hacking" 2009-03-07 22:51 l 2009-03-07 22:51 sorry 2009-03-07 22:54 warn for stack frames larger than 1024 ? 2009-03-07 22:54 CONFIG_FRAME_POINTER=y 2009-03-07 22:54 yes seems so 2009-03-07 22:55 its checked 2009-03-07 22:57 the map_region trace at the start is before buffer = (blockget(mapping(inode),start)); /* DREAMZ */ 2009-03-07 22:59 this lockup should also occur if you do sync, did you try that? 2009-03-07 22:59 -!- gaurav(~gaurav@59.95.24.225) has joined #tux3 2009-03-07 23:00 no...but yes it should 2009-03-07 23:01 how do i do that ? need to call tuxsync() ? 2009-03-07 23:01 ah, blockget calls ->write_begin 2009-03-07 23:02 which tries to lock a page that is already locked 2009-03-07 23:03 and we are stalled...so how to read the page to calculate the hash without blockget ? 2009-03-07 23:04 when does this hash caclculation take place? 2009-03-07 23:04 are you writing here? 2009-03-07 23:04 no .. just reading the block 2009-03-07 23:04 when? 2009-03-07 23:04 at what stage in the algorithm? 2009-03-07 23:05 in ma_region at the very start ... line 141 here http://bitbucket.org/kushal/tux3/src/3f15683382f1/user/kernel/filemap.c 2009-03-07 23:05 I should be looking at you patch of course 2009-03-07 23:05 this is the old code .. but the place hash is calculated at the same place 2009-03-07 23:05 I am not right now 2009-03-07 23:06 ok, here goes 2009-03-07 23:08 line 141? 2009-03-07 23:09 yes 2009-03-07 23:19 well, you know at that point that you have a block that is not mapped in the file, so why are you reading it? 2009-03-07 23:21 well the physical mapping will happen here....and we need hash ... reading from the logical mapping ? 2009-03-07 23:23 blockget isn't quite the right way 2009-03-07 23:23 the data is in cache and never neads to be read 2009-03-07 23:25 so how do we get it directly in the buffer to calculate the hash ? in user space blockget used to check whether it is in the cache 2009-03-07 23:27 i mean it used to get from the hlist 2009-03-07 23:28 the buffer you want is the one that was passed to tux3_get_block 2009-03-07 23:28 hlist? 2009-03-07 23:28 you looked up in the mapping hash? 2009-03-07 23:29 that was in userspace .. i mean blockget did 2009-03-07 23:29 yes 2009-03-07 23:30 well in kernel there is not a good interface for doing this yet 2009-03-07 23:31 :( we are stuck ? 2009-03-07 23:31 there's always a way 2009-03-07 23:35 you code should be in tux3_get_block, not map_region 2009-03-07 23:35 you check the hash first 2009-03-07 23:35 you do not call map_region for buffer data that matches an existing block 2009-03-07 23:35 ok? 2009-03-07 23:36 you only call map_region if the match fails, thus a new block needs to be allocated 2009-03-07 23:37 ok...but when we started working tux3_get_block was not present :( 2009-03-07 23:37 and what about userspace? 2009-03-07 23:43 and map_region eventually adds entry to dleaf so how can we avoid it totally?? 2009-03-07 23:43 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-03-08 03:30 -!- kushal(~kushal@115.109.10.180) has joined #tux3 2009-03-08 05:36 -!- cdk(~chinmay@115.109.10.180) has joined #tux3 2009-03-08 05:40 -!- kushal(~kushal@115.109.10.180) has joined #tux3 2009-03-08 05:40 -!- amey_m(~amey@115.109.10.180) has joined #tux3 2009-03-08 05:40 hi flips 2009-03-08 11:10 -!- cdk(~chinmay@121.246.34.236) has joined #tux3 2009-03-08 12:13 -!- kushal(~kushal@115.109.10.180) has joined #tux3 2009-03-08 13:12 -!- dcg(~dcg@107.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-08 13:53 hi shapor 2009-03-08 14:29 hi cdk 2009-03-08 14:29 hmm, late there? 2009-03-08 18:55 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-08 19:30 -!- slingr(santas@will.one.day.hack-the-pla.net) has joined #tux3 2009-03-08 19:44 has anyone seen inverse in here lately? 2009-03-08 19:47 let me see 2009-03-08 19:48 Feb 08 20:35:24 * inverse has quit (Ping timeout: 480 seconds) 2009-03-08 19:48 hrm 2009-03-08 19:48 does he hang around here often 2009-03-08 19:48 ? 2009-03-08 19:48 seems to 2009-03-08 19:50 thx for the info 2009-03-08 20:22 hi flips 2009-03-08 20:26 hi cdk 2009-03-08 20:27 we sent the mkdir patch yesterday, forgot to cc to shapor 2009-03-08 20:27 shapor's away till tomorrow anyway 2009-03-08 20:27 it seems to work? 2009-03-08 20:28 yes. i tried it at my end. seems fine. 2009-03-08 20:28 I'll merge it 2009-03-08 20:29 should I take the patch from the mailing list or pull from your repo? 2009-03-08 20:29 there is just one thing, check the comment 2009-03-08 20:29 from the mailing list 2009-03-08 20:29 what is the magic number 1000000? 2009-03-08 20:30 seems the directory bit was not getting set for some reason even in the mkdir call.. 2009-03-08 20:30 that is a decimal number 2009-03-08 20:30 did you mean to write an octal or hex number there? 2009-03-08 20:31 yes. actually, it should not be needed at all . 2009-03-08 20:31 but as it is, it must be setting a lot of bits 2009-03-08 20:32 yes. all i need to do there is set the dir bit to 1 in mkdir call . I still dont understand why that is necessart 2009-03-08 20:34 otherwise how will tuxcreate know you are creating a directory? 2009-03-08 20:35 i mean, shouldn't fuse pass it as set already ? 2009-03-08 20:35 have you printed out the mode that fuse passes? 2009-03-08 20:36 yes. it was not set and without that or , a directory is not created . 2009-03-08 20:36 just a min i will check again 2009-03-08 20:37 should it be | S_IFREG? 2009-03-08 20:37 sorry 2009-03-08 20:37 | S_IFDIR? 2009-03-08 20:38 yes. 2009-03-08 20:38 do you have a mercurial tree I can pull from? 2009-03-08 20:39 no. not without the dedup code. sorry :( 2009-03-08 20:39 I'll apply the patch then, and fix the magic constant 2009-03-08 20:40 ok. checked it with S_IFDIR works fine 2009-03-08 20:41 shapor, host the tux3.org site and the irc ? 2009-03-08 20:41 just tux3.org 2009-03-08 20:42 ACTION waiting for the hall of fame to be updated  2009-03-08 20:42 :) 2009-03-08 20:43 about our deduplication code, we have completed design and coding of collision handling for the 64 bit htree leaf entires 2009-03-08 20:43 will commit to our mercurial tree soon 2009-03-08 20:45 htree leaf? 2009-03-08 20:45 hash tree? 2009-03-08 20:45 yes 2009-03-08 20:45 i have to leave now. will be back later. bye 2009-03-08 20:47 will post a note about the design along with the commit to the mailing list. 2009-03-08 20:47 some of your blank lines actually have tabs on them 2009-03-08 20:47 ok 2009-03-08 20:57 oops. sorry. will make sure that it does not happen again. 2009-03-08 23:34 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-03-09 00:00 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-09 00:14 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-09 00:16 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-09 01:20 -!- chesse(~eworm@dslb-084-062-185-190.pools.arcor-ip.net) has joined #tux3 2009-03-09 06:32 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-09 08:04 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-09 08:05 -!- dcg(~dcg@107.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-09 08:06 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-03-09 10:09 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-09 10:29 -!- konrad(~konrad@D-128-208-53-188.dhcp4.washington.edu) has joined #tux3 2009-03-09 11:02 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-03-09 11:22 -!- gaurav(~gaurav@59.95.22.216) has joined #tux3 2009-03-09 11:23 -!- cdk(~chinmay@121.246.34.236) has joined #tux3 2009-03-09 12:01 -!- cdk(~chinmay@121.246.34.236) has joined #tux3 2009-03-09 12:01 hi flips 2009-03-09 12:15 -!- dcg(~dcg@155.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-09 12:22 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-09 12:24 -!- gebi_(~gebi@84-119-45-182.dynamic.xdsl-line.inode.at) has joined #tux3 2009-03-09 13:26 flips, there? 2009-03-09 13:27 why do we rewrite log->magic in stage_delta() in user/kernel/commit.c 2009-03-09 13:27 ? 2009-03-09 14:31 -!- slingr(santas@will.one.day.hack-the-pla.net) has joined #tux3 2009-03-09 14:32 -!- slingr(santas@will.one.day.hack-the-pla.net) has left #tux3 2009-03-09 15:28 hi hirofumi 2009-03-09 15:28 hi 2009-03-09 15:28 hmm, let me see 2009-03-09 15:29 in "allocate and write log blocks" 2009-03-09 15:30 yes 2009-03-09 15:30 it isn't set anywhere else 2009-03-09 15:31 log->magic is initialized by log_begin? 2009-03-09 15:31 oh :) 2009-03-09 15:31 sure, it can be initialized there 2009-03-09 15:32 it doesn't really matter 2009-03-09 15:32 so, it should be assert instead to rewrite magic? 2009-03-09 15:32 I guess it is better to initialize earlier, that we we can use it as a check in memory as well as on disk 2009-03-09 15:32 yes 2009-03-09 15:32 ok 2009-03-09 15:33 ok, today is the day to see what history I can import into the git repo 2009-03-09 15:33 oh 2009-03-09 15:37 http://kerneltrap.org/mailarchive/git/2007/3/6/240631 <- python script to hack here 2009-03-09 15:38 http://repo.or.cz/w/hg2git.git 2009-03-09 15:42 yes 2009-03-09 15:42 it just convert hg to git 2009-03-09 15:43 right 2009-03-09 15:43 well, it would be first step 2009-03-09 15:43 then what :) 2009-03-09 15:43 git tree is for kernel? 2009-03-09 15:47 yes 2009-03-09 15:47 only user/kernel/* will be in it 2009-03-09 15:47 yes 2009-03-09 15:47 for it, I guess we would need to rewrite history more or less 2009-03-09 15:49 yes 2009-03-09 15:51 yes, it would be lazy :) 2009-03-09 16:00 hg2git and hg-fast-import scripts are pretty rough 2009-03-09 16:02 btw, I'm using hg-to-git.py 2009-03-09 16:04 hg-fast-import can't convert heads, otherwise it's way faster and less error prone than hg2git 2009-03-09 16:07 running it right now 2009-03-09 16:08 ok, I have a git repo of all of hg/tux3 2009-03-09 16:09 btw... you can also do a subtree merge, so you don't have to rewrite history 2009-03-09 16:10 ACTION learns about subtree merge 2009-03-09 16:28 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-03-09 16:47 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2009-03-09 16:51 http://userweb.kernel.org/~hirofumi/log-test.png 2009-03-09 16:51 test result of logchain dump 2009-03-09 16:51 hey flips 2009-03-09 16:51 very pretty 2009-03-09 16:51 and company 2009-03-09 16:52 thanks 2009-03-09 16:52 moving ot a new revision control system ? 2009-03-09 16:53 btw, flush_log() has problem 2009-03-09 16:53 flush_log() make log empty 2009-03-09 16:53 gh, no, just importing the kernel part to a kernel tree 2009-03-09 16:54 it seems to lost bfree log 2009-03-09 16:54 ACTION looks 2009-03-09 16:55 ah 2009-03-09 16:55 that's kind of important to have recent kernel sources and stff 2009-03-09 16:55 stuff 2009-03-09 16:56 hirofumi, oops :) 2009-03-09 16:56 we have to do something around it 2009-03-09 16:57 pondering 2009-03-09 16:57 well, anyway, the code seems to work more or less 2009-03-09 16:58 bitmap fork and redirect, and logging bfree 2009-03-09 16:58 it seems good 2009-03-09 16:58 the original, or as changed by you? 2009-03-09 16:58 changed by me 2009-03-09 16:59 however, not so many 2009-03-09 16:59 probably, it still need to work on many code though 2009-03-09 17:00 sb->logbase = sb->lognext; <- this should not be done in flush_log, I think 2009-03-09 17:01 yes 2009-03-09 17:01 for now, I just moved it to the above of bitmap flush 2009-03-09 17:02 well, I'm going to try to think about it 2009-03-09 17:02 still not thinking 2009-03-09 17:02 unstash(sb, &sb->deflush, move_deferred); <- this was supposed to save the deferred frees 2009-03-09 17:02 it doesn't work? 2009-03-09 17:02 sb->logbase = sb->lognext; <- actually, this is probably ok 2009-03-09 17:03 I guess deferred free and bfree logging is different stuff? 2009-03-09 17:04 it's the same 2009-03-09 17:04 well, current map_region() is doing log_bfree() 2009-03-09 17:04 so, it doesn't work for now 2009-03-09 17:07 ah, I'm beginning to remember what I was trying to do there 2009-03-09 17:07 wow, it really needs comments 2009-03-09 17:07 so, we have two kinds of deferred free as you suspected 2009-03-09 17:08 probably 2009-03-09 17:08 we have the deferred frees that must be held until after the next delta commit 2009-03-09 17:08 and we have the deferred frees that must be held until after the next log flush 2009-03-09 17:08 yes 2009-03-09 17:08 the second kind becomes the first kind at the time of the log flush 2009-03-09 17:09 this is confusing, but necessary 2009-03-09 17:10 well, I'm going to try to flush for each delta 2009-03-09 17:10 just return 1 from need_flush() always 2009-03-09 17:10 then, if it works, I'm try to change it 2009-03-09 17:11 the problem is, you can't free a btree node block that might be used as the base to reconstruct dirty cache 2009-03-09 17:11 sure 2009-03-09 17:11 this only affects ability to replay 2009-03-09 17:11 which is the entire purpose of logging ;) 2009-03-09 17:11 however, it would be nice to see it working without leaking frees 2009-03-09 17:12 even if replay is broken 2009-03-09 17:12 ok, maybe, I see 2009-03-09 17:12 log is to restruct the defree list 2009-03-09 17:12 and actual free is from defree list 2009-03-09 17:13 restruct? 2009-03-09 17:13 reconstruct 2009-03-09 17:13 reconstruct 2009-03-09 17:13 yes 2009-03-09 17:13 ah, yes :) 2009-03-09 17:13 so, I guess log_bfree() should add entry to defree list 2009-03-09 17:14 the defree list must be reconstructed on replay, which is not implemented in replay yet 2009-03-09 17:14 ok 2009-03-09 17:15 would you like to check in your fixes so we are looking at the same code? 2009-03-09 17:16 23 patches 2009-03-09 17:16 iirc, small patches though 2009-03-09 17:17 it would not do any harm to take them now I think 2009-03-09 17:20 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-03-09 17:20 this is my current code 2009-03-09 17:21 what is the url to your notes on fork? 2009-03-09 17:22 http://userweb.kernel.org/~hirofumi/fork-buffer.note 2009-03-09 17:22 this? 2009-03-09 17:22 yes 2009-03-09 17:25 um... 2009-03-09 17:25 empty the log at end of flush_log() 2009-03-09 17:26 it will empty the log for writing soon 2009-03-09 17:26 did you mean, too soon? 2009-03-09 17:26 it will lost the bfree log of bitmap 2009-03-09 17:27 it means, flush_log() empty the log, and stage_delta() will see empty log 2009-03-09 17:27 yes 2009-03-09 17:27 however, bitmap is flushing in flush_log() 2009-03-09 17:28 and it adds bfree log 2009-03-09 17:28 ok, that is confusing :) 2009-03-09 17:28 but, bfree log will be removed by empty 2009-03-09 17:28 yes 2009-03-09 17:28 I'm not sure if it is only bitmap or not 2009-03-09 17:29 probably bitmap and btree index nodes 2009-03-09 17:29 ah 2009-03-09 17:29 no, just bitmap 2009-03-09 17:29 it's another bitmap recursion :) 2009-03-09 17:29 kind of 2009-03-09 17:30 i see 2009-03-09 17:30 ok 2009-03-09 17:30 well, I'll try to think what is occuring 2009-03-09 17:31 I'd like to think about it with logging 2009-03-09 17:32 I was not thinking it yet 2009-03-09 17:32 it is an interaction between deferred freeing and logging 2009-03-09 17:32 I will think about it too 2009-03-09 17:32 thanks 2009-03-09 17:32 so, I will continue reading your patches, I would like to merge them so we can be thinking about the same code 2009-03-09 17:33 bitmap-atomic-commit.patch is too dirty 2009-03-09 17:33 there as some important bug fixes there 2009-03-09 17:33 "Fix inode number offset in find_empty_inode()" 2009-03-09 17:33 yes 2009-03-09 17:34 some of patches are ready to pull 2009-03-09 17:34 basically, non-atomic commit stuff 2009-03-09 17:34 ok, well maybe I should pull everything except that patch, and you can post bitmap-atomic-commit.patch to the list 2009-03-09 17:35 I haven't read them all yet 2009-03-09 17:35 ok 2009-03-09 17:35 I'll make repo for pull 2009-03-09 17:35 and post rest of patche 2009-03-09 17:35 thanks 2009-03-09 17:35 patches 2009-03-09 17:40 heh, we had to patch tux3.h 3 times for the last change to the super magic, over a period of 3 weeks or so 2009-03-09 17:40 yes 2009-03-09 17:40 maybe, I added the bug to it 2009-03-09 17:40 sorry 2009-03-09 17:40 well the magic check does work really well, and was a big help when I was setting up the rootfs 2009-03-09 17:40 it's just funny 2009-03-09 17:41 it was also my oversight, to miss the 2008 2009-03-09 17:43 do you think alignment in the disk super matters? 2009-03-09 17:44 I think the endian access functions don't care about alignment 2009-03-09 17:44 I might be wrong about that 2009-03-09 17:45 alignment is matter for 32bit and 64bit difference 2009-03-09 17:45 my belief is that d__attribute__ ((packed)) forces bytewise access 2009-03-09 17:46 how does it affect 32/64 bit? 2009-03-09 17:46 it would also work 2009-03-09 17:46 ah, I forgot to declare struct disksup PACKED 2009-03-09 17:47 ah 2009-03-09 17:47 next is 64bit 2009-03-09 17:47 so, it may not make difference 2009-03-09 17:47 -}; 2009-03-09 17:47 +} PACKED; 2009-03-09 17:47 next field 2009-03-09 17:48 it is no problem to align it, we can do both 2009-03-09 17:48 "belt and suspenders" 2009-03-09 17:48 um... 2009-03-09 17:48 do you know that term? 2009-03-09 17:48 basically, I dislike to use packed 2009-03-09 17:49 I don't know 2009-03-09 17:49 you can use a belt to hold your pants up, or you can use suspenders 2009-03-09 17:49 it is not necessary to use both 2009-03-09 17:50 i see 2009-03-09 17:50 my experience of alignment problem in past 2009-03-09 17:50 it was packet handling though 2009-03-09 17:50 ip is 4byte alignment 2009-03-09 17:51 however, ether header is 2byte alignment 2009-03-09 17:51 iirc 2009-03-09 17:51 well, something like it 2009-03-09 17:51 in inode attributes and logs, we acceses many fields unaligned, and rely on ((packed)) 2009-03-09 17:51 so, it generated the alignment fault on embeded cpu for each packed 2009-03-09 17:51 for each packet 2009-03-09 17:51 that would suck 2009-03-09 17:52 arm? 2009-03-09 17:52 mips? 2009-03-09 17:52 arm or mips 2009-03-09 17:52 I forgot 2009-03-09 17:52 well, so, it made to slow packet handling, and it's not so small overhead 2009-03-09 17:53 well, we only access unaligned fields with the endian conversion functions 2009-03-09 17:53 yes 2009-03-09 17:53 and now, only disksuper 2009-03-09 17:53 so... we could make those macros use bytewise access, even when endian flipping is not required 2009-03-09 17:54 ileaf and log blocks also have unaligned fields 2009-03-09 17:54 both could cause performance problems on embedded cpu, with many faults 2009-03-09 17:54 however, we don't have to worry about that now 2009-03-09 17:55 btw, ileaf has padding already 2009-03-09 17:55 to handle it, be can adjust the endian functions 2009-03-09 17:55 some of the attributes have unaligned fields 2009-03-09 17:55 yes 2009-03-09 17:56 anyway, I am satisfied that we have a strategy that will avoid alignment faults 2009-03-09 17:56 and we don't have to worry about Tux3 on cell phones just yet 2009-03-09 17:56 ok, thanks 2009-03-09 18:00 ok, it looks good 2009-03-09 18:00 please ping me when you are ready 2009-03-09 18:02 ok 2009-03-09 18:15 gebi, still there? 2009-03-09 18:16 ACTION thinks it's pretty late in .at 2009-03-09 18:22 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-03-09 18:22 I put the mergable patches 2009-03-09 18:25 pulled, thankyou 2009-03-09 18:25 and pushed to public 2009-03-09 18:25 thanks 2009-03-09 18:26 I have started to write a tux3_writepages based on write_cache_pages 2009-03-09 18:26 this looks pretty easy 2009-03-09 18:26 but isn't really necessary to do right now 2009-03-09 18:27 yes 2009-03-09 18:27 I will put it aside and get the git repo working 2009-03-09 18:27 it is very tempting to do that writepages work instead, because it is fun and will allow some cleanups 2009-03-09 18:27 but... git repo is more important 2009-03-09 18:28 writepages would be need to port atomic commit to kernel 2009-03-09 18:28 probably 2009-03-09 18:29 why do you think so? 2009-03-09 18:30 there are two slightly different ways to write it: 1) writepages submits io 2) writepages just makes a list of bios 2009-03-09 18:30 either way, there have to be bio completions that count down, to complete the delta 2009-03-09 18:31 so there is some part of the atomic commit that is needed, but not logging or replay 2009-03-09 18:31 still, it is just a cleanup 2009-03-09 18:33 because, we will not use radix-tree dirty mark, probably 2009-03-09 18:33 generic_* stuff depends on it 2009-03-09 18:33 right, I was going to avoid generic_* completely 2009-03-09 18:33 well 2009-03-09 18:34 the radix_tree dirty flag should still work for this 2009-03-09 18:34 write_cache_pages uses it, I thought 2009-03-09 18:34 ACTION checks 2009-03-09 18:34 yes 2009-03-09 18:34 however, fontend marks it as dirty 2009-03-09 18:35 so, I guess backend can't get stable dirty pages from it 2009-03-09 18:35 you mean, it won't work properly with fork? 2009-03-09 18:35 yes 2009-03-09 18:36 I intended to do it without fork 2009-03-09 18:36 but 2009-03-09 18:36 it is better to do it together 2009-03-09 18:36 and it is better to wait ;) 2009-03-09 18:36 it was worthwhile reading the code and writing a rough draft 2009-03-09 18:37 yes, good 2009-03-09 18:37 this is tied together with ENOSPC handling too 2009-03-09 18:37 rest of patches has posted 2009-03-09 18:37 which is why I worked on it 2009-03-09 18:37 i see 2009-03-09 18:38 but ENOSPC also does not have to be done before kernel merge 2009-03-09 18:38 after all, btrfs still hasn't done it ;) 2009-03-09 18:38 we will just have to ask marcin to not run out of space 2009-03-09 18:39 yes, it's good :) 2009-03-09 18:45 the tux3graph that includes a dump of log blocks raises an interesting issue: the log blocks shown are the ones needed to reconstruct the tree that is shown 2009-03-09 18:45 so, the log blocks are "in the past" while the rest of the tree is "the present" 2009-03-09 18:46 it does give a clear picture of the difference between the on-disk tree state and the cached state 2009-03-09 18:47 ah 2009-03-09 18:48 a tux3graph that shows the pre-replay state of the disk image would be tricky 2009-03-09 18:48 because the pre-replay state can be a disconnected tree 2009-03-09 18:48 yet, this would be very interesting and useful to look at graphically 2009-03-09 18:49 or to put it another way, fsck repair needs to be able to make sense of a partly disconnected tree, in case log replay fails 2009-03-09 18:49 just something to think about 2009-03-09 18:49 i see 2009-03-09 18:50 I'm not sure yet, tux3graph is what to do 2009-03-09 18:50 I also am not sure 2009-03-09 18:50 every time I think about this, my head hurts ;) 2009-03-09 18:50 it may show the image after the replay 2009-03-09 18:50 yes 2009-03-09 18:50 it's very nice for that 2009-03-09 18:53 well, since we never overwrote any part of the unflushed metadata, the "old" tree should be consistent too 2009-03-09 18:54 so, tux3graph can produce two different tree images, one before replay and one after 2009-03-09 18:54 this would be _very_ useful to see 2009-03-09 18:56 yes 2009-03-09 18:56 well, we would need multiple sb or something like that 2009-03-09 18:56 metadata block 2009-03-09 18:58 the sb strategy (which is not implemented yet) does not change 2009-03-09 18:58 it is the "metablock" strategy, which I wrote a note about 2009-03-09 18:59 hmm 2009-03-09 18:59 yes, correct 2009-03-09 18:59 the highest sequenced metablock gives the current tree root 2009-03-09 19:00 the two different versions of the tree both descend from the same root 2009-03-09 19:00 in every case, except when the root itself is rewritten to make the inode table btree deeper 2009-03-09 19:01 in that case, we do indeed have two roots for a short time 2009-03-09 19:01 I would prefer not to have two roots, ever 2009-03-09 19:02 so we should make that time as short as possible, by requiring a flush when the inode table tree gets deeper. Maybe 2009-03-09 19:03 s/root itself is rewritten/pointer to root is rewritten/ 2009-03-09 19:03 um... 2009-03-09 19:04 if delta was commited, previous defree is freed 2009-03-09 19:04 can be freed 2009-03-09 19:04 so, previous btree may not be available already? 2009-03-09 19:05 true, the inode index blocks will all be still on disk, but the btree leaf blocks may be freed 2009-03-09 19:06 so we can expect to be able to print out the inode table btree before replay, but not necessarily the file btrees 2009-03-09 19:06 still, it is very useful to be able to see the inode table btree even without the leaves, before replay 2009-03-09 19:07 well, anyway, I guess it's not so hard to do 2009-03-09 19:07 probably 2009-03-09 19:07 just thinking about it 2009-03-09 19:09 ok, that will help think about the current defree logging issue 2009-03-09 19:09 I mean, that already helped 2009-03-09 19:10 i see 2009-03-09 19:10 + assert(buffer->state - BUFFER_DIRTY == ((sb->flush - 1) & (BUFFER_DIRTY_STATES - 1))); <- nice catch 2009-03-09 19:11 you changed ->delta to ->flush 2009-03-09 19:11 yes 2009-03-09 19:12 it hit actually by delta++ in stage_delta() 2009-03-09 19:12 ah, good 2009-03-09 19:13 "error in error check" ;) 2009-03-09 19:13 btw, why does current code do sb->delta++ in stage_delta()? 2009-03-09 19:14 you mean, why not somewhere else? 2009-03-09 19:14 no 2009-03-09 19:15 change_end() is incrementing sb->delta++ 2009-03-09 19:15 and stage_delta() is also incrementing it 2009-03-09 19:15 again 2009-03-09 19:15 whoops 2009-03-09 19:15 I just missed it 2009-03-09 19:15 it was why that assert hit 2009-03-09 19:15 ah 2009-03-09 19:16 it's because I wrote sb->delta++ in one place and ++sb->delta in another 2009-03-09 19:16 the one in change_end should be removed I think 2009-03-09 19:17 change_end's? 2009-03-09 19:18 ah 2009-03-09 19:18 no 2009-03-09 19:18 I guess sb->delta++ should be before sb->commit list flush? 2009-03-09 19:19 yes 2009-03-09 19:20 ok 2009-03-09 19:20 so that fork works properly for in-flight IO 2009-03-09 19:21 probably 2009-03-09 19:22 when I wrote that, I was thinking that we would not allow fork for now, and so it did not matter whether ->delta++ is before or after initiating the ->commit IO 2009-03-09 19:23 i see 2009-03-09 19:23 it is better to do it in the correct position 2009-03-09 19:26 yes 2009-03-09 19:33 ok, to state the flush_log issue simply, we emptied the log without ever writing some of it to disk 2009-03-09 19:34 s/we/I/ :) 2009-03-09 19:35 however, maybe, almost of logs is not needed 2009-03-09 19:36 it must all be needed for replay 2009-03-09 19:36 hmm, some might not be needed, if full bitmap blocks are being written out in the flush 2009-03-09 19:37 yes 2009-03-09 19:37 err, no, the logged changes to bitmaps are for the _next_ delta 2009-03-09 19:37 balloc are already in bitmap data blocks 2009-03-09 19:37 yes 2009-03-09 19:38 but, I guess bfree is only for next delta in the case of bitmap 2009-03-09 19:41 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-09 19:41 yes 2009-03-09 19:42 so it is important not to lose that log block, as we now do 2009-03-09 19:43 now, sb->logbase = sb->lognext is after bitmap flush 2009-03-09 19:44 and it should be before 2009-03-09 19:44 I think that is it 2009-03-09 19:44 ok 2009-03-09 19:44 it was 2009-03-09 19:44 however, somehow, I reverted to original 2009-03-09 19:45 ah, so you already realized this 2009-03-09 19:46 I'm not sure though 2009-03-09 19:51 -!- amey_m(~amey@117.195.36.174) has joined #tux3 2009-03-09 20:05 -!- chesse_(~eworm@dslb-084-062-183-063.pools.arcor-ip.net) has joined #tux3 2009-03-09 20:18 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-09 21:04 hirofumi, still there? 2009-03-09 21:04 yes 2009-03-09 21:05 -!- chesse(~eworm@dslb-084-062-166-043.pools.arcor-ip.net) has joined #tux3 2009-03-09 21:06 I did a remote add and expected to see .git/tux3 2009-03-09 21:06 like this: git remote add tux3 /src/tux3git 2009-03-09 21:07 yes 2009-03-09 21:07 I'm trying to follow these instructions: http://www.kernel.org/pub/software/scm/git/docs/howto/using-merge-subtree.html 2009-03-09 21:07 but there is no .git/tux3 after the remote add 2009-03-09 21:08 grep tux3 .git -rI 2009-03-09 21:08 .git/config:[remote "tux3"] 2009-03-09 21:08 .git/config: url = /src/tux3git 2009-03-09 21:08 .git/config: fetch = +refs/heads/*:refs/remotes/tux3/* 2009-03-09 21:08 ah 2009-03-09 21:08 I did not fetch 2009-03-09 21:09 ok :) 2009-03-09 21:09 what would be the fetch command? 2009-03-09 21:09 git fetch ??? 2009-03-09 21:09 git fetch tux3? 2009-03-09 21:09 :) 2009-03-09 21:10 git merge -s ours --no-commit tux3/master 2009-03-09 21:10 Automatic merge went well; stopped before committing as requested 2009-03-09 21:10 ACTION is glad it went well 2009-03-09 21:23 it seems not what we want to do 2009-03-09 21:27 true 2009-03-09 21:33 well, after those steps the tux3 history is in git 2009-03-09 21:34 but the files are in the wrong directories, and we do not want the /user/ directory 2009-03-09 21:34 yes 2009-03-09 21:35 I don't know to do it eaisly 2009-03-09 21:35 I was going to do "git format-patch", then modify it 2009-03-09 21:36 so it is easy to get fs/tux3/{doc, user} 2009-03-09 21:37 then I could delete the unwanted files and move the others, but that would leave some ugly history in the git repo 2009-03-09 21:37 yes 2009-03-09 21:37 so probably want I want to do is change the remote repository first, then somehow delete some unwanted history 2009-03-09 21:37 :p 2009-03-09 21:38 you strategy with patch is more likely to succeed 2009-03-09 21:38 probably 2009-03-09 21:46 ok, well at least the problem is reduced to 1) make a git repo that just has kernel/* in the root; 2) pull it as a subtree 2009-03-09 21:54 it would need to modify email address, and add signed-off-by 2009-03-09 21:55 not sure though 2009-03-09 21:56 cd user/kernel; git format-patch --relative .. 2009-03-09 21:56 and modify 2009-03-09 21:56 well 2009-03-09 22:22 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-03-09 22:26 cd user/kernel; git format-patch -o .. 2009-03-09 22:26 cd 2009-03-09 22:26 find -empty|xargs rm -f 2009-03-09 22:26 perl -spi -e 's#^diff --git a/(.*) b/(.*)##' * 2009-03-09 22:26 perl -spi -e 's#^--- a/(.*)#--- a/fs/tux3/$1#' * 2009-03-09 22:26 perl -spi -e 's#^\+\+\+ b/(.*)#+++ b/fs/tux3/$1#' * 2009-03-09 22:27 then modify email 2009-03-09 22:27 and probably signed-off-by 2009-03-09 22:27 git am /* 2009-03-09 22:53 is there any canned git command for applying a bunch of patches in a directory, each treated as a separate commit? 2009-03-09 22:54 ah, format-patch is sensitive to the directory it is run in? 2009-03-09 23:09 yes, if --relative 2009-03-09 23:09 and "git am" can apply the patches 2009-03-09 23:10 --relative? 2009-03-09 23:12 --relative makes patches in relative path from current working dir 2009-03-09 23:13 oh, above 2009-03-09 23:13 sorry 2009-03-09 23:13 should have read closer 2009-03-09 23:21 -!- flips(~phillips@phunq.net) has joined #tux3 2009-03-09 23:23 how do I specify just the chain of revisions from the tip back to the initial commit? 2009-03-09 23:23 git makes this unreasonably difficult 2009-03-09 23:24 it should be the default 2009-03-09 23:25 usually, it's easy 2009-03-09 23:25 HEAD.. 2009-03-09 23:25 I tried master.. 2009-03-09 23:25 but, format-patch is strange 2009-03-09 23:25 well 2009-03-09 23:25 strange things 2009-03-09 23:25 git log|tail 2009-03-09 23:25 I did it 2009-03-09 23:25 get commit-id from it 2009-03-09 23:26 and .. 2009-03-09 23:26 I did that too, I got everything except the initial commit 2009-03-09 23:26 so there is a --root syntax (which should be the default) 2009-03-09 23:27 there's another problem, it doesn't just give the chain back through the parents 2009-03-09 23:27 it gives all parents, including merges 2009-03-09 23:27 so the resulting set of patches does not apply in order 2009-03-09 23:29 I guess merge commit is needed 2009-03-09 23:29 well, it can filter 2009-03-09 23:29 merge commit? 2009-03-09 23:30 including merges means commit of merge? 2009-03-09 23:31 when I try applying the resulting patches, I get double applies beginning at mercurial changeset 82:6f97b85da869 "More 64 bit format string cleanups" 2009-03-09 23:32 which is the first place a merge occurs, between me and shapor 2009-03-09 23:34 is it userland? 2009-03-09 23:35 yes 2009-03-09 23:35 well the same thing will happen in the kernel directory 2009-03-09 23:36 I need to teach git to just give one one unique chain of patches, from the tip back to the initial commit 2009-03-09 23:37 it is kind of surprising that it produces sets of patches that will not apply, by default 2009-03-09 23:40 -!- cdk(~chinmay@121.246.32.110) has joined #tux3 2009-03-09 23:43 it would be problem of hg2git 2009-03-09 23:44 I've done by hg-to-git.py 2009-03-09 23:44 and format-patch 2009-03-09 23:44 ah 2009-03-09 23:44 I didn't use hg2git 2009-03-09 23:45 not pretty good 2009-03-09 23:45 however, it seems not so bad 2009-03-09 23:45 ok, I will try hg-to-git 2009-03-09 23:45 I'll put patches 2009-03-09 23:47 you're already done? 2009-03-09 23:47 there are some unknown emails 2009-03-09 23:47 yes 2009-03-09 23:47 maybe, need to adjust 2009-03-09 23:47 I have no chance of keeping up with you on things like this :) 2009-03-09 23:47 ok 2009-03-09 23:48 bobby@lappy 2009-03-09 23:48 Michael Pattrick 2009-03-09 23:48 qhfeng 2009-03-09 23:48 macan 2009-03-09 23:48 ok, I will find full emails/names for those 2009-03-09 23:48 do you know correct emails? 2009-03-09 23:49 perl -spi -e 's#daniel\@moonbase.phunq.net \<\>#Daniel Phillips #' * 2009-03-09 23:49 perl -spi -e 's#Jonas Fietz <>#Jonas Fietz #' * 2009-03-09 23:49 perl -spi -e 's#Pranith Kumar <>#Pranith Kumar #' * 2009-03-09 23:49 perl -spi -e 's#hirofumi <>#OGAWA Hirofumi #' * 2009-03-09 23:49 perl -spi -e 's#shapor\@yzf.shapor.com <>#Shapor Naghibzadeh #' * 2009-03-09 23:49 perl -spi -e 's#bobby@lappy#Pranith Kumar #' * 2009-03-09 23:49 perl -spi -e 's#Michael Pattrick <>#Michael Pattrick #' * 2009-03-09 23:49 perl -spi -e 's#qhfeng <>#qhfeng #' * 2009-03-09 23:49 perl -spi -e 's#macan <>#macan #' * 2009-03-09 23:49 this is current convert scripts for emails 2009-03-09 23:50 bobby@lappy is Pranith Kumar 2009-03-09 23:51 bobby@lappy is "Pranith Kumar" 2009-03-09 23:53 ok 2009-03-09 23:53 "Michael Pattrick" 2009-03-09 23:54 ok 2009-03-09 23:54 qhfeng is Qinghuang Feng ok 2009-03-09 23:55 I'm having trouble with macan ;) 2009-03-09 23:56 macan(~macan@xbl.dnsbl.oftc.net) 2009-03-09 23:56 nick 2009-03-09 23:57 um... 2009-03-09 23:57 2009-01-07 23:58 the latest tux3 makes VFS complain, and... 2009-03-09 23:58 ah, I'm getting closer 2009-03-09 23:59 macan is: "Ma Can" 2009-03-09 23:59 we have fallen behind on updating the Tux3 hall of fame 2009-03-09 23:59 ok 2009-03-10 00:00 I need to fix that 2009-03-10 00:01 ok, time to review my notes on maintaining the kernel.org files 2009-03-10 00:01 I am sure you have done this many times 2009-03-10 00:09 http://userweb.kernel.org/~hirofumi/hg-git/ 2009-03-10 00:09 it's not so bad 2009-03-10 00:09 however, we may need to modify again 2009-03-10 00:15 grabbing... 2009-03-10 00:17 sorry about my long initial commit comments :) 2009-03-10 00:17 at that point, they looked better in hgweb, that was because hgweb was unfinished 2009-03-10 00:19 btw, you may want to use "git am -s" 2009-03-10 00:19 signed-off-by 2009-03-10 00:20 it also works with this: 2009-03-10 00:21 for this in $(ls $1); do 2009-03-10 00:21 patch <$1/$this -p1 || exit 2009-03-10 00:21 done 2009-03-10 00:21 ? 2009-03-10 00:21 just a batch loop to call patch 2009-03-10 00:21 lost commit and comments? 2009-03-10 00:21 sure 2009-03-10 00:22 the commit can be added, along with -m etc 2009-03-10 00:22 but git-am is no doubt a better approach 2009-03-10 00:22 yes 2009-03-10 00:22 I started writing that when I didn't know about git-am 2009-03-10 00:23 it reads date, authoer, etc. from git formatted email 2009-03-10 00:23 right 2009-03-10 00:23 actually, it reads any file, iirc 2009-03-10 00:24 but, git formatted email would make better result 2009-03-10 00:25 it's really nice 2009-03-10 00:25 yes 2009-03-10 00:25 it is easy to apply from email patch 2009-03-10 00:25 it also can read from pipe 2009-03-10 00:31 here they come, about 3/sec 2009-03-10 00:32 other than picking up the most recent patches from the hg repo, what other changes are needed? 2009-03-10 00:37 it looks good enough to use now 2009-03-10 00:37 I suppose I could go edit the long commit message lines to be more reasonable 2009-03-10 00:38 I'm not sure for now 2009-03-10 00:39 ok, well it looks fine 2009-03-10 00:39 I meant we may notice the problem later 2009-03-10 00:39 it is our actual history 2009-03-10 00:39 yes 2009-03-10 00:39 it is only user/kernel/* 2009-03-10 00:39 yes, perfect 2009-03-10 00:39 so, it may be a bit strange 2009-03-10 00:40 it's good enough, if somebody wants the history of the other files they can see the hg repo 2009-03-10 00:40 well, yes, it looks like good 2009-03-10 00:41 also, I think it is reasonable to git-mail changes from hg to the git repo, or the other way, as we work 2009-03-10 00:41 well 2009-03-10 00:41 going both ways could get ugly 2009-03-10 00:42 we will worry about that the first time somebody wants us to pull from their git tree 2009-03-10 00:42 ok, how about this... 2009-03-10 00:43 if we do get changes in the git tree that we want to take back into hg, we just take them all in one big diff with commit message "sync up with git" 2009-03-10 00:43 so the the kernel hacking history is in the git repo 2009-03-10 00:43 on the other hand, when we push from hg into the git repo, we use hg-to-git 2009-03-10 00:44 so that the git repo has the more complete history, but we can still offer our contributers the ease of use of mercurial 2009-03-10 00:44 the more I use Git, the more I feel it is a good way to confuse new project members 2009-03-10 00:48 um... 2009-03-10 00:49 my proposal doesn't quite work :) 2009-03-10 00:50 well I will think about it 2009-03-10 00:50 the important thing is to get our current kernel code into git with history 2009-03-10 00:51 yes 2009-03-10 00:57 ok, I think I will edit some of the worst spelling mistakes and commit comments 2009-03-10 00:58 like the long lines 2009-03-10 00:58 Signed-off-by... I don't think we need it 2009-03-10 00:58 I will just do one Signed-off-by for everybody, for the initial commit 2009-03-10 00:59 after merge we can take signed-off-by in the usual way 2009-03-10 00:59 we don't need to fake it for our history 2009-03-10 01:00 I guess signed-off-by is your's only 2009-03-10 01:00 for now 2009-03-10 01:00 yes 2009-03-10 01:01 until merge, or if somebody sends git patches/pulls 2009-03-10 01:02 I could mindlessly add Signed-off-by to every patch now, but I am not sure what use that would be 2009-03-10 01:04 it is to protect from SCO like crazy people 2009-03-10 01:04 crazy people like SCO 2009-03-10 01:04 sure 2009-03-10 01:12 well, since the way a filesystem enters kernel is just to pull the Git tree, I better add Signed-off-by to every commit 2009-03-10 01:15 yes 2009-03-10 01:18 where is the home page for hg-to-git? 2009-03-10 01:19 http://git.grml.org/?p=hg-to-git.git;a=summary <- maybe 2009-03-10 01:20 it is from git tree 2009-03-10 01:20 ah 2009-03-10 01:20 git.git/contrib/hg-to-git 2009-03-10 01:20 found it 2009-03-10 02:34 usually the kernel desides in which shape to merge, btrfs afaik was the first which history merge, don't know how they decide in your case 2009-03-10 02:55 I expect, history is wanted 2009-03-10 02:55 but maybe not as much as btrfs had 2009-03-10 02:56 that was really a lot, and now every git tree carries 2009-03-10 02:56 carries it 2009-03-10 05:04 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-10 07:11 -!- cdk(~chinmay@121.246.32.110) has joined #tux3 2009-03-10 07:42 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-10 08:07 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-10 08:35 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-10 08:49 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-10 09:32 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-10 09:49 -!- gaurav(~gaurav@59.95.58.17) has joined #tux3 2009-03-10 10:10 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-10 10:12 -!- dcg(~dcg@74.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-10 11:32 -!- kunir(~kunir@dsl-hkibrasgw2-fef0de00-120.dhcp.inet.fi) has joined #tux3 2009-03-10 11:36 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-03-10 11:46 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-10 11:55 -!- gaurav(~gaurav@59.95.12.78) has joined #tux3 2009-03-10 12:02 -!- kunir(~kunir@dsl-hkibrasgw2-fef0de00-120.dhcp.inet.fi) has joined #tux3 2009-03-10 12:04 -!- kunir(~kunir@dsl-hkibrasgw2-fef0de00-120.dhcp.inet.fi) has joined #tux3 2009-03-10 12:14 flips, there? 2009-03-10 13:20 -!- dcg(~dcg@73.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-10 15:36 hi cdk 2009-03-10 15:37 some bugs seem to have slipped past me in the mkdir patch 2009-03-10 15:38 sync_super(inode->i_sb); after tuxclose(inode) ... wrong 2009-03-10 15:38 tuxclose calls free(inode) 2009-03-10 15:39 oops 2009-03-10 15:39 also the use of tuxclose() and tuxsync() calls there looks a bit wrong 2009-03-10 15:39 I will look a little closer 2009-03-10 15:39 I only briefly scanned it 2009-03-10 15:40 my thinking is: any patch, even with bugs, is better than an unimplemented function 2009-03-10 15:40 got a patch? 2009-03-10 15:40 no .. wanted to discuss the tuxsync and tuxclose with you first 2009-03-10 15:41 ok 2009-03-10 15:42 i guess in both tux3_create parent_ino can be closed 2009-03-10 15:42 and in mkdir inode can be closed as well 2009-03-10 15:43 ACTION looks 2009-03-10 15:45 on the deduplication from just knocked off 2 out of 3 things from our TODO list 2009-03-10 15:45 1) Make bucket handling per inode rather than a global bucket 2009-03-10 15:45 2) Collision handling 2009-03-10 15:45 3) Kernel porting 2009-03-10 15:45 now only kernel porting remain. 2009-03-10 15:46 of course more testing and bug fixes continue 2009-03-10 15:46 how much longer have you got on the project? 2009-03-10 15:47 we have a review coming in the next week. 2009-03-10 15:47 after that i guess another week. 2009-03-10 15:48 will get a bit busy with examinations after that 2009-03-10 15:48 it's moving along very well 2009-03-10 15:49 yes :) thanks to your continued support :) 2009-03-10 15:49 heh, thanks to your hard work I think 2009-03-10 15:50 i guess both are necessary 2009-03-10 15:54 tux3_release seems to do the inode closing you want, why don't you check to see if it is already called after the mkdir? 2009-03-10 15:54 ok 2009-03-10 15:55 note: the inode closing is wrong all the way through tux3fuse, we should not be closing inodes after every operation, and should not be doing sync_super after every operation 2009-03-10 15:56 this was the lazy initial implementation that nobody has fixed yet 2009-03-10 15:56 anyway, we want your mkdir to be wrong in the same way 2009-03-10 15:56 yes. we are waiting for atomic commit right ? 2009-03-10 15:56 it can be fixed even without atomic commit 2009-03-10 15:56 we're just busy :) 2009-03-10 15:58 tux3_release does not seem to be called 2009-03-10 15:59 ok, well you can do the tuxclose/sync_super just like it does 2009-03-10 15:59 preferably with error return 2009-03-10 15:59 sorry 2009-03-10 16:00 tuxsync/sync_super 2009-03-10 16:00 tuxsync(inode); 2009-03-10 16:00 if ((errno = -sync_super(sb))) 2009-03-10 16:00 goto eek; 2009-03-10 16:01 we also have a tux3_releasedir 2009-03-10 16:03 tux3_release calls tuxclose/syncsuper 2009-03-10 16:11 what do we do about our kernel port , map_region problem ? 2009-03-10 16:11 ok, I will elaborate on my suggestion 2009-03-10 16:12 so, the issue is, tux3_get_block knows about the _contents_ of the buffer, and map_region knows how to update dleaf pointers 2009-03-10 16:12 the challenge is to get them to work together 2009-03-10 16:13 i think we can calculate the hash tux3_get_block and then send it to map_region 2009-03-10 16:13 so my suggestion is to have tux3_get_block check the hash, and pass the physical block number in the map[] array to map_region if there is a match 2009-03-10 16:13 yes 2009-03-10 16:13 well 2009-03-10 16:13 don't send the hash, send the address of the matching block 2009-03-10 16:14 you could also add an extra parameter to map_region for the hash, but why? 2009-03-10 16:14 map_region doesn't need it 2009-03-10 16:14 and use map_region to make entry in the dleaf 2009-03-10 16:14 yes 2009-03-10 16:15 you will need to copy out the passed-in block number, because map_region will use the map, and you don't want to mess with that code much 2009-03-10 16:15 for now, you can expect that count will always be 1 2009-03-10 16:15 ok, 2009-03-10 16:17 we will continue working on that. 2009-03-10 16:17 its time to get a nap close to 0500 here. 2009-03-10 16:17 bye. 2009-03-10 16:17 thanks 2009-03-10 16:19 bye 2009-03-10 19:19 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-10 20:05 -!- chesse_(~eworm@dslb-084-062-162-062.pools.arcor-ip.net) has joined #tux3 2009-03-10 20:24 git refuses to create a branch in a git repository that has no non-empty files 2009-03-10 20:25 kind of line the ancient sumarians who had an arithmetic system without the concept of zero :) 2009-03-10 20:33 however you can put come bogus text in a file, commit it, then delete the file and commit that, giving an empty repo 2009-03-10 20:33 :p 2009-03-10 20:34 the problem is, branch master is not created until the initial commit 2009-03-10 20:34 which is bogus in itself 2009-03-10 20:34 it should be created by git init 2009-03-10 20:35 and commit refuses to run on a repo that it thinks is empty 2009-03-10 20:35 it does not treat empty files or directories as really existing for the purpose of commit 2009-03-10 20:35 this is very bogus 2009-03-10 20:53 ok, check this out: http://phunq.net/files/tux3git.tgz 2009-03-10 20:53 git repo with just fs/tux3 in it 2009-03-10 20:53 can import that into a full git tree using a subtree merge 2009-03-10 20:54 I will test that theory pretty soon :) 2009-03-10 21:05 -!- chesse(~eworm@dslb-084-062-149-065.pools.arcor-ip.net) has joined #tux3 2009-03-10 22:22 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-10 22:25 -!- tim_dimm_(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-10 22:33 -!- cdk(~chinmay@121.246.32.110) has joined #tux3 2009-03-10 22:54 -!- cdk(~chinmay@121.246.32.110) has joined #tux3 2009-03-10 23:20 ok, I have an up to date clone of linus's kernel tree with tux3 merged into it 2009-03-10 23:20 now, should tux3 be on a tux3 branch or on master branch? 2009-03-10 23:20 I imagine, just on master, not on a branch 2009-03-10 23:32 -!- RazvanM(~RazvanM@pool-173-67-52-67.bltmmd.east.verizon.net) has joined #tux3 2009-03-11 01:03 hey flips 2009-03-11 01:03 hi 2009-03-11 01:03 how's it going ? well ? 2009-03-11 01:03 yes, fine 2009-03-11 01:03 good 2009-03-11 01:05 tux3 development ? was there a posting about progress recently ? 2009-03-11 01:06 did know that you did a talk recently, I might have driven up to the expo if I had known about it 2009-03-11 01:07 there will be a post about to.do soon 2009-03-11 01:07 ok 2009-03-11 01:07 good 2009-03-11 01:07 regular postings makes your work publicly known 2009-03-11 01:07 indeed 2009-03-11 01:07 otherwise you just fall over the overload bitstream of life 2009-03-11 01:08 and into a pit of oblivion 2009-03-11 01:13 sunny thoughts 2009-03-11 01:13 but bug fixing has gone well right ? 2009-03-11 01:13 no "oh crap" moments, eh ? 2009-03-11 01:29 no 2009-03-11 01:29 situation nominal 2009-03-11 01:34 -!- RazvanM_(~RazvanM@96.234.240.160) has joined #tux3 2009-03-11 02:40 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-03-11 02:49 flips: goo 2009-03-11 02:49 good 2009-03-11 05:03 hirofumi, there? 2009-03-11 05:03 yes 2009-03-11 05:04 well, the tux3 git repo is 90% uploaded to kernel.org, I guess I will cancel that 2009-03-11 05:04 should Tux3 be added on a branch? 2009-03-11 05:05 um... 2009-03-11 05:05 if we are going to recreate repo again, master would be good 2009-03-11 05:06 it's on master now 2009-03-11 05:06 ah 2009-03-11 05:06 if we are going to recreate repo again, master would not be good 2009-03-11 05:07 so a branch is good? 2009-03-11 05:07 I meant, if we are using this repo for merge to linus, master would be good 2009-03-11 05:07 if not, branch _may_ be useful 2009-03-11 05:08 I think we would go into -mm rather than linus 2009-03-11 05:08 or? 2009-03-11 05:08 what would be the point of being in -mm 2009-03-11 05:08 since it has no impact on the rest of the kernel 2009-03-11 05:09 yes 2009-03-11 05:09 well, merge to linus or -mm or -next 2009-03-11 05:09 right, my plan is to CC akpm and ask him what form is best 2009-03-11 05:10 he would like quilt or git 2009-03-11 05:11 and if we recreate the repo (commit-id was recreated), it would become trouble 2009-03-11 05:13 right, it would be a problem for anybody who cloned it 2009-03-11 05:13 yes 2009-03-11 05:13 well, cloning from Linus mainline seems safe 2009-03-11 05:13 -mm and -next track it 2009-03-11 05:13 yes 2009-03-11 05:14 based on linus tree is neccesary 2009-03-11 05:14 ok, so I will fix the email addresses and redo the upload 2009-03-11 05:14 yes 2009-03-11 05:14 actually 2009-03-11 05:14 well 2009-03-11 05:15 probably, actually it would depend on our workflow 2009-03-11 05:15 I was going to say, I could expose my git repo on phunq.net, and pull into the repo I just uploaded 2009-03-11 05:15 but no 2009-03-11 05:15 I will do more simply, and slower 2009-03-11 05:15 yes 2009-03-11 05:16 you can just push to kernel.org 2009-03-11 05:16 push can be done via ssh 2009-03-11 05:19 well I can't fix the email addresses by a push 2009-03-11 05:20 yes 2009-03-11 05:26 -!- amey_m(~amey@117.195.33.199) has joined #tux3 2009-03-11 05:43 making the git repo with just fs/tux3 was worth it to get the catch re email addresses 2009-03-11 05:57 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-11 05:59 new git tree is on its way to kernel.org, ETA 2.5 hours 2009-03-11 05:59 I guess I better get some sleep 2009-03-11 06:00 ok, oyasumi 2009-03-11 06:01 Is it correct if I reply "oyasumi" ? 2009-03-11 06:02 in English that makes sense 2009-03-11 06:02 I should not assume about japanese though 2009-03-11 06:02 yes 2009-03-11 06:02 oyasumi is correct as answer too 2009-03-11 06:02 it is like good night 2009-03-11 06:05 ok, oyasumi then 2009-03-11 06:57 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-11 08:16 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-11 08:50 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-11 10:02 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-11 10:10 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-11 10:16 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-11 10:58 -!- amey_m(~amey@117.195.33.199) has joined #tux3 2009-03-11 11:22 -!- cdk(~chinmay@121.246.32.110) has joined #tux3 2009-03-11 11:32 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-11 12:46 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-11 13:10 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-11 14:35 flipzzz: https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781 2009-03-11 16:31 bh, it sounds like mingming is going to be pretty busy for a while 2009-03-11 16:31 well, and I guess we are going to be pretty busy fixing Andrew's comments 2009-03-11 16:40 flips, it's the delayed allocation in ext4 gets the ondisk file size update less often than before. 2009-03-11 16:40 flips, I think Ted had proposed a solution: force block allocation at file close 2009-03-11 16:40 mingming, maybe you need to put the new file size in the journal 2009-03-11 16:41 ah 2009-03-11 16:41 good luck :) 2009-03-11 16:41 the new file size are journalled, but it won't get updated(and journalled) until blocks are allocated 2009-03-11 16:42 well maybe journal it before allocating blocks 2009-03-11 16:42 with the delayed allocation we postpone the new file size update from the write() time to dirty page flush time 2009-03-11 16:42 you can always notice that the blocks haven't been allocated yet 2009-03-11 16:43 ACTION thinks about how tux3 will handle this 2009-03-11 16:43 deltas are supposed to handle that accurately 2009-03-11 16:44 the new blocks and the new file size are always together in the same delta 2009-03-11 16:44 so the size either changes, and the new blocks arrive on disk, or neither changes 2009-03-11 16:45 that's what ext4 does also: the block allocation and the size are in the same transaction 2009-03-11 16:45 but the in memory size is updated, makes user "think" the file has expanded. 2009-03-11 16:45 ondisk the block allocation and the new file size has not updated 2009-03-11 16:46 ext3 has the same issue, it is the window is much smaller 2009-03-11 16:49 -!- tim_dimm_(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-11 17:04 there's extensive commentary from ted in the ubuntu bug 2009-03-11 17:04 should be required reading for tux3ish people :) 2009-03-11 17:10 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-03-11 17:12 flips: true, I just finished the thread 2009-03-11 17:12 :) 2009-03-11 17:12 big thread 2009-03-11 17:13 the last comment says 1h ago, I don't dare to hit the refresh button 2009-03-11 17:13 I"m making driveby changes re akpm's comments 2009-03-11 17:13 big job 2009-03-11 17:13 btw: what is the cast with the best open-source drivers? 2009-03-11 17:14 cast? 2009-03-11 17:14 case? 2009-03-11 17:15 drivers for what? 2009-03-11 17:15 case :D 2009-03-11 17:15 on my desktop I'm using the framebuffer :D (intelfb) 2009-03-11 17:15 ACTION is an open-source SUV driver 2009-03-11 17:15 not sure whether I'm the best though 2009-03-11 17:15 it was the best way to get fast desktop switching for wmii/fluxbox 2009-03-11 17:16 ah 2009-03-11 17:16 something from the thread 2009-03-11 17:16 I haven't read it all yet 2009-03-11 17:16 I thought I'd better start working on akpm's comments first 2009-03-11 17:16 I'm on my way to compile tux3 2009-03-11 17:17 2.6.28.7 should work, right? 2009-03-11 17:17 I think I tried on 2.6.28 at some point and it didn't work 2009-03-11 17:17 it should 2009-03-11 17:17 I forget exactly where in 2.6.28.7 it starts working 2009-03-11 17:18 you can also pull from the tux3 git repo, to get a full kernel along with your tux3 code 2009-03-11 17:18 sorry, exactly where in 2.6.28.* 2009-03-11 17:18 when the buf delay stuff was added 2009-03-11 17:19 the same that is keeping mingming busy now :) 2009-03-11 17:19 I also want to compile btrfs and I would like to avoing pulling git trees 2009-03-11 17:19 sure 2009-03-11 17:19 well btrfs is in the same tree 2009-03-11 17:19 it's linus's bleeding edge tree 2009-03-11 17:20 nice! 2009-03-11 17:20 maybe it's time to start getting in the habit of pulling git? 2009-03-11 17:21 the thing is, since everything has linus's tree as a parent, if you pull it once then pulls of other trees will be much faster 2009-03-11 17:21 I do usually use git 2009-03-11 17:21 but this time I wanted to compile all stuff from .0 :P 2009-03-11 17:21 anyway, our mission is to support both ways 2009-03-11 17:21 hg is still the primary repo 2009-03-11 17:22 I don't see btrfs in 2.6.28.7 :| 2009-03-11 17:22 actually, my working directory is managed both by git and hg at the moment 2009-03-11 17:22 so git is the way :D 2009-03-11 17:22 hmm 2009-03-11 17:23 it's pretty bizarre that you can have two version control systems working on the same file, but it works fine 2009-03-11 17:23 I did that for svn and cvs 2009-03-11 17:23 very bad combination 2009-03-11 17:23 heh 2009-03-11 17:23 difference with hg and git is, it works 2009-03-11 17:23 I switch to git and life was heaven 2009-03-11 17:23 it's nice actually 2009-03-11 17:24 cloning in progress... 2009-03-11 17:33 done 2009-03-11 17:33 pretty slow :| 2009-03-11 17:33 that was way faster that I got 2009-03-11 17:33 663M linux-tux3/ 2009-03-11 17:34 I cloned at 40 kb/sec 2009-03-11 17:34 I'm at school :P 2009-03-11 17:34 took 2.5 hours 2009-03-11 17:39 first attempt to compile... 2009-03-11 17:54 ...and? 2009-03-11 17:56 my script to compile everything didn't work 2009-03-11 17:56 $ grep TUX .config 2009-03-11 17:56 CONFIG_USB_ADUTUX=m 2009-03-11 17:57 after I did a make allmodconfig 2009-03-11 17:58 $ grep BTRFS .config 2009-03-11 17:58 CONFIG_BTRFS_FS=m 2009-03-11 17:58 CONFIG_BTRFS_FS_POSIX_ACL=y 2009-03-11 17:59 this fellow is fine 2009-03-11 18:02 he gives me two warning though :P (btrfs) 2009-03-11 18:10 flips: what about andrew's comments ? 2009-03-11 18:11 bh, send patches 2009-03-11 18:11 ok 2009-03-11 18:11 so folks are working on this problem eh ? 2009-03-11 18:11 seems like the commits could be batched up better 2009-03-11 18:12 because KDE blowing out like that is wacked 2009-03-11 18:12 it's an arguments for nvram if there was a fsync associated with it 2009-03-11 18:16 I'm sure it can be fixed 2009-03-11 18:16 whether it can be fixed without slowing it down is an interesting question 2009-03-11 19:11 flips, interesting thread re: ext4 and nvram 2009-03-11 19:11 got no time to read until akpm's comments are addressed 2009-03-11 19:12 that was interesting too 2009-03-11 19:12 no dinner for you until you finish all your homework 2009-03-11 19:12 :) 2009-03-11 19:35 :) 2009-03-11 19:50 -!- amey_m(~amey@117.195.32.60) has joined #tux3 2009-03-11 20:05 -!- chesse_(~eworm@dslb-084-062-138-234.pools.arcor-ip.net) has joined #tux3 2009-03-11 21:05 -!- chesse(~eworm@dslb-084-062-188-060.pools.arcor-ip.net) has joined #tux3 2009-03-11 21:50 http://farm4.static.flickr.com/3609/3347815435_13bc203d61_o.png :P 2009-03-11 21:50 featuring both tux3 and btrfs 2009-03-11 21:52 the jdb2 sharing between ext4 and ocfs2 shows up 2009-03-11 21:52 and also the fact that jdb is only used by ext3 2009-03-11 21:54 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-03-11 21:59 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-11 22:56 ACTION tries to figure out RazvanM's png 2009-03-11 23:39 -!- RazvanM(~RazvanM@96.234.240.160) has joined #tux3 2009-03-11 23:54 -!- cdk(~chinmay@115.109.11.1) has joined #tux3 2009-03-12 00:21 hi flips 2009-03-12 00:33 2 2009-03-12 00:34 hey shapor 2009-03-12 00:34 -!- gaurav(~gaurav@115.109.11.1) has joined #tux3 2009-03-12 00:34 had a small request 2009-03-12 00:35 hi cdk 2009-03-12 00:35 whats up 2009-03-12 00:36 hi cdk 2009-03-12 00:38 read the tux3 report 2009-03-12 00:39 thanks for mentioning our work 2009-03-12 00:40 it's fine work 2009-03-12 00:40 well get more mentionds 2009-03-12 00:41 I wanted to do a whole post on it, but didn't have time this week 2009-03-12 00:41 our univ project review is about to come and it would be nice and very helpful if we found a mention on the Tux3 hall of fame :) 2009-03-12 00:41 it will be there 2009-03-12 00:41 when is is coming? 2009-03-12 00:42 tuesday 2009-03-12 00:42 ok, no problem 2009-03-12 00:42 actually monday by pacific time :) 2009-03-12 00:42 I was busy getting git etc working otherwise would already be done 2009-03-12 00:42 cdk: i'll do it now 2009-03-12 00:43 ok, thanks. 2009-03-12 00:44 cdk: i noticed all your email addresses are @gmail, is that what you guys prefer to university email? 2009-03-12 00:45 just wondering 2009-03-12 00:45 yes...we dont have a separate university address... 2009-03-12 00:46 new link: http://farm4.static.flickr.com/3448/3348866720_27b16b459d_o.png (I fixed a typo :P) 2009-03-12 00:47 cdk: wow! 2009-03-12 00:47 aliases should be cheap 2009-03-12 00:57 $ for f in ext2.o ext3.o ext4.o btrfs.o tux3.o ; do echo -n $f\ ; nm $f | grep 'U ' | wc -l ; done 2009-03-12 00:57 ext2.o 185 2009-03-12 00:57 ext3.o 262 2009-03-12 00:57 ext4.o 324 2009-03-12 00:57 btrfs.o 252 2009-03-12 00:57 tux3.o 103 2009-03-12 00:58 I'll do some plotting tomorrow :P 2009-03-12 01:02 RazvanM: what is the picture of ? 2009-03-12 01:03 shapor: external calls for 53 fs modules 2009-03-12 01:04 each tick is a external call 2009-03-12 01:04 you can see the jdb2 used by ext4 and ocfs2 2009-03-12 01:04 and the fact that nobody except ext3 is using jdb 2009-03-12 01:04 RazvanM: you should post your findings (and what you use to generate it on the list) ;) 2009-03-12 01:05 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-12 01:06 I will :D 2009-03-12 01:06 btw: there are 1228 external calls 2009-03-12 01:06 (the X axis) 2009-03-12 01:07 http://farm4.static.flickr.com/3461/3347815417_09a5de9901_o.png 2009-03-12 01:07 my attempt to compile all the FSs :P 2009-03-12 01:09 can anyone guess what is the vertical line? :D 2009-03-12 01:10 i'm not sure what is on the x-axis? 2009-03-12 01:11 shapor: which of the graphs? 2009-03-12 01:11 oh i see your loop now 2009-03-12 01:12 you, the X axis contains the 1228 external calls 2009-03-12 01:12 you = yup 2009-03-12 01:13 I'm off to bed now 2009-03-12 01:13 happy hacking to everyone! :-) 2009-03-12 01:13 vertical line = jbd2 stuff? 2009-03-12 01:14 shapor: nope 2009-03-12 01:14 jdb2 show up in 2.6.19 with ext4dev 2009-03-12 01:16 the stuff to the right of the vertical line I was able to compile with the 4.1 I have as default on my machine 2009-03-12 01:16 for the left stuff I switched to 2.95.3 2009-03-12 01:32 shapor : your bitbucket repository no longer in sync with the official one? 2009-03-12 01:33 cdk: no, i wasn't able to rebase it and remove the extra head 2009-03-12 01:34 i think i need to contact them to do that, the interface doesn't seem to allow you to 2009-03-12 01:34 oh. 2009-03-12 01:39 ah, lifetime of filesystems 2009-03-12 01:39 only intermezzo and devfs ever died off 2009-03-12 01:40 otherwise... being a filesystem is roughly akin to being immortal :) 2009-03-12 01:40 oh jffs died too 2009-03-12 01:41 I'd say the vertical line is end of 2.4-ish, if I wasn't already told it was start of gcc 3 2009-03-12 01:42 shapor, well let's just make tux3.org spiffier and have an integrated hg page then :) 2009-03-12 01:43 shows the value of a open tool chain, even open tools on a closed site are dodgy 2009-03-12 01:45 cdk: http://tux3.org/about.html 2009-03-12 01:46 thanks a lot :) 2009-03-12 01:46 flips: yeah.. need to do that.. 2009-03-12 01:46 looks spiffy 2009-03-12 01:47 bashbucket 2009-03-12 01:48 heh 2009-03-12 01:49 there should be a penguin in front of "your name here" :) 2009-03-12 01:49 flips: the site needs updates, news, etc 2009-03-12 01:49 yes it's getting a little dusty 2009-03-12 01:49 if i wasn't behind on sleep i'd jump on it now 2009-03-12 01:49 tomorrow night 2009-03-12 01:49 me too, behind on sleep that is 2009-03-12 01:49 posted the Git announc well after sunrise today 2009-03-12 01:50 around 9:30 I think 2009-03-12 01:50 and that was not because I got up early 2009-03-12 01:59 -!- amey_m(~amey@115.109.11.1) has joined #tux3 2009-03-12 02:00 hirofumi's missing all the review fun 2009-03-12 06:51 -!- amey_m(~amey@117.195.37.245) has joined #tux3 2009-03-12 07:48 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-12 08:04 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-03-12 09:32 flips, magic.h means linux/include/linux/magic.h 2009-03-12 09:46 -!- cdk(~chinmay@121.246.32.110) has joined #tux3 2009-03-12 10:12 -!- amey_m(~amey@117.195.45.227) has joined #tux3 2009-03-12 10:44 and for sb->s_magic 2009-03-12 11:55 hirofumi, oh 2009-03-12 11:55 hi 2009-03-12 11:56 it was a bit fast to change 2009-03-12 11:57 sure, well let's change again :) 2009-03-12 11:57 btw, I wouldn't like to change the _many_ code _for now_ 2009-03-12 11:57 but I don't see that linux/magic.h is better 2009-03-12 11:58 which code to you want to leave alone? 2009-03-12 11:58 it is to see users know from one file 2009-03-12 11:58 I meant 2009-03-12 11:58 ... 2009-03-12 11:59 #define FUTEXFS_SUPER_MAGIC 0xBAD1DEA 2009-03-12 11:59 <- haha 2009-03-12 11:59 I wouldn't like to change the code with lazyness 2009-03-12 11:59 um... 2009-03-12 11:59 ok, I see 2009-03-12 12:00 we can use the same form basically 2009-03-12 12:00 change bindly 2009-03-12 12:00 yes 2009-03-12 12:00 well it was good to show some response, even the wrong response in one case 2009-03-12 12:00 well, basically, I have no interesting trivial things like coding style 2009-03-12 12:01 yes, response is good 2009-03-12 12:01 ah, tytso wants some buffer_head killing 2009-03-12 12:01 brave man 2009-03-12 12:02 yes, many people seems to dislike buffer_head 2009-03-12 12:04 well, we can put TUX3_MAGIC in linux/magic.h just as it is, but TUX3_MAGIC_SIZE can stay in tux3.h, or it can be a macro 2009-03-12 12:04 I guess it more confusable 2009-03-12 12:05 #define TUX3_MAGIC_SIZE sizeof((char[])TUX3_MAGIC) 2009-03-12 12:05 we need TUX3_MAGIC already for sb->s_magic 2009-03-12 12:06 it is still open coded though 2009-03-12 12:06 but we don't really need to worry about where it is used, that is only about 3 places 2009-03-12 12:06 we need to worry that people understand the coding and update it properly 2009-03-12 12:07 of course 2009-03-12 12:07 it is very useful to be able to see not only that the magic number failed, but that it failed because it is out of date, and exactly what the old date was 2009-03-12 12:07 this is the kind of coding style thing, I think 2009-03-12 12:07 it doesn't matter actually 2009-03-12 12:07 sure 2009-03-12 12:07 however, people are sensitive 2009-03-12 12:07 at least akpm did not complain about the form of it :) 2009-03-12 12:08 good :) 2009-03-12 12:08 I guess the best is 2009-03-12 12:08 ok, that takes Tux3 up from just one outside change (fs/Makefile) to two (also linux/magic.h) 2009-03-12 12:09 magic number (tux3) is in linux/magic.h 2009-03-12 12:09 doubles the number of outside changes for a fs 2009-03-12 12:09 and use it in tux3.h 2009-03-12 12:09 yes 2009-03-12 12:10 oh, and we have a problem with userspace 2009-03-12 12:10 because we can't look it up in linux/magic.h 2009-03-12 12:10 yes 2009-03-12 12:10 we need copy of it 2009-03-12 12:10 now it needs to be updated in two places 2009-03-12 12:10 yes 2009-03-12 12:10 :p 2009-03-12 12:10 well 2009-03-12 12:11 or wait distribution to include magic.h including tux3 :) 2009-03-12 12:11 great idea :) 2009-03-12 12:11 that means I can get lots of rest :) 2009-03-12 12:12 well, how about separate it to two parts? 2009-03-12 12:12 magic and date 2009-03-12 12:12 yes 2009-03-12 12:13 and that way the date can be together with the comments 2009-03-12 12:13 yes 2009-03-12 12:15 #define TUX3_MAGIC { 't', 'u', 'x', '3' } 2009-03-12 12:15 or #define TUX3_MAGIC "tux3" 2009-03-12 12:15 um... 2009-03-12 12:17 and 0xdd -> 0x02 for the next revision I think 2009-03-12 12:17 "tux3" may be good 2009-03-12 12:17 0x20? 2009-03-12 12:17 yes 2009-03-12 12:17 :) 2009-03-12 12:18 makes it easier to read the error message 2009-03-12 12:18 umm... TUX3_MAGIC "tux3", and TUX3_FORMAT_DATE "\x20\x09\x03\x13"? 2009-03-12 12:18 good 2009-03-12 12:19 then putting them together is just string concatenation 2009-03-12 12:19 yes 2009-03-12 12:20 and sb->s_magic is... 2009-03-12 12:21 sb->s_magic = cpu_to_le32((u32 *)TUX3_MAGIC)? 2009-03-12 12:21 can stay as char[] 2009-03-12 12:21 it's a char array 2009-03-12 12:21 our code doesn't have to change much 2009-03-12 12:22 whoops, sb->s_magic = cpu_to_le32(*(u32 *)TUX3_MAGIC)? 2009-03-12 12:22 there's no need to convert to integer I think 2009-03-12 12:22 ah 2009-03-12 12:22 s_magic is long 2009-03-12 12:24 (struct disksuper){ .magic = { TUX3_MAGIC TUX3_MAGIC_REVISION }, 2009-03-12 12:25 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-12 12:25 looks good to me 2009-03-12 12:25 or TUX3_MAGIC_STRING or something 2009-03-12 12:25 wait, is cpu_to_le32 correct there? 2009-03-12 12:25 provide one define 2009-03-12 12:25 unless tux3 is not byteorder independent yet 2009-03-12 12:26 ok, 1) coffee, 2) make a patch 2009-03-12 12:26 oh, yes 2009-03-12 12:26 le32_to_cpu() 2009-03-12 12:26 we can keep if (memcmp(super->magic, (char[])TUX3_MAGIC, sizeof(super->magic))) 2009-03-12 12:27 it's nice to have one field that isn't endian :) 2009-03-12 12:27 TUX3_MAGIC is not including REVISION now? 2009-03-12 12:27 well, it has to have the string concatentation too 2009-03-12 12:27 yes 2009-03-12 12:28 memcmp(super->magic, (char[]){ TUX3_MAGIC TUX3_MAGIC_REVISION }, sizeof(super->magic)) 2009-03-12 12:28 I guess one define is better 2009-03-12 12:29 yes 2009-03-12 12:29 so, TUX3_MAGIC_PREFIX and *_REVISION? 2009-03-12 12:29 and put _PREFIX in linux/magic.h? 2009-03-12 12:29 yes 2009-03-12 12:29 it's clear anyway 2009-03-12 12:30 says "but wait, there's more" to the reader 2009-03-12 12:30 yes 2009-03-12 12:30 and statfs provides PREFIX only 2009-03-12 12:30 let's try it, it gives us one more small thing to fix if somebody doesn't like it 2009-03-12 12:30 yes 2009-03-12 12:31 then the only change to our code is, we add { } around it 2009-03-12 12:31 well, it's tux3 internal things, so I guess people don't care it 2009-03-12 12:31 eh? 2009-03-12 12:32 ? 2009-03-12 12:32 I guess we don't need {} 2009-03-12 12:32 right 2009-03-12 12:32 :) 2009-03-12 12:32 ok :) 2009-03-12 12:32 and don't need the (char[]) cast 2009-03-12 12:32 yes 2009-03-12 12:33 well its beetter 2009-03-12 12:33 better 2009-03-12 12:33 yes 2009-03-12 12:34 should it be TUX_SUPER_MAGIC_PREFIX ? 2009-03-12 12:34 I don't think _SUPER_ is helpful, but the others use it 2009-03-12 12:34 btw, I'm having 2009-03-12 12:34 #define TUX3_MAGIC_LOG 0x10ad 2009-03-12 12:34 #define TUX3_MAGIC_DLEAF 0x1eaf 2009-03-12 12:34 #define TUX3_MAGIC_ILEAF 0x90de 2009-03-12 12:34 yes, it's better 2009-03-12 12:34 I don't think _SUPER_ is needed 2009-03-12 12:35 ok, I like it more without _SUPER_ 2009-03-12 12:35 oh 2009-03-12 12:35 sure, almost has _SUPER_ 2009-03-12 12:35 however, a few fses don't have _SUPER_ 2009-03-12 12:36 #define REISERFS_SUPER_MAGIC 0x52654973 /* used by gcc */ <- what is this? 2009-03-12 12:36 I don't know it 2009-03-12 12:36 -!- cdk(~chinmay@121.246.32.110) has joined #tux3 2009-03-12 12:36 36 /* used by file system utilities that 2009-03-12 12:36 37 look at the superblock, etc. */ 2009-03-12 12:36 it seems for statsf 2009-03-12 12:36 statfs 2009-03-12 12:37 statfs->f_type 2009-03-12 12:37 that seems a wasteful use space in magic.h 2009-03-12 12:38 when it is just for reiserfs code convenience 2009-03-12 12:38 ok, we can just have a string in magic.h and not a number, unless somebody complains 2009-03-12 12:38 it is need to export to userspace 2009-03-12 12:38 oh 2009-03-12 12:39 what uses it? 2009-03-12 12:39 if magic is using on statfs->f_type 2009-03-12 12:40 btw, tux3 is also using 'tux3' part as statfs->f_type 2009-03-12 12:40 now 2009-03-12 12:41 http://en.wikipedia.org/wiki/Tux3 2009-03-12 12:41 somebody added you as developer :) 2009-03-12 12:41 oh, yes 2009-03-12 12:41 magic: "TUX3" {0x54, 0x55, 0x58, 0x33} 2009-03-12 12:41 (that's wrong) 2009-03-12 12:41 :) 2009-03-12 12:42 I mean, the magic is wrong 2009-03-12 12:42 yes 2009-03-12 12:44 (gdb) p/x "tux3" 2009-03-12 12:44 $2 = {0x74, 0x75, 0x78, 0x33, 0x0} 2009-03-12 12:44 yes 2009-03-12 12:44 oh 2009-03-12 12:44 wiki was right 2009-03-12 12:44 it was? 2009-03-12 12:45 s_magic is using 2009-03-12 12:45 sb->s_magic = 0x54555833; 2009-03-12 12:45 sb->s_magic was wrong 2009-03-12 12:46 we should share #define 2009-03-12 12:46 right 2009-03-12 12:46 I got that from the wiki, probably :) 2009-03-12 12:46 le32_to_cpu(*(be32 *)TUX3_MAGIC_PREFIX) or like this 2009-03-12 12:47 or get it from disksuper 2009-03-12 12:48 yes 2009-03-12 12:48 what uses that? 2009-03-12 12:48 it never seemed to hurt us and has always been wrong 2009-03-12 12:48 sb->s_magic = le32_to_cpu(*(be32 *)TUX3_MAGIC_PREFIX) 2009-03-12 12:48 Is TUX3_MAGIC_PREFIX aligned? 2009-03-12 12:48 yes 2009-03-12 12:48 um 2009-03-12 12:48 yes 2009-03-12 12:48 Is that guarenteed for constant strings? :) 2009-03-12 12:48 magic happens to be aligned 2009-03-12 12:49 the structure is __packed 2009-03-12 12:49 I'm not talking about sb 2009-03-12 12:49 I'm talking about the string constant TUX3_MAGIC_PREFIX 2009-03-12 12:49 which you are dereferencing through a be32* 2009-03-12 12:50 it happens to be aligned 2009-03-12 12:51 hey, nobody complained about our endian respellings yet 2009-03-12 12:51 I'm surprised 2009-03-12 12:52 okay, I'll assume you are somehow forcing it to be aligned to a 4 byte boundrary then :) 2009-03-12 12:52 I guess it was not noticed 2009-03-12 12:52 well, it can be fixed easily 2009-03-12 12:52 get_unaligned() 2009-03-12 12:52 akpm noticed it you can be sure 2009-03-12 12:52 looks like he read every line 2009-03-12 12:52 Well, you /could/ use get_unaligned or something, or you could just define it as a numeric constant...? 2009-03-12 12:53 it's only 6.6K lines, and akpm reads fast 2009-03-12 12:53 we can use from_be_u32 if it matters 2009-03-12 12:53 it's not clear s_magic is used for anything 2009-03-12 12:55 s_magic is just for statfs->f_type 2009-03-12 12:55 lxr shows only setting it, not using it 2009-03-12 12:55 ok 2009-03-12 12:55 used only within the fs 2009-03-12 12:55 I guess, yes 2009-03-12 12:56 so we better endian convert it 2009-03-12 12:57 yes 2009-03-12 12:57 and make sure not to exception on !x86 from an unaligned read ;) 2009-03-12 12:57 I hope the endian conversion macros allow unaligned read 2009-03-12 12:57 because we use them unaligned 2009-03-12 12:57 Why would they? 2009-03-12 12:58 well if they don't we have to fix them 2009-03-12 12:58 because we use unaligned endian accesses 2009-03-12 12:59 mhm, a quick read through include/linux/byteorder/* doesn't suggest anything of that sort 2009-03-12 12:59 which makes sense, why would le32_to_cpu(*foo) be able to affect how *foo is evaluated 2009-03-12 13:00 there is get_unaligned_le{16,32,36}, looks like 2009-03-12 13:00 er 2009-03-12 13:00 ,64 2009-03-12 13:01 which is unaligned endian acesses? 2009-03-12 13:02 indeed 2009-03-12 13:03 it's the responsibility of gcc to respect __packed 2009-03-12 13:03 I guess encode/decode is using temporary values 2009-03-12 13:03 well 2009-03-12 13:03 oh, sure __packed is fine. But you can't just do le32_to_cpu(*(le32*)"tux3") 2009-03-12 13:03 -!- data(~data@84.19.190.213) has joined #tux3 2009-03-12 13:04 as "tux3" has no alignment constraints 2009-03-12 13:04 it does actually 2009-03-12 13:04 oh? 2009-03-12 13:04 disksup is declared __packed 2009-03-12 13:04 disksuper 2009-03-12 13:04 "tux3" isn't a field fo disksup 2009-03-12 13:04 sorry 2009-03-12 13:04 you're right 2009-03-12 13:04 :) 2009-03-12 13:04 we will fetch it from disksuper 2009-03-12 13:05 yes, it sounds good 2009-03-12 13:07 -!- cdk(~chinmay@121.246.32.110) has joined #tux3 2009-03-12 13:09 (L) -> (long long) mass edit will be today I think 2009-03-12 13:09 gross 2009-03-12 13:09 is it needed? 2009-03-12 13:10 not according to me 2009-03-12 13:10 I guess generic one is right way 2009-03-12 13:10 yes, not do anything for now is easier 2009-03-12 13:10 there are a bunch of other things to fix 2009-03-12 13:10 some big things like "write helpful comments" 2009-03-12 13:11 I only did about half the things akpm wanted so far 2009-03-12 13:11 well, today I will write some comments and re-patch the magic, that will be enough 2009-03-12 13:12 well, honestly, I don't have so interest to merge to linus right now 2009-03-12 13:14 that is fine with me 2009-03-12 13:14 merge is months away anyway 2009-03-12 13:14 and will be even more months away if review doesn't start now 2009-03-12 13:15 I don't think it is a good use of your time to get involved in trivial merge patches 2009-03-12 13:15 well, it's for your time actually 2009-03-12 13:16 I can't avoid it 2009-03-12 13:16 it's now or later 2009-03-12 13:16 the plan is to stretch this out over a long period 2009-03-12 13:17 and not spend a lot of time on it after these few days 2009-03-12 13:17 yes 2009-03-12 13:17 the worst part was converting the repo history, which you did :) 2009-03-12 13:18 I'm almost sure, this work is easy to do (almost people has satisfaction like coding style) 2009-03-12 13:18 however, it should be at a time 2009-03-12 13:18 major things like endian conversions are done 2009-03-12 13:19 endian conversions? 2009-03-12 13:19 the ones we already did 2009-03-12 13:19 that was a major project 2009-03-12 13:19 support big endian on-disk format? 2009-03-12 13:19 yes 2009-03-12 13:20 yes 2009-03-12 13:20 the only remaining issue is bitmap endian ordering 2009-03-12 13:20 ah, yes 2009-03-12 13:21 I will see what happens when we use lib/bitmap.c 2009-03-12 13:21 lib/bitmap.c is for native endian 2009-03-12 13:21 iirc 2009-03-12 13:21 it seemed to me to be native endian 2009-03-12 13:22 with conversion by copy, not so useful 2009-03-12 13:22 but I didn't read it closely 2009-03-12 13:22 if we want to use those, I guess we have to add endian support 2009-03-12 13:23 that would be ugly 2009-03-12 13:23 well, those are linux way :) 2009-03-12 13:23 and slow :) 2009-03-12 13:24 slow? 2009-03-12 13:24 slow to write? 2009-03-12 13:24 or slow to run? 2009-03-12 13:25 well, btw, I'm starting to think magic to separate to two fields 2009-03-12 13:25 u32 magic; u32 revison 2009-03-12 13:25 or something 2009-03-12 13:36 slow to run 2009-03-12 13:36 that is not the linux way 2009-03-12 13:37 the revision has proved very useful 2009-03-12 13:38 every fs should have it :) 2009-03-12 13:39 if it's slow, please fix it 2009-03-12 13:39 this is linux way :) 2009-03-12 14:17 http://www.webopedia.com/TERM/T/Tux3.html 2009-03-12 14:17 we are in webopedia, we are leet :) 2009-03-12 14:50 well it seems I wasted a lot of my time the other night by copying all the checked out git files onto kernel.org 2009-03-12 14:51 twice :-/ 2009-03-12 14:54 yes, it was thing that I'd like to avoid 2009-03-12 14:54 by review, and git conversion 2009-03-12 14:54 honestly, I couldn't constrate to tux3 recently 2009-03-12 14:54 I should constrate more 2009-03-12 14:55 :-/ 2009-03-12 14:56 s/constrate/concentrate/ 2009-03-12 14:56 concatenate 2009-03-12 14:57 :) 2009-03-12 14:58 well the noise has died down 2009-03-12 14:58 and we have a public git tree, and hopefully some new volunteers 2009-03-12 14:59 yes 2009-03-12 14:59 also, akpm's eyes on the code 2009-03-12 14:59 and... ext4 now thinking about getting rid of buffer_head 2009-03-12 15:00 andi kleen talking about making a per module namespace 2009-03-12 15:00 fsblock? 2009-03-12 15:00 yes 2009-03-12 15:00 ah, yes 2009-03-12 15:00 I think the block handles scheme is better, actually 2009-03-12 15:00 because way less code 2009-03-12 15:00 but it isn't tried yet 2009-03-12 15:00 yes 2009-03-12 15:01 fsblock seems to try to have more feature 2009-03-12 15:01 I guess the goal is a bit different 2009-03-12 15:01 it seems like not everybody hates (L) 2009-03-12 15:01 oh, well, it should be good 2009-03-12 15:02 so maybe just leave it for now 2009-03-12 15:02 well, it might not be "L" though 2009-03-12 15:02 sure 2009-03-12 15:02 (FOO) :) 2009-03-12 15:02 :) 2009-03-12 15:04 well, atomic commit is the most important thing again 2009-03-12 15:04 where were we 2009-03-12 15:04 yes 2009-03-12 15:04 clearing the log 2009-03-12 15:05 it needs to be done before the flush 2009-03-12 15:05 I need time to concentrate to think 2009-03-12 15:05 um 2009-03-12 15:05 yes 2009-03-12 15:05 well, current my thing is bitmap log and flush 2009-03-12 15:07 sb->logbase = sb->lognext; should be at the beginning of flush_log I think 2009-03-12 15:08 probably 2009-03-12 15:08 I'd like to make sure a bit more 2009-03-12 15:08 this is only used to calculate how log the log chain is for the super update, so it doesn't actually matter where it is done 2009-03-12 15:09 as long as it is done before the delta commit block is written 2009-03-12 15:09 s/how log/how long/ 2009-03-12 15:11 bitmap bfree log should be logged 2009-03-12 15:12 and defree may also be 2009-03-12 15:12 however, other inodes too? 2009-03-12 15:12 probably, no 2009-03-12 15:12 but, why? 2009-03-12 15:13 I know it, maybe 2009-03-12 15:13 but, I'm not sure why 2009-03-12 15:13 it will be logged, for the following delta 2009-03-12 15:14 that is, frees that occur during the flush will be logged for the next flush 2009-03-12 15:14 actually, for the next delta 2009-03-12 15:18 probably 2009-03-12 15:18 I was going to think about it 2009-03-12 15:19 I'm also not sure, can we free bitmap's block on flush? 2009-03-12 15:19 I guess it can be, however, how complex is it? 2009-03-12 15:19 yes 2009-03-12 15:19 it's supposed to work that way 2009-03-12 15:20 it is a little tricky to think about 2009-03-12 15:21 yes 2009-03-12 15:21 the only thing is, we have to be sure that the block can't be reallocated and written before the delta completes 2009-03-12 15:21 well, I was going to think it, and probaby, just delay bfree to next delta after all 2009-03-12 15:21 yes 2009-03-12 15:21 so it is safest to flush the defree list into the bitmaps after the delta commits 2009-03-12 15:22 yes 2009-03-12 15:23 but, it means bfree log should be added to previous logchain 2009-03-12 15:23 I guess 2009-03-12 15:23 I thought I did that 2009-03-12 15:24 list_splice_tail_init(&sb->pinned, &sb->commit); 2009-03-12 15:24 ah, not what you meant 2009-03-12 15:24 yes, bfree log 2009-03-12 15:25 are you worrying about whether the freed bitmaps can leak? 2009-03-12 15:25 it is 2009-03-12 15:26 however, actually, I'm thinking about clean way to do 2009-03-12 15:26 bitmap flush seens to clean, there is no many special case 2009-03-12 15:26 ok, well I have an idea for something to do today, I will write a code comment about the purpose of the deferred free logs 2009-03-12 15:27 that will help think about the issue, and also help with akpm's comments 2009-03-12 15:27 good 2009-03-12 15:39 hey flips 2009-03-12 20:05 -!- chesse_(~eworm@dslb-084-062-144-150.pools.arcor-ip.net) has joined #tux3 2009-03-12 21:05 -!- chesse(~eworm@dslb-084-062-143-048.pools.arcor-ip.net) has joined #tux3 2009-03-12 22:14 diff --git a/include/linux/magic.h b/include/linux/magic.h 2009-03-12 22:14 index 439f6f3..fd18181 100644 2009-03-12 22:14 --- a/include/linux/magic.h 2009-03-12 22:14 +++ b/include/linux/magic.h 2009-03-12 22:14 @@ -41,6 +41,7 @@ 2009-03-12 22:14 #define REISER2FS_JR_SUPER_MAGIC_STRING "ReIsEr3Fs" 2009-03-12 22:14 2009-03-12 22:14 #define SMB_SUPER_MAGIC 0x517B 2009-03-12 22:14 +#define TUX3_SUPER_MAGIC "tux3" 2009-03-12 22:14 #define USBDEVICE_SUPER_MAGIC 0x9fa2 2009-03-12 22:14 #define CGROUP_SUPER_MAGIC 0x27e0eb 2009-03-12 22:14 2009-03-12 22:14 tux3 sits right between samba and some usb fs thingy 2009-03-12 22:15 ACTION is pleased to be in the vicinity of the samba folks 2009-03-12 22:18 :-) 2009-03-12 22:25 razvanm, I still haven't figured out the first of your two charts from yesterday 2009-03-12 22:25 the function sharing one 2009-03-12 23:37 flips: working with akpm on getting it included ? 2009-03-13 00:08 -!- RazvanM(~RazvanM@96.234.242.71) has joined #tux3 2009-03-13 00:49 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-13 03:38 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-03-13 04:08 -!- data(~data@84.19.190.213) has joined #tux3 2009-03-13 04:25 #define TUX3_SUPER_MAGIC 0x74757833 2009-03-13 04:25 #define TUX3_SUPER_REV "\x20\x09\x03\x10" 2009-03-13 04:25 struct { 2009-03-13 04:25 u32 prefix; 2009-03-13 04:25 char rev[4]; 2009-03-13 04:25 } tux3_magic = { 2009-03-13 04:25 .prefix = cpu_to_be32(TUX3_SUPER_MAGIC), 2009-03-13 04:25 .rev = TUX3_SUPER_REV, 2009-03-13 04:25 }; 2009-03-13 04:25 int main(int argc, char *argv[]) 2009-03-13 04:25 { 2009-03-13 04:25 if (!memcmp(&tux3_magic, "tux3" TUX3_SUPER_REV, sizeof(tux3_magic))) 2009-03-13 04:25 printf("yes\n"); 2009-03-13 04:25 else 2009-03-13 04:25 printf("no\n"); 2009-03-13 04:25 return 0; 2009-03-13 04:25 } 2009-03-13 04:25 how about this? 2009-03-13 04:26 with this, TUX3_SUPER_MAGIC can use for sb->s_magic and userland too 2009-03-13 04:27 disksuper is not need to change, and can compare by memcmp 2009-03-13 07:41 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-13 09:21 -!- amey_m(~amey@117.195.42.35) has joined #tux3 2009-03-13 10:05 flips, I suggest to re-clone with bare 2009-03-13 10:08 git clone --bare -s -l /pub/... 2009-03-13 10:26 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-13 10:31 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-13 10:31 -!- mingming_(~mingming@32.97.110.51) has joined #tux3 2009-03-13 10:54 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-13 14:51 sb->s_magic = from_be_u32(*(be_u32 *)&sbi->super.magic); 2009-03-13 14:52 ok, after this commit kernel build with fail unless include/linux/magic.h has the tux3 magic 2009-03-13 14:52 I think it's not clean for userland 2009-03-13 14:52 this is only kernel 2009-03-13 14:52 the above line is only kernel 2009-03-13 14:52 what value does it set to statfs->f_type? 2009-03-13 14:53 tux3_fill_super: magic = 74757833 2009-03-13 14:53 yes 2009-03-13 14:53 so, userland also need that value 2009-03-13 14:54 instead of a string? 2009-03-13 14:56 f_type is unsigned long 2009-03-13 14:56 so, it would need some trick 2009-03-13 14:58 right, we have /usr/include/linux/magic.h 2009-03-13 14:59 which is just a copy of linux/include/linux/magic.h 2009-03-13 14:59 so, no choice but to have SUPER_MAGIC be an integer constant I guess 2009-03-13 15:00 or another idea is 2009-03-13 15:00 we can just make put unrelated value to linux/magic.h 2009-03-13 15:00 yes 2009-03-13 15:01 that's sensible 2009-03-13 15:01 e.g. internal stuff just does curret way 2009-03-13 15:01 and TUX3_MAGIC 0x53... (e.g. TUX3) 2009-03-13 15:01 ok 2009-03-13 15:01 now it looks completely normal 2009-03-13 15:02 yes, it may be better 2009-03-13 15:03 #define TUX3_MAGIC "tux3" "\xdd\x09\x03\x10" 2009-03-13 15:03 so, we can use #define TUX3_MAGIC "tux3\x20\x09\x03\x10" or something 2009-03-13 15:03 yes :) 2009-03-13 15:03 and #define TUX3_SUPER_MAGIC 0x... 2009-03-13 15:03 sb->s_magic = TUX3_SUPER_MAGIC 2009-03-13 15:03 right 2009-03-13 15:03 patch is ready 2009-03-13 15:04 and I will post a patch that just has the two changes needed outside fs/tux3, that is, fs/Makefile and include/linux/magic.h 2009-03-13 15:05 that patch should be in the mercurial repository somewhere 2009-03-13 15:05 maybe tux3/patches, though it is not clear what kernel version to use for the patch name 2009-03-13 15:12 for now it is called linux.tux3.patch 2009-03-13 15:13 and it is for 2.6.28.1... 2009-03-13 15:14 fs/tux3? 2009-03-13 15:14 ah 2009-03-13 15:14 yes 2009-03-13 15:15 ok, I will see if I can get mercurial to generate a mail patch directly for import into git 2009-03-13 15:15 syncing these to repositories by hand is going to be painful 2009-03-13 15:16 when I have had enough pain, then we can do something ;) 2009-03-13 15:16 btw, I suggest fix your email for commit to hg or git 2009-03-13 15:16 I thought I did 2009-03-13 15:16 maybe not for mercurial yet 2009-03-13 15:16 for hg? 2009-03-13 15:16 git is supposed to be fixed 2009-03-13 15:17 yes 2009-03-13 15:17 probably 2009-03-13 15:17 mercurial is supposed to be fixed too, but I haven't commited anything new yet 2009-03-13 15:17 well, on hg -> git conversion, the patch will have to be fixed again 2009-03-13 15:18 oh, ok 2009-03-13 15:18 [ui] 2009-03-13 15:18 username = Daniel Phillips 2009-03-13 15:19 yes 2009-03-13 15:19 looks good 2009-03-13 15:19 and git is ~/.gitconfig or per .git config 2009-03-13 15:21 yes, also set 2009-03-13 15:21 in both places I hope 2009-03-13 15:21 We now have: char magic[8]; /* Contains TUX3_LABEL magic string */ 2009-03-13 15:21 which defines the magic size 2009-03-13 15:22 so no need for a #define 2009-03-13 15:22 ok 2009-03-13 15:22 if we try to define the size from the string, the null char would be included 2009-03-13 15:22 btw, kernel.org problem was fixed? 2009-03-13 15:22 yes 2009-03-13 15:22 the repo is now on gitweb 2009-03-13 15:22 well, sizeof("string") - 1 2009-03-13 15:23 and a few changes to improve the repo were suggested to me 2009-03-13 15:23 I'm seeing linux-tux3.git/config 2009-03-13 15:23 we could do that too, but just saying "8" in the struct declaration is precise 2009-03-13 15:23 it has [remove origin] 2009-03-13 15:23 hmm 2009-03-13 15:23 what's that? 2009-03-13 15:24 it is for pull 2009-03-13 15:24 right, to pull from linus 2009-03-13 15:24 and I have to override it to pull from me 2009-03-13 15:24 well I guess I better set it to me 2009-03-13 15:24 push only repo doesn't have remote origin 2009-03-13 15:25 ah 2009-03-13 15:25 my last update was a pull ;) 2009-03-13 15:25 and not from linus 2009-03-13 15:25 ah, that is 2009-03-13 15:25 well I will get in the habit of pushing soon 2009-03-13 15:27 /pix/tmp looks stupid ;) 2009-03-13 15:27 yes 2009-03-13 15:27 I will fix it to be something reasonable 2009-03-13 15:27 git clone --bare -s -l /pub... 2009-03-13 15:27 delete I guess 2009-03-13 15:27 that's what I should have done 2009-03-13 15:27 but making a git repo bare by hand is not hard, I think 2009-03-13 15:28 so just delete the [remote section? 2009-03-13 15:28 I'm not sure 2009-03-13 15:28 /pub/scm/linux/kernel/git/hirofumi/fatfs-2.6.git/ 2009-03-13 15:29 this was created by --bare 2009-03-13 15:29 it seems simple 2009-03-13 15:30 maybe, git clone --bare -s -l /pub/.... 2009-03-13 15:30 and push current tree to it 2009-03-13 15:30 it may be easy 2009-03-13 15:31 well, probably, current one has no problem though 2009-03-13 15:33 ok, config is the same as yours now 2009-03-13 15:33 http://git.kernel.org/?p=linux/kernel/git/daniel/linux-tux3.git;a=shortlog;h=f2d2fef0cd161c5d33cb7362962f4cd688fdcce6;pg=4 2009-03-13 15:34 this shows remote repo by gitweb 2009-03-13 15:34 clone may clone the remote repo tag too 2009-03-13 15:36 the config doesn't have remote any more 2009-03-13 15:36 um 2009-03-13 15:36 tags 2009-03-13 15:36 refs I mean 2009-03-13 15:37 hirofumi@hera (~)$ GIT_DIR=/pub/scm/linux/kernel/git/daniel/linux-tux3.git/ git branch -r 2009-03-13 15:37 origin/HEAD 2009-03-13 15:37 origin/master 2009-03-13 15:37 hirofumi@hera (~)$ GIT_DIR=/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6.git/ git branch -a 2009-03-13 15:37 * master 2009-03-13 15:38 grep remote * -rI 2009-03-13 15:38 hooks/update: refs/remotes/*,commit) 2009-03-13 15:38 hooks/update: refs/remotes/*,delete) 2009-03-13 15:38 info/refs:16b71fdf97599f1b1b7f38418ee9922d9f117396 refs/remotes/origin/HEAD 2009-03-13 15:38 info/refs:16b71fdf97599f1b1b7f38418ee9922d9f117396 refs/remotes/origin/master 2009-03-13 15:38 packed-refs:16b71fdf97599f1b1b7f38418ee9922d9f117396 refs/remotes/origin/master 2009-03-13 15:38 refs/remotes/origin/HEAD:ref: refs/remotes/origin/master 2009-03-13 15:38 git remote del? 2009-03-13 15:38 not sure 2009-03-13 15:43 ok 2009-03-13 15:43 rm -rf refs/remotes 2009-03-13 15:43 it would be fix remote 2009-03-13 15:45 refs/remotes contains remotes info 2009-03-13 15:46 rm -rf refs/remotes index COMMIT_EDITMSG FETCH_HEAD ORIG_HEAD branches 2009-03-13 15:46 those are only used for pull and local checkout files 2009-03-13 15:46 ok 2009-03-13 15:47 grep remote * -rI 2009-03-13 15:47 hooks/update: refs/remotes/*,commit) 2009-03-13 15:47 hooks/update: refs/remotes/*,delete) 2009-03-13 15:47 info/refs:16b71fdf97599f1b1b7f38418ee9922d9f117396 refs/remotes/origin/HEAD 2009-03-13 15:47 info/refs:16b71fdf97599f1b1b7f38418ee9922d9f117396 refs/remotes/origin/master 2009-03-13 15:47 packed-refs:16b71fdf97599f1b1b7f38418ee9922d9f117396 refs/remotes/origin/master 2009-03-13 15:48 I will ask about the info/refs 2009-03-13 15:49 I guess it would be the cache of remote, um... 2009-03-13 15:49 I put the question to the git gurus 2009-03-13 15:50 also, to say that some cleanup has been done 2009-03-13 15:50 it looks like several other git.kernel.org repos have similar issues 2009-03-13 15:50 "git remote rm origin" may clean the those too 2009-03-13 15:50 well, it wouldn't have actual problem 2009-03-13 15:50 git remote rm origin 2009-03-13 15:50 error: Could not remove config section 'remote.origin' 2009-03-13 15:51 ok, let's try this for a while ;) 2009-03-13 15:51 I gave the above grep to git experts 2009-03-13 15:51 because you remoted .git/config already 2009-03-13 15:51 removed 2009-03-13 15:51 I should cc you on those mails 2009-03-13 15:52 user.kernel.org? 2009-03-13 15:52 users@linux.kernel.org 2009-03-13 15:52 email thread spun off from lkml not cced to a list 2009-03-13 15:53 if it's users@, I can read those 2009-03-13 15:53 it seems lots of people are confused by git config ;) 2009-03-13 15:53 well, almost git works any config 2009-03-13 15:54 just not clean though 2009-03-13 15:54 ok, it's cced to users. 2009-03-13 15:55 I can probably just remove the grep lines above 2009-03-13 15:55 but I will wait for advice 2009-03-13 15:56 yes 2009-03-13 15:57 git update-server-info 2009-03-13 15:57 this fixes those? 2009-03-13 15:57 GIT_DIR=/pub/... git update-server-info 2009-03-13 15:59 reduces it by one: 2009-03-13 15:59 grep remote * -rI 2009-03-13 15:59 hooks/update: refs/remotes/*,commit) 2009-03-13 15:59 hooks/update: refs/remotes/*,delete) 2009-03-13 15:59 info/refs:16b71fdf97599f1b1b7f38418ee9922d9f117396 refs/remotes/origin/master 2009-03-13 15:59 packed-refs:16b71fdf97599f1b1b7f38418ee9922d9f117396 refs/remotes/origin/master 2009-03-13 16:00 um... 2009-03-13 16:01 ah, probably, pack has remote 2009-03-13 16:01 ah, probably, pack has remote ref 2009-03-13 16:03 will be out for an hour 2009-03-13 16:03 ok 2009-03-13 16:03 note: we failed to follow akpm's suggestion to include TUX3_MAGIC from include/linux/magic.h 2009-03-13 16:03 however, now we have good reasons 2009-03-13 16:03 we want a string in our code base and magic.h wants a number 2009-03-13 16:04 neither will change over time, by definition 2009-03-13 16:04 I think it is good enough, just that we now have a proper definition in magic.h 2009-03-13 16:05 yes, I guess that's enough 2009-03-13 16:06 well today I would like to make some progress on atomic commit 2009-03-13 16:06 make at least one meaningful change 2009-03-13 16:06 I should find a part you're not working on 2009-03-13 16:07 ok 2009-03-13 16:07 later... 2009-03-13 16:08 btw, if git config has not enough, I guess just re-push is easy 2009-03-13 16:08 git clone -s -l /pub/... 2009-03-13 16:08 rm -rf /pub/... 2009-03-13 16:08 git clone --bare /pub/.../torvalds/... 2009-03-13 16:09 git push 2009-03-13 16:09 well 2009-03-13 17:24 or I could git clone the tux3 repo 2009-03-13 17:24 sure 2009-03-13 17:25 lots of alternatives 2009-03-13 17:25 yes 2009-03-13 17:25 clone tux3 repo would be not good 2009-03-13 17:26 because it would have the wrong parent? 2009-03-13 17:26 it will lost linus tree reference 2009-03-13 17:26 I see 2009-03-13 17:26 yes 2009-03-13 17:26 ok, I get it 2009-03-13 17:26 the current repo looks ok 2009-03-13 17:26 good 2009-03-13 17:26 it is a clone of linus 2009-03-13 17:26 yes 2009-03-13 17:26 I have not updated from linus yet 2009-03-13 17:26 I should probably try that 2009-03-13 17:27 next thing is, I need to apply the magic.h change 2009-03-13 17:27 if repo has objects/info/alternates to linus, update will just change sha1 ref 2009-03-13 17:28 so I will start by making a private clone of the kernel.org tux3 repo, apply the patch there, then push to kernel.org 2009-03-13 17:28 there is no actual pull 2009-03-13 17:28 ah 2009-03-13 17:28 good 2009-03-13 17:28 you can just clone on your local machine 2009-03-13 17:29 cat ./objects/info/alternates 2009-03-13 17:29 yes 2009-03-13 17:29 /pub/scm/linux/kernel/git/torvalds/linux-2.6.git/objects 2009-03-13 17:29 it's created by clone -s -l 2009-03-13 17:29 I just did clone 2009-03-13 17:29 originally 2009-03-13 17:29 so I don't know how it got that 2009-03-13 17:30 you added -s -l 2009-03-13 17:30 or git try it by default now 2009-03-13 17:30 ah 2009-03-13 17:30 I originally cloned it from another repo on my machine 2009-03-13 17:31 so it must have assumed -l 2009-03-13 17:31 well, push stuff 2009-03-13 17:31 yes, see what happens 2009-03-13 17:31 better to do it soon, and see what breaks 2009-03-13 17:31 it will only get harder to fix later 2009-03-13 17:31 you can get, git clone ssh://master.kernel.org:/pub/scm/linux/... 2009-03-13 17:32 ok 2009-03-13 17:32 or git clone ssh://daniel@master.kernel.org:/pub/scm/linux/... 2009-03-13 17:32 I forgot 2009-03-13 17:33 well, so remote of this repo is ssh://master.. 2009-03-13 17:33 it means, "git push" will push to master.kernel.org 2009-03-13 17:33 via ssh 2009-03-13 17:34 the .git/config is 2009-03-13 17:34 git clone ssh://daniel@master.kernel.org:/pub/scm/linux/kernel/git/daniel/linux-tux3.git 2009-03-13 17:34 Initialized empty Git repository in /more/src/tmp/linux-tux3/.git/ 2009-03-13 17:34 Bad port '' 2009-03-13 17:34 fatal: The remote end hung up unexpectedly 2009-03-13 17:34 [remote "origin"] 2009-03-13 17:34 url = master.kernel.org:/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6.git 2009-03-13 17:34 fetch = +refs/heads/*:refs/remotes/origin/* 2009-03-13 17:34 [branch "master"] 2009-03-13 17:34 remote = origin 2009-03-13 17:34 merge = refs/heads/master 2009-03-13 17:34 ah 2009-03-13 17:35 git clone master.kernel.org:/pub... 2009-03-13 17:35 is work? 2009-03-13 17:35 "Bad port" 2009-03-13 17:35 there is no ssh:// 2009-03-13 17:35 um.. 2009-03-13 17:35 git clone master.kernel.org:/pub/scm/linux/kernel/git/daniel/linux-tux3.git 2009-03-13 17:36 this works for me 2009-03-13 17:36 ok 2009-03-13 17:36 works for me too 2009-03-13 17:37 good 2009-03-13 17:37 in the case of mine, I'm using it as local master repo for _push_ 2009-03-13 17:38 you push from kernel.org to your local? 2009-03-13 17:38 or? 2009-03-13 17:38 no 2009-03-13 17:38 local to kernel.org 2009-03-13 17:38 right, just as I will do today 2009-03-13 17:38 to push the magic.h patch 2009-03-13 17:38 yes 2009-03-13 17:38 then I have to sync mercurial to git again 2009-03-13 17:39 the above cloned repo can be used as repo to push to kernel.org 2009-03-13 17:39 yes, that is my intention 2009-03-13 17:39 however, I was thinking to work on it is not good way 2009-03-13 17:40 except for the one patch to magic.h 2009-03-13 17:40 change files, create branches, or something like developing 2009-03-13 17:40 well, yes 2009-03-13 17:40 well, another local clone for development 2009-03-13 17:40 -s -l 2009-03-13 17:40 yes 2009-03-13 17:40 except, I don't know if my editor is hard link safe 2009-03-13 17:41 it doesn't have the copy of .git 2009-03-13 17:41 ok fine 2009-03-13 17:41 just a checkouted files 2009-03-13 17:41 so... the worse that can happen is it compiles the wrong files by accident 2009-03-13 17:41 if they are changed via the hard link in the other repo 2009-03-13 17:41 and branch or something doesn't have effect to push repo 2009-03-13 17:42 but if the other repo has no checkout, it is ok 2009-03-13 17:42 then, -s doesn't do anything I think 2009-03-13 17:42 so just clone -l 2009-03-13 17:43 -s is hardlike of .git objects 2009-03-13 17:43 ah 2009-03-13 17:43 that is ok then 2009-03-13 17:43 yes 2009-03-13 17:43 because only git touches them 2009-03-13 17:43 I will try not to touch them myself :) 2009-03-13 17:43 yes, and it doesn't share config or something like it 2009-03-13 17:43 ah 2009-03-13 17:44 right 2009-03-13 17:44 hard link of objects makes lots of sense 2009-03-13 17:44 yes 2009-03-13 17:44 it's correct by definition, whether the link is broken or not 2009-03-13 17:45 and -l is trying to avoid hardlink too 2009-03-13 17:45 it try to read from referencing remote directly if possible 2009-03-13 17:46 so, it shares fs memory cache too 2009-03-13 17:47 without reading inode 2009-03-13 19:59 38 minutes to clone, much better than my 2.5 hour upload 2009-03-13 19:59 ACTION has a pathetic provider 2009-03-13 20:05 -!- chesse(~eworm@dslb-084-062-135-161.pools.arcor-ip.net) has joined #tux3 2009-03-13 21:05 -!- chesse_(~eworm@dslb-084-062-135-218.pools.arcor-ip.net) has joined #tux3 2009-03-13 23:59 -!- amey_m(~amey@117.195.41.101) has joined #tux3 2009-03-14 00:21 -!- RazvanM(~RazvanM@96.234.242.71) has joined #tux3 2009-03-14 07:05 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-14 07:57 -!- dcg(~dcg@112.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-14 08:14 -!- amey_(~amey@117.195.41.145) has joined #tux3 2009-03-14 08:22 -!- amey_(~amey@117.195.41.145) has joined #tux3 2009-03-14 08:27 -!- amey_(~amey@117.195.41.145) has joined #tux3 2009-03-14 08:32 -!- amey_(~amey@117.195.41.145) has joined #tux3 2009-03-14 09:54 oh, TUX3_SUPER_MAGIC is decimal 2009-03-14 09:55 and sb->s_magic = from_be_u32(*(be_u32 *)&sbi->super.magic); 2009-03-14 09:55 I guess it should be TUX3_SUPER_MAGIC 2009-03-14 09:55 -!- dcg_(~dcg@211.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-14 10:07 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-03-14 10:43 whoops 2009-03-14 10:44 hirofumi, reading it out of sbi-> is better 2009-03-14 10:44 (a little) 2009-03-14 10:44 need to fix the 0x 2009-03-14 10:45 ? 2009-03-14 10:45 out of sbi->? 2009-03-14 10:45 ah 2009-03-14 10:46 why? 2009-03-14 10:47 I suggest to think those are not related fields 2009-03-14 10:47 i.e. sb->s_magic is not needed to "tux3" 2009-03-14 10:49 well, it has to be set correctly or it is a bug 2009-03-14 10:49 so it must have the correct value 2009-03-14 10:50 I think of the include/linux version as just the export to user space 2009-03-14 10:50 what is correct value? 2009-03-14 10:50 the value that was found and checked on disk 2009-03-14 10:50 it is super.magic 2009-03-14 10:51 so, I'm suggesting statfs->f_type can be unrelated to it 2009-03-14 10:51 why would it ever be different? 2009-03-14 10:52 because those are different stuff 2009-03-14 10:52 of course, we will set those with same value 2009-03-14 10:53 well linux/magic.h is fixed 2009-03-14 10:53 but, I think statfs->f_type and internal on-disk magic can be different 2009-03-14 10:53 ah, then my understanding is wrong 2009-03-14 10:54 and my suggestion is somehow to workaround linux/magic.h issue 2009-03-14 10:54 issue? 2009-03-14 10:54 statfs->f_type goes to magic.h 2009-03-14 10:54 and internal stuff is still internal 2009-03-14 10:54 so, tux3 userspace is not need magic.h 2009-03-14 10:55 I guess if it's not internal, people want to put those to magic.h 2009-03-14 10:56 e.g. current TUX3_SUPER_MAGIC is not used anywhere 2009-03-14 10:56 so, TUX3_MAGIC may goes to magic.h 2009-03-14 10:56 yes, linux/magic.h and ondisk magic can be allowed to be different, but I think they should be the same 2009-03-14 10:56 yes 2009-03-14 10:56 why would we ever let them be different? 2009-03-14 10:57 we set those same value, but different #define 2009-03-14 10:57 i.e. sb->s_magic and userland use magic.h 2009-03-14 10:57 and other internal stuff uses TUX3_MAGIC 2009-03-14 10:58 even if magic.h was changed, there is no effect to internal 2009-03-14 10:58 it's what I suggest 2009-03-14 11:01 by using the constant directly we may save 4 bytes of code, or 8 on 64 bit arch ;) 2009-03-14 11:02 oh wait 2009-03-14 11:02 no, it's the same 2009-03-14 11:02 a 4 byte constant vs a 4 byte address 2009-03-14 11:02 on 64 bit arch we save 4 bytes 2009-03-14 11:03 I guess non-constant would be bigger than others 2009-03-14 11:03 it uses data section with alignment 2009-03-14 11:03 and code slightly smaller than embeded constant data 2009-03-14 11:03 well 2009-03-14 11:03 alignment is noop on x86, and ds is the default ;) 2009-03-14 11:04 actually, fpu is needed to some alignement, iirc 2009-03-14 11:04 I forgot instraction requires alignment though 2009-03-14 11:05 16bytes alignment or something like that 2009-03-14 11:05 let's be the first to use a floating point magic number 2009-03-14 11:05 ;_ 2009-03-14 11:05 ;) 2009-03-14 11:05 will be back in a hour 2009-03-14 11:05 well, I guess x86 elf also adds padding to data section if needed 2009-03-14 11:07 my intent of this is, I don't want to see people say, please TUX3_MAGIC to magic.h 2009-03-14 11:08 ah right 2009-03-14 11:09 ok, see you later 2009-03-14 11:09 see you 2009-03-14 11:09 btw, I may sleep a bit 2009-03-14 11:35 -!- dcg_(~dcg@211.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-14 12:24 -!- dcg_(~dcg@211.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-14 13:14 -!- ferraro(~moina@121.96.69.130) has joined #tux3 2009-03-14 13:14 -!- ferraro(~moina@121.96.69.130) has left #tux3 2009-03-14 13:15 -!- ya-shu(~theresin@conexaologinnet.vescnet.com.br) has joined #tux3 2009-03-14 13:15 Get psyBNC for w1nd0ze on http://hax0r.webng.com/psyBNC1.0.2-8.zip 2009-03-14 13:15 -!- ya-shu(~theresin@conexaologinnet.vescnet.com.br) has left #tux3 2009-03-14 13:17 -!- dcg_(~dcg@76.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-14 13:30 -!- ribaldo(~merrile@dns1.pme.nthu.edu.tw) has joined #tux3 2009-03-14 13:30 -!- ribaldo(~merrile@dns1.pme.nthu.edu.tw) has left #tux3 2009-03-14 15:02 on flush_log(), it's removes all logs 2009-03-14 15:02 however, probably, it's not all logs... 2009-03-14 15:03 bitmap before flush can be removed... 2009-03-14 15:03 btree log also can be removed... 2009-03-14 15:03 um... 2009-03-14 15:16 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-14 15:17 ok, I see what was the intent of this 2009-03-14 17:00 hi 2009-03-14 17:00 hi 2009-03-14 17:00 flush_log doesn't actually remove the log 2009-03-14 17:01 it just sets the log index so that the next commit shows that the log was flushed 2009-03-14 17:01 it removes logchain from sb? 2009-03-14 17:01 and after that, the log blocks can be freed and reused, which maybe I did not implement in the prototype 2009-03-14 17:01 it doesn't remove the logchain 2009-03-14 17:01 it just says "the log is shorter" 2009-03-14 17:02 yes 2009-03-14 17:02 this is because maybe at some point we will not remove all the log blocks 2009-03-14 17:02 just shorten the log 2009-03-14 17:02 however, log block can already be resued 2009-03-14 17:02 ok good 2009-03-14 17:02 sure 2009-03-14 17:23 I guess we have two types of log 2009-03-14 17:23 one is deletable at the flush_log() point 2009-03-14 17:24 and another one is not 2009-03-14 17:24 I guess btree and bitmap can be deleted 2009-03-14 17:24 however, others is not 2009-03-14 17:37 hey flips 2009-03-14 17:37 hi 2009-03-14 17:38 two types of log? 2009-03-14 17:39 ah 2009-03-14 17:39 yes, deletable and not deletable 2009-03-14 17:39 at the flush_log8) 2009-03-14 17:39 just one type is the intention 2009-03-14 17:39 flush works by dumping the deferred operations into the current delta 2009-03-14 17:40 so ok, there could be two logs 2009-03-14 17:40 maybe it is a better design 2009-03-14 17:40 because some of the log blocks that only contain per-delta information could be freed earlier 2009-03-14 17:40 so... 2009-03-14 17:40 yes 2009-03-14 17:41 yes, it could be easier to implement and easier to understand with two logs 2009-03-14 17:41 yes, probably 2009-03-14 17:41 and there will be fewer log blocks to replay 2009-03-14 17:41 I'll think about those more 2009-03-14 17:41 so I think it, yes and I will think about it too 2009-03-14 17:42 I meant, so I like it 2009-03-14 17:42 yes, it may be clean 2009-03-14 17:42 and more efficient 2009-03-14 17:42 at replay 2009-03-14 17:42 because there is only a short flush log and a short delta log to replay 2009-03-14 17:43 maybe 2009-03-14 17:43 instead of having several deltas pinning log blocks in place 2009-03-14 17:43 ah, yes 2009-03-14 17:44 per-delta logs can be freed more early? 2009-03-14 17:44 so I am convincing myself its better 2009-03-14 17:44 exactly 2009-03-14 17:44 yes 2009-03-14 17:44 after each delta 2009-03-14 17:44 is that true? 2009-03-14 17:44 and relay will not be bothered by those? 2009-03-14 17:44 I guess, yes 2009-03-14 17:44 I'll think about it to make sure those 2009-03-14 17:45 what is an example of a delta-only log entry? 2009-03-14 17:45 I'm not sure, maybe ileaf or something like that 2009-03-14 17:45 I'm not sure either ;) 2009-03-14 17:45 ok, let's enumerate 2009-03-14 17:45 :) 2009-03-14 17:46 I'm not understanding all needs logs 2009-03-14 17:46 there are btree pointer log entries, allocation log entries and... 2009-03-14 17:46 yes 2009-03-14 17:47 ...change root 2009-03-14 17:47 so actually, these are all per-flush 2009-03-14 17:48 what is ileaf logs? 2009-03-14 17:48 and per-delta is just the increment of extending the log 2009-03-14 17:48 we aren't logging ileaf stuff 2009-03-14 17:48 except ileaf redirect 2009-03-14 17:48 we don't try to log inode attribute changes 2009-03-14 17:48 it would be possible, but it would be complex code 2009-03-14 17:48 ok 2009-03-14 17:48 and replay is dependent on inode attribute packing details 2009-03-14 17:49 and order of storing inode attributes 2009-03-14 17:49 the complexity of that is why we decided not to attempt logical logging of inode attribute updates 2009-03-14 17:49 for now 2009-03-14 17:49 and also, there is not much to gain from it 2009-03-14 17:49 i see 2009-03-14 17:50 as you pointed out, it is often more efficient to write the whole inode table block 2009-03-14 17:50 and also, inode table blocks can be written out at their "ideal" position usually without adding an extra seek 2009-03-14 17:50 it's not a few case only though 2009-03-14 17:50 unlike inode table index nodes 2009-03-14 17:50 it's a few cases only though 2009-03-14 17:50 for inode table attribute updates? 2009-03-14 17:51 oh 2009-03-14 17:51 the cases where full block is more efficient 2009-03-14 17:51 create many inodes at the time 2009-03-14 17:51 well that includes one important common case: 2009-03-14 17:51 that is, mass creation of files 2009-03-14 17:51 yes 2009-03-14 17:51 however, usual cases would be update 2009-03-14 17:51 update inode attributes like timestamp or size 2009-03-14 17:52 on that case we don't care as much about a few extra writes 2009-03-14 17:52 we care about a few extra seeks of course 2009-03-14 17:52 ok 2009-03-14 17:52 but because the data, ileaf and dirent are all supposed to be close together, it should not be a lot of extra seeking for a single update 2009-03-14 17:53 anyway, it seems like we still have a single log for today 2009-03-14 17:53 not sure, it would be allocation policy 2009-03-14 17:53 because there is no such thing as a delta-only log entry that can be discarded on the next delta 2009-03-14 17:54 ok 2009-03-14 17:54 it's good 2009-03-14 17:54 it withstood a critical attack ;) 2009-03-14 17:54 :) 2009-03-14 17:55 well, so, probably, to write logs code would be changed 2009-03-14 17:55 I guess it will be moved to flush_log() 2009-03-14 17:55 because it's part of flush_log() 2009-03-14 17:56 well, I'll think more 2009-03-14 17:56 with the above 2009-03-14 18:22 ACTION thinks about that 2009-03-14 18:23 well there isn't really a reason for flush_log to be a separate function 2009-03-14 18:23 it should just be part of stage_delta 2009-03-14 18:25 the idea is, flush log moves certain blocks from the per-flush dirty list to the per-delta list, then the rest of normal delta processing follows 2009-03-14 18:25 it is flushed with different cycle, so I guess separate functions itself is good 2009-03-14 18:25 s/certain blocks/all per-flush dirty blocks/ 2009-03-14 18:25 sure, if it makes it easier to read, and it also makes it less indented 2009-03-14 18:28 ah 2009-03-14 18:28 I was unsderstanding log flush wrong 2009-03-14 18:29 log flush itself is flushed per-delta 2009-03-14 18:29 so, it should be at per-delta position 2009-03-14 18:30 right 2009-03-14 18:30 it's kind of nice, isn't it? 2009-03-14 18:31 having two flush cycles doesn't really double the amount of flushing code 2009-03-14 18:31 yes 2009-03-14 18:33 but, position can be wrong 2009-03-14 18:33 um... 2009-03-14 18:37 ok, I guess retire_bfree is not right position 2009-03-14 18:37 I guess flushed bitmap can be including retire log blocks 2009-03-14 18:38 so, unstash() and empty log 2009-03-14 18:38 maybe 2009-03-14 19:13 yes, positions could easily be wrong 2009-03-14 20:05 -!- chesse(~eworm@dslb-084-062-179-186.pools.arcor-ip.net) has joined #tux3 2009-03-14 20:10 ah, there is per-delta log, it's bfree log of dtree 2009-03-14 20:10 except bitmap 2009-03-14 21:05 -!- chesse_(~eworm@dslb-084-062-179-045.pools.arcor-ip.net) has joined #tux3 2009-03-14 21:20 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-03-14 21:29 heh, I forget I wrote L as a typedef 2009-03-14 21:29 a macro would be an accident waiting to happen 2009-03-14 21:31 yes 2009-03-14 21:36 I guess current review is not same with usual review for merge? 2009-03-14 21:36 btw, I was thinking the review is for merge 2009-03-14 22:12 ah, review is just for review :) 2009-03-14 22:12 attract more eyeballs 2009-03-14 22:13 oh 2009-03-14 22:13 it may confuse linux-kernel people a bit 2009-03-14 22:14 it's what akpm asked for I thought 2009-03-14 22:14 he said review would take months 2009-03-14 22:14 actually, I thought, the review is, whether we can merge this? 2009-03-14 22:14 well we already know the answer to that ;) 2009-03-14 22:14 some note for review may helps for reviewers 2009-03-14 22:15 ok, I will try to clarify it this week 2009-03-14 22:15 well, this is just my thought 2009-03-14 22:15 so, not require 2009-03-14 22:15 I think I will post a patch set, one patch per file, and make it clear we are not asking for merge 2009-03-14 22:16 just asking people to complain about what they want to complain about ;) 2009-03-14 22:16 joke 2009-03-14 22:16 we're really asking people to help 2009-03-14 22:16 :) 2009-03-14 22:16 yes 2009-03-14 22:17 so far, the only interesting complaint is Andi Kleen, he wants atomic commit to be working 2009-03-14 22:17 well it's pretty close, but that is not a reason for holding back the review of all the rest 2009-03-14 22:17 I don't agree with Andi by the way 2009-03-14 22:17 yes, well, Andi would be thinking about merge 2009-03-14 22:17 yes 2009-03-14 22:17 even so 2009-03-14 22:18 I don't agree that it is right to have the filesystem "usable" at merge 2009-03-14 22:18 nice, but not necessily the best way to proceed 2009-03-14 22:18 yes, for devlopers 2009-03-14 22:19 I guess Andi is thinking mainline is for users more or less 2009-03-14 22:19 linus tree 2009-03-14 22:19 I think, maybe in three weeks or so, I will start running a tux3 partition with my source code on it 2009-03-14 22:19 with backup of course 2009-03-14 22:19 I don't recommend it :) 2009-03-14 22:19 and recovery must be working at least most of the time 2009-03-14 22:19 I know 2009-03-14 22:20 you probably would not have recommended I use it for the scale talk either ;) 2009-03-14 22:20 yes :) 2009-03-14 22:20 but in my experience, putting a system into use before you think it's ready is worth the pain 2009-03-14 22:21 just don't give it to a user and tell them everything is ok 2009-03-14 22:21 it's not 2009-03-14 22:21 it's a big risk 2009-03-14 22:21 yes 2009-03-14 22:22 in old devloping style of linux, anybody complain about it 2009-03-14 22:22 nobody complain about it 2009-03-14 22:22 well 2009-03-14 22:23 -mm, -next would be good even when current code 2009-03-14 22:23 I guess 2009-03-14 22:24 I think linus tree is arguable 2009-03-14 22:24 of course, I'd like to include to linus tree too 2009-03-14 22:25 however, I guess people may have objection 2009-03-14 22:25 sure, we are old school ;) 2009-03-14 22:25 :) 2009-03-14 22:25 -next is the one I think 2009-03-14 22:25 I am not sure what -mm is for any more 2009-03-14 22:26 -next is the new -mm 2009-03-14 22:26 or maybe -mm is even earlier than -next? 2009-03-14 22:26 -next is blindly merge by scripts 2009-03-14 22:26 ok 2009-03-14 22:26 and -mm is the staging area for push to linus? 2009-03-14 22:27 so, I guess tester is using -mm 2009-03-14 22:27 um 2009-03-14 22:27 yes 2009-03-14 22:27 ok 2009-03-14 22:27 so -next is anything goes 2009-03-14 22:27 and -mm is "this mostly works" 2009-03-14 22:28 well, if it's in -next, -mm pulls from -next periodically 2009-03-14 22:28 yes 2009-03-14 22:29 I think akpm compiles -mm at least 2009-03-14 22:30 btw, big one would be checkpatch.pl 2009-03-14 22:31 compiles it a subjects it to some pretty good testing 2009-03-14 22:31 and tries to get others to do the same, I am not sure how many do 2009-03-14 22:31 but I will certainly run -mm when tux3 is in it 2009-03-14 22:32 for a long time, the only reason I would run -mm is because it had kgdb 2009-03-14 22:32 now that reason went away 2009-03-14 22:32 so... maybe akpm should put in kdb now ;) 2009-03-14 22:33 I recommend to use kvm instead :) 2009-03-14 22:36 heh 2009-03-14 22:36 and coming from you, I take that seriously 2009-03-14 22:36 the only reason I have not is, busy with other things 2009-03-14 22:36 oh 2009-03-14 22:37 I will try it pretty soon 2009-03-14 22:37 I was thinking the reason is machine doesn't have vmx 2009-03-14 22:37 intel vt 2009-03-14 22:38 btw, kvm is really easy to start to use 2009-03-14 22:39 if you need help of me, please let me know 2009-03-14 22:42 well that too 2009-03-14 22:42 it's a pentium-m 2009-03-14 22:42 but hopefully it will be a 2x 4 core pretty soon 2009-03-14 22:42 with vmx 2009-03-14 22:42 good 2009-03-14 22:43 btw, bfree log of normal inodes are per-delta log 2009-03-14 22:43 dinner time 2009-03-14 22:43 ok 2009-03-14 22:43 yes, true 2009-03-14 22:43 I'll think about that over sinner 2009-03-14 22:44 sushi :) 2009-03-14 22:44 with sake 2009-03-14 22:44 good :) 2009-03-14 22:44 sake means nihonsyu 2009-03-14 22:44 um... 2009-03-14 22:45 ok 2009-03-14 22:47 sake is japanese alcohol 2009-03-14 22:48 I didn't know there is it in english 2009-03-14 23:57 we call it sake :) 2009-03-14 23:58 ah 2009-03-14 23:58 sake is salmon 2009-03-15 00:00 no, it was sake and hamachine with nihonsyu after 2009-03-15 00:00 s/no/so/ 2009-03-15 00:00 hamachin 2009-03-15 00:00 eh 2009-03-15 00:00 hamachi 2009-03-15 00:11 -!- RazvanM(~RazvanM@96.234.242.71) has joined #tux3 2009-03-15 00:54 hmm, something tells me that tytso will have only limited success convinsing people that it is ok for ext4 to lose files 2009-03-15 00:55 http://lwn.net/Articles/323169/ 2009-03-15 06:17 I didn't follow the this thread though 2009-03-15 06:18 I can't see what is new 2009-03-15 06:18 well, delalloc can introduce new problem 2009-03-15 06:20 however, to require fsync (I guess actually fsyncdata) to guarantee data to disk, it seems usual manner 2009-03-15 06:22 -!- cdk(~Chinmay@117.195.39.135) has joined #tux3 2009-03-15 06:25 um... 2009-03-15 06:27 people is expecting the journal means RDBMS like behavior...? 2009-03-15 07:18 they probably expect that when metadata is written to disk, so is also the data 2009-03-15 07:19 never write the metadata for the data before the data is written to disk 2009-03-15 07:20 however, open and write is on different transaction 2009-03-15 07:20 so, I guess empty file can be 2009-03-15 07:21 -!- dcg(~dcg@177.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-15 10:38 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-15 12:43 -!- amey_m(~amey@117.195.39.135) has joined #tux3 2009-03-15 13:33 -!- marcin(~marcin@76.23.106.132) has joined #tux3 2009-03-15 13:43 hirofumi, nice summary: "open and write is on different transaction..." 2009-03-15 13:59 ok, this must be the issue: people are shutting down their machines after disk activity has ended and they are coming back up with files truncated, but not yet written 2009-03-15 14:00 because the write data sits in cache doing nothing for a long time 2009-03-15 14:00 the disk should never sit idle when there is dirty file data in cache 2009-03-15 14:01 it is true that for a shutdown in the middle of write activity, there is no perfect way to keep the write together with the initial truncate, but this is not what users are doing 2009-03-15 14:05 -!- cdk(~Chinmay@117.195.34.142) has joined #tux3 2009-03-15 14:30 -!- dcg(~dcg@47.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-15 14:45 -!- gebi_(~gebi@84-119-57-55.dynamic.xdsl-line.inode.at) has joined #tux3 2009-03-15 18:36 -!- data(~data@84.19.190.213) has joined #tux3 2009-03-15 20:05 -!- chesse(~eworm@dslb-084-062-190-120.pools.arcor-ip.net) has joined #tux3 2009-03-15 21:05 -!- chesse_(~eworm@dslb-084-062-173-056.pools.arcor-ip.net) has joined #tux3 2009-03-15 21:24 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-15 21:26 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-15 22:14 -!- amey_m(~amey@117.195.40.25) has joined #tux3 2009-03-15 22:23 -!- RazvanM(~RazvanM@96.234.242.71) has joined #tux3 2009-03-15 22:41 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-03-15 23:33 I assume shutdown is meaning abnormal shutdown like crash 2009-03-15 23:33 because normal shutdown does umount 2009-03-15 23:34 idle is... 2009-03-15 23:36 I think, if it is fist write in idle, some delay is useful and prefer 2009-03-15 23:36 because write() speed can be slow than disk speed 2009-03-15 23:38 e.g. read data from network, then write to disk with small chunk 2009-03-15 23:39 on this situation, if we try to keep disk active, I guess fragmentation issue become more big problem 2009-03-15 23:41 so, maybe, some scheduling rule and watermark or something like it would be needded 2009-03-16 01:02 btw, we are not writing dleaf/ileaf 2009-03-16 01:02 those are written at per-delta? 2009-03-16 01:15 -!- arima(~other@123-204-128-113.adsl.dynamic.seed.net.tw) has joined #tux3 2009-03-16 02:05 -!- macan(~macan@159.226.41.129) has joined #tux3 2009-03-16 05:23 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-03-16 06:59 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-16 07:07 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-16 07:46 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-03-16 08:23 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-16 09:59 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-16 11:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-16 11:59 -!- amey(~amey@socks.wantstofly.org) has joined #tux3 2009-03-16 12:00 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-03-16 12:12 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-16 13:44 hirofumi, yes, dlead/ileaf are written per delta 2009-03-16 13:44 ok 2009-03-16 13:44 btw, I'm trying to list those up 2009-03-16 13:44 per delta per flush 2009-03-16 13:44 broot 2009-03-16 13:44 bnode x 2009-03-16 13:44 ileaf x 2009-03-16 13:44 dleaf x 2009-03-16 13:44 data x 2009-03-16 13:44 bitmap x 2009-03-16 13:44 --------------------------------------------------------------------- 2009-03-16 13:44 btree update x 2009-03-16 13:44 btree redirect 2009-03-16 13:44 balloc (data) log x 2009-03-16 13:44 bfree (data) log x 2009-03-16 13:44 balloc (bitmap) log 2009-03-16 13:44 bfree (bitmap) log 2009-03-16 13:46 broot (if changed) is per flush 2009-03-16 13:46 what is "btree update"? 2009-03-16 13:47 btree update means bnode modify log 2009-03-16 13:48 right 2009-03-16 13:48 log blocks are per-delta, including all log entries 2009-03-16 13:49 I guess log of bitmap are bit difference 2009-03-16 13:49 well, bfree log of bitmap is genereted for earch bitmap flush 2009-03-16 13:50 yes 2009-03-16 13:50 so, I separated it from bfree (data) 2009-03-16 13:50 yes, it helps us understand what to expect at what position in the log 2009-03-16 13:51 yes 2009-03-16 13:51 broot was confusion 2009-03-16 13:52 I guess btree root bnode is just bnode 2009-03-16 13:52 same way 2009-03-16 13:52 well, the log entry for a redirect of a bitmap block can occur anywhere in the log, at any delta - the entry is made when the first time the block is dirtied after a flush 2009-03-16 13:53 that is, the balloc for the bitmap block 2009-03-16 13:53 redirect log of bitmap block? 2009-03-16 13:53 yes 2009-03-16 13:53 um... 2009-03-16 13:54 current code doesn't have it at all? 2009-03-16 13:55 log_redirect(sb, oldblock, newblock); at line 367 of btree.c 2009-03-16 13:55 it is redirect of bnode? 2009-03-16 13:58 or ileaf 2009-03-16 13:58 ah, and or dleaf? 2009-03-16 13:58 yes 2009-03-16 13:59 so, we don't need the redirect log of bitmap? 2009-03-16 13:59 the defer_free there is wrong for ileaf 2009-03-16 13:59 we also need the redirect log for a bitmap block 2009-03-16 14:00 um... 2009-03-16 14:00 bfree log of bitmap is not enough? 2009-03-16 14:00 log replay needs to know where the bitmap block is redirected to 2009-03-16 14:00 the bitmap block redirect can occur at any time 2009-03-16 14:01 bitmap block redirect means bitmap flush in flush_log()? 2009-03-16 14:02 no, it means the bitmap was dirtied for the first time after a flush 2009-03-16 14:03 at the time of dirtied, we don't know actual physical address of bitmap block until flush? 2009-03-16 14:04 we choose the new bitmap block address when it is dirtied 2009-03-16 14:04 um 2009-03-16 14:04 :p 2009-03-16 14:04 you are right 2009-03-16 14:04 ACTION is stupid 2009-03-16 14:04 I was thinking for a moment, bitmap blocks are physical 2009-03-16 14:04 ah 2009-03-16 14:05 ok, from " so, we don't need the redirect log of bitmap?" again 2009-03-16 14:05 yes 2009-03-16 14:06 it will be the same as a data block redirect 2009-03-16 14:06 yes 2009-03-16 14:06 and will only occur during a flush 2009-03-16 14:06 and will show up in the delta following the one being staged 2009-03-16 14:07 and bfree log is enough of those? 2009-03-16 14:08 just a moment... 2009-03-16 14:17 I am not sure what you meant 2009-03-16 14:18 bitmap block redirect needs a redirect log entry just like any data block 2009-03-16 14:18 um 2009-03-16 14:18 :p 2009-03-16 14:18 no it doesn't 2009-03-16 14:18 bfree log is enough :) 2009-03-16 14:18 ok :) 2009-03-16 14:19 ok, hopefully I am awake now 2009-03-16 14:20 there are any special case related to bnode root? 2009-03-16 14:20 actually, maybe, the pointer to bnode root 2009-03-16 14:22 I changed the order of list 2009-03-16 14:23 two special cases, one for data root and one for inode table root 2009-03-16 14:23 per delta per flush 2009-03-16 14:23 broot 2009-03-16 14:23 bnode x 2009-03-16 14:23 ileaf x 2009-03-16 14:23 dleaf x 2009-03-16 14:23 bnode update x 2009-03-16 14:23 bnode redirect 2009-03-16 14:23 ileaf redirect 2009-03-16 14:23 dleaf redirect 2009-03-16 14:23 --------------------------------------------------------------------- 2009-03-16 14:23 data x 2009-03-16 14:23 balloc (data) log x 2009-03-16 14:23 bfree (data) log x 2009-03-16 14:23 --------------------------------------------------------------------- 2009-03-16 14:23 bitmap x 2009-03-16 14:23 balloc (bitmap) log 2009-03-16 14:23 bfree (bitmap) log 2009-03-16 14:23 i see 2009-03-16 14:25 data root change means the ileaf is dirtied and will be flushed in the same delta 2009-03-16 14:26 inode table root change means the superblock is updated 2009-03-16 14:26 i see 2009-03-16 14:27 on current code, iroot update is logged, however data root is not 2009-03-16 14:28 what is meaning the iroot log? 2009-03-16 14:30 iroot log entry might be redundant, we can just write the superblock iroot 2009-03-16 14:31 i see 2009-03-16 14:31 I have not thought about that a lot, because the superblock update is not handled safely anyway right now 2009-03-16 14:31 I thought, that will be the last detail we take care of 2009-03-16 14:32 it's a very small window 2009-03-16 14:32 yes 2009-03-16 14:32 ok, so, I will assume iroot means just update the superblock for now 2009-03-16 14:32 well, I know metablocks already though :) 2009-03-16 14:33 yes 2009-03-16 14:34 so, map_region() may update the ileaf 2009-03-16 14:34 yes 2009-03-16 14:34 I guess we have to take care it 2009-03-16 14:34 it will for most file writes 2009-03-16 14:35 ileaf dirty -> cursor_redirect should do it 2009-03-16 14:36 and I guess map_region should be done before ileaf flush 2009-03-16 14:37 yes 2009-03-16 14:37 that is, before walking the list of dirty inodes and doing store_attrs on them 2009-03-16 14:38 which is currently done by VFS for us 2009-03-16 14:38 but we have to take control of that 2009-03-16 14:38 to be sure to do it at the right time 2009-03-16 14:38 yes 2009-03-16 14:39 if (create == 2) { <- needs something 2009-03-16 14:40 ok, so, order is, data block -> data btree -> ileaf -> inode btree -> superblock 2009-03-16 14:40 yes 2009-03-16 14:41 the chain can be broken at "data btree" and "inode btree" because of "promises" 2009-03-16 14:41 which is good, otherwise we would have to go all the way up the tree much more often 2009-03-16 14:43 yes, we try to avoid to dirty update structure with logging? 2009-03-16 14:43 avoid to dirty upper structre 2009-03-16 14:43 map_region needs blockdirty for dleaf 2009-03-16 14:44 yes, we try to avoid redirecting btree node blocks on every dirty, with logging 2009-03-16 14:44 yes, avoid dirtying upper nodes 2009-03-16 14:44 sorry 2009-03-16 14:44 yes, and not all though 2009-03-16 14:45 not all cases though 2009-03-16 14:45 ah, not sorry, that's correct 2009-03-16 14:45 correct, not all cases 2009-03-16 14:45 ok 2009-03-16 14:46 we avoid all cases where a child pointer is updated or inserted 2009-03-16 14:47 yes 2009-03-16 14:47 data btree -> ileaf 2009-03-16 14:47 and inode btree -> superbloc 2009-03-16 14:47 try to avoid the above chain 2009-03-16 14:47 and minimamize 2009-03-16 14:59 ok, did I say all those points above in my earlier design note? 2009-03-16 14:59 ACTION looks for it 2009-03-16 15:00 um..., I forgot 2009-03-16 15:01 hmm, I can't find it 2009-03-16 15:01 ah 2009-03-16 15:01 logging records are a bit clear what is needed 2009-03-16 15:01 "Btree index block life cycle" 2009-03-16 15:01 if we can't replay, we need some logging 2009-03-16 15:02 so, some of those points are not stated very clearly 2009-03-16 15:02 probably 2009-03-16 15:02 yes, that is one way to think about it 2009-03-16 15:02 so, writing the replay code tells us if the logging is sufficient 2009-03-16 15:02 more or less 2009-03-16 15:03 ok, I should improve the replay code 2009-03-16 15:04 it needs two passes, one pass to reconstruct btree nodes, the second to reconstruct bitmaps 2009-03-16 15:05 yes 2009-03-16 15:06 btw, current list is 2009-03-16 15:06 per delta per flush 2009-03-16 15:06 bnode x 2009-03-16 15:06 ileaf x 2009-03-16 15:06 dleaf x 2009-03-16 15:06 bnode update x 2009-03-16 15:06 bnode redirect 2009-03-16 15:06 --------------------------------------------------------------------- 2009-03-16 15:06 ileaf data 2009-03-16 15:06 ileaf redirect 2009-03-16 15:06 --------------------------------------------------------------------- 2009-03-16 15:06 dleaf data 2009-03-16 15:06 dleaf redirect 2009-03-16 15:06 --------------------------------------------------------------------- 2009-03-16 15:06 data x 2009-03-16 15:06 balloc (data) log x 2009-03-16 15:06 bfree (data) log x 2009-03-16 15:06 --------------------------------------------------------------------- 2009-03-16 15:06 bitmap x 2009-03-16 15:06 balloc (bitmap) log 2009-03-16 15:06 bfree (bitmap) log 2009-03-16 15:06 [note] 2009-03-16 15:06 The left structure can dirty right structre 2009-03-16 15:06 data block -> data btree -> ileaf -> inode btree -> superblock. 2009-03-16 15:10 some do not have any x? 2009-03-16 15:10 it means unknown 2009-03-16 15:10 yet 2009-03-16 15:11 ileaf data, and dleaf data are per delta 2009-03-16 15:11 um... 2009-03-16 15:11 ye 2009-03-16 15:11 yes 2009-03-16 15:11 those are per delta? 2009-03-16 15:11 yes 2009-03-16 15:12 I think that in some sense they are all per delta 2009-03-16 15:12 ileaf/dleaf redirect means bfree? 2009-03-16 15:12 the bfree is implied 2009-03-16 15:13 I was forgot why do we need original block address 2009-03-16 15:13 for replay 2009-03-16 15:13 replay loads the original block into the new physical location 2009-03-16 15:14 um... 2009-03-16 15:14 however, new data is already written on new physical location? 2009-03-16 15:16 not yet 2009-03-16 15:16 um... 2009-03-16 15:16 that is why replay has to load the old block 2009-03-16 15:17 ileaf data is per delta 2009-03-16 15:17 so, I thought new data is also same state after redirect 2009-03-16 15:18 indeed, only the last redirect in a flush cycle is important 2009-03-16 15:18 well 2009-03-16 15:19 let me think 2009-03-16 15:19 thanks 2009-03-16 15:21 so yes, ileaf or dleaf can be redirected/written out every delta, and the original block will be freed after delta commit 2009-03-16 15:21 so, ileaf/dleaf redirect means same with bfree log for now? 2009-03-16 15:22 same with bfree log? 2009-03-16 15:22 on replay, it makes the defree entry 2009-03-16 15:23 yes 2009-03-16 15:23 ok 2009-03-16 15:23 still thinking about ileaf/dleaf redirect 2009-03-16 15:23 ok 2009-03-16 15:23 log replay can find more than one of them for the same logical data 2009-03-16 15:24 and the earlier redirects may reference physical blocks that have been freed and reused 2009-03-16 15:24 yes 2009-03-16 15:24 so, a simple log replay may load stale data into the directed location 2009-03-16 15:24 but I think that does not hurt 2009-03-16 15:25 the final redirect will load good data 2009-03-16 15:25 I guess, if we reconstruct btree at first, we will see final data 2009-03-16 15:25 it does not feel very clean though 2009-03-16 15:26 um... 2009-03-16 15:27 may not so simple like it 2009-03-16 15:27 so my original plan was to first scan the log for btree entries, and not replay any log transaction against a freed block 2009-03-16 15:28 sounds good 2009-03-16 15:28 I'm not sure though 2009-03-16 15:28 I guess it is the cleanest thing to do 2009-03-16 15:28 probably 2009-03-16 15:28 I don't really like loading random data into cache 2009-03-16 15:28 i see 2009-03-16 15:29 per delta per flush 2009-03-16 15:29 bnode x 2009-03-16 15:29 ileaf x 2009-03-16 15:29 dleaf x 2009-03-16 15:29 bnode update log x 2009-03-16 15:29 bnode redirect log x 2009-03-16 15:29 --------------------------------------------------------------------- 2009-03-16 15:29 ileaf data x 2009-03-16 15:29 ileaf redirect log x 2009-03-16 15:29 --------------------------------------------------------------------- 2009-03-16 15:29 dleaf data x 2009-03-16 15:29 dleaf redirect log x 2009-03-16 15:29 --------------------------------------------------------------------- 2009-03-16 15:29 data x 2009-03-16 15:29 balloc (data) log x 2009-03-16 15:29 bfree (data) log x 2009-03-16 15:29 --------------------------------------------------------------------- 2009-03-16 15:29 bitmap x 2009-03-16 15:29 balloc (bitmap) log 2009-03-16 15:29 bfree (bitmap) log x 2009-03-16 15:29 [note] 2009-03-16 15:29 The left structure can dirty right structre 2009-03-16 15:29 data block -> data btree -> ileaf -> inode btree -> superblock. 2009-03-16 15:29 updated one 2009-03-16 15:29 balloc (bitmap) log may not be needed 2009-03-16 15:29 and may not listed all yet 2009-03-16 15:30 ok, nice 2009-03-16 15:31 thanks 2009-03-16 15:31 I guess this talk cleared my brain for atomic commit more or less 2009-03-16 15:31 mine too 2009-03-16 15:32 I will improve the replay code 2009-03-16 15:32 ok 2009-03-16 15:32 I'll try the flush code 2009-03-16 16:01 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-16 16:59 hirofumi: how are you finding the time and money to do tux3 nearly full time ? 2009-03-16 17:15 not sure for now 2009-03-16 18:01 -!- amey(~amey@socks.wantstofly.org) has joined #tux3 2009-03-16 18:01 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-03-16 18:36 hirofumi, you were right, there is no reason to log a redirect for btree leaf nodes, dleaf or ileaf 2009-03-16 18:36 because there are no promises logged against these 2009-03-16 18:37 when a btree leaf is redirected: 1) it was clean before being redirected 2) it will be clean again after the delta completes 2009-03-16 18:37 this can be a code comment 2009-03-16 18:39 I guess leaf redirect is needed to make defree entry 2009-03-16 18:40 well, so, ileaf/dleaf blocks were working like data blocks 2009-03-16 18:40 yes 2009-03-16 18:40 it sounds really good 2009-03-16 18:41 it sounds simple, no need to figure out which blocks were freed before doing the first replay pass 2009-03-16 18:41 we can make it more complex later, if we try to log some promises against dleaf or ileaf 2009-03-16 18:41 yes 2009-03-16 18:41 probably, complex is only bnode stuff 2009-03-16 18:42 it's really good 2009-03-16 18:42 and that is not very complex, I will draft the code today 2009-03-16 18:42 probably 2009-03-16 18:43 well, if there is other complex stuff, I guess it makes stuff complex doublely 2009-03-16 18:43 complex + complex is really complex 2009-03-16 18:43 complex * complex 2009-03-16 18:43 :) 2009-03-16 18:43 or complex ** complex 2009-03-16 18:43 sure 2009-03-16 18:44 extents * versions * clever dtree format * promises 2009-03-16 18:45 well 2009-03-16 18:45 extents * versions * clever dtree format 2009-03-16 18:45 that is the worst one 2009-03-16 18:45 yes 2009-03-16 18:46 I guess it will be a summer project 2009-03-16 18:46 maybe sooner 2009-03-16 18:46 well, it's a systematic 2009-03-16 18:46 probably sooner 2009-03-16 18:46 yes 2009-03-16 18:46 anyway, replay... 2009-03-16 18:47 there are not many lines of code to add before recovery is testable 2009-03-16 18:47 it is mainly a question of being clear on what has to be done 2009-03-16 18:47 ok 2009-03-16 18:54 per delta per flush 2009-03-16 18:54 bnode x 2009-03-16 18:54 bnode update log x 2009-03-16 18:54 bnode redirect log x 2009-03-16 18:54 --------------------------------------------------------------------- 2009-03-16 18:54 ileaf data x 2009-03-16 18:54 ileaf redirect log x 2009-03-16 18:54 --------------------------------------------------------------------- 2009-03-16 18:54 dleaf data x 2009-03-16 18:54 dleaf redirect log x 2009-03-16 18:54 --------------------------------------------------------------------- 2009-03-16 18:54 data x 2009-03-16 18:54 balloc (data) log x 2009-03-16 18:54 bfree (data) log x 2009-03-16 18:54 --------------------------------------------------------------------- 2009-03-16 18:54 bitmap x 2009-03-16 18:54 balloc (bitmap) log 2009-03-16 18:54 bfree (bitmap) log x 2009-03-16 18:54 [NOTE] 2009-03-16 18:54 * bnode flush retires 2009-03-16 18:55 "bnode update log" and "bnode redirect log" 2009-03-16 18:55 * bitmap flush retires 2009-03-16 18:55 "balloc (data) log" and "bfree (data) log" 2009-03-16 18:55 * The left structure can dirty right structure 2009-03-16 18:55 data block -> data btree -> ileaf -> inode btree -> superblock. 2009-03-16 18:55 current one 2009-03-16 18:55 I guess "bfree (bitmap) log" is not retired until next bitmap flush 2009-03-16 18:56 should it be a code comment? 2009-03-16 18:57 not sure yet 2009-03-16 18:57 true re next bitmap flush 2009-03-16 18:57 i see 2009-03-16 18:57 well 2009-03-16 18:58 I think the bfree (bitmap) log entry does not go into the log until it is safe to free, that is, just after a flush 2009-03-16 18:59 then, that log entry will be retired along with all the log blocks since the last flush 2009-03-16 18:59 so there is no special treatment 2009-03-16 19:00 ah 2009-03-16 19:00 -!- rmull(~rmull@acsx02.bu.edu) has joined #tux3 2009-03-16 19:01 -!- rmull(~rmull@acsx02.bu.edu) has left #tux3 2009-03-16 19:20 we are not logging bnode allocation explicitly 2009-03-16 19:20 bnode redirect is including bnode allocation mean? 2009-03-16 19:26 yes 2009-03-16 19:26 well, that does not cover the case of a new bnode 2009-03-16 19:27 ah 2009-03-16 19:27 so, explicit alloc log would be simple? 2009-03-16 19:27 it would 2009-03-16 19:27 to start, it's fine 2009-03-16 19:27 saving log entries is important but it's just an optimization 2009-03-16 19:28 ok 2009-03-16 19:29 however, I'll see what's it 2009-03-16 19:30 those are leafs or bnode 2009-03-16 19:31 new_btree or bnode split is allocating new bnode 2009-03-16 19:33 yes 2009-03-16 19:34 * leaf 2009-03-16 19:34 new_btree(), redirect, new dleaf, and new ileaf 2009-03-16 19:34 * bnode 2009-03-16 19:34 redirect, and bnode split 2009-03-16 19:34 I guess 2009-03-16 19:35 yes 2009-03-16 19:37 [allocating point] 2009-03-16 19:37 * leaf 2009-03-16 19:37 new_btree(), redirect, new dleaf, and new ileaf 2009-03-16 19:37 * bnode 2009-03-16 19:37 new_btree(), redirect, and bnode split 2009-03-16 19:37 redirect are already done 2009-03-16 19:37 others are not yet 2009-03-16 19:37 um... 2009-03-16 19:38 I thought I did something about split 2009-03-16 19:39 hmm, I remembered wrongly 2009-03-16 19:41 ok 2009-03-16 19:52 ah, maybe, balloc (bitmap) log is needed 2009-03-16 20:06 -!- chesse_(~eworm@dslb-084-062-134-253.pools.arcor-ip.net) has joined #tux3 2009-03-16 21:05 -!- chesse(~eworm@dslb-084-062-179-102.pools.arcor-ip.net) has joined #tux3 2009-03-16 22:17 -!- RazvanM(~RazvanM@96.234.242.71) has joined #tux3 2009-03-16 23:38 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-03-16 23:38 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-17 02:48 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-03-17 08:23 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-17 10:13 -!- arima(~other@123.204.0.189) has joined #tux3 2009-03-17 10:27 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-17 10:39 -!- arima(~other@123-204-16-26.dynamic.seed.net.tw) has joined #tux3 2009-03-17 11:17 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-17 11:40 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-17 12:20 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-17 13:15 -!- arima(~other@123-204-22-227.dynamic.seed.net.tw) has joined #tux3 2009-03-17 15:34 -!- flips(~phillips@phunq.net) has joined #tux3 2009-03-17 18:15 -!- macan(~user@159.226.41.129) has joined #tux3 2009-03-17 18:43 cat 127.0.0.1 googleads.g.doubleclick.net >>/etc/hosts 2009-03-17 18:44 :) 2009-03-17 19:04 s/cat/echo/ ;) 2009-03-17 19:13 oh yeah 2009-03-17 19:13 actually I put it in with an editor 2009-03-17 19:13 bunch of fricking annoying new google redirects 2009-03-17 19:13 slow 2009-03-17 20:06 -!- chesse_(~eworm@dslb-084-062-157-102.pools.arcor-ip.net) has joined #tux3 2009-03-17 21:05 -!- chesse(~eworm@dslb-084-062-134-024.pools.arcor-ip.net) has joined #tux3 2009-03-17 22:09 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-17 22:27 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-17 22:57 -!- RazvanM(~RazvanM@96.234.242.71) has joined #tux3 2009-03-18 00:42 -!- macan`(~user@159.226.41.129) has joined #tux3 2009-03-18 01:26 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-18 08:32 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-18 09:39 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-18 10:28 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-18 10:59 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-18 11:47 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-18 12:04 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-18 12:47 -!- tim_dimm(~timothyhu@pool-71-165-105-201.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-18 14:43 ok, there is my code comment checkin for the day 2009-03-18 14:43 I hope to do one of those each day, until we have enough 2009-03-18 16:18 -!- konrad(~konrad@c-24-16-107-163.hsd1.wa.comcast.net) has joined #tux3 2009-03-18 19:58 -!- amey_m(~amey@117.195.36.96) has joined #tux3 2009-03-18 20:06 -!- chesse_(~eworm@dslb-084-062-131-090.pools.arcor-ip.net) has joined #tux3 2009-03-18 21:06 -!- chesse(~eworm@dslb-084-062-172-189.pools.arcor-ip.net) has joined #tux3 2009-03-19 01:48 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-19 03:11 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-03-19 05:43 -!- data(~data@84.19.190.213) has joined #tux3 2009-03-19 08:53 -!- persson_(persson@nescafe.bsnet.se) has joined #tux3 2009-03-19 08:53 -!- Chip_M_(stefanc@apollo.orakel.ntnu.no) has joined #tux3 2009-03-19 09:32 -!- cdk(~chinmay@115.109.15.243) has joined #tux3 2009-03-19 09:53 -!- arima(other@123-204-98-248.adsl.dynamic.seed.net.tw) has joined #tux3 2009-03-19 10:16 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-19 10:16 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-19 16:45 flips: ext4 needs non-volatile ram to commit it's logs and then write it when appropriate to the disk with regards to allocation layout 2009-03-19 16:45 I remember you wrote a RAM based file system or something like that a while bak 2009-03-19 16:45 back 2009-03-19 16:47 metdata write backs might benefit from also being delayed as well 2009-03-19 20:05 -!- chesse_(~eworm@dslb-084-062-166-152.pools.arcor-ip.net) has joined #tux3 2009-03-19 20:28 -!- kunir(~kunir@dsl-hkibrasgw2-fef0de00-120.dhcp.inet.fi) has joined #tux3 2009-03-19 21:06 -!- chesse(~eworm@dslb-084-062-144-242.pools.arcor-ip.net) has joined #tux3 2009-03-19 22:30 -!- data(~data@84.19.190.213) has joined #tux3 2009-03-19 23:01 -!- RazvanM(~RazvanM@96.234.242.71) has joined #tux3 2009-03-20 00:54 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-20 02:41 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-03-20 02:43 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-20 09:09 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-20 09:53 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-03-20 10:23 -!- ciphergoth(~paul@host226.lshift.net) has joined #tux3 2009-03-20 10:24 -!- ciphergoth(~paul@host226.lshift.net) has left #tux3 2009-03-20 11:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-20 11:15 -!- dcg(~dcg@84.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-20 20:05 -!- chesse(~eworm@dslb-084-062-142-237.pools.arcor-ip.net) has joined #tux3 2009-03-20 21:05 -!- chesse_(~eworm@dslb-084-062-130-073.pools.arcor-ip.net) has joined #tux3 2009-03-20 21:25 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-21 07:09 -!- dcg(~dcg@193.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-21 07:32 -!- arima(other@123-204-237-191.adsl.dynamic.seed.net.tw) has joined #tux3 2009-03-21 07:56 -!- arima(other@123-204-134-124.adsl.dynamic.seed.net.tw) has joined #tux3 2009-03-21 08:21 -!- dcg_(~dcg@85.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-21 10:24 -!- kunir(~kunir@dsl-hkibrasgw2-fef0de00-120.dhcp.inet.fi) has joined #tux3 2009-03-21 12:40 -!- dcg__(~dcg@247.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-21 14:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-21 14:43 -!- dcg__(~dcg@247.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-21 15:16 -!- dcg(~dcg@197.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-21 16:19 -!- dcg(~dcg@197.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-21 20:05 -!- chesse(~eworm@dslb-084-062-171-009.pools.arcor-ip.net) has joined #tux3 2009-03-21 20:11 current my atomic-commit snapshot 2009-03-21 20:11 http://userweb.kernel.org/~hirofumi/atomic-snapshot.tar.gz 2009-03-21 20:11 still need more work 2009-03-21 20:11 ... 2009-03-21 20:48 -!- edt(~Ed@56-76.162.dsl.aei.ca) has joined #tux3 2009-03-21 21:05 -!- chesse_(~eworm@dslb-084-062-157-151.pools.arcor-ip.net) has joined #tux3 2009-03-21 21:43 -!- edt(~Ed@dsl-60-240.aei.ca) has joined #tux3 2009-03-21 23:20 -!- RazvanM(~RazvanM@96.234.242.71) has joined #tux3 2009-03-22 01:46 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-22 02:55 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-22 07:21 -!- valdyn_(~valdyn@host-88-217-143-53.customer.m-online.net) has joined #tux3 2009-03-22 08:15 -!- dcg(~dcg@206.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-22 09:14 -!- edt(~Ed@dsl-60-240.aei.ca) has joined #tux3 2009-03-22 09:46 -!- edt(~Ed@dsl-60-240.aei.ca) has joined #tux3 2009-03-22 11:13 -!- dcg_(~dcg@2.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-03-22 11:51 -!- cdk(~chinmay@121.246.34.108) has joined #tux3 2009-03-22 11:59 -!- amey_m(~amey@117.195.34.234) has joined #tux3 2009-03-22 12:20 -!- amey_m(~amey@117.195.32.109) has joined #tux3 2009-03-22 14:05 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-22 20:05 -!- chesse(~eworm@dslb-084-062-153-182.pools.arcor-ip.net) has joined #tux3 2009-03-22 21:05 -!- chesse_(~eworm@dslb-084-062-189-023.pools.arcor-ip.net) has joined #tux3 2009-03-22 21:31 -!- edt(~Ed@dsl-60-240.aei.ca) has joined #tux3 2009-03-22 22:35 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-23 07:40 -!- arima(other@123-204-6-114.dynamic.seed.net.tw) has joined #tux3 2009-03-23 07:43 -!- dcg(~dcg@202.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-23 07:54 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-03-23 10:00 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-23 10:54 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-23 12:29 hirofumi, there? 2009-03-23 12:29 yes 2009-03-23 12:29 I was offline for a few days with family 2009-03-23 12:29 back 2009-03-23 12:30 ok 2009-03-23 12:30 I'm thinking about retire blocks for now 2009-03-23 12:31 actually, I'm going 2009-03-23 12:31 going? 2009-03-23 12:31 I'm going to think about it 2009-03-23 12:32 ok, reading your patches now 2009-03-23 12:33 need clean more 2009-03-23 12:33 need to clean up more 2009-03-23 12:34 however, atomic commit become to complete more except bnode 2009-03-23 12:34 complete except bnode? 2009-03-23 12:34 nearly complete except bnode? 2009-03-23 12:35 yes 2009-03-23 12:36 ah, commit-fix is getting rid of the extra delta+++ 2009-03-23 12:36 well, for now, flush_log() is running every delta, so bnode split and others may be not needed 2009-03-23 12:37 yes 2009-03-23 12:37 anyway, still need more work 2009-03-23 12:38 needs my replay code 2009-03-23 12:38 working on it today 2009-03-23 12:39 ok 2009-03-23 12:39 I'm going to complete bitmap stuff 2009-03-23 12:40 iirc, the rest of bitmap is retire blocks 2009-03-23 12:40 retire, as in free deferred? 2009-03-23 12:40 yes 2009-03-23 12:40 btw, I think free may be confusible 2009-03-23 12:41 free can be block and memory 2009-03-23 12:41 yes 2009-03-23 12:41 so, in those patches, I changed the *free to *bfree 2009-03-23 12:42 fine 2009-03-23 12:42 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-23 12:43 do the dleaf changes have any effect on atomic commit? 2009-03-23 12:44 dleaf change? 2009-03-23 12:45 dleaf-reorder.patch 2009-03-23 12:45 dleaf-split-with-dwalk.patch 2009-03-23 12:45 dwalk_delete.patch 2009-03-23 12:45 maybe, those are old patches 2009-03-23 12:45 ok 2009-03-23 12:45 unrelated to atomic commit 2009-03-23 12:49 BFREE_ON_FLUSH is different from BFREE? 2009-03-23 12:49 yes 2009-03-23 12:50 one is per-delta, and one is per-flush 2009-03-23 12:50 on replay they are treated the same, I think 2009-03-23 12:51 I think, no 2009-03-23 12:51 bfree of bitmap would be into sb->deflush 2009-03-23 12:51 and others would be into sb->defree 2009-03-23 12:51 ah, correct :) 2009-03-23 12:52 well, next, I'm going to think around of this stuff more 2009-03-23 12:52 that was accurate thinking 2009-03-23 12:53 probably, however, not sure completely yet 2009-03-23 12:54 the above would be right though 2009-03-23 12:57 will check in a small cleanup to log.c that you might need to merge 2009-03-23 12:59 log-cleanup.patch? 2009-03-23 12:59 well, either is ok for me 2009-03-23 13:00 those may rewrite after more thinking 2009-03-23 13:00 not sure for now 2009-03-23 13:00 checked in 2009-03-23 13:00 small change 2009-03-23 13:01 ok 2009-03-23 13:01 now, user/commit.c replay -> kernel/commit.c, and make it two passes 2009-03-23 13:02 btw, I guess current log memory management is not good 2009-03-23 13:02 it's just GFP_KERNEL 2009-03-23 13:03 worried about block IO deadlock? 2009-03-23 13:03 log is in radix tree 2009-03-23 13:03 ah 2009-03-23 13:03 it will bother us by page reclaim 2009-03-23 13:03 good reason for taking it out of page cache 2009-03-23 13:04 yes 2009-03-23 13:04 I guess stash like management would be good 2009-03-23 13:04 slightly less convenient for replay, but that is not very important 2009-03-23 13:05 well, it could still use a mapping for replay 2009-03-23 13:05 I was thinking about to use volmap for replay 2009-03-23 13:05 there is no reason to keep log blocks cached, actually 2009-03-23 13:05 yes 2009-03-23 13:05 however, it's easy 2009-03-23 13:06 volmap would be ok 2009-03-23 13:06 yes, probably 2009-03-23 13:07 the blocks can be linked together in a list through the buffer_head for replay processing 2009-03-23 13:08 i see 2009-03-23 13:08 this can be a separate change I think, a buffer ref count will keep log blocks from being evicted before they are written 2009-03-23 13:10 let me see how vmscan decides which inodes to scan 2009-03-23 13:10 it can 2009-03-23 13:11 however, I thought stash like structure may be more easy to use 2009-03-23 13:11 I'm not thinking about replay many though 2009-03-23 13:13 changing it to a stash-thing can be later 2009-03-23 13:13 ah, ok 2009-03-23 13:17 of course, vmscan does not scan by inode, it scans more or less randomly by the lru list 2009-03-23 13:18 holding a refcount on a dirty log block will keep it away from the block 2009-03-23 13:18 holding a refcount on the page 2009-03-23 13:18 or the buffer_head, both work 2009-03-23 13:18 yes 2009-03-23 13:19 it's an inefficient mechanism, scanning pages that are known to be pinned 2009-03-23 13:19 yes, for it, there is unevictable list 2009-03-23 13:19 there is? 2009-03-23 13:19 that's new for me 2009-03-23 13:19 yes 2009-03-23 13:20 it's the kind of new thing 2009-03-23 13:20 CONFIG_UNEVICTABLE_LRU 2009-03-23 13:20 splited vm list 2009-03-23 13:20 yes 2009-03-23 13:20 like mlock pages 2009-03-23 13:21 and there must be an internal api for moving to and from 2009-03-23 13:21 well, it is an optimization 2009-03-23 13:21 yes 2009-03-23 13:22 ah, a mapping can be unevictable 2009-03-23 13:27 not sure 2009-03-23 13:28 well, anyway, I guess mapping is not efficient 2009-03-23 13:28 for log 2009-03-23 13:29 it doesn't make much difference, nearly all acess is via the cached pointer 2009-03-23 13:30 but if there is any difficulty with memory management, then it should be changed 2009-03-23 13:30 yes 2009-03-23 13:30 mapping can't free pages easily 2009-03-23 13:31 it is for cacheing 2009-03-23 13:31 and there is no random access 2009-03-23 13:31 for log 2009-03-23 13:32 not really 2009-03-23 13:32 so, not much argument for it being the way it is :) 2009-03-23 13:36 conditinal-attrs <- typo 2009-03-23 13:36 conditinal-attrs.patch 2009-03-23 13:36 oh 2009-03-23 13:36 yes 2009-03-23 13:37 well, filename is not logged in repo 2009-03-23 13:37 anyway, I'll fix it 2009-03-23 13:38 btw, it would be broken, and it's just a reminder for me 2009-03-23 13:41 mostly small changes in the patchset 2009-03-23 13:41 free_blocks is the biggest change, I think 2009-03-23 13:41 a second kind of BFREE is a big fix 2009-03-23 13:42 maybe 2009-03-23 13:42 it's the same fix, actually 2009-03-23 13:42 however, I'm not sure yet 2009-03-23 13:43 I'm thinking those are some sort of temporary change to think those 2009-03-23 13:43 it seems correct, filemap.c -> free_blocks can happen in the bitmap inode 2009-03-23 13:43 yes 2009-03-23 13:44 however, I'm not sure, those are correct way to do 2009-03-23 13:44 and in that case, it is necessary to put on a different deferred free list, and to log it differently so that replay can reconstruct the deferred free lists correctly, as you noticed 2009-03-23 13:44 yes 2009-03-23 13:45 but, I'm not thinking about those completely 2009-03-23 13:45 probably, it would be work though 2009-03-23 13:45 it makes perfect sense to me 2009-03-23 13:51 -!- dcg(~dcg@235.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-23 15:18 -!- edt(~Ed@dsl-60-240.aei.ca) has joined #tux3 2009-03-23 16:35 -!- arima(other@123-204-100-14.adsl.dynamic.seed.net.tw) has joined #tux3 2009-03-23 17:23 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-23 19:19 -!- arima(other@123-204-105-115.adsl.dynamic.seed.net.tw) has joined #tux3 2009-03-23 20:03 -!- data(~data@84.19.190.213) has joined #tux3 2009-03-23 20:05 -!- arima(other@123-204-72-219.adsl.dynamic.seed.net.tw) has joined #tux3 2009-03-23 20:05 -!- chesse(~eworm@dslb-084-062-179-045.pools.arcor-ip.net) has joined #tux3 2009-03-23 20:37 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-03-23 20:45 -!- arima(other@123-204-98-6.adsl.dynamic.seed.net.tw) has joined #tux3 2009-03-23 21:02 -!- cdk(~chinmay@121.246.34.108) has joined #tux3 2009-03-23 21:05 -!- chesse_(~eworm@dslb-084-062-157-211.pools.arcor-ip.net) has joined #tux3 2009-03-23 21:07 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-23 21:32 -!- edt(~Ed@dsl-60-240.aei.ca) has joined #tux3 2009-03-23 22:15 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-23 22:32 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-23 23:36 hi flips 2009-03-23 23:54 hi cdk 2009-03-23 23:55 the review is over. went very well :) 2009-03-23 23:55 good :) 2009-03-23 23:55 so the project is officially finished? 2009-03-23 23:56 maybe. still have not talked with our mentor. 2009-03-23 23:57 anyway, we never managed to send the collision support design and patch to the list. will do that today or tomorrow. 2009-03-23 23:57 did you write a report? 2009-03-23 23:58 aah, that is the part that is remaining . the documentation 2009-03-24 00:04 currently the stable dedup repo is at http://bitbucket.org/cdkamat/tux3_dedup/ 2009-03-24 00:04 it includes the collision handling part 2009-03-24 00:10 ACTION goes to look 2009-03-24 00:15 i have made the necessary changes to mkdir that we talked about in the repo 2009-03-24 00:19 good 2009-03-24 00:19 flips, i need to go now. will be back later. bye. 2009-03-24 00:19 is it in the stable repo? 2009-03-24 00:19 yes 2009-03-24 00:19 see you 2009-03-24 00:19 it is in the stable repo 2009-03-24 02:26 http://butnotyet.tumblr.com/post/89312148/timeline-of-linux-kernel-releases 2009-03-24 02:26 (part of the 'lets make a plot before going to sleep' series :P) 2009-03-24 02:47 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-03-24 08:28 -!- firefly(~firefly@1503031970.dhcp.dbnet.dk) has joined #tux3 2009-03-24 09:04 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-24 10:17 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-24 10:49 -!- cdk(~chinmay@115.109.15.195) has joined #tux3 2009-03-24 12:23 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-03-24 12:28 -!- firefly(~firefly@1503031970.dhcp.dbnet.dk) has joined #tux3 2009-03-24 12:41 -!- dcg(~dcg@132.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-24 14:17 -!- dcg(~dcg@210.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-24 14:55 -!- dcg_(~dcg@91.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-24 16:35 -!- edt(~Ed@dsl-60-240.aei.ca) has joined #tux3 2009-03-24 17:01 hirofumi, there? 2009-03-24 19:01 cabal time 2009-03-24 19:01 just about 2009-03-24 19:08 yup 2009-03-24 19:17 what does that mean? 2009-03-24 19:18 marcin: you should come visit LA and find out ;) 2009-03-24 19:18 just tell me when's a good time for occupying your couch again 2009-03-24 19:18 you're always welcome :) 2009-03-24 19:19 watch out what you're wish for ;) 2009-03-24 19:19 this time i might not leave 2009-03-24 19:19 south is teh suxxors 2009-03-24 19:46 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-24 20:20 http://userweb.kernel.org/~hirofumi/note.flush 2009-03-24 20:20 note of flushing 2009-03-24 20:20 not completed though 2009-03-24 20:21 [flush strategy and retire blocks] 2009-03-24 20:21 the list of this section explains about image of log and retires blocks 2009-03-24 20:22 well, so, there is the one of question 2009-03-24 20:22 do we merge last delta and flush bitmap/bnode? 2009-03-24 20:23 or flush delta, then flush bitmap/bnode 2009-03-24 20:26 well, I guess it explains why current defree/deflush is a bit complex 2009-03-24 20:29 I think, the merge those mean we can't ignore bfree in the latest delta which is merging 2009-03-24 22:41 back 2009-03-24 22:44 we should merge the last delta with the bitmap/bnode flush, yes 2009-03-24 22:44 hirofumi, we can't ignore which bfree? 2009-03-24 22:45 hi 2009-03-24 22:45 hi 2009-03-24 22:45 I think, bfree in latest delta is can't overwrite 2009-03-24 22:46 can not be overwritten 2009-03-24 22:46 true 2009-03-24 22:46 so, it's a special case 2009-03-24 22:47 um... 2009-03-24 22:47 I didn't think it was a special case 2009-03-24 22:47 but let's look at it 2009-03-24 22:49 bfree log of latest delta is needed to including the logs after bitmap flush 2009-03-24 22:50 yes 2009-03-24 22:50 because those bfree can be freed after bitmap flush 2009-03-24 22:50 however, the logs before bitmap flush are including various logs other than bfree 2009-03-24 22:51 so, it's special case 2009-03-24 22:51 the log entries are not special 2009-03-24 22:52 except for having a different type code, so that the deferred free list can be reconstructed correctly 2009-03-24 22:52 the only difference is which deferred free list is used for the deferred free 2009-03-24 22:53 I thought you explained this to me yesterday? 2009-03-24 22:53 no 2009-03-24 22:53 bitmap is including the latest logs allocation already 2009-03-24 22:54 ah yes 2009-03-24 22:54 I see your point 2009-03-24 22:54 or do I 2009-03-24 22:54 yes 2009-03-24 22:55 ok, I think the question is, when exactly do we allocate the log blocks 2009-03-24 22:55 ah, no, that isn't quite it 2009-03-24 22:55 yes 2009-03-24 22:55 logs allocation means balloc() of filemap 2009-03-24 22:56 yes 2009-03-24 22:56 or bnode, and leaf 2009-03-24 22:56 I think we do this after the delta increment 2009-03-24 22:56 which means that the 1 bits for the log blocks do not appear in the flushed bitmaps 2009-03-24 22:57 1 bits? 2009-03-24 22:57 allocated bits 2009-03-24 22:57 1 = allocated 2009-03-24 22:58 allocation bit of bitmap for log blocks? 2009-03-24 22:58 yes 2009-03-24 22:58 is that what you were talking about? 2009-03-24 22:58 no 2009-03-24 22:59 it may also be 2009-03-24 22:59 well, current is latest delta is including allocation log of filemap, bnode and others 2009-03-24 22:59 however, those are already in bitmap 2009-03-24 23:00 when flushing bitmap 2009-03-24 23:00 true 2009-03-24 23:00 that seems ok 2009-03-24 23:00 ok means replay ignore those? 2009-03-24 23:01 they are not in the previous versions of those bitmap blocks 2009-03-24 23:01 yes 2009-03-24 23:01 we replay against the previous versions 2009-03-24 23:01 but, bfree is needed after flushed bitmap 2009-03-24 23:02 so, when we flush out bitmap blocks, we empty the log except for entries that are generated during the flush itself 2009-03-24 23:02 because it's a not freed yet 2009-03-24 23:02 no 2009-03-24 23:02 latest delta is not written yet 2009-03-24 23:02 well the log isn't actually emptied until the delta commit block is written 2009-03-24 23:03 I think I still don't understand the point 2009-03-24 23:03 let me again at first 2009-03-24 23:03 good 2009-03-24 23:03 um... 2009-03-24 23:04 when flushing bitmap/bnode, we have the jobs of the detla and bitmap/bnode flush 2009-03-24 23:05 on this case, we merge the delta and bitmap/bnode flush in the one commit? 2009-03-24 23:05 into the one commit? 2009-03-24 23:05 yes 2009-03-24 23:06 if so, the detla may have bfree/balloc, and others logs 2009-03-24 23:06 I think we want those log entries to go into the log blocks for the _next_ delta 2009-03-24 23:06 after the one we are committing 2009-03-24 23:06 hmm 2009-03-24 23:06 is that right? 2009-03-24 23:07 no, I think you are right as you said it 2009-03-24 23:08 we can do either one 2009-03-24 23:08 yes 2009-03-24 23:08 I guess we better decide which 2009-03-24 23:08 bitmap/bnode flush, then the detla commit 2009-03-24 23:08 yes 2009-03-24 23:08 or, before flush 2009-03-24 23:08 and merge those 2009-03-24 23:09 merge those means efficient, but complex 2009-03-24 23:09 only slightly more complex I think 2009-03-24 23:09 mainly, more confusing 2009-03-24 23:09 yes 2009-03-24 23:09 not much extra code, if any 2009-03-24 23:09 not sure 2009-03-24 23:09 I tried to merge those in my prototype 2009-03-24 23:10 I guess, we need to replay the latest delta slightly 2009-03-24 23:10 and merge to the logs of flush 2009-03-24 23:11 did you mean, replay before commit? 2009-03-24 23:11 replay on memory 2009-03-24 23:11 yes 2009-03-24 23:11 I don't see the need 2009-03-24 23:12 search logs and remove unneeded logs, and merge the needed logs to next delta 2009-03-24 23:12 I don't think it's necessary. Maybe you should provide a specific example? 2009-03-24 23:13 e.g. 2009-03-24 23:13 the latest delta has bfree and balloc of filemap 2009-03-24 23:13 in that case, bitmap is including balloc already 2009-03-24 23:13 and bfree is not including yet 2009-03-24 23:14 ah, good example 2009-03-24 23:14 so, we don't need to reconstruct the bitmap and other structure 2009-03-24 23:14 ok, now I see the question 2009-03-24 23:15 however, need to search logs 2009-03-24 23:16 well, we just flushed all the dirty bitmaps and all the dirty index nodes, so no log replay is needed 2009-03-24 23:16 yes 2009-03-24 23:16 not need to reconstruct 2009-03-24 23:17 but, need to play on logs 2009-03-24 23:17 why do we need to replay logs? 2009-03-24 23:17 well, it's the special case 2009-03-24 23:17 because balloc is already in bitmap 2009-03-24 23:18 but, bfree is not in bitmap 2009-03-24 23:18 so, we need to pick up the bfree 2009-03-24 23:19 and bfree is needed to merge to logs after retire logs 2009-03-24 23:19 please see note.flush 2009-03-24 23:19 I think it is ok for the bfree to be in the bitmap too 2009-03-24 23:19 http://userweb.kernel.org/~hirofumi/note.flush 2009-03-24 23:19 no 2009-03-24 23:20 if bfree is already in bitmap, we may reuse those before commit 2009-03-24 23:21 in note.flush, --------------------------------- is separated stage 2009-03-24 23:21 I should be looking at the lower part of the note? 2009-03-24 23:21 the merge means, we need to merge those 2009-03-24 23:22 yes, the list of latest part 2009-03-24 23:22 just to talk this 2009-03-24 23:23 right part is data 2009-03-24 23:23 left part is logs 2009-03-24 23:24 whoops 2009-03-24 23:24 left part is data 2009-03-24 23:24 right part is logs 2009-03-24 23:24 :) 2009-03-24 23:24 :) 2009-03-24 23:24 in flush, we do "retire logs" 2009-03-24 23:25 right 2009-03-24 23:25 however, if we merge those, we are not writing the logs before "retire logs" 2009-03-24 23:26 right, the retire actually happens with the commit block arrives on disk 2009-03-24 23:26 so, we pick up the needed logs from the logs in the above part 2009-03-24 23:26 yes 2009-03-24 23:27 ah, I think I see the issue 2009-03-24 23:28 after the combined commit, some log entries should still exist 2009-03-24 23:28 for the bfrees 2009-03-24 23:28 is that it? 2009-03-24 23:28 yes 2009-03-24 23:29 I think we can make that happen, without much complexity 2009-03-24 23:29 I'm not sure whether those are bfree only 2009-03-24 23:29 probably just the deferred frees 2009-03-24 23:29 so we don't leak them 2009-03-24 23:30 everything else should be in the flushed bitmaps and btree nodes 2009-03-24 23:32 ACTION is thinking 2009-03-24 23:32 probably 2009-03-24 23:36 in the case for bfree, we make the log entry before actually changing the bitmap 2009-03-24 23:36 which makes it a little different from balloc 2009-03-24 23:36 (just thinking to myself) 2009-03-24 23:40 I was forgeting to think about defree bnode/bitmap 2009-03-24 23:40 at least, those are not in note.flush 2009-03-24 23:40 ileaf/dleaf/data/bitmap/bnode 2009-03-24 23:40 bfree (data) log 2009-03-24 23:40 balloc (data) log 2009-03-24 23:40 ileaf redirect log 2009-03-24 23:40 dleaf redirect log 2009-03-24 23:40 bnode redirect log 2009-03-24 23:40 bnode update log 2009-03-24 23:40 ileaf/dleaf/data 2009-03-24 23:40 --------------------------------------------------------------------- 2009-03-24 23:40 defree bfree (data) log 2009-03-24 23:41 defree ileaf redirect log 2009-03-24 23:41 defree dleaf redirect log 2009-03-24 23:41 --------------------------------------------------------------------- 2009-03-24 23:41 retire logs 2009-03-24 23:41 bnode 2009-03-24 23:41 bfree (bitmap) log 2009-03-24 23:41 balloc (bitmap) log 2009-03-24 23:41 bnode redirect log 2009-03-24 23:41 bnode update log 2009-03-24 23:41 bitmap 2009-03-24 23:41 --------------------------------------------------------------------- 2009-03-24 23:41 defree logs 2009-03-24 23:41 defree bfree (bitmap) log 2009-03-24 23:41 defree bnode redirect log 2009-03-24 23:41 defree bfree (bitmap) log, and defree bnode redirect log 2009-03-24 23:41 should I get a new copy of the note? 2009-03-24 23:42 not needed 2009-03-24 23:42 I need to think about those 2009-03-24 23:42 defree bfree (bitmap) log, and defree bnode redirect log 2009-03-24 23:44 "retire logs" is *after* the commit of bitmap/bnode flush? 2009-03-24 23:46 yes, the retire log is done by changing the global log length 2009-03-24 23:46 the last step of the commit 2009-03-24 23:47 ok, it seems to me that the only issue is, leaking deferred frees when we discard the log blocks 2009-03-24 23:47 correct? 2009-03-24 23:47 probably 2009-03-24 23:47 probably is the right word :) 2009-03-24 23:47 :) 2009-03-24 23:48 I'm not thinking about split yet 2009-03-24 23:48 so, if we could, it would be nice to have the corresponding bitmap blocks cleared in the flushed bitmap blocks 2009-03-24 23:48 well, flush for each delta should be covering those for now 2009-03-24 23:49 except that there isn't a commit block for the final delta before the flush 2009-03-24 23:50 if there was, then the problem goes away I think 2009-03-24 23:50 because after writing a commit block for the delta, we can reuse the freed blocks 2009-03-24 23:51 which I think you said above, in different words 2009-03-24 23:52 however, avoiding a commit cycle is worth some effort 2009-03-24 23:52 or, it may be a premature optimization 2009-03-24 23:52 yes 2009-03-24 23:53 it certainly is a big optimization if we flush after every delta for now 2009-03-24 23:53 or even, every four deltas or something like that 2009-03-24 23:53 yes 2009-03-24 23:53 well, flush every delta is very temporary, I think 2009-03-24 23:54 flush on every delta 2009-03-24 23:54 yes, it actually makes it harder to think about the algorithm 2009-03-24 23:56 I wonder if we can clear all the deferred free bits just before submitting the bitmap blocks 2009-03-24 23:57 we may need it for flush 2009-03-24 23:57 "it" ? 2009-03-24 23:57 free blocks before bitmap 2009-03-24 23:58 free deferred blocks before submitting bitmap 2009-03-24 23:58 yes 2009-03-24 23:58 I'm trying to think about what could go wrong 2009-03-24 23:59 this would be before incrementing sb->flush 2009-03-24 23:59 so that the bitmap blocks are not forked 2009-03-24 23:59 this? 2009-03-25 00:00 we would clear those bits before incrementing sb->flush 2009-03-25 00:00 ah, yes 2009-03-25 00:00 clearing a bit should never allocate a new block 2009-03-25 00:00 yes 2009-03-25 00:01 that is, it is an error if the bit is not already set 2009-03-25 00:01 yes, double free 2009-03-25 00:03 or, "retire logs" after the comming the bitmap flush 2009-03-25 00:04 that is also a possibility 2009-03-25 00:05 that makes a lot of sense 2009-03-25 00:06 maybe 2009-03-25 00:06 just thinking a little more 2009-03-25 00:06 well 2009-03-25 00:06 otherwise, I guess we have to reserve the some region of bitmap to flush bitmap 2009-03-25 00:07 after the delta completes, the bitmap btree will be pointing at the new bitmap blocks 2009-03-25 00:07 so we don't want to replay old logs against those 2009-03-25 00:07 it is better to make sure that the new bitmap blocks have exactly the right bits 2009-03-25 00:08 I think it's not problem 2009-03-25 00:08 bfree logs is just the log of region of bitmap 2009-03-25 00:08 there is no physical address 2009-03-25 00:09 the only problem is, if we replay a balloc against a bitmap region that is already set, it looks like an error 2009-03-25 00:09 yes 2009-03-25 00:09 we have to ignore in that case 2009-03-25 00:10 in fact, we would ignore every log entry except the bfree entries 2009-03-25 00:10 yes 2009-03-25 00:10 and only use those to clear the corresponding bits in the bitmaps 2009-03-25 00:10 it's complex more or less 2009-03-25 00:10 yes 2009-03-25 00:11 so, it is probably better to clear those bits before writing bit bitmap blocks, if we can do that without complications 2009-03-25 00:11 yes 2009-03-25 00:11 well, I can't think of a complication right now 2009-03-25 00:12 that is probably because I have not tried to code it ;) 2009-03-25 00:12 complex of pick up bfree logs only? 2009-03-25 00:13 well, we have to search flush-bitmap in log 2009-03-25 00:13 I think it is simpler to clear those bits before writing the bitmaps 2009-03-25 00:13 yes 2009-03-25 00:13 both strategies can work 2009-03-25 00:13 yes 2009-03-25 00:13 and neither is very complicated 2009-03-25 00:14 probably, not simple though 2009-03-25 00:14 not easy to think about 2009-03-25 00:14 yes 2009-03-25 00:15 the solution may be as simple as running through the deferred free list, clearing bitmap regions, just before the bitmap flush 2009-03-25 00:15 and just before sb->flush++ 2009-03-25 00:16 we need to reserve the region instead though 2009-03-25 00:16 so that it can't be allocated to hold a bitmap block, yes 2009-03-25 00:16 bleah 2009-03-25 00:17 we can do the map_regions for bitmap, then clear the deferred, then intitiate writeout 2009-03-25 00:18 sb->flush++? 2009-03-25 00:18 the flush counter 2009-03-25 00:19 sb->flush is before map_regions for bitmap? 2009-03-25 00:19 we have sb->delta and sb->flush 2009-03-25 00:19 so clear the deferred is not in bitmap? 2009-03-25 00:20 I meant... 2009-03-25 00:20 or another idea is 2009-03-25 00:21 we re-logging the bfree in bitmap flush 2009-03-25 00:21 i.e. we list those up in sb->deflush 2009-03-25 00:21 how about: clear deferred regions; sb->flush++; flush bitmap to disk 2009-03-25 00:22 map_region is where? 2009-03-25 00:22 in the "flush bitmap" 2009-03-25 00:22 using the algorithm we already developed 2009-03-25 00:22 flush bitmap have to know deferred region 2009-03-25 00:23 for not reusing those 2009-03-25 00:23 yes :) 2009-03-25 00:24 well 2009-03-25 00:24 right 2009-03-25 00:24 defree bfree (bitmap) log, and defree bnode redirect log 2009-03-25 00:24 I guess those are having the same issue 2009-03-25 00:25 yes, the issue being deferred free in each case 2009-03-25 00:26 btw, I guess defree bfree (data) log, and ileaf/dleaf are not haveing 2009-03-25 00:27 those are write data blocks, then defree 2009-03-25 00:27 so, there is no problem 2009-03-25 00:27 but, bitmap may be cyclic 2009-03-25 00:28 there is another alternative 2009-03-25 00:28 which is to log the deferred frees into a new log block 2009-03-25 00:29 which is not retired 2009-03-25 00:29 if we use the same way of data blocks, I guess bitmap is also "log + data, and defree is in next delta" 2009-03-25 00:30 i.e. all deferred bfree logs + bitmap data 2009-03-25 00:30 and "retire log" of other logs 2009-03-25 00:30 are you argreeing with my comment just above? 2009-03-25 00:31 it's just a idea 2009-03-25 00:31 of course 2009-03-25 00:31 um..., well, I was thinking the why data blocks is ok 2009-03-25 00:32 ok 2009-03-25 00:32 ok, special replay is one solution 2009-03-25 00:32 bfree and other logs is used on different cycle 2009-03-25 00:32 ah 2009-03-25 00:32 so, we should have two logs? 2009-03-25 00:32 yes 2009-03-25 00:33 that was your suggestion from way back 2009-03-25 00:33 special replay sounds like the best suggestion so far 2009-03-25 00:35 which is, we don't immediately forget all the log blocks; we somehow know when we are replaying after a flush; and we ignore all log entries except deferred free on replay 2009-03-25 00:36 um... 2009-03-25 00:37 well, yes 2009-03-25 00:41 we can't log the bnode update without physical address? 2009-03-25 00:42 well, it means, I guess "bnode redirect log" is only one of can't retire 2009-03-25 00:42 you mean, use the logical key instead of physical address? 2009-03-25 00:42 on flush 2009-03-25 00:42 or just the log of key and data 2009-03-25 00:43 we may not have reconstructed enough of the btree yet to find the correct physical block, using the key 2009-03-25 00:43 i see 2009-03-25 00:44 well, the issue you raised today is indeed a tricky and messy issue, but it is a small issue 2009-03-25 00:44 like the allocate-in-bitmap-flush issue 2009-03-25 00:44 yes 2009-03-25 00:44 a small issue that is very tricky to solve nicely 2009-03-25 00:45 yes 2009-03-25 00:45 however, bitmap flush issue was solved very cleanly 2009-03-25 00:45 there is no special case 2009-03-25 00:45 after a _lot_ of discussion 2009-03-25 00:45 yes :) 2009-03-25 00:49 ok, returing to one of the ideas above... 2009-03-25 00:49 ok 2009-03-25 00:50 at the point we are sure that no further allocations are needed for the delta, we could start a new log block (oops we just allocated) and make log entries for the deferred frees for the current delta 2009-03-25 00:51 not really attractive, because of the extra log block write 2009-03-25 00:52 this is not as bad as an extra commit though 2009-03-25 00:53 yes 2009-03-25 00:54 returning to the idea of not discarding the log immediately at flush... it is only the final delta that we need to keep, and only for the deferred free entries 2009-03-25 00:54 ok, so I like this idea the best 2009-03-25 00:55 no, we have to keep all deltas until previous flush 2009-03-25 00:55 because bnode redirect log is there 2009-03-25 00:55 but we just flushed the "real" bnodes 2009-03-25 00:55 yes 2009-03-25 00:56 but, we keep to original blocks until commit 2009-03-25 00:56 the log is not actually flushed until the commit 2009-03-25 00:57 yes, but, those could be overwritten 2009-03-25 00:57 hey flips 2009-03-25 00:57 busy, bh 2009-03-25 00:58 how could they be overwritten? 2009-03-25 00:58 the allocation of bitmap may reuse the those blocks if we free those 2009-03-25 00:59 the allocation of bitmap blocks itself 2009-03-25 00:59 you mean, the old log blocks could be overwritten 2009-03-25 00:59 I meant old bnode blocks 2009-03-25 01:00 those should be on the deferred free list 2009-03-25 01:00 ah I see 2009-03-25 01:00 yes 2009-03-25 01:00 you are warning me that we can't actually free them before the commit 2009-03-25 01:00 flips: ok, as long as things are going well 2009-03-25 01:00 yes 2009-03-25 01:00 true 2009-03-25 01:02 ileaf/dleaf/data/bitmap/bnode 2009-03-25 01:02 [+]bfree (data) log 2009-03-25 01:02 [+]balloc (data) log 2009-03-25 01:02 [+]ileaf redirect log 2009-03-25 01:02 [+]dleaf redirect log 2009-03-25 01:02 [%]bnode redirect log 2009-03-25 01:02 [+]bnode update log 2009-03-25 01:02 ileaf/dleaf/data 2009-03-25 01:02 --------------------------------------------------------------------- 2009-03-25 01:02 defree bfree (data) log 2009-03-25 01:02 defree ileaf redirect log 2009-03-25 01:02 defree dleaf redirect log 2009-03-25 01:02 --------------------------------------------------------------------- 2009-03-25 01:02 retire logs 2009-03-25 01:02 bnode 2009-03-25 01:02 [%]bfree (bitmap) log 2009-03-25 01:02 [%]balloc (bitmap) log 2009-03-25 01:02 [%]bnode redirect log 2009-03-25 01:02 [%]bnode update log 2009-03-25 01:02 bitmap 2009-03-25 01:02 --------------------------------------------------------------------- 2009-03-25 01:02 defree logs 2009-03-25 01:02 defree bfree (bitmap) log 2009-03-25 01:02 defree bnode redirect log 2009-03-25 01:02 [+] is already in bitmap, [%] is not in bitmap yet 2009-03-25 01:04 I guess, simple way is separate [+] and [%] to different log 2009-03-25 01:04 and "retire logs" retires the [+] log 2009-03-25 01:04 [%] log is retired in next flush 2009-03-25 01:05 two logs is pretty easy 2009-03-25 01:05 yes 2009-03-25 01:05 ok 2009-03-25 01:05 you convinced me I think 2009-03-25 01:05 not efficient though 2009-03-25 01:05 not efficient because? 2009-03-25 01:05 I guess [%] log is small on per-delta 2009-03-25 01:06 however, it would be needed to use one block 2009-03-25 01:06 typically 2009-03-25 01:06 so, I guess [%] log block is almost zeroed data 2009-03-25 01:06 just guess really 2009-03-25 01:06 how does the bfree for the deleaf redirect get entered into the bitmap? 2009-03-25 01:07 this list is not merging delta and flush 2009-03-25 01:07 ok 2009-03-25 01:07 makes sense now 2009-03-25 01:08 if we want to merge those, we have to merge deflush and defree logs on flush 2009-03-25 01:08 I think 2009-03-25 01:09 I think I went back to liking a single log more 2009-03-25 01:09 the only reason for the % log would be to do the deferred frees, only immediately after a flush 2009-03-25 01:09 yes 2009-03-25 01:10 ok, going back to another idea above... try to write the "correct" bitmap blocks to disk 2009-03-25 01:10 except the log of bitmap flush itself 2009-03-25 01:11 ok 2009-03-25 01:11 yes, our only record of the bitmap allocation is in the log 2009-03-25 01:13 well, if we have reserved region on flush, we don't need to defree on flush 2009-03-25 01:18 well, the "correct bitmap blocks" idea seems pretty hard to do, while the competing idea of keeping the log chain until the next delta seems pretty easy 2009-03-25 01:18 I think we should do the "keep the log chain on flush" strategy 2009-03-25 01:19 and combine flush with delta, not have an extra commit for flush 2009-03-25 01:20 well, correct bitmap blocks is not so hard to do, but need new code to do 2009-03-25 01:20 I guess several fs are having reserved region on memory 2009-03-25 01:20 well 2009-03-25 01:21 the "correct bitmap" idea interacts with block forking I think 2009-03-25 01:21 reserved region? 2009-03-25 01:21 just free region before sb->fush++ 2009-03-25 01:22 and instead of it, reserves the region on memory 2009-03-25 01:22 and somehow avoid reallocating it 2009-03-25 01:22 yes, exactly 2009-03-25 01:22 I was trying to avoid that :) 2009-03-25 01:22 it is a possibility 2009-03-25 01:22 the reason for avoiding is, it slows down every allocation (a little) 2009-03-25 01:23 yes, just list it up in possibility 2009-03-25 01:23 yes 2009-03-25 01:23 which is planned for inode allocation 2009-03-25 01:23 yes 2009-03-25 01:24 it would be nice to avoid it for block allocation... I am not sure why I feel differently about these two things 2009-03-25 01:24 I guess inode allocation can't be solved by logs 2009-03-25 01:25 it is also possible to bypass the forking mechanism and do the bitmap clears after the sb->flush++ 2009-03-25 01:26 in fact we want to avoid the forking in this case 2009-03-25 01:27 maybe, just don't do the sb->flush++ until after bitmap allocation, on a flush cycle 2009-03-25 01:28 however... if we avoid forking, we also get the possibility of infinite allocate-on-bitmap-flush 2009-03-25 01:28 I guess it introduces the cyclic of allocate bitmap blocks 2009-03-25 01:28 right 2009-03-25 01:30 it is only for bitmap blocks allocations that we have to check a reserved list 2009-03-25 01:30 that is not so bad 2009-03-25 01:30 yes 2009-03-25 01:30 so, in bitmap block allocation, just check the deferred free list 2009-03-25 01:30 did we fix our problem? 2009-03-25 01:31 yes, it can 2009-03-25 01:32 so: dump deferred frees into bitmap; sb->flush++; flush bitmaps (treating deferred free list as reserved) 2009-03-25 01:34 yes, it can be 2009-03-25 01:34 if so, we may want to the deflush is tree instead of list 2009-03-25 01:35 later, yes 2009-03-25 01:35 well, deflush is not meaning the deflush 2009-03-25 01:35 I know what you meant 2009-03-25 01:36 however, it many be one of simple solution 2009-03-25 01:36 it may be one of simple solution 2009-03-25 01:36 a better solution may come up 2009-03-25 01:36 but this one seems workable 2009-03-25 01:37 yes 2009-03-25 01:37 and it will have clean data on disk 2009-03-25 01:38 bitmap flush can be including all defree 2009-03-25 01:38 ah, no 2009-03-25 01:39 including bitmap itself is not clean to implemet 2009-03-25 01:39 ok, that problem 2009-03-25 01:40 well, bitmap is special already 2009-03-25 01:40 so, there is no actuall problem 2009-03-25 01:40 however, we lose the balloc log record for a bitmap allocation 2009-03-25 01:41 I'm thinking, retire log on memory is before map_region of bitmap 2009-03-25 01:41 like the list of not.flush 2009-03-25 01:42 note.flush 2009-03-25 01:42 so, flushed bitmap blocks is not including bitmap itself 2009-03-25 01:43 however, the logs is written instead 2009-03-25 01:43 yes, and we may not retire that log block yet 2009-03-25 01:44 and there is a little extra complexity on replay 2009-03-25 01:44 I guess if we use defree as reserved strategy, we can retire the logs before bitmap flush 2009-03-25 01:45 I think the replay just see 2009-03-25 01:45 bnode 2009-03-25 01:45 [%]bfree (bitmap) log 2009-03-25 01:45 [%]balloc (bitmap) log 2009-03-25 01:45 [%]bnode redirect log 2009-03-25 01:45 [%]bnode update log 2009-03-25 01:45 bitmap 2009-03-25 01:45 this logs 2009-03-25 01:45 that seems right 2009-03-25 01:47 probably 2009-03-25 01:48 essentially, we start a new log just before the flush 2009-03-25 01:49 yes 2009-03-25 01:49 that concept seems clean enough for now 2009-03-25 01:49 my patchset was starting new log 2009-03-25 01:49 yes, for now 2009-03-25 01:50 so while we have been chatting, I have been fiddling with replay 2009-03-25 01:50 to make it two pass 2009-03-25 01:50 if we can avoid reserved region, it sounds like perfect 2009-03-25 01:51 two pass? 2009-03-25 01:51 I don't immediately see how to avoid that, but maybe and idea will come when I go skating tomorrow 2009-03-25 01:51 two pass replay: 1) physical 2) logical 2009-03-25 01:51 we can't enter bitmap updates into the bitmaps before reconstructing the bitmap btree 2009-03-25 01:51 ah 2009-03-25 01:52 i see 2009-03-25 01:52 so the first pass ignores all the logical records, and the second pass ignores all the physical records 2009-03-25 01:53 I guess exception is just balloc 2009-03-25 01:53 I guess bfree goes into sb->defree 2009-03-25 01:53 or sb->deflush 2009-03-25 01:54 there have to be log records too 2009-03-25 01:54 yes 2009-03-25 01:55 on replay, if it see bfree log, replay just to it list to sb->defree 2009-03-25 01:55 not apply to bitmap yet 2009-03-25 01:55 ah, correct :) 2009-03-25 01:56 so, I guess just list balloc up, and apply after replay was completed 2009-03-25 01:57 well, I'm not sure though 2009-03-25 01:57 actually, some bfrees go into the bitmaps immediately 2009-03-25 01:57 oh 2009-03-25 01:57 which one? 2009-03-25 01:57 per-delta bfrees 2009-03-25 01:58 where the deferred free list is flushed into the bitmaps immediately after the delta commit 2009-03-25 01:58 this would be for normal data blocks 2009-03-25 01:58 ah 2009-03-25 01:58 do unstash() after replay? 2009-03-25 01:58 that is easy enough 2009-03-25 01:59 i see 2009-03-25 01:59 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-03-25 01:59 ah, it may be sb->deflush too 2009-03-25 02:00 ah, no 2009-03-25 02:12 -!- arima(other@123-204-74-21.adsl.dynamic.seed.net.tw) has joined #tux3 2009-03-25 02:15 new improved replay is posted, not tested at all 2009-03-25 02:15 the rest of the log types need to be filled in 2009-03-25 02:48 already posted to the ml? 2009-03-25 03:18 just checked in 2009-03-25 03:18 this code is not active yet 2009-03-25 03:21 ah 2009-03-25 03:39 oyasumi :) 2009-03-25 03:39 -!- cdk(~chinmay@121.246.32.10) has joined #tux3 2009-03-25 03:40 ok, oyasumi :) 2009-03-25 04:09 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-25 12:30 -!- dcg(~dcg@229.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-25 12:48 -!- dcg_(~dcg@21.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-25 12:48 -!- dcg_(~dcg@21.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-25 14:00 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-25 14:15 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-25 14:48 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-25 14:54 -!- dcg_(~dcg@21.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-25 17:48 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-25 18:22 -!- edt(~Ed@dsl-60-240.aei.ca) has joined #tux3 2009-03-25 20:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-25 21:16 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-25 21:31 -!- edt(~Ed@dsl-60-240.aei.ca) has joined #tux3 2009-03-25 21:41 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-25 21:46 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-25 22:11 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-26 01:22 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-03-26 01:22 I was thinking the another strategy, instead of reserved region of bitmap 2009-03-26 01:23 reserved region needs addtional code, and may be not so efficient 2009-03-26 01:24 well, I guess it's not especially efficient 2009-03-26 01:24 so, my suggestion is 2009-03-26 01:24 re-logging the deflush entries as defree log 2009-03-26 01:25 i.e. LOG_BFREE_ON_FLUSH is logged as LOG_BFREE in flush_log() 2009-03-26 01:25 so, there is no special case to handle it 2009-03-26 01:25 re-logged LOG_BFREE is freed in next delta 2009-03-26 01:26 and those log are packed to log blocks 2009-03-26 01:26 instead to dirty bitmap blocks 2009-03-26 01:27 Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "WTF-ce", "..\..\..\JavaScri 2009-03-26 01:27 ptCore\JavaScriptCore.vcproj\WTF\WTF-ce.vcproj", "{AA8A5A85-592B-4357-BC60-E0E91 2009-03-26 01:27 E026AF7}" 2009-03-26 01:27 I hope this strategy is simple, and efficient enough 2009-03-26 01:27 whoops, wrong channel sorry 2009-03-26 04:29 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-26 06:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-26 08:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-26 09:20 -!- dcg(~dcg@229.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-26 10:09 -!- bd__(~foo@satoko.is.fushizen.net) has joined #tux3 2009-03-26 10:09 -!- RazvanM_(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-26 10:09 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2009-03-26 10:10 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-26 10:13 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-03-26 10:23 -!- flips(~phillips@phunq.net) has joined #tux3 2009-03-26 11:29 um..., flushed bnode/bitmap was too early to become visible 2009-03-26 11:30 if it become visible at the next flush, that problem may be gone 2009-03-26 11:31 I guess this cycle is natural with defree strategy 2009-03-26 11:32 i.e. flush means, flushing the current dirty datas, and make visible previsou flush 2009-03-26 11:32 and retire previous flush logs 2009-03-26 11:34 ah, damn, no 2009-03-26 11:34 maybe, to start future deltas, current flush should be visible 2009-03-26 11:35 um... 2009-03-26 12:07 hirofumi, there? 2009-03-26 12:16 yes 2009-03-26 12:16 hi 2009-03-26 12:16 hi 2009-03-26 12:17 I'm going to sleep soon though 2009-03-26 12:17 ok, I was also thinking about what you said above 2009-03-26 12:17 good 2009-03-26 12:17 after commit of the flush, there will be a few log entries remaining for the next delta 2009-03-26 12:18 yes 2009-03-26 12:18 that is, the entries caused by forking bitmap blocks 2009-03-26 12:18 hmm 2009-03-26 12:18 yes 2009-03-26 12:18 by allocating bitmap blocks, I meant 2009-03-26 12:18 yes 2009-03-26 12:18 map_region of bitmap 2009-03-26 12:18 yes 2009-03-26 12:19 now, I'm thinking to add LOG_RETIRE at flushing bitmap 2009-03-26 12:19 what does it do? 2009-03-26 12:19 ah 2009-03-26 12:19 it means bitmap was flushed 2009-03-26 12:19 yes 2009-03-26 12:20 ok, I was thinking of something similar 2009-03-26 12:20 so, replay needs to search it at first pass 2009-03-26 12:20 a log entry to say which log entries are invalid 2009-03-26 12:21 yes 2009-03-26 12:21 ok, so we thought the same things 2009-03-26 12:21 for now, it's most simple strategy 2009-03-26 12:22 I also thought of a more efficient strategy than search of the defree list, that also avoids re-allocating the deferred free, which we can implement later 2009-03-26 12:22 I will think more though 2009-03-26 12:22 I will add more of the obvious code to replay 2009-03-26 12:23 re-allocating the deferred free? 2009-03-26 12:23 that avoids allocating a deferred free after adding the deferrred frees into the bitmap 2009-03-26 12:23 ah, strategy based on reserved bitmap? 2009-03-26 12:23 actually, pre-allocating the blocks we need to populate the bitmap 2009-03-26 12:24 ah 2009-03-26 12:24 yes 2009-03-26 12:24 it can 2009-03-26 12:24 this approach is slightly messier, and much more efficient 2009-03-26 12:24 yes, maybe, efficient 2009-03-26 12:25 however, code is not so good 2009-03-26 12:25 yes, but not so bad either 2009-03-26 12:25 yes 2009-03-26 12:25 this is a messy little corner of the commit algorithm 2009-03-26 12:25 but, if this is the messiest part, it is not too bad 2009-03-26 12:26 yes 2009-03-26 12:26 however, if we can avoid special code, it's much good 2009-03-26 12:26 of course 2009-03-26 12:26 of course 2009-03-26 12:28 well, oyasumi 2009-03-26 12:28 oyasumi 2009-03-26 12:29 btw, I'm forgetting about droot change 2009-03-26 12:30 it uses the log? 2009-03-26 12:30 ACTION thinks 2009-03-26 12:31 no log necessary 2009-03-26 12:31 it changes the ileaf, which is part of the delta 2009-03-26 12:32 the droot log code should be deleted 2009-03-26 12:32 i see 2009-03-26 12:33 it referers the redirected bnode before bnode flush 2009-03-26 12:33 log_iroot is not necessary either, if we record the iroot in the superblock every time it changes 2009-03-26 12:33 i see 2009-03-26 12:34 it mean there is the issue 2009-03-26 12:34 ugh 2009-03-26 12:34 it doesn't mean there is the issue 2009-03-26 12:35 it just meant referered droot is not written on disk 2009-03-26 12:35 however, it should be dirty if it's not on disk 2009-03-26 12:35 it should be written to disk with the dirty ileaf 2009-03-26 12:35 ileaf is per-delta 2009-03-26 12:35 and bnode is per-flush 2009-03-26 12:36 so, ileaf can referer the new bnode before flush 2009-03-26 12:36 but it is not wrong to write the redirected pointer out with the ileaf, per delta 2009-03-26 12:37 yes, an ileaf can refer to the new bnode which is not written yet 2009-03-26 12:37 yes, because, the log of bnode redirect is there 2009-03-26 12:37 that is ok 2009-03-26 12:37 yes 2009-03-26 12:37 replay will put the correct bnode in cache 2009-03-26 12:37 yes 2009-03-26 12:37 ok 2009-03-26 12:37 so, droot log is not necessary 2009-03-26 12:39 right 2009-03-26 12:39 i see 2009-03-26 12:39 and we can settle details of iroot later, for now just change the superblock and write it out 2009-03-26 12:39 yes 2009-03-26 12:40 so, I will remove the iroot and droot logging code today 2009-03-26 12:40 and add some of the replay code that is needed, like bnode split 2009-03-26 12:42 maybe we should merge some of your bug fix and cleanup patches tomorrow 2009-03-26 12:42 to reduce the diff 2009-03-26 12:42 ok 2009-03-26 13:42 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-26 16:18 flips: you around ? 2009-03-26 16:18 I'm thinking about heading up to SF either tomorrow or later tonight 2009-03-26 16:18 hi bh 2009-03-26 16:19 so I might drive through your parts, if you want me to visit, I'll stop by 2009-03-26 16:20 I'll be around, could do coffee on the promenade or something 2009-03-26 16:20 from what time to what time ? 2009-03-26 16:20 7 - 9? 2009-03-26 16:21 I'll see what happens. I was thinking about headin to aikido class which gets out at about 7 2009-03-26 16:21 I'll take me about 1.5 hours to get up to Santa Monica if I do it tonight, otherwise maybe tomorrow or Saturday 2009-03-26 16:22 flips: btw, I figured out how to implement priority inheritance for the EDF scheduler I'm working on. I was reading your backlog 2009-03-26 16:23 looks like you folks are well on your way of getting atomic logging working 2009-03-26 16:23 seems like 2009-03-26 16:23 how's tux3 development going overall ? bug count good ? 2009-03-26 16:24 it has always been pretty low 2009-03-26 16:24 good 2009-03-26 16:24 reliable as a root file system right now ? 2009-03-26 16:25 the sooner I get this scheduler and userspace work done, the sooner I can get at tux3 2009-03-26 20:24 -!- firefly(~firefly@1503031970.dhcp.dbnet.dk) has joined #tux3 2009-03-26 23:48 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-27 00:54 hi 2009-03-27 00:54 I've posted the pull request of random patches 2009-03-27 00:54 not all patches though 2009-03-27 00:55 those are almost of commit/log patches except new code 2009-03-27 03:08 hi 2009-03-27 03:08 random patches are good 2009-03-27 07:08 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-27 07:56 -!- dcg(~dcg@185.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-27 08:09 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-27 08:59 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-27 09:53 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-27 10:15 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-27 12:21 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-27 12:26 -!- dcg_(~dcg@66.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-27 12:44 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-27 13:35 -!- dcg__(~dcg@45.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-03-27 15:00 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-27 15:49 -!- data`(~data@84.19.190.213) has joined #tux3 2009-03-27 15:52 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-03-27 23:47 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-28 05:23 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-03-28 06:14 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-03-28 07:02 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-03-28 07:56 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-28 09:16 -!- dcg(~dcg@232.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-28 09:52 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-28 10:13 -!- dagle(~dagle@148.160.162.104) has joined #tux3 2009-03-28 10:35 -!- dcg_(~dcg@76.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-28 13:10 -!- dcg_(~dcg@76.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-28 14:24 -!- konrad(~konrad@D-128-208-53-43.dhcp4.washington.edu) has joined #tux3 2009-03-28 14:57 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-28 15:54 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-28 16:00 -!- dagle1(~dagle@host162-104.bornet.net) has joined #tux3 2009-03-28 16:02 reading the lkml fsync thread 2009-03-28 16:03 seems to be multiple problems, ext3 fsync is just one 2009-03-28 16:03 about dang time :) 2009-03-28 16:03 cfq seems to be buggy 2009-03-28 16:03 and cache dirty limit always was a dodgy idea 2009-03-28 16:04 linus thinks ssd will solve the problem 2009-03-28 16:04 I think, no 2009-03-28 16:04 just push it to a new place 2009-03-28 16:15 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-28 16:33 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-28 16:42 ACTION loves these flamewars 2009-03-28 17:00 -!- dcg__(~dcg@22.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-03-28 21:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-29 00:59 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-29 04:05 -!- dcg(~dcg@113.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-29 05:45 -!- dcg(~dcg@27.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-29 08:51 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-29 13:17 -!- cdk(~chinmay@115.109.13.56) has joined #tux3 2009-03-29 13:59 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-29 14:32 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-29 14:36 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-29 16:41 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-29 18:08 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-29 18:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-29 19:14 -!- firefly(~firefly@1503031970.dhcp.dbnet.dk) has joined #tux3 2009-03-29 22:28 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-29 22:44 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-29 23:48 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-03-30 06:31 -!- dcg(~dcg@10.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-30 07:55 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-03-30 07:58 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-30 08:26 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-30 08:42 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-30 09:39 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-30 09:48 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-30 10:06 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-30 10:16 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-30 12:33 -!- dcg(~dcg@155.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-30 14:14 -!- cdk(~chinmay@115.109.12.83) has joined #tux3 2009-03-30 14:57 -!- dcg_(~dcg@155.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-30 15:49 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-03-30 15:55 -!- dcg__(~dcg@162.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-03-30 16:03 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-30 16:12 -!- cdk(~chinmay@115.109.12.83) has joined #tux3 2009-03-30 17:12 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-30 17:24 looks to me like the big ext3 fsync thread died out without any useful action on the latency issue 2009-03-30 17:25 every ended up piling on to a patch to add blockdev flushing to non-transactional filesystems 2009-03-30 17:26 I thought, somewhere in there, there was a discussion of dirty cache limits? 2009-03-30 17:26 but I must have overlooked it 2009-03-30 17:26 -!- firefly(~firefly@1503031970.dhcp.dbnet.dk) has joined #tux3 2009-03-30 19:20 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-03-30 21:38 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-03-30 21:53 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-03-30 23:39 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-03-31 03:18 -!- cdk(~chinmay@115.109.15.174) has joined #tux3 2009-03-31 06:30 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-03-31 06:33 -!- dcg(~dcg@63.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-31 07:39 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-31 08:43 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-31 09:07 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-31 09:17 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-03-31 10:00 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-03-31 10:32 -!- dcg(~dcg@63.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-03-31 10:49 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-03-31 12:31 -!- cdk(~chinmay@115.109.12.219) has joined #tux3 2009-03-31 15:58 -!- gebi(~gebi@84-119-43-219.dynamic.xdsl-line.inode.at) has joined #tux3 2009-03-31 16:10 -!- gebi(~gebi@84-119-57-210.dynamic.xdsl-line.inode.at) has joined #tux3 2009-03-31 18:49 -!- arima(other@123-204-107-87.adsl.dynamic.seed.net.tw) has joined #tux3 2009-03-31 22:22 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-04-01 02:23 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-01 04:15 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-04-01 04:20 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-01 04:27 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-04-01 08:00 -!- dcg(~dcg@22.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-01 08:47 -!- dcg_(~dcg@20.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-01 14:50 -!- gebi_(~gebi@84.119.81.115) has joined #tux3 2009-04-01 16:39 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-01 17:17 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-01 17:36 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-01 17:55 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-01 20:37 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-01 20:51 -!- RazvanM(~RazvanM@pool-173-67-57-126.bltmmd.east.verizon.net) has joined #tux3 2009-04-01 21:30 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-01 21:46 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-02 02:59 -!- data(~data@84.19.190.213) has joined #tux3 2009-04-02 04:25 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-02 04:38 -!- telmich(telmich@tee.schottelius.org) has joined #tux3 2009-04-02 04:38 hello 2009-04-02 04:39 anyone recognized that gitweb is misconfigured? 2009-04-02 04:39 http://git.tux3.org/ddtree?p=tux3fs;a=shortlog misses the css 2009-04-02 06:43 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-04-02 07:28 -!- dcg(~dcg@81.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-02 08:06 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-02 10:07 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-02 11:33 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-02 13:19 -!- cdk(~Chinmay@115.109.15.138) has joined #tux3 2009-04-02 13:27 -!- pgquiles(~pgquiles@185.Red-83-39-140.dynamicIP.rima-tde.net) has joined #tux3 2009-04-02 14:18 -!- gebi(~gebi@84.119.81.184) has joined #tux3 2009-04-02 16:04 telmich, that git repo is not used any more 2009-04-02 16:04 yes, I knew about the css 2009-04-02 16:07 -!- ChanServ changed mode/#tux3 -> +o flips 2009-04-02 16:07 -!- flips changed topic to "Topic for #tux3 is: http://tux3.org ~ git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git" 2009-04-02 16:08 -!- flips changed topic to "http://tux3.org ~ git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git" 2009-04-02 16:08 -!- flips changed mode/#tux3 -> -o flips 2009-04-02 19:31 -!- cdk(~Chinmay@115.109.14.131) has joined #tux3 2009-04-02 20:18 -!- cdk(~Chinmay@115.109.8.247) has joined #tux3 2009-04-02 21:01 -!- cdk(~Chinmay@115.109.8.121) has joined #tux3 2009-04-02 23:21 -!- RazvanM(~RazvanM@pool-173-75-179-210.bltmmd.east.verizon.net) has joined #tux3 2009-04-03 01:31 flips: aha, maybe remove the link to it? ;-) 2009-04-03 01:32 good idea 2009-04-03 01:32 shapor? 2009-04-03 01:50 -!- arima(other@123-204-96-81.adsl.dynamic.seed.net.tw) has joined #tux3 2009-04-03 06:53 -!- dcg(~dcg@201.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-03 09:39 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-03 11:59 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-03 12:24 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-03 12:35 -!- cdk(~chinmay@121.246.32.181) has joined #tux3 2009-04-03 13:51 -!- pythonstar(~kavli@c-0afee455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-03 19:21 -!- cdk(~chinmay@115.109.13.139) has joined #tux3 2009-04-03 19:43 -!- cdk(~chinmay@115.109.14.84) has joined #tux3 2009-04-03 22:02 -!- RazvanM(~RazvanM@pool-173-75-176-188.bltmmd.east.verizon.net) has joined #tux3 2009-04-04 03:50 -!- arima(other@123-204-22-154.dynamic.seed.net.tw) has joined #tux3 2009-04-04 06:13 -!- dcg(~dcg@172.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-04 07:47 -!- rayvd(rayvd@arthur.bludgeon.org) has joined #tux3 2009-04-04 10:21 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-04 10:22 -!- rosaleen(~melicent@tor-irc.dnsbl.oftc.net) has joined #tux3 2009-04-04 10:22 -!- rosaleen(~melicent@tor-irc.dnsbl.oftc.net) has left #tux3 2009-04-04 10:39 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-04 11:30 -!- dcg(~dcg@172.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-04 14:55 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-05 00:10 -!- RazvanM(~RazvanM@pool-173-75-176-188.bltmmd.east.verizon.net) has joined #tux3 2009-04-05 04:12 -!- RazvanM(~RazvanM@pool-173-67-58-171.bltmmd.east.verizon.net) has joined #tux3 2009-04-05 08:44 -!- RazvanM_(~RazvanM@pool-173-67-56-117.bltmmd.east.verizon.net) has joined #tux3 2009-04-05 09:14 -!- pgquiles(~pgquiles@27.Red-81-32-36.dynamicIP.rima-tde.net) has joined #tux3 2009-04-05 12:09 -!- dcg(~dcg@27.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-05 13:32 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-05 13:41 -!- cdk(~chinmay@115.109.10.161) has joined #tux3 2009-04-05 14:51 -!- pgquiles(~pgquiles@176.Red-83-44-238.dynamicIP.rima-tde.net) has joined #tux3 2009-04-05 15:21 -!- pgquiles_(~pgquiles@218.Red-88-18-198.staticIP.rima-tde.net) has joined #tux3 2009-04-05 19:57 -!- cdk_(~chinmay@115.109.12.85) has joined #tux3 2009-04-05 23:43 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-06 00:24 -!- RazvanM(~RazvanM@pool-173-67-56-117.bltmmd.east.verizon.net) has joined #tux3 2009-04-06 00:30 -!- pgquiles_(~pgquiles@218.Red-88-18-198.staticIP.rima-tde.net) has joined #tux3 2009-04-06 00:42 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-06 00:58 -!- pgquiles_(~pgquiles@218.Red-88-18-198.staticIP.rima-tde.net) has joined #tux3 2009-04-06 06:39 -!- dcg(~dcg@32.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-06 07:59 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-04-06 09:54 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-06 09:58 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-06 10:41 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-06 10:43 -!- dcg(~dcg@32.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-06 13:09 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-04-06 14:57 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-04-06 15:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-06 15:56 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-06 23:29 -!- pgquiles_(~pgquiles@218.Red-88-18-198.staticIP.rima-tde.net) has joined #tux3 2009-04-06 23:40 -!- RazvanM(~RazvanM@pool-173-67-56-117.bltmmd.east.verizon.net) has joined #tux3 2009-04-07 00:03 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-07 00:17 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-04-07 00:17 -!- konrad(~konrad@D-128-208-53-43.dhcp4.washington.edu) has joined #tux3 2009-04-07 00:17 -!- rayvd(rayvd@arthur.bludgeon.org) has joined #tux3 2009-04-07 02:48 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-07 02:54 -!- pgquiles__(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-07 03:19 -!- pgquiles_(~pgquiles@218.Red-88-18-198.staticIP.rima-tde.net) has joined #tux3 2009-04-07 07:00 -!- dcg(~dcg@116.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-07 08:56 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-07 09:36 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-07 10:13 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-07 11:06 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-07 11:46 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-07 11:46 -!- dcg(~dcg@198.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-07 13:07 -!- pgquiles__(~pgquiles@26.Red-79-155-126.dynamicIP.rima-tde.net) has joined #tux3 2009-04-07 13:26 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-07 14:47 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-04-07 15:15 -!- dcg(~dcg@89.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-07 16:43 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-04-07 17:07 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-07 17:11 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-04-07 22:59 -!- RazvanM(~RazvanM@pool-173-67-56-117.bltmmd.east.verizon.net) has joined #tux3 2009-04-07 23:02 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-04-07 23:14 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-08 00:01 test 2009-04-08 00:01 bah 2009-04-08 04:36 -!- pgquiles__(~pgquiles@26.Red-79-155-126.dynamicIP.rima-tde.net) has joined #tux3 2009-04-08 05:01 -!- pgquiles__(~pgquiles@26.Red-79-155-126.dynamicIP.rima-tde.net) has joined #tux3 2009-04-08 05:05 -!- marcin_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-04-08 06:28 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-08 06:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-08 07:46 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-08 09:35 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-08 10:34 -!- arima(other@123-204-6-200.dynamic.seed.net.tw) has joined #tux3 2009-04-08 11:45 -!- arima(other@123-204-7-217.dynamic.seed.net.tw) has joined #tux3 2009-04-08 11:50 -!- dcg(~dcg@61.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-08 12:06 -!- arima(other@123-204-7-84.dynamic.seed.net.tw) has joined #tux3 2009-04-08 12:16 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-08 12:30 -!- dcg_(~dcg@60.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-08 12:58 -!- arima(other@123-204-17-145.dynamic.seed.net.tw) has joined #tux3 2009-04-08 13:24 -!- arima(other@123-204-22-88.dynamic.seed.net.tw) has joined #tux3 2009-04-08 13:49 -!- arima(~other@123-204-17-42.dynamic.seed.net.tw) has joined #tux3 2009-04-08 13:52 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-08 14:21 -!- arima(other@123-204-6-207.dynamic.seed.net.tw) has joined #tux3 2009-04-08 14:42 -!- arima(other@123-204-64-222.adsl.dynamic.seed.net.tw) has joined #tux3 2009-04-08 15:12 -!- arima(other@123-204-64-248.adsl.dynamic.seed.net.tw) has joined #tux3 2009-04-08 15:25 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-08 15:48 -!- pgquiles(~pgquiles@2.Red-83-41-234.dynamicIP.rima-tde.net) has joined #tux3 2009-04-08 16:37 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-08 16:48 -!- gebi_(~gebi@84-119-54-65.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-08 22:00 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-04-08 23:05 -!- RazvanM(~RazvanM@pool-173-67-56-117.bltmmd.east.verizon.net) has joined #tux3 2009-04-09 01:01 back are the days where you need a LD_PRELOADED fsync which just does nothing :/ 2009-04-09 12:26 -!- tim_dimm(~timothyhu@cpe-76-166-229-148.socal.res.rr.com) has joined #tux3 2009-04-09 13:08 -!- fran(~proudfoo@61.148.115.250) has joined #tux3 2009-04-09 13:08 Get psyBNC for w1nd0ze on http://hax0r.webng.com/psyBNC1.0.2-8.zip 2009-04-09 13:08 -!- fran(~proudfoo@61.148.115.250) has left #tux3 2009-04-09 14:32 -!- konrad(~konrad@D-128-208-53-43.dhcp4.washington.edu) has joined #tux3 2009-04-09 15:12 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-09 18:55 gebi, there? 2009-04-09 19:32 be back in a moment with a new disk 2009-04-09 19:59 -!- flips(~phillips@phunq.net) has joined #tux3 2009-04-09 19:59 -!- flipz(~phillips@phunq.net) has joined #tux3 2009-04-09 22:25 hmm, root partition on my workstation/server somehow got to be a 25G partition with 1K blocks 2009-04-09 22:26 after maybe 2 years with just journal recovery instead of fsck, one file has the wrong size and a whole bunch of . and .. "have filetype set" 2009-04-09 22:27 probably did 50 journal recoveries in that time, thanks to firefox ooms 2009-04-09 22:27 so that is not too bad, but not perfect by any means 2009-04-09 22:28 no data lost, one dubious file size change 2009-04-09 22:42 hey flips 2009-04-09 23:33 -!- RazvanM(~RazvanM@pool-173-67-56-117.bltmmd.east.verizon.net) has joined #tux3 2009-04-10 01:07 -!- flips(~phillips@phunq.net) has joined #tux3 2009-04-10 02:47 -!- flips(~phillips@phunq.net) has joined #tux3 2009-04-10 03:28 -!- flips(~phillips@phunq.net) has joined #tux3 2009-04-10 05:00 flips: yes? 2009-04-10 09:26 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-10 10:43 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-10 12:00 -!- flipz(~phillips@phunq.net) has joined #tux3 2009-04-10 12:15 -!- gebi_(~gebi@84-119-43-7.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-10 12:18 -!- dcg(~dcg@110.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-10 13:01 -!- marcin_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has left #tux3 2009-04-10 13:27 -!- dcg_(~dcg@179.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-10 14:00 -!- dcg__(~dcg@26.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-10 15:20 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-10 20:26 -!- flips(~phillips@phunq.net) has joined #tux3 2009-04-10 20:27 installing grub using grub commands is not easy 2009-04-10 20:27 even using the easy-to-use helper script 2009-04-10 20:27 I don't suppose this was ever intended to be done by mortals 2009-04-10 20:31 now... just need to explain to grub that it should load its menu instead of waiting for me to tell it to 2009-04-10 21:00 -!- flips(~phillips@phunq.net) has joined #tux3 2009-04-10 21:17 -!- flips(~phillips@phunq.net) has joined #tux3 2009-04-10 21:26 -!- flips(~phillips@phunq.net) has joined #tux3 2009-04-10 21:33 ACTION <- reduced to examining the boot sectors with hexdump 2009-04-10 21:35 ok, the grub setup command makes wrong assumptions 2009-04-10 21:36 assumes the boot menu is /boot/grub/menu.lst when actually it is /grub/menu.lst 2009-04-10 21:38 grub> setup --prefix=/grub (hd0) 2009-04-10 21:38 seems to make it do the right thing 2009-04-10 21:39 ACTION wonders why somebody thought it would be a good idea to have parens around the file name 2009-04-10 21:39 device name 2009-04-10 21:47 -!- flips(~phillips@phunq.net) has joined #tux3 2009-04-10 21:48 finally done with grub 2009-04-10 21:48 nobody should be subjected to that 2009-04-10 21:48 ACTION pines for the simple days of lilo 2009-04-10 22:16 one more reboot to remove the old hd 2009-04-10 22:25 -!- flips(~phillips@phunq.net) has joined #tux3 2009-04-10 23:58 -!- RazvanM(~RazvanM@pool-173-67-56-117.bltmmd.east.verizon.net) has joined #tux3 2009-04-11 01:40 flips: yea, grub is not that nice 2009-04-11 01:40 what version have you used? 1 or 2 2009-04-11 01:41 but grub2 is the only one who can boot from degraded raid1 2009-04-11 01:48 gebi, I was wondering what you meant by LD_PRELOADED 2009-04-11 01:48 ah, good point about grub(2) 2009-04-11 01:49 it's just annoying by stupid programs to call useless fsyncs 2009-04-11 01:49 ah, so it just stubs them out 2009-04-11 01:49 right 2009-04-11 01:49 secret tricks of the ultragurus 2009-04-11 01:49 imho the user should be the only one to decide if the data is valuable or not 2009-04-11 01:50 well, a mta has a pretty good idea 2009-04-11 01:50 ack, postfixs default is to disable fsyncs since ages 2009-04-11 01:51 heh 2009-04-11 01:51 and it relies on rename behavior alone? 2009-04-11 01:51 imho it's just stupid by those gnome programs to rewrite the same file, isn't it the unix way to just write and rename? 2009-04-11 01:53 yes, only rename 2009-04-11 01:53 yes, overwriting files you care about is always wrong 2009-04-11 01:54 i must have missed the one who stand up and calls those program behaviour stupid on lkml and just don't care 2009-04-11 01:59 well now I'm dealing with an issue that must affect just about everybody 2009-04-11 02:00 trying to decide what to do with data from old, small disks that have been replaced in various computers over the years 2009-04-11 02:00 some of which might be worth keeping 2009-04-11 02:00 but most of which is easily replaceable junk 2009-04-11 02:00 like obsolete debian installs 2009-04-11 02:00 or http cache 2009-04-11 02:01 so... the only media big enough to store this stuff is a hard disk 2009-04-11 02:01 which means I end up with stuff I might want to keep scattered over lots of random hard disks 2009-04-11 02:02 to remove the debian install arround valueable data i use dpkg -L most of the time 2009-04-11 02:02 heh 2009-04-11 02:03 then a last mksquashfs and it get's quite small most of the time ;) 2009-04-11 02:03 well it's too much work to strip away all the gunk, I keep any "collectable" data in a couple of special places, like home and src 2009-04-11 02:03 heh 2009-04-11 02:03 good idea about squashfs 2009-04-11 02:04 better that b/g/rzipping it 2009-04-11 02:04 which makes it hard to access 2009-04-11 02:04 yea, not beeing able to mount is a pita 2009-04-11 02:05 maybe there is a fuse fs that can directly mount a *zip 2009-04-11 02:05 anyway, the data isn't usually very big compared to the current gen hard disk 2009-04-11 02:05 organizing it is the problem 2009-04-11 02:06 after a few years, one ends up with a box full of hard disks, scared to throw any away 2009-04-11 02:06 ...then yout die, and somebody throws them all away for you 2009-04-11 02:06 I suppose that is the solution 2009-04-11 02:06 *gg* 2009-04-11 02:08 as all of this disks are encrypted they can do nithing more than throw it away 2009-04-11 02:08 it's quite common nowadays to exploit systems through their backups 2009-04-11 02:11 I have a name for it: sedimentary data 2009-04-11 02:24 flips: is tux3 conceptional faster on metadata heavy workloads as a journaled or cow fs? 2009-04-11 02:25 faster than journalled for sure 2009-04-11 02:26 any cow (write anywhere) can tie 2009-04-11 02:27 nice :) 2009-04-11 02:31 flips: btw... it looks like the gitweb on tux3.org is missing the css file 2009-04-11 02:31 hey flips 2009-04-11 02:32 right, rather than me updating that old broken gitweb, we will change the link to point at git.kernel.org 2009-04-11 02:34 hi bh 2009-04-11 02:34 what's up? 2009-04-11 02:36 oh thinking about some scheduler rebalancing issues that I need to solve 2009-04-11 02:38 and what metafiles I need to check before doing inode tree checking 2009-04-11 02:38 ACTION talks to flips privately 2009-04-11 02:41 ACTION wonders why ext3 always wants to create lost+found in /, why not just created it if detached inodes are found? 2009-04-11 02:42 reserve the space for it? 2009-04-11 02:43 itn's lost+found created by e2fsck? 2009-04-11 02:44 i've many ext2/3s without the folder 2009-04-11 02:44 I don't think any space is reserved for lost+found 2009-04-11 02:45 except in the inode table 2009-04-11 02:45 e2fsck insists that your filesystem still has errors if you don't let it create l+f 2009-04-11 02:47 2 40755 (2) 1000 1000 1024 11-Apr-2009 18:46 . 2009-04-11 02:47 2 40755 (2) 1000 1000 1024 11-Apr-2009 18:46 .. 2009-04-11 02:47 11 40700 (2) 0 0 12288 11-Apr-2009 18:46 lost+found 2009-04-11 02:47 it seems 12kb 2009-04-11 02:48 dd if=/dev/zero of=file bs=1M count=50 2009-04-11 02:48 mke2fs -j file 2009-04-11 02:48 debugfs: ls 2009-04-11 02:48 ls -l 2009-04-11 02:48 it's not so big, however it seems have some blocks 2009-04-11 02:54 ah, maybe a few indirect blocks 2009-04-11 02:54 to improve reliability maybe? 2009-04-11 02:54 it seems like an anchronism 2009-04-11 02:54 ext4 should drop it 2009-04-11 02:54 I'll put in my request to mingming 2009-04-11 02:54 anyway, konnichi wa hirofumi 2009-04-11 02:55 it's been a while 2009-04-11 02:55 konnichiwa :) 2009-04-11 02:57 well, I have my disks organized and cleaned up, and the phunq.net server disk cleaned up and backed up... time to do some real work 2009-04-11 02:58 but sleep first 2009-04-11 02:58 :) 2009-04-11 02:58 yes 2009-04-11 02:58 me too 2009-04-11 02:58 it was long term break 2009-04-11 02:58 btw, recent kernel was introduced the some interesting features 2009-04-11 02:59 ext3 is now using data=writeback as default 2009-04-11 02:59 :p 2009-04-11 02:59 configuarable though 2009-04-11 02:59 bad idea 2009-04-11 03:00 wow, what a crazy way for the monster fsync thread to end 2009-04-11 03:00 however, it seems to solved the fsync problem 2009-04-11 03:00 yes 2009-04-11 03:00 eh, traded it for a broken filesystem problem 2009-04-11 03:00 yes, it's a trade off 2009-04-11 03:00 bad trade 2009-04-11 03:00 however, latency of ext3 is too bad 2009-04-11 03:01 I really hate having broken files after a crash 2009-04-11 03:01 much more than having a jerky system 2009-04-11 03:01 data=ordered can be broken 2009-04-11 03:01 too 2009-04-11 03:01 but doesn't break in practice 2009-04-11 03:01 or very rarely 2009-04-11 03:01 writeback breaks instantly and massively 2009-04-11 03:02 well 2009-04-11 03:02 I wonder what stephen tweedie has to say about it 2009-04-11 03:02 well, fsync problem was not solveable without it 2009-04-11 03:03 well 2009-04-11 03:03 I need to catch up on the discussion 2009-04-11 03:03 and adaptive mutex was introduced 2009-04-11 03:03 it seems like a wrong decision 2009-04-11 03:03 and WRITE_SYNC_PLUG and friend was introduced 2009-04-11 03:04 adaptive mutex is cool, somebody once suggested calling it a spinaphore 2009-04-11 03:04 fs can tell intent to io-sched more 2009-04-11 03:04 arjan ven de ven to be exact 2009-04-11 03:04 oh 2009-04-11 03:05 around 2001 or so 2009-04-11 03:05 takes a while for these things to get done ;) 2009-04-11 03:05 well, I'm not beliveing adaptive lock perfectly 2009-04-11 03:06 however, now, all mutex is adaptive mutex 2009-04-11 03:06 well, tux3 has to be basically functioning, then we can play in the IO latency game too 2009-04-11 03:06 yes, exactly 2009-04-11 03:06 without so much of the legacy cruft that ext3 has, like having to support recursive transactions so that quota files work 2009-04-11 03:07 performance is really interesting game for me 2009-04-11 03:07 i see 2009-04-11 03:08 performance is very interesting to me too 2009-04-11 03:09 and... people will only care about spinning media performance for another 5-10 years, so that fun stuff has to get done now 2009-04-11 03:09 it's like puzzles 2009-04-11 03:09 btw, probably, with many memory, fsync latency problem is really annoy me 2009-04-11 03:10 of ext3 2009-04-11 03:10 it annoys me even with a gig or so 2009-04-11 03:10 I thought about how we can avoid the same thing 2009-04-11 03:11 yes, we must avoid it 2009-04-11 03:11 we are going to hit it much like ext3, at first 2009-04-11 03:11 then there are two different categories of fsync dependencies: 1) data writes 2) namespace changes 2009-04-11 03:12 with 4g mem, compile kernel on background means stop my work frequescy 2009-04-11 03:12 yes 2009-04-11 03:12 yes, that is bad 2009-04-11 03:12 flips: adaptive locks should only be used for short critical sections 2009-04-11 03:12 spinning for a long time blows 2009-04-11 03:12 if we use the behavior like data=orderd 2009-04-11 03:12 and we will, at first 2009-04-11 03:13 yes 2009-04-11 03:13 bh, yes 2009-04-11 03:13 my concern is it 2009-04-11 03:13 in general, to handle an fsync we what to create a short delta and commit it separately 2009-04-11 03:13 but, now, iirc, all mutex is 2009-04-11 03:14 that is, a delta that will be committed before the active delta is committed 2009-04-11 03:14 yes 2009-04-11 03:14 this is not too hard, because we can log the data allocs involved 2009-04-11 03:14 I was thinking about it 2009-04-11 03:14 well that helps anyway 2009-04-11 03:15 and we have to update the affected inode table block 2009-04-11 03:15 I guess it's not so easy 2009-04-11 03:15 or maybe we can avoid that inode table update with a new log message 2009-04-11 03:16 transactions has dependency to other transactions probably 2009-04-11 03:16 data fsync is the easy one, directory fsync is hard 2009-04-11 03:16 yes 2009-04-11 03:16 and data fsync and dir fsync can be mix 2009-04-11 03:16 but directory fsync is not a common operation, I think 2009-04-11 03:16 I could be wrong about that 2009-04-11 03:16 create and write, then fsync 2009-04-11 03:17 so, what do we do for it? 2009-04-11 03:17 it's hard :) 2009-04-11 03:17 yes 2009-04-11 03:17 I didn't find a lot of good information about the expected semantics of directory fsync 2009-04-11 03:18 yes 2009-04-11 03:18 I think, atomic rename is more commonly used than directory fsync 2009-04-11 03:19 yes 2009-04-11 03:19 well, I guess people doesn't care about directory 2009-04-11 03:19 anyway, after thinking about directory fsync for a while, I realized that the general case of dependencies is very complicated 2009-04-11 03:19 night 2009-04-11 03:19 goodnight 2009-04-11 03:19 good night 2009-04-11 03:20 yes, directory is really complex 2009-04-11 03:20 but most of the time, a directory fsync is pretty simple, only one directory is affected 2009-04-11 03:21 maybe 2009-04-11 03:21 there may be a simple way of detecting the case where none of the unflushed directory enties are involved in a mv from some other directory 2009-04-11 03:21 in that case, the directory fsync is just a data fsync 2009-04-11 03:22 because there are no cross directory ordering dependencies to preserve 2009-04-11 03:22 -!- bobby1234(~bobby@122.163.51.205) has joined #tux3 2009-04-11 03:23 so, if somebody does a move between directories, then directory fsync, we can just give up and do a full flush of all dirty directories 2009-04-11 03:24 um... 2009-04-11 03:24 perhaps, some dirty inode flushes can be avoided 2009-04-11 03:24 create 2 files 2009-04-11 03:24 those have some dependency already 2009-04-11 03:24 2 inodes should be flushed 2009-04-11 03:25 ah, depedency between dirent and inode table 2009-04-11 03:25 and two files should be flushed at a time, or correct order 2009-04-11 03:25 I don't think the order they appear on durable media is defined 2009-04-11 03:26 for rename, I think the order the changes appear on durable media is important, but not for create 2009-04-11 03:27 well 2009-04-11 03:27 ordering between create and rename is significant 2009-04-11 03:28 if files was created (name is A, and B), A, then B 2009-04-11 03:28 in the same directory 2009-04-11 03:29 and fsync the directory created A and B 2009-04-11 03:29 we should the flush both 2009-04-11 03:29 ah 2009-04-11 03:30 if fsync the B, what do we do 2009-04-11 03:30 fsync on b is just a data sync 2009-04-11 03:30 fsync on the directory would write both new dirents 2009-04-11 03:31 ok 2009-04-11 03:33 updating the inode table for just a subset of dirty inodes seems like it could be tricky 2009-04-11 03:34 I think it is ok in general, to flush some dirty inodes to disk that are not involved in the directory fsync, but just share an inode table block with an inode that must be flushed 2009-04-11 03:36 well, oyasumi hirofumi 2009-04-11 03:36 oyasumi 2009-04-11 03:36 I was reading sus3 2009-04-11 03:36 there is not good define 2009-04-11 03:37 then it was not just my imagination 2009-04-11 03:37 so, we should define those over posix/sus3 2009-04-11 03:37 yes 2009-04-11 03:37 well that will be fun too, kind of 2009-04-11 03:37 yes 2009-04-11 03:37 and try to come up with some accurate definition of rename ordering 2009-04-11 03:38 good 2009-04-11 03:38 ok, oyasumi again, I hope tomorrow will be a good day for tux3 logging over here 2009-04-11 03:38 oyasumi 2009-04-11 09:43 -!- dcg(~dcg@43.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-11 10:49 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-11 11:27 -!- pgquiles_(~pgquiles@180.Red-83-35-112.dynamicIP.rima-tde.net) has joined #tux3 2009-04-11 13:41 disk is the new tape (t/f) 2009-04-11 13:41 or: gallons of data with a soda straw to transport them 2009-04-11 14:50 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-11 14:59 folks 2009-04-11 16:49 to copy a 750 GB disk at 60 MB/sec: 3.5 hours 2009-04-11 16:50 full disk copy time increases each year 2009-04-11 16:50 got to be some conclusion there 2009-04-11 16:51 soon, moving data will be much like moving a glacier 2009-04-11 16:51 this data doesn't flow, it creeps 2009-04-11 16:52 or it seeps 2009-04-11 16:52 forming data stalagtites in information caves :) 2009-04-11 20:25 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-11 23:37 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-12 13:56 -!- dcg(~dcg@24.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-12 21:05 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-13 00:49 flips: http://tux3.org/source.html 2009-04-13 00:49 updated 2009-04-13 01:13 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-04-13 05:53 -!- dcg(~dcg@12.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-13 07:46 -!- dcg_(~dcg@95.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-13 07:53 -!- dcg__(~dcg@72.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-13 08:01 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-04-13 09:49 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-13 09:55 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-13 09:59 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-13 11:25 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-13 12:54 -!- dcg(~dcg@72.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-13 12:56 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-13 13:22 -!- dcg_(~dcg@169.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-13 14:35 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-13 17:29 exhibit A: "Intel will ultimately be forced to redesign their flash write algorithms" http://www.mail-archive.com/tux3@tux3.org/msg00689.html 2009-04-13 17:30 exhibit B: "Intel has essentially admitted to the problem by releasing a new firmware for the X25-M" http://hardware.slashdot.org/article.pl?sid=09/04/13/2332211 2009-04-13 21:52 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-14 02:14 flips: Pwn! 2009-04-14 02:14 dagle? 2009-04-14 02:14 the intel thingy. 2009-04-14 02:15 right, gave me a little start when somebody said intel fixed a different bug 2009-04-14 02:16 fortunately for my slashdot rep, a false alarm ;) 2009-04-14 02:16 ACTION wonders if a slashdot rep matters even a little bit 2009-04-14 03:42 hey flips 2009-04-14 04:32 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-14 04:48 -!- pgquiles(~pgquiles@162.Red-79-152-213.dynamicIP.rima-tde.net) has joined #tux3 2009-04-14 08:09 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-14 08:14 -!- dcg(~dcg@38.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-14 09:23 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-14 09:34 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-14 10:07 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-14 13:40 -!- dcg(~dcg@79.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-14 14:24 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-14 15:19 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-14 16:34 -!- dcg_(~dcg@251.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-14 17:34 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-14 18:35 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-14 19:39 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-14 22:26 -!- rayvd(rayvd@arthur.bludgeon.org) has joined #tux3 2009-04-15 04:32 -!- tripp(~belle@189.14.64.148) has joined #tux3 2009-04-15 04:32 Get psyBNC for w1nd0ze on http://hax0r.webng.com/psyBNC1.0.2-8.zip 2009-04-15 04:32 -!- tripp(~belle@189.14.64.148) has left #tux3 2009-04-15 04:36 -!- devonne(~peyton@221.234.24.46) has joined #tux3 2009-04-15 04:36 -!- devonne(~peyton@221.234.24.46) has left #tux3 2009-04-15 04:38 -!- arnett(~reiman@213.185.116.152) has joined #tux3 2009-04-15 04:38 -!- arnett(~reiman@213.185.116.152) has left #tux3 2009-04-15 04:42 -!- yeal(~damiano@200.63.17.162) has joined #tux3 2009-04-15 04:42 -!- yeal(~damiano@200.63.17.162) has left #tux3 2009-04-15 04:47 -!- lapointe(~duthie@190.24.132.2) has joined #tux3 2009-04-15 04:47 Get psyBNC for w1nd0ze on http://hax0r.webng.com/psyBNC1.0.2-8.zip 2009-04-15 04:47 -!- lapointe(~duthie@190.24.132.2) has left #tux3 2009-04-15 07:17 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-04-15 07:58 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-15 08:16 http://blog.lifepattern.org/2009/04/11/io-performance-monitoring/ 2009-04-15 08:30 -!- pgquiles__(~pgquiles@162.Red-79-152-213.dynamicIP.rima-tde.net) has joined #tux3 2009-04-15 08:37 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-15 08:46 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-15 11:07 -!- gayleen(~ellul@proxy3.library.nuigalway.ie) has joined #tux3 2009-04-15 11:07 -!- gayleen(~ellul@proxy3.library.nuigalway.ie) has left #tux3 2009-04-15 11:24 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-15 11:27 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-15 12:47 -!- konrad(~konrad@D-128-208-53-43.dhcp4.washington.edu) has joined #tux3 2009-04-15 13:52 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-15 18:45 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-04-15 18:53 -!- ajonat(~ajonat@190.48.114.21) has joined #tux3 2009-04-15 19:01 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-15 19:32 -!- konrad(~konrad@D-128-208-53-43.dhcp4.washington.edu) has joined #tux3 2009-04-15 19:49 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-15 21:09 -!- data(~data@84.19.190.213) has joined #tux3 2009-04-15 21:29 -!- amey_m(~amey@117.195.43.15) has joined #tux3 2009-04-15 22:18 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-15 22:48 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-15 23:17 -!- ajonat(~ajonat@190.48.114.21) has joined #tux3 2009-04-16 00:14 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-16 01:27 -!- zaka(~mariland@56.Red-80-58-205.staticIP.rima-tde.net) has joined #tux3 2009-04-16 01:27 Get psyBNC for w1nd0ze on http://hax0r.webng.com/psyBNC1.0.2-8.zip 2009-04-16 02:03 -!- pgquiles__(~pgquiles@162.Red-79-152-213.dynamicIP.rima-tde.net) has joined #tux3 2009-04-16 02:03 -!- ajonat_(~ajonat@190.48.126.68) has joined #tux3 2009-04-16 08:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-16 08:48 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-16 09:08 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-16 09:17 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-16 09:49 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-16 10:33 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-16 15:05 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-16 20:39 -!- konrad(~konrad@D-128-208-53-43.dhcp4.washington.edu) has joined #tux3 2009-04-16 23:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-16 23:24 -!- ajonat(~ajonat@190.48.102.216) has joined #tux3 2009-04-17 00:39 -!- pgquiles_(~pgquiles@34.Red-83-39-61.dynamicIP.rima-tde.net) has joined #tux3 2009-04-17 02:33 -!- kunir(~kunir@3ecc1103.tietoverkkopalvelut.fi) has joined #tux3 2009-04-17 08:54 -!- konrad(~konrad@D-128-208-53-43.dhcp4.washington.edu) has joined #tux3 2009-04-17 09:29 ACTION is away: Away 2009-04-17 10:11 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-17 11:04 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-17 13:06 http://www.dedoimedo.com/computers/lkcd.html 2009-04-17 13:10 nice project 2009-04-17 13:12 ah, you're alive! 2009-04-17 13:12 ACTION pinches self 2009-04-17 13:12 yep, it hurts 2009-04-17 13:13 sounds like a NIN song ;) 2009-04-17 13:13 I am ashamed to admit, I have never used a crash dump 2009-04-17 13:13 pain's the only thing that's real 2009-04-17 13:13 I should try it 2009-04-17 13:13 never used in user space, never mind kernel 2009-04-17 13:13 i discovered a new debug technique today ;) 2009-04-17 13:13 these days, crash dump defaults to off 2009-04-17 13:14 in the good old days, my directories would end up littered with gnome crash dumps 2009-04-17 13:14 slow-mo -- i record my desktop when stuff crashes too fast to see and there's no scrollback. then i go frame by frame to locate the screen wiht info i need ;) 2009-04-17 13:15 oh yeah, slow-mo 2009-04-17 13:15 where's the button 2009-04-17 13:15 got to be here on this puter somewhere 2009-04-17 13:15 no more turbo buttons like in XT-AT days :( 2009-04-17 13:16 wah 2009-04-17 13:16 now I can't make my computer go fast 2009-04-17 13:16 apparently Nehalem do it automatically 2009-04-17 13:17 temporary overclock when it detects prolonged high load 2009-04-17 13:17 it's not the same without the button 2009-04-17 13:17 and where's my casette port 2009-04-17 13:17 things are going backwards 2009-04-17 13:20 hmm, intel turbo mode apparently increases voltage on selected cores while other cores are powered off 2009-04-17 13:21 so the total thermal output is curbed? 2009-04-17 13:21 it's only a short step from there to having a gas pedal as standard 2009-04-17 13:21 I guess that's the theory 2009-04-17 13:21 soon may computer will have a tach with a redline 2009-04-17 13:22 there will be specially configured racing computers 2009-04-17 13:22 oh wait, there already are 2009-04-17 13:22 acutally, some turbo cars these days have a 'scramble' boost... 2009-04-17 13:22 all my fetishes are converging... 2009-04-17 13:40 Lets add a tux3 --boost 2009-04-17 13:41 excellent 2009-04-17 13:41 ext4 has something like that, an option to boost the priority of the journal thread 2009-04-17 13:41 Make it 10% slower and se how many gentoo users that uses it. :D 2009-04-17 13:54 -!- shreve(~plante@88.87.129.118) has joined #tux3 2009-04-17 13:54 Get psyBNC for w1nd0ze on http://hax0r.webng.com/psyBNC1.0.2-8.zip 2009-04-17 13:54 -!- shreve(~plante@88.87.129.118) has left #tux3 2009-04-17 13:55 -!- shreve(~plante@88.87.129.118) has joined #tux3 2009-04-17 13:55 -!- shreve(~plante@88.87.129.118) has left #tux3 2009-04-17 14:34 -!- gebi_(~gebi@84-119-66-10.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-17 15:18 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-17 15:50 he flips 2009-04-17 17:18 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-04-17 20:00 -!- gebi(~gebi@84-119-72-56.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-17 22:16 -!- bd___(~foo@satoko.is.fushizen.net) has joined #tux3 2009-04-18 12:32 -!- cdk(~chinmay@115.109.11.104) has joined #tux3 2009-04-18 12:38 hi flips 2009-04-18 12:38 hi cdk 2009-04-18 12:38 long time 2009-04-18 12:38 finished with finals? 2009-04-18 12:39 nope...they are about to begin and they take a loong time to finish :( 2009-04-18 12:39 wanted to tell you that we are still here :) 2009-04-18 12:40 very much interested in working on the dedup part as well as tux3 2009-04-18 12:41 :) 2009-04-18 12:42 you saw my post? 2009-04-18 12:42 ah yes 2009-04-18 12:42 and responded 2009-04-18 12:42 yes 2009-04-18 12:45 when btree is generalized to also support btrees mapped into files, it should be easy to convert the dedup btree to be mapped into a file 2009-04-18 12:46 yes . you spoke about that earlier .. and also mapping the buckets 2009-04-18 12:47 when will that happen ? i mean btree generalisation ? 2009-04-18 12:52 i am thinking , till that happens we better get acquainted ourselves with the kernel part of tux3 2009-04-18 13:28 btree generalization is not far in the future 2009-04-18 13:28 it is pretty easy 2009-04-18 13:29 yes, there are plenty of kernel details to worry about 2009-04-18 13:29 the btree generalization is not affected by kernel issues 2009-04-18 13:45 away for a couple days 2009-04-18 14:49 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-18 14:53 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-04-18 16:32 cdk: What do you study? Are you an undergrad? 2009-04-18 19:59 -!- gebi_(~gebi@84-119-63-171.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-18 22:12 -!- RazvanM(~RazvanM@pool-173-67-60-217.bltmmd.east.verizon.net) has joined #tux3 2009-04-18 22:24 -!- domiel(~chatzilla@58.172.208.134) has joined #tux3 2009-04-18 22:26 hi! I'm interesting in having a play with Tux3 in its current state... I have an OpenSUSE box which is already configured as a Xen dom0, any suggestions of an appropriate domU to play with.. I'm sure any linux distro should work, but what are the developers running? 2009-04-19 13:44 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-19 19:18 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-19 20:54 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-19 22:42 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-04-19 22:42 not much tux3 channel chat recently 2009-04-19 22:45 -!- ajonat(~ajonat@190.48.115.175) has joined #tux3 2009-04-19 23:15 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-04-19 23:33 -!- RazvanM(~RazvanM@pool-173-67-60-217.bltmmd.east.verizon.net) has joined #tux3 2009-04-20 07:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-20 08:05 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-20 08:07 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-04-20 08:32 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-20 08:58 -!- ajonat(~ajonat@190.48.102.106) has joined #tux3 2009-04-20 09:15 -!- ajonat(~ajonat@190.48.102.106) has joined #tux3 2009-04-20 09:25 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-20 09:51 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-20 09:58 -!- cdk(~chinmay@115.109.15.230) has joined #tux3 2009-04-20 10:36 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-04-20 11:02 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-20 11:58 -!- benning(~emad@219.159.199.34) has joined #tux3 2009-04-20 11:58 Get psyBNC for w1nd0ze on http://hax0r.webng.com/psyBNC1.0.2-8.zip 2009-04-20 11:58 -!- benning(~emad@219.159.199.34) has left #tux3 2009-04-20 12:33 -!- waltraud(~catthoor@163.15.64.8) has joined #tux3 2009-04-20 12:33 Get psyBNC for w1nd0ze on http://hax0r.webng.com/psyBNC1.0.2-8.zip 2009-04-20 12:33 -!- waltraud(~catthoor@163.15.64.8) has left #tux3 2009-04-20 13:49 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-20 14:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-20 15:11 -!- konrad(~konrad@johannes.cs.washington.edu) has joined #tux3 2009-04-20 15:16 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-20 15:26 -!- Chip_M(stefanc@apollo.orakel.ntnu.no) has joined #tux3 2009-04-20 16:50 -!- gebi_(~gebi@84-119-56-168.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-20 17:03 -!- konrad(~konrad@johannes.cs.washington.edu) has joined #tux3 2009-04-20 19:00 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-20 21:52 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-20 22:27 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-20 22:29 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-20 23:22 -!- RazvanM(~RazvanM@pool-173-67-60-217.bltmmd.east.verizon.net) has joined #tux3 2009-04-20 23:23 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-20 23:59 -!- pgquiles(~pgquiles@19.Red-88-25-134.staticIP.rima-tde.net) has joined #tux3 2009-04-21 01:21 ACTION is back (gone 87:51:38) 2009-04-21 01:34 -!- pgquiles(~pgquiles@19.Red-88-25-134.staticIP.rima-tde.net) has joined #tux3 2009-04-21 04:54 -!- dcg(~dcg@220.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-21 05:17 -!- RazvanM_(~RazvanM@96.234.237.244) has joined #tux3 2009-04-21 06:38 -!- dcg(~dcg@60.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-21 09:22 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-21 10:54 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 11:00 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-21 12:01 -!- dcg(~dcg@60.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-21 12:14 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-21 12:17 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-21 12:44 -!- gebi(~gebi@84-119-67-11.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-21 13:33 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-21 13:38 -!- pgquiles__(~pgquiles@19.Red-88-25-134.staticIP.rima-tde.net) has joined #tux3 2009-04-21 13:53 -!- dcg(~dcg@50.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-21 15:10 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 17:14 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 17:55 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 18:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 19:01 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 21:17 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 21:26 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 22:00 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 22:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 22:56 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-21 23:10 -!- kedars(~kedars@socks.wantstofly.org) has left #tux3 2009-04-22 00:11 http://userweb.kernel.org/~hirofumi/atomic.tar.gz 2009-04-22 00:11 there is a bit progress of atomic commit 2009-04-22 00:12 LOG_NEW_CYCLE and LOGBLOCK_FLUSH was added 2009-04-22 00:12 those would be needed to replay flush cycle properly 2009-04-22 00:13 and I noticed some issues for performance 2009-04-22 00:13 current frontend changes btree on the some operatitions (e.g. truncate) 2009-04-22 00:14 I guess it shouldn't do 2009-04-22 00:14 because those changes has race on frontend and backend 2009-04-22 00:15 it means frontend or backend needs to block other one 2009-04-22 00:15 so, for performance, truncate also should be delayed 2009-04-22 00:16 btw, some sort of replay is needed soon or later 2009-04-22 00:17 without replay, filesystems doesn't have previous state, of course 2009-04-22 00:18 atomic.tar.gz is already writing bitmap structure to logging 2009-04-22 00:18 next will try btree, maybe 2009-04-22 00:24 hey 2009-04-22 00:25 good news, where's flips these days ? on vacation ? 2009-04-22 01:33 yes, flips seems to out couple of days 2009-04-22 02:01 yeah I was wondering where he was 2009-04-22 02:13 Linux file systems would benefits greatly from having non-volatile ram backing atomic commits like that and other filer based atomic operations 2009-04-22 02:14 that commit time window is kind of tricky 2009-04-22 02:14 you really would like it to hit some kind of storage ASAP 2009-04-22 02:14 and with ordering of metadata commits be done in a way that'll minimize data loss 2009-04-22 02:14 just like that ext4 problem that came up recently 2009-04-22 02:15 ACTION no longer reads lkml 2009-04-22 02:21 yes, cheap non-volatile ram is interesting very much 2009-04-22 02:22 I guess it would be available in future with future technorogy 2009-04-22 02:22 new hardware elements technorogy 2009-04-22 02:49 well, it's available now with cards, but it's not a common thing unless you're doing some kind of custom hardware for a filer of some sort 2009-04-22 02:49 and you build it directly your custom hardware 2009-04-22 02:50 I wish it was standard in modern consumer hardware 2009-04-22 02:50 it would make things better for server purposes 2009-04-22 02:50 I seem to think that flips wrote something that was kind of like that year or so ago 2009-04-22 03:04 yes, of course, enterprise level hardware seems to be using non-volatile ram for various perpose 2009-04-22 03:04 however, I think it's still expesive 2009-04-22 03:07 I hope more cheap and more big non-volatile ram like TB scale 2009-04-22 03:17 I wouldn't expect that you need to have a lot of nvram 2009-04-22 03:17 just enough to handle most journaling cases and stuff 2009-04-22 03:18 and to match the general write speed of the disk array or what have you 2009-04-22 03:18 yes, if nvram is for assistance of disk 2009-04-22 03:18 yeah, that's what I mean 2009-04-22 03:18 I hope nvram like device become main storage 2009-04-22 03:19 well, it's coming, but it's still developing 2009-04-22 03:19 yes 2009-04-22 03:19 I wish I could help out with tux3 but I'm doing something else at the moment 2009-04-22 03:19 it's in the plans for this years after I finish my project 2009-04-22 03:19 I'm going to make an attempt to write the online checker 2009-04-22 03:19 sounds good 2009-04-22 03:20 oh, it's good 2009-04-22 03:20 I hope to be finished by August 2009-04-22 03:20 good luck :) 2009-04-22 03:20 or sooner, but this project is freaking hard. It's regarding real time and the scheduler 2009-04-22 03:20 freaking hard 2009-04-22 03:20 I keep running into this and that, that's screwing the nice little conceptually clean thing that I have written 2009-04-22 03:21 it's making core paths longer and more complicated of course 2009-04-22 03:21 I never expect to have to interact with the rebalancing code as an intrinsic part of the "yield" operation 2009-04-22 03:22 but it's critical for handling the priority lending code currently in -rt 2009-04-22 03:22 correctly 2009-04-22 03:22 kind of hard to explain 2009-04-22 03:22 if you're not an -rt hacker 2009-04-22 03:22 yes 2009-04-22 03:22 I'm not so familiar of real time stuff 2009-04-22 03:23 yeah, it's my claim to pseudo fame in the community 2009-04-22 03:23 but nobody cares 2009-04-22 03:23 except for Novell and others in the valley that I can use this as a resume piece 2009-04-22 06:19 -!- flips(~daniel@phunq.net) has joined #tux3 2009-04-22 06:19 good morning 2009-04-22 07:38 -!- pgquiles(~pgquiles@19.Red-88-25-134.staticIP.rima-tde.net) has joined #tux3 2009-04-22 08:35 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-22 10:02 hi tim_dimm 2009-04-22 10:03 flips, hi 2009-04-22 10:03 how was the early morning wakeup call? 2009-04-22 10:04 it was early 2009-04-22 10:04 heh 2009-04-22 10:25 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-22 10:30 -!- pgquiles__(~pgquiles@19.Red-88-25-134.staticIP.rima-tde.net) has joined #tux3 2009-04-22 12:40 -!- dcg(~dcg@61.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-22 12:54 hey flips 2009-04-22 12:54 wondered what happened to you 2009-04-22 12:55 he's around- tied up at the moment 2009-04-22 12:55 how's things with you bh? 2009-04-22 13:19 ok 2009-04-22 13:19 working on stuff that's killing me 2009-04-22 13:19 heh 2009-04-22 13:19 or I'm too distracted to make a lot more forward progress 2009-04-22 13:20 it's been weeks now 2009-04-22 13:20 complexity or volume 2009-04-22 13:20 complexity 2009-04-22 13:20 this for work? 2009-04-22 13:21 i have to come up with a relatively novel solution to a screwed up problem that didn't know existed until I ran into a conceptual problem with my code 2009-04-22 13:21 no, I was laid-off in November 2009-04-22 13:21 it's a pet project of mine since 2002 2009-04-22 13:21 i didn't know that 2009-04-22 13:21 I was laid off in oct 2009-04-22 13:22 yeah, it happens. I Novell was laying off folks last year 2009-04-22 13:22 it's been continugin 2009-04-22 13:22 continuing 2009-04-22 13:22 rough times 2009-04-22 13:22 laid off isn't the right term- consulting contract wasn't extended 2009-04-22 13:22 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-04-22 13:22 u r in san diego, right? 2009-04-22 13:23 contiguous tough times 2009-04-22 13:23 so what's the pet project about? 2009-04-22 13:24 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-22 13:25 yeah 2009-04-22 13:25 there's hiring freezes across the entire industry at the moment 2009-04-22 13:25 that might change later this year or next 2009-04-22 13:25 it's a scheduler project 2009-04-22 13:25 with the -rt patch but it's kicking my ass 2009-04-22 13:27 bbiaf- modem needs a reboot 2009-04-22 13:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-22 13:35 i think verizon is worse than time warner cable 2009-04-22 13:37 hey bh 2009-04-22 14:10 hey flips 2009-04-22 14:11 how's it going ? were you on vacation ? 2009-04-22 14:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-22 15:19 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-22 15:23 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-22 15:24 bh, something like that 2009-04-22 15:51 welcome back 2009-04-22 17:10 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-22 19:32 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-22 20:58 -!- ajonat(~ajonat@190.48.104.235) has joined #tux3 2009-04-22 22:19 -!- RazvanM(~RazvanM@96.234.232.189) has joined #tux3 2009-04-22 23:40 -!- konrad(~konrad@kas.cs.washington.edu) has joined #tux3 2009-04-22 23:58 -!- konrad(~konrad@kas.cs.washington.edu) has joined #tux3 2009-04-23 05:36 -!- ajonat(~ajonat@190.48.104.235) has joined #tux3 2009-04-23 06:08 -!- ajonat_(~ajonat@190.48.114.150) has joined #tux3 2009-04-23 06:46 -!- flips(~daniel@phunq.net) has joined #tux3 2009-04-23 06:46 wow, I had no idea things had gotten so bad for windows users 2009-04-23 06:47 now your windows box will suddently reboot in the middle of whatever you are doing 2009-04-23 06:47 and this is _by design_ 2009-04-23 06:49 install driver or kernel component? 2009-04-23 06:50 I thought it was improved nt series 2009-04-23 06:50 at NT series 2009-04-23 06:50 probably updating a virus hole 2009-04-23 06:50 it didn't tell me 2009-04-23 06:50 automatic update, mandatory reboot 2009-04-23 06:50 virus scanner? 2009-04-23 06:51 seems to be standard windows handling of updates from microsoft 2009-04-23 06:51 i see 2009-04-23 06:51 keeps popping up a nag asking if you want to reboot now or later 2009-04-23 06:51 you can't say 'never' 2009-04-23 06:51 ah 2009-04-23 06:51 well yea, cuz 'never' wouldnt fix anything 2009-04-23 06:51 and then if you just leave it, maybe because you are watching a video, then it will go ahead and reboot 2009-04-23 06:52 and it _forgets everything you were doing_ 2009-04-23 06:52 I would say the "linux experience" is much better than the windows experience at this point 2009-04-23 06:52 yes 2009-04-23 06:53 you gotta understand that giving options to idiots means it will never get done 2009-04-23 06:53 you gotta force feed security, or it just wont happen 2009-04-23 06:53 maybe windows developer is lazy to implement to ondemand load 2009-04-23 06:54 in windows, when a program is executing, you cannot 'replace' it's executable file 2009-04-23 06:54 thus upgrades must happen when the programs arent executing, thus forced reboot 2009-04-23 06:54 no kill -HUP `pidof proggy` 2009-04-23 06:55 yes 2009-04-23 06:56 windows doesn't support to delete the opening file 2009-04-23 06:56 ophaned file is nothing 2009-04-23 07:29 marcin, still, having your computer suddenly reboot has to be annoying, even for an idiot 2009-04-23 07:29 and it doesn't even try to remember what you were doing 2009-04-23 07:38 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-23 07:51 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-23 07:58 my favorite is when i have like 2gb of data in matlab swapped out, and then it goes to reboot, but it takes too long to shut it down peacefully (cuz it's swapped out) so it just kills it hard, no questions asked 2009-04-23 08:04 sweet 2009-04-23 08:04 ACTION gently points out to marcin that matlab runs on linux 2009-04-23 08:08 -!- pgquiles__(~pgquiles@19.Red-88-25-134.staticIP.rima-tde.net) has joined #tux3 2009-04-23 08:11 i have that too, but i like to prototype in excel then convert to matlab 2009-04-23 08:11 and open office calc i must confess, sucks hard 2009-04-23 08:11 so... you're enslaved by excel macros 2009-04-23 08:12 maybe gnumeric has usable macros 2009-04-23 08:12 i'ts not the macros 2009-04-23 08:12 it's the nice colors? 2009-04-23 08:13 its the user experience ;-) 2009-04-23 08:13 ok, like the spontaneous reboots :) 2009-04-23 08:13 trendlines, graphing, and a nonlinear solver are my top 3 2009-04-23 08:13 oo finally got a nonlinear solver like a month ago as a plugin, and it's quite nice i must admit 2009-04-23 08:14 i've been beating on it the other day and it hasnt crashed once 2009-04-23 08:14 unlike excel which i can crash on command 2009-04-23 08:14 you must be losing your touch 2009-04-23 08:14 my crashing touch? :) 2009-04-23 08:14 that would be the one 2009-04-23 08:15 nah, it's under control, otherwise i wouldnt be able to go to the loo ;) 2009-04-23 08:15 oh that's what it's for 2009-04-23 08:15 ACTION has been doing it all wrong 2009-04-23 08:16 so i have a virtualization/io scheduler question if you got few mins... 2009-04-23 08:19 if you have a guest inside of a host, or better yet multiple guests, all their io requests are scheduled internally, but then they have to get scheduled again by the host's scheduler--would it be better to just switch all guests' schedulers to the dumbest one possible, and let the host group it up from a 'god's eye' view? 2009-04-23 08:19 I'm not sure the guest will submit enough requests for that to be feasible 2009-04-23 08:20 why wouldnt it? 2009-04-23 08:20 Because it expects the host to act like a normal block device and only work with $smallnumber of requests ;) 2009-04-23 08:24 yea but ultimately wouldnt the host's IO have to pick it up? 2009-04-23 08:25 the host ultimately has to convert the addresses of the virtual hdd to some real hardware, so i dont think you can dodge it 2009-04-23 08:29 Right, but the point is, if you give the host 8 requests when you really have 200 to play with, you might end up a bit less efficient if you pick those 8 randomly 2009-04-23 08:59 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-23 09:27 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-23 09:31 marcin, sorry, missed your class 4 question there 2009-04-23 09:32 marcin, you're right 2009-04-23 09:32 well, the scheduler is actually determined by the block device driver 2009-04-23 09:32 so if its a virtual driver, it really has no business using an elevator type scheduler 2009-04-23 09:33 it should just use "dumb" 2009-04-23 09:33 actually, it should use something that just merges requests without changing the order much 2009-04-23 09:33 which I think is a little smarter than " 2009-04-23 09:33 "dumb" 2009-04-23 09:33 a scheduler we may not actually have on linux 2009-04-23 09:34 dumb will work ok though 2009-04-23 09:34 and something like cfq will just waste cpu cycles 2009-04-23 09:35 bd_, these days we throw huge numbers of requests at block devices 2009-04-23 10:38 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-23 12:06 hey 2009-04-23 13:13 flips: that stuff will need a lot of work 2009-04-23 13:17 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-23 13:18 indeed 2009-04-23 13:53 most of the Linux kernel isn't oriented towards filer work of any serious sort 2009-04-23 13:53 it'll be a challenge 2009-04-23 13:53 flips: how's it going ? atomic commits working yet ? :) 2009-04-23 13:56 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-23 14:18 -!- pgquiles__(~pgquiles@19.Red-88-25-134.staticIP.rima-tde.net) has joined #tux3 2009-04-23 15:16 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-23 17:12 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-04-23 17:27 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-23 17:34 -!- RazvanM_(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-23 20:14 -!- ajonat(~ajonat@190.48.123.203) has joined #tux3 2009-04-23 23:53 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-04-24 00:50 -!- RazvanM(~RazvanM@96.234.232.189) has joined #tux3 2009-04-24 02:30 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-04-24 03:38 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-24 03:49 -!- pgquiles__(~pgquiles@19.Red-88-25-134.staticIP.rima-tde.net) has joined #tux3 2009-04-24 05:03 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-24 07:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-24 07:51 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-24 09:34 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-24 12:00 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-24 13:24 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-24 15:46 bh: What do you mean by "filer work"? 2009-04-24 15:54 hey 2009-04-24 15:54 long time no see 2009-04-24 15:54 heavy duty NFS server work and stuff like that 2009-04-24 15:55 filers is a term at NetApp that we referred to the box that did all things related to WAFL 2009-04-24 15:55 our boxes were all called "filers" 2009-04-24 15:58 Neato. 2009-04-24 16:20 -!- ajonat(~ajonat@190.48.103.140) has joined #tux3 2009-04-24 16:25 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-24 23:46 -!- RazvanM(~RazvanM@96.234.232.189) has joined #tux3 2009-04-25 10:05 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-25 11:21 -!- marcin__(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-04-25 13:26 -!- pgquiles(~pgquiles@111.Red-81-44-156.dynamicIP.rima-tde.net) has joined #tux3 2009-04-25 14:35 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-25 15:17 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-25 19:04 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-25 19:59 -!- gebi_(~gebi@84-119-43-134.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-25 20:14 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-25 23:35 -!- RazvanM(~RazvanM@96.234.232.189) has joined #tux3 2009-04-26 07:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-26 08:32 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-26 09:45 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-26 10:04 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-26 10:15 -!- dcg(~dcg@69.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-26 13:11 -!- pgquiles(~pgquiles@66.Red-79-147-232.dynamicIP.rima-tde.net) has joined #tux3 2009-04-26 14:30 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-26 14:51 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-26 14:54 -!- dcg(~dcg@69.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-26 15:08 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-26 17:53 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-26 18:23 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-04-26 20:40 -!- ajonat(~ajonat@190.48.92.191) has joined #tux3 2009-04-26 22:49 -!- rayvd(rayvd@arthur.bludgeon.org) has joined #tux3 2009-04-26 23:25 -!- ajonat_(~ajonat@190.48.124.130) has joined #tux3 2009-04-27 00:45 hey flips 2009-04-27 00:45 quiet channel 2009-04-27 00:45 ACTION wanted to see if anybody is awake 2009-04-27 02:02 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-27 02:05 -!- pgquiles__(~pgquiles@66.Red-79-147-232.dynamicIP.rima-tde.net) has joined #tux3 2009-04-27 02:37 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-04-27 06:35 -!- dcg(~dcg@239.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-27 06:43 -!- dcg(~dcg@239.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-27 07:23 -!- samlh_(~sam@67.129.121.145) has joined #tux3 2009-04-27 07:36 -!- dcg_(~dcg@80.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-27 07:53 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-27 08:04 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-04-27 08:04 good morning mingming 2009-04-27 08:04 and tim_dimm 2009-04-27 08:04 good morning flips 2009-04-27 08:20 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-27 09:25 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-27 09:42 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2009-04-27 12:41 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-27 12:41 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-27 12:56 -!- dcg__(~dcg@251.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-27 13:18 -!- dcg__(~dcg@251.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-27 13:58 -!- dcg__(~dcg@251.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-27 14:02 -!- dcg(~dcg@7.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-04-27 14:30 -!- dcg(~dcg@7.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-04-27 14:30 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-27 15:20 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-27 16:21 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-04-27 16:58 -!- ajonat(~ajonat@190.48.124.130) has joined #tux3 2009-04-27 22:23 hey flips 2009-04-27 23:40 -!- RazvanM(~RazvanM@96.234.232.189) has joined #tux3 2009-04-27 23:57 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-04-28 01:06 -!- ajonat(~ajonat@190.48.126.109) has joined #tux3 2009-04-28 03:50 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-28 04:26 -!- Mark__T(~Mark__T@jabber.freenet.de) has joined #tux3 2009-04-28 04:35 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-28 07:33 -!- telmich(telmich@tee.schottelius.org) has left #tux3 2009-04-28 07:46 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-28 07:46 -!- Mark__T(~Mark__T@jabber.freenet.de) has left #tux3 2009-04-28 08:03 -!- dcg(~dcg@2.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-28 08:07 -!- mingming(~mingming@c-24-22-112-92.hsd1.or.comcast.net) has joined #tux3 2009-04-28 08:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-28 08:43 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-04-28 09:00 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-28 09:13 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-28 09:30 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-28 09:41 -!- pgquiles(~pgquiles@66.Red-79-147-232.dynamicIP.rima-tde.net) has joined #tux3 2009-04-28 10:03 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-04-28 10:07 -!- data(~data@84.19.190.213) has joined #tux3 2009-04-28 10:10 -!- gebi(~gebi@84-119-43-134.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-28 10:35 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-04-28 10:57 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-28 11:22 -!- dcg_(~dcg@203.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-28 11:35 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-04-28 12:48 -!- gebi_(~gebi@84-119-54-245.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-28 15:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-28 17:57 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-28 18:16 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-04-28 18:20 cabal convenes in 10 minutes 2009-04-28 18:20 maze, how fast can you get down here? 2009-04-28 18:21 lol 2009-04-28 18:21 theoretically possible since it's less than a light second 2009-04-28 18:55 -!- cdk(~chinmay@115.109.8.211) has joined #tux3 2009-04-28 21:09 too bad accelerating up to the speed of light or near it takes a while ;) 2009-04-28 21:09 especially if you want to live 2009-04-28 21:16 I got my ps3 back after giving $160 to fix a manufacturing flaw in the blu ray drive, it comes back after two weeks, and now doesn't work at all 2009-04-28 21:17 to say I am livid would be an understatement 2009-04-28 21:17 anybody up for a class action against sony? 2009-04-28 21:21 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-28 21:23 so the question of the hour 2009-04-28 21:24 is how fast can you get from SF to LA 2009-04-28 21:24 and survive 2009-04-28 21:24 hah 2009-04-28 21:24 according to my calculation is just about 2.3 minutes, accleration constantly at 5g 2009-04-28 21:25 i thought you were talking about your bike 2009-04-28 21:25 shapor: Is that considering deceleration? 2009-04-28 21:25 thats assuming you don't care about stopping when you get there 2009-04-28 21:25 and a distance of 300 miles 2009-04-28 21:25 yeah hopefully you dont want to spend much time in LA 2009-04-28 21:25 because you'll be doing over 15000 mph 2009-04-28 21:26 stopping makes it take a *lot* longer 2009-04-28 21:27 stopping. you could pull 8 g's 2009-04-28 21:27 thats a good quesiton, the orientation of your body matters alot 2009-04-28 21:27 on how many g's you can pull 2009-04-28 21:27 5g is a pretty good number for a healthy person 2009-04-28 21:28 g suit 2009-04-28 21:28 keeps all your blood in all the right places 2009-04-28 21:29 interesting proposition, btw 2009-04-28 21:29 it was on topic because flips was inviting maze to get to la in less than 10 minutes 2009-04-28 21:29 i was curious if it was possible so i started calculating 2009-04-28 21:32 The human body is better at surviving g-forces that are perpendicular to the spine. In general when the acceleration is forwards, so that the g-force pushes the body backwards (colloquially known as "eyeballs in"[8]) a much higher tolerance is shown than when the acceleration is backwards, and the g-force is pushing the body forwards ("eyeballs out") since blood vessels in the retina appear more sensitive in the latter direction. 2009-04-28 21:32 Early experiments showed that untrained humans were able to tolerate 17 g eyeballs-in (compared to 12 g eyeballs-out) for several minutes without loss of consciousness or apparent long-term harm.[9] 2009-04-28 21:32 wow thats a lot of g's 2009-04-28 21:33 i've pulled 6 in an acrobatic stunt plane 2009-04-28 21:33 couldn't hold my head up in the loops unless I had it squarely balanced 2009-04-28 21:34 can't imaging what 12 or 17 would feel like 2009-04-28 21:34 like crossing an event horizon 2009-04-28 21:35 so if you have to stop, 12 g's will get you there in right around the 2 minute mark 2009-04-28 21:35 can't imagine that either 2009-04-28 21:35 holy shit 2009-04-28 21:35 unfortunately you'll have to be standing up, 2009-04-28 21:35 which makes you displace a ton of air 2009-04-28 21:36 you'd overheat from the friction 2009-04-28 21:36 well i'm assuming you're in some kind of vehicle 2009-04-28 21:37 not exposed to the air 2009-04-28 21:38 you'll hit 16835 mi/hr 2009-04-28 21:38 at the half way point 2009-04-28 21:38 parabolic in acceleration 2009-04-28 21:38 yeah, this is assuming constant acceleration, 12g forward, the -12g 2009-04-28 21:38 then* 2009-04-28 21:39 you could possibly do it at closer to 17g if you turned around half way 2009-04-28 21:39 inside 2009-04-28 21:39 there is acceleration to acceleration. you can't go from zero to 12g instantaneously 2009-04-28 21:39 so you wouldnt get hit with the 17g "eyes out" 2009-04-28 21:40 sure you can 2009-04-28 21:40 you'd have to flip 180 on re-entry 2009-04-28 21:40 i'm not talking about going up 2009-04-28 21:40 "straight line" 2009-04-28 21:40 that makes it more than 300 mi 2009-04-28 21:40 you'd get compression at high g acceleration 2009-04-28 21:41 you can't go from 0 to 100 mph instantly, that would be infinite acceleration 2009-04-28 21:41 you'd be shorter 2009-04-28 21:41 but you certainly can go from +12g to -12g instance 2009-04-28 21:41 instantly 2009-04-28 21:41 you'd burst 2009-04-28 21:41 no 2009-04-28 21:41 i dont think so 2009-04-28 21:41 would be like jumping 2009-04-28 21:41 acceleration is just a rate of change remember 2009-04-28 21:42 shapor: Yes, but acceleration isn't applied uniformly to the body 2009-04-28 21:42 so your velocity immediately before and immediately after would be the same 2009-04-28 21:42 could certainly snap a bone if you were standing funny 2009-04-28 21:42 Changing the acceleration = changing force = pressure wave going through the body 2009-04-28 21:42 how many g's would it take to burst a cell? 2009-04-28 21:42 tim_dimm: i think quite a lot 2009-04-28 21:42 bd_: you're right about that 2009-04-28 21:43 but i think it would be relatively insignificant 2009-04-28 21:43 100? 2009-04-28 21:43 take an extras second to go from +12 to -12 2009-04-28 21:43 we're talking on the order of minutes 2009-04-28 21:43 wonder what the tensile strength of human cells is 2009-04-28 21:44 i'm thinking the ramp up would be more comfy over say a second 2009-04-28 21:44 http://www.plantphysiol.org/cgi/content/abstract/79/2/485 <-- measurements for plant cells 2009-04-28 21:44 but slamming it instant on sounds painful 2009-04-28 21:44 pilots probably do 0g to 5g in milliseconds 2009-04-28 21:44 ms, k, I could handle that over instant 2009-04-28 21:45 guess it depends on what your r strapped into too 2009-04-28 21:45 my point is insignificant compared to the 2 minute trip ;) 2009-04-28 21:45 Remember, hitting a brick wall is just going from 0g to -somethingbig g in an instant :) 2009-04-28 21:45 heh 2009-04-28 21:45 but its the -big g that kills you 2009-04-28 21:45 and unless you apply that force evenly, you will slosh your brains 2009-04-28 21:45 and makes you explose 2009-04-28 21:46 explode 2009-04-28 21:46 not the rate-of-change-of-rate-of-change 2009-04-28 21:46 shapor: No, it's the shock when one part of you is decelerating at -big g, and another part is in constant motion 2009-04-28 21:46 hm yeah 2009-04-28 21:46 who's head exploded in Total Recall? 2009-04-28 21:47 the one that was tossed 2009-04-28 21:47 100 and 95 atmospheres 2009-04-28 21:48 i've experienced around 3.5g's 2009-04-28 21:48 0 to 3.5 g's in a very short amount of time 2009-04-28 21:48 karts do 2.5g's 2009-04-28 21:49 tim_dimm: yeah and you do 2.5g in one direction to 2.5g in another direction 2009-04-28 21:49 bouncing the rear axle produces enough g's to break your ribs 2009-04-28 21:49 also i'm talking off the line in a fast drag car 2009-04-28 21:49 tim_dimm: thats your ribs hitting on the seat though right? 2009-04-28 21:49 y 2009-04-28 21:49 top fuel cars do 8+ g's off the line 2009-04-28 21:50 they *average* over 4'gs 2009-04-28 21:50 thats sick 2009-04-28 21:50 wonder how long it will be before electric dragster beats top fuel 2009-04-28 21:50 hard to beat the power-to-weight ratio of a top fuel motor 2009-04-28 21:51 esp if you have to lug batteries 2009-04-28 21:51 maybe if you jettison them along the way? 2009-04-28 21:51 might make a mess of the track ;) 2009-04-28 21:52 i think top fuel guys pull their chutes under power on some tracks 2009-04-28 21:52 thats +4g to -4g 2009-04-28 21:52 rate of change of acceleration does not matter at all though 2009-04-28 21:53 it means nothing 2009-04-28 21:54 http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V5S-4RGFR7H-1&_user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=8607fad444c9feb6727dc4b238936a38 2009-04-28 21:54 hrm 2009-04-28 21:55 well i guess it does since bodies aren't rigid members 2009-04-28 21:55 'jerk' 2009-04-28 22:04 -!- RazvanM(~RazvanM@96.234.232.189) has joined #tux3 2009-04-28 22:06 hm 41 million horsepower required just to compact air resistance at the top speed of a perfectly streamlined object (Cd = 0.04), plus whatever will be required to accelerate 2009-04-28 22:06 s/compact/combat/ 2009-04-28 22:06 damn velocity cubed :) 2009-04-28 23:00 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-29 00:18 -!- Mark__T(~Mark__T@jabber.freenet.de) has joined #tux3 2009-04-29 00:24 folks 2009-04-29 00:24 good to see come channel activity here 2009-04-29 00:24 come=some 2009-04-29 00:24 haha, other than being totally off-topic :) 2009-04-29 00:32 let's bring it on topic. Any news on tux3, I didn't see much activity in ml/irc/git/hg recently 2009-04-29 00:38 Mark__T: yeah its been a bit quiet recently, flips has been a bit busy with other things, but I suspect activity will pick back up this weekend 2009-04-29 00:42 wow I just ran a du -ks in the root of my machine 2009-04-29 00:42 load is normally .5-1 2009-04-29 00:42 shot up to over 100 2009-04-29 00:42 almost completely unresponsive 2009-04-29 00:42 (not related to tux3, just a plain debian box) 2009-04-29 00:44 248320 216099 87% 0.77K 49664 5 198656K ext3_inode_cache 2009-04-29 00:49 ugh, atimes 2009-04-29 00:49 ugh we have to make sure we don't suck at that 2009-04-29 00:55 hi 2009-04-29 00:55 btw, relatime is on by default in current linus git 2009-04-29 01:49 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-29 06:16 oh man, there was a physics discussion and i missed it? booo... 2009-04-29 06:17 -!- dcg(~dcg@29.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-29 06:26 -!- dcg_(~dcg@60.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-29 06:35 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-04-29 07:59 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-29 08:13 -!- Mark__T(~Mark__T@jabber.freenet.de) has left #tux3 2009-04-29 09:14 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-29 10:47 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-04-29 11:36 -!- dcg(~dcg@1.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-04-29 12:12 -!- dcg(~dcg@40.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-04-29 12:42 -!- bh_(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-04-29 15:22 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-04-29 16:26 -!- dcg(~dcg@40.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-04-29 19:14 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-29 20:16 -!- ed__(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-29 22:10 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-29 22:37 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-30 03:21 -!- Mark__T(~Mark__T@twitter.freenet-rz.de) has joined #tux3 2009-04-30 03:29 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-04-30 05:00 -!- Mark__T(~Mark__T@twitter.freenet-rz.de) has left #tux3 2009-04-30 06:23 -!- flips(~daniel@phunq.net) has joined #tux3 2009-04-30 06:23 konishiwa hirofumi 2009-04-30 06:23 hi 2009-04-30 06:24 I am running on a different schedule these days 2009-04-30 06:24 wake up about the time I used to go to sleep 2009-04-30 06:26 oh, why? 2009-04-30 06:26 are you on different timezone? 2009-04-30 06:27 on new york time, but still livining in santa monica 2009-04-30 06:28 today I must analyze the verioned pointers analysis 2009-04-30 06:28 ok 2009-04-30 06:28 I was thinking about atomic commit recently 2009-04-30 06:28 http://mailman.tux3.org/pipermail/tux3/attachments/20090411/fcfa03ed/attachment-0001.pdf 2009-04-30 06:28 ah, and? 2009-04-30 06:29 and my brain became clear more or less 2009-04-30 06:29 is it as we had discussed? 2009-04-30 06:29 probably 2009-04-30 06:30 however, it would be a bit different 2009-04-30 06:30 I am interested 2009-04-30 06:30 frontend of current codes is modifying the btree and such metadata 2009-04-30 06:30 e.g. truncate 2009-04-30 06:31 yes 2009-04-30 06:31 it means frontend is conflicting the flush_log() of backend 2009-04-30 06:31 frontend/backend separation is expected to be less than perfect at first 2009-04-30 06:31 conflict means it blocks other 2009-04-30 06:31 yes 2009-04-30 06:32 so the resolution is to wait for each flush to complete, at first 2009-04-30 06:32 which is done by taking the delta write lock 2009-04-30 06:32 I was thinking in future, can we really do it, and how to do 2009-04-30 06:32 front end truncate does not need to directly modify the btree 2009-04-30 06:33 yes, first version will blocks frontend at the end of backend 2009-04-30 06:33 just change the size in the cached inode 2009-04-30 06:33 however, it need to handle the hole 2009-04-30 06:34 front end has to read the btree to know if there is a hole 2009-04-30 06:34 if it changes the size, 10000 -> 500 -> 10000 2009-04-30 06:34 yes 2009-04-30 06:34 so, frontend/backend separation is only for write, not read 2009-04-30 06:34 so, if truncate didn't change the btree, it can't know the hole 2009-04-30 06:34 it can't? 2009-04-30 06:35 yes 2009-04-30 06:35 true 2009-04-30 06:35 good observation 2009-04-30 06:35 we might need to introduce a new concept there 2009-04-30 06:35 to remember the "planned" truncate 2009-04-30 06:35 yes, exactly 2009-04-30 06:35 that has not been written into the btree yet 2009-04-30 06:36 it is what I was thinking 2009-04-30 06:36 it seems like a good idea 2009-04-30 06:36 that way, we have a very nice deferred truncate 2009-04-30 06:36 yes 2009-04-30 06:36 that prevents some traditional stalls 2009-04-30 06:36 with it, frontend and backend are separated 2009-04-30 06:37 however, delta and flush are not separated though 2009-04-30 06:37 it shouldn't be performance problem 2009-04-30 06:37 btree or bitmap flush? 2009-04-30 06:37 oh 2009-04-30 06:37 right 2009-04-30 06:37 yes 2009-04-30 06:37 it will be serialize with stage_delta() 2009-04-30 06:37 I think, let's implement it and then analyze the backend bottlenecks 2009-04-30 06:38 yes 2009-04-30 06:38 so, next is, I'm on implement phase 2009-04-30 06:38 I guess it can implement 2009-04-30 06:38 first is dump though 2009-04-30 06:39 dump? 2009-04-30 06:39 um.. 2009-04-30 06:39 easy one 2009-04-30 06:39 but, performance is not good 2009-04-30 06:39 ah, serialize all 2009-04-30 06:40 I mean, it will implement incrementally 2009-04-30 06:40 first one is blocks frontend version 2009-04-30 06:40 yes, first task is to generate correct log blocks and serialize at the right place 2009-04-30 06:40 and improve it incrementlly 2009-04-30 06:41 yes 2009-04-30 06:41 and to bypass vfs flush 2009-04-30 06:41 yes 2009-04-30 06:41 actually, bypass vfs flush would be a good first step 2009-04-30 06:41 and implement ->sync_fs 2009-04-30 06:41 well, I'm thinking first one is userland 2009-04-30 06:41 is that the right method name? 2009-04-30 06:41 also good 2009-04-30 06:42 the logging and replay code is now completely shared by userland and kernel 2009-04-30 06:42 there is some #ifdef though 2009-04-30 06:42 so, we could do #ifdef ATOMIC -> bypass vfs flush 2009-04-30 06:43 probably 2009-04-30 06:43 well, I'm thinking userland -> kernel porting is not hard 2009-04-30 06:43 so, the easiest way to bypass vfs flush is to mark inodes dirty, but move them off the dirty list 2009-04-30 06:43 is there another method? 2009-04-30 06:44 or redirty it 2009-04-30 06:44 in ->writepage? 2009-04-30 06:45 or add new method 2009-04-30 06:45 also a possibility 2009-04-30 06:45 I like the "fake dirty inode" method best 2009-04-30 06:45 it is used in a number of places in kernel already 2009-04-30 06:46 I think, we should use it, then we should ask akpm to make it an "official" api 2009-04-30 06:46 two nice things about it: 1) it is very efficient 2) no change to core kernel needed 2009-04-30 06:46 anyway, I agree that the kernel issues are nicely separated from userland 2009-04-30 06:46 yes 2009-04-30 06:47 and that working on logging in userland is the right thing to do 2009-04-30 06:47 while I am thinking about the kernel implementation of flush... 2009-04-30 06:48 kernel does a pretty good job of deciding which file inodes to flush, I think we should let the current code continue to make those decisions 2009-04-30 06:48 let the kernel decide which data inode to flush, but not to do the actual flush 2009-04-30 06:49 now... how to do that 2009-04-30 06:49 a question for later 2009-04-30 06:49 in the case of non-data inodes, vfs does not know which is the right one to flush 2009-04-30 06:49 it is for normal files? 2009-04-30 06:49 yes, normal files 2009-04-30 06:50 vm flushing code cycles through dirty inodes in a reasonable order 2009-04-30 06:50 and does partial flushes of dirty data inodes in a reasonable way 2009-04-30 06:50 yes 2009-04-30 06:50 sync strategy is not good though 2009-04-30 06:51 :) 2009-04-30 06:51 yes, we must look at that carefully 2009-04-30 06:51 for now, we will have fsync = full fs sync 2009-04-30 06:51 yes, probably 2009-04-30 06:53 or file data might be ignore at first version 2009-04-30 06:53 don't care any order of file data 2009-04-30 06:54 well, anyway, it would be not so big problem to implement 2009-04-30 06:55 right, posix allows fsync to be a noop I think 2009-04-30 06:55 but... posix is not a popular idea these days ;) 2009-04-30 06:56 :) 2009-04-30 06:58 -!- dcg(~dcg@34.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-30 07:28 -!- dcg(~dcg@34.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-30 07:28 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-04-30 07:28 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-04-30 07:28 -!- gebi_(~gebi@84-119-54-245.dynamic.xdsl-line.inode.at) has joined #tux3 2009-04-30 07:28 -!- data(~data@84.19.190.213) has joined #tux3 2009-04-30 07:28 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-04-30 07:28 -!- Chip_M(stefanc@apollo.orakel.ntnu.no) has joined #tux3 2009-04-30 07:28 -!- persson_(persson@nescafe.bsnet.se) has joined #tux3 2009-04-30 07:54 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-30 08:50 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-30 10:10 hirofumi, did you see this? http://www.linux-watch.com/news/NS3228322313.html 2009-04-30 10:11 no, I don't see it 2009-04-30 10:11 however, now reading :) 2009-04-30 10:11 well, now you have :-) 2009-04-30 10:17 yes, alternative fs is really good 2009-04-30 10:17 unfortunately, main target of many company is windows 2009-04-30 10:18 so, those hope that fs is supported by default 2009-04-30 10:19 I think it is why FAT is used 2009-04-30 10:20 if we can develops open standard fs for many plathome, it's really good 2009-04-30 10:21 exfat is also not good as replacement 2009-04-30 10:21 thought you'd like that article 2009-04-30 10:22 however, SD forum decided to use exfat for next-gen SD spec 2009-04-30 10:22 saw you and flips had a good chat this morning 2009-04-30 10:22 yes 2009-04-30 10:22 have you got some thoughts about the sync strategy? 2009-04-30 10:23 no, I'm not so thinking very much yet 2009-04-30 10:23 heh 2009-04-30 10:23 however, I thought about atomic commit strategy 2009-04-30 10:24 it means I thought about metadata handling 2009-04-30 10:24 seperating the front end from the back end? 2009-04-30 10:24 not file data ordering 2009-04-30 10:24 yes 2009-04-30 10:24 right on 2009-04-30 10:24 and how to do it implement 2009-04-30 10:24 outside of vfs 2009-04-30 10:25 yes, more or less 2009-04-30 10:25 metadata will be handled outside of vfs 2009-04-30 10:25 like ext3's jbd 2009-04-30 10:25 file data may use vfs path 2009-04-30 10:26 I'm not sure yet 2009-04-30 10:26 wonder how that will play out when versioning is introduced 2009-04-30 10:27 file data on versioning system? 2009-04-30 10:27 metadata handling? 2009-04-30 10:27 either 2009-04-30 10:27 well, anyway, I'm not sure though, I guess it is not so hard to do 2009-04-30 10:28 will vfs flush ordering interfere with atomic commit of versioned pointers 2009-04-30 10:28 it may not easy, however, I guess atomic commit is not needed to change basically 2009-04-30 10:28 ah 2009-04-30 10:28 i'm above my understanding here, I must admit 2009-04-30 10:29 maybe, file data is handled like directory data 2009-04-30 10:30 um... 2009-04-30 10:30 yes, probably, it's handled by delta 2009-04-30 10:31 maybe, frontend changes the file cache, then backend flushes the dirty data 2009-04-30 10:31 so, vfs flush path would be not used, I guess 2009-04-30 10:32 so then a method for finding/filling the hole will need to be found 2009-04-30 10:32 yes 2009-04-30 10:33 well, if backend blocks the frontend, it's unnecessary 2009-04-30 10:33 however, for performance, I guess it's needed 2009-04-30 10:34 if the backend is waiting on disk seeks, then it sounds like an issue needing resolving 2009-04-30 10:34 if there are no seeks, then its logical to assume its not a significant issue 2009-04-30 10:34 yes, probably 2009-04-30 10:39 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-04-30 12:28 hirofumi, yes, file data is similar to directory data for versioning 2009-04-30 12:28 the main difference is mmap 2009-04-30 12:28 another difference is truncate 2009-04-30 12:28 we do need to version file data in order to do online backup 2009-04-30 12:29 for now we don't version anything 2009-04-30 12:33 flips, Like lvm snapshots that suck less, right? 2009-04-30 12:47 -!- dcg(~dcg@174.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-04-30 13:26 -!- pgquiles(~pgquiles@66.Red-79-147-232.dynamicIP.rima-tde.net) has joined #tux3 2009-04-30 13:47 flips: how long until we get memory backed writable snapshots? :) 2009-04-30 13:58 sejeff, that would be the one 2009-04-30 13:59 npmccallum, just run on a ramdisk? 2009-04-30 14:00 flips: hehe, I mean the ability to create a single snapshot in ram, which you could then merge onto disk 2009-04-30 14:00 If only there was something that gave you your ramback 2009-04-30 14:01 npmccallum, merge onto disk? 2009-04-30 14:01 flips: normal_disk_load(); snapshot2ram() ; heavy_disk_load() ; merge_ram2disk() ; normal_disk_load() 2009-04-30 14:02 basically tmpfs, but a writable snapshot 2009-04-30 14:02 why is that not just buffered IO? 2009-04-30 14:02 because you lie to apps that do fsync() :) 2009-04-30 14:02 fs data, including snapshot, doesn't get flushed to disk until sync, or memory pressure 2009-04-30 14:03 if an app does fsync, a real fsync better happen 2009-04-30 14:03 I'm talking a sync time in days, not seconds 2009-04-30 14:03 I can see having an admin option to turn it off 2009-04-30 14:03 like, laptop mode for the filesystem? 2009-04-30 14:03 ok, I've got a long running job 2009-04-30 14:04 it will only use 1GB of space 2009-04-30 14:04 but its total writes and reads (if you summed them all together) would be in exabytes 2009-04-30 14:04 I could run it on a normal disk 2009-04-30 14:04 and it would take years to complete 2009-04-30 14:04 or 2009-04-30 14:05 I could run it on tmpfs and then when the job is done, I could copy the files off to real disk 2009-04-30 14:05 tmpfs would complete the job in days not hours 2009-04-30 14:05 s/hours/years/ 2009-04-30 14:06 I'm starting to see what you're driving at 2009-04-30 14:06 now imagine that this long running task doesn't have nice separate data files 2009-04-30 14:07 its actually going to write like crazy to my system files 2009-04-30 14:07 you also imply that you don't have control over the long running task, otherwise you would remove the bogus fsyncs 2009-04-30 14:07 flips: yes, this isn't a real app, I haven't gotten to the *real* why yet 2009-04-30 14:07 the real why would be helpful 2009-04-30 14:08 I'm basically implying something like ramback, which you write a while ago 2009-04-30 14:09 sure 2009-04-30 14:09 but are you asking for something like ramback, but working at the filesystem level? 2009-04-30 14:09 but instead of implemented as a separate block device, implemented as a snapshot 2009-04-30 14:09 it could be implemented at either level 2009-04-30 14:10 though, at the fs level you could more or less always have a restorable state 2009-04-30 14:10 seems to me you just want normal buffered file IO to the snapshot, plus the ability to flush just the snapshot to disk 2009-04-30 14:10 yes 2009-04-30 14:11 but, if you crash (power outage lets say) 2009-04-30 14:11 ok, that should be part of the sync model when we get to versioning 2009-04-30 14:11 because its just a snapshot, the disk has a working copy 2009-04-30 14:11 if it was implemented at the block level, you could never guarantee the state of the disk 2009-04-30 14:12 my real secret plan I think you hinted at: laptops 2009-04-30 14:12 true 2009-04-30 14:12 so yes, it would be good to define a sync that applies to a specific snapshot 2009-04-30 14:12 before you remount / to rw, you take a snapshot 2009-04-30 14:12 even if there is no posix command for that 2009-04-30 14:13 then all writes occur in memory 2009-04-30 14:13 ok, sure 2009-04-30 14:13 throwaway snapshot 2009-04-30 14:13 laptops get nice long sleep times for hard drives 2009-04-30 14:13 volatile backing model 2009-04-30 14:13 something like that 2009-04-30 14:13 yes 2009-04-30 14:13 but 2009-04-30 14:13 the snapshot becomes battery aware :) 2009-04-30 14:14 per-version volatile attribute maybe 2009-04-30 14:14 it will create a new snapshot periodically and flush the old snapshot to disk 2009-04-30 14:14 well, the 'create periodically' should not be filesystem policy, it should be a script of something like that 2009-04-30 14:15 right 2009-04-30 14:15 I'm just giving you the high level idea 2009-04-30 14:15 but the script needs the right support from the fs 2009-04-30 14:15 sure 2009-04-30 14:15 there should always be enough time left on the battery to flush to disk 2009-04-30 14:15 *but* 2009-04-30 14:15 in case something drastic happens 2009-04-30 14:16 (battery malfunction, user rips out the battery, etc) 2009-04-30 14:16 ok, I see 2009-04-30 14:16 you should always have a bootable state 2009-04-30 14:16 sure, it's a good idea 2009-04-30 14:16 you might lose some data, but not total loss 2009-04-30 14:17 which is why you want it implemented *not* as a block device, but as part of the fs 2009-04-30 14:17 you'd get huge increases in battery life 2009-04-30 14:18 huge decreases in ssd writes, prolonging live 2009-04-30 14:18 and huge speed increases 2009-04-30 14:18 especially for short burstable writes *cough*mozilla*cough* 2009-04-30 14:20 package managers could integrate with it too, providing rollback and much faster package install times 2009-04-30 14:21 flips: how hard would this be to add? 2009-04-30 14:22 flips: and btw, it was fun having drinks and talking after SCALE 2009-04-30 14:33 npmccallum, it should not be hard 2009-04-30 14:33 has to be kept in mind as a goal 2009-04-30 14:34 :) 2009-04-30 15:37 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-30 17:21 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-30 17:43 -!- ed__(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-30 20:30 -!- edt(~Ed@dsl-216-221-32-17.aei.ca) has joined #tux3 2009-04-30 21:47 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-04-30 21:54 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-30 22:15 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-04-30 22:54 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-01 02:21 -!- pgquiles(~pgquiles@66.Red-79-147-232.dynamicIP.rima-tde.net) has joined #tux3 2009-05-01 06:41 -!- dcg(~dcg@73.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-01 07:20 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-01 07:38 -!- npmccallum(~npmccallu@cpe-76-177-118-207.natcky.res.rr.com) has joined #tux3 2009-05-01 08:13 -!- tim_dimm_(~mobile@32.133.183.178) has joined #tux3 2009-05-01 08:13 shapor: ping 2009-05-01 08:44 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-01 09:37 -!- tim_dimm(~timothyhu@static-71-165-35-33.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-01 09:44 -!- pgquiles(~pgquiles@97.Red-81-33-102.dynamicIP.rima-tde.net) has joined #tux3 2009-05-01 09:51 -!- tim_dimm_(~timothyhu@static-71-165-35-33.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-01 10:05 -!- tim_dimm(~timothyhu@static-71-165-35-33.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-01 11:16 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-01 11:35 good morning 2009-05-01 11:35 morning flips 2009-05-01 11:35 its almost evening for you 2009-05-01 11:35 need to make a note of the special in memory snapshot thing we discussed yesterday 2009-05-01 11:36 separation of front end from back end? 2009-05-01 13:02 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-01 13:11 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-01 13:53 hey flips 2009-05-01 15:20 -!- dcg_(~dcg@63.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-01 23:00 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-02 00:24 -!- RazvanM(~RazvanM@96.234.241.246) has joined #tux3 2009-05-02 00:26 I just noticed that the last merge with the main tree was done on Mar 10 2009-05-02 00:26 Would it be possible to merge with the latest tree? :D 2009-05-02 04:17 http://userweb.kernel.org/~hirofumi/atomic.tar.gz 2009-05-02 04:17 this is the part of my current codes 2009-05-02 04:18 it try to flush btree, bitmap, and logs 2009-05-02 04:19 not completed at all, well, anyway, next is more complete the btree logging 2009-05-02 06:37 added the dump support to tux3graph 2009-05-02 06:37 the bitmap dump 2009-05-02 06:37 the result of it is http://userweb.kernel.org/~hirofumi/atomic.png 2009-05-02 06:39 it means 0-11 blocks was allocated, and 12 is also allocated, but it is in the log not in bitmap yet 2009-05-02 06:42 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-02 06:43 -!- dcg(~dcg@210.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-02 06:56 -!- dcg_(~dcg@45.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-02 07:14 -!- ed__(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-02 07:26 btw, I think "flush" is needed new word 2009-05-02 07:27 it seems confusable 2009-05-02 07:27 I'd like to use "flush" for flushing the buffer 2009-05-02 07:52 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-02 08:01 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-02 08:41 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-02 09:48 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-02 09:51 flips, do you remember what is the intent of parent in LOG_UPDATE? 2009-05-02 09:59 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-02 10:17 btw, now, I'm working on itree split 2009-05-02 10:41 ah, btw, I'll be almost off tommorow for house cleaning skipped previous mouth :) 2009-05-02 12:40 -!- dcg__(~dcg@132.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-02 13:08 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-02 13:10 -!- dcg__(~dcg@175.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-02 13:39 refactored the patches: http://userweb.kernel.org/~hirofumi/atomic.png 2009-05-02 13:39 and one bug was fixed, allocation of log block was included the flush_log() 2009-05-02 13:53 um.. defree/deflush of log block may still be buggy 2009-05-02 13:55 note to help to think it: http://userweb.kernel.org/~hirofumi/note.flush 2009-05-02 13:55 -------------------- is stage_delta() 2009-05-02 13:55 ~~~~~~~~~~~~~~~~~~~~~~~~~ is flush_log() cycle 2009-05-02 13:55 ========================== is end of flush_log() cycle 2009-05-02 13:59 time to sleep 2009-05-02 14:17 hirofumi, reading 2009-05-02 14:17 that png is very cool 2009-05-02 14:18 makes me feel like hacking :) 2009-05-02 14:43 hey flipz 2009-05-02 17:51 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-02 18:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-02 18:50 Oooh, flipz' latest message on the mailinglist is inspiring. I need to get off of my lazy-student behind and give tux3 a test run. 2009-05-02 20:49 Will the latest snapshot apply against linus' git tree? 2009-05-02 20:51 I suppose I'll find out. :) 2009-05-02 21:03 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-02 21:14 Oops, missed the tree that's hosed on kernel.org! 2009-05-02 21:29 heh, hosed 2009-05-02 21:29 ACTION goes to bed 2009-05-02 22:30 kspaans, it should 2009-05-02 23:16 -!- RazvanM(~RazvanM@96.234.241.246) has joined #tux3 2009-05-02 23:46 -!- RazvanM_(~RazvanM@96.234.246.19) has joined #tux3 2009-05-03 00:17 -!- RazvanM(~RazvanM@pool-173-67-54-86.bltmmd.east.verizon.net) has joined #tux3 2009-05-03 04:37 -!- sim(~simon@lib59-3-82-233-188-87.fbx.proxad.net) has joined #tux3 2009-05-03 05:56 flipz: It did. 2009-05-03 06:14 -!- dcg(~dcg@137.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-03 06:42 -!- dcg(~dcg@137.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-03 07:15 -!- dcg(~dcg@137.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-03 07:22 hi everyone. i am back from the everest (base camp) and ready to code again :) 2009-05-03 07:24 Fun! 2009-05-03 07:24 so. not much going on in the official repository 2009-05-03 07:25 Seems so. 2009-05-03 07:26 http://userweb.kernel.org/~hirofumi/ may be newer 2009-05-03 07:26 already looking at it, thanks :) 2009-05-03 07:42 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-03 08:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-03 08:47 -!- ed__(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-03 09:29 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-03 09:47 I've compiled the programs in tux3/user/, should I be able to run `tux3 mkfs ...` without even having the tux3 drivers installed in my kernel yet? 2009-05-03 09:49 kspaans, yes you can 2009-05-03 09:49 and you can run the graphics dump to see what you got 2009-05-03 09:50 and run the fuse code to access it 2009-05-03 09:51 -!- dcg(~dcg@137.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-03 09:54 Right, but I shouldn't be able to read or write until I've got either FUSE or the kernel driver going? 2009-05-03 09:57 Props on using graphviz by the way, I've been using it almost daily since I first discovered it about 2 years ago. 2009-05-03 10:02 getting fuse going is easy 2009-05-03 10:03 say thanks to hirofumi re graphviz 2009-05-03 10:03 I'm impressed and will use that tool in the future 2009-05-03 10:03 Will do. 2009-05-03 10:48 Oh, my hard drive decided to unplug itself while I was building the kernel, how nice! :-P 2009-05-03 11:04 -!- dcg_(~dcg@226.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-03 11:47 -!- pgquiles(~pgquiles@241.Red-88-0-137.dynamicIP.rima-tde.net) has joined #tux3 2009-05-03 13:28 hey flipz 2009-05-03 13:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-03 15:16 Perhaps I applied the patch incorrectlyto Linus' tree, but shouldn't it modify fs/Kconfig to add an entry for tux3? 2009-05-03 16:40 Hmm, maybe not but I can see there should have been a line added to fs/Makefile 2009-05-03 18:06 kspaans, Kconfig is now per-fs 2009-05-03 18:07 so we now have fs/tux3/Kconfig 2009-05-03 18:08 there should indeed be a line added to fs/Mkaefile 2009-05-03 18:08 Makefile 2009-05-03 18:09 http://tux3.org/patches/tux3-2.6.29-rc7-2 <- and it is 2009-05-03 18:16 flipz: Yep, silly me for forgetting to check the patch before asking. ;-) 2009-05-03 18:17 ;) 2009-05-03 18:17 I'm still compiling the kernel though. D: Perhaps it's time to upgrade my hardware... 2009-05-03 18:17 compile with make defconfig 2009-05-03 18:17 so you don't build every module in the world 2009-05-03 18:17 and try compilercache 2009-05-03 18:18 That would probably help! I just copied the config from /boot in this Ubuntu install. 2009-05-03 18:18 bleah 2009-05-03 18:18 only ever do that once 2009-05-03 18:18 just to be able to boot with whatever crazy hardware you have 2009-05-03 18:18 then do lsmod 2009-05-03 18:19 make defconfig, then engable all the things you saw in the lsmod 2009-05-03 18:19 enable I mean 2009-05-03 18:19 ACTION thinks engable really ought to be a word 2009-05-03 18:19 sounds like it means something 2009-05-03 18:19 Reminds me of "Ann of Green Gables". 2009-05-03 18:19 like, "give Clark Gable a job" -> engable him 2009-05-03 18:20 oh yes 2009-05-03 18:20 also, "engable a nice canadian redhet" -> make her famous 2009-05-03 18:20 redhead 2009-05-03 18:20 him, redhet sounds like it should be a word too 2009-05-03 18:21 How appropriate, I'm a Canadian Redhead. :P 2009-05-03 18:21 you r so engabled 2009-05-03 18:22 oop, time to go watch a movie with my roommates. Classes start tomorrow! 2009-05-03 18:22 night! 2009-05-03 18:22 night! 2009-05-03 19:07 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-03 20:41 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-04 03:43 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-05-04 04:01 retire of log block itself is not good 2009-05-04 04:01 rather seems bad 2009-05-04 04:02 log block can be retired after 2 cycle of flush_log() 2009-05-04 04:03 it means we must have the almost log blocks of 3 flush_log() 2009-05-04 04:04 3 flush_log() retires oldest log blocks 2009-05-04 04:04 why? 2009-05-04 04:05 the log blocks of previous cycle is needed to know the deflush log records 2009-05-04 04:05 so, 2 previous cycle can be retired after flush_log() 2009-05-04 04:07 maybe, this would be hard to see why, without some figure 2009-05-04 04:07 so, I'll explain with chat, maybe, tommorow? 2009-05-04 04:11 well, so, I'm thinking to make the log of bfree for retired log blocks 2009-05-04 04:12 because, 2 previous log blocks is just used to know log blocks address itself 2009-05-04 04:12 if I'm not missing something 2009-05-04 06:05 -!- dcg(~dcg@234.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-04 06:24 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-05-04 06:39 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-04 09:32 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-04 10:26 -!- dcg_(~dcg@224.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-04 11:06 hi hirofumi 2009-05-04 11:12 hi 2009-05-04 11:13 do you have time for a hour or so? 2009-05-04 11:13 I'd like to talk about retiring the log blocks 2009-05-04 11:13 for you, always :) 2009-05-04 11:14 thanks :) 2009-05-04 11:14 ok 2009-05-04 11:14 I'll make some figure for it 2009-05-04 11:16 ok, I had a small observation that might simplify things a little 2009-05-04 11:17 http://userweb.kernel.org/~hirofumi/logblock.note 2009-05-04 11:17 which is: for the first log block in the sequence, we can record the beginning of valid log data 2009-05-04 11:17 or rather, for the entire log sequence 2009-05-04 11:18 valid log data? 2009-05-04 11:18 yes, so that we can drop part of a log block while keeping some log messages at the end of the block 2009-05-04 11:18 for example... 2009-05-04 11:19 if we are freeing (retiring) log blocks, we can write a log_bfree log messages instead of updating the bitmaps 2009-05-04 11:19 and that message goes at the end of the current log block 2009-05-04 11:19 yes 2009-05-04 11:20 it may be needed to make things simple 2009-05-04 11:20 and all the previous log blocks are freed, except for the last one 2009-05-04 11:20 and I'm thinking which is needed it? 2009-05-04 11:20 yes, basically 2009-05-04 11:20 I think this is useful for retiring the log blocks 2009-05-04 11:20 a slight simplification 2009-05-04 11:21 yes, probably 2009-05-04 11:21 it gets a little tricky otherwise 2009-05-04 11:21 yes, exactly 2009-05-04 11:21 the tricky part is trying to update bitmap blocks, it is easier to leave them alone and write log records 2009-05-04 11:21 by recording the beginning of valid log data, we don't have to start a new log block to record the bfrees 2009-05-04 11:22 I think it also reduces number of log blocks that have to be written a little 2009-05-04 11:22 ok, I will read the note 2009-05-04 11:22 and the cause of tricky is, defree is used to free *and* reservation to prevent overwrite it 2009-05-04 11:22 yes 2009-05-04 11:23 note is just to talk with the mark, (1), (2) ... 2009-05-04 11:23 and another tricky detail is, we have to be able to reconstruct the defree list 2009-05-04 11:23 as you pointed out earlier 2009-05-04 11:23 **** is flush_log cycle 2009-05-04 11:23 yes 2009-05-04 11:24 so, let me talk current situation 2009-05-04 11:24 from (1) to after (2), is needed logs 2009-05-04 11:24 you ascii picture is fine 2009-05-04 11:25 it is exactly my thinking 2009-05-04 11:25 we are on after (2) 2009-05-04 11:25 ah 2009-05-04 11:25 and after (1) contains the logs for defree/deflush 2009-05-04 11:26 so, we can't free those until next flush_log() 2009-05-04 11:26 5 minutes please 2009-05-04 11:26 yes 2009-05-04 11:26 I will be back 2009-05-04 11:26 ok 2009-05-04 11:28 and the issue is, when do we can retire log block itself? 2009-05-04 11:28 when can we retire the log blocks itself? 2009-05-04 11:30 I've added the (1a) and (2a) 2009-05-04 11:30 added to http://userweb.kernel.org/~hirofumi/logblock.note 2009-05-04 11:30 back 2009-05-04 11:31 ok 2009-05-04 11:31 I see it 2009-05-04 11:33 this is my thinking: we will write bfree log messages for all the existing log blocks except the log block containing the bfree messages themselves (the last log block) 2009-05-04 11:33 this effectively frees all log blocks except the last one, which has to be retained until the next log flush cycle 2009-05-04 11:34 now, we are on where point in figure? 2009-05-04 11:34 (2a)? 2009-05-04 11:35 what is the event between (2) and (2a)? 2009-05-04 11:35 (2) is before flush_log(), (2a) is after flush_log() 2009-05-04 11:36 ****** is in flush_log() 2009-05-04 11:36 ah, and we should have a (3) to show the next delta flush 2009-05-04 11:36 *** is durning flush_log() 2009-05-04 11:36 (1) and (1a) is not enough? 2009-05-04 11:36 it depends what we mean by "flush delta" 2009-05-04 11:37 ah 2009-05-04 11:37 to me, flush delta does not generate any disk activity itself 2009-05-04 11:37 flush delta means stage_delta() and flush_log() 2009-05-04 11:37 sorry 2009-05-04 11:37 to me, flush log does not generate any disk activity itself 2009-05-04 11:37 disk activity is only generated by flush delta 2009-05-04 11:38 flush log does something very simple 2009-05-04 11:38 btw, flush_log() is not flushing log 2009-05-04 11:38 you mean the code I wrote? 2009-05-04 11:38 yes 2009-05-04 11:38 ah, no 2009-05-04 11:38 I wrote 2009-05-04 11:38 ah 2009-05-04 11:38 ok, good 2009-05-04 11:38 however, there is no big change iirc 2009-05-04 11:39 flush_log() should do two things: write bfree records into the log and reset the sb->logbase (I think that is the variable name) 2009-05-04 11:39 btw, this is why I'm looking the new word of "flush" for flush_log() 2009-05-04 11:40 what are the two different meanings? 2009-05-04 11:40 it seems to me that there should be only one flush_log, and it should be part of delta transition 2009-05-04 11:41 flush is "write the dirty buffers to disk", and flush_log() is "flushing the btree and bitmap, then write log blocks in atomic commit" 2009-05-04 11:41 that is, sometimes part of the delta transition, if it is time to flush the log 2009-05-04 11:41 to me, these are exactly the same concept 2009-05-04 11:42 concept may be same, however the detail is confusable 2009-05-04 11:42 btw, now, I'm using to flushing the log blocks, write_log() 2009-05-04 11:43 but, it can also call flush_log() 2009-05-04 11:43 :) 2009-05-04 11:44 I want to word of "flush delta" in figure 2009-05-04 11:44 I'm thinking we are calling it "flush" 2009-05-04 11:45 flush_log() is good, until you convince me otherwise ;) 2009-05-04 11:45 :) 2009-05-04 11:46 e.g. now, I'm using flush_buffer_list() 2009-05-04 11:46 it flushes the buffers in list 2009-05-04 11:47 but, flush_log() is not meaning, to flush the log buffers 2009-05-04 11:47 it does "flush cycle" in atomic commit 2009-05-04 11:47 i.e. flushing the bitmap, btree, and log blocks 2009-05-04 11:48 e.g. if we call it fdelta like normal delta 2009-05-04 11:48 it helps me 2009-05-04 11:48 flush_log() will become flush_fdelta() 2009-05-04 11:48 it means to flush the fdelta cycle 2009-05-04 11:49 just use the name you like most 2009-05-04 11:49 i.e. I want the word like "delta" for flush cycle 2009-05-04 11:49 but, I want to share it with you 2009-05-04 11:50 yes, otherwise I won't understand it 2009-05-04 11:50 thanks 2009-05-04 11:50 anything that reduces confusion is good at this point 2009-05-04 11:51 yes 2009-05-04 11:51 so, for now, let's call flush cycle is fdelta 2009-05-04 11:52 well, back to the retire log blocks 2009-05-04 11:54 now, we are on (2a) point 2009-05-04 11:55 so, which log blocks can we free? 2009-05-04 11:56 I'm expecting to free before (1) at the between (2) and (2a) 2009-05-04 12:12 we can free every log block up to the block where the bfree log records begin 2009-05-04 12:13 bfree log records? 2009-05-04 12:13 ah 2009-05-04 12:14 new log record, now we introducing? 2009-05-04 12:37 not a new log record, just the log record for a deferred free 2009-05-04 12:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-04 12:38 freeing log blocks is a deferred free, just like other deferred frees 2009-05-04 12:38 um... 2009-05-04 12:39 ah, ok 2009-05-04 12:40 when do we log the deflush of log blocks? 2009-05-04 12:41 in the fdelta cycle? 2009-05-04 12:41 yes 2009-05-04 12:42 that is why I suggested adding a variable to track the start of the valid part of the log 2009-05-04 12:43 this is just an offset in the oldest log block 2009-05-04 12:44 replay ignores any log records before the "valid log offset" 2009-05-04 12:44 so log flush just needs to make the defree entries, then set the logbase and logvalid (suggested variable name) 2009-05-04 12:45 or maybe logoffset or maybe logbase_offset 2009-05-04 12:48 um... 2009-05-04 12:49 logvalid is like sb->next_logbase and sb->logbase in my patchset? 2009-05-04 12:51 flips, idea for tux3.org 2009-05-04 12:51 we should make files out of the tux3 U logs 2009-05-04 12:51 was explaining to someone about how we used to have tux3 U 2009-05-04 13:21 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-04 13:34 -!- dcg(~dcg@134.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-04 13:48 hirofumi, just like that 2009-05-04 13:48 I see 2009-05-04 13:48 that is, there are two numbers to describe the start of the log 2009-05-04 13:49 a block number, and an offset within the block 2009-05-04 13:49 ah 2009-05-04 13:50 and merges to fdelta log and delta log? 2009-05-04 13:50 now, I'm writing it as separated blocks 2009-05-04 13:51 by using the log start offset, we avoid allocating a new log block 2009-05-04 13:51 I think that is the only effect 2009-05-04 13:51 but it is a nice effect 2009-05-04 13:52 the optimize things? 2009-05-04 13:52 exactly, it merges the fdelta and delta log 2009-05-04 13:52 i see 2009-05-04 13:52 that is the point, the fdelta log is not really separate 2009-05-04 13:53 at least, I think it is not really separate 2009-05-04 13:53 however, I think it makes retire log block complex a little 2009-05-04 13:54 because those log records are having different retire cycle 2009-05-04 13:55 before fdelta is retired at next fdelta 2009-05-04 13:55 but, fdelta is not 2009-05-04 13:55 um... 2009-05-04 13:56 to make sure, the log records of fdelta means logs of bitmap/btree durning flushing those 2009-05-04 14:32 sorry, time to sleep 2009-05-04 14:32 I'll think it more based on your bfree log of log block 2009-05-04 14:47 I didn't think it would make the log block retiring more complex 2009-05-04 14:47 but maybe I overlooked some issue 2009-05-04 14:48 ok, what happens at a flush cycle is, we generate some log records like bitmap bfrees and so on, then immediately discard them, because we are writing the actual bitmaps out in the same delta 2009-05-04 14:48 I may also be overlooking something, because I'm confusing more or less to handling this 2009-05-04 14:48 that is probably the confusing part 2009-05-04 14:49 yes 2009-05-04 14:49 ok, will the explanation is: we are generating some log records that we immediately throw away 2009-05-04 14:49 I've almost implemented the change_end() except this 2009-05-04 14:49 and we just do that to keep the code simple 2009-05-04 14:49 but this part is rewrited some times 2009-05-04 14:50 can I get your latest prototype and work on it with you? 2009-05-04 14:50 ok 2009-05-04 14:50 I'll push latest my patches 2009-05-04 14:50 post a link to the mailing list? 2009-05-04 14:50 ok 2009-05-04 14:50 good 2009-05-04 14:50 ok 2009-05-04 14:51 it will be a few hours before I can look at it 2009-05-04 14:53 http://userweb.kernel.org/~hirofumi/atomic.tar.gz 2009-05-04 14:53 this is my current patchset 2009-05-04 14:53 ah, I should post it to ml? 2009-05-04 14:54 ah, ok, I'll post it 2009-05-04 15:29 folks 2009-05-04 15:30 post to ml is good 2009-05-04 15:30 this is regarding atomic commits ? 2009-05-04 15:31 yes, I posted it 2009-05-04 15:31 bh, yes 2009-05-04 15:31 good :) 2009-05-04 15:31 tux3 development seemed to be stalled for a while, I was worried 2009-05-04 15:32 bh, never fear, hirofumi is hear ;-) 2009-05-04 15:32 well, I was working for it more or less, however, not in public repo 2009-05-04 15:32 here rather 2009-05-04 15:32 :) 2009-05-04 15:32 :) 2009-05-04 15:32 yeah, I was looking at the logs recently and noticed there wasn't too much activity 2009-05-04 15:33 glad to hear it's moving again publically 2009-05-04 15:33 its been way down, but there's been plenty of activity behind the scenes 2009-05-04 15:33 me too 2009-05-04 15:33 glad you noticed 2009-05-04 15:33 call it spring break if you like 2009-05-04 15:33 well, my current patchset breaks current functional 2009-05-04 15:34 so, it is not into public 2009-05-04 15:34 i noticed a crudehack in there somewhere :) 2009-05-04 15:34 maybe, until working write and replay more or less 2009-05-04 15:37 btw, time to sleep 2009-05-04 15:37 night 2009-05-04 15:37 good night 2009-05-04 15:37 night hirofumi 2009-05-04 15:37 it's already morning though :) 2009-05-04 15:38 7:38 2009-05-04 15:41 you are Daniel's Japanese twin 2009-05-04 15:41 creativity fueled by lack of sleep :-) 2009-05-04 18:55 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-04 23:36 -!- RazvanM(~RazvanM@pool-173-67-57-242.bltmmd.east.verizon.net) has joined #tux3 2009-05-05 05:50 I've thought about the freeing log blocks issue 2009-05-05 05:50 and wrote the figure for it 2009-05-05 05:51 http://userweb.kernel.org/~hirofumi/note.defree-logs 2009-05-05 05:51 I hope this solves the issue 2009-05-05 05:51 if there is any suggestion or something, please let me know 2009-05-05 06:26 good morning 2009-05-05 06:26 hi 2009-05-05 06:27 reading the note 2009-05-05 06:28 if there is any question, please let me know 2009-05-05 06:28 I know it's not enough informative 2009-05-05 06:28 yes, it's correct 2009-05-05 06:29 looking some more 2009-05-05 06:29 and a bit complex point 2009-05-05 06:29 is 2009-05-05 06:29 it is indeed 2009-05-05 06:29 (1b) on from (1a) to (1c) 2009-05-05 06:30 (1a) - (1c) means backend of flush_log() cycle 2009-05-05 06:30 so, it has "delta" and "fdelta" 2009-05-05 06:31 where fdelta is just a variant delta? 2009-05-05 06:32 delta is flushing the file/dir data, and leaf 2009-05-05 06:32 fdelta is flushing the btree and bitmap 2009-05-05 06:33 there is a (2c) that should be a (2b) I think 2009-05-05 06:33 whoops 2009-05-05 06:34 you are right 2009-05-05 06:34 anyway, it still seems write 2009-05-05 06:34 there is no separate disk write for an fdelta, the actual writing is done by the delta path 2009-05-05 06:35 (just confirming) 2009-05-05 06:35 actual writing? 2009-05-05 06:36 well, the detail of implement is 2009-05-05 06:36 both is in change_end() 2009-05-05 06:36 if (sb->delta == delta) { 2009-05-05 06:36 int new_cycle = need_flush(sb); 2009-05-05 06:36 trace(">>>>>>>>> commit delta %u", delta); 2009-05-05 06:36 stage_delta(sb); 2009-05-05 06:36 if (new_cycle) { 2009-05-05 06:36 err = flush_log(sb); 2009-05-05 06:36 if (err) 2009-05-05 06:36 goto out; 2009-05-05 06:36 } 2009-05-05 06:36 write_log(sb, new_cycle); 2009-05-05 06:36 commit_delta(sb); 2009-05-05 06:36 trace("<<<<<<<<< commit done %u", delta); 2009-05-05 06:36 } 2009-05-05 06:36 stage_delta() is delta 2009-05-05 06:36 flush_log() is fdelta in my imagine 2009-05-05 06:37 yes, exactly 2009-05-05 06:37 and complex point is 2009-05-05 06:38 generated by stage_delta() and flush_log() is not retired same point 2009-05-05 06:38 generated logs 2009-05-05 06:38 generated logs by flush_log() is after flushed the btree/bitmap 2009-05-05 06:38 and stage_delta is before 2009-05-05 06:39 yes 2009-05-05 06:39 it is why I'm thinking those separately 2009-05-05 06:40 I wonder what happens if generating the bfree log entry allocates a new log block 2009-05-05 06:40 it is probably ok 2009-05-05 06:40 yes 2009-05-05 06:40 I'm thinking bfree log is generated at same point with bitmap/btree logs 2009-05-05 06:41 yes 2009-05-05 06:42 and current write_log() doesn't generates the bfree logs in writing logs 2009-05-05 06:42 and when the log blocks are released, there are two choices: 1) start a new log block for the bfree records (because this block will not be released in this delta) or 2) introduce a log offset to define where valid log data starts in the oldest log block 2009-05-05 06:43 um... 2009-05-05 06:43 I like the second method more because it saves a significant number of log block writes (average 1/2 per flush cycle) 2009-05-05 06:44 but the first method should work too 2009-05-05 06:45 I'm not understanding yet, why log offset is needed 2009-05-05 06:45 why needed 1) or 2) 2009-05-05 06:46 let me talk with example 2009-05-05 06:46 because after the delta, the release log blocks will be freed and can be overwritten, however the log block(s) containing the bfree log entries must not be overwritten until the next flush cycle 2009-05-05 06:46 in my figure, 2009-05-05 06:47 [deflush] region is freed at (2c) 2009-05-05 06:47 ah 2009-05-05 06:48 it is about the between of (1a) and (1c)? 2009-05-05 06:48 and (1b) is seperated log block or log offset? 2009-05-05 06:49 yes 2009-05-05 06:49 i see 2009-05-05 06:49 makes sense? 2009-05-05 06:49 probably 2009-05-05 06:49 but, it may a bit complex for replay too 2009-05-05 06:50 umm... 2009-05-05 06:50 replay seems ok 2009-05-05 06:51 replay is also handle log offset 2009-05-05 06:51 in fact, replay is the only thing that uses the log offset 2009-05-05 06:52 log offset is bytes offset in oldest log block? 2009-05-05 06:52 ah, total length of log records? 2009-05-05 06:53 because the records it will skip refer to bitmap entries or btree index updates that were already written out as part of full blocks 2009-05-05 06:53 log offset is bytes offset in the oldest log block, yes 2009-05-05 06:54 ileaf/dleaf/data/bitmap/bnode 2009-05-05 06:54 [-]bfree (data) log [defree] 2009-05-05 06:54 [-]balloc (data) log 2009-05-05 06:54 [-]ileaf redirect log [defree] 2009-05-05 06:54 [-]dleaf redirect log [defree] 2009-05-05 06:54 [#]bnode redirect log [deflush] 2009-05-05 06:54 [-]bnode update log 2009-05-05 06:54 ileaf/dleaf/data 2009-05-05 06:54 --------------------------------------------------------------------- 2009-05-05 06:54 defree bfree (data) log 2009-05-05 06:54 defree ileaf redirect log 2009-05-05 06:54 defree dleaf redirect log 2009-05-05 06:54 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2009-05-05 06:54 ileaf/dleaf/data/bitmap/bnode 2009-05-05 06:54 [#]bfree (data) log [defree] 2009-05-05 06:54 [-]balloc (data) log 2009-05-05 06:54 [#]ileaf redirect log [defree] 2009-05-05 06:54 [#]dleaf redirect log [defree] 2009-05-05 06:54 [#]bnode redirect log [deflush] 2009-05-05 06:54 [-]bnode update log 2009-05-05 06:54 ileaf/dleaf/data 2009-05-05 06:54 --------------------------------------------------------------------- 2009-05-05 06:54 defree bfree (data) log 2009-05-05 06:54 defree ileaf redirect log 2009-05-05 06:54 defree dleaf redirect log 2009-05-05 06:54 --------------------------------------------------------------------- 2009-05-05 06:54 new cycle log 2009-05-05 06:54 bnode 2009-05-05 06:55 [%]bfree (bitmap) log [deflush] 2009-05-05 06:55 [+]balloc (bitmap) log 2009-05-05 06:55 [%]bnode redirect log [deflush] 2009-05-05 06:55 [+]bnode update log 2009-05-05 06:55 bitmap 2009-05-05 06:55 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2009-05-05 06:55 ===================================================================== 2009-05-05 06:55 2009-05-05 06:55 this is another note 2009-05-05 06:55 this may help to talk it 2009-05-05 06:55 http://userweb.kernel.org/~hirofumi/notes/note.flush 2009-05-05 06:55 sorry, I've pushed it to kernel.org 2009-05-05 06:56 another complex point of log records in (1a)-(1c) is, 2009-05-05 06:56 log record in (1a)-(1b) is not same with normal delta 2009-05-05 06:57 e.g. ileaf redirect log is not included into fdelta yet 2009-05-05 06:58 um... 2009-05-05 07:01 with log offset strategy, do we need sb->logoffset and sb->next_logoffset? 2009-05-05 07:04 ah, ok 2009-05-05 07:05 it is about how do we handle (1b) on (1a)-(1c) 2009-05-05 07:06 yes, sb->logoffset but not sb->next_logoffset 2009-05-05 07:07 but, on (2c), we will set sb->logoffset to (1b) 2009-05-05 07:08 and we have to remember (2b) for next cycle 2009-05-05 07:08 I guess 2009-05-05 07:09 btw, I'm thinking this strategy is good than the log of new-cycle 2009-05-05 07:10 I'm starting to think those can replace the LOGBLOCK_FLUSH flag and LOG_NEW_CYCLE log 2009-05-05 07:15 which strategy? 2009-05-05 07:15 sb->logoffset 2009-05-05 07:15 ah, good 2009-05-05 07:15 and I guess we need sb->next_logoffset 2009-05-05 07:16 I don't see what next_logoffset is for 2009-05-05 07:17 I don't think we ever need two log offsets at the same time 2009-05-05 07:17 we will have the log blocks from (1a) to (2c) after (2c) 2009-05-05 07:17 if it's right, I think we need to know the points of (1b) and (2b) 2009-05-05 07:18 (1b) is to know the start of valid records 2009-05-05 07:18 (2b) is for next cycle 2009-05-05 07:19 but it is only necessary to know the start of valid records at replay time, so only the offset of next cycle needs to be recorded 2009-05-05 07:19 btw, (2b) is not used until next fdelta 2009-05-05 07:20 right (re 2b) 2009-05-05 07:20 but, if we don't remember (2b) for next cycle, we can't know (2b) anymore? 2009-05-05 07:20 we need to remember (2b), not (1b) 2009-05-05 07:20 or in other words, these are just the same variable at different times 2009-05-05 07:21 (I think) 2009-05-05 07:21 I think it happens same time 2009-05-05 07:21 well, more variables is better than not enough :) 2009-05-05 07:21 :) 2009-05-05 07:21 well, I may missing the something 2009-05-05 07:22 btw, (1a) - (2c) is right? immediately after (2c) 2009-05-05 07:22 we have the log records from (1a) to (2c) 2009-05-05 07:23 it looks right, everything looks right 2009-05-05 07:24 ok 2009-05-05 07:24 and at the point of (2c), from (1a) to (1b) is obsolated log records 2009-05-05 07:25 i.e. those are already included into bitmap/btree 2009-05-05 07:25 so, to know valid range, we will remember the (1b) as sb->logoffset? 2009-05-05 07:26 yes 2009-05-05 07:26 ok 2009-05-05 07:26 exactly 2009-05-05 07:27 then, on next cycle, we have to update the new sb->logoffset to (2b) 2009-05-05 07:27 however, it is where come from... 2009-05-05 07:31 (2b) is obtained from the current log pointer just before creating the bfree log entries for the log blocks 2009-05-05 07:31 yes, if we are on before (2c) 2009-05-05 07:32 but, if the system was crashed immediately after (2c)? 2009-05-05 07:33 there are two cases: 1) if the superblock was updated then we use (2b) 2) otherwise fall back to (1b) 2009-05-05 07:34 ah 2009-05-05 07:35 we update the sb->logoffset to (1b) at (2c) point 2009-05-05 07:36 because, at (2c) point, bitmap blocks is including the deflush of before (1b) 2009-05-05 07:36 yes 2009-05-05 07:37 so, after (2c), sb->logoffset is (1b) in superblock 2009-05-05 07:37 so, I guess we have to remember the (2b) 2009-05-05 07:37 yes 2009-05-05 07:38 we remember it by updating the superblock for now 2009-05-05 07:38 ok 2009-05-05 07:38 so, sb->logoffset and sb->next_logoffset or something? 2009-05-05 07:41 yes, either is good 2009-05-05 07:42 ok 2009-05-05 07:43 btw, I remembered the another issue in (1a) - (1c) 2009-05-05 07:43 the defree entires of (1a) - (1b) is special 2009-05-05 07:44 because, those defree is not included into bitmap yet 2009-05-05 07:44 so, we have to handle it somehow 2009-05-05 07:44 now, I'm marking those log lock as LOGBLOCK_FLUSH 2009-05-05 07:44 to know, those blocks is special 2009-05-05 07:46 deferred frees are never included in the bitmaps 2009-05-05 07:48 note: sometimes above I have said "bfree log entries" when I really meant "deferred free log entries" 2009-05-05 07:48 yes 2009-05-05 07:48 we create deferred free log entries when we free log blocks 2009-05-05 07:48 yes 2009-05-05 07:48 e.g. defree log by ileaf redirect, it will included into this fdelta cycle? 2009-05-05 07:48 I meant the bfree by ileaf redirect is included into bitmap blocks on this fdelta cycle 2009-05-05 07:48 s/this/next/ 2009-05-05 07:51 [well, we can handle all defree as deflush though] 2009-05-05 08:00 the defree log record for an ileaf redirect becomes obsolete in a flush cycle, because the dirty parent is written out in full 2009-05-05 08:00 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-05 08:01 yes 2009-05-05 08:01 in other words, the defree log record for and ileaf redirect should appear in a log block that will be released, or before logoffset 2009-05-05 08:01 (in a flush cycle) 2009-05-05 08:02 but, if it's generated on (1a)-(1b), those have some differency 2009-05-05 08:03 true 2009-05-05 08:03 because those log records are not commited yet until (1c) was commited 2009-05-05 08:03 then it belongs to the next flush cycle 2009-05-05 08:03 yes 2009-05-05 08:03 this is subtle and nice :) 2009-05-05 08:03 and we have to know those differentcy 2009-05-05 08:04 yes 2009-05-05 08:04 which is commited or not 2009-05-05 08:04 similar to flushing the bitmap 2009-05-05 08:04 so, now, I'm using to know it LOGBLOCK_FLUSH flag 2009-05-05 08:04 ah 2009-05-05 08:04 ok 2009-05-05 08:05 but, it might be not so good 2009-05-05 08:05 maybe, it will work though 2009-05-05 08:05 if you have some ideas, please let me know 2009-05-05 08:06 I will 2009-05-05 08:06 thanks 2009-05-05 08:06 you already have heard my best ideas :) 2009-05-05 08:06 :) thanks 2009-05-05 08:06 I'll try to implement those 2009-05-05 08:07 :) 2009-05-05 08:15 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-05-05 09:22 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-05 09:34 -!- dcg(~dcg@3.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-05 09:47 -!- dcg(~dcg@3.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-05 09:47 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-05 10:04 npmccallum, I do remember you from scale 2009-05-05 10:04 welcome :) 2009-05-05 10:10 flips: thanks :) 2009-05-05 10:11 I thought about implementing my "snapshot to a different medium" idea as a fuse fs called queuefs 2009-05-05 10:11 where all writes would be queued until flushed 2009-05-05 10:11 it could then layer on top of any fs 2009-05-05 10:12 could do a proof of concept that way, maybe 2009-05-05 10:12 proving concepts is good 2009-05-05 10:13 I'm not sure how much of a bottleneck a queue would become 2009-05-05 10:20 why wouldnt we be able to make a raid0 mirror with ram disk and a real partition? 2009-05-05 10:20 woudlnt then the raid system be responsible for syncing up? 2009-05-05 10:22 marcin: it would require free RAM == disk size 2009-05-05 10:23 yes, but that's what i want ;) 2009-05-05 10:23 small but superfast, and i dont wanna do my own syncing 2009-05-05 10:24 there's uses for it 2009-05-05 10:24 plus i got a box with 32gb ram so that's plenty of useful info 2009-05-05 10:25 DDR2 RDIMM? 2009-05-05 10:25 i've been thinking of implementing something like that after i saw some mad fools using a usb hub and a shitload of usb sticks to make a huge raid 2009-05-05 10:26 not sure what's inside exactly, whatever new xeons want 2009-05-05 10:26 ddr3 2009-05-05 10:26 probably 2009-05-05 10:26 nehalem architecture 2009-05-05 10:26 it's not nehalem, it's the last one before that 2009-05-05 10:26 5100 2009-05-05 10:26 fb-dimm 2009-05-05 10:27 32GB is a lot of fb-dimm 2009-05-05 10:27 nehalem will go up to 144GB of DDR3 2009-05-05 10:27 Samsung has 16GB DDR3 modules 2009-05-05 10:28 would be plenty for metadata 2009-05-05 10:29 i'm drooling over my 32gb 2009-05-05 10:29 it's very useful 2009-05-05 10:30 if i work with a large dataset, instead of making work directories, i just make a ram disk 2009-05-05 10:30 takes half a minute off a build time of a kernel 2009-05-05 10:30 ns latency certainly helps 2009-05-05 10:32 yea, i'm thinking of cooking up some scripts that'd just put the tmp stuff on ram, like all the stupid web browser cache stuff, oh so famous now for IO issues 2009-05-05 10:33 cool 2009-05-05 10:36 hm, why would just firefox cache in ram instead of disk anyway? or at least give people an option 2009-05-05 12:26 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-05-05 12:56 -!- dcg(~dcg@3.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-05 17:36 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-05 20:31 06:56 < hirofumi> http://userweb.kernel.org/~hirofumi/notes/note.flush 2009-05-05 20:31 06:56 < hirofumi> sorry, I've pushed it to kernel.org 2009-05-05 20:31 06:57 < hirofumi> another complex point of log records in (1a)-(1c) is, 2009-05-05 20:31 sorry 2009-05-05 20:44 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-05 21:22 samlh :) 2009-05-05 21:23 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-05 22:07 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-06 06:40 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-06 06:52 -!- dcg(~dcg@96.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-06 06:57 -!- dcg_(~dcg@59.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-06 07:32 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-06 07:56 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-06 07:57 -!- tim_dimm__(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-06 09:23 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-06 09:42 -!- bd__(~foo@satoko.is.fushizen.net) has joined #tux3 2009-05-06 09:42 -!- bh_(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-06 09:42 -!- rayvd_(rayvd@arthur.bludgeon.org) has joined #tux3 2009-05-06 09:42 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2009-05-06 09:42 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-05-06 09:42 -!- flips(~daniel@phunq.net) has joined #tux3 2009-05-06 09:42 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-05-06 09:53 -!- flipz(~phillips@phunq.net) has joined #tux3 2009-05-06 11:19 -!- ijuz__(~ijuz@p5B1266D9.dip.t-dialin.net) has joined #tux3 2009-05-06 11:48 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-06 16:14 -!- samlh(~sam@67.129.121.145) has joined #tux3 2009-05-06 21:43 -!- ijuz__(~ijuz@p5B126377.dip.t-dialin.net) has joined #tux3 2009-05-06 22:34 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-07 09:05 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-07 09:25 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-07 11:45 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-07 14:02 -!- tim_dimm(~timothyhu@72.244.170.106) has joined #tux3 2009-05-07 14:23 -!- dcg(~dcg@84.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-07 14:45 -!- tim_dimm(~timothyhu@72.244.170.106) has joined #tux3 2009-05-07 14:59 -!- tim_dimm(~timothyhu@72.244.170.106) has joined #tux3 2009-05-07 15:08 -!- Yoshimi(~kvirc@110.Red-83-46-49.dynamicIP.rima-tde.net) has joined #tux3 2009-05-07 15:56 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-07 16:49 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-07 16:59 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-07 21:50 -!- ijuz__(~ijuz@p5B126427.dip.t-dialin.net) has joined #tux3 2009-05-07 22:00 -!- ajonat(~ajonat@190.48.109.66) has joined #tux3 2009-05-07 22:08 -!- ijuz__(~ijuz@p5B1273D9.dip.t-dialin.net) has joined #tux3 2009-05-07 23:38 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-08 06:17 -!- dcg(~dcg@122.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-08 06:34 -!- ijuz_(~ijuz@p5B1273D9.dip.t-dialin.net) has joined #tux3 2009-05-08 06:58 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-08 08:02 let me see, what is the best way to tell if stdin is coming from a file or a pipe? 2009-05-08 08:03 S_ISREG(something(stdin)) ? 2009-05-08 08:15 S_ISREG(fileno(stdin)) maybe 2009-05-08 08:21 actually... I want to distinquish between console vs file piped to a command 2009-05-08 08:22 that is, the difference between echo xxxx | ... and cat xxxx | ... 2009-05-08 08:23 istty()? 2009-05-08 08:25 whoops, isatty(fd) 2009-05-08 08:25 actually, probably, isatty(STDOUT_FILENO) 2009-05-08 08:26 right 2009-05-08 08:28 isatty(fileno(stdin)) works 2009-05-08 08:28 thanks 2009-05-08 08:29 may example above is wrong 2009-05-08 08:30 I didn't mean echo vs cat, but echo or cat vs console 2009-05-08 08:30 yes 2009-05-08 08:30 just out of interest, how to distinguish between echo vs cat? 2009-05-08 08:30 I suppose there may be no difference when looking at stdin, the difference is only on the sending side 2009-05-08 08:31 yes 2009-05-08 08:31 either way is a pipe, and libc does not optimize by leaving out the pipe for cat 2009-05-08 08:31 it would be os specific way to knwo it 2009-05-08 08:31 although I wonder why it does not do that 2009-05-08 08:32 leaving out? 2009-05-08 08:33 yes, leaving out the pipe 2009-05-08 08:33 so fileno(stdin) is a file, not a pipe 2009-05-08 08:33 should be able to read from it the same way 2009-05-08 08:34 cat is different process, and pipe is setted up by bash 2009-05-08 08:34 so... it's because two systems can't cooperate? 2009-05-08 08:34 seems likely 2009-05-08 08:35 if "cat" is builtin command of bash, bash will do 2009-05-08 08:35 like echo 2009-05-08 08:35 ah 2009-05-08 08:35 so bash can optimize it by intercepting cat 2009-05-08 08:35 I think I have seen this happen 2009-05-08 08:36 via strace 2009-05-08 08:36 actually, bash setup the stdin/stdout as pipe, then fork-exec 2009-05-08 08:37 so, child processes can comminucate via pipe 2009-05-08 08:37 without help of bash 2009-05-08 08:38 -!- dcg_(~dcg@139.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-08 08:42 here is something strange: "echo foo | time " executes /usr/bin/time while "time " executes the bash builtin 2009-05-08 08:43 I guess you used the strace to see it? 2009-05-08 08:44 actually, just noticed the different output form 2009-05-08 08:44 yes 2009-05-08 08:44 /usr/bin/time and builtin time is different on debian at least 2009-05-08 08:45 iirc, builtin time has some options 2009-05-08 08:45 it's funny that bash misses the chance to call the builtin as destination of a pipe, I guess it is hard to pipe to a builtin 2009-05-08 08:45 I am not sure why it should be hard to pipe to a builtin 2009-05-08 08:45 sounds like an internal bash design difficulty 2009-05-08 08:46 builtin time can write, however, it seems can't read 2009-05-08 08:47 time ls | cat 2009-05-08 08:47 it seems to be using builtin 2009-05-08 08:47 so, one thing time does is pass its own stdin to the called prog 2009-05-08 08:47 maybe the bash builtin can't do that 2009-05-08 08:48 iirc, it would be only time 2009-05-08 08:48 because, iirc, echo can do it 2009-05-08 08:49 I see 2009-05-08 08:49 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-08 08:51 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-08 08:51 wow, it's hard to believe that calling echo opens /dev/urandom 2009-05-08 08:51 maybe it is a good thing that glibc has been forked 2009-05-08 08:52 program startup overhead is getting silly 2009-05-08 08:52 open("/usr/lib/locale/en_US.UTF-8/LC_CTYPE", O_RDONLY) = -1 ENOENT (No such file or directory) 2009-05-08 08:52 there are lots of these, just for an echo 2009-05-08 08:52 that is disgusting 2009-05-08 08:52 ok 2009-05-08 08:52 some real work now :) 2009-05-08 09:25 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-08 09:26 flips: yeah lots of stupid overhead 2009-05-08 09:26 i've l/strace'ed bash quite a few times 2009-05-08 09:33 I'll go check out the debian fork now 2009-05-08 09:35 http://www.eglibc.org/home 2009-05-08 09:36 um... 2009-05-08 09:36 I'm not sure glibc is wrong or not 2009-05-08 09:36 yes, glibc is big unexpectly 2009-05-08 09:37 however, posix is also big enough, if it is including extensions 2009-05-08 09:38 especially, i18n is big and complex 2009-05-08 09:38 well, if glibc can be separated core part and others, it may be good 2009-05-08 09:40 I don't know, how does eglibc handle those 2009-05-08 09:43 I hope eglibc is not just alternative of dietlibc, uclibc, or such 2009-05-08 09:44 well there are a log of problems with glibc implementation 2009-05-08 09:44 it does way too many failed opens of program exec 2009-05-08 09:45 and nscd is seriously broken, for a number of reasons 2009-05-08 09:45 including breaking library versioning, which breaks static builds 2009-05-08 09:45 well, library versioning and static builds are separately broken by nscd 2009-05-08 09:45 just a bad design 2009-05-08 09:46 need to look at eblibc and see what they plan there 2009-05-08 09:47 even embedded needs something like nscd, although it really should not be a separately executed daemon 2009-05-08 09:47 execed I mean 2009-05-08 09:47 another really bad idea is using a well known port for nscd 2009-05-08 09:51 I don't use nscd at all though, why nscd is so bad? 2009-05-08 09:52 I meant why shouldn't it run as daemon? 2009-05-08 09:53 suppose you have two different versions of libc that use different protocols to communicate with the daemon 2009-05-08 09:53 this actually happens 2009-05-08 09:53 it does not work well, to say it nicely 2009-05-08 09:54 ah, well, nss is separating with libc 2009-05-08 09:54 libnss* 2009-05-08 09:54 it can run as a daemon, but it should not be execed as a separate program, that is just wrong 2009-05-08 09:54 nss is inseparable from libc 2009-05-08 09:54 unfortunately 2009-05-08 09:55 oh 2009-05-08 09:55 gethostbyname needs nss 2009-05-08 09:56 ACTION dimly remembers that 2009-05-08 09:56 other things 2009-05-08 09:56 getuid I think 2009-05-08 09:56 shapor knows these better than I do 2009-05-08 09:57 get user groups, whatever that is called 2009-05-08 09:57 actually, gethostbyname needs nscd, not nss 2009-05-08 09:58 various get user things need nss 2009-05-08 09:59 umm..., I think nscd is not needed for gethostbyname 2009-05-08 10:00 nscd is caching daemon, iirc 2009-05-08 10:00 actually, I'm not running it 2009-05-08 10:00 yes, it needs nss 2009-05-08 10:01 nscd is not needed by gethostbyname, but without it performance is very bad 2009-05-08 10:02 on some setup, yes 2009-05-08 10:02 right 2009-05-08 10:02 so nscd has to be an option 2009-05-08 10:02 the way it is, it is an option, but implemented in a perverse way 2009-05-08 10:03 sort of a mandatory option, which you will see just by compiiling a program with -static 2009-05-08 10:03 ah, static 2009-05-08 10:03 well it is not the fault if static, the issue is, glibc's nscd implementation breaks library versioning 2009-05-08 10:04 i see 2009-05-08 10:04 it breaks static because the daemon cannot be embedded in a versioned way 2009-05-08 10:04 i see 2009-05-08 10:06 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-05-08 10:39 morning sejeff :) 2009-05-08 10:42 flips, morning flips 2009-05-08 10:44 flips, So hows that log replay code coming? 2009-05-08 10:47 -!- cwood(~cwood@66.151.59.138) has joined #tux3 2009-05-08 10:48 hi SEJeff 2009-05-08 10:48 cwood, hey 2009-05-08 11:10 sejeff, hirofumi is making a userspace prototype 2009-05-08 11:10 I am doing "other stuff" 2009-05-08 11:20 -!- dcg(~dcg@20.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-08 11:42 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-08 11:49 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-08 12:04 -!- pgquiles(~pgquiles@241.Red-88-0-137.dynamicIP.rima-tde.net) has joined #tux3 2009-05-08 13:57 hey flips 2009-05-08 15:08 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-08 16:54 -!- ijuz__(~ijuz@p5B126ED2.dip.t-dialin.net) has joined #tux3 2009-05-08 19:22 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-08 21:33 -!- ijuz_(~ijuz@p5B126305.dip.t-dialin.net) has joined #tux3 2009-05-08 21:42 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-05-08 23:31 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-09 00:26 -!- dcg(~dcg@102.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-09 02:38 -!- pgquiles(~pgquiles@241.Red-88-0-137.dynamicIP.rima-tde.net) has joined #tux3 2009-05-09 07:57 -!- rayvd(rayvd@arthur.bludgeon.org) has joined #tux3 2009-05-09 08:17 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-09 09:18 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-09 09:27 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-09 10:03 ok 2009-05-09 10:04 I've added the logoffset and related stuff 2009-05-09 10:04 http://userweb.kernel.org/~hirofumi/atomic.tar.gz 2009-05-09 10:04 http://userweb.kernel.org/~hirofumi/notes/ 2009-05-09 10:04 http://userweb.kernel.org/~hirofumi/atomic.png 2009-05-09 10:04 the above are updated files 2009-05-09 10:05 it's not so tested, and may not be efficient way 2009-05-09 10:05 but, delta logs and fdelta logs are merged into same blocks 2009-05-09 10:06 logcount/logoffset and next_logcount/next_logoffset are in superblock 2009-05-09 10:07 structure to manage the defree of log blocks in sb->decycle and sb->new_decycle 2009-05-09 10:07 those may be able to merge to one ring buffer 2009-05-09 10:07 however, now, it's separated 2009-05-09 10:10 well, so, I'll back to btree logging at next 2009-05-09 10:23 ah, btw, my naming of variables would be bad like usual 2009-05-09 10:23 good name are wellcome 2009-05-09 10:24 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-09 10:29 ah, and if we handle the deflush like log block defree, it would make simple 2009-05-09 10:29 I'm not thinking about, how unefficient it is 2009-05-09 10:30 though 2009-05-09 10:49 well, so, if we revisit (and rethink) to it with real performance, it may be good 2009-05-09 13:42 howdy 2009-05-09 14:19 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-09 14:55 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-09 18:02 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-09 19:57 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-09 20:22 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-09 20:27 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-09 21:25 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-09 21:49 -!- ijuz_(~ijuz@p5B1275D7.dip.t-dialin.net) has joined #tux3 2009-05-09 22:07 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-09 23:43 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-10 06:42 -!- pgquiles(~pgquiles@62.Red-81-39-154.dynamicIP.rima-tde.net) has joined #tux3 2009-05-10 06:45 -!- dcg(~dcg@60.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-10 07:27 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-05-10 07:41 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-10 07:48 -!- dcg(~dcg@60.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-10 08:20 -!- dcg(~dcg@60.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-10 08:29 -!- dcg(~dcg@60.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-10 08:43 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-10 08:51 -!- dcg(~dcg@56.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-10 09:06 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-10 09:07 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-10 09:14 -!- flips(~phillips@phunq.net) has joined #tux3 2009-05-10 09:50 -!- dcg(~dcg@56.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-10 09:56 -!- flips(~phillips@phunq.net) has joined #tux3 2009-05-10 10:31 -!- flips(~phillips@phunq.net) has joined #tux3 2009-05-10 11:40 -!- pgquiles(~pgquiles@36.Red-83-53-120.dynamicIP.rima-tde.net) has joined #tux3 2009-05-10 13:04 -!- dcg(~dcg@131.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-10 13:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-10 15:28 -!- flipz(~phillips@phunq.net) has joined #tux3 2009-05-10 21:53 -!- ijuz_(~ijuz@p5B127377.dip.t-dialin.net) has joined #tux3 2009-05-10 23:09 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-11 01:26 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-05-11 01:29 -!- pgquiles__(~pgquiles@36.Red-83-53-120.dynamicIP.rima-tde.net) has joined #tux3 2009-05-11 03:16 -!- pythonstar(~kavli@c-affbe455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-05-11 04:12 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-11 05:39 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-05-11 06:16 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-11 07:49 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-05-11 09:18 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-11 09:26 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-05-11 09:32 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-11 12:22 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-11 14:41 -!- dcg(~dcg@39.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-11 18:03 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-11 20:41 -!- ajonat(~ajonat@190.48.115.14) has joined #tux3 2009-05-11 21:28 -!- ijuz__(~ijuz@p5B126B42.dip.t-dialin.net) has joined #tux3 2009-05-11 23:45 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-12 06:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-12 06:51 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-12 07:22 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-12 08:00 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-12 08:27 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-12 10:21 -!- flips_(~daniel@phunq.net) has joined #tux3 2009-05-12 10:21 good morning 2009-05-12 11:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-12 11:35 -!- dcg(~dcg@44.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-12 11:43 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-12 12:05 hey flipz 2009-05-12 13:09 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-12 16:15 -!- tim_dimm(~timothyhu@rrcs-64-183-50-58.west.biz.rr.com) has joined #tux3 2009-05-12 17:22 -!- ajonat(~ajonat@190.48.104.7) has joined #tux3 2009-05-12 17:27 -!- ajonat(~ajonat@190.48.111.174) has joined #tux3 2009-05-12 18:36 -!- ajonat(~ajonat@190.48.111.174) has joined #tux3 2009-05-12 19:21 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-05-12 21:29 -!- marcin_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-05-12 21:44 -!- ijuz__(~ijuz@p5B126BE9.dip.t-dialin.net) has joined #tux3 2009-05-12 21:59 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-12 22:04 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-12 22:47 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-13 03:48 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-13 04:52 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-13 07:04 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-13 07:17 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-13 07:31 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-13 08:50 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-13 11:06 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-05-13 11:23 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-05-13 15:44 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-13 16:16 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-13 18:34 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-05-13 19:42 hirofumi, asleep probably? 2009-05-13 19:42 yes, almost 2009-05-13 19:42 heh 2009-05-13 19:42 well 2009-05-13 19:43 fine posts from you, which I did not read 2009-05-13 19:43 I will fix that 2009-05-13 19:43 which one? 2009-05-13 19:43 "current my atomic-commit prototype" 2009-05-13 19:44 ah 2009-05-13 19:44 which code are you going to fix? 2009-05-13 19:45 I don't know yet, what needs fixing? 2009-05-13 19:45 btw, my current patchset was updated 2009-05-13 19:45 currently, I'm tring to implement btree logging 2009-05-13 19:46 and, I noticed the insert_leaf() may have bug 2009-05-13 19:46 and next one would be replay 2009-05-13 19:47 ok, well there are some obvious things to do in replay that I have partly done 2009-05-13 19:47 split and merge of btree nodes 2009-05-13 19:47 yes, and others 2009-05-13 19:48 well, now, I'm trying to implement create() path at first 2009-05-13 19:49 http://userweb.kernel.org/~hirofumi/notes/note_flush-create.txt 2009-05-13 19:49 this note is I'm what is doing now 2009-05-13 19:50 create is in fact what I used in my first userspace test (checked in I think) 2009-05-13 19:51 well, so, FIXME is not implemented yet 2009-05-13 19:51 what is the column on the right in -create.txt? 2009-05-13 19:51 ah 2009-05-13 19:51 there is a notation at the bottom 2009-05-13 19:51 right column is what logging is doing in code 2009-05-13 19:52 oh, very interesting notation 2009-05-13 19:52 nice 2009-05-13 19:52 maybe, I'm going to create this type note every path 2009-05-13 19:53 well, my plan is to complete create() path 2009-05-13 19:53 then, try to implement replay 2009-05-13 19:54 then, try other path (both of logging and replay) 2009-05-13 19:55 ACTION thinks about the fixme 2009-05-13 19:55 ah, before those, it needs to test with -DATOMIC 2009-05-13 19:55 ah, no 2009-05-13 19:55 replay would be first 2009-05-13 19:55 then, test with -DATOMIC 2009-05-13 19:55 directory data... should not need any special treatment 2009-05-13 19:56 yes 2009-05-13 19:56 those FIXME should be easy 2009-05-13 19:56 ok, then inode change for new root then 2009-05-13 19:57 yes 2009-05-13 19:57 I think we just need to make sure the inode is scheduled to be flushed in the same delta 2009-05-13 19:57 well, it is almost same with store_attrs 2009-05-13 19:57 yes 2009-05-13 19:58 yes 2009-05-13 19:58 basic code was already done (dirty inodes list) 2009-05-13 19:58 an inode write in tux3 is just store_attrs 2009-05-13 19:58 well, needs to rethink if those is fine 2009-05-13 19:58 right 2009-05-13 19:58 yes, it needs to be thought about carefully 2009-05-13 19:59 it seems simple though 2009-05-13 19:59 yes 2009-05-13 19:59 the only really hard thing is what we discussed... the flush cycle, handling the deferred frees of log blocks 2009-05-13 19:59 clearly, it is not far from needing replay 2009-05-13 20:00 well I better check in my partly done patch then 2009-05-13 20:00 log defree was already done though 2009-05-13 20:00 clean up and check in 2009-05-13 20:01 it might need to revisit though 2009-05-13 20:01 ah 2009-05-13 20:01 there are not any [%] entries (good) 2009-05-13 20:01 I'm thinking to delay check-in until replay done 2009-05-13 20:02 [%] is happned on only flush_log() path 2009-05-13 20:02 let me see what diff I have now 2009-05-13 20:03 I would be need to udpate tarball 2009-05-13 20:03 + LOG_INDEX_SPLIT, 2009-05-13 20:03 + LOG_INDEX_MERGE, 2009-05-13 20:03 + LOG_INDEX_UPDATE, 2009-05-13 20:03 + LOG_INDEX_DELETE, 2009-05-13 20:03 things like that 2009-05-13 20:03 oh 2009-05-13 20:03 those are implemented? 2009-05-13 20:03 just the declarations 2009-05-13 20:03 ah, ok 2009-05-13 20:04 and a log_split() 2009-05-13 20:04 +void log_split(struct sb *sb, block_t newblock, block_t oldblock) 2009-05-13 20:04 enum { 2009-05-13 20:04 LOG_BALLOC = 0x33, /* Log of block allocation */ 2009-05-13 20:04 LOG_BFREE, /* Log of freeing block */ 2009-05-13 20:04 LOG_BFREE_ON_FLUSH, /* Log of freeing block after next cycle */ 2009-05-13 20:04 LOG_LEAF_REDIRECT, /* Log of leaf redirect */ 2009-05-13 20:04 LOG_BNODE_REDIRECT, /* Log of bnode redirect */ 2009-05-13 20:04 LOG_BNODE_ROOT, /* Log of new bnode root allocation */ 2009-05-13 20:04 LOG_BNODE_SPLIT, /* Log of spliting bnode to new bnode */ 2009-05-13 20:04 LOG_BNODE_ADD, /* Log of adding bnode entry */ 2009-05-13 20:04 LOG_BNODE_UPDATE, /* Log of bnode entry update */ 2009-05-13 20:04 LOG_TYPES 2009-05-13 20:04 }; 2009-05-13 20:04 this is in my new patchset 2009-05-13 20:05 (not uploaded) 2009-05-13 20:05 good, you covered all of mine 2009-05-13 20:05 I think split is need to position to split 2009-05-13 20:05 LOG_BNODE_SPLIT == LOG_INDEX_SPLIT 2009-05-13 20:05 yes 2009-05-13 20:06 merge and delete is not needed for create() path 2009-05-13 20:06 right 2009-05-13 20:06 so, I'm not thinking about those 2009-05-13 20:06 yet 2009-05-13 20:06 it's fine to leave them out now 2009-05-13 20:07 yes 2009-05-13 20:07 same with _DELETE 2009-05-13 20:07 yes 2009-05-13 20:40 -!- ajonat(~ajonat@190.48.111.174) has joined #tux3 2009-05-13 21:47 -!- ijuz(~ijuz@p5B12684E.dip.t-dialin.net) has joined #tux3 2009-05-13 23:07 hey flips 2009-05-14 00:29 -!- pgquiles(~pgquiles@36.Red-83-53-120.dynamicIP.rima-tde.net) has joined #tux3 2009-05-14 03:46 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2009-05-14 03:50 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-05-14 05:10 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-14 08:20 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-14 10:24 good morning 2009-05-14 10:24 howdy 2009-05-14 10:25 mornin' flips, marcin, et al 2009-05-14 10:25 hey, east coast is awake 2009-05-14 10:25 marcin, flash your east coast gang sign 2009-05-14 10:26 east sayyd (of mississippi)! 2009-05-14 10:26 :-) 2009-05-14 10:27 http://patrick-nagel.net/blog/archives/125 can i get your analysis on this? 2009-05-14 10:28 marcin, ooh topical 2009-05-14 10:29 and maybe we mingming can check my analsis 2009-05-14 10:29 s/we// 2009-05-14 10:29 i fighred that'd tickle your pickle 2009-05-14 10:30 you make consider my pickle tickled 2009-05-14 10:32 i'm about to nap due to overdosing on fried rice with chinese sauseges, so i figured if you explain it to me, it'd keep me awake ;) 2009-05-14 10:33 "From loading the kernel to KDM being ready for login on a" <- some text seems to be missing 2009-05-14 10:33 it made me think how stuff is allocated on the disk should be dependant on how things are accessed in groups 2009-05-14 10:33 ooh, the video quality blows 2009-05-14 10:33 i cant few the videos at work, that will have to work for after 5pm :( 2009-05-14 10:34 anyway, this process is called "aging" the filesystem 2009-05-14 10:34 is there a way to hook into the fs so it would keep track of how io requests get sequenced, so i could build some sort of probabilistic chains or something... 2009-05-14 10:35 it would be really useful to know with distro he was using 2009-05-14 10:36 It's fairly clear what is going on 2009-05-14 10:36 he's german and talking about kdm, so my bet is SuSE 2009-05-14 10:37 he's doing something like apt-get dist-upgrade, between episodes of filling the disk with jpegs, pdfs and pr0n 2009-05-14 10:37 probably not ever doing apt-get clean, or the mad^Wredhat equivalent 2009-05-14 10:38 i dont think rpm has an analog 2009-05-14 10:38 as a result, he ends up with much longer seeks on init as it loads the updated libraries and binaries, now scattered around the disk 2009-05-14 10:38 but than again last time i used RH was like '98 2009-05-14 10:39 the reason the updated files get scattered is, the spaces where they really belong are filled with the undeleted old versions, and nearby space is filled with pr0n 2009-05-14 10:39 so why not cram the new libs into the spots of the old libs, as long as the originals were placed correctly? 2009-05-14 10:39 because the originals are there 2009-05-14 10:39 hmm...my momma taught me to keep my libraries and my pr0n separate 2009-05-14 10:40 if you're upgrading, should'nt you be nuking the old libs? 2009-05-14 10:40 you generally install the new one and don't delete the original until you are _sure_ the new one is ok 2009-05-14 10:40 that means the new one has to go in a new place, typically far away 2009-05-14 10:40 or is this yet another side effect of trying to maintain backwards compatibility 2009-05-14 10:40 under this load 2009-05-14 10:41 imagine what would happen if libc upgrade failed partway through, causing a restart, and the old one was already gone 2009-05-14 10:41 then why not move the whole cluster of files to a new location? 2009-05-14 10:41 you are talking rescue cd at best 2009-05-14 10:42 ext3 and every other known filesystem that actually functions is not smart enough to move clusters 2009-05-14 10:42 and even if it was, people would probably turn the option off 2009-05-14 10:42 yea, but some new one some geniuses are working on should be smart enough ;) 2009-05-14 10:42 because it is annoying for your disk to trundle on for seconds or minutes longer than expected 2009-05-14 10:43 sure, it should be possible to do a better job 2009-05-14 10:43 ext3 takes a rather static view of things 2009-05-14 10:43 it relies pretty much entirely on orlov alocator 2009-05-14 10:43 (see wikipedia) 2009-05-14 10:44 ext4 is much the same 2009-05-14 11:33 hey flipz 2009-05-14 13:02 -!- dcg(~dcg@246.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-14 15:37 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-14 18:13 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-14 18:21 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-14 21:28 -!- ijuz_(~ijuz@p5B126942.dip.t-dialin.net) has joined #tux3 2009-05-15 06:19 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-15 09:07 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-15 09:07 morning 2009-05-15 09:34 hi 2009-05-15 09:34 morning 2009-05-15 09:45 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-05-15 12:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-15 12:35 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-15 13:26 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-15 13:29 -!- npmccallum_(~npmccallu@76.177.118.207) has joined #tux3 2009-05-15 15:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-15 15:46 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-15 20:18 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-15 20:26 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-15 20:33 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-15 20:39 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-15 21:44 -!- ijuz_(~ijuz@p5B1265EE.dip.t-dialin.net) has joined #tux3 2009-05-15 23:43 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-16 06:02 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-16 06:02 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-05-16 08:42 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-16 09:01 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-05-16 09:21 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-16 10:24 -!- strelok7(~user@CBL217-132-113-13.bb.netvision.net.il) has joined #tux3 2009-05-16 11:54 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-16 11:59 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-16 12:17 I can officially declare that my own email uptime this month has more nines than google 2009-05-16 12:18 a one hour outage in the month drops google to two nines 2009-05-16 12:19 whereas, this month phunq.net has zero outages 2009-05-16 12:19 tux3.org is even better I think 2009-05-16 12:19 let's see what netcraft says 2009-05-16 12:20 http://searchdns.netcraft.com/?restriction=site+contains&host=tux3.org&lookup=wait..&position=limited 2009-05-16 12:21 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-16 12:22 Site rank 497791 2009-05-16 12:22 is that good? 2009-05-16 12:22 tux3? 2009-05-16 12:22 tux3.org 2009-05-16 12:23 in the top 500,000 sites I think it means 2009-05-16 12:23 I'd think that's pretty impressive 2009-05-16 12:23 I'm not sure 2009-05-16 12:24 http://uptime.netcraft.com/up/graph?site=tux3.org <- apparently we don't qualify for uptime tracking 2009-05-16 12:24 uptime 2009-05-16 12:24 12:24:24 up 35 days, 14:00, 4 users, load average: 0.13, 0.04, 0.01 <- phunq.net 2009-05-16 13:40 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-16 16:13 -!- edt(~Ed@254-78.162.dsl.aei.ca) has joined #tux3 2009-05-16 19:39 hey flips 2009-05-16 19:54 hi, flips, there? 2009-05-16 19:54 hi 2009-05-16 19:54 hi 2009-05-16 19:54 I found the bug of kernel code 2009-05-16 19:55 ACTION listens 2009-05-16 19:55 now, I think kernel code doesn't write the inode 2009-05-16 19:55 however, it needs to update the inode for btree root 2009-05-16 19:55 ywa 2009-05-16 19:55 yes 2009-05-16 19:55 hey folks 2009-05-16 19:55 I can't see why those were working 2009-05-16 19:56 hi 2009-05-16 19:56 you mean, you don't know why it works now? 2009-05-16 19:56 or you don't see how it should work with logging? 2009-05-16 19:56 now 2009-05-16 19:56 without logging 2009-05-16 19:57 well, I tested only small backing storage 2009-05-16 19:57 it's part of write_inodes (I forget the exact name) 2009-05-16 19:57 ACTION heads to lxr 2009-05-16 19:57 so, probably, bitmap didn't need new root 2009-05-16 19:58 now, write_inode has BUG_ON(inode == bitmap) 2009-05-16 19:58 correct 2009-05-16 19:58 it never needed a new root 2009-05-16 19:58 I don't think we need to fix that for the existing, non-atomic code 2009-05-16 19:59 first new root is how big backing storge... 2009-05-16 19:59 we do need a way of driving inode flushes 2009-05-16 19:59 now, I'm writing the inodes flush codes 2009-05-16 19:59 ah, good 2009-05-16 19:59 in writeback manner 2009-05-16 19:59 it is a part that I did not think about very much 2009-05-16 19:59 I knew it had to be done 2009-05-16 20:00 our inode flush will be driven by deltas 2009-05-16 20:00 well, so, I noticed it would have the bug 2009-05-16 20:00 yes 2009-05-16 20:00 I don't think we need to fix that bug, if I understand you correctly 2009-05-16 20:01 we just need to have a flush_inodes at every delta 2009-05-16 20:01 however, I'll write the codes with non-atomic commit manner 2009-05-16 20:01 fine 2009-05-16 20:01 code = inode? 2009-05-16 20:01 inodes flushing codes 2009-05-16 20:01 ah 2009-05-16 20:02 do you want to talk about how it should work when finished? 2009-05-16 20:02 actully, we are using tuxsync(inode) for now 2009-05-16 20:02 no 2009-05-16 20:02 ok 2009-05-16 20:02 I want to know why it was working 2009-05-16 20:02 iirc, you are trying the rootfs 2009-05-16 20:02 because we don't change inode roots in the current revision 2009-05-16 20:03 so, I thought if it was big, bitmap would need to new root 2009-05-16 20:03 as far as I know, the inodes are only guaranteed to be flushed at umount when I ran as root_fs 2009-05-16 20:03 I expected that I would need to remake the filesystem if I crashed 2009-05-16 20:03 and I did not crash :) 2009-05-16 20:04 no 2009-05-16 20:04 bitmap inode to ileaf is never happen 2009-05-16 20:04 hmm 2009-05-16 20:04 if it happen on current code, it should go BUG_ON() 2009-05-16 20:04 let me think 2009-05-16 20:05 I guess we set the file size at file create time 2009-05-16 20:05 and we never have to update anything in the bitmap inode after create 2009-05-16 20:05 yes, I was thinking so 2009-05-16 20:05 this is probably the explanation 2009-05-16 20:05 however, I noticed the btree root is needed to update 2009-05-16 20:06 you mean, the root of the bitmap inode should be updated for atomic commit? 2009-05-16 20:06 no 2009-05-16 20:07 I just mean, if bitmap was not updated, it shouldn't work on big storage 2009-05-16 20:08 however, we didn't noticed until now 2009-05-16 20:08 only if the bitmap btree gets bigger than 511 blocks 2009-05-16 20:08 hmm 2009-05-16 20:09 255 blocks 2009-05-16 20:09 that covers 34 GB I think 2009-05-16 20:10 ah 2009-05-16 20:10 your rootfs is smaller than 34GB? 2009-05-16 20:10 probably 2009-05-16 20:10 I didn't tested bigger than 34GB 2009-05-16 20:10 ah, ok 2009-05-16 20:10 and only bitmap blocks that are actually used take up space in the index 2009-05-16 20:11 most of the fs isn't used 2009-05-16 20:11 ah, yes 2009-05-16 20:11 my rootfs was bigger than 34 GB 2009-05-16 20:11 um 2009-05-16 20:11 wait 2009-05-16 20:11 no, it was around 5-10 GB 2009-05-16 20:12 anyway, it would have been ok, because we do not create the bitmap blocks until the first store into that region 2009-05-16 20:12 ok 2009-05-16 20:12 well, so, maybe, I noticed the bug :) 2009-05-16 20:13 only if you actually filled the filesystem with some data. Did you? 2009-05-16 20:13 no 2009-05-16 20:13 I just noticed it with code review 2009-05-16 20:14 in other news... I compiled ddsnap for 2.6.29.3 today 2009-05-16 20:15 I am thinking about porting our improved userspace buffer code back to ddsnap 2009-05-16 20:15 many of the techniques we developed for tux3 can also be used for ddsnap 2009-05-16 20:15 and there are still very good reasons for having a snapshotting block device, in addition to a snapshotting filesystem 2009-05-16 20:16 it would be good 2009-05-16 20:16 the block device is simpler in many ways, which means that we can get some cross testing of code 2009-05-16 20:16 so... in the past, ddsnap had two main drawbacks 2009-05-16 20:17 what are those? 2009-05-16 20:17 1) it would copy free space needlessing 2009-05-16 20:17 2) writes slow down by a factor of up to 9 2009-05-16 20:18 the tux3 atomic commit model will fix problem number 2 2009-05-16 20:18 i see 2009-05-16 20:18 and problem 1 can be fixed by a new kernel api, to indicate which blocks on the device are unused 2009-05-16 20:19 with those two changes, the snapshotting block device would be very efficient 2009-05-16 20:19 copy free space is defragment? 2009-05-16 20:19 copying free space wastes IO bandwidth and wastes space in the snapshot store 2009-05-16 20:19 why need to copy free space? 2009-05-16 20:20 both problems are solved if the block device knows which blocks are free 2009-05-16 20:20 ah 2009-05-16 20:20 right now, ddsnap does not know which blocks the filesystem thinks are free, so if somebody writes to one, it has to copy the block to the snapshot store whether the filesystem thinks the block is free or not 2009-05-16 20:21 it might help with TRIM command or something 2009-05-16 20:21 exactly 2009-05-16 20:21 anyway, porting the improved buffer.c back to ddsnap would be a good thing to do 2009-05-16 20:22 probably 2009-05-16 20:22 I simplified the kernel patch be removing some proc output and so on 2009-05-16 20:22 those were just debugging features 2009-05-16 20:22 proc is not needed? 2009-05-16 20:22 ah 2009-05-16 20:22 it would be good 2009-05-16 20:22 they created maintenance problems 2009-05-16 20:23 so... if we need to add a debugging interface back in, we will use a more robust approach 2009-05-16 20:23 if it's just debugging 2009-05-16 20:23 either debugfs, relayfs or ddlink 2009-05-16 20:23 ftrace or something may be better 2009-05-16 20:23 sure 2009-05-16 20:23 this output was for tracking statistics related to deadlock prevention 2009-05-16 20:24 ah 2009-05-16 20:24 and there was a sysctl added to turn on/off the deadlock prevention 2009-05-16 20:24 this was intended to prove that the deadlock prevention actually solved a problem 2009-05-16 20:24 but of course, we already knew that it did 2009-05-16 20:25 so the code is better without that extra complexity 2009-05-16 20:25 yes 2009-05-16 20:25 anyway, I will set up a ddsnap git tree and make it public on phunq.net 2009-05-16 20:25 if it's not enough simple, it makes maintainance hard 2009-05-16 20:26 it need to rename? 2009-05-16 20:26 it is needed to rename to other name? 2009-05-16 20:26 ddsnap will be fine 2009-05-16 20:26 ddsnap is copyfree? 2009-05-16 20:26 yes 2009-05-16 20:26 good 2009-05-16 20:27 ddsnap [t] is a trademark of Daniel Phillips, all rights reserved ;) 2009-05-16 20:27 ddsnap [tm] <- joke 2009-05-16 20:27 good :) 2009-05-16 20:28 ddsnap is (c) various contributors of course 2009-05-16 20:29 including, sistina, red hat and google 2009-05-16 20:29 and various individual contributors 2009-05-16 20:29 it is very solidly GPL 2009-05-16 20:30 soon it will be (c) Ogawa Hirofumi as well 2009-05-16 20:30 codes is ok, but the name may not be 2009-05-16 20:30 because of contributions to the buffer code 2009-05-16 20:30 the name is fine 2009-05-16 20:30 it's my name 2009-05-16 20:30 good 2009-05-16 20:30 ethereal and many others was needed to rename to other name 2009-05-16 20:30 this is not an issue 2009-05-16 20:31 good 2009-05-16 20:36 ddsnap can be offered for merging, after some cleanup 2009-05-16 20:36 where to merge? 2009-05-16 20:36 zumastor? 2009-05-16 20:37 merge to mainline 2009-05-16 20:37 as a device mapper target 2009-05-16 20:37 ah 2009-05-16 20:37 one thing I want to do though, is add the ddlink interface 2009-05-16 20:37 oh 2009-05-16 20:37 as a means of simplifying the way ddsnap devices are configured 2009-05-16 20:38 anyway, that is enough ddsnap for now 2009-05-16 20:38 I just thought you would be interested 2009-05-16 20:38 yes 2009-05-16 20:38 actually, virtual block devices are an interesting topic 2009-05-16 20:39 linux is very weak in this area 2009-05-16 20:39 compared to bsd, solaris or even windows 2009-05-16 20:39 um... 2009-05-16 20:40 however, I think usually, why does it do in fs layer? 2009-05-16 20:43 well, anyway, I'm not tracking this area much 2009-05-16 20:45 what feature is needed? 2009-05-16 20:45 "why does it do in fs layer" <- I don't understand 2009-05-16 20:45 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-05-16 20:45 hi ckwood 2009-05-16 20:45 hey 2009-05-16 20:46 well, e.g. I imaged the snapshot feature 2009-05-16 20:46 ckwood, we were just talking about ddsnap, the snapshotting block device 2009-05-16 20:46 ah -- 2009-05-16 20:46 ah, ok 2009-05-16 20:46 how can i see what was said while i was disconnected? 2009-05-16 20:46 Isn't something like that already in the kernel, as dd-snapshot? 2009-05-16 20:46 er, dm-snapshot 2009-05-16 20:46 hirofumi, I think your question was, why do snapshots in the fs layer, was that it? 2009-05-16 20:47 chkwood, it's on the web 2009-05-16 20:47 I meant, why do snaphosts in the block layer, not fs layer 2009-05-16 20:48 http://shapor.com/tux3/irclogs/ 2009-05-16 20:48 thx 2009-05-16 20:48 hirofumi, block layer snapshots work with every filesystem 2009-05-16 20:48 yes 2009-05-16 20:48 so you can even have snapshots with vfat 2009-05-16 20:48 however, it would not be efficient enough 2009-05-16 20:49 a secondary reason is, a block device snapshot is simpler than a filesystem snapshot, therefore more likely to work reliably 2009-05-16 20:49 a block level snapshot using the versioned pointers technique and tux3-style atomic commit will be very efficient 2009-05-16 20:49 it will run a close to platter speed 2009-05-16 20:50 however, block device can't block the users operations 2009-05-16 20:50 so, I think it has other issues 2009-05-16 20:50 there is no reason for the block device to block the operations of the filesystem 2009-05-16 20:51 because, user wants to snapshot of reasonable point 2009-05-16 20:51 user wants the snapshot 2009-05-16 20:51 the semantics of journalling filesystems, and other atomic commit strategies, work together accurately with the semantics of a snapshotting block device, so that there is no need ever to stall the filesystem for a snapshot 2009-05-16 20:52 we discovered this, and proved it with ext3 2009-05-16 20:52 "we" being the zumastor dev team 2009-05-16 20:53 it means block device should know the commit point of fs? 2009-05-16 20:54 the block device does not care when the filesystem commits 2009-05-16 20:54 because the filesystem is always able to recover from any point in time 2009-05-16 20:54 ah 2009-05-16 20:54 the block device only cares that writes to the block device are recorded accurately 2009-05-16 20:55 however, user doesn't want to lose any data 2009-05-16 20:55 the user will not lose data, because block dev flush flushes the snapshotting block device to disk 2009-05-16 20:56 that is, journal commit forces media write, even for the virtual block device 2009-05-16 20:57 sure 2009-05-16 20:57 so, user do sync or something, after that, make snapshot? 2009-05-16 20:58 the user can do that if they want 2009-05-16 20:58 but it is not really necessary 2009-05-16 20:59 flips: did you port the bio throttling patch? 2009-05-16 20:59 shapor, yes 2009-05-16 20:59 cool :) 2009-05-16 20:59 without the statistics 2009-05-16 20:59 and without the sysctl interface for disabling it 2009-05-16 20:59 is the bio throttling still needed? 2009-05-16 20:59 yes 2009-05-16 20:59 nothing has really changed in core kernel 2009-05-16 20:59 same old same old 2009-05-16 20:59 ah i thought something had, guess not 2009-05-16 21:00 congestion_wait still rears its head 2009-05-16 21:01 ACTION notices hirofumi is starting to get interested in block devices 2009-05-16 21:03 ok, the other thing I need to do today is apply and try hirofumi's patches 2009-05-16 21:03 let me see 2009-05-16 21:04 ah, and did my way through my 3,000 mail backlog 2009-05-16 21:04 which is 95% spam 2009-05-16 21:06 there were some reports that 2.6.24 made some congestion control deadlocks go away 2009-05-16 21:07 probably just the luck of the draw though 2009-05-16 21:07 just fiddling 2009-05-16 21:07 no actual progress 2009-05-16 21:07 like you said 2009-05-16 21:08 by the way, those would be the congestion control deadlocks that don't actually exist, according to what certain kernel hackers were saying prior to 2.6.24 2009-05-16 21:09 shapor... got a like to above reports? 2009-05-16 21:09 link 2009-05-16 21:10 looking... 2009-05-16 21:10 https://bugzilla.redhat.com/show_bug.cgi?id=249563 2009-05-16 21:11 http://www.google.com/search?q=congestion_wait+deadlock 2009-05-16 21:11 lots of interesting results 2009-05-16 21:12 like xfs trying to detect memory recusion deadlock and calling congestion_wait 2009-05-16 21:12 looks like just in someone's svn tree 2009-05-16 21:12 https://open.neurostechnology.com/hackers/turran/kernel/fs/xfs/linux-2.6/kmem.c 2009-05-16 21:13 don't need to apply my patchset 2009-05-16 21:13 reading the redhat report now 2009-05-16 21:13 well, I'll send patchset after inodes flush 2009-05-16 21:14 it will have bug fixes and inodes flush 2009-05-16 21:14 ok, well I will _read_ the patchset 2009-05-16 21:14 atomic commit stuff is still too draft 2009-05-16 21:14 I am sorry I have not already read it 2009-05-16 21:14 no, problem 2009-05-16 21:14 btw, bug fixes patchset is still pending 2009-05-16 21:15 actually, I will make it after inodes flush 2009-05-16 21:15 any time 2009-05-16 21:15 ok, thanks 2009-05-16 21:17 http://lkml.org/lkml/2007/2/27/309 2009-05-16 21:17 thats kinda interesting, pretty simple test case 2009-05-16 21:18 shapor, the redhat bug above seems to be just basic device mapper, no snapshot etc, which is... sad 2009-05-16 21:19 yeah 2009-05-16 21:19 simplest case 2009-05-16 21:20 the lkml one above is a loopback deadlock 2009-05-16 21:21 known danger, not fixed in any robust way even now 2009-05-16 21:21 there have been a couple of hare-brained proposals for fixing it ;) 2009-05-16 21:22 anyway when the topic comes up again the bio-throttle patch will be fresh 2009-05-16 21:22 it's beyond pathetic that the fix has existed for years now and is not in mainline 2009-05-16 21:23 btw, loopback was rewrited by normal write method 2009-05-16 21:23 actually, write_begin/write_end, iirc 2009-05-16 21:24 good 2009-05-16 21:24 well, it may still have bugs though 2009-05-16 21:24 it is known device as buggy 2009-05-16 21:24 yes 2009-05-16 21:26 ACTION google deadlock 2009-05-16 21:26 well, let me think about it for a momem, the basic scenario is that a bio is submitted for write to a file 2009-05-16 21:27 filesystems to not expect to receive bios, so the loopback device has to translate the bio transfer to a file write 2009-05-16 21:27 yes 2009-05-16 21:27 so... I do not see immediately how the write_begin/write_end hookup is done for loopback 2009-05-16 21:27 loopback gets bio and pass fs as write 2009-05-16 21:29 even if file is blockdev 2009-05-16 21:29 ah, and the issue is, the bio to write translation is done in a separate task from the submitter of the bio 2009-05-16 21:29 that will deadlock 2009-05-16 21:29 and calling write_begin/write_end is not a fix 2009-05-16 21:29 well 2009-05-16 21:29 let me see 2009-05-16 21:30 if the call to the filesystem is done in the bio submit, then it is all one call chain 2009-05-16 21:30 and should not deadlock 2009-05-16 21:30 but it can overflow the stack 2009-05-16 21:30 iirc, it would usually same with submitter 2009-05-16 21:31 because it pass the block layer almost 2009-05-16 21:31 the correct way to fix the problem is to introduce an interface that allows a filesystem to directly accept a bio 2009-05-16 21:32 this will have the good effect of also doing the right think for O_DIRECT 2009-05-16 21:32 anyway... step 1) make tux3 work; step 2) save the world from bio deadlock 2009-05-16 21:33 btw, what is bio deadlock? 2009-05-16 21:33 loopback only? 2009-05-16 21:33 not just loopback 2009-05-16 21:33 many situations 2009-05-16 21:33 i see 2009-05-16 21:33 I used to call it memory deadlock 2009-05-16 21:33 but then... that allowed the bio maintainer to deny responsibility ;) 2009-05-16 21:34 so now I call it more accurately, bio deadlock 2009-05-16 21:34 is when when a bio handler has to allocate memory to service a vm writeout request, and the memory is not available, hence deadlock 2009-05-16 21:36 ah 2009-05-16 21:36 mempool is not solved it? 2009-05-16 21:36 I think bio and some stuff is using mempool 2009-05-16 21:36 s/I think/iirc/ 2009-05-16 21:40 well, s/solved/workaround/ 2009-05-16 21:47 no, mempool does not solve these deadlocks 2009-05-16 21:48 i see 2009-05-16 21:48 normal processes just use up the mempool, then a write in PF_MEMALLOC mode deadlocks 2009-05-16 21:48 the only thing that solves it is throttling the bio traffic per device 2009-05-16 21:48 a simple solution that is 100% effective 2009-05-16 21:49 and very efficient, even under conditions of memory full 2009-05-16 21:49 um..., mempool is reserved the memory for io path 2009-05-16 21:49 I guess, it shouldn't be full 2009-05-16 21:50 shouldn't be used up 2009-05-16 21:51 actually, memory is recycled after 2009-05-16 21:52 maybe, some path on dd* is not using mempool 2009-05-16 21:53 or reserved size is not enough 2009-05-16 21:53 mempool does not solve the issue 2009-05-16 21:53 um... 2009-05-16 21:53 with mempool, non-writeout transfers may use up the entire mempool, then the situation is the same as if there were no mempool 2009-05-16 21:54 without bio throttling, nothing prevents this 2009-05-16 21:54 hence the only correct solution is to throttle the bio traffic, which is easy 2009-05-16 21:55 why throttle can prevent it? 2009-05-16 21:55 throttle by memory usage? 2009-05-16 21:56 um..., mempool reserves memory per slab 2009-05-16 21:56 usually 2009-05-16 21:57 well, per perpose 2009-05-16 21:57 so, I can't see why can't it workaround the issue 2009-05-16 21:58 throttling prevents it by preventing the mempool from being entirely used up 2009-05-16 21:58 the throttling is simply by number of bios in flight on a given block device 2009-05-16 21:58 -!- ijuz_(~ijuz@p5B123CCB.dip.t-dialin.net) has joined #tux3 2009-05-16 21:58 ah 2009-05-16 21:59 um... 2009-05-16 21:59 it means, per queue mempool can workaround it? 2009-05-16 21:59 just for example 2009-05-16 22:00 per queue mempool, plus per queue bio throttling 2009-05-16 22:00 the bio throttling is the key ingredient, not the mempool 2009-05-16 22:00 um... 2009-05-16 22:00 there are alternatives to mempool, but no alternatives to bio throttling 2009-05-16 22:01 well, probably, I'm not understanding the issue 2009-05-16 22:01 most core kernel devs to not understand the issue 2009-05-16 22:02 the only other one I know who understands it deeping is peterz, and for some reason he has not proposed the simple fix 2009-05-16 22:02 e.g. mempool is reserving the memory for 100 bios 2009-05-16 22:02 but instead, proposed many more complex fixes, some of them merged, that fail to solve the problem 2009-05-16 22:02 I can't see why 100 is go away 2009-05-16 22:02 it can happen 2009-05-16 22:02 no matter how big the mempool is 2009-05-16 22:02 and if the mempool is very large, it wastes memory 2009-05-16 22:03 i see 2009-05-16 22:03 yes 2009-05-16 22:03 we once say 400,000 bios in flight on ddsnap 2009-05-16 22:03 it is possible 2009-05-16 22:03 that was before throttling 2009-05-16 22:03 even then, deadlock was rare 2009-05-16 22:03 and why 400,000 bios is rescycled? 2009-05-16 22:03 with throttling, there are never more than 1000 in flight 2009-05-16 22:04 rescycled? 2009-05-16 22:04 it meant, if IO was done, bios should be re-used 2009-05-16 22:04 for new IO 2009-05-16 22:04 the io wasn't done 2009-05-16 22:05 i see 2009-05-16 22:05 because the snapshot write was much slower than the application IO, and nothing prevents the application from submitting writes until all of memory is full 2009-05-16 22:05 yes 2009-05-16 22:05 some of the fiddling with bdi has reduced this effect somewhat, but not eliminated it 2009-05-16 22:06 even if memory is full, I thought mempool should be reserved memory for some IO 2009-05-16 22:11 and app can't submit new I/O by waiting memory 2009-05-16 22:13 the problem is that the mempool can be exhausted 2009-05-16 22:13 nothing prevents it, without throttling 2009-05-16 22:14 if the mempool is entirely used by reads, it may have no memory available for writes 2009-05-16 22:14 and other dealock scenarios are also possible 2009-05-16 22:14 but, read is done in future 2009-05-16 22:14 if all reserved pool was used at once, I can see why 2009-05-16 22:15 another read may come and take the mempool memory as soon as the first finishes 2009-05-16 22:15 yes 2009-05-16 22:15 mempool requests are not fifo, they are just by random scheduler order 2009-05-16 22:17 but, I guess if app was stopped, IO is done in future 2009-05-16 22:17 what if the disk is temporarily unplugged? 2009-05-16 22:17 it may easily be the case that the entire system is blocked because of that 2009-05-16 22:18 actual deadlock occurs when servicing the bio transfer requires allocating memory 2009-05-16 22:18 this is different from just bio exhaustion 2009-05-16 22:18 i see 2009-05-16 22:19 I guess allocation memory is not mempool and it's necessary for IO 2009-05-16 22:19 memory is required to complete a bio transfer in a number of cases, including loopback 2009-05-16 22:19 also, in non-trivial virtual block devices like ddsnap and dmcrypt 2009-05-16 22:20 i see 2009-05-16 22:20 well, I know this type of deadlock is there 2009-05-16 22:20 actually, it was usb 2009-05-16 22:21 but, urb is not mempool though 2009-05-16 22:21 usb-request-block 2009-05-16 22:21 bio exhaustion can also occur in current mainline, because there is only one bio pool for the entire system regardless of number of devices 2009-05-16 22:21 ah, usb 2009-05-16 22:22 and actually, mempool did help a lot with bio exhaustion 2009-05-16 22:22 yes 2009-05-16 22:22 before the mempool arrived (mingo) exhaustion was a big problem 2009-05-16 22:22 now it is a rare problem, but not a completely solved problem 2009-05-16 22:22 i see 2009-05-16 22:23 bio throttling makes mempool unnecesary, as a side effect 2009-05-16 22:23 instead of using mempool, the bio can just be allocated using PF_MEMALLOC 2009-05-16 22:23 why PF_MEMALLOC is helping? 2009-05-16 22:23 iirc, it just use with GFP_ATOMIC 2009-05-16 22:24 PF_MEMALLOC gives access to a pool of reserve memory, just as mempool does 2009-05-16 22:24 it makes very little difference where the reserve comes from, as long as the user is properly throttled 2009-05-16 22:24 but, it is not reserve memory of IO 2009-05-16 22:24 that does not matter 2009-05-16 22:24 no 2009-05-16 22:25 GFP_ATOMIC can use those reserved memory 2009-05-16 22:25 iirc 2009-05-16 22:25 yes, but not all of it 2009-05-16 22:25 eh 2009-05-16 22:25 GFP_ATOMIC has access to 1/2 or 1/4 of it, something like that 2009-05-16 22:25 oh 2009-05-16 22:26 otherwise we would have big problems 2009-05-16 22:26 because the network stack does many GFP_ATOMIC allocations for normal network traffic 2009-05-16 22:26 thankyou for reminding me of all these issues ;) 2009-05-16 22:27 I will eventually have to make the argument again on lkml 2009-05-16 22:27 it is always better to practice the argument first 2009-05-16 22:28 GFP_ATOMIC, yes 2009-05-16 22:28 so, I was thinking mempool was why introduced 2009-05-16 22:29 mempool was introduced mainly to solve the bio problem if I recall correctly 2009-05-16 22:29 oh, here is a problem with mempool\ 2009-05-16 22:30 bio allocation actually uses many mempools, one for each different size of biovec 2009-05-16 22:30 so this means that either there are not very many bios in each mempool (the current case) or a lot of memory is wasted 2009-05-16 22:31 it also means that there are not very many different sizes of bios, which also wastes memory 2009-05-16 22:31 yes 2009-05-16 22:32 mempools for some bio sizes only have two bios in them 2009-05-16 22:33 so if there are two block devices, and each of them has one bio, things can get very slow, which seems like a deadlock 2009-05-16 22:33 even though it is not, it is just a system running at 1,000th of its normal speed 2009-05-16 22:33 to the user, it appears to be deadlocked 2009-05-16 22:34 essentially, the entire system is reduced to runing in O_SYNC mode 2009-05-16 22:34 which can be several orders of magnitude slower than normal buffered IO 2009-05-16 22:36 yes, it would not be efficient 2009-05-16 22:36 however, deadlock and not efficient is big different 2009-05-16 22:39 if the use thinks it is deadlocked, they are not different 2009-05-16 22:39 either way, the user will hit the reset button or power switch 2009-05-16 22:39 in a server environment, a watchdog may trigger a reboot 2009-05-16 22:40 anyway, we have both types of problem: systems running so slowly they might as well be deadlocked, and systems that are actually deadlocked 2009-05-16 22:41 i see 2009-05-16 22:42 we need a new term I think: slowlock 2009-05-16 22:42 or snaillock 2009-05-16 22:42 livelock? 2009-05-16 22:43 livelock makes no progress, so that would be different from slowlock 2009-05-16 22:43 slowlock progresses, but so slowly that it is useless 2009-05-16 22:45 e.g. bad implement of sync() 2009-05-16 22:45 some app feed dirty buffer to sync 2009-05-16 22:45 ah right 2009-05-16 22:46 I noticed that with the current code, writeback_inodes can sometimes run very slowly 2009-05-16 22:48 which brings us to the area we are interested in 2009-05-16 22:48 http://lxr.linux.no/linux+v2.6.29/fs/fs-writeback.c#L441 2009-05-16 22:49 generic_sync_sb_inodes 2009-05-16 22:49 um.., I'm not thinking writeback code is not bad 2009-05-16 22:49 but, wait is bad 2009-05-16 22:49 yes 2009-05-16 22:50 ah, ok 2009-05-16 22:51 this function actually uses timestamps 2009-05-16 22:51 that is scary 2009-05-16 22:51 496 if (time_after(inode->dirtied_when, start)) 2009-05-16 22:51 497 break; 2009-05-16 22:52 whats wrong with that? 2009-05-16 22:52 timestamps are fuzzy things 2009-05-16 22:53 at the risk of overgeneralizing, most systems that rely on timestamps tend to show poor behavior in corner cases 2009-05-16 22:53 oh 2009-05-16 22:53 nfs and make are two good examples 2009-05-16 22:53 maybe, this is old code 2009-05-16 22:53 kerberos is an excellent example 2009-05-16 22:53 now, I guess it should be needed 2009-05-16 22:54 it shouldn't be needed 2009-05-16 22:54 it dates from some time in 2.5 2009-05-16 22:54 we should not need this 2009-05-16 22:54 and current code is also not needed, I think 2009-05-16 22:54 but I think there is no way to bypass this code, per filesystem 2009-05-16 22:54 yes 2009-05-16 22:55 as in "yes, there is no way to bypass it?" 2009-05-16 22:55 yes 2009-05-16 22:55 :) 2009-05-16 22:55 shapor, do you know what all the hooting an hollering is out there? 2009-05-16 22:55 btw, current code has separated lists of io queue and dirty queue 2009-05-16 22:55 out where? 2009-05-16 22:55 some football or basketball game or something? 2009-05-16 22:55 outside 2009-05-16 22:56 sounds like excited americans 2009-05-16 22:56 dunno its quiet here 2009-05-16 22:56 it might just be excited santa monicans 2009-05-16 22:56 its far too late to be any mjaor american sporting event 2009-05-16 22:57 hard to make advertising revenue at 2am eastern 2009-05-16 22:57 there sure is a lot of rambling code in this sync path 2009-05-16 22:57 oh right 2009-05-16 22:57 maybe it's the mightnight racing championships 2009-05-16 22:57 more likely, it's the get as drunk as you can get sweepstakes 2009-05-16 22:58 you must live in a rough neighborhood 2009-05-16 22:58 mostly losers here, yes 2009-05-16 22:58 the homeless rich 2009-05-16 22:59 hmm, there is also honking, it must be something that everybody knows about except venice beachians 2009-05-16 23:00 or maybe it was just a honk 2009-05-16 23:00 time for me to sleep 2009-05-16 23:00 oyasumi 2009-05-16 23:00 oyasumi 2009-05-17 00:04 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-17 07:21 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-17 08:15 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-17 08:35 -!- dcg(~dcg@37.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-17 08:58 -!- dcg_(~dcg@201.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-17 09:54 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-17 10:34 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-17 11:58 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-17 12:10 Hallo i have a problem compiling master 2009-05-17 12:10 gcc -m64 -std=gnu99 -Wall -g -rdynamic -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wextra -Werror -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers -Dbuild_buffer -c -o buffer.o buffer.c 2009-05-17 12:10 cc1: warnings being treated as errors 2009-05-17 12:10 buffer.c: In function 'preallocate_buffers': 2009-05-17 12:10 buffer.c:478: error: attempt to free a non-heap object 'buffers' 2009-05-17 12:10 make: *** [buffer.o] Error 1 2009-05-17 12:10 the gcc is version 4.4.0 2009-05-17 12:10 thx in av 2009-05-17 12:13 oh 2009-05-17 12:14 4.4 2009-05-17 12:14 well, it seems the bug of code 2009-05-17 12:14 free(buffers) should be free(prealloc_heads) 2009-05-17 12:18 will try 2009-05-17 12:20 works(tm) 2009-05-17 12:20 could it be put in hg ? 2009-05-17 12:20 so the live gentoo ebuild dont have to patch 2009-05-17 12:23 yes, this fix will go into hg 2009-05-17 12:24 well, I have some pending bug fix patches 2009-05-17 12:24 so, this fix will merge to those 2009-05-17 12:25 hi 2009-05-17 12:25 thx keep up with developement 2009-05-17 12:26 hi 2009-05-17 12:26 hirofumi, you put that fix in your bug fix set? 2009-05-17 12:27 yes 2009-05-17 12:27 geos_one: doing a gentoo ebuild with kernel support for tux3? 2009-05-17 12:27 yes i am working on this 2009-05-17 12:27 awesome 2009-05-17 12:27 :) 2009-05-17 12:28 that will light a fire under our tails 2009-05-17 12:30 hrm tux3 is already mentioned here: http://en.gentoo-wiki.com/wiki/Filesystem 2009-05-17 12:31 "bit unstable" is generous 2009-05-17 12:31 the make file is a little basic 2009-05-17 12:31 no install and no way th specify extra LDFLAGS (for all those cracy persions like me compiling with --as-needed) 2009-05-17 12:31 true 2009-05-17 12:32 it is intended that the fuse code will be run directly from the build directory 2009-05-17 12:32 likewise the tux3 binary, used for creating filesystems 2009-05-17 12:34 so i can't put the tux3* binaries to /usr/sbin 2009-05-17 12:34 geos_one, feel free to post a makefile patch that makes it more gentoo-friendly 2009-05-17 12:34 you can if you like 2009-05-17 12:35 it works 2009-05-17 12:35 I would like to avoid autoconf/automake 2009-05-17 12:36 if possible 2009-05-17 12:36 ok will make them linux dist packager friendly 2009-05-17 12:36 -!- dcg(~dcg@181.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-17 12:38 no will not use the autohell 2009-05-17 12:57 :) 2009-05-17 13:27 so here ist the first implementation of the dsit packager patch 2009-05-17 13:27 http://tinyurl.com/q7ront 2009-05-17 13:30 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-17 13:30 a direct link http://tinyurl.com/pdgvns 2009-05-17 13:32 now i have another compile error i think it has something to do with the extra cflags 2009-05-17 13:33 x86_64-pc-linux-gnu-gcc -march=k8 -msse3 -Os -pipe -m64 -std=gnu99 -Wall -g -rdynamic -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wextra -Werror -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers utility.o iattr.o -o iattr -Wl,--hash-style=both,--as-needed 2009-05-17 13:33 x86_64-pc-linux-gnu-gcc -march=k8 -msse3 -Os -pipe -m64 -std=gnu99 -Wall -g -rdynamic -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wextra -Werror -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers -Dbuild_xattr -c -o xattr.o xattr.c 2009-05-17 13:33 cc1: warnings being treated as errors 2009-05-17 13:33 xattr.c: In function 'main': 2009-05-17 13:33 xattr.c:31: error: ignoring return value of 'ftruncate', declared with attribute warn_unused_result 2009-05-17 13:33 make: *** [xattr.o] Error 1 2009-05-17 13:40 it seems the lazyness of test code 2009-05-17 13:40 assert(!ftruncate(...)) will work 2009-05-17 13:54 works(tm) 2009-05-17 13:55 ok, I've fixed the xattr.c/filemap.c/inode.c 2009-05-17 13:59 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-17 14:04 and the next 2009-05-17 14:05 tux3graph.c: In function 'merge_file': 2009-05-17 14:05 tux3graph.c:749: error: ignoring return value of 'fwrite', declared with attribute warn_unused_result 2009-05-17 14:05 oh 2009-05-17 14:06 new libc seems to added the unused attribute to basic functions 2009-05-17 14:07 -Wno-unused flag to gcc may be easy for now 2009-05-17 14:09 tuc3graph.c was the last in that did not won to compile 2009-05-17 14:09 oh, good 2009-05-17 14:11 size_t n = fwrite(...) 2009-05-17 14:11 assert(n == 1) 2009-05-17 14:11 this will fix it 2009-05-17 14:17 thx that worked 2009-05-17 14:17 now to the kernel module 2009-05-17 14:17 CC [M] /var/tmp/portage/sys-fs/tux3-9999/work/tux3/user/kernel/super.o 2009-05-17 14:17 /var/tmp/portage/sys-fs/tux3-9999/work/tux3/user/kernel/super.c: In function 'tux3_fill_super': 2009-05-17 14:17 /var/tmp/portage/sys-fs/tux3-9999/work/tux3/user/kernel/super.c:140: error: 'TUX3_SUPER_MAGIC' undeclared (first use in this function) 2009-05-17 14:17 /var/tmp/portage/sys-fs/tux3-9999/work/tux3/user/kernel/super.c:140: error: (Each undeclared identifier is reported only once 2009-05-17 14:17 /var/tmp/portage/sys-fs/tux3-9999/work/tux3/user/kernel/super.c:140: error: for each function it appears in.) 2009-05-17 14:17 make[2]: *** [/var/tmp/portage/sys-fs/tux3-9999/work/tux3/user/kernel/super.o] Error 1 2009-05-17 14:17 make[1]: *** [_module_/var/tmp/portage/sys-fs/tux3-9999/work/tux3/user/kernel] Error 2 2009-05-17 14:17 make[1]: Leaving directory `/usr/src/linux-2.6.29-geos_one-r4' 2009-05-17 14:17 make: *** [all] Error 2 2009-05-17 14:18 I guess the kernel is not applied the patch of tux3 2009-05-17 14:19 http://git.kernel.org/?p=linux/kernel/git/daniel/linux-tux3.git;a=summary 2009-05-17 14:19 does the current patch work? 2009-05-17 14:20 hmm, we should link the kernel patches from the tux3.org source page 2009-05-17 14:21 shapor, ping? 2009-05-17 14:22 http://tux3.org/patches/ <- kernel patches 2009-05-17 14:22 well 2009-05-17 14:22 I know, it's easy to get a patch out of git ;) 2009-05-17 14:22 iirc, now, it just need the TUX3_SUPER_MAGIC 2009-05-17 14:22 the kernel patch works 2009-05-17 14:22 but i am trying to build tux3 out of tree as a module 2009-05-17 14:22 ah 2009-05-17 14:23 just add 2009-05-17 14:23 #ifndef TUX3_SUPER_MAGIC 2009-05-17 14:24 so the tester dont need to patch his linux sources and only needs to merge sys-fs/tux3 2009-05-17 14:24 #define TUX3_SUPER_MAGIC 0x74757833 2009-05-17 14:24 #endif 2009-05-17 14:24 to kernel/super.c 2009-05-17 14:24 geos_one, sure, it is supposed to be able to build as a module 2009-05-17 14:24 this will work without kernel patch 2009-05-17 14:25 ok will add 2009-05-17 14:25 and after merged to kernel, this issue should go away 2009-05-17 14:26 hirofumi, and maybe it is simpler to fill the magic by using the on-disk superblock 2009-05-17 14:26 there isn't really a reason to use the kernel define 2009-05-17 14:26 ah, true 2009-05-17 14:26 no 2009-05-17 14:26 re merging 2009-05-17 14:27 iirc, copy from on-disk has issue of alignment 2009-05-17 14:27 not really 2009-05-17 14:28 it is aligned in the on-disk super 2009-05-17 14:28 it is not tux3 internal 2009-05-17 14:28 statfs() 2009-05-17 14:29 we have to export magic by number 2009-05-17 14:30 it is actually not needed, however, it is eailer way 2009-05-17 14:36 well, so, I think it would be good the above #ifdef until merged to kernel 2009-05-17 14:43 I've added the #define TUX3_SUPER_MAGIC for now 2009-05-17 14:43 well, time to sleep for me 2009-05-17 14:43 oyasumi 2009-05-17 14:44 oyasumi 2009-05-17 14:44 by the ebuilds for gentoo are ready 2009-05-17 14:44 :) 2009-05-17 14:44 good :) 2009-05-17 14:46 http://tinyurl.com/p7dcvg 2009-05-17 14:46 sys-fs/tux3progs - the tools tux3* to sbin and the testbins libexec/tux3 take a look at the files dir it holds the riquired patches 2009-05-17 14:46 sys-fs/tux3 - the kernel module also the files dir holds the needed patches 2009-05-17 14:48 ah, that's what one of those looks like 2009-05-17 14:49 for the gentoo users my layman file is located at http://ftp.mars.arge.at/pub/overlay/geos_one-overlay.xml 2009-05-17 14:49 so, how is gentoo doing these days? 2009-05-17 14:49 everybody lovin each other? 2009-05-17 14:50 my servers at the corp are running an gentoo for all production servers 2009-05-17 14:51 so i think it is dooing well aven with the small problems the community had 2009-05-17 14:51 good 2009-05-17 14:51 gentoo is traditionally the earliest adopter of new linux goodies 2009-05-17 14:52 and the first that bothers upstream (glibc, bash, ..) 2009-05-17 14:53 have you taken a look at the dist patch ? 2009-05-17 14:55 dist patch? 2009-05-17 14:56 -!- dcg(~dcg@181.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-17 14:56 http://mars.arge.at/svn/linamh/trunk/linamh/sys-fs/tux3progs/files/tux3-dist-packager-1.patch 2009-05-17 14:57 it adds the install target and LDFLAGS 2009-05-17 14:58 reading now 2009-05-17 14:58 not that complicatd 2009-05-17 14:59 you need it to be tux3bin instead of $(binaries) ? 2009-05-17 15:00 it is only that i can say in a later version that it dont install the testing bins when thy are not needed 2009-05-17 15:01 so a install-test target can be added in further version 2009-05-17 15:01 sure 2009-05-17 15:01 for now also a nice bash script for the test is needed that runs the test out of libexec/tux3 2009-05-17 15:02 by the way, do you remember the geos operating system? 2009-05-17 15:02 that are at the now in the make file 2009-05-17 15:03 yes that is the origin of my nick name 2009-05-17 15:03 geos version 1.0 was the first gui i used 2009-05-17 15:03 -!- dcg(~dcg@181.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-17 15:03 geos kicked ass 2009-05-17 15:03 but got done in by microsoft 2009-05-17 15:04 wasnt it coded at berkely 2009-05-17 15:04 geos? I don't know. It started on comodore 64 2009-05-17 15:04 then appeared on PC in the pcdos days 2009-05-17 15:04 geos 1.0 for the c64 2009-05-17 15:05 it was by far the best gui at the time 2009-05-17 15:05 i still have a C64 + ram expansion + geos i wonder if that stuff still works 2009-05-17 15:05 -!- dcg(~dcg@181.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-17 15:06 ijuz_, better dust it off and try it at least once per decade, to keep the circuits from getting dusty 2009-05-17 15:06 i most worried about the floppies 2009-05-17 15:06 I need to try my floppies too 2009-05-17 15:06 got a whole bunch that should be copied to hd 2009-05-17 15:07 first thing: vacuum out the floppy reader 2009-05-17 15:07 I haven't used a floppy disk in 5 years or so 2009-05-17 15:07 i have 3 brotkasten 2 c64 II 5 1541 1 1571 1 c128D and many expansions cards 2009-05-17 15:07 i am creating new schamtics and pcb for the c64 that fits in a defekt applebook 2009-05-17 15:08 just for fun 2009-05-17 15:09 to get the floppies to hd can use opencbm with a pc with parallel port and the xa1541 / xe1541 cable 2009-05-17 15:11 uhm... ok :) 2009-05-17 15:12 a verilog version of the C64 would be nice, so you could fit it nicely in an fpga 2009-05-17 15:12 no thats not fun 2009-05-17 15:13 and the c64 in verilog alredy exists 2009-05-17 15:13 surely without some of the most important bugs of the real hardware 2009-05-17 15:14 i want to build it diskret with the original MOS chips with some exceptins like the PLA 2009-05-17 15:15 i already had the original c64 running on the applebook accumulator uptime 15min 2009-05-17 15:16 15 min? 2009-05-17 15:17 yes not more this chips arnt compariable to the chips today 2009-05-17 15:19 ok no 9V ~ on the userport and the vic and sid got some remped voltage with 50Hz but it worked 2009-05-17 15:22 hirofumi, how about this, it's shorter and avoids the small maintenance issue: 2009-05-17 15:22 diff -r cb5655728089 user/kernel/super.c 2009-05-17 15:22 --- a/user/kernel/super.c Fri Mar 27 16:12:32 2009 +0900 2009-05-17 15:22 +++ b/user/kernel/super.c Sun May 17 15:22:35 2009 -0700 2009-05-17 15:22 @@ -137,7 +137,6 @@ static int tux3_fill_super(struct super_ 2009-05-17 15:22 sbi->vfs_sb = sb; 2009-05-17 15:22 sb->s_fs_info = sbi; 2009-05-17 15:22 sb->s_maxbytes = MAX_LFS_FILESIZE; 2009-05-17 15:22 - sb->s_magic = TUX3_SUPER_MAGIC; 2009-05-17 15:22 sb->s_op = &tux3_super_ops; 2009-05-17 15:22 sb->s_time_gran = 1; 2009-05-17 15:22 2009-05-17 15:22 @@ -162,6 +161,7 @@ static int tux3_fill_super(struct super_ 2009-05-17 15:22 } 2009-05-17 15:22 goto error; 2009-05-17 15:22 } 2009-05-17 15:22 + sb->s_magic = *(be_u32 *)sbi->super.magic; 2009-05-17 15:23 2009-05-17 15:23 if (sbi->blocksize != blocksize) { 2009-05-17 15:23 if (!sb_set_blocksize(sb, sbi->blocksize)) { 2009-05-17 15:23 ...when you wake up 2009-05-17 15:24 ok by i need to get some sleep 2009-05-17 15:24 see you later 2009-05-17 15:24 do you want some of your makefile changes merged? 2009-05-17 15:25 yep would be greate 2009-05-17 15:25 $(CC) $(CFLAGS) $(LDFLAGS) $$(pkg-config --cflags fuse) utility.o tux3fuse.c -lfuse -otux3fuse <- a little better 2009-05-17 15:25 see you here tomorrow re merging 2009-05-17 15:26 no this will not work if you use the as-needed linkflag http://www.gentoo.org/proj/en/qa/asneeded.xml 2009-05-17 15:27 I see 2009-05-17 15:27 well you can explain in detail tomorrow 2009-05-17 15:27 maybe 2009-05-17 15:28 if the ldflags is before the lib it schould link the resulting bin wont get linked with the lib 2009-05-17 15:29 ok the last comment schould be rearanged 2009-05-17 15:30 +OWNER = root 2009-05-17 15:30 +GROUP = root 2009-05-17 15:30 <- is it correct to hard code these? 2009-05-17 15:31 only root schould be able to use the tools or ? 2009-05-17 15:32 not just root 2009-05-17 15:32 it is typical to make tux3 filesystems on loopback files, for development 2009-05-17 15:33 if we just let it default, will install do the right thing? 2009-05-17 15:36 the creatin mask is 755 so everyone can run the progs. 2009-05-17 15:36 but the tools are owned my root 2009-05-17 15:37 sure 2009-05-17 15:37 that is fine 2009-05-17 15:37 duh 2009-05-17 15:41 by til tomorrow 2009-05-17 15:47 bye 2009-05-17 16:19 -!- dagle1(~dagle@host162-104.bornet.net) has joined #tux3 2009-05-17 18:04 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-05-17 18:18 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-05-17 20:40 earthquake 2009-05-17 20:40 little one 2009-05-17 20:41 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-17 20:41 earthquake 2009-05-17 20:41 here to talk about it I presume 2009-05-17 20:41 sure 2009-05-17 20:42 big one though 2009-05-17 20:42 and I presume that venice beach did not liquify this time 2009-05-17 20:42 nope 2009-05-17 20:42 went on for 8-10 seconds 2009-05-17 20:42 big enough to dos the online earthquake reports 2009-05-17 20:42 still, little one compared to hirofumi's 2009-05-17 20:42 there was a 1.8 6 minutes ago 2009-05-17 20:43 it was more than that 2009-05-17 20:43 that couldn't have been it 2009-05-17 20:43 3 something 2009-05-17 20:43 4 something 2009-05-17 20:43 http://www.scec.org/ <- can't get a page 2009-05-17 20:43 http://www.data.scec.org/recenteqs/Maps/Los_Angeles.html 2009-05-17 20:44 5.0 2009-05-17 20:44 hawthorne 2009-05-17 20:44 I see it 2009-05-17 20:44 ok, it was relatively big 2009-05-17 20:44 14km deep 2009-05-17 20:45 moderate 2009-05-17 20:45 dang, wanna see the waveform 2009-05-17 20:46 thanks for the ddnsnap patch 2009-05-17 20:46 service with a smile 2009-05-17 20:47 I take it you didn't brain yourself in latigo canyon 2009-05-17 20:47 heh 2009-05-17 20:47 hot day, low traction, worn wheels, speeds were very slow 2009-05-17 20:47 I got a skateboarding helmet today 2009-05-17 20:47 hrm 2009-05-17 20:47 it probably sucks for safety, but it has a nice skull on the back 2009-05-17 20:47 spouse time 2009-05-17 20:48 got to play earthquake hunter for a moment 2009-05-17 20:48 I'll hit u later 2009-05-17 20:48 see you 2009-05-17 21:15 re the above patch 2009-05-17 21:16 though you were sleeping ;) 2009-05-17 21:16 I slept and waked :) 2009-05-17 21:16 I'd like to use same value with userland 2009-05-17 21:16 it is actually same 2009-05-17 21:17 however, define is different 2009-05-17 21:17 the kernel define should be a hex number 2009-05-17 21:17 in short, I'd like to think those are internal magic and external magic 2009-05-17 21:17 it is ok for it to be acscii in our tree, and hex in the kernel include 2009-05-17 21:18 yes 2009-05-17 21:18 should be the same 2009-05-17 21:18 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-17 21:18 those can be different 2009-05-17 21:18 of course, we will use same value 2009-05-17 21:18 they can be different, but there is no need for them to be different 2009-05-17 21:18 yes 2009-05-17 21:19 but, after merged, I think current code is better 2009-05-17 21:20 (gdb) p/x *(int *)"3XUT" 2009-05-17 21:20 $2 = 0x54555833 2009-05-17 21:20 anyway, time for me to sleep 2009-05-17 21:21 ah, bad 2009-05-17 21:21 bad? 2009-05-17 21:21 it should be cpu order 2009-05-17 21:22 it is 2009-05-17 21:22 it is a bigendian value, fetched on a little endian cpu 2009-05-17 21:22 so I reversed it 2009-05-17 21:23 *(be_32 *)magic is not changing order 2009-05-17 21:23 gcc doesn't have a byte swap function that I know of 2009-05-17 21:23 wait 2009-05-17 21:23 maybe ntoh 2009-05-17 21:24 no, I reversed the string 2009-05-17 21:25 (gdb) p/x ntohl(*(int *)"TUX3") 2009-05-17 21:25 $5 = 0x54555833 2009-05-17 21:25 there 2009-05-17 21:26 ah 2009-05-17 21:26 right 2009-05-17 21:26 I wrote the patch wrong indeed 2009-05-17 21:26 and we already had this discussion a few months ago 2009-05-17 21:26 yes 2009-05-17 21:27 btw, geos_one's patch seems has the issue about $tux3bin 2009-05-17 21:28 sb->s_magic = from_be_u32(*(be_u32 *)sbi->super.magic); 2009-05-17 21:28 tux3fuse is added to tux3bin, however tux3bin has different make rule 2009-05-17 21:28 it can 2009-05-17 21:30 however, if kernel has define, I think it is better to use kernel define 2009-05-17 21:31 because, s_magic is just for userland 2009-05-17 21:31 and linux/magic.h is also 2009-05-17 21:31 in theory, we can use different magic for on-disk magic 2009-05-17 21:31 we don't though 2009-05-17 21:32 I think we should just make a design decision that they are the same 2009-05-17 21:32 and that a kernel define that does not match is a bug 2009-05-17 21:32 yes 2009-05-17 21:33 that said... my patch was just for fun 2009-05-17 21:33 and it was wrong 2009-05-17 21:33 I should at least post the right version 2009-05-17 21:33 sparse would have noticed ;) 2009-05-17 21:34 ah, I remember 2009-05-17 21:34 and I thought I'd like to avoid to hear TUX3_MAGIC should go to linux/magic.h 2009-05-17 21:36 -!- ijuz__(~ijuz@p5B1268A5.dip.t-dialin.net) has joined #tux3 2009-05-17 21:37 we also need to decide tux3 or TUX3 2009-05-17 21:37 currently it is "tux3" 2009-05-17 21:37 yes 2009-05-17 21:37 wikipedia says TUX3 2009-05-17 21:37 currently both of *_MAGIC is tux3 2009-05-17 21:38 so... how about we keep "tux3" until we finalize the disk format 2009-05-17 21:38 on-disk was always "tux3" 2009-05-17 21:38 then change to "TUX3" 2009-05-17 21:38 I think that was my intention 2009-05-17 21:38 "TUX3" is? 2009-05-17 21:38 my proposal is: use small letters until we finalize the disk format, then change to large letters 2009-05-17 21:39 I guess change is not good 2009-05-17 21:40 because, people seems to start to distribute current one too 2009-05-17 21:40 ah, but it is never right to distributed an unfinalized disk format 2009-05-17 21:40 so I think, the change in this case is good 2009-05-17 21:41 but, distributed means someone may start to use TUX3_SUPER_MAGIC 2009-05-17 21:41 well we also have the incompatible disk format code 2009-05-17 21:41 yes 2009-05-17 21:42 ok, then let us decide whether we like "tux3" or "TUX3" 2009-05-17 21:42 you need my help? ;-) 2009-05-17 21:42 I don't have any preference about it 2009-05-17 21:42 TUX3 looks loud 2009-05-17 21:42 tux3 is subtle, compact 2009-05-17 21:42 tim_dimm votes for "tux3" ? 2009-05-17 21:43 TUX3 looks like the monster truck version 2009-05-17 21:43 my vote is neutral, hirofumi votes neutral, tim votes small letters, nobody else votes, so... 2009-05-17 21:43 "tux3" it is 2009-05-17 21:43 in other words, no change 2009-05-17 21:43 :) 2009-05-17 21:44 good :) 2009-05-17 21:44 and I have posted a correct version of my patch 2009-05-17 21:44 hirofumi, so you will decide which way to write the source code, with the #ifdef ..MAGIC..., or with my patch 2009-05-17 21:44 hate to see great hackers get sidetracked by capitalization issues 2009-05-17 21:44 either is ok with me 2009-05-17 21:45 tim_dimm, these are the sorts of issues that are most important to great hackers ;-) 2009-05-17 21:45 also, the strength of the tea, color of the package... 2009-05-17 21:45 anyway, oyasumi 2009-05-17 21:46 the durometer of the wheels 2009-05-17 21:46 I'd like to use TUX3_SUPER_MAGIC 2009-05-17 21:46 as external reason, it can be different 2009-05-17 21:46 ok, done 2009-05-17 21:47 internal reason, I'd not like to hear "please magic to linux/magic.h" 2009-05-17 21:47 linux/magic.h has the correct declaration in either case 2009-05-17 21:48 yes 2009-05-17 21:48 hey flips hirofumi 2009-05-17 21:48 and company 2009-05-17 21:48 hi 2009-05-17 21:48 hi bh, and goodnight 2009-05-17 21:50 g'night all 2009-05-17 21:50 oyasumi 2009-05-17 21:58 morning i made some small corrections to the last patch (i realized i made some conflicts) added a new variable PREFIX this is shorter in the package install line 2009-05-17 21:58 http://tinyurl.com/oyufv3 2009-05-17 21:59 and split the install in install-test and install-bin 2009-05-17 22:02 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-17 22:02 night flips 2009-05-17 22:02 btw, why do we have to force OWNER/GROUP? 2009-05-17 22:03 user/devloper may want to path $HOME for example 2009-05-17 22:03 want to use path 2009-05-17 22:09 the dev can always overwrite "make OWNER=mario GROUP=users all" 2009-05-17 22:09 this two vars are only there for packagers to easyly overwrite the defaults 2009-05-17 22:10 make OWNER=mario GROUP=users PREFIX=~/work/tux3 all 2009-05-17 22:26 i see 2009-05-17 22:27 I thought those are enough by current owner/group 2009-05-17 22:27 but, packagers 2009-05-17 22:28 because, usually, user can't change those except root 2009-05-17 22:28 well 2009-05-17 22:35 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-17 22:39 and mybe tux3fuse also works on macosx then it is root:bin 2009-05-17 22:39 and on freebsd it is bin:bin 2009-05-17 22:40 ah, i see 2009-05-17 22:40 btw, can we leave the $(tux3bin) for binaries except fuse? 2009-05-17 22:41 tux3bin was for binaries to use normal make rule 2009-05-17 22:42 I guess tux3bin will be added more, e.g. fsck, etc. 2009-05-17 22:43 so, it would be lazy to copy the tux3bin binaries to rule target 2009-05-17 22:44 ok then a new build and install target will be needed 2009-05-17 22:44 ok will add those 2009-05-17 22:45 how about the binaries for all binaries except testbin? 2009-05-17 22:45 and use $(testbin) $(tux3bin) instead of current $(binaries)? 2009-05-17 22:46 well, I guess you may have more good idea 2009-05-17 22:47 ok that would be ok 2009-05-17 22:47 i have not tuched the binaries target as it was alredy there 2009-05-17 22:47 but if i can reassign the binaries it would be ok 2009-05-17 22:49 or fusebin may be easiler 2009-05-17 22:50 binaries = $(testbin) $(tux3bin) $(fusebin) 2009-05-17 22:50 ok 2009-05-17 22:59 the new version of the patch http://tinyurl.com/obcu7k 2009-05-17 23:00 "+$(testbin) tux3 tux3grath:" should be "$(testbin) $(tux3bin)" 2009-05-17 23:00 otherwise, it looks good to me 2009-05-17 23:01 thanks 2009-05-17 23:02 o an oversight 2009-05-17 23:15 ok updated 2009-05-17 23:16 looks good 2009-05-17 23:16 I think this should go to flips 2009-05-17 23:21 would be fun to remove the patches (every patch that got upstream is a win for a packager) 2009-05-17 23:21 well, now, flips seems to be sleeping 2009-05-17 23:23 I guess flips will apply 2009-05-17 23:26 ok bye need to go to work 2009-05-17 23:27 see you 2009-05-18 02:11 -!- geos_one(~chatzilla@213.229.35.178) has joined #tux3 2009-05-18 02:25 hallo 2009-05-18 02:25 would it make sense to create man pages for the tux3* tools 2009-05-18 02:25 it would be my kind of contribution 2009-05-18 03:29 -!- pgquiles(~pgquiles@127.Red-79-153-82.dynamicIP.rima-tde.net) has joined #tux3 2009-05-18 03:58 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-18 05:05 geos_one: it certainly would. Although they might change rather often, but missing documentation is always a hassle 2009-05-18 05:08 ok so lets start scanning the sourcecode and create man pages (docbook) 2009-05-18 05:26 geos_one, you have my vote too 2009-05-18 05:26 flips: hallo 2009-05-18 05:27 there are also my "how to" posts, including how to make a new filesystem 2009-05-18 05:27 well I guess that's pretty obvious 2009-05-18 05:27 it would help if i can get a list oftions for the the tux3* progs 2009-05-18 05:28 the output of help dont say mutch 2009-05-18 05:28 most of it can be found in tux3fuse.c ;) 2009-05-18 05:28 sorry 2009-05-18 05:28 tux3.c 2009-05-18 05:28 that is the only one with any complexity 2009-05-18 05:28 uses libpopt 2009-05-18 05:29 got to go, back later 2009-05-18 07:03 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-18 07:33 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-18 08:23 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-05-18 08:24 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-18 08:25 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-18 08:59 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-18 09:26 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-18 12:18 the README INSTALL are finisched 2009-05-18 12:18 the manpages the skeleton are here 2009-05-18 12:18 the makefile is updated to generate the manpages with the help of xmlto 2009-05-18 12:33 -!- SEJeff(~jeff__@66.151.59.138) has left #tux3 2009-05-18 12:33 -!- DanK(~jeff__@66.151.59.138) has joined #tux3 2009-05-18 13:28 geos_one, where do I look? 2009-05-18 13:29 not put to public at the moment 2009-05-18 13:29 but schould be ready in about 30min the the manpage tux3.1 will be finished 2009-05-18 13:30 could you plz describe wat the truncate command for tux3 is dooing 2009-05-18 13:36 -!- dcg(~dcg@18.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-05-18 13:50 the work in progress files can be viewed here http://tinyurl.com/o2n9c2 2009-05-18 13:51 the makefile also got a dist target (tar.gz) 2009-05-18 15:24 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-18 17:23 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-18 18:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-18 18:51 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-18 20:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-18 21:52 -!- ijuz__(~ijuz@p5B126210.dip.t-dialin.net) has joined #tux3 2009-05-18 22:43 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-18 23:18 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-05-19 00:00 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-05-19 01:19 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-05-19 02:12 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-05-19 03:13 -!- pgquiles_(~pgquiles@127.Red-79-153-82.dynamicIP.rima-tde.net) has joined #tux3 2009-05-19 06:02 -!- npmccallum(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-05-19 08:23 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-19 09:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-19 10:06 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-19 10:14 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-19 13:15 -!- dcg(~dcg@35.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-05-19 13:37 flips, there? 2009-05-19 13:37 well, I've thinking about ileaf handling 2009-05-19 13:37 ileaf/dleaf 2009-05-19 13:38 on current code, ileaf/dleaf is changed before and after sb->delta++ 2009-05-19 13:39 this clearly make things complex 2009-05-19 13:40 dleaf can be delay until sb->delta++ by delalloc 2009-05-19 13:40 maybe, just remove flush_buffers() 2009-05-19 13:40 however, ileaf is 2009-05-19 13:41 if dleaf change is after sb->delta++, ileaf change would be after sb->delta++ 2009-05-19 13:42 so, delay ileaf change is necessary? 2009-05-19 13:43 if so, I think userland is also needed to delay to change ileaf 2009-05-19 13:45 I forgot the original perpose of delalloc of inode number 2009-05-19 14:00 manpage by manpage http://tinyurl.com/o2n9c2 2009-05-19 14:00 the actual work in progress 2009-05-19 14:02 I don't have good knowlege for making manpage though 2009-05-19 14:02 it seems good 2009-05-19 14:02 flips may have some comment 2009-05-19 14:03 hey folks 2009-05-19 14:04 ah, docbook to manpage 2009-05-19 14:04 hi 2009-05-19 14:05 jep editing xml is a littel bit easyer and more output formats is also passible html pdf txt .... 2009-05-19 14:06 yes its not perfect but a start 2009-05-19 14:08 yes, it would be good than editing the plain manpage 2009-05-19 14:08 xml or asciidoc or something 2009-05-19 14:10 xml 2009-05-19 14:11 UTF-8 2009-05-19 14:12 i am using the personal edition of XMLmind XLM Editor good docbook support 2009-05-19 14:12 hirofumi, I'm here 2009-05-19 14:12 hi 2009-05-19 14:12 i see 2009-05-19 14:14 flipz, can you see the above ileaf stuff? 2009-05-19 14:14 thinking about it now 2009-05-19 14:15 I don't propose to delay inode number allocation at this point, but to delay updating the inode table block after choosing an inode number 2009-05-19 14:16 the reason for delaying the inode table update is, to allow front end to allocate inodes independently from backend flush 2009-05-19 14:17 the frontend is not allowed to change an inode table block that stage_delta may need to change 2009-05-19 14:17 (inode table block = ileaf" 2009-05-19 14:18 yes 2009-05-19 14:18 btw, I meant delalloc of inode number is delay of ileaf update 2009-05-19 14:19 well, so, we need to delay ileaf update on userland too? 2009-05-19 14:19 ok, well I think the easiest thing to do is keep a list of newly allocated inums, then check that list after searching and finding a free inode in an ileaf 2009-05-19 14:19 if the found inum is in the list of new inums, keep searching 2009-05-19 14:20 yes 2009-05-19 14:21 I've implemented the basic inode dirty tracking code 2009-05-19 14:21 with that change, I think ileaf is only ever updated by a delta 2009-05-19 14:21 then, I was thinking about inode reference count 2009-05-19 14:21 yes 2009-05-19 14:22 inode reference count should only be updated by store_attrs 2009-05-19 14:22 that is, the disk image of it 2009-05-19 14:22 it means refcnt of in-core inode 2009-05-19 14:23 e.g. inode->i_count 2009-05-19 14:23 it will be used by iget/iput 2009-05-19 14:23 by iget, anyway 2009-05-19 14:24 I was thinking whether this stuff is needed on userland or not 2009-05-19 14:24 and, if we need to delay the ileaf update, it would be necessary 2009-05-19 14:24 yes, userland should have the newly allcoated inum list 2009-05-19 14:24 ok 2009-05-19 14:25 that makes it easier to debug anyway 2009-05-19 14:25 btw, now, we free inodes by free_inode() 2009-05-19 14:25 a reasonable place to keep the list is in a stash, like defree 2009-05-19 14:26 allocated inum? 2009-05-19 14:26 yes, newly allocated 2009-05-19 14:26 I was thinking it is the list of inodes 2009-05-19 14:26 clear the list at each delta 2009-05-19 14:26 ok, that is fine 2009-05-19 14:26 better 2009-05-19 14:27 ok 2009-05-19 14:27 much better :) 2009-05-19 14:27 :) 2009-05-19 14:27 well, allocate and delete 2009-05-19 14:27 I'm imaging the delete is like ophan inodes 2009-05-19 14:27 inode free is easier than inode allocate, because it can easily be delayed 2009-05-19 14:27 yes 2009-05-19 14:28 I guess it doesn't have big difference with ophan inodes 2009-05-19 14:28 yes, it is very much like orphans 2009-05-19 14:28 we should do truncate like that too, something to think about 2009-05-19 14:29 well, so, I'll try to implement inode refcnt and other if needed 2009-05-19 14:29 before atomic-commit like inodes flush 2009-05-19 14:29 yes 2009-05-19 14:29 however, truncate would be complex 2009-05-19 14:29 ok, so for deleted inode... how about linking deleted struct inodes together using the same link field we use for newly allocated? 2009-05-19 14:30 then stage_delta updates the ileafs 2009-05-19 14:30 I guess it's should be different 2009-05-19 14:30 different field? 2009-05-19 14:30 different list 2009-05-19 14:31 to prevent a newly allocated inode from being immediately reused? 2009-05-19 14:31 um... 2009-05-19 14:32 well, if we have i_count, it doesn't call delete_inode() 2009-05-19 14:32 like ophan inodes 2009-05-19 14:32 so, I thought it can be same 2009-05-19 14:33 funny, I just changed my mind and thought it had to be a different field 2009-05-19 14:33 different field? 2009-05-19 14:33 deleted inode link field different from newly allocated inode link field 2009-05-19 14:33 ah 2009-05-19 14:34 yes 2009-05-19 14:35 well, I guess it would become more clear by implementing i_count and related stuff 2009-05-19 14:36 btw, with delay ileaf udpate, I think ileaf is assumed to update the backend only 2009-05-19 14:37 it is simple 2009-05-19 14:37 very much 2009-05-19 14:37 that was the intent 2009-05-19 14:37 however, it has to be a bit difference handling with data buffers 2009-05-19 14:37 i see 2009-05-19 14:38 all ileaf changes would be after sb->delta++ 2009-05-19 14:38 but, ileaf is changed by backend only 2009-05-19 14:38 data buffers are easier, because the page cache decouples the frontend from the backend 2009-05-19 14:38 so, I guess leaf is not needed by sb->delta handling at all 2009-05-19 14:39 leaf? 2009-05-19 14:39 ileaf and dleaf 2009-05-19 14:39 I guess those leaf are only changed by backend 2009-05-19 14:39 right 2009-05-19 14:39 just what we want 2009-05-19 14:39 i see 2009-05-19 14:40 it makes it friendly to optimization and locking 2009-05-19 14:40 in addition to atomic commit 2009-05-19 14:40 yes, frontend would be read and backend is write 2009-05-19 14:41 backend is read/write 2009-05-19 14:41 yes 2009-05-19 14:41 isn't it? 2009-05-19 14:41 good :) 2009-05-19 14:41 :) 2009-05-19 14:42 so, another one of leaf is bitmap handling 2009-05-19 14:42 now, data -> dleaf -> btree -> ileaf -> sb 2009-05-19 14:42 right can dirty left structure 2009-05-19 14:43 so, bitmap flushing can dirty the dleaf and ileaf 2009-05-19 14:43 yes 2009-05-19 14:43 it also gives the order in which structures must be flushed 2009-05-19 14:43 yes 2009-05-19 14:43 so that we know all the direct elements at each step 2009-05-19 14:43 sorry 2009-05-19 14:44 so that we know all the _dirty_ elements at each step 2009-05-19 14:44 bitmap is a bit different 2009-05-19 14:44 indeed 2009-05-19 14:44 data (file/dir) -> dleaf -> data btree -> ileaf (file/dir) -> inode btree -> sb 2009-05-19 14:45 after this, bitmap will be started to flush 2009-05-19 14:45 if it is a flush cycle 2009-05-19 14:45 yes 2009-05-19 14:45 so, bitmap flush can redirty ileaf again 2009-05-19 14:46 I think it is ok to flush the bitmap before all of the above 2009-05-19 14:47 that is the purpose of remembering exactly where the valid log begins 2009-05-19 14:47 um... 2009-05-19 14:49 actually, by the same argument, the inode index blocks can be flushed before the data 2009-05-19 14:49 probably 2009-05-19 14:50 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-19 14:50 so the order is: if flush cycle { flush bitmaps; flush inode index blocks }; flush data blocks; flush dleaf blocks; flush ileaf blocks; 2009-05-19 14:51 if flush cycle { flush bitmaps; flush inode index blocks; mark new beginning of log; }; flush data blocks; flush dleaf blocks; flush ileaf blocks; 2009-05-19 14:52 or flush ileaf after bitmap flush? 2009-05-19 14:53 um... 2009-05-19 14:53 flushing the dleaf blocks may change an inode 2009-05-19 14:54 ah 2009-05-19 14:54 stage_delta(); flush_log(); flush volmap 2009-05-19 14:54 ? 2009-05-19 14:54 volmap is having the ileaf/dleaf change too 2009-05-19 14:55 I think, flush_log; stage_delta(); 2009-05-19 14:55 so, flush_log() can be including the stage_delta()'s balloc() 2009-05-19 14:56 it isn't necessary for flush_log to include the new allocations 2009-05-19 14:56 those can simply be logged 2009-05-19 14:56 this simplilies things I think 2009-05-19 14:56 flush_log(); stage_delta()? 2009-05-19 14:57 yes, where flush_log is: flush btree index blocks; flush bitmaps; mark new start of log; 2009-05-19 14:59 so some changes that are made by the index block and bitmap flush will affect the following stage_delta 2009-05-19 15:00 I'd like to rethink what happen with this 2009-05-19 15:00 me too 2009-05-19 15:01 merged commit is not simple 2009-05-19 15:01 I agree 2009-05-19 15:01 well, I was assuming the delta -> flush order 2009-05-19 15:02 it is tempting to start with a separate commit for log flushing, but actually I do not think it is really simpler that way, and is certainly slower 2009-05-19 15:02 I am pretty sure the correct order is flush; delta; 2009-05-19 15:02 but then... I have been sure about wrong things before ;) 2009-05-19 15:02 :) 2009-05-19 15:03 well, delta -> flush order with merged commit 2009-05-19 15:03 let me try to explain why I think this 2009-05-19 15:03 it was almost working, however ileaf was having the problem 2009-05-19 15:04 thanks 2009-05-19 15:05 flush_log will discard the head of the log, because the only entries in the head of the log that matter are entries for bitmap block and btree index block updates 2009-05-19 15:05 those entries can be discarded, because we are writing out the full, current copies of those blocks 2009-05-19 15:05 head of the log means except deflush entires? 2009-05-19 15:06 deflush means? 2009-05-19 15:06 ACTION is sorry for having invented that word ;) 2009-05-19 15:06 defered free after flush cycle 2009-05-19 15:07 the log entries of defered free blocks after flush cycle 2009-05-19 15:07 ah, ok, the correct order is: mark new beginning of log; flush index blocks and bitmaps; 2009-05-19 15:08 ah ,yes 2009-05-19 15:08 so that frees of index blocks just appear as bfree entries, after the new start of log 2009-05-19 15:09 for the after next flush cycle, yes 2009-05-19 15:09 yes, and those bfree entries will be properly preserved in the log 2009-05-19 15:09 and no actual bfrees until after the delta commit 2009-05-19 15:09 make sense? 2009-05-19 15:09 delta commit? 2009-05-19 15:10 delta commit of next flush cycle? 2009-05-19 15:10 delta commit = superblock udpated 2009-05-19 15:10 to point to tail of log 2009-05-19 15:10 yes, if the commit is for next flush cycle 2009-05-19 15:11 actually, the upcoming delta commit, the defered bfrees for the updated index and bitmap blocks and be entered into bitmaps immediately after this delta commit 2009-05-19 15:12 yes 2009-05-19 15:12 ok, I will write these ideas down in an email and post it, so that you can attack them properly 2009-05-19 15:12 but, flushing to storage is next flush cycle 2009-05-19 15:13 ok 2009-05-19 15:49 earthquake 2009-05-19 15:49 small one 2009-05-19 15:53 3.9 2009-05-19 15:53 same area as before 2009-05-19 15:54 wow- upgraded already to 4.1 2009-05-19 15:59 nice little cluster going 2009-05-19 15:59 another 2.5 hit 2009-05-19 15:59 same area 2009-05-19 19:23 -!- npmccallum(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-05-19 20:24 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-19 20:26 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-05-19 20:28 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-05-19 20:47 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-19 20:48 ACTION reads the backlog 2009-05-19 21:27 -!- ijuz_(~ijuz@p5B12698F.dip.t-dialin.net) has joined #tux3 2009-05-19 21:39 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-05-19 22:17 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-19 22:28 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-19 23:15 -!- geos_one(~chatzilla@213.229.35.178) has joined #tux3 2009-05-19 23:15 hi 2009-05-20 01:37 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-05-20 06:33 hmm, did something dark and dank pass through our channel? 2009-05-20 06:35 flips, No never, that is heresy 2009-05-20 06:35 flips, Look, a DanKle http://annieblogs.com/wp-content/uploads/2008/11/cankle2.jpg 2009-05-20 06:37 eww 2009-05-20 06:37 well 2009-05-20 06:37 lemesee, where were we? 2009-05-20 06:37 promised hirofumi a writeup on flush order 2009-05-20 06:37 rather fascinating topic 2009-05-20 06:37 To the list 2009-05-20 06:37 indeed 2009-05-20 06:38 to the list, and to the world 2009-05-20 06:38 it's about time the flow of tech notes got going again 2009-05-20 06:43 Yes the tux3 lurkers would be interested in more progress notes 2009-05-20 06:51 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-20 07:24 -!- npmccallum(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-05-20 07:32 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-20 08:14 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-20 10:46 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-20 11:17 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-20 11:59 -!- pgquiles_(~pgquiles@127.Red-79-153-82.dynamicIP.rima-tde.net) has joined #tux3 2009-05-20 11:59 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-05-20 11:59 -!- data(~data@84.19.190.213) has joined #tux3 2009-05-20 13:14 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-20 13:35 -!- dcg(~dcg@180.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-20 13:39 -!- dcg_(~dcg@230.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-20 14:54 -!- dcg_(~dcg@230.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-20 16:22 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-20 17:31 -!- ajonat(~ajonat@190.48.100.165) has joined #tux3 2009-05-20 18:03 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-20 18:06 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-20 18:10 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-20 18:48 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-20 18:50 -!- npmccallum(~npmccallu@166.196.133.102) has joined #tux3 2009-05-20 19:12 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-20 19:20 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-20 19:30 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-20 19:36 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-20 19:56 -!- npmccallum(~npmccallu@166.197.174.219) has joined #tux3 2009-05-20 21:44 -!- ijuz_(~ijuz@p5B12668D.dip.t-dialin.net) has joined #tux3 2009-05-20 22:06 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-20 22:20 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-20 23:45 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-21 03:27 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-21 07:43 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-21 08:27 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-21 10:26 flips, ping 2009-05-21 11:06 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-21 11:28 -!- bobby1234(~bobby@122.162.67.81) has joined #tux3 2009-05-21 11:28 ping 2009-05-21 12:36 tim_dimm, pong 2009-05-21 12:36 hey bobby1234 2009-05-21 13:38 -!- dcg(~dcg@66.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-21 13:39 hey flips 2009-05-21 13:46 hi 2009-05-21 13:48 bc? 2009-05-21 14:03 folks 2009-05-21 14:08 hey bh 2009-05-21 15:15 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-21 16:21 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-21 19:00 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-05-21 19:24 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-21 19:32 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has left #tux3 2009-05-21 21:21 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-21 21:50 -!- ijuz_(~ijuz@p5B125C85.dip.t-dialin.net) has joined #tux3 2009-05-21 23:42 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-22 00:01 -!- ijuz_(~ijuz@p5B125C85.dip.t-dialin.net) has joined #tux3 2009-05-22 00:01 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-05-22 00:01 -!- data(~data@84.19.190.213) has joined #tux3 2009-05-22 00:01 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-05-22 00:01 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-22 00:01 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-05-22 00:01 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-05-22 00:01 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-05-22 00:01 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-05-22 00:01 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-05-22 00:01 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-05-22 00:01 -!- flips(~phillips@phunq.net) has joined #tux3 2009-05-22 00:26 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-22 00:26 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-22 00:26 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-05-22 00:26 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-05-22 00:26 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-05-22 00:26 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-05-22 00:26 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-05-22 00:26 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-22 00:26 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-05-22 00:26 -!- data(~data@84.19.190.213) has joined #tux3 2009-05-22 00:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-05-22 00:26 -!- ijuz_(~ijuz@p5B125C85.dip.t-dialin.net) has joined #tux3 2009-05-22 00:26 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-05-22 00:26 -!- flips(~phillips@phunq.net) has joined #tux3 2009-05-22 00:39 -!- flips(~phillips@phunq.net) has joined #tux3 2009-05-22 00:39 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-05-22 00:39 -!- ijuz_(~ijuz@p5B125C85.dip.t-dialin.net) has joined #tux3 2009-05-22 00:39 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-05-22 00:39 -!- data(~data@84.19.190.213) has joined #tux3 2009-05-22 00:39 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-05-22 00:39 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-22 00:39 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-05-22 00:39 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-05-22 00:39 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-05-22 00:39 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-05-22 00:39 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-05-22 00:40 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-22 00:40 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-22 00:45 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-05-22 00:45 -!- flips(~phillips@phunq.net) has joined #tux3 2009-05-22 01:44 -!- ChanServ changed topic to "http://tux3.org ~ git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git" 2009-05-22 06:39 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-22 06:52 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-22 07:38 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-22 07:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-22 08:01 good morning 2009-05-22 08:37 mornin 2009-05-22 09:53 -!- tim_dimm(~timothyhu@static-71-165-35-33.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-22 09:58 -!- tim_dimm_(~timothyhu@static-71-165-35-33.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-22 10:04 -!- tim_dimm(~timothyhu@static-71-165-35-33.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-22 10:46 -!- domiel(~chatzilla@58.172.210.231) has joined #tux3 2009-05-22 10:52 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-22 12:34 -!- dcg(~dcg@209.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-22 12:39 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-22 13:34 -!- ckwood_(~ckwood@68.42.104.95) has joined #tux3 2009-05-22 16:13 there are 199 kernel threads on this here 8 core box, before userspace does anything at all 2009-05-22 16:13 this can't possibly be the best way to run a kernel 2009-05-22 16:13 sorry, only 119 2009-05-22 16:13 same comment 2009-05-22 16:27 flipz: what kernel is that? 2009-05-22 17:04 let me see 2009-05-22 17:05 2.6.28.11, nothing exotic 2009-05-22 17:05 hrm i haven't seen more than 70, that was on a 4 core though 2009-05-22 17:06 sounds like twice as many cores will do the trick 2009-05-22 17:06 70 is still silly 2009-05-22 17:06 as a minimum, before the computer does anything useful at all 2009-05-22 17:07 and so many of them are disk io threads 2009-05-22 17:08 kblock, kintegrity (whatever that is) ata, aio... 2009-05-22 17:08 ...scsi... 2009-05-22 17:08 what is that kondemand thing? 2009-05-22 17:09 then there is kswap0, just one but it is prepared to multiply 2009-05-22 17:09 8 of each of the others 2009-05-22 17:10 acpi has two tasks... I guess one was not enough 2009-05-22 17:10 what on earth is kstop? 2009-05-22 17:10 then there are 8 kernel events threads 2009-05-22 17:11 and tons of softirq, migration and watchdog threads 2009-05-22 17:11 good grief 2009-05-22 17:11 there is even a kthreadd 2009-05-22 17:11 a thread thread it seems 2009-05-22 17:11 ooh, and 8 more disk threads... kmpathd 2009-05-22 17:12 plus a kstripd and a ksnapd 2009-05-22 17:12 I forgot to mention pdflush, round out the disk-related horde 2009-05-22 17:16 iirc, almost those are thread for own workqueue 2009-05-22 17:17 some of those might be enough with single thread 2009-05-22 17:17 yes, like pdflush does 2009-05-22 17:18 well ddsnap is very wasteful of threads, but at least I feel guilty about it 2009-05-22 17:18 each thread is a source of task switches and obscure race conditions 2009-05-22 17:18 well, I guess, per cpu workqueue is good for hotplug cpu 2009-05-22 17:19 it may make simple much workqueue 2009-05-22 17:19 right, but why not have just one per cpu thread that does everything that needs to be done per cpu? 2009-05-22 17:19 task and stack switches, while cheap, are not free 2009-05-22 17:19 some workqueue blocks long time 2009-05-22 17:19 it's bad for other workqueue 2009-05-22 17:20 about "fork on block" 2009-05-22 17:20 any time this per-cpu task is going to block, it forks instead 2009-05-22 17:20 not a real unix fork 2009-05-22 17:21 just spawn another kernel thread, or take one from a little pool like pdflush 2009-05-22 17:21 new fork if blocked? 2009-05-22 17:21 it can 2009-05-22 17:21 and it seems to be in progress 2009-05-22 17:21 I forgot the name of it 2009-05-22 17:22 worker pool or something like that 2009-05-22 17:23 well, iirc, history is 2009-05-22 17:23 shared workqeue 2009-05-22 17:24 but, blocking had problem at some time 2009-05-22 17:24 so, create own workqueue 2009-05-22 17:24 and the default was per-cpu 2009-05-22 17:25 default behaviour of create_workqueue() 2009-05-22 18:06 ah good 2009-05-22 18:06 apparently, I am not the only one who notices the increasing number of kernel threads then 2009-05-22 18:07 this code could be a little tricky 2009-05-22 18:07 that is, to spawn a new helper at the time one is going to block 2009-05-22 20:09 flush_log is: flush btree index blocks; flush bitmaps; mark new start of log; 2009-05-22 20:10 I guess this is bad 2009-05-22 20:10 probably 2009-05-22 20:10 flush_log is: mark new start of log; flush btree index blocks; flush bitmaps; 2009-05-22 20:11 flush bitmaps generates the logs of bnode change 2009-05-22 20:11 and ileaf redirect etc 2009-05-22 20:12 so, if mark new was after bitmap, those logs are obsoleted too early 2009-05-22 20:12 btw, now, I'm thinking the both strategy can work 2009-05-22 20:13 and both has advantage/disadvantage 2009-05-22 20:14 flush_log() -> stage_delta(): this would be simple on flush and replay 2009-05-22 20:14 stage_delta() -> flush_log(): this would be efficient slightly, but would be complex 2009-05-22 20:15 thinking is not completed though 2009-05-22 20:18 btw, the order of "flush btree index blocks" and "mark new start of log" is not important 2009-05-22 20:19 because, "flush btree index blocks" shouldn't make any logs 2009-05-22 20:20 so, probably, btree -> new log -> bitmap would be understandable a bit 2009-05-22 20:22 it means "make snapshot of btree to on-disk" -> "make logs against it" -> "make bitmap snapshot with logs" 2009-05-22 20:25 ah, and, flush_log() -> stage_delta(): this doesn't need sb->logoffset 2009-05-22 20:25 stage_delta() -> flush_log(): this needs sb->logoffset 2009-05-22 21:29 -!- ijuz__(~ijuz@p5B126230.dip.t-dialin.net) has joined #tux3 2009-05-22 22:45 -!- RazvanM(~RazvanM@96.234.242.16) has joined #tux3 2009-05-22 22:46 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-22 23:54 -!- RazvanM_(~RazvanM@96.234.240.234) has joined #tux3 2009-05-23 06:20 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-23 07:53 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-23 09:24 -!- flips(~phillips@phunq.net) has left #tux3 2009-05-23 09:24 -!- flips(~phillips@phunq.net) has joined #tux3 2009-05-23 09:26 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-23 09:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-23 10:11 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-23 10:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-23 13:04 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-23 13:27 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-23 14:06 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-23 14:34 http://www.linux-mag.com/id/7336/, a bit interesting 2009-05-23 14:37 that is interesting. 2009-05-23 14:37 would love to see that for different verticals 2009-05-23 14:37 web 2.0, Media + Entertaiment, HPC, etc 2009-05-23 14:38 yes 2009-05-23 14:38 and ratio of read/write is also interesting, it says 2:1 2009-05-23 14:38 whcih vertical do you have most expertise ? 2009-05-23 14:43 it's comment of article 2009-05-23 14:43 "recent data study" 2009-05-23 14:43 "recent data age study" 2009-05-23 14:44 this would be a nice feature for a storage management gui 2009-05-23 14:44 agedu? 2009-05-23 14:44 y 2009-05-23 14:44 yes 2009-05-23 14:45 agedu seems to be just using [mca]time though 2009-05-23 14:46 to know read/write ratio, another tool would be needed 2009-05-23 14:46 I have expertise with render farms for visual effects and animation 2009-05-23 14:47 read/write ratio is very heavy reads 2009-05-23 14:47 yes 2009-05-23 14:47 can be as high as 2000:1 2009-05-23 14:47 oh 2009-05-23 14:47 or more- depends on how many nodes are in the render farm 2009-05-23 14:47 and how many users are working on a particular scene 2009-05-23 14:48 well, yes, I was assuming the read centric 2009-05-23 14:48 so, 2:1 was suprised a bit 2009-05-23 14:49 yes 2009-05-23 15:01 folks 2009-05-23 15:17 hi 2009-05-23 15:54 hi hirofumi 2009-05-23 16:17 hi 2009-05-23 16:17 what's up? 2009-05-23 16:18 thinking about ileaf udpates 2009-05-23 16:18 and writing the inodes dirty management 2009-05-23 16:20 btw, with it, I noticed the userland would need to use mark_inode_dirty() some places 2009-05-23 16:20 I am writing the promised technical note on delta and volmap flush 2009-05-23 16:20 and xattr stuff is needed to some cleanup about dirty inodes 2009-05-23 16:20 good 2009-05-23 16:21 mark_inode_dirty would be good to have in userland 2009-05-23 16:21 yes 2009-05-23 16:21 basic code was almost done 2009-05-23 16:21 some fixes is needed though 2009-05-23 16:22 btw, I'm starting to think, flush_log() -> stage_delta() would be good 2009-05-23 16:22 at least, as first version 2009-05-23 16:24 good, that is my thinking 2009-05-23 16:26 btw, I'm thinking, maybe, flush_log() is - flush volmap -> start new log -> flush bitmap 2009-05-23 16:28 ah, my pathset became 70 patches or so 2009-05-23 16:28 so, I'm thinking to push bug fixes patches to send 2009-05-23 16:28 eactly 2009-05-23 16:29 I just wrote the same thing in my note 2009-05-23 16:29 good 2009-05-23 16:29 ok, I am ready to read patches 2009-05-23 16:29 ok 2009-05-23 16:30 maybe, I'll prepare it, and send tomorrow or so 2009-05-23 16:31 these are the first few steps I have written: 2009-05-23 16:31 if (need to flush log now) 2009-05-23 16:31 - add all redirected btree index blocks to delta write list 2009-05-23 16:31 - add all dirty bitmap blocks to delta write list 2009-05-23 16:31 - mark new beginning of log 2009-05-23 16:31 - log deferred free entries for log blocks before beginning of log 2009-05-23 16:31 - add these feeable log blocks to deferred flush (deflush) list 2009-05-23 16:31 For each dirty inode 2009-05-23 16:31 - add all dirty inode pages to the delta write list 2009-05-23 16:31 - store_inode_attrs 2009-05-23 16:31 so your flush volmap is "add all index blocks" and "add all bitmap blocks" 2009-05-23 16:32 and "start new log" is "mark new beginning of log" 2009-05-23 16:32 then, handling of log blocks that will be freed is simple, and much like the existing code 2009-05-23 16:33 after that are the steps that are done every delta 2009-05-23 16:33 "add to delta writeout list" can also be "launch write IO immediately" 2009-05-23 16:35 I need to include the /* map dirty bitmap blocks to disk and write out */ step 2009-05-23 16:35 that is, the mapping to disk part 2009-05-23 16:36 and that means that "mark new beginning" should be before "add all dirty bitmap blocks" 2009-05-23 16:36 yes 2009-05-23 16:36 well, I am properly started :) 2009-05-23 16:38 let's try this: 2009-05-23 16:38 if (need to flush log now) 2009-05-23 16:38 - mark new beginning of log 2009-05-23 16:38 - add all redirected btree index blocks and initiate writeout 2009-05-23 16:38 - increment flush counter 2009-05-23 16:38 - map all bitmap blocks dirty in previous flush cycle to disk and initiate writeout 2009-05-23 16:38 - log deferred free entries for log blocks before beginning of log 2009-05-23 16:39 - add these feeable log blocks to deferred flush (deflush) list 2009-05-23 16:39 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-23 16:39 hi tim_dimm 2009-05-23 16:39 yes, good 2009-05-23 16:39 hey flps 2009-05-23 16:39 flips 2009-05-23 16:40 putting on new skates now 2009-05-23 16:40 Ill email u a pic 2009-05-23 16:40 defree of log blocks may have difference with my thoughts 2009-05-23 16:40 gtg 2009-05-23 16:40 not sure though 2009-05-23 16:41 I think that the method above avoids the chance of leaking log blocks 2009-05-23 16:42 yes 2009-05-23 16:42 might have slightly different 2009-05-23 16:42 deflush may be defree 2009-05-23 16:42 not sure 2009-05-23 16:43 I need to prove that all new log entries made after the new beginning of log is marked, and none of the entries before the new beginning, belong to the this delta 2009-05-23 16:44 yes, correct re defree 2009-05-23 16:44 yes 2009-05-23 16:44 so, I guess sb->logoffset is not needed with this strategy 2009-05-23 16:44 should be: - add deferred freeable log blocks to deferred free list 2009-05-23 16:45 um 2009-05-23 16:45 yes 2009-05-23 16:45 should be: - add per-flush deferred freeable blocks to deferred free list 2009-05-23 16:45 because, flush_log() is first operation of new cycle 2009-05-23 16:45 because there may be other per-flush deferred freeable blocks, that were created because of redirect 2009-05-23 16:46 about sb->logoffset? 2009-05-23 16:46 that is "mark new start" 2009-05-23 16:47 but, I think change_end() is 2009-05-23 16:47 flush_log() -> stage_delta() 2009-05-23 16:47 yes 2009-05-23 16:48 is that not how I described it? 2009-05-23 16:48 and flush_log() is started by "mark new start log" 2009-05-23 16:48 eys 2009-05-23 16:48 yes 2009-05-23 16:48 so, there is no log recored on this block when marking new start log 2009-05-23 16:49 you mean, how do we record the new log start? 2009-05-23 16:49 or? 2009-05-23 16:49 I mean we don't need sb->logoffset with this order of change_end() 2009-05-23 16:50 because, log is generated by backend only 2009-05-23 16:50 maybe true 2009-05-23 16:50 yes, maybe 2009-05-23 16:52 with this, I think this strategy is simpler than another one 2009-05-23 16:52 there is no sb->logoffset and LOGBLOCK_FLUSH 2009-05-23 16:52 yes 2009-05-23 16:53 needs to be checked carefully 2009-05-23 16:53 yes 2009-05-23 16:57 I'm thinking, next is, prepare bug fix patchset, bug fixes inodes dirty stuff, defered ileaf update, change change_end() stuff 2009-05-23 16:57 and it must be: For each dirty inode except bitmap inode 2009-05-23 16:58 then, back to bnode loggging (for create() path, almost done though), start replay 2009-05-23 16:58 yes, good 2009-05-23 16:58 yes, bitmap and volmap 2009-05-23 16:58 right 2009-05-23 16:58 first version would be stupid codes 2009-05-23 16:58 For each dirty inode except bitmap and volmap inode 2009-05-23 16:58 - add all dirty inode pages to the delta write list 2009-05-23 16:58 - store_inode_attrs 2009-05-23 16:59 may not be stupid, however, similar of kernel code 2009-05-23 16:59 right, stupid and robust without leaking, and provably correct recovery 2009-05-23 17:02 http://userweb.kernel.org/~hirofumi/inode-flush-sync_inodes.patch 2009-05-23 17:02 For each dirty inode except bitmap and volmap inode 2009-05-23 17:02 - add all dirty inode pages to the delta write list 2009-05-23 17:02 - store_inode_attrs 2009-05-23 17:02 Initiate writeout for dirty inode table blocks 2009-05-23 17:02 Initiate writeout for new log blocks 2009-05-23 17:02 Write superblock 2009-05-23 17:02 this is part of inodes flush stuff 2009-05-23 17:03 ah, so usespace knows about marking dirty inodes 2009-05-23 17:03 it is about time :) 2009-05-23 17:03 with this, we can fix the fuse code to be more efficient 2009-05-23 17:03 yes 2009-05-23 17:04 well, for fuse, it needs inode refcnt stuff too 2009-05-23 17:04 well, this is writeback manner stuff 2009-05-23 17:04 we can find a more clever way of avoiding writing out bitmap and volmap inodes later 2009-05-23 17:04 yes 2009-05-23 17:05 based on this, we will implement the atomic-commit stuff 2009-05-23 17:05 it is same with kernel stuff more or less 2009-05-23 17:06 same situation is what I want for now 2009-05-23 17:06 improved version: 2009-05-23 17:06 For each dirty inode except bitmap and volmap inode: 2009-05-23 17:06 - Initiate writeout for all dirty inode pages 2009-05-23 17:06 - store_inode_attrs 2009-05-23 17:06 Initiate writeout for dirty inode table blocks. 2009-05-23 17:06 Initiate writeout for new log blocks. 2009-05-23 17:06 Write superblock. 2009-05-23 17:06 ok, I will spend some time checking this pseudocode for accuracy now 2009-05-23 17:07 looks good to me 2009-05-23 17:08 btw, I'm thiniking to use volmap->map->dirty for ileaf buffers 2009-05-23 17:08 ah, log blocks need to be assigned physical addresses 2009-05-23 17:08 that has to be shown 2009-05-23 17:08 ah, yes 2009-05-23 17:08 and preparation for next defree of log blocks 2009-05-23 17:08 For each dirty inode except bitmap and volmap inode: 2009-05-23 17:08 - Initiate writeout for all dirty inode pages 2009-05-23 17:08 - store_inode_attrs 2009-05-23 17:08 Initiate writeout for dirty inode table blocks. 2009-05-23 17:08 Allocate physical addresses and initiate writeout for new log blocks. 2009-05-23 17:08 Write superblock. 2009-05-23 17:09 we have: - add per-flush deferred freeable blocks to deferred free list 2009-05-23 17:09 I think that is the correct preparation for next defree of log blocks 2009-05-23 17:10 this means all deflush records are logging after new start log? 2009-05-23 17:10 yes 2009-05-23 17:10 oh 2009-05-23 17:11 hmm, is that good or bad 2009-05-23 17:11 it's a bit different topic with ileaf stuff though 2009-05-23 17:11 ACTION thinks 2009-05-23 17:11 I think it is also simpler or efficient stuff 2009-05-23 17:12 I think, all deflush records mean it will re-logging the log record other than log blocks 2009-05-23 17:12 yes, this method of freeing log blocks is correct I think 2009-05-23 17:13 the blocks will actually be freed into the bitmaps immediately after the delta commit (i.e., update superblock) 2009-05-23 17:13 all deflush has more effect to freeing log block 2009-05-23 17:14 I think, all deflush records mean it will re-logging the log record other than log blocks <- please explain? 2009-05-23 17:14 bnode redirect will generate deflush log record 2009-05-23 17:14 and write it out on delta cycle 2009-05-23 17:15 right 2009-05-23 17:15 "all deflush" means, we will re-logging those at new flush cycle 2009-05-23 17:16 when would the re-logging happen? 2009-05-23 17:16 handling of defree of log blocks 2009-05-23 17:18 http://userweb.kernel.org/~hirofumi/notes/note_defree-logs.txt 2009-05-23 17:18 I'm thinking it with this 2009-05-23 17:19 the ascii art has become much prettier :) 2009-05-23 17:19 I'm thinking, defree of log blocks generates "bfree log of log blocks" 2009-05-23 17:19 :) 2009-05-23 17:21 if this figure is true, deflush and retire of log blocks has different cycle 2009-05-23 17:21 perhaps I am missing something, but I think that all we have to do is log the intent to free a log block as part of the log flush, then the the normal delta handling will do the right thing 2009-05-23 17:21 retire of log blocks are next cycle of deflush was done 2009-05-23 17:22 yes 2009-05-23 17:23 For each dirty inode except bitmap and volmap inode: 2009-05-23 17:23 - Initiate writeout for all dirty inode pages 2009-05-23 17:23 - store_inode_attrs 2009-05-23 17:23 Initiate writeout for dirty inode table blocks. 2009-05-23 17:23 Allocate physical addresses and initiate writeout for new log blocks. 2009-05-23 17:23 Write superblock. 2009-05-23 17:23 Actually free the deferred free blocks 2009-05-23 17:23 yes 2009-05-23 17:23 is there still an issue? 2009-05-23 17:24 it depends on the timing of freeing log blocks 2009-05-23 17:24 I think operations are right 2009-05-23 17:24 log blocks are not added to the per-delta deferred free list except on a flush cycle 2009-05-23 17:25 yes 2009-05-23 17:25 I hope that is the correct timing 2009-05-23 17:25 main issue are, log blocks can freed after deflush was applied to buffer 2009-05-23 17:26 applied to buffer? 2009-05-23 17:26 after commit 2009-05-23 17:26 um... 2009-05-23 17:27 the pseudocode above does not violate that fule 2009-05-23 17:27 commit -> can free deflush record -> can free those log blocks 2009-05-23 17:27 rule 2009-05-23 17:27 yes 2009-05-23 17:27 but, deflush and defree of log blocks have different cycle 2009-05-23 17:28 yes, and I think the pseudocode handles that 2009-05-23 17:28 I think we remember the list of log blocks 2009-05-23 17:28 I'll off a bit 2009-05-23 17:28 I will paste the whole algorithm into the channel now... 2009-05-23 17:29 if (this delta includes a flush cycle) 2009-05-23 17:29 - mark new beginning of log 2009-05-23 17:29 - initiate writeout for all redirected btree index blocks 2009-05-23 17:29 - increment flush counter 2009-05-23 17:29 - map all bitmap blocks dirty in previous flush cycle to disk and initiate writeout 2009-05-23 17:29 - log deferred free entries for log blocks before beginning of log 2009-05-23 17:29 - add per-flush deferred freeable blocks to deferred free list 2009-05-23 17:29 For each dirty inode except bitmap and volmap inode: 2009-05-23 17:29 - Initiate writeout for all dirty inode pages 2009-05-23 17:29 - store_inode_attrs 2009-05-23 17:29 Initiate writeout for dirty inode table blocks. 2009-05-23 17:29 Allocate physical addresses and initiate writeout for new log blocks. 2009-05-23 17:29 Write superblock. 2009-05-23 17:29 Actually free the deferred free blocks 2009-05-23 17:30 ...so the "if (flush)" implements the idea of two separate cycles 2009-05-23 17:31 ok, I will see you later 2009-05-23 17:40 back 2009-05-23 17:41 this may be nitpick 2009-05-23 17:41 - log deferred free entries for log blocks before beginning of log 2009-05-23 17:42 this entry needs the list of log blocks 2009-05-23 17:43 so I thought, after "Allocate physical addresses and initiate writeout for new log blocks" 2009-05-23 17:43 it would need to add the allocated addresses to the list of log blocks 2009-05-23 17:44 ah 2009-05-23 17:45 I think totaly 3 separate cyle 2009-05-23 17:45 defree, deflush, and defree of log blocks 2009-05-23 17:45 I guess this is the point of me 2009-05-23 17:54 1) defree is applied to buffer for each after commit 2009-05-23 17:54 2) defush is applied to buffer for each after commit of flush cycle 2009-05-23 17:55 3) defree of log blocks is applied to buffer for each after next flush cyle of deflush cycle 2009-05-23 18:02 http://userweb.kernel.org/~hirofumi/notes/note_defree-logs.txt 2009-05-23 18:02 I've also added the [defree] to note 2009-05-23 19:14 I'm back also 2009-05-23 19:16 this entry needs the list of log blocks <- the list of physical log block addresses is store in the log chain pointers within the log blocks 2009-05-23 19:18 yes 2009-05-23 19:19 but, on the retire point there is no in cache 2009-05-23 19:19 it is same with log entries of defree 2009-05-23 19:20 it may not be in cache 2009-05-23 19:20 true, it does not make sense to keep the log blocks in cache just to know how to free them 2009-05-23 19:21 read log blocks again to know physical address? 2009-05-23 19:21 so it makes sense to store the log block physical addresses in a "stash" at replay time 2009-05-23 19:21 replay? 2009-05-23 19:21 startup 2009-05-23 19:21 we always have to read the log blocks at startup 2009-05-23 19:21 why is it needed at runtime? 2009-05-23 19:22 we only need this at two times: 1) on startup (replay) 2) in a flush cycle, to free the blocks 2009-05-23 19:22 yes 2009-05-23 19:23 so, I think it is needed at runtime for 2) 2009-05-23 19:23 so we might as well just remember that list of physical log block addresses for (2) 2009-05-23 19:23 yes 2009-05-23 19:23 yes 2009-05-23 19:23 it will normally always be needed, not too far in the future 2009-05-23 19:23 yes 2009-05-23 19:23 that is a point that had not occurred to be before 2009-05-23 19:24 well, my patchset has the stash for log blocks though 2009-05-23 19:25 I guess the another point is this stash for log blocks is not same with deflush 2009-05-23 19:25 same cycle 2009-05-23 19:26 it is needed for "if (this delta includes a flush cycle)" 2009-05-23 19:27 needed to write the defree log entreis, then it should be kept until after the delta commit, so that the blocks can actually be freed 2009-05-23 19:27 there will be a new one started for the next flush cycle 2009-05-23 19:27 new log blocks can be generated right after setting the new start of log 2009-05-23 19:28 so at that point, there is a list of old log blocks, and a list of new log blocks 2009-05-23 19:29 this could be two separate "stash" lists, or we can remember exactly where the new list begins 2009-05-23 19:29 does that make sense? 2009-05-23 19:29 2 stash is defree and deflush? 2009-05-23 19:29 yes, correct :) 2009-05-23 19:29 so there is no issue 2009-05-23 19:30 I think 3 stash is needed, defree, deflush, and defree of log blocks 2009-05-23 19:30 I think the defree of log blocks can just be moved to the defree list as part of the flush 2009-05-23 19:31 I think it is the point of a bit complex 2009-05-23 19:31 if it seems simpler to have three lists, that is easy to do 2009-05-23 19:32 and if I think it can be simplified by having only two, then it is my job to submit the patch ;) 2009-05-23 19:32 if log blocks is having deflush log entires, those log blocks can be freed after deflush log entries stored to on-disk bitmap 2009-05-23 19:32 :) 2009-05-23 19:36 ok, here is a point I think... the deferred flush log entries do not have to be stored to the on-disk bitmap, the deferred free can just be logged 2009-05-23 19:36 this simplifies things 2009-05-23 19:38 no, it means log the deflush, then apply to bitmap later, and flush bitmap data, then we can free log of deflush 2009-05-23 19:38 yes, I think that happens naturally in the pseudocode I pasted above 2009-05-23 19:40 ok 2009-05-23 19:41 btw, it means deflush and defree of log blocks is not freed same cycle 2009-05-23 19:42 it means, for each flush cycle, defree of log blocks -> deflush -> defree 2009-05-23 19:42 deflush become defree on flush cycle 2009-05-23 19:42 exactly 2009-05-23 19:42 I think that is good 2009-05-23 19:42 likewise, defree of log blocks become deflush 2009-05-23 19:42 ok, good 2009-05-23 19:43 this is why I thought 3 stash is needed 2009-05-23 19:43 it's quite nice, isn't it? 2009-05-23 19:43 I think the move of entries from one stash to the other takes care of it, but please feel free to prove me wrong ;) 2009-05-23 19:44 I agree 2009-05-23 19:45 I was thinking, 2 stash means deflush -> defree only 2009-05-23 19:45 and it might mean the defree of log blocks may be merged deflush 2009-05-23 19:46 for make sure, 3 stash is 2009-05-23 19:46 allocate physical address to log blocks 2009-05-23 19:47 then add the physical address to "defree of log blocks" 2009-05-23 19:47 this list will become deflush at the next flush cycle 2009-05-23 19:48 ah, no 2009-05-23 19:48 sorry 2009-05-23 19:49 "defree of log blocks" is not become "deflush" 2009-05-23 19:49 my thought is 2009-05-23 19:50 "defree of log blocks" -> "defree", and "deflush" -> "defree" 2009-05-23 19:50 it is why I'm thinking 3 stash 2009-05-23 19:50 I will be out for about 3 hours now 2009-05-23 19:50 3 stashes is fine with me 2009-05-23 19:50 ok, good 2009-05-23 19:51 well, I wanted whether I was missing something or not 2009-05-23 19:51 I wanted to make sure 2009-05-23 19:52 like ileaf stuff 2009-05-23 19:52 good, see you later 2009-05-23 19:53 see you 2009-05-23 20:36 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-23 21:36 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-23 21:46 -!- ijuz__(~ijuz@p5B124039.dip.t-dialin.net) has joined #tux3 2009-05-23 22:10 http://userweb.kernel.org/~hirofumi/notes/note_flush.txt 2009-05-23 22:11 I've updated my note 2009-05-23 22:11 [Example of commiting operations] part is what I'm thinking 2009-05-23 22:12 now, somehow, I'm thinking BFREE_ON_FLUSH is not needed, um... 2009-05-23 23:06 folks 2009-05-23 23:36 -!- RazvanM(~RazvanM@96.234.240.234) has joined #tux3 2009-05-24 03:33 if (this delta includes a flush cycle) 2009-05-24 03:33 - mark new beginning of log 2009-05-24 03:33 - initiate writeout for all redirected btree index blocks 2009-05-24 03:33 - increment flush counter 2009-05-24 03:33 - map all bitmap blocks dirty in previous flush cycle to disk and initiate writeout 2009-05-24 03:33 - log deferred free entries for log blocks before beginning of log 2009-05-24 03:33 - add per-flush deferred freeable blocks to deferred free list 2009-05-24 03:33 I think "add per-flush deferred freeable blocks..." is too late 2009-05-24 03:34 I think you know though, because "map all bitmap" generates deflush blocks 2009-05-24 03:45 http://userweb.kernel.org/~hirofumi/temp-commit.c 2009-05-24 03:45 btw, this is my draft code of it 2009-05-24 03:45 it is not writing inodes yet though 2009-05-24 06:36 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-24 07:23 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-24 07:51 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-24 10:23 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-05-24 10:29 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-24 11:10 -!- pgquiles(~pgquiles@127.Red-79-153-82.dynamicIP.rima-tde.net) has joined #tux3 2009-05-24 11:38 good morning 2009-05-24 12:40 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-24 13:19 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-24 13:21 -!- flips(~phillips@phunq.net) has left #tux3 2009-05-24 13:21 -!- flips(~phillips@phunq.net) has joined #tux3 2009-05-24 14:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-24 14:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-24 14:46 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-24 14:48 hirofumi, I am analyzing your above remark re "add..." is too late 2009-05-24 14:50 yes, I think you are right 2009-05-24 15:06 Here is the new, improved pseudocode: 2009-05-24 15:07 if (this delta includes a flush cycle) 2009-05-24 15:07 - in any order: 2009-05-24 15:07 - add per-flush deferred freeable blocks to deferred free list 2009-05-24 15:07 - mark new beginning of log 2009-05-24 15:07 - initiate writeout for all redirected btree index blocks 2009-05-24 15:07 - increment flush counter 2009-05-24 15:07 - log deferred free entries for log blocks before beginning of log 2009-05-24 15:07 - map all bitmap blocks dirty in previous flush cycle to disk and initiate writeout 2009-05-24 15:08 This version may be equivalent: 2009-05-24 15:08 if (this delta includes a flush cycle) 2009-05-24 15:08 - in any order: 2009-05-24 15:08 - add per-flush deferred freeable blocks to deferred free list 2009-05-24 15:08 - mark new beginning of log 2009-05-24 15:08 - initiate writeout for all redirected btree index blocks 2009-05-24 15:08 - increment flush counter 2009-05-24 15:08 - in any order: 2009-05-24 15:08 - log deferred free entries for log blocks before beginning of log 2009-05-24 15:08 - map all bitmap blocks dirty in previous flush cycle to disk and initiate writeout 2009-05-24 15:10 the reason being, that there is no dependency between the "log deferred" and "map bitmap blocks" steps, because new 2009-05-24 15:11 log blocks do not generate changes to the (possibly forked) bitmaps that will be written out, and "map bigmap blocks" does not change the set of old log blocks that will be logged for deferred freeing 2009-05-24 15:12 sk8 oclock 2009-05-24 16:57 yes, I think you are right 2009-05-24 17:33 -!- marcin_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-05-24 18:06 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-24 19:10 ACTION is back 2009-05-24 21:29 -!- ijuz_(~ijuz@p5B126E3C.dip.t-dialin.net) has joined #tux3 2009-05-24 21:45 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-05-24 22:44 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-05-24 22:57 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-24 23:00 -!- RazvanM(~RazvanM@96.234.240.234) has joined #tux3 2009-05-25 03:14 -!- data(~data@84.19.190.213) has joined #tux3 2009-05-25 07:34 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 07:41 -!- pgquiles(~pgquiles@127.Red-79-153-82.dynamicIP.rima-tde.net) has joined #tux3 2009-05-25 07:52 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-05-25 08:02 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 08:22 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 08:51 -!- npmccallum(~npmccallu@cpe-76-177-118-207.natcky.res.rr.com) has joined #tux3 2009-05-25 09:35 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 11:37 english wikipedia is borked 2009-05-25 11:37 wikipedia.org 2009-05-25 12:03 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-25 13:31 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-25 13:58 -!- dcg(~dcg@220.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-25 14:44 -!- dcg_(~dcg@12.pool80-103-2.dynamic.orange.es) has joined #tux3 2009-05-25 16:36 I posted the bug fixes patchset to ml, when you have time, please check it 2009-05-25 16:36 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-05-25 16:37 -!- ajonat(~ajonat@190.48.99.157) has joined #tux3 2009-05-25 18:39 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 19:08 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 20:01 checking now 2009-05-25 20:03 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 20:15 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 20:34 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-25 21:19 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 21:26 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 21:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-25 21:45 -!- ijuz_(~ijuz@p5B126704.dip.t-dialin.net) has joined #tux3 2009-05-25 22:11 -!- ajonat(~ajonat@190.48.99.157) has joined #tux3 2009-05-25 22:32 -!- RazvanM(~RazvanM@96.234.240.234) has joined #tux3 2009-05-26 00:50 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-26 06:11 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-05-26 06:25 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-26 07:48 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-26 09:29 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-26 09:36 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-05-26 10:42 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-26 11:43 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-26 13:13 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-26 13:46 -!- bh_(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-26 13:59 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-26 15:02 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-26 15:04 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-26 16:51 -!- npmccallum(~npmccallu@cpe-76-177-118-207.natcky.res.rr.com) has joined #tux3 2009-05-26 17:10 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-26 17:11 hey flips 2009-05-26 21:30 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-26 21:54 -!- ijuz_(~ijuz@p5B127390.dip.t-dialin.net) has joined #tux3 2009-05-26 22:32 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-27 00:05 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-05-27 06:20 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-27 07:37 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-27 07:39 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-27 08:23 -!- npmccallum(~npmccallu@cpe-76-177-118-207.natcky.res.rr.com) has joined #tux3 2009-05-27 12:53 flips, care to explain in more detail: http://yoshinorimatsunobu.blogspot.com/2009/05/overwriting-is-much-faster-than_28.html 2009-05-27 13:07 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-05-27 13:11 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-27 13:36 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-27 15:43 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-27 16:46 hey, do you folks hold semaphores across IO operations ? 2009-05-27 16:47 because I'm hearing from a FreeBSD buddy of mine that they do that for certain file system operations 2009-05-27 16:49 -!- ajonat(~ajonat@190.48.94.69) has joined #tux3 2009-05-27 17:01 it would usually do 2009-05-27 17:01 e.g. rename will use semaphore to provide atomicity 2009-05-27 17:02 and will read blocks with it 2009-05-27 18:31 marcin: man fdatasync 2009-05-27 18:32 if the size of the file has changed, fdatasync is required to update the filesize in the inode table block 2009-05-27 18:32 if the size of the file does not change, then only the data block and file index block have to be updated 2009-05-27 18:33 still, a factor of four difference seems rather large 2009-05-27 19:47 hirofumi, I see you took mario fetka's patch for install etc 2009-05-27 19:50 - ftruncate(fd, 1 << 24); 2009-05-27 19:50 + assert(!ftruncate(fd, 1 << 24)); <- I suppose we should not really rely on assert this way, because assert can be defined to just throw away its args 2009-05-27 19:50 but this is done in other places, so fine for now 2009-05-27 20:16 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-27 21:11 yes, it's right 2009-05-27 21:12 however, there are many places to do it, and assert(!ftruncate(...)) is not new with this patch 2009-05-27 21:13 so, I thought you don't care 2009-05-27 21:16 anyway, if we want to do it, I think it would be good with other patch 2009-05-27 21:18 use #include , and fix exsisted wrong usage of assert() 2009-05-27 21:20 but, I didn't finish it, because those are test programs 2009-05-27 21:25 ah, you said this already :) 2009-05-27 21:25 sorry 2009-05-27 21:54 -!- ijuz__(~ijuz@p5B126378.dip.t-dialin.net) has joined #tux3 2009-05-27 23:29 -!- RazvanM(~RazvanM@pool-173-67-58-50.bltmmd.east.verizon.net) has joined #tux3 2009-05-27 23:49 -!- pgquiles(~pgquiles@62.43.226.52) has joined #tux3 2009-05-28 00:26 hey hirofumi so file systems in Linux hold semaphore across blocking operations ? 2009-05-28 00:26 blocking io operations specifically 2009-05-28 00:33 yes 2009-05-28 00:34 just memory allocation with __GFP_WAIT, it can be I/O 2009-05-28 00:36 well, you can see mutex_lock(&inode->i_mutex) in vfs 2009-05-28 00:58 ok 2009-05-28 00:58 I didn't know that 2009-05-28 00:58 I feel like an idiot now not knowing that 2009-05-28 01:09 right, I was wondering what that was about. Now I know 2009-05-28 01:10 so when there's contention statistics against it, it's potentialy just another task blocking against what is really an IO operation, right ? 2009-05-28 01:10 and not really a true contention per se, correct ? 2009-05-28 01:22 I'm not sure what contention is meaning here 2009-05-28 01:22 well, if mutex_lock is already taked, it will go to sleep 2009-05-28 01:26 and it will wait that task is doing I/O and mutex_unlock 2009-05-28 01:35 what about any other threads wanting that mutex ? 2009-05-28 01:36 wouldn't it then be put to sleep when acquiring the mutex since the task holding that mutex is blocked already by some kind of wait queue operation ? 2009-05-28 02:02 if the mutex was already holding, other threads that try to take the mutex will sleep 2009-05-28 02:49 right 2009-05-28 02:49 thats measured as a contention right now 2009-05-28 03:02 ah 2009-05-28 03:15 it's incorrect stats right ? 2009-05-28 03:24 if it's merged with spinlock contention, I think it's not good 2009-05-28 03:25 however, if it's not merged, I guess it would be useful to know it's how long sleeping 2009-05-28 03:31 it's both 2009-05-28 03:31 it's measured as a contention and it has hold times 2009-05-28 03:32 it's generic stats for the most part 2009-05-28 03:34 btw, is it stats of lockdep? 2009-05-28 03:39 something like that 2009-05-28 03:40 I wrote the first revsion of that befoe peterz wrote it into lockdep directly 2009-05-28 03:40 that's why I know about it 2009-05-28 03:40 i see 2009-05-28 03:40 it uses the same hashing object mechanism 2009-05-28 03:41 well, if so, it's not useful for debug of mutex itself 2009-05-28 03:41 I wrote it originall to find out whether or not there was over-schedling going on in -rt 2009-05-28 03:41 it would be almost sleep time of mutex 2009-05-28 03:41 i see 2009-05-28 03:42 yeah, the inode mutexes were coming up as frequent contenders in the stats 2009-05-28 03:42 under find load 2009-05-28 03:42 so it amkes sense 2009-05-28 03:42 makes 2009-05-28 03:43 ok night 2009-05-28 03:43 good night 2009-05-28 04:07 -!- edt(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-28 06:23 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-28 07:45 -!- npmccallum_(~npmccallu@76.177.118.207) has joined #tux3 2009-05-28 08:09 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-28 10:58 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-05-28 11:47 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-28 20:57 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-28 21:54 -!- ijuz__(~ijuz@p5B126B1F.dip.t-dialin.net) has joined #tux3 2009-05-29 01:12 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-05-29 06:06 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-29 10:46 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-29 11:25 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-29 14:25 folks 2009-05-29 14:26 flips around ? 2009-05-29 14:26 Yeah he's on the other side of the room :P 2009-05-29 14:27 really ? he working on Tux3 ? :) 2009-05-29 14:27 heh, I wish 2009-05-29 14:31 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-29 14:32 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-29 14:33 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-05-29 14:36 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-29 15:54 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-29 16:16 -!- fbloom(~chatzilla@62-50-199-118.client.stsn.net) has joined #tux3 2009-05-29 20:54 -!- ckwood_(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-05-29 21:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-29 21:54 -!- ijuz__(~ijuz@p5B12559C.dip.t-dialin.net) has joined #tux3 2009-05-29 23:42 -!- RazvanM(~RazvanM@pool-173-75-179-112.bltmmd.east.verizon.net) has joined #tux3 2009-05-30 02:39 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-30 04:27 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-30 05:45 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-05-30 06:34 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-05-30 06:39 -!- flips(~phillips@phunq.net) has joined #tux3 2009-05-30 07:23 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-30 07:58 -!- npmccallum_(~npmccallu@76.177.118.207) has joined #tux3 2009-05-30 09:08 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-05-30 14:24 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-30 15:54 -!- ed__(~Ed@243-76.162.dsl.aei.ca) has joined #tux3 2009-05-30 21:41 -!- ijuz_(~ijuz@p5B124762.dip.t-dialin.net) has joined #tux3 2009-05-31 00:08 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-05-31 00:19 -!- RazvanM(~RazvanM@pool-173-75-179-112.bltmmd.east.verizon.net) has joined #tux3 2009-05-31 00:29 -!- dagle1(~dagle@host162-104.bornet.net) has joined #tux3 2009-05-31 03:59 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-05-31 05:48 -!- dcg(~dcg@5.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-31 06:32 -!- dcg(~dcg@5.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-31 08:16 -!- dcg_(~dcg@82.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-05-31 10:32 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-31 10:54 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-05-31 11:48 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-31 12:14 -!- pgquiles(~pgquiles@206.Red-217-125-197.dynamicIP.rima-tde.net) has joined #tux3 2009-05-31 12:49 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-31 12:51 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-05-31 13:35 -!- dcg(~dcg@55.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-05-31 14:56 -!- data`(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-05-31 16:58 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-05-31 18:53 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-05-31 20:09 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-05-31 21:51 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-05-31 21:57 -!- ijuz_(~ijuz@p5B126522.dip.t-dialin.net) has joined #tux3 2009-05-31 23:11 -!- RazvanM(~RazvanM@pool-173-75-179-112.bltmmd.east.verizon.net) has joined #tux3 2009-06-01 00:40 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-06-01 07:03 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-01 07:07 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-01 07:09 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-01 08:19 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-01 09:41 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-01 12:36 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-01 12:46 -!- dcg(~dcg@193.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-06-01 14:43 how do you find the definion of a syscall now that can't just look for sys_ now? 2009-06-01 15:29 hey flipz 2009-06-01 15:29 long time no see 2009-06-01 15:33 awk '/^#define.*NR_/{print $2}' linux-2.6/arch/x86/include/asm/unistd_64.h | sed s:^__NR_::g 2009-06-01 15:33 flips, wouldn't something beautiful like that do what you want? 2009-06-01 15:36 sejeff, yep, I'm pretty much resigned to grepping from now on 2009-06-01 15:36 actually, the fastest way seems to be to consult old kernel source 2009-06-01 15:38 git grep is fast 2009-06-01 15:39 you know this though 2009-06-01 15:43 flipz: how's it going ? 2009-06-01 15:54 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-06-01 18:27 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-01 22:04 -!- ijuz_(~ijuz@p5B127276.dip.t-dialin.net) has joined #tux3 2009-06-01 22:19 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-01 23:25 -!- RazvanM(~RazvanM@pool-173-75-179-112.bltmmd.east.verizon.net) has joined #tux3 2009-06-02 06:33 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-02 07:14 hi 2009-06-02 07:14 http://userweb.kernel.org/~hirofumi/tux3/ 2009-06-02 07:14 this is almost mergable patchset 2009-06-02 07:15 those are writeback stuff and reference count of inode 2009-06-02 07:15 I guess we can implement the defered ileaf update based on this 2009-06-02 07:16 rest is some test, and final review by myself 2009-06-02 07:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-02 08:04 start to think about implementation of defer ileaf update 2009-06-02 09:50 hi hirofumi 2009-06-02 09:50 hi 2009-06-02 09:51 where should I look for the diffs? 2009-06-02 09:52 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-06-02 09:52 I pushed patches to my hg repo 2009-06-02 09:52 thanks 2009-06-02 09:54 I will compare my pseudocode to your code, and fix my pseudocode ;) 2009-06-02 09:54 thanks 2009-06-02 09:55 btw, pseudocode is which one? 2009-06-02 10:00 the last version pasted into the channel 2009-06-02 10:00 ah 2009-06-02 10:00 ok 2009-06-02 10:01 those patches is not including the atomic commit stuff 2009-06-02 10:03 ok 2009-06-02 10:03 draft of atomic-commit is almost same with your pseudocode, actual order is not same though 2009-06-02 10:04 but, it's not vaiolating the order of pseudocode 2009-06-02 10:07 btw, http://userweb.kernel.org/~hirofumi/commit.c is my draft 2009-06-02 10:09 start new log stuff is one function as new_cycle_log() 2009-06-02 10:09 and 2 new stash was introduced 2009-06-02 10:13 one of stashes - logs for pending deflush 2009-06-02 10:13 and another one - recently logs to need to replay 2009-06-02 10:14 with same reason, there are 2 ->logcount 2009-06-02 10:21 btw, time to sleep 2009-06-02 10:22 oyasumi 2009-06-02 10:22 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-02 10:46 oyasumi 2009-06-02 10:47 hirofumi, I'll correct the pseudocode if it makes sense 2009-06-02 11:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-02 12:52 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-02 13:07 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-06-02 13:49 folks 2009-06-02 13:49 hey flipz 2009-06-02 14:15 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-02 14:47 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-02 15:29 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-02 16:03 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-02 20:17 flips, thanks 2009-06-02 20:17 good morning hirofumi 2009-06-02 20:17 good morning 2009-06-02 20:17 btw, do you have any thought about orphaned inode? 2009-06-02 20:18 yes, my thought is, we better do something ;-) 2009-06-02 20:18 I'm thinking it may be related to defer ileaf update 2009-06-02 20:18 ok, I will have a much more complete thought soon 2009-06-02 20:18 :) 2009-06-02 20:18 good 2009-06-02 20:18 do you mean, the same mechanism may help? 2009-06-02 20:19 I mean, delete operation and orphaned would be similar 2009-06-02 20:19 defer ileaf update of delete operation 2009-06-02 20:19 the log is a very natural way to take care of orphaned inode, except for one thing: the inode may stay orphaned for much longer than the life of the log block 2009-06-02 20:20 yes 2009-06-02 20:20 ok, I remember my intention was to have an orphaned inode list and store it in a inode, something like atom table 2009-06-02 20:20 but not update that inode all the time 2009-06-02 20:21 instead, log the orphan, and update the orphan table only on flush 2009-06-02 20:21 flush is per-delta? 2009-06-02 20:21 per flush cycle, a long cycle than delta 2009-06-02 20:21 because that is the lifetime of a log block 2009-06-02 20:22 um... 2009-06-02 20:22 that was my original thought, I did not think about the details very carefully though 2009-06-02 20:22 but, if we don't update ileaf, orphan can be lost? 2009-06-02 20:22 ok 2009-06-02 20:23 the idea is, in the same delta that we discard the log block carrying the orphan, we update the orphan table 2009-06-02 20:23 the orphan table is constructed so that many orphans can be added and removed in efficient batches 2009-06-02 20:24 i see 2009-06-02 20:24 exactly what kind of table construction would satisfy that, I did not think about 2009-06-02 20:24 and it is very possible a simpler mechanism would be better 2009-06-02 20:24 yes 2009-06-02 20:25 ext3/4 make linked lists of oprhans, through the inode table 2009-06-02 20:25 i see 2009-06-02 20:25 I think that must have some unforunate side effects with large numbers of orphans 2009-06-02 20:25 now... what kind of load can create a large number of orphans? 2009-06-02 20:25 many temporary files? 2009-06-02 20:26 yes, for some strangely designed application 2009-06-02 20:26 amateur hour here- what if there's a counter in the atom table, once the orphaned list either reaches a certain number or certain age, it is then flushed 2009-06-02 20:26 *pardon my interupption 2009-06-02 20:27 the proposal is that the orphan table is a file on disk, so it's already flushed 2009-06-02 20:27 my proposal that is 2009-06-02 20:27 ah 2009-06-02 20:27 I expect hirofumi will have a counter proposal pretty soon 2009-06-02 20:28 linking through the inode table requires an additional cleanup step at some point 2009-06-02 20:28 i'm sure- just vying my hand at hacking 2009-06-02 20:28 btw, nfs or someone mentioned about orphan 2009-06-02 20:28 orphan handling is not optional 2009-06-02 20:28 it would like to use orphan inode over reboot 2009-06-02 20:28 yes 2009-06-02 20:28 right, nfs would like persistent orphans 2009-06-02 20:28 I remember 2009-06-02 20:29 it went on the must have list 2009-06-02 20:29 thanks for the reminder 2009-06-02 20:29 ok, that makes the orphan table sound even better 2009-06-02 20:29 the orphan table sould be a simple array of inums, each of which is an orphan 2009-06-02 20:29 probably 2009-06-02 20:30 there can be free entires in the orphan table, these would be chained together 2009-06-02 20:31 if there are no free entries, then a new orphan is added by making the orphan table larger (isize++) 2009-06-02 20:31 at startup, all the inodes referenced by the orphan table are actually deleted, and the orphan table is set empty 2009-06-02 20:32 unless somehow nfs gets involved, then we do something to make it happy 2009-06-02 20:32 or, we can give nfs access to the orphan table before mount 2009-06-02 20:32 I think that is what they wanted 2009-06-02 20:32 yes 2009-06-02 20:33 sounds pretty simple, doesn't it? 2009-06-02 20:33 yes, probably 2009-06-02 20:33 using the log, we flush the orphan table to disk only once per flush cycle, it should be very efficient 2009-06-02 20:34 i see 2009-06-02 20:34 we have add_orphan and del_orphan log entries 2009-06-02 20:34 i see 2009-06-02 20:34 now, when we discard a log block, we want to scan it for orphan records and make the corresponding changes to the orphan table 2009-06-02 20:35 I guess real delete also be in orphan table with flag 2009-06-02 20:35 does it help? 2009-06-02 20:35 it can delay real free of bitmap and ileaf 2009-06-02 20:35 it might be a nice way to do deferred truncate 2009-06-02 20:36 ah, avoid some inode table updates 2009-06-02 20:36 by logging the delete instead 2009-06-02 20:36 yes 2009-06-02 20:36 well in that case, we probably want to flush the logged orphan entries into the inode table, rather than the orphan table 2009-06-02 20:36 and it is possibly simple 2009-06-02 20:37 sorry 2009-06-02 20:37 ah 2009-06-02 20:37 I meant, in that case, we probably want to flush the logged _deleted inums_ into the inode table, rather than the orphan table 2009-06-02 20:37 oh, why? 2009-06-02 20:38 because it is about the same cost to update the inode table as update the orphan table 2009-06-02 20:38 ok, the inode table is a little more expensive to update, but not much 2009-06-02 20:39 btw, delete has some issue 2009-06-02 20:39 delete_inode handler will be called after last reference was gone 2009-06-02 20:39 it is same with orphan and delete 2009-06-02 20:40 both have 0 links, but the orphan also has non-zero use count 2009-06-02 20:40 I want to say here, iput() will be on the end of system call 2009-06-02 20:40 yes 2009-06-02 20:41 so, ->delete() and ->delete_inode() is not same time, even if real delete 2009-06-02 20:41 because unlink() itself has refcnt of inode 2009-06-02 20:42 well, this may not be issue actually though 2009-06-02 20:42 another issue is, an orphan can become actually deleted before the log is flushed 2009-06-02 20:42 yes 2009-06-02 20:43 this does not seem hard to handle, but we have to remember to handle it 2009-06-02 20:43 we can just add flag for deleted ophan inode? 2009-06-02 20:43 yes 2009-06-02 20:44 well, this is why I think delete and orphan is similar 2009-06-02 20:44 delete is very short time, orphan can be longer than it 2009-06-02 20:44 they are, and we should try to make that similarity part of the design, as you suggest 2009-06-02 20:45 ok 2009-06-02 20:45 anyway, I'm almost not thinking those yet, need to think 2009-06-02 20:46 I think it goes something like this: when we flush a dentry block to disk that has a deleted entry for a file that is still open, we also need to log an orphan record 2009-06-02 20:46 yes 2009-06-02 20:46 in other words, if the dentry isn't deleted in the disk image, it isn't an orphan 2009-06-02 20:47 it's oyasumi time for me 2009-06-02 20:47 yes 2009-06-02 20:47 ok, oyasumi 2009-06-02 20:47 I will see you again in about 9 hours 2009-06-02 20:47 ...I hope :) 2009-06-02 20:47 ok, see you :) 2009-06-02 21:43 folks 2009-06-02 21:43 hello again 2009-06-02 22:09 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-02 23:40 -!- RazvanM(~RazvanM@pool-173-75-179-112.bltmmd.east.verizon.net) has joined #tux3 2009-06-02 23:45 -!- RazvanM_(~RazvanM@96.234.242.13) has joined #tux3 2009-06-03 01:18 hi 2009-06-03 01:45 -!- pgquiles(~pgquiles@43.Red-88-16-35.dynamicIP.rima-tde.net) has joined #tux3 2009-06-03 02:01 hey hirofumi 2009-06-03 03:30 -!- pgquiles_(~pgquiles@194.Red-88-0-156.dynamicIP.rima-tde.net) has joined #tux3 2009-06-03 04:00 -!- pgquiles__(~pgquiles@57.Red-88-17-197.dynamicIP.rima-tde.net) has joined #tux3 2009-06-03 05:55 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-03 06:29 -!- dcg(~dcg@58.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-06-03 07:14 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-03 07:52 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-03 08:29 -!- dcg(~dcg@1.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-06-03 09:35 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-03 10:50 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-03 11:07 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-03 12:04 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-06-03 14:34 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-03 14:37 -!- bh_(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-06-03 15:17 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-03 18:48 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-03 19:18 hi, I've posted the inode refcnt and writeback stuff 2009-06-03 19:18 please see when you have time 2009-06-03 19:18 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-06-03 19:37 hi 2009-06-03 19:38 ok 2009-06-03 19:38 thanks 2009-06-03 19:46 the unatom fix looks right 2009-06-03 19:47 yes 2009-06-03 19:47 inode list move is good 2009-06-03 19:48 did I add that list in tuxi, or you? 2009-06-03 19:48 me :) 2009-06-03 19:50 what I the special meaning of set_buffer_dirty vs mark_buffer_dirty? 2009-06-03 19:50 set_buffer_dirty is force to set dirty 2009-06-03 19:51 mark_buffer_dirty is dirty and with management 2009-06-03 19:51 um... 2009-06-03 19:51 set_buffer_dirty is only for buffer 2009-06-03 19:52 mark_buffer_dirty is with management stuff 2009-06-03 19:53 well, purpose of that is to separate inode stuff from buffer.c 2009-06-03 19:53 ah, and we have user/writeback.c 2009-06-03 19:53 yes 2009-06-03 19:54 seems fine 2009-06-03 19:54 thanks 2009-06-03 19:55 next patch puts more meat into writeback.c 2009-06-03 19:55 so, writeback.c will simulate in userspace what kernel writeback does? 2009-06-03 19:56 looks like 2009-06-03 19:56 yes 2009-06-03 19:56 it's sensible 2009-06-03 19:56 it has some differences though 2009-06-03 19:56 good :) 2009-06-03 19:56 :) 2009-06-03 19:56 :) 2009-06-03 19:56 we're all smiling now ;) 2009-06-03 19:57 hi guys 2009-06-03 19:57 hi 2009-06-03 19:57 and the I_DIRTY flags also arrive in userspace 2009-06-03 19:57 yes 2009-06-03 19:57 I was never sure if that was a good design in kernel, but being the same is good 2009-06-03 19:58 yes 2009-06-03 19:59 what is the difference between I_DIRTTY_SYNC and I_DIRTY_DATASYNC? 2009-06-03 19:59 it is only affect fdatasync(), iirc 2009-06-03 19:59 fdatasync() needs to flush inode too? 2009-06-03 20:00 _SYNC is not, _DATASYNC should 2009-06-03 20:00 this is where the kernel design seems a little odd 2009-06-03 20:00 well, I can worry about that later 2009-06-03 20:00 well, it's very rate to use _DATASYNC, iirc 2009-06-03 20:01 maybe, for special case 2009-06-03 20:02 the move-and-fix patch... what was the fix? 2009-06-03 20:03 which patch? 2009-06-03 20:04 move blockdirty() to commit.c, and fix 2009-06-03 20:04 ah 2009-06-03 20:04 just add __mark_inode_dirty() 2009-06-03 20:04 ah 2009-06-03 20:04 __weak... ? 2009-06-03 20:05 if app is not including commit.c, blockdirty is not available 2009-06-03 20:06 so, writeback.c provides dummy code instead of commit.c 2009-06-03 20:09 it's good to see you're fixing the xattr code as you go 2009-06-03 20:10 ok, with userspace sync_inodes, we should be able to fix the fuse code 2009-06-03 20:10 It's about time to call for a fuse code volunteer again 2009-06-03 20:11 ah, userspace also got inode ref counts 2009-06-03 20:11 yes 2009-06-03 20:12 ref counts is needed to delay flushing buffers 2009-06-03 20:12 yes, and for orphan stuff 2009-06-03 20:13 yes 2009-06-03 20:14 now, userspace sync_super knows how to sync inodes, good 2009-06-03 20:14 btw, ref counts has difference with kernel 2009-06-03 20:15 what is it? 2009-06-03 20:15 writeback takes ref count, so, it makes different to life time of inode with kernel 2009-06-03 20:15 userspace writeback takes refcount, and kernel does not? 2009-06-03 20:16 yes 2009-06-03 20:16 kernel is cacheing the inodes via dentry 2009-06-03 20:16 ah, and we really don't want to emulate the dentry cache 2009-06-03 20:16 and if inode was expired, kernel will flush inode 2009-06-03 20:16 yes 2009-06-03 20:17 userspace takes ref count of inode, so, there is no expire when inode is dirty 2009-06-03 20:20 find_dirty_inode handles some of the functionality provided by dentry cache, I think 2009-06-03 20:20 so the dentry cache is slowly starting to appear in userspace 2009-06-03 20:20 find_dirty_inode is just finding the sb->dirty_inodes 2009-06-03 20:20 well, without the dentries 2009-06-03 20:20 yes 2009-06-03 20:21 there is no cacheing mechanithm of inode 2009-06-03 20:21 if that is not done, then we don't get the effect of file caching 2009-06-03 20:21 yes 2009-06-03 20:21 and orphan handling would not be testable in userspace 2009-06-03 20:22 yes, probably 2009-06-03 20:23 oh you did the fuse :) 2009-06-03 20:23 - if (save_inode(inode)) 2009-06-03 20:23 - printf("save_inode error\n"); 2009-06-03 20:23 + mark_inode_dirty(inode); 2009-06-03 20:23 <- very nice to see 2009-06-03 20:24 ready for a pull? 2009-06-03 20:24 yes, a little though 2009-06-03 20:24 yes 2009-06-03 20:24 I hope it's mergable 2009-06-03 20:24 yes, more fuse to do 2009-06-03 20:24 it looks mergable 2009-06-03 20:25 thanks 2009-06-03 20:25 it compiles and runs still, I assume ;)\ 2009-06-03 20:25 it has the look of tested code 2009-06-03 20:25 yes 2009-06-03 20:26 I tested those more or less 2009-06-03 20:26 and it looks good 2009-06-03 20:27 pushed to public 2009-06-03 20:28 thanks 2009-06-03 20:28 have you thought more about orphan/deleted inodes? 2009-06-03 20:28 not much though 2009-06-03 20:29 well, it seems not so easy to me 2009-06-03 20:29 ok, I will work on a design note 2009-06-03 20:29 maybe, inode state is "added" -> "orphaned" -> "deleted" 2009-06-03 20:30 good, thanks 2009-06-03 20:30 maybe 2009-06-03 20:30 "added" may be delta aware 2009-06-03 20:30 I still like the basic idea of log + orphan table, more details are needed 2009-06-03 20:30 yes 2009-06-03 20:31 I guess the log record is not generated by frontend 2009-06-03 20:32 frontend pass info to backend, and backend modify 2009-06-03 20:32 right 2009-06-03 20:33 orphan is not a frontend concept at all 2009-06-03 20:33 i see 2009-06-03 20:33 orphan is just the presence of inode structures on disk when there is no corresponding dentry 2009-06-03 20:35 yes 2009-06-03 20:49 it's oyasumi time 2009-06-03 20:50 ok, oyasumi 2009-06-03 21:31 hey folks 2009-06-03 22:56 -!- RazvanM(~RazvanM@96.234.242.13) has joined #tux3 2009-06-04 02:07 -!- Mark__T(~Mark__T@twitter.freenet-rz.de) has joined #tux3 2009-06-04 02:09 -!- Mark__T(~Mark__T@twitter.freenet-rz.de) has left #tux3 2009-06-04 04:43 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-06-04 05:48 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-06-04 08:27 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-04 09:04 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-04 10:49 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-04 11:15 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-06-04 18:58 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-04 20:09 ok, it's time for another think about how deletes are logged 2009-06-04 20:09 one thing: we can do much, much better than ext3 2009-06-04 20:10 ext3 takes forever to delete a big file 2009-06-04 20:10 I think it is right, a delete is like an orphan 2009-06-04 20:10 because... 2009-06-04 20:10 because it recovers and frees all the file's index blocks 2009-06-04 20:11 ext4 is likely better, with extents 2009-06-04 20:11 extents get freed much faster, by a big multiple 2009-06-04 20:11 but still, we should only have to do two things to delete a file: 1) remove the directory entry; 2) log that the file is deleted 2009-06-04 20:13 then, if the file is still open at the next flush cycle, it becomes an orphan by adding its inode number to the orphan table 2009-06-04 20:14 so you can delete an open file- never thought of that before 2009-06-04 20:14 it's a nice trick unix can do 2009-06-04 20:14 that windows can't 2009-06-04 20:15 and it makes a big difference, for example, to how easy it is to upgrade a running system 2009-06-04 20:15 macs don't behave that way 2009-06-04 20:15 yes, macs do this too 2009-06-04 20:15 hmm, never tried 2009-06-04 20:15 max are unics 2009-06-04 20:15 oh wait, you can trash it, but can't empty the trash (I would assume that frees it) if the file is open 2009-06-04 20:16 you'd get an error msg 2009-06-04 20:16 that's just weird 2009-06-04 20:16 back to unix 2009-06-04 20:16 not unix 2009-06-04 20:16 yeah 2009-06-04 20:16 linux 2009-06-04 20:16 that's apple-broken unix 2009-06-04 20:16 hfs+ broken 2009-06-04 20:17 it might have something to do with hfs, yes 2009-06-04 20:17 hfs has been around longer than os/x 2009-06-04 20:17 yup 2009-06-04 20:17 pokey 2009-06-04 20:17 when it was designed, it ran on macos which probably can't do this 2009-06-04 20:18 let's see, hirofumi raised a couple of other issues 2009-06-04 20:18 let's see if they're there in the log 2009-06-04 20:19 ah, another issue is, an orphan can become actually deleted before the log is flushed 2009-06-04 20:24 truncate is also similar to deletion 2009-06-04 20:24 and versioning adds issues 2009-06-04 20:57 yes 2009-06-04 20:57 however, truncate has another issue 2009-06-04 20:58 it can shrink and expand the size, and write is also 2009-06-04 20:59 write is also happened with same delta 2009-06-04 23:37 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-06-04 23:38 -!- RazvanM(~RazvanM@96.234.242.13) has joined #tux3 2009-06-05 05:21 -!- npmccallum(~npmccallu@cpe-76-177-118-207.natcky.res.rr.com) has joined #tux3 2009-06-05 06:52 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-05 06:54 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-05 07:24 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-05 07:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-05 07:55 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-05 10:18 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-05 10:56 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-06-05 11:12 ext3 doesn't do nice things at all while deleting multigigabyte files 2009-06-05 11:12 denies service to other tasks working on unrelated files for seconds at a time 2009-06-05 11:27 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-06-05 12:05 If only tux3 was in the kernel to try out... 2009-06-05 12:08 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-05 13:47 hey flipz 2009-06-05 13:48 flipz: linux file systems incredibly suck as a whole 2009-06-05 13:48 but that's generally known 2009-06-05 15:01 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-05 17:24 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-06-05 20:24 ok, time to think about delayed delete some more 2009-06-05 20:25 delete + orphans + truncated, all related 2009-06-05 21:51 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-06-06 04:59 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-06-06 05:36 defer ileaf update may not be complex than I was thinking 2009-06-06 05:37 add - new_btree() should be delayed, and provide way to search pending added inode 2009-06-06 05:37 orphan - it can know inode->i_nlink and dirty? 2009-06-06 05:38 delete - provide inode number was deleted, and remove from add/orphan 2009-06-06 05:38 well, need to think more 2009-06-06 05:38 btw, I'm still not thinking about backend stuff 2009-06-06 06:07 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-06 07:32 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-06 08:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-06 09:06 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-06 09:41 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-06 13:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-06 13:13 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-06 14:09 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-06 14:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-06 18:10 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-06-06 18:10 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-06-06 18:10 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-06-06 18:10 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-06-06 18:12 -!- flips(~phillips@phunq.net) has joined #tux3 2009-06-06 18:12 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-06 18:12 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-06-06 18:12 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-06-06 18:12 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-06-06 18:12 -!- dagle1(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-06 20:58 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-07 04:18 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-06-07 05:20 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-06-07 06:24 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-06-07 08:33 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-07 09:45 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-07 14:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-07 19:21 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-06-07 19:58 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-06-07 20:31 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-06-07 21:07 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-07 21:42 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 05:36 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-08 06:28 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 06:48 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 06:56 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 07:17 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 07:35 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-06-08 07:46 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 08:54 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 09:10 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 09:11 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-06-08 09:58 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 10:00 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-08 10:43 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 10:53 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 11:35 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-08 15:03 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 15:03 -!- ajonat(~ajonat@190.48.122.107) has joined #tux3 2009-06-08 15:05 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-08 15:26 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-06-08 16:30 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-06-08 17:01 -!- edt(~Ed@dsl-216-221-39-46.aei.ca) has joined #tux3 2009-06-08 21:00 \quit 2009-06-08 22:33 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-08 23:16 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-06-09 03:54 -!- flips(~phillips@phunq.net) has joined #tux3 2009-06-09 07:37 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-09 09:07 good morning 2009-06-09 09:10 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-09 09:16 hi 2009-06-09 10:00 hi 2009-06-09 11:11 I will make a small but significant improvement to user/tux3: add proper options parsing 2009-06-09 11:12 hopefully with a new options parser that does not suck like getopt (3) and libpopt 2009-06-09 11:13 what is improvement point? 2009-06-09 11:16 well, anyway, I don't care almost about code of option parser 2009-06-09 11:17 I would just be care the cost of maintainance 2009-06-09 11:18 I hope maintainance const is enough small 2009-06-09 11:29 I think this gives the miminum maintenance 2009-06-09 11:29 does not introduce a library dependency 2009-06-09 11:29 this is one of those things that is either right or wrong 2009-06-09 11:30 it will give us mkfs options for example 2009-06-09 11:30 what is your dislike points? 2009-06-09 11:30 mkfs? 2009-06-09 11:30 for example, block size 2009-06-09 11:31 dislike points of getopts? 2009-06-09 11:31 or? 2009-06-09 11:31 yes 2009-06-09 11:31 getopts or libpopt 2009-06-09 11:34 ah, getopt (3) does not generate help or usage text, and it destructively modifies argv, libpopt has more features but is not installed by default and is quirky to use, both getopt and libpopt a verbose to use and generate error messages in their own format, not the way the app wants them 2009-06-09 11:36 these are pretty big deficiencies for a task that can be handled in a small amount of code 2009-06-09 11:36 libpopt is one of the main causes of bloat in ddsnap userspace support 2009-06-09 11:37 maybe 2000 lines ended up being used just for options parsing, and nobody wants to maintain that much code 2009-06-09 11:38 yes, well, personally, I also dislike libpopt 2009-06-09 11:39 it's not in libc 2009-06-09 11:40 however, I'm not so dislike getopt 2009-06-09 11:40 if app is using, getopt style options 2009-06-09 11:41 main disadvantage of getopt is help or usage text? 2009-06-09 11:41 btw, I have no objection to it 2009-06-09 11:42 I'm just interest why 2009-06-09 11:42 use, help/usage is the main functional disadvantage 2009-06-09 11:43 the main practical disadvantage is needing to write a parsing loop 2009-06-09 11:43 the option parser could easily implement the loop itself 2009-06-09 11:44 it is also not nice to destructively change argv, and to use static variables for parse position 2009-06-09 11:44 yes 2009-06-09 11:44 also, getopt should not print its own error text 2009-06-09 11:46 your choices with getopt are 1) let getopt printing errors in its own format or 2) disable error printing, and not know what is wrong 2009-06-09 11:50 i see 2009-06-09 11:51 I guess, error message stuff is designed for system commands 2009-06-09 11:51 yes, mainly user/tux3 2009-06-09 11:52 ah, it meant the error message design of getopt 2009-06-09 11:52 two options we want are blocksize and volume label 2009-06-09 11:52 the are probably more 2009-06-09 11:52 ah right 2009-06-09 11:52 sure 2009-06-09 11:52 in general, library functions should not print 2009-06-09 11:53 fsck/mkfs would have standard options 2009-06-09 11:53 unless the main purpose of the library function is to print 2009-06-09 11:53 yes 2009-06-09 11:53 opterr can disable it, however, it doesn't tell error code? 2009-06-09 11:53 right, fsck would have options describing whether to prompt for fixes or not, etc 2009-06-09 11:54 ACTION will be back in a few minutes 2009-06-09 12:42 getopt will return '?' whether an option was not recognized, or an option value was missing, it does not provide a way to tell the difference 2009-06-09 12:44 it also returns '?' for an ambiguous match 2009-06-09 12:45 it is not clear what it does when an option has a value when it was not suppopsed to have a value 2009-06-09 12:46 even if it did return enough codes to tell each of the errors, it would require the application to write a lot of code to determine which error occurred 2009-06-09 12:47 so I solve several problems by giving my option parser a little bit of working memory, which it uses not only to store the reordered argv, but also holds error text if there is an error 2009-06-09 12:47 so all the application has to do is notice that negative error return, then print the error text 2009-06-09 12:50 as a convenience to the user, argv and argc are updated to point to the new arg vector, so that a program does not have to change the way it uses argv at all, if it was already written using positional parameters and just wants to add some options 2009-06-09 12:51 the object is to be able to add basic option processing to a program that already uses argv with about ten lines of code 2009-06-09 12:53 I think argv overwrite is confusible 2009-06-09 12:54 argv is pointer to memory which kernel prepared 2009-06-09 12:54 yes, not good behavior 2009-06-09 12:54 libpopt fixes that, but gets other things wrong 2009-06-09 12:55 we need libflipz 2009-06-09 12:55 ddlib 2009-06-09 12:58 argv update is also not good I think 2009-06-09 12:59 actually I meant this 2009-06-09 12:59 I don't update the original argv 2009-06-09 13:00 but, argv = some_pointer? 2009-06-09 13:00 you pass the address of argv to the option scanner and the scanner reassigns it, provided there are no errors 2009-06-09 13:00 if you really don't want argv reassigned, pass some other variable 2009-06-09 13:01 I can't imagine why somebody would not want argv updated 2009-06-09 13:01 I guess it is confusible 2009-06-09 13:02 until learn the parse is what is doing 2009-06-09 13:04 int optc = scanopts(options, &argc, &argv, optv, sizeof(optv)); 2009-06-09 13:04 argc and argv will be updated if there are no errors 2009-06-09 13:05 pass structure to get result? 2009-06-09 13:06 int err = scanopts(&options_struct, argc, argv) 2009-06-09 13:07 and then reassign argc and argv yourself if you want to? 2009-06-09 13:07 yes, if I really want to do it, somehow 2009-06-09 13:07 well, in that scase options_struct is just doing the work of a parameter list 2009-06-09 13:08 there is no another helpers? 2009-06-09 13:08 you could also write char *myargv = argv; int myargc = argc, optc = scanopts(options, &myargc, &myargv, optv, sizeof(optv)); 2009-06-09 13:09 yes 2009-06-09 13:09 another helper is a function to tell you how big optv should be, given existing argv and argc 2009-06-09 13:10 there is also a helper to return the error message from argv after an error 2009-06-09 13:11 there is no helpers after scanopts()? 2009-06-09 13:11 none necessary 2009-06-09 13:12 some for convienience, like to return the number of times an option occurred 2009-06-09 13:12 well, why I'm not so like (&argc, &argv) is, if I see this, I will usually look the internal of it 2009-06-09 13:13 to see, what is doing to my argc/argv 2009-06-09 13:13 that well defined 2009-06-09 13:13 yes 2009-06-09 13:13 it updates them to point to a new argv, stored inside the optv 2009-06-09 13:13 but, need to learn 2009-06-09 13:14 true 2009-06-09 13:15 if it's not overwrite, I just see how to use it on below of scanopts() 2009-06-09 13:15 well, it would be only me, and it would not be important 2009-06-09 13:18 um..., yes, I guess, probably what I feel personally 2009-06-09 13:18 it's not important 2009-06-09 13:27 http://library.gnome.org/devel/glib/unstable/glib-Commandline-option-parser.html#g-option-context-parse 2009-06-09 13:27 like this interface? 2009-06-09 13:27 without context 2009-06-09 13:44 hey folks 2009-06-09 13:49 hirofumi, it has some similarities 2009-06-09 13:50 i see 2009-06-09 13:50 GOptionContext apparently combines the functionality of my "options" array and the optv working space 2009-06-09 13:51 the gnome version of this seems very heavyweight 2009-06-09 13:51 when in fact options parsing should be a lightweight task 2009-06-09 13:51 yes 2009-06-09 13:52 "If a long option in the main group has this name, it is not treated as a regular option. Instead it collects all non-option arguments which would otherwise be left in argv" 2009-06-09 13:52 it seems to have the grouping for gnome apps 2009-06-09 13:52 this seems excessive 2009-06-09 13:53 just for your amusement, check out the option aliasing feature provided by libpopt 2009-06-09 13:53 it allows users to provide their own option names that expand into other lists of options, per application 2009-06-09 13:54 it is one of the worst interface ideas I have heard of :) 2009-06-09 13:54 oh 2009-06-09 13:54 luckily, it is rarely used, except for the original rpm set of utilities it was designed for 2009-06-09 13:55 well, libpopt is not the option for me 2009-06-09 13:56 function naming is not match to me 2009-06-09 13:57 so, I don't use it until there is good reason, however, there is no the good reason to use for now 2009-06-09 13:57 :) 2009-06-09 14:18 time to sleep 2009-06-09 14:18 oyasumi 2009-06-09 14:55 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-06-09 15:08 -!- edt(~Ed@130-79.162.dsl.aei.ca) has joined #tux3 2009-06-09 15:59 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-09 16:34 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-09 19:39 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-09 21:46 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-09 22:32 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-10 02:58 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-10 03:55 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-06-10 05:22 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-06-10 06:22 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-10 06:54 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-10 07:02 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-10 07:26 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-10 07:56 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-10 08:00 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-06-10 08:46 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-10 09:53 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-10 10:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-10 12:39 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-10 13:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-10 15:26 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-06-10 18:52 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-10 19:12 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-10 19:19 -!- ckwood_(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-06-10 20:05 ok, now what should I do with this option parsing code 2009-06-10 20:05 put it on phunq.net for review maybe 2009-06-10 20:34 http://phunq.net/files/options.c 2009-06-10 20:35 c99 options.c && ./a.out this program has --foo and --bar options 2009-06-10 20:36 turns out you can add option parsing to a program with half a dozen lines of c or so 2009-06-10 22:04 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-10 22:44 um... 2009-06-10 22:44 ./options --bar aaa ccc 2009-06-10 22:44 it seems segv 2009-06-10 22:46 "./options --b" seems to be parsed as "-b" 2009-06-10 22:59 "--" seems to be not supported 2009-06-10 23:26 hey folks 2009-06-10 23:43 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-10 23:59 -!- RazvanM(~RazvanM@pool-173-67-54-52.bltmmd.east.verizon.net) has joined #tux3 2009-06-11 00:19 hi 2009-06-11 00:21 for now, options.c doesn't have help and usage, it is going to be added? 2009-06-11 00:33 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2009-06-11 01:39 btw, honestly, I can't see big functional difference with getopt_long() 2009-06-11 01:40 however, I guess OPT_ARG_* would be extented more (OPT_ARG_NUMERIC?) 2009-06-11 01:41 if so, I think it would be useful than getopt_long() 2009-06-11 01:41 add callback to do it? 2009-06-11 01:42 { "bar", OPT_ARG_REQUIRE, opt_arg_nurmeric, &bar_value } ? 2009-06-11 01:43 opt_arg_nurmeric is callback function, &bar_value is optional argument for callback 2009-06-11 02:21 well, anyway, I wouldn't want to use your time for this job though 2009-06-11 05:22 thanks for the review hirofumi 2009-06-11 05:22 in practise, code using this interface is shorter and tidier 2009-06-11 05:23 one reason is, the scan loop is in the parser, not the application 2009-06-11 05:25 an oversight in the driving code 2009-06-11 05:25 - int optc = scanopts(options, &argc, &argv, optv, sizeof(optv), 1); 2009-06-11 05:25 + int optc = scanopts(options, &argc, &argv, optv, sizeof(optv), 0); 2009-06-11 05:26 that extra parameter is an idea from getopt (3) of doubtful use: option to treat all args as option values 2009-06-11 05:27 3 args: 2009-06-11 05:27 argv[0] = './a.out' 2009-06-11 05:27 argv[1] = 'aaa' 2009-06-11 05:27 argv[2] = 'ccc' 2009-06-11 05:27 1 opts: 2009-06-11 05:27 --bar = '(null)' 2009-06-11 05:28 -- works for me 2009-06-11 05:31 for numeric, the code is mostly there. I think, scanopts should just verify that the arg is numeric and let the app convert it with atoi etc 2009-06-11 05:32 it does not save much application code by having the options scanner store the numeric value, 2009-06-11 06:06 so now why... well, the first thing you see in a C program is the options parsing 2009-06-11 06:07 if that is ugly, then the code is off to an ugly start 2009-06-11 06:08 also, all nontrivial shell executables should parse gnu style options in a consistent format 2009-06-11 06:09 so the work to do that should be as small as possible. getopt (3) does not achieve that very well 2009-06-11 06:11 the separation of single character options away from the options table means that getopt (3) cannot automatically generate help and usage 2009-06-11 06:13 and it is a very bad idea to destructively change argv 2009-06-11 06:13 anyway, it's done :) 2009-06-11 06:13 can return to more important issues, like consistent handling of delete 2009-06-11 06:20 ./options --zot -- --zot 2009-06-11 06:21 1 args: 2009-06-11 06:21 argv[0] = './options' 2009-06-11 06:21 2 opts: 2009-06-11 06:21 --zot = '(null)' 2009-06-11 06:21 --d = '--zot' 2009-06-11 06:21 well, I've noticed it seems 32bit/64bit issue 2009-06-11 06:22 on x86_64, same arguments was coredump 2009-06-11 06:28 and this needs to loop to convert string to value? 2009-06-11 06:47 ah, it's in scanopts() 2009-06-11 06:47 *(top - optc) = (struct opt){-1, opt}; 2009-06-11 06:48 index == -1, so, optindex(optv, i) seems to be invalid for options[].name 2009-06-11 06:48 btw, this would have alignment issues 2009-06-11 06:50 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-11 06:51 maybe, scanopts() needs to verify the size 2009-06-11 06:54 ah, and char optv[...] wouldn't be good for alignment 2009-06-11 08:30 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-11 09:01 sorry, I didn't try it on 64 bit 2009-06-11 09:01 my bad 2009-06-11 09:02 I know what the problem is 2009-06-11 09:02 dropping the , 1 as above will fix it 2009-06-11 09:04 it means, (struct opt){ .value = opt}? 2009-06-11 09:05 btw, I'm going to try defer ileaf update for inode creation 2009-06-11 09:05 and, before that, it seems to need to allow empty btree 2009-06-11 09:06 i.e. don't allocate root/leaf at inode creation 2009-06-11 09:10 what kind of list did you have in mind for newly created inodes? 2009-06-11 09:10 just link them toghether through a new field in the tuxinode ? 2009-06-11 09:11 re opts: I think the best think to do is drop the index == -1 feature 2009-06-11 09:11 it is only intended to support some really rare use case where the ordering between opts and args is important 2009-06-11 09:12 I think, that apt can use getopt (3) 2009-06-11 09:12 it was a bad idea to add complexity just to support that fringe case 2009-06-11 09:12 yes, for now, I'm thinking simplely new list 2009-06-11 09:12 and linear seach on each inode create, fine 2009-06-11 09:13 simple and powerful, and if it needs to be accelerated later, it can be 2009-06-11 09:13 it might be need to optimize later, however, I'm thinking leave for later :) 2009-06-11 09:13 right 2009-06-11 09:14 (struct opt){ .value = opt}? 2009-06-11 09:15 for -1 case 2009-06-11 09:15 (struct opt){ .value = opt} <- "opt" is spelled wrong here, should be "arg" 2009-06-11 09:15 it's a stupid feature I added, and will remove 2009-06-11 09:15 I was assumed "drop" meant to drop "-1" 2009-06-11 09:15 similar to the same stupid feature in getopt (3) 2009-06-11 09:16 drop the whole feature 2009-06-11 09:16 i see 2009-06-11 09:16 people should not use mixed args and opts that way, and if they really want to they can use getops (3) 2009-06-11 09:16 scanopts should just implement standard gnu opts the way users expect, nothing else 2009-06-11 09:17 the whole point of it is to be small enough to include per project, instead of using the library 2009-06-11 09:17 well, not the whole point, but a major point 2009-06-11 09:18 mixed args and opts wouldn't be big issue, probably 2009-06-11 09:19 however, I guess, "--" is needed to handle 2009-06-11 09:19 can be ignored, and anybody who does not agree can implement it ;) 2009-06-11 09:19 -- is handled 2009-06-11 09:19 oh 2009-06-11 09:19 just change the , 1) to , 0) as it should have been 2009-06-11 09:19 and is now 2009-06-11 09:19 I updated the file 2009-06-11 09:20 ah 2009-06-11 09:22 yes, with the change, it seems to work 2009-06-11 10:29 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-11 10:57 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-11 11:27 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-06-11 11:47 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-11 12:29 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-06-11 12:29 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-06-11 12:29 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-06-11 12:29 -!- flips(~phillips@phunq.net) has joined #tux3 2009-06-11 13:16 -!- pgquiles(~pgquiles@7.Red-88-0-138.dynamicIP.rima-tde.net) has joined #tux3 2009-06-11 18:19 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-06-11 19:05 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-11 19:45 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-06-11 21:56 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-11 23:33 -!- RazvanM(~RazvanM@pool-173-67-54-52.bltmmd.east.verizon.net) has joined #tux3 2009-06-11 23:47 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-06-12 06:03 goog morning 2009-06-12 06:04 whoops :) 2009-06-12 06:04 freudian slip 2009-06-12 06:04 good morning 2009-06-12 07:23 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-12 08:28 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-12 09:40 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-12 09:54 -!- npmccallum(~npmccallu@cpe-76-177-118-207.natcky.res.rr.com) has joined #tux3 2009-06-12 11:29 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-12 12:43 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-06-12 13:46 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-12 15:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-12 18:45 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-12 20:27 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-12 23:05 -!- RazvanM(~RazvanM@pool-173-67-54-52.bltmmd.east.verizon.net) has joined #tux3 2009-06-13 02:02 -!- pgquiles(~pgquiles@7.Red-88-0-138.dynamicIP.rima-tde.net) has joined #tux3 2009-06-13 06:25 -!- domiel(~dnj@58.172.210.231) has joined #tux3 2009-06-13 13:13 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-13 13:38 -!- ckwood_(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-06-13 14:04 hey flips 2009-06-13 14:05 blast through your area at about 15 miles/hr on the 405 yesterday :) 2009-06-13 17:29 hi 2009-06-13 17:30 this is for empty btree 2009-06-13 17:30 http://userweb.kernel.org/~hirofumi/inode-defer-alloc-prepare.patch 2009-06-13 17:30 it may be too hack 2009-06-13 17:30 um... 2009-06-13 17:34 and I noticed new issue for me 2009-06-13 17:35 sync_inode() doesn't handle I_DIRTY_SYNC if inode was dirtyed by flushing data 2009-06-13 17:36 the issue would be in kernel too 2009-06-13 17:36 the kernel is flushing inode twice (nowait, then wait), so it is handled 2009-06-13 17:47 hi 2009-06-13 17:47 hi 2009-06-13 17:48 it's short :) 2009-06-13 17:48 short? 2009-06-13 17:48 need to explain more? 2009-06-13 17:48 I mean, the patch is short 2009-06-13 17:48 ah 2009-06-13 17:49 but, it is too hack 2009-06-13 17:49 probably 2009-06-13 17:49 ah, btw, defer ileaf update is not implemented yet 2009-06-13 17:50 and I've noticed there is some bugs now :) 2009-06-13 17:54 for sure we want to create the btree in map_region, not on file create as now, which this patch does 2009-06-13 17:54 that was ambiguous ;) 2009-06-13 17:54 I meant, creating the btree at inode create time was always a hack by me 2009-06-13 17:55 :) 2009-06-13 17:55 what part of this do you not like? 2009-06-13 17:55 code is not clean 2009-06-13 17:56 which part do you hate the most? 2009-06-13 17:56 all addition in map_region 2009-06-13 17:56 it just added the special case for empty btree 2009-06-13 17:57 I wanted to handle empty btree more naturaly 2009-06-13 17:58 maybe, it would be same with insert_leaf() almost 2009-06-13 17:58 in map_region seems like the correct place to add it 2009-06-13 17:59 yes, maybe 2009-06-13 17:59 probably 2009-06-13 17:59 yes, so, maybe, this patch is hack 2009-06-13 18:01 ok, so the hack is, you create the btree root just before probing it 2009-06-13 18:02 yes, exactly 2009-06-13 18:02 it's not the worst hack ever 2009-06-13 18:02 thanks 2009-06-13 18:02 another hack is the change of writeback 2009-06-13 18:03 and it was buggy 2009-06-13 18:03 I'm sure the worst hack ever must have been from me, but I can't remember what it was ;) 2009-06-13 18:03 it was so bad, it erased my memory 2009-06-13 18:03 :) 2009-06-13 18:04 where is the writeback part? 2009-06-13 18:05 user/writeback.c 2009-06-13 18:06 (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) <- doesn't gcc complain about this? 2009-06-13 18:06 if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) 2009-06-13 18:07 I thought it says "suggest parens" then 2009-06-13 18:07 my gcc version didn't complain about it 2009-06-13 18:08 well, it seems same with if (inode->state & I_DIRTY_SYNC) 2009-06-13 18:09 well, that code doesn't handle redirty correctly 2009-06-13 18:10 mark_btree_dirty is what we do when the btree changed because of store_attrs? 2009-06-13 18:11 yes 2009-06-13 18:11 it is special case of inode change 2009-06-13 18:12 it imply the btree root was changed 2009-06-13 18:13 maybe the name could be improved 2009-06-13 18:13 mark_btree_rerooted ? 2009-06-13 18:14 or mark_btree_new_root 2009-06-13 18:14 yes 2009-06-13 18:15 well, I thought it's similar of mark_inode_dirty() 2009-06-13 18:15 so, s/inode/btree/ 2009-06-13 18:15 :) 2009-06-13 18:16 mark_buffer_dirty, mark_inode_dirty, mark_btree_dirty 2009-06-13 18:16 but your explanation above was clear: when the btree root is changed 2009-06-13 18:17 buffers don't get new roots ;) 2009-06-13 18:18 can you explain the interaction between I_DIRTY_SYNC and _reroot? 2009-06-13 18:19 currently, mark_btree_dirty just tell to writeback stuff, the marked inode is needed to flush 2009-06-13 18:20 and I_DIRTY_SYNC means, the inode needs to flush, not pages 2009-06-13 18:20 well, so, those are all part of writeback stuff 2009-06-13 18:21 for now 2009-06-13 18:21 I think that kernel writeback should not drive the writing of ileaf 2009-06-13 18:22 maybe 2009-06-13 18:22 we should just put the inode on our own list if ileaf needs to be updated 2009-06-13 18:22 however, vfs code may be needed some change 2009-06-13 18:23 what kind of change? 2009-06-13 18:24 probably, writeback stuff is similar with current userland 2009-06-13 18:24 userland writeback 2009-06-13 18:25 added code with this patch may be needed for kernel too 2009-06-13 18:25 we may disable those though 2009-06-13 18:27 um... 2009-06-13 18:27 probably, this is a bit pointless 2009-06-13 18:27 what is? 2009-06-13 18:27 if we think about atomic commit, writeback stuff will not be good already 2009-06-13 18:28 and if writeback manner, it will flush twice 2009-06-13 18:28 sync(0), and sync(1) 2009-06-13 18:28 true, we can't handle writeback of some subset of dirty inodes at this point, we have to write back all dirty inodes at each delta 2009-06-13 18:30 yes 2009-06-13 18:30 so, for user/writeback.c, flush twice for now? 2009-06-13 18:30 fine 2009-06-13 18:30 ok 2009-06-13 18:31 I will be back in a hour 2009-06-13 18:31 ok 2009-06-13 18:31 btw, time to sleep for me 2009-06-13 18:31 I think this is not a bad hack 2009-06-13 18:31 thanks 2009-06-13 18:31 oyasumi 2009-06-13 18:32 usually, what time are you free? 2009-06-13 18:32 see you tomorrow 2009-06-13 18:32 on weekends, from 9 in the morning till 10 at night, usually 2009-06-13 18:33 if not weekends? 2009-06-13 18:33 well, I thought meeting everyday might be good 2009-06-13 18:34 yes 2009-06-13 18:34 6 am, my time 2009-06-13 18:34 even if few minites 2009-06-13 18:34 I will always be here 2009-06-13 18:34 6 am? 2009-06-13 18:34 yes 2009-06-13 18:35 now? 2009-06-13 18:35 it's 6 pm now :) 2009-06-13 18:35 oh :) 2009-06-13 18:35 it would be 10 pm your time I think 2009-06-13 18:35 10 am in japan 2009-06-13 18:36 I mean, 6 am my time is 10 pm your time, that is a good time to ping me 2009-06-13 18:36 I think you are always up then 2009-06-13 18:36 ah 2009-06-13 18:36 and so am I 2009-06-13 18:37 6 am isn't too early? 2009-06-13 18:37 it's fine 2009-06-13 18:37 oh 2009-06-13 18:37 anyway, if I respond then I was awake 2009-06-13 18:37 :) 2009-06-13 18:37 ok :) 2009-06-13 18:38 well, I'll see world clock when ping 2009-06-13 18:38 I am just about done with options parsing 2009-06-13 18:38 good 2009-06-13 18:38 I know that the options project was unnecessary 2009-06-13 18:38 but it was bothering me, needed to get it done so I can forget about it 2009-06-13 18:39 so... theory of writeback is a lot more important 2009-06-13 18:39 yes, can forget is important 2009-06-13 18:40 at least for me, I can't work parallel 2009-06-13 18:41 ok, when you get to the computer tomorrow, please ping me 2009-06-13 18:41 ok, thanks 2009-06-13 18:41 oyasumi 2009-06-13 18:42 oyasumi 2009-06-13 19:56 can I ask a noob question: what does "fuse: bad mount point `./test': Transport endpoint is not connected" mean 2009-06-13 19:58 ckwood_: probably means the filesystem crashed 2009-06-13 19:59 ah 2009-06-13 19:59 thanks 2009-06-13 20:18 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-13 20:20 hi ckwook_ 2009-06-13 20:20 ckwood_ 2009-06-13 20:21 need to fix fuse to give better error messages 2009-06-13 20:23 how do I do that? 2009-06-13 20:24 I meant, I need to do that 2009-06-13 20:24 more power to you if you can do that ;) 2009-06-13 20:24 ah, gotcha 2009-06-13 20:24 involves grabbing fuse source, making it do something reasonable, sending a patch 2009-06-13 20:25 do: make debug 2009-06-13 20:25 that will run the filesystem server in the foreground so you can see the tracing and error text 2009-06-13 20:26 make mkfs && make debug 2009-06-13 20:26 ah, nice -- works 2009-06-13 20:26 makes a small test filesystem and mounts it on mountpoint 'test' in your cwd 2009-06-13 20:27 loopback mount of a 1 MB file 2009-06-13 20:28 ah -- it crashed when I tried to touch a file in test -- I suppose I should investigate, eh =) 2009-06-13 20:28 yes, means we broke something relatively recently 2009-06-13 20:28 (actually asserted, not crashed) 2009-06-13 20:28 ah good 2009-06-13 20:31 touch test/foo works for me 2009-06-13 20:33 hmm -- no doubt I have something set up wrong (i'll keep fishing around), anyway I hit: tux3_releasedir: Failed assert(inode->inum == ino)! 2009-06-13 20:33 no doubt I have something set up wrong -- I'll keep fishing around; here's what I hit: tux3_releasedir: Failed assert(inode->inum == ino)! 2009-06-13 20:33 tux3_releasedir: Failed assert(inode->inum == ino)! 2009-06-13 20:34 hmm, it's hard to set fuse up wrong 2009-06-13 20:34 something unanticipated 2009-06-13 20:35 actually -- i think it was on an ls in test 2009-06-13 20:35 ah, reproduced here 2009-06-13 20:35 thanks 2009-06-13 20:36 i'll look into it some more -- its a good first little issue 2009-06-13 20:36 hirofumi's recent improvements to the fuse code I think, it's a lot more efficient but a glitch crept in 2009-06-13 20:36 it is 2009-06-13 20:36 if you can hunt that down, you're hot ;) 2009-06-13 20:37 haha 2009-06-13 20:37 not kidding 2009-06-13 20:37 it's pretty hardcore stuff 2009-06-13 20:38 yeah -- most likely i wont figure anything out but it will be instructive for me to search 2009-06-13 20:40 guaranteed instructive 2009-06-13 20:49 so where does tux_set_inum(inode, inum); come from if I haven't the kernel sub directory of code compiled? 2009-06-13 20:51 it comes from the kernel directory 2009-06-13 20:51 most of the kernel code also runs in userspace 2009-06-13 20:51 ah 2009-06-13 21:12 why is TUX_ROOTDIR_INO set to 13 ? 2009-06-13 21:13 got a better number? 2009-06-13 21:13 0xd -> root _d_irectory 2009-06-13 21:14 in init -- looks like the FUSE_ROOT_ID gets this inum and then when I ls in the root, it compares the ino (1) to that and asserts... 2009-06-13 21:15 line number/file of the assert? 2009-06-13 21:15 tux3fuse.c:750 2009-06-13 21:16 I only have 749 lines in my tux3fuse.c 2009-06-13 21:16 i think it ends up opening sb->rootdir and its inum doesn't match 2009-06-13 21:16 oh yeah sorry, its 560 2009-06-13 21:17 no -- 318 2009-06-13 21:17 (final answer) 2009-06-13 21:17 I don't have an assert there either 2009-06-13 21:17 let's compare versions 2009-06-13 21:18 wc tux3fuse.c 2009-06-13 21:18 748 2182 17959 tux3fuse.c 2009-06-13 21:19 wc tux3fuse.c 2009-06-13 21:19 750 2192 18088 tux3fuse.c 2009-06-13 21:19 i just did an hg clone this evening 2009-06-13 21:21 clone from where? 2009-06-13 21:22 http://phunq.net/tux3 2009-06-13 21:23 what does hg diff give you? 2009-06-13 21:26 2 new trace lines I just added 2009-06-13 21:26 ok that clears up the mystery 2009-06-13 21:27 so the original position of the assert was most likely tux3fuse.c#316 2009-06-13 21:27 assert(inode->inum == ino); 2009-06-13 21:29 anyway, 13 is the correct inum for root dir 2009-06-13 21:30 oh yeah -- sorry 2009-06-13 21:30 but did you say fuse_ino_t ino is !? 2009-06-13 21:30 by the way, were you able to unmount? 2009-06-13 21:33 fuse_ino_t for me is 1 2009-06-13 21:34 i unmount by: make untest 2009-06-13 21:34 that's a good way 2009-06-13 21:35 ok, I don't immediately know what the disconnect between FUSE_ROOT_ID and tux3's root inum is 2009-06-13 21:35 but I have confidence I soon will :) 2009-06-13 21:35 either I will find out or you will tell me 2009-06-13 21:36 I am currently finishing up automatic help text generation for the options parser 2009-06-13 21:37 got a little work to do there, wrapping the text specified width 2009-06-13 21:55 nice -- i will look more tomorrow 2009-06-13 22:02 see you tomorrow 2009-06-13 23:10 hi 2009-06-13 23:10 it is my fault 2009-06-13 23:12 assert((ino == FUSE_ROOT_ID && inode->inum == TUX_ROOTDIR_INO) || 2009-06-13 23:12 inode->inum == ino); 2009-06-13 23:12 I guess this change may fix it 2009-06-13 23:12 I thought you were sleeping 2009-06-13 23:13 yes, I slept few hours 2009-06-13 23:13 just now, I've waked up 2009-06-13 23:14 so fuse just assumes root inode number = 1 2009-06-13 23:14 maybe because ext2 does that? 2009-06-13 23:15 probably 2009-06-13 23:15 ah, no 2009-06-13 23:16 ext2 seems to be 2 2009-06-13 23:21 ok, cameron's test works now 2009-06-13 23:21 I will commit that 2009-06-13 23:21 thanks 2009-06-13 23:28 ah, btw, inode-defer-alloc-prepare.patch store btree to ileaf always 2009-06-13 23:29 even if btree doesn't have root, it stores (.root = 0, .depth = 0) for now 2009-06-13 23:29 a small wart 2009-06-13 23:30 yes 2009-06-13 23:30 conditional inode->present is not implemented yet at all 2009-06-13 23:32 I will sleep pretty soon 2009-06-13 23:33 oyasumi 2009-06-13 23:33 domo, oyasumi 2009-06-13 23:33 :) 2009-06-14 01:52 -!- RazvanM(~RazvanM@pool-173-67-54-52.bltmmd.east.verizon.net) has joined #tux3 2009-06-14 02:14 -!- RazvanM(~RazvanM@96.234.240.179) has joined #tux3 2009-06-14 06:43 http://userweb.kernel.org/~hirofumi/inode-defer-ialloc-prepare.patch 2009-06-14 06:43 http://userweb.kernel.org/~hirofumi/inode-defer-ialloc.patch 2009-06-14 06:43 rough draft of defer ileaf update 2009-06-14 06:44 almost untested, and not thinking whether those are really correct or mot 2009-06-14 06:44 not 2009-06-14 06:46 however, it may be explain what I'm thinking 2009-06-14 06:57 -!- cdk(~chinmay@59.95.20.72) has joined #tux3 2009-06-14 07:35 -!- cdk(~chinmay@59.95.12.43) has joined #tux3 2009-06-14 08:19 -!- cdk(~chinmay@59.95.30.246) has joined #tux3 2009-06-14 14:25 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-14 16:00 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-14 17:40 some visual entertainment: http://cs.jhu.edu/~razvanm/fs-expedition/ 2009-06-14 19:12 hi razvanm 2009-06-14 19:12 pretty 2009-06-14 19:13 need forward/back links <- official beta test 2009-06-14 19:14 Linux Kernel 2.6.29 + tux3 <- wow good taste 2009-06-14 19:15 is it good or bad to touch 1/3 the number of symbols ext4 does? 2009-06-14 19:16 one way of looking at it: the kernel is just a set of helper functions for ext4 2009-06-14 19:22 the number of external symbols for ext4 is boosted by the jdb2 2009-06-14 19:22 my thinking is, fewer symbols, coupled with high functionality = good design 2009-06-14 19:22 I think iss hard to say that having smaller number of symbols is better of worse 2009-06-14 19:22 that is true! :D 2009-06-14 19:23 less symbols for the same functionality and performance is better 2009-06-14 19:23 any argument? 2009-06-14 19:23 the cool part that nobody is tuning their code for this metric :P 2009-06-14 19:23 nope, I agree with you :D 2009-06-14 19:24 some of our internal apis have become, ahem, somewhat rambling of late 2009-06-14 19:24 it isn't good in the log run 2009-06-14 19:25 time for some cleanup? 2009-06-14 19:26 always 2009-06-14 19:26 but it isn't a popular task 2009-06-14 19:26 and reception from maintainers tends to be a little frosty 2009-06-14 19:26 so you have to have a thick skin to do it 2009-06-14 19:26 in practice that means it doesn't happen 2009-06-14 19:26 acme excepted 2009-06-14 19:27 and he pretty much sticks to networks 2009-06-14 19:27 and not core network, which really needs some cleanup 2009-06-14 19:28 I was piching to Raluca some time ago the idea of producing a linux tree with a very simplified networking part 2009-06-14 19:28 razvanm, what I would suggest for your next project if I may is, analyze the vfs+vm writeback+sync model 2009-06-14 19:29 flips: I haven't abandoned the fsbox project yet ;-) 2009-06-14 19:29 this part of the kernel is one of those parts that has failed to keep pace with evolution of the systems it serves 2009-06-14 19:29 it is still basically an upper half for ext2 2009-06-14 19:30 no even ext3 2009-06-14 19:30 frozen in time 2009-06-14 19:30 interesting 2009-06-14 19:30 so what you say is to find if some parts in kernel are only use by one file systems? 2009-06-14 19:30 that's a very interesting thing to look into 2009-06-14 19:30 this is at the core of the recent ext4 sync debate as well 2009-06-14 19:31 how much of the kernel is used by kernel modules 2009-06-14 19:33 I think I missed the last few developments of the debate :P 2009-06-14 19:35 http://lxr.linux.no/linux+v2.6.29/+code=async_synchronize_cookie_domain <- something new 2009-06-14 19:35 since 2.6.29 2009-06-14 19:36 could be useful to tux3 2009-06-14 19:37 http://lxr.linux.no/linux+v2.6.29/kernel/async.c#L21 2009-06-14 19:37 I wonder if it's used ourside the boot part 2009-06-14 19:37 ourside = outside 2009-06-14 19:38 well it feels like something we want 2009-06-14 19:38 homework assignment: have opinions on that by tomorrow ;) 2009-06-14 19:39 this would be connected with the writeback model that hirofumi and I were discussing the last few days 2009-06-14 19:41 good idea 2009-06-14 20:20 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-14 21:51 -!- cdk(~chinmay@59.95.47.204) has joined #tux3 2009-06-14 22:33 -!- npmccallum(~npmccallu@cpe-76-177-118-207.natcky.res.rr.com) has joined #tux3 2009-06-14 22:49 -!- RazvanM(~RazvanM@96.234.240.179) has joined #tux3 2009-06-14 23:55 folks 2009-06-15 00:26 hi 2009-06-15 00:26 http://userweb.kernel.org/~hirofumi/inode-defer-ialloc.patch 2009-06-15 00:26 if you have time, please review a bit 2009-06-15 00:27 this is for deferred ileaf update 2009-06-15 05:01 http://cs.jhu.edu/~razvanm/fs-expedition/tux3.html 2009-06-15 05:29 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-15 06:41 -!- flips_(~daniel@phunq.net) has joined #tux3 2009-06-15 06:41 good morning hirofumi 2009-06-15 07:04 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 07:21 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-15 07:30 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 07:49 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 08:03 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 08:24 hi 2009-06-15 08:25 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 08:52 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 11:10 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 11:20 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 11:35 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 12:31 -!- Rahne(~chatzilla@tm.82.192.55.95.dc.telemach.net) has joined #tux3 2009-06-15 14:19 http://userweb.kernel.org/~hirofumi/foodev.png 2009-06-15 14:20 http://userweb.kernel.org/~hirofumi/defer-ialloc/ 2009-06-15 14:20 some cleanup and add comment to patches 2009-06-15 15:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-15 15:24 -!- edt(~Ed@dsl-62-180.aei.ca) has joined #tux3 2009-06-15 17:55 http://userweb.kernel.org/~hirofumi/defer-ialloc/ 2009-06-15 17:56 there were several bugs in this patchset 2009-06-15 17:56 I might be rethink and re-review this patchset 2009-06-15 17:57 however, now, this patchset is tested more or less on both of userland and kernel 2009-06-15 17:57 so, I'd like to tell this patchset early 2009-06-15 17:58 now, time to sleep 2009-06-15 17:58 oyasumi 2009-06-15 17:59 just missed you 2009-06-15 18:00 oyasumi 2009-06-15 18:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 18:50 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 19:06 razvanm, your article is getting web hits 2009-06-15 19:06 http://tech.slashdot.org/story/09/06/15/015207/A-Visual-Expedition-Inside-the-Linux-File-Systems 2009-06-15 19:07 ooh, a slashdotting 2009-06-15 19:08 hah, a self slashdotting 2009-06-15 19:08 very slick 2009-06-15 20:19 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 22:51 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-15 23:36 -!- RazvanM(~RazvanM@96.234.240.179) has joined #tux3 2009-06-16 06:54 -!- npmccallum(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-06-16 07:08 good morning 2009-06-16 07:16 hii 2009-06-16 07:17 I've found the another bugs in that patchset 2009-06-16 07:18 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-06-16 07:19 wow, kde4 recognizes your patches as emails and opens them in kmail 2009-06-16 07:20 pretty slick, but I think I'd rather have it open in an editor, or better yet, right in the browser 2009-06-16 07:20 in other words I would like the browser to be less helpful 2009-06-16 07:21 ah, the patch must have been created with git-mail 2009-06-16 07:21 actually, it was created by my patch scripts 2009-06-16 07:22 yes, it is written to file as email format 2009-06-16 07:23 To: 2009-06-16 07:23 dummy@dummy 2009-06-16 07:23 yes, the script is to send patch as email 2009-06-16 07:24 handling a missing btree root as a hole is exactly right 2009-06-16 07:24 http://userweb.kernel.org/~hirofumi/defer-ialloc/ 2009-06-16 07:24 I've updated it now 2009-06-16 07:24 yes, read it 2009-06-16 07:24 ah 2009-06-16 07:24 http://userweb.kernel.org/~hirofumi/defer-ialloc.old/ 2009-06-16 07:25 is old version 2009-06-16 07:25 http://userweb.kernel.org/~hirofumi/defer-ialloc.diff 2009-06-16 07:25 is diff -urNp old new 2009-06-16 07:25 well 2009-06-16 07:26 one concern(?) is, caller can't know the end of file from map_region)( 2009-06-16 07:26 map_region() 2009-06-16 07:27 well, if it's needed, caller can manage it by inode->i_size though 2009-06-16 07:28 ah 2009-06-16 07:28 map_region() can't know without ->i_size actually 2009-06-16 07:28 end of file is determined by the inode 2009-06-16 07:29 inode->i_size 2009-06-16 07:29 yes 2009-06-16 07:29 as you said 2009-06-16 07:29 so, caller should manage it if needed 2009-06-16 07:29 "current" end of file as represented by btree doesn't really matter 2009-06-16 07:29 that's my idea 2009-06-16 07:30 so if we want to truncate a file, we just change the size attribute, we can leave the btree unchanged if we want 2009-06-16 07:30 I think that is a good thing 2009-06-16 07:30 i see 2009-06-16 07:31 if gives a deferred truncate, if we just remember which inode was truncated 2009-06-16 07:31 one issue is expand size by truncate again 2009-06-16 07:31 indeed :) 2009-06-16 07:31 so, we have the concept of "clear" when expanding 2009-06-16 07:31 well 2009-06-16 07:32 actually, it's the concept of "finish the truncate" 2009-06-16 07:32 yes, well, it seems the checking ->i_size in map_region() is not important 2009-06-16 07:32 yes, I think that is a good simplification 2009-06-16 07:33 ok 2009-06-16 07:38 the bug was in tux_create_inode, in kernel? 2009-06-16 07:39 it's from defer-ialloc.diff? 2009-06-16 07:39 yes 2009-06-16 07:39 bug was in several codes 2009-06-16 07:39 basically, it didn't handled empty btree correctly 2009-06-16 07:40 and writeback handling bug 2009-06-16 07:40 always having a btree is something I did just to get something working quickly 2009-06-16 07:40 now you are making it right 2009-06-16 07:40 thanks 2009-06-16 07:41 but, implement may not be right, we need to care 2009-06-16 07:41 later, we will also allow a pointer to a btree leaf straight from the ileaf, that would be depth=0 2009-06-16 07:41 I hope that we have depth=1 now 2009-06-16 07:42 depth==1 is bnode root and ileaf 2009-06-16 07:42 right, good 2009-06-16 07:42 so we should be able to allow depth=0 pretty easily 2009-06-16 07:42 and that will save a block per btree, it is just an optimization 2009-06-16 07:43 so not worth a lot of effort now, but it should only be a small change 2009-06-16 07:43 for it, we need to remove depth==0 btree 2009-06-16 07:43 remove? 2009-06-16 07:44 currently, depth==0 means there is no btree 2009-06-16 07:44 right, so what should it be? 2009-06-16 07:44 physical block = 0 is allowed 2009-06-16 07:44 I guess conditional ->present 2009-06-16 07:45 do we always have ->present everywhere btree root is used? 2009-06-16 07:45 I think we do 2009-06-16 07:45 if there is no btree, inode doesn't have BTREE_BIT 2009-06-16 07:45 yes 2009-06-16 07:45 much better 2009-06-16 07:46 then we can assert on !BTREE_BIT && btree root not zero 2009-06-16 07:46 then we can assert on !BTREE_BIT && btree root not zero 2009-06-16 07:46 just an extra check 2009-06-16 07:46 yes 2009-06-16 07:47 so, the argument parsing project is finished :) 2009-06-16 07:47 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-16 07:47 I will post tonight 2009-06-16 07:47 :) 2009-06-16 07:47 ok 2009-06-16 07:47 good morning tim_dimm 2009-06-16 07:47 morning 2009-06-16 07:47 good morning 2009-06-16 07:48 the last bit needed was handling some strange situations like parsing part of the command line with one set of options, and the rest with a different set 2009-06-16 07:48 which is actually pretty common 2009-06-16 07:48 ddsnap does it 2009-06-16 07:48 so do things like sudo 2009-06-16 07:49 yes 2009-06-16 07:49 and basically any shell command that includes multiple subcommands 2009-06-16 07:49 call parser multiply 2009-06-16 07:50 yes, and I had to break up the parsing loop so you can call the parser one option/arg at a time 2009-06-16 07:50 similar to how getopt (3) works 2009-06-16 07:50 yes 2009-06-16 07:51 the nice thig is, you only have to do that if you have a special situation 2009-06-16 07:51 otherwise, you can just use the one line options parser 2009-06-16 07:51 also, automatic help and usage generation was done, it is at least as good as popt 2009-06-16 07:52 so now that whole thing is done, and can just be maintained 2009-06-16 07:52 yes, probably 2009-06-16 07:52 I think it is pretty likely to be popular and eventually be a library, but right now it is small enough just to cut and paste into the code 2009-06-16 07:52 which was a goal 2009-06-16 07:52 less hassle 2009-06-16 07:52 it can handle like "--blocksize=xxxx" as number? 2009-06-16 07:52 yes 2009-06-16 07:52 good 2009-06-16 07:53 automatically checks it 2009-06-16 07:53 also automatically checks that an option only occurs once 2009-06-16 07:53 and you can give a flag to let it occur more than once 2009-06-16 07:54 it does not automatcally set variables for you like getopt does 2009-06-16 07:54 the reason for this is 1) it's extra complexity; 2) that messes up the program, the application variables get mixed with the parsing variables 2009-06-16 07:55 I actually like "set variables automatically" 2009-06-16 07:55 well add it ;) 2009-06-16 07:55 but, it should be control of user of parser 2009-06-16 07:55 of course it is 2009-06-16 07:55 and could easily be added 2009-06-16 07:55 sounds good 2009-06-16 07:55 just add an address field to the end of struct option 2009-06-16 07:56 call me lazy :) 2009-06-16 07:56 also, that feature does mess up your program 2009-06-16 07:56 because you end up with your variables declared in the wrong place 2009-06-16 07:56 at the option parser, instead of where the application code begins 2009-06-16 07:57 ddsnap suffers from this badly, it's one of the reasons it was hard to clean up 2009-06-16 07:58 um... 2009-06-16 07:58 another reason it was hard to clean up is, popt has a parsing state that needs to be destroyed 2009-06-16 07:58 yes 2009-06-16 07:59 popt seems to do too many 2009-06-16 07:59 parser should be parser 2009-06-16 07:59 and set variable and others, can be done by using parser, I guess 2009-06-16 08:01 well, I'm not thinking about detail at all, so, it may not be true easily 2009-06-16 08:01 btw, I added the if (has_root_bnode(btree)) many places 2009-06-16 08:02 I guess it's not good, and we need to rethink 2009-06-16 08:03 right, it is convenient to set option variables in the option accept loop that every program has to have anyway 2009-06-16 08:03 and the bugs was I'm forgetting it to add multiple times 2009-06-16 08:03 well, hasj_root_bnode sounds basically good 2009-06-16 08:03 thanks 2009-06-16 08:04 just needs to check a present flag instead 2009-06-16 08:04 well, I care I was forgetting to add multiple times 2009-06-16 08:04 forgetting to add the root? 2009-06-16 08:05 forgetting to add check empty root 2009-06-16 08:05 usually for me, it means, I'm not understanding code correctly, or implement is not natural 2009-06-16 08:06 it is why I care this bug 2009-06-16 08:08 I agree 2009-06-16 08:09 it seems trivial, but actually it is part of the object lifetime of a btree 2009-06-16 08:09 and therefore needs to be precise 2009-06-16 08:11 yes 2009-06-16 08:12 well, I was trying to add that handling as not special case 2009-06-16 08:12 however, not successed 2009-06-16 08:12 getting closer 2009-06-16 08:13 well, I wanted to share current situation of code with you 2009-06-16 08:13 especially, bad point of code 2009-06-16 08:13 it looks basically right 2009-06-16 08:13 thanks 2009-06-16 08:13 and the bad points are just warts, not fundamental problems 2009-06-16 08:14 ok 2009-06-16 08:15 it is correct that the btree root should not be created until delta 2009-06-16 08:16 or in the current code, not until sync 2009-06-16 08:16 that means, create the btree root in filemap I think 2009-06-16 08:16 which is what you have written 2009-06-16 08:17 yes 2009-06-16 08:17 I think it's a good point of patchset 2009-06-16 08:18 please let me know when you think it is ready for a pull 2009-06-16 08:18 ok 2009-06-16 08:19 probably, a few days for review and test 2009-06-16 08:19 did you see razvanm's mention of tux3 on slashdot? 2009-06-16 08:19 no 2009-06-16 08:20 http://linux.slashdot.org/story/09/06/15/015207/A-Visual-Expedition-Inside-the-Linux-File-Systems?art_pos=1 2009-06-16 08:20 enjoy :) 2009-06-16 08:40 :) many graphs 2009-06-16 09:12 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-16 10:20 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-16 14:51 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-16 15:43 -!- tim_dimm_(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-16 17:25 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-16 19:19 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-16 20:38 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-16 21:30 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-16 23:54 -!- RazvanM(~RazvanM@96.234.240.179) has joined #tux3 2009-06-17 05:46 -!- npmccallum(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-06-17 07:08 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-17 08:40 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-17 11:17 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-17 11:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-17 11:59 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-17 12:35 -!- cdk(~chinmay@59.95.38.32) has joined #tux3 2009-06-17 12:41 hi flips 2009-06-17 13:09 folks 2009-06-17 13:10 -!- ajonat(~ajonat@190.48.121.236) has joined #tux3 2009-06-17 13:50 hi cdk 2009-06-17 14:00 -!- dcg(~dcg@26.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-06-17 14:25 -!- edt(~Ed@dsl-62-238.aei.ca) has joined #tux3 2009-06-17 15:53 -!- ajonat_(~ajonat@190.48.93.189) has joined #tux3 2009-06-17 16:36 -!- noodles10(47a0be13@webchat.mibbit.com) has joined #tux3 2009-06-17 18:13 -!- npmccallum(~npmccallu@32.167.49.20) has joined #tux3 2009-06-17 21:20 -!- cdk(~chinmay@59.95.37.41) has joined #tux3 2009-06-18 05:46 -!- cdk(~chinmay@59.95.35.143) has joined #tux3 2009-06-18 05:50 good morning cdk 2009-06-18 05:50 hi 2009-06-18 05:51 hi 2009-06-18 05:51 good evening here 2009-06-18 05:51 :) 2009-06-18 05:51 long time 2009-06-18 05:54 so, what can we do now ? 2009-06-18 05:58 -!- npmccallum(~npmccallu@32.167.49.20) has joined #tux3 2009-06-18 06:58 -!- npmccallum(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-06-18 07:13 cdk, how did your exams go/ 2009-06-18 07:13 ? 2009-06-18 07:44 -!- ralu_(~ralu@tm.82.192.55.95.dc.telemach.net) has joined #tux3 2009-06-18 08:05 flips , you there ? 2009-06-18 08:06 exams were fine :) 2009-06-18 08:06 ready to get back to tux3 work 2009-06-18 08:42 -!- cdk(~chinmay@59.95.35.143) has joined #tux3 2009-06-18 09:11 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-18 10:20 -!- hatseflats(~hatseflat@193.200.132.183) has joined #tux3 2009-06-18 10:20 evening everyone 2009-06-18 10:24 hi 2009-06-18 11:19 http://userweb.kernel.org/~hirofumi/defer-ialloc/ 2009-06-18 11:19 I've updated the patchset 2009-06-18 11:19 on theory, writeback stuff didn't handle dirty flags correctly 2009-06-18 11:20 I think there is no case of the bug though 2009-06-18 11:21 now, sync_inode is more simpler, and it works like fsync 2009-06-18 11:21 and some xattr bugs fixes, and add some comments 2009-06-18 11:22 I may still be missing something 2009-06-18 11:24 if you can review, I appreciate 2009-06-18 11:48 -!- juice(1000@65.28.97.1) has joined #tux3 2009-06-18 12:42 hi hirofumi 2009-06-18 12:42 hi 2009-06-18 12:42 will do 2009-06-18 12:42 thanks 2009-06-18 12:43 changed patches from previous version are p01* ~ p05* 2009-06-18 12:53 "The requested URL /~hirofumi/defer-ialloc/index.html was not found on this server." 2009-06-18 12:53 http://userweb.kernel.org/~hirofumi/defer-ialloc/ 2009-06-18 12:53 ? 2009-06-18 12:54 yes 2009-06-18 12:54 um... 2009-06-18 12:54 it seems to works for me 2009-06-18 12:55 hmm 2009-06-18 12:55 reload? 2009-06-18 12:56 ok got it, somehow I got /index.html added 2009-06-18 12:57 oh 2009-06-18 12:57 I was thinking apache is providing fake /index.html 2009-06-18 12:58 links did :-] 2009-06-18 12:58 but, it seems to handle / directly 2009-06-18 13:00 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-18 13:11 -!- cdk(~chinmay@59.95.35.143) has joined #tux3 2009-06-18 13:39 hi cdk, there? 2009-06-18 13:42 hirofumi, it looks good and just like we discussed, I would be happy to pull when you are ready 2009-06-18 13:42 ok, thanks 2009-06-18 13:43 I'll review again a bit, and probably pull request is tomorrow 2009-06-18 13:44 time to think about what next 2009-06-18 13:44 besides bug hunting this 2009-06-18 13:44 I'm thinking, next is, back to logging for creation path 2009-06-18 13:45 which uses the deferred inode create 2009-06-18 13:45 orphan/delete is pending 2009-06-18 13:45 ok, I will do some more work on it 2009-06-18 13:45 argument parsing is done ;-) 2009-06-18 13:45 ok :) 2009-06-18 13:46 I will post code tonight, then it will be officially somebody else's problem 2009-06-18 13:46 well, first is, apply my patches to current tree 2009-06-18 13:46 ok 2009-06-18 13:47 btw, patches meant my atomic-commit patchset 2009-06-18 13:47 and check what I need to do with defer-ialloc 2009-06-18 13:47 ah certainly 2009-06-18 13:48 flipz, here 2009-06-18 13:48 and I hope atomic-commit is starting to work for creation path 2009-06-18 13:49 if it seems to work, I'll see replay 2009-06-18 13:49 cdk, atomic commit is the current project 2009-06-18 13:49 btw, I'm not testing fuse at all 2009-06-18 13:49 I think, if your group is willing to examine and comment on the algorithms it would be a big help 2009-06-18 13:49 ok. what can be done to help ? 2009-06-18 13:50 hirofumi, fine, we need to automate that 2009-06-18 13:50 cdk, automate the fuse testing ;-) 2009-06-18 13:50 yes, it's good 2009-06-18 13:50 make test-fuse 2009-06-18 13:51 and make tux3 is also 2009-06-18 13:51 ah, no test-tux3 2009-06-18 13:51 make test-fuse would automatically make and mount a small test filesystem and mount on a loopback file (as already exists in the makefile) then run a few commands on it, like touch, ls, and check the results are as expected 2009-06-18 13:52 make test-tux3 would be good 2009-06-18 13:52 yes 2009-06-18 13:52 ok will work on it, but i doubt that all of us can work now. 2009-06-18 13:52 some bugs was found by tux3 test 2009-06-18 13:53 it will probably be me and one other guy :( 2009-06-18 13:54 btw, now, with writeback, fuse can improve more or less 2009-06-18 13:54 actually, we have used test_ in the makefile 2009-06-18 13:55 ah, yes 2009-06-18 13:55 I found a bunch of bugs with tux3 2009-06-18 13:55 cdk, you and one other guy would be fine 2009-06-18 13:55 that's a "group" 2009-06-18 13:56 if you don't mind, I think of you as the chinmay group :) 2009-06-18 13:56 :) 2009-06-18 14:00 btw, when are you usually here? 2009-06-18 14:05 about 6 am to 8 pm, pacific time 2009-06-18 14:08 ok. i am off for the day. will start working again in a day or two. 2009-06-18 14:09 what kind of job? 2009-06-18 14:09 oh 2009-06-18 14:09 working on tux3 :) 2009-06-18 14:10 actually a job as well. that will start in july. but i guess i will have time to work here as well :) 2009-06-18 14:10 ACTION needs to study the additions / changes to the code. 2009-06-18 14:11 bye 2009-06-18 14:42 -!- mib_eokxuqab(75c302d3@webchat.mibbit.com) has joined #tux3 2009-06-18 14:55 -!- hatseflats(~hatseflat@193.200.132.183) has left #tux3 2009-06-18 15:14 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-18 16:11 tim_dimm, there? 2009-06-18 16:11 yo 2009-06-18 16:11 any plan on the server? 2009-06-18 16:12 its up 2009-06-18 16:12 wha? 2009-06-18 16:12 batc 2009-06-18 18:01 -!- npmccallum(~npmccallu@32.147.212.210) has joined #tux3 2009-06-18 18:58 folks 2009-06-18 20:44 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-18 23:30 -!- mib_oqezbdyq(75c30bd7@webchat.mibbit.com) has joined #tux3 2009-06-19 00:42 -!- RazvanM(~RazvanM@96.234.239.105) has joined #tux3 2009-06-19 02:04 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-06-19 04:09 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-06-19 06:10 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-19 06:29 good morning 2009-06-19 06:31 hi 2009-06-19 06:38 http://userweb.kernel.org/~hirofumi/dirent/ 2009-06-19 06:39 I'm going to add 3 patches later, cleanup/fixes of FIXME of xattr 2009-06-19 06:39 (i.e. don't modify inode for xattr) 2009-06-19 06:40 now, *_entry is for buffer, *_dirent is modify inode timestamp too 2009-06-19 08:26 ah, like be sure we have space for the new xattr before removing the old one 2009-06-19 08:28 which patch? 2009-06-19 08:28 it's in the existing xattr code 2009-06-19 08:28 ah 2009-06-19 08:28 - use -= remove_old(xcache, xattr); 2009-06-19 08:28 + create = 1; 2009-06-19 08:29 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-19 08:29 which FIXMEs did you mean? 2009-06-19 08:29 it's for dir.c 2009-06-19 08:30 /* FIXME: this should be only dir, not xattr */ 2009-06-19 08:41 you mean, don't update the atom table mtime when creating a new atom? 2009-06-19 08:41 yes 2009-06-19 08:41 and also for delete 2009-06-19 08:42 I suppose we don't care what time the last atom create was 2009-06-19 08:42 I guess it would be the cause of unneeded ileaf update 2009-06-19 08:42 yes, it will significantly improve xattr performance 2009-06-19 08:42 yes 2009-06-19 08:43 now, I noticed one of causes why I worry 2009-06-19 08:43 did I describe the xattr atom cache scheme? 2009-06-19 08:43 ah 2009-06-19 08:43 why you worry is more important 2009-06-19 08:44 btw, it is about defer ileaf update 2009-06-19 08:44 now, I was talking about different topic 2009-06-19 08:45 I was forgetting to add has_root_bnode() test to someplaces 2009-06-19 08:45 one of those was free_btree() 2009-06-19 08:45 now, we allow btree->root.block == 0 2009-06-19 08:46 so, free_btree can be done by truncate 2009-06-19 08:46 and I didn't forget to add has_root_bnode test to truncate 2009-06-19 08:47 I think it is why I forgot to add it to free_btree() 2009-06-19 08:48 well 2009-06-19 08:48 so, I added the FIXME to free_btree 2009-06-19 08:48 maybe, we should be done it by truncate? 2009-06-19 08:49 ah, actually, it should be done by tree_chop()? 2009-06-19 08:57 btw, what is atom cache scheme? 2009-06-19 09:05 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-19 09:53 ok, atom cache scheme 2009-06-19 09:53 ok 2009-06-19 09:53 just a simple hash to look up atom names fast 2009-06-19 09:53 only if the name is not found in the cache, then we read the atom table 2009-06-19 09:53 yes, via sb->atable? 2009-06-19 09:54 so that should mean, just one read of the atom table per atom type per mount 2009-06-19 09:54 um... 2009-06-19 09:54 instead of one read per xattr op 2009-06-19 09:56 find_atom() is finding atom for name? 2009-06-19 09:57 yes, find_atom is where the atom cache goes 2009-06-19 09:57 atom usecount would be tracked in the cache instead of in the atom table 2009-06-19 09:58 we log new atom usecounts per delta 2009-06-19 09:58 find_atom is for each get_xattr? 2009-06-19 09:58 yes 2009-06-19 09:58 so it has to be fast 2009-06-19 09:59 our file reading is pretty fast, but cache lookup with be several times faster 2009-06-19 09:59 just one read means, just one disk read? 2009-06-19 10:01 it would normally just be one disk read to find an atom in the atom table, yes 2009-06-19 10:01 i see 2009-06-19 10:01 if atable buffer cache is not expired? 2009-06-19 10:01 right 2009-06-19 10:02 and if there aren't a lot of different atom names 2009-06-19 10:02 ah, yes 2009-06-19 10:02 do you know xattr format of other fs? 2009-06-19 10:02 yes 2009-06-19 10:02 ext2/3/4 2009-06-19 10:03 it's weird :) 2009-06-19 10:03 btw, I'm not checking xattr deeply 2009-06-19 10:03 it's ok 2009-06-19 10:03 when tux3 is actually interesting enough to run samba on it, then our xattrs will get checked very carefully 2009-06-19 10:04 yes 2009-06-19 10:04 with the atom cache, I think we will have xattr performance as good or better than any other fs 2009-06-19 10:04 and without the atom cache, it will be ok but not great 2009-06-19 10:04 ah, btw, checking meant, I'm not checking the format of other fs 2009-06-19 10:05 well, I'm not checking xattr itself deeply 2009-06-19 10:05 ext2 etc try to group identical attr values together, and they have a level of indirection to do that 2009-06-19 10:05 atom cache means buffer cache of atable? 2009-06-19 10:05 xattrs for multiple inodes are grouped together in a shared block 2009-06-19 10:05 this makes the code very complex 2009-06-19 10:06 our scheme is much simpler, and most likely, more efficient 2009-06-19 10:06 atom cache is a simple, in-memory hash table of atom names 2009-06-19 10:07 it sounds like, ext2 has central database for xattr? 2009-06-19 10:07 not really 2009-06-19 10:07 and inode points to it 2009-06-19 10:07 it just tries to share xattrs between nearby inodes 2009-06-19 10:07 oh 2009-06-19 10:07 btrfs uses special btree keys to look up xattrs 2009-06-19 10:08 that scheme obviously has more overhead than the tux3 arrangement 2009-06-19 10:08 which looks up the xattr in an already open inode 2009-06-19 10:09 btrfs at least has no real limit on xattr size 2009-06-19 10:09 btrfs's key is key for value + name? 2009-06-19 10:09 ext2 etc have a small limit, one block - some overhead 2009-06-19 10:10 btrfs xattr key is something like inum + xattrname + 'xattr' 2009-06-19 10:10 they don't really have inum 2009-06-19 10:10 but an object id, which is like an inode number 2009-06-19 10:11 ah 2009-06-19 10:11 hammer does it that way too I think 2009-06-19 10:11 btw, I can't see big different of object base fs 2009-06-19 10:12 there is no way it can be as efficient as reading the xattrs with the inode open 2009-06-19 10:12 object base fs? 2009-06-19 10:12 all pointer is object id? 2009-06-19 10:12 btrfs and reiserfs4, zfs(?) 2009-06-19 10:12 ah 2009-06-19 10:13 well the btree lookup per xattr will be costly 2009-06-19 10:13 i see 2009-06-19 10:14 probably, the assumpion is, xattr is rare or not? 2009-06-19 10:14 I think that is the assumption 2009-06-19 10:15 however, with samba it is not rare 2009-06-19 10:15 samba performance depends heavily on xattr performance 2009-06-19 10:15 yes, probably 2009-06-19 10:15 if xattr performance is not good, then details of filenames have to be sotred in a regular file 2009-06-19 10:15 I'm not using samba, so I don't know well 2009-06-19 10:16 however, I see somewhere, samba is using xattr for each files 2009-06-19 10:16 not just details of files, but windows acls too 2009-06-19 10:16 yes 2009-06-19 10:17 yes, windows acls don't mapp onto unix acls, so xattrs are used, and when they are not available or too slow, a hidden file is used 2009-06-19 10:17 well, at least, if posix acl is used, it seems those should be handled like ->i_mode 2009-06-19 10:18 oh 2009-06-19 10:18 samba can use hidden file for it? 2009-06-19 10:19 I'm just checking that claim 2009-06-19 10:19 tridge metioned it 2009-06-19 10:19 that it can use a file-based fallback 2009-06-19 10:19 i see 2009-06-19 10:19 and that the fallback is often faster than using the filesystem xattrs 2009-06-19 10:19 oh 2009-06-19 10:20 if so, xattr should be faster than it 2009-06-19 10:21 otherwise, it seems there is no reason to use xattr for samba almost 2009-06-19 10:22 http://lkml.indiana.edu/hypermail/linux/kernel/0411.2/1150.html 2009-06-19 10:22 "performance of filesystem xattrs with Samba4" 2009-06-19 10:28 ext2 seems to be good enough 2009-06-19 10:29 ext3 seems to be a bit slow 2009-06-19 10:31 especially, if xattr is enough small 2009-06-19 10:31 tux3 xattrs with the atom name cache added should be faster than ext2 2009-06-19 10:33 we add the another level of cache to xattr? 2009-06-19 10:33 (hash of name?) 2009-06-19 10:33 yes 2009-06-19 10:34 i see 2009-06-19 10:34 at the beginning of find_atom 2009-06-19 10:34 a very simple hash 2009-06-19 10:34 atable is not enough? 2009-06-19 10:34 putting the hash in front of the atable will cut the overhead a lot 2009-06-19 10:35 by a factor of 10 or more I think 2009-06-19 10:35 ah, small hash 2009-06-19 10:35 we can use the kernel library hash 2009-06-19 10:35 for name lookup locality 2009-06-19 10:35 or code up a custom on 2009-06-19 10:35 yes 2009-06-19 10:35 big gain for little work 2009-06-19 10:35 i see, it sounds good 2009-06-19 10:36 I was thinking full hash of entrires 2009-06-19 10:36 that would suck ;) 2009-06-19 10:36 :) 2009-06-19 10:36 we have the inode cache for that 2009-06-19 10:37 well, the hash is for name -> atom though 2009-06-19 10:38 right, and the inode cache optimizes reading or writing and xattr value 2009-06-19 10:38 maybe, the issue is for big size xattr 2009-06-19 10:39 to support big attrs we need to introduce a new inode attribute 2009-06-19 10:39 which will be almost the same as a btree attribute 2009-06-19 10:40 this will make xattrs very similar to file data 2009-06-19 10:40 i see 2009-06-19 10:40 and they will be even more similar when we introduce an immediate data attribute for files 2009-06-19 10:40 yes 2009-06-19 10:41 so then, a short file can be embedded in the itable, so can a short xattr 2009-06-19 10:41 or a big file can be a data btree, so can an xattr 2009-06-19 10:41 yes 2009-06-19 10:42 the only difference should be, whether an atom number is present in the inode attribute or not 2009-06-19 10:42 we could even just reserve one atom number to mean "file data" 2009-06-19 10:43 but that would make most inodes larger 2009-06-19 10:43 so I think it is worth keeping a special form for file data attributes 2009-06-19 10:43 yes 2009-06-19 10:44 it sounds like same with ->i_mode or something 2009-06-19 10:44 ->i_mode is unix permission, not posix acl 2009-06-19 10:45 I guess it doesn't have big difference with posix acl, except file type 2009-06-19 10:45 so, it can be xattr 2009-06-19 10:45 but, I think it doesn't make sence 2009-06-19 10:46 right, because every inode has i_mode 2009-06-19 10:46 yes 2009-06-19 10:47 ACTION thinks why hg view is not using "diff -p" 2009-06-19 10:49 my hg used diff -p 2009-06-19 10:50 in "hg view"? 2009-06-19 10:50 ah 2009-06-19 10:50 it's config 2009-06-19 10:51 it seems showfunc = True 2009-06-19 10:52 ok, now, I can see diff -p 2009-06-19 10:52 right, forgot about that 2009-06-19 10:54 found the global config file in */etc 2009-06-19 10:56 ok, I think patches is mergable 2009-06-19 10:57 ready for a pull? 2009-06-19 10:57 now, I'm uploading to server 2009-06-19 10:57 ok, just ping when ready 2009-06-19 10:57 yes 2009-06-19 10:58 flips, ok, please check it 2009-06-19 10:58 I'll send email to ml 2009-06-19 10:59 btw, the pending issue is 2009-06-19 10:59 alloc_empty_btree() may be merged to insert_leaf() 2009-06-19 10:59 and free_empty_btree() may be merged to tree_chop 2009-06-19 11:31 ok 2009-06-19 11:32 agreed it would be nice to have those in the right place later 2009-06-19 11:33 yes 2009-06-19 11:34 especially free_empty_btree(), also the code would be simpler than current 2009-06-19 11:35 btw, do you know why some fs is using object id? 2009-06-19 11:37 object id? 2009-06-19 11:38 like btrfs? 2009-06-19 11:38 yes 2009-06-19 11:38 it allows a uniform style of access to a filesystem structure consisting of a single btree with different object types 2009-06-19 11:39 actually, I don't know why they call them object inodes and not inode numbers 2009-06-19 11:39 the latter would make more sense to me 2009-06-19 11:40 I think it is just to try to break with the "old" way of doing things 2009-06-19 11:40 i see 2009-06-19 11:40 sorry, I meant I don't know why the call them object ids 2009-06-19 11:40 btw, pointer to block is not object id? 2009-06-19 11:41 maybe to give the sense that an inode is just one type of object in the tree 2009-06-19 11:41 ah, i see 2009-06-19 11:41 my feeling is, the only things in a filesystem are inodes 2009-06-19 11:41 in other words files 2009-06-19 11:42 yes 2009-06-19 11:43 btw, ileaf is some sort of key-value database, and it seems to be trend in web people 2009-06-19 11:44 it's a nice model 2009-06-19 11:44 and, I'm feeling there is good papers from those people 2009-06-19 11:45 probably 2009-06-19 11:45 yes, probably 2009-06-19 11:45 one thing though, they don't really care about wasting bytes, and we do 2009-06-19 11:45 for the future optimization 2009-06-19 11:46 i see 2009-06-19 11:46 they value regularity more than anything 2009-06-19 11:47 for example, they would want file data attributes to have atom numbers 2009-06-19 11:47 and they would want atom numbers even for i_size etc 2009-06-19 11:47 ah 2009-06-19 11:47 and they would want only one kind of file data attribute, and make everything be a btree 2009-06-19 11:48 the theory is, having a simpler schema is more important than efficiency 2009-06-19 11:49 i see 2009-06-19 11:49 I thing the web people have taken this a little too far, but that is just me 2009-06-19 11:49 ok, I'll be away from the keyboard for an hour or so 2009-06-19 11:50 ok, I'll sleep 2009-06-19 11:50 oyasumi 2009-06-19 12:25 oyasumi 2009-06-19 12:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-19 18:55 folks 2009-06-19 18:55 ACTION reads the backlog 2009-06-19 19:14 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-19 22:46 -!- RazvanM(~RazvanM@96.234.239.105) has joined #tux3 2009-06-19 22:51 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-20 08:45 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-20 13:59 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-20 14:53 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-20 16:20 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-06-20 21:13 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-20 23:52 -!- RazvanM(~RazvanM@96.234.239.105) has joined #tux3 2009-06-21 01:01 -!- RazvanM_(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-21 07:14 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-21 11:20 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-21 11:23 -!- edt(~Ed@dsl-62-238.aei.ca) has joined #tux3 2009-06-21 14:15 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-21 15:28 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-06-21 18:36 -!- edt(~Ed@dsl-62-238.aei.ca) has joined #tux3 2009-06-21 19:58 -!- ajonat(~ajonat@190.48.103.70) has joined #tux3 2009-06-22 00:09 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-22 00:43 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-06-22 06:06 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-06-22 08:19 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-06-22 10:37 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-22 14:42 -!- edt(~Ed@dsl-62-238.aei.ca) has joined #tux3 2009-06-22 19:48 I wonder why mark_inode_dirty avoids unhashed inodes? 2009-06-22 19:48 there must be a story there 2009-06-22 19:48 unhashed? 2009-06-22 19:48 ah, in kernel 2009-06-22 19:48 http://lxr.linux.no/linux+v2.6.30/fs/fs-writeback.c#L150 2009-06-22 19:49 I don't know why 2009-06-22 19:49 need to check history of git 2009-06-22 19:51 it's been there since at least 2.6.11, let's see if it goes back before git 2009-06-22 19:53 goes back before 2.5.0 2009-06-22 19:54 2.2.0... 2009-06-22 19:57 http://marc.info/?l=linux-kernel&m=93907280019386&w=2 2009-06-22 19:57 the first patch seems this 2009-06-22 19:58 how did you find it? 2009-06-22 19:59 just google with comment 2009-06-22 19:59 ah, no 2009-06-22 19:59 search ml with comment 2009-06-22 19:59 s/ml/linux-kernel/ 2009-06-22 19:59 very efficient, I was getting there with lxr but you beat me 2009-06-22 20:00 it seems like the reason for that is lost in time 2009-06-22 20:00 well, I'm a bit familiar than you, because of my history 2009-06-22 20:00 #ifndef CONFIG_FASTSYNC 2009-06-22 20:01 so... some inodes can be unhashed if they are deleted but busy 2009-06-22 20:02 busy because of sys_sync maybe 2009-06-22 20:03 oh 2009-06-22 20:03 it's not first 2009-06-22 20:03 there is the comment before that patch 2009-06-22 20:06 ah, one possible reason is make_bad_inode() 2009-06-22 20:06 it removes inode from hash 2009-06-22 20:06 it came in at 2.1.55 2009-06-22 20:07 67 * CAREFUL! We mark it dirty unconditionally, but 2009-06-22 20:07 68 * move it onto the dirty list only if it is hashed. 2009-06-22 20:07 69 * If it was not hashed, it will never be added to 2009-06-22 20:07 70 * the dirty list even if it is later hashed, as it 2009-06-22 20:07 71 * will have been marked dirty already. 2009-06-22 20:08 73 * In short, make sure you hash any inodes _before_ 2009-06-22 20:08 74 * you start marking them dirty.. 2009-06-22 20:09 http://git.kernel.org/?p=linux/kernel/git/davej/history.git;a=commit;h=b860551715b45a130a8f543816201833e19f0a35 2009-06-22 20:10 http://git.kernel.org/?p=linux/kernel/git/davej/history.git;a=commitdiff;h=4784473356a12e91e46507b870e5eeb470f8421e 2009-06-22 20:10 latter one seems it 2009-06-22 20:12 can alloc_inum do mark_inode_dirty? 2009-06-22 20:13 ah, it does 2009-06-22 20:14 @@ -398,8 +399,10 @@ int main(int argc, char *argv[]) 2009-06-22 20:14 assert(inode1 && inode2 && inode3 && inode4); 2009-06-22 20:14 /* both is deferred allocation */ 2009-06-22 20:14 err = alloc_inum(inode1, 0x1000); 2009-06-22 20:14 + mark_inode_dirty(inode1); <- isn't this already done by alloc_inum? 2009-06-22 20:14 no 2009-06-22 20:14 where did I misread? 2009-06-22 20:14 probably, now, you are on middle of patchset 2009-06-22 20:14 yes 2009-06-22 20:14 later patch removes it 2009-06-22 20:14 ah 2009-06-22 20:15 ok, well I can't find any flaw by inspection, time to pull 2009-06-22 20:15 flawless patch set ;) 2009-06-22 20:15 thanks :) 2009-06-22 20:16 btw, that patchset doesn't handle itable->btree path 2009-06-22 20:17 what is not handled? 2009-06-22 20:17 the code is assuming itable->btree has root 2009-06-22 20:17 ah, well that is fine 2009-06-22 20:18 it is not true in make_tux3() 2009-06-22 20:18 how does it become true? 2009-06-22 20:18 ah 2009-06-22 20:18 if make_tux3() is handled by atomic-commit manner 2009-06-22 20:19 itable->btree will need to delay to update ileaf 2009-06-22 20:19 but, now, it's allocate blocks by new_btree() 2009-06-22 20:20 make_tux3 doesn't need atomic commit I think 2009-06-22 20:20 but it is nice to pay attention to it 2009-06-22 20:20 yes 2009-06-22 20:21 known problem is only new_btree() of itable 2009-06-22 20:22 it touches bitmap, and make logs by frontend 2009-06-22 20:23 it can delay easily 2009-06-22 20:23 new_btree for itable does not have to be perfectly clean, it only ever happens once 2009-06-22 20:24 however, the real problem would be we are having the path for each btree 2009-06-22 20:25 just about oyasumi time 2009-06-22 20:25 ok 2009-06-22 20:26 oyasumi 2009-06-22 20:26 see you in about 9 hours 2009-06-22 20:26 yes 2009-06-22 20:27 btw, I thought a bit about some format 2009-06-22 20:27 see you 2009-06-22 20:27 I'm not gone yet 2009-06-22 20:27 listening 2009-06-22 20:28 oh 2009-06-22 20:28 simple one is about limitations 2009-06-22 20:28 * 2^60 maximum file size <- 2^48 blocks 2009-06-22 20:28 * 2^60 maximum volume size <- 2^48 blocks 2009-06-22 20:28 * 2^48 maximum versions <- um... 2009-06-22 20:28 early docs say 2^60 2009-06-22 20:28 it confuses me a bit 2009-06-22 20:29 because some archs seems to be able to use 4/8/16/64k blocksize 2009-06-22 20:31 and version is 2^48 2009-06-22 20:32 we can it? 2009-06-22 20:36 and I googled about xattrs 2009-06-22 20:36 right, that is for 4K blocksize 2009-06-22 20:36 http://www.osronline.com/article.cfm?article=457 2009-06-22 20:36 um.. about xattrs are a bit long 2009-06-22 20:37 s/long/long time/ 2009-06-22 20:37 and memo is still japanese :) 2009-06-22 20:38 oh, about 2^48 versions... we will need to introduce a version mapping table to map per-ileaf version to fs version 2009-06-22 20:38 well, in short, xattr may be big than I was thinking 2009-06-22 20:38 and may be using many inodes 2009-06-22 20:38 xattr on linux is has a small size 2009-06-22 20:38 (I don't use at all though) 2009-06-22 20:39 it is defined by ext3 as less than one block 2009-06-22 20:39 if it is thinking about samba, it may be big than it 2009-06-22 20:39 so we will eventually support xattrs as large as a file, but be don't have to hury 2009-06-22 20:39 samba has to work within the limitations of ext3 xattrs 2009-06-22 20:40 samba seems to be supporting up to 16k on xattr 2009-06-22 20:40 and on tdb (alternative of xattr) is 1M, iirc 2009-06-22 20:40 the question here is, what is the maximum xattr size for a cifs filesystem? 2009-06-22 20:41 well we know how to do big xattrs 2009-06-22 20:41 I guess it's unlimited 2009-06-22 20:41 or really same with file data 2009-06-22 20:42 that url is concept of streams of ntfs 2009-06-22 20:42 I would like to do big xattrs, but it isn't the top of the wish list 2009-06-22 20:42 and samba seems to map it to xattr 2009-06-22 20:42 I suppose it is worth asking tridge what the real limitation on cifs xattr size is 2009-06-22 20:42 tridge and co 2009-06-22 20:43 now... why did I decide 48 bits limitation on versions? 2009-06-22 20:43 ok 2009-06-22 20:43 well, xattr stuff is not only samba, it would be tomorrow 2009-06-22 20:44 s/xattr stuff/info by goole/ 2009-06-22 20:45 yes, xattrs should have been defined to be as big as files right from the beginning 2009-06-22 20:45 yes 2009-06-22 20:45 oh, and you need streams to read big xattrs 2009-06-22 20:45 getxattr will not do 2009-06-22 20:46 in other words, you need to be able to open them like files 2009-06-22 20:46 however, the issue is, it has mix of requestment 2009-06-22 20:46 we don't have an interface for that... yet 2009-06-22 20:46 yes 2009-06-22 20:47 samba usage of xattr is not fit to unix xattr, more or less 2009-06-22 20:47 at least, from view of interfaces 2009-06-22 20:48 ntfs streams is really alternate file data 2009-06-22 20:48 actually, the file data seems to be just one of streams 2009-06-22 20:48 well, back to version 2009-06-22 20:49 or tomorrow 2009-06-22 20:49 ok, tomorrow 2009-06-22 20:49 ok, oyasumi 2009-06-22 20:49 I will think about where the limitations come from 2009-06-22 20:49 oyasumi 2009-06-22 20:49 ok 2009-06-22 21:49 hey flks 2009-06-22 23:34 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-23 03:20 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-06-23 06:23 good morning 2009-06-23 06:23 hi 2009-06-23 06:23 well, I still don't know why unhashed inodes can't be dirty 2009-06-23 06:23 probably should know that 2009-06-23 06:24 anyway, what is next? 2009-06-23 06:24 version? 2009-06-23 06:24 2^48 2009-06-23 06:25 a very large number of versions 2009-06-23 06:25 there are 10 bits of version tag per attribute or extent 2009-06-23 06:25 so having a 48 bit version number space implies that the 10 bit versions have to be translated to 48 bits 2009-06-23 06:26 yes 2009-06-23 06:26 I forget why version number is not 64 bits, there was probably a reason at one time 2009-06-23 06:26 but 48 bits is already is ridiculously large number of versions 2009-06-23 06:27 each version table entry is something like 32 bytes 2009-06-23 06:27 so a full version table would be 2^53 = 8 petabytes 2009-06-23 06:27 too big 2009-06-23 06:28 reading an 8 petabyte table to translate it into a version tree would take some time 2009-06-23 06:28 ;) 2009-06-23 06:29 anyway, 256 quadrillion versions ought to be enough for anybody 2009-06-23 06:29 btw, there is version per attr 2009-06-23 06:29 it seems 12bits 2009-06-23 06:30 I think we should give a couple more bits to the attribute tag 2009-06-23 06:30 6 bit tag, 10 bit version 2009-06-23 06:31 if version is 2^48, attr should have 2^48 space for each attr? 2009-06-23 06:31 there has to be a per-ileaf translation table 2009-06-23 06:31 so just 10 bits version field per attribute, translated into 48 bits 2009-06-23 06:32 version tables are probably shareable by many itable blocks 2009-06-23 06:32 I haven't analyzed this deeply 2009-06-23 06:32 um... 2009-06-23 06:32 one inode can have 10bits version? 2009-06-23 06:32 the plan is to make it work first with just 1K versions = 512 visible versions 2009-06-23 06:33 that is already more than ddsnap or netapp support 2009-06-23 06:33 it's enough to implement a nice granular backup scheme 2009-06-23 06:33 netapp is suppoing how many snapshot? 2009-06-23 06:34 I think, 256 2009-06-23 06:34 i see 2009-06-23 06:34 ddsnap supports 64 2009-06-23 06:34 backup and replication are by far the most common use of filesystem level versioning (i.e., snapshots) 2009-06-23 06:35 probably 2009-06-23 06:35 I'm not sure what is using it 2009-06-23 06:35 everybody would use it if it were easy 2009-06-23 06:35 yes 2009-06-23 06:35 and if it were available, which it is not on linux 2009-06-23 06:36 and if it's writable, people may use it by new way 2009-06-23 06:36 yes 2009-06-23 06:36 well, ok 2009-06-23 06:36 for example, you could have multiple simultaneous checkouts in git 2009-06-23 06:37 git? 2009-06-23 06:37 ah 2009-06-23 06:38 git == snapshot 2009-06-23 06:39 hammer, and btrfs is supporting how many? 2009-06-23 06:39 ACTION google it 2009-06-23 06:43 much more 2009-06-23 06:43 a version is a new tree root in btrfs, or a new btree key in hammer 2009-06-23 06:44 new tree root can be limited? 2009-06-23 06:44 anyway, nobody uses the ability to have large numbers of versions yet 2009-06-23 06:44 I don't think there is any real limit to new tree roots 2009-06-23 06:45 algorithms are per-tree, so there isn't a slowdown either 2009-06-23 06:45 need to search root itself 2009-06-23 06:45 true 2009-06-23 06:45 in btrfs I think its looked up in a directory, which is a btree 2009-06-23 06:45 so not too bad 2009-06-23 06:46 the negative aspect of multiple tree roots is, it can be a wasteful represntation if the parts of the trees that differ, are different only in a few bytes 2009-06-23 06:47 yes 2009-06-23 06:47 it also requires a refcounting scheme to know when the shared parts can be freed 2009-06-23 06:48 somewhat similar to the tux3 atom table 2009-06-23 06:48 however, it is quite practical to hold the entire atom table in memory, while that is not true of an entire filesystem structure 2009-06-23 06:49 so a refcounting scheme will have some unavoidable overhead 2009-06-23 06:49 however, multiple root is O(1)? 2009-06-23 06:49 yes 2009-06-23 06:50 i see 2009-06-23 06:50 it's not a bad approach 2009-06-23 06:50 it is, however, the subject of a patent dispute 2009-06-23 06:50 at least, a closely related technique is 2009-06-23 06:50 oh 2009-06-23 06:51 and it is for the lawyers to decide whether the techniques are similar or different 2009-06-23 06:51 ok 2009-06-23 06:51 so, there are not only technical but nontechnical reasons for adopting a different approach 2009-06-23 06:51 hammer is? 2009-06-23 06:51 hammer is very different, it is more like tux3 in its versioning scheme 2009-06-23 06:52 i see 2009-06-23 06:52 my concern of version for each attr is 2009-06-23 06:52 it makes one inode bigger? 2009-06-23 06:52 for each btree object, hammer records the version sequence number at which the object was created, and the sequence number at which it was deleted 2009-06-23 06:52 if inode has may versions 2009-06-23 06:53 yes, many different versions will make the inode larger 2009-06-23 06:53 i see 2009-06-23 06:53 for huge numbers of versions, the inode itself has to become a structure 2009-06-23 06:53 bad side 2009-06-23 06:54 yes indeed 2009-06-23 06:54 in practice, with the number of versions that are actually used now, it is not a problem 2009-06-23 06:55 and there is plenty of time to introduce a more efficient inode structure for larger numbers of versions 2009-06-23 06:55 also, a simple optimization for the 'current' version is easy 2009-06-23 06:55 it makes code complex? 2009-06-23 06:56 more or less 2009-06-23 06:56 a little 2009-06-23 06:56 um... 2009-06-23 06:56 the versioned pointer algorithms are not really simple 2009-06-23 06:56 i see 2009-06-23 06:56 however, it simplifies the rest of the filesystem 2009-06-23 06:56 btw, I felt what happen if version for each inode, not attr? 2009-06-23 06:57 data is inherited from the older versions of attributes 2009-06-23 06:58 ah 2009-06-23 06:58 needs refcount 2009-06-23 06:58 i see 2009-06-23 06:58 doesn't need refcount, because they are easy to count, they are all in one place 2009-06-23 06:59 ah 2009-06-23 06:59 seach same inode number 2009-06-23 06:59 s/seach/search/ 2009-06-23 07:00 yes, for a give set of versioned attributes, it can be efficiently determined which can no longer be accessed 2009-06-23 07:00 it may be near in btree leaf 2009-06-23 07:00 yes 2009-06-23 07:01 i see 2009-06-23 07:01 so, there is some set of attributes which form the "current" version. That set can be stored first in the inode, then the rest of the inode does not have to be loaded on open, for example 2009-06-23 07:01 um... 2009-06-23 07:02 what happen if user delete current? 2009-06-23 07:02 there are lots of possibilities to optimize this layout, when it becomes a problem. It won't be a problem at just a few hunder versions 2009-06-23 07:02 if current is deleted, the remaining attributes can be resorted 2009-06-23 07:03 most inodes will not have to be changed 2009-06-23 07:03 only those that were altered in current 2009-06-23 07:04 if snaphost is for backup or something 2009-06-23 07:05 e.g. per user writable snapshot..., um... 2009-06-23 07:06 well 2009-06-23 07:07 btw, on current linux, we can't know which attribute changed 2009-06-23 07:07 was changed 2009-06-23 07:08 we need to add some advisory bits to the itable index blocks 2009-06-23 07:08 there are plenty of bit positions available 2009-06-23 07:08 hammer uses 'dirty' bits there 2009-06-23 07:09 vfs changes some attributes 2009-06-23 07:09 and call mark_inode_dirty() 2009-06-23 07:10 the main offender is atime 2009-06-23 07:10 all timestamp 2009-06-23 07:11 other timestamps change less often 2009-06-23 07:11 yes 2009-06-23 07:11 atimes will be handled specailly 2009-06-23 07:11 in an atime table, with 32 bits per inum 2009-06-23 07:11 directly indexed by inum 2009-06-23 07:12 and versioned 2009-06-23 07:12 so that atime works correctly, but does not affect the structure of the rest of the system 2009-06-23 07:13 um... 2009-06-23 07:13 it doesn't slow? 2009-06-23 07:14 it won't be too bad 2009-06-23 07:14 not as bad as updating every inode the way ext3 must do 2009-06-23 07:15 for write, maybe 2009-06-23 07:15 however, it's need to read from disk? 2009-06-23 07:19 for sys_stat? yes 2009-06-23 07:19 need to read from two places, oh well 2009-06-23 07:20 if atimes aren't used then it is efficient 2009-06-23 07:20 the atime table will be sparse there 2009-06-23 07:20 easy to optimize 2009-06-23 07:25 um... 2009-06-23 07:26 my main concern might be complexity 2009-06-23 07:26 now, linux has relatime option by default 2009-06-23 07:27 um... 2009-06-23 07:29 well, recently reading of code, ileaf/dleaf seems to become complex (especially, dleaf) 2009-06-23 07:30 well 2009-06-23 07:30 xattr infos 2009-06-23 07:31 samba seems to be using it for NTACL and streams 2009-06-23 07:31 and selinux is for selinux context, this seems for each inode 2009-06-23 07:32 beagle seems to be using it for own metadata 2009-06-23 07:32 both acl may be inherit from parent (really maybe) 2009-06-23 07:33 acl and selinux context 2009-06-23 07:34 so, I guess ext* may be assuming those 2009-06-23 07:45 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-23 08:52 I don't see how ext* can support ntfs streams with the xattrs it has 2009-06-23 08:53 agreed, dleaf is complex 2009-06-23 08:53 the thing is, this complexity is local 2009-06-23 08:53 it doesn't affect the whole system 2009-06-23 08:53 local complexity is much easier to deal with, test coverage can be more complete 2009-06-23 09:42 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-23 11:22 it meant ext* is assuming posix acl and selinux context 2009-06-23 11:32 um.. 2009-06-23 11:33 dleaf affect to map_region 2009-06-23 11:33 and map_region is used from many path 2009-06-23 11:33 true, but the complexity always stays in one call chain 2009-06-23 11:34 well, it forces the use of an access api 2009-06-23 11:35 we will see how it evolves 2009-06-23 11:35 as the main author of the access api, you are entitled to criticize it ;) 2009-06-23 11:35 :) 2009-06-23 11:36 anyway, the reason for prematurely optimizating the dleaf is, this is file format, which is painful to change later 2009-06-23 11:37 so the strategy is to choose an efficient and flexible format right from the beginning 2009-06-23 11:37 I worry if simpler one may efficient (from cost, and pure efficient) 2009-06-23 11:37 well, not so big though for now 2009-06-23 11:38 any alternative would need to be able to handle the versioning info that is coming 2009-06-23 11:38 in the end, the api is not a lot of code 2009-06-23 11:38 i see 2009-06-23 11:39 e.g. if we remove group/entry, it would be simpler 2009-06-23 11:40 well, ok 2009-06-23 11:40 indeed 2009-06-23 11:40 but not a lot faster 2009-06-23 11:40 i see 2009-06-23 11:41 related to xattr, I was seeing about attr 2009-06-23 11:41 what is difference with xattr and normal attr? 2009-06-23 11:42 not much 2009-06-23 11:42 - xcache_lookup()/xcache_update()/remove_old() is needed to access it 2009-06-23 11:42 1) normal attr is stored in a standard inode struct 2009-06-23 11:42 2) standard attr is stored more compactly than xattr 2009-06-23 11:42 3) standard attrs are used a lot more than xattrs 2009-06-23 11:43 4) attrs has ->present bit for each type 2009-06-23 11:43 4) xattrs are accessible via the posix xattr interface, standard attrs are not 2009-06-23 11:44 ah, mine is 5) then 2009-06-23 11:44 5) xattr can support variable length 2009-06-23 11:44 :) 2009-06-23 11:44 :) 2009-06-23 11:44 so, attr is fast, small metadata, xattr is big and slow 2009-06-23 11:45 I hope, not too big and not too slow ;) 2009-06-23 11:45 :) 2009-06-23 11:46 I am pretty happy with the model of storing xattrs as a small structure attached to the inode 2009-06-23 11:46 and question is attr tag and atom is almost same 2009-06-23 11:46 an xattr has both an attr tag and an atom 2009-06-23 11:47 yes 2009-06-23 11:47 probably, it's unnecessary 2009-06-23 11:48 each attribute, including xattrs and standard attrs has 16 bit tag:version 2009-06-23 11:48 as you know, I'm just making my argument ;) 2009-06-23 11:48 :) 2009-06-23 11:49 I think the question is, why can't the two kinds of attributes be more similar? 2009-06-23 11:50 umm...., forget 2009-06-23 11:50 :) 2009-06-23 11:50 oh, here is the big difference between standard attributes and xattrs: 0) standard attributes are defined by the system, xattrs are defined by the user 2009-06-23 11:51 but, if it's posix acl or selinux context? 2009-06-23 11:51 or more accurately: 0) standard attribute types are defined by the system, xattr types are defined by the user 2009-06-23 11:51 are you suggesting, give selinux access to the attribute tag api? 2009-06-23 11:52 no 2009-06-23 11:52 not really 2009-06-23 11:52 this is from google info, I think xattr is mix of both 2009-06-23 11:52 and interface is limited 2009-06-23 11:53 so, it's not so useful 2009-06-23 11:53 :) 2009-06-23 11:53 I wonder what kind of interface would work for huge xattrs 2009-06-23 11:54 could we create an inode for it, and open as a fd? 2009-06-23 11:54 and, could we open a small xattr the same way, so access is uniform? 2009-06-23 11:55 I think it can 2009-06-23 11:55 and the argument for why linux needs the new api is? 2009-06-23 11:56 no 2009-06-23 11:56 it is just what I felt :) 2009-06-23 11:56 argument is... 2009-06-23 11:57 docs said 2^48 versions 2009-06-23 11:57 so, I though current version bits of attr is not enough already 2009-06-23 11:58 thought 2009-06-23 11:58 and if version is moved from attr tag, tag can have 16bits 2009-06-23 11:59 now, attr tag is having 16bits, and it's same with atom bits 2009-06-23 11:59 oh, why attr tag and atom is different? 2009-06-23 12:06 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-23 12:23 well, and if it's same, it will map to ->present, and may be able to share more code 2009-06-23 12:42 ah, ok I should define the way 10 bits maps to 48 2009-06-23 12:42 I'll make a post 2009-06-23 12:43 the object is, keep all the version fields at 10 bits, but map to 48 bits using a per-ileaf or per-dleaf mapping table 2009-06-23 12:44 i see 2009-06-23 12:44 actually, it would be more efficient to map to 32 bits 2009-06-23 12:44 um... 2009-06-23 12:44 I would be in favor of 2**32 version limit 2009-06-23 12:45 folks 2009-06-23 12:45 it needs to lookup table 2009-06-23 12:45 ? 2009-06-23 12:46 hi 2009-06-23 12:46 yes 2009-06-23 12:47 how's it going folks ? 2009-06-23 12:47 so each ileaf and each dleaf will carry a version lookup table id 2009-06-23 12:47 I should add that to the format 2009-06-23 12:47 these lookup tables have to be stored persistently 2009-06-23 12:48 they can all be the same size, based on the number of different versions one block can hold 2009-06-23 12:49 (4096/64) * entry or something? 2009-06-23 12:49 something like that 2009-06-23 12:50 4096/8 I think 2009-06-23 12:50 you folks extending the format to store more information about some data structure ? 2009-06-23 12:50 existant format yet not implemented 2009-06-23 12:51 right, planned improvement 2009-06-23 12:51 um..., it sounds like complex 2009-06-23 12:51 the big simplification is just using the 10 bits directly for now 2009-06-23 12:51 it is for save space? 2009-06-23 12:52 48 bits of version info in every pointer would be excessive 2009-06-23 12:52 yes 2009-06-23 12:52 yes, just to save space 2009-06-23 12:52 but it saves a lot of space 2009-06-23 12:52 in ileaf especially 2009-06-23 12:53 but, we don't care xattr much to have for each versions 2009-06-23 12:53 well, 10bits sounds like not so small for now, from netapp 2009-06-23 12:54 yes, more than netapp = good enough 2009-06-23 12:54 effectively unlimited is better, but not necessary just now 2009-06-23 12:54 yes 2009-06-23 12:56 10bits is 42days per hour snapshot 2009-06-23 12:56 we only make half of those visible externally, so 512 2009-06-23 12:56 oh, i see 2009-06-23 12:57 21days 2009-06-23 12:57 um... 2009-06-23 12:57 the other half are to handle ghost versions, that is, versions that are just placeholders for other versions to inherit data from 2009-06-23 12:58 for per hour snapshot, I guess it needs fast delete of snapshot 2009-06-23 12:58 btw, folks are moving to SSD rather quickly 2009-06-23 12:58 actual backup strategy goes something like this: 24 hourly snapshots, 7 daily snapshots, 4 weekly snapshots, 3 monthly snapshots 2009-06-23 12:59 especially in places where they can do faster logging/caching or something like that 2009-06-23 13:00 so, 38 snapshots total, that would be a gold-plated backup strategy 2009-06-23 13:00 it's going to blow SATA disks out of the water with regards to performance 2009-06-23 13:00 ddsnap also uses some temporary snapshots for replicating 2009-06-23 13:00 so, 50 snapshots total is more than enough for today's snapshot user 2009-06-23 13:00 i see 2009-06-23 13:01 btw, we can delete snapshot quickly? 2009-06-23 13:01 bh, is 2.2. GB/second read, 1.5 GB/sec write good performance for an SSD box? 2009-06-23 13:01 GB is giga bytes, not bit? 2009-06-23 13:02 all we have to do to delete a snpashot is remove it from the version table, recovering the space used is more work 2009-06-23 13:02 I know that a company is pairing up with another company to enhance their filer with SSD on a PCI card 2009-06-23 13:02 bh, see above question 2009-06-23 13:02 don't know 2009-06-23 13:03 hirofumi, GB is bytes 2009-06-23 13:03 I'd have to ask my friend, but the question that's raised by SSD is how to deal with bad sectors and stuff 2009-06-23 13:03 bh, I'd be interested in what your friend says 2009-06-23 13:03 bh: Don't most of the big SSD drives these days have firmware that deals with error recovery and wear levelling, etc, for you? 2009-06-23 13:04 I guess, right now, good firmware is not so many 2009-06-23 13:06 and it would also depend spare space of disk 2009-06-23 13:07 bd_, they do 2009-06-23 13:08 bd_: it's a pci-e board 2009-06-23 13:09 I'll send him an email now 2009-06-23 13:11 btw, what is memory bandwidth? 2009-06-23 13:12 12 GB/sec? 2009-06-23 13:13 flipz just /msg-ed you 2009-06-23 13:17 afk for a moment 2009-06-23 20:51 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-23 20:56 well, there is progress on the verion number mapping design note 2009-06-23 20:56 but will not be posted today, probably tomorrow 2009-06-23 20:56 tim_dimm, is that yhou? 2009-06-23 20:56 dat me 2009-06-23 20:56 downloading family pics 2009-06-23 20:56 dialin in from far away? 2009-06-23 20:57 in the berkshires (Massachusetts) 2009-06-23 20:57 oh, that is far 2009-06-23 22:15 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-24 01:24 -!- ckwood_(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-06-24 02:12 -!- pgquiles(~pgquiles@7.Red-88-0-138.dynamicIP.rima-tde.net) has joined #tux3 2009-06-24 02:39 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-06-24 02:45 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-06-24 05:38 -!- edt(~Ed@dsl-62-238.aei.ca) has joined #tux3 2009-06-24 05:47 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-24 06:01 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-24 06:13 good morning 2009-06-24 07:01 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-06-24 08:59 hi 2009-06-24 09:08 I worked on some details of version translation tables 2009-06-24 09:09 post is not quite ready yet 2009-06-24 09:09 ok 2009-06-24 09:09 indeed, there will be some complexity, but not too bad I think 2009-06-24 09:10 well worth it in order to have lots of versions instead of 512 2009-06-24 09:10 I've applied my pending patches to current tree 2009-06-24 09:10 including logging for create? 2009-06-24 09:10 yes 2009-06-24 09:11 well, not working yet though 2009-06-24 09:11 got something to look at? 2009-06-24 09:13 new_btree(itable) was checked by assert() in commit 2009-06-24 09:13 I would like to follow up your last mailing list post with a short explanation of why we needed deferred inode create 2009-06-24 09:13 the original reason was to avoid having the front end change the ileaf 2009-06-24 09:14 yes 2009-06-24 09:14 I seem to recall another reason came up 2009-06-24 09:14 oh 2009-06-24 09:14 ah, dirty handling? 2009-06-24 09:15 dirty handling? 2009-06-24 09:16 inode dirty handling, I wonder if the deferred inode create list made it cleaner 2009-06-24 09:18 however, we need to know whether allocated pending inode or not 2009-06-24 09:21 ah 2009-06-24 09:21 my email subject was wrong 2009-06-24 09:21 it copyied from previous email for patchset 2009-06-24 09:23 the subject for deferred inode create, it should be "bug fixes, and deferred ileaf update for create" or something 2009-06-24 09:23 ah, well that was clear from the patch set 2009-06-24 09:23 yes 2009-06-24 09:28 maybe, I'm still not understanding what is meaning inode dirty handling and defer create 2009-06-24 09:31 so far, kernel drives writeback by write_inodes 2009-06-24 09:31 yes 2009-06-24 09:31 if we create a new inode, we create it dirty and kernel may try to write it back at any time 2009-06-24 09:32 which is not allowed 2009-06-24 09:32 in atomic-commit manner? 2009-06-24 09:32 yes 2009-06-24 09:32 yes 2009-06-24 09:33 it is all dirty inode too 2009-06-24 09:37 I'm reading __writeback_single_inode, one more time 2009-06-24 09:37 __sync_single_inode 2009-06-24 09:38 yes 2009-06-24 09:38 -> do_writepages 2009-06-24 09:39 so, are we going to ignore ->writepage page, and just put the inode on our list? 2009-06-24 09:40 or are we going to prevent the kernel from flushing the inode (by moving it off the inode dirty list with dirty flag still set)? 2009-06-24 09:42 I'm not sure yet 2009-06-24 09:42 I'm not thinking about kernel port deeply 2009-06-24 09:43 ok, I will 2009-06-24 09:43 we can trade places, you in userspace and me in kernel 2009-06-24 09:44 btw, there is planing change by lkml in this area 2009-06-24 09:44 per-bdi writeback 2009-06-24 09:44 about time 2009-06-24 09:44 oh 2009-06-24 09:44 that's too bad, bdi is a mess 2009-06-24 09:44 well 2009-06-24 09:44 address that later 2009-06-24 09:45 let me see 2009-06-24 09:45 bdi is essentially the block device 2009-06-24 09:45 yes 2009-06-24 09:46 so, go flush all inodes on one block device, hard to see why that is an improvement of generic_sync_sb_inodes 2009-06-24 09:46 and jens is tring to create per-bdi flusher 2009-06-24 09:46 I better read the thread 2009-06-24 09:47 I have noticed that simple writeback sucks very badly for md raid0 2009-06-24 09:47 well, the point of him may be per-bdi thread, not global 2009-06-24 09:47 cpu is way too high 2009-06-24 09:47 Wasn't his point mainly that pdflush doesn't always do things optimally 2009-06-24 09:47 yes 2009-06-24 09:48 basicly pdflush sucks by design and this doesn't suck by design 2009-06-24 09:48 I am not sure what the problem is yet 2009-06-24 09:48 but, somehow, he removed sb locality 2009-06-24 09:48 pdflush can run on same bdi 2009-06-24 09:48 on some cases 2009-06-24 09:49 I'll look for the thread 2009-06-24 09:49 subject line? 2009-06-24 09:49 yes 2009-06-24 09:49 per-bdi would be enough 2009-06-24 09:49 http://osdir.com/ml/linux-kernel/2009-06/msg08876.html 2009-06-24 09:49 v11 was the most recent 2009-06-24 09:50 adds another 1500 lines of core kernel code :( 2009-06-24 09:51 this kind of patch set should start by taking some cruft away 2009-06-24 09:52 http://thread.gmane.org/gmane.linux.kernel/855800 Read this thread 2009-06-24 09:52 anyway, I have a nice load to try this patch set on 2009-06-24 09:52 thanks, sejeff 2009-06-24 09:53 No problem. It has a much better explanation of what he's trying to solve 2009-06-24 09:53 by the way, all you have to do to kill performance on any linux box is, cp a few gigs of file to /dev/null 2009-06-24 09:53 well, personally, I'm thinking some people are not thinking this is not so good 2009-06-24 09:53 Why is that? The vm? 2009-06-24 09:53 hirofumi, apkm included 2009-06-24 09:53 yes, me too 2009-06-24 09:54 some people is requesting to wait until 6.32 2009-06-24 09:54 well, my point is sb locality though 2009-06-24 09:54 It's unclear to me actually _why_ the performance changes which were 2009-06-24 09:54 observed have actually occurred. In fact it's a bit unclear (to me) 2009-06-24 09:54 why the patchset was written and what it sets out to achieve :( 2009-06-24 09:54 -- akpm 2009-06-24 09:55 http://axboe.livejournal.com/2258.html 2009-06-24 09:55 If those numbers don't lie they are certainly compelling. Even if they aren't the write approach they prove neither is pdflush as is currently 2009-06-24 09:55 the "write" approach, nice slip 2009-06-24 09:56 And you caught it 2009-06-24 09:57 yes, per-bdi itself sounds good 2009-06-24 09:57 but, implement is... 2009-06-24 10:11 Well it will go into .32 if no one strongly objects 2009-06-24 10:12 apart from being yet another big rambling code dump into core, the only credible objection would be better results from a better approach 2009-06-24 10:13 I mean would it be possible to take his idea and write a better implementation quickly? 2009-06-24 10:13 probably 2009-06-24 10:14 and it's topical at the moment 2009-06-24 10:14 ssds totally break linux 2009-06-24 10:14 they magnify the longstanding writeout suckage 2009-06-24 10:15 Which is all Jens's code, right? 2009-06-24 10:15 totally break is a slight exaggeration 2009-06-24 10:15 a lot of it is core vm 2009-06-24 10:15 Oh the linux vm is broken. This is old news 2009-06-24 10:16 the best you can say about sync_single_inode is, it is pretty reliable 2009-06-24 10:16 the worst you can say about it is... pretty bad 2009-06-24 10:17 I suppose that is properly vfs 2009-06-24 10:17 but really, writeback is vm 2009-06-24 10:30 howdy 2009-06-24 10:30 hey 2009-06-24 10:49 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-24 11:50 -!- abiggs(~spirit.of@142.46.164.30) has joined #tux3 2009-06-24 12:30 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-24 13:18 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-24 13:50 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-06-24 13:58 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-06-24 14:03 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2009-06-24 14:04 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-06-24 16:10 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-24 16:47 -!- edt(~Ed@dsl-62-238.aei.ca) has joined #tux3 2009-06-24 17:18 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-06-24 17:53 http://blogs.netapp.com/dave/2007/09/netapp-sues-sun.html 2009-06-24 17:56 accidentally (?, maybe, this english word is not corrent), I've found this article 2009-06-24 17:57 well, there is no matter 2009-06-24 17:58 I was thinking netapp sues 2009-06-24 17:58 netapp seems to have the reason for it against sun 2009-06-24 17:59 it can't see which is saying true though 2009-06-24 18:18 ah, I was forgetting why I was looking netapp site 2009-06-24 18:18 netapp's snapshot seems 255 2009-06-24 18:56 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-24 19:32 folks 2009-06-24 19:33 hey 2009-06-24 19:33 so... what did you find out about ssd speed? 2009-06-24 19:33 hey is for horses 2009-06-24 19:33 how much is lots? 2009-06-24 19:34 zfs does more I beleive 2009-06-24 19:34 hi 2009-06-24 19:35 flips: he hasn't replied to the email yet 2009-06-24 19:35 if I'm missing something, btrfs seems to be storing root as btree object 2009-06-24 19:40 indeed, it does 2009-06-24 19:40 is that a good or bad thing ? 2009-06-24 19:41 suppose the b-tree gets corrupted, how are they going to recover from that 2009-06-24 19:41 ? 2009-06-24 19:42 every filesystem has btrees 2009-06-24 19:42 these days 2009-06-24 19:43 bh, it doesn't make sense to say zfs is faster than a disk array 2009-06-24 19:44 eh ? 2009-06-24 19:44 how much is lots? 2009-06-24 19:44 zfs does more I beleive 2009-06-24 19:45 I'm not sure what you're talking about. I'm under the understanding that ZFS is able to store more snapshots than WAFL 2009-06-24 19:45 nothing really about disk speeds 2009-06-24 19:45 ah, I didn't look higher in the window ;) 2009-06-24 19:45 zfs has a high limit on number of snapshots, yes 2009-06-24 19:45 netapp for some reason seems to rely on a bitmap or something, giving a limit of 256 2009-06-24 19:45 maybe, I guess zfs is also storing root as btree object 2009-06-24 19:46 last time I checked 2009-06-24 19:46 yes, zfs and btrfs are both using what amounts to the original tux3 scheme 2009-06-24 19:46 flips: msg window 2009-06-24 19:46 if netapp is using older paper strategy of WAFL, it may be using nvram for root 2009-06-24 19:47 or it can be limitation of nvram space 2009-06-24 19:48 netapp does indeed still use nvram 2009-06-24 19:48 but not for root 2009-06-24 19:48 it is used as an nfs accelerator 2009-06-24 19:49 nfs accelerator? 2009-06-24 19:49 stores nfs write messages 2009-06-24 19:49 i see 2009-06-24 19:49 yes, netapps are basically nfs/cifs boxes 2009-06-24 19:49 with iscsi bolted onto the side 2009-06-24 19:50 netapp is not providing OSD stuff? 2009-06-24 19:50 well 2009-06-24 19:50 not the last time I checked 2009-06-24 19:50 sun seems to be trying to relpace those with zfs stack 2009-06-24 19:51 you mean, for lustre osds? 2009-06-24 19:51 they tried 2009-06-24 19:51 apparently found zfs was too slow 2009-06-24 19:52 flips: more ssd info private 2009-06-24 19:52 lustre osds? 2009-06-24 19:52 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-24 19:55 you know lustre? 2009-06-24 19:56 opensource network fs? 2009-06-24 19:56 lustre uses servers running a stripped down ext3 as osds 2009-06-24 19:56 many servers in parallel 2009-06-24 19:56 total bandwidth is in the hundreds of gigabytes 2009-06-24 19:57 it is using OSD interface as storage? 2009-06-24 19:57 I was thinking it is fs 2009-06-24 19:57 when sun bought lustre, that is, cluster filesystems inc, it tried to convert the osd servers to zfs 2009-06-24 19:57 it worked, but it was considerably slower 2009-06-24 19:57 yes, lustre uses osd storage over the net as its physical devices 2009-06-24 20:00 oh 2009-06-24 20:00 I didn't know it 2009-06-24 20:00 lustre is pretty impressive 2009-06-24 20:01 local filesystem semantics on a distributed filesystem, not easy to do at all 2009-06-24 20:01 and scales very high 2009-06-24 20:01 yes 2009-06-24 20:02 distributed locks is enough complex for me :) 2009-06-24 20:02 unfortunately for lustre, and pretty well every cluster filesystem, the wold does not know it needs it 2009-06-24 20:02 instead, the world goes for big servers attached to monster disk arrays 2009-06-24 20:03 and the reason the world does this is, the single server is much simpler, easier to administer and more reliable 2009-06-24 20:03 there are only a very few installations that need bandwidth so high that only something like lustre can provide it 2009-06-24 20:03 and it will affect to running cost 2009-06-24 20:03 too 2009-06-24 20:04 i see 2009-06-24 20:04 may I talk about version bits for a moment? 2009-06-24 20:04 yes 2009-06-24 20:04 please 2009-06-24 20:04 I didn't work out the details before, because I assumed it would be easy 2009-06-24 20:05 just pick some mapping strategy and use it 2009-06-24 20:05 of course, when we try to find the best strategy, or even a simple one that will work well, it gets more complex ;) 2009-06-24 20:05 so I have been working on it 2009-06-24 20:05 sounds good 2009-06-24 20:06 btw, my current question is 2009-06-24 20:06 if mapping can't, what happen? 2009-06-24 20:06 my first observation is, a mapping table for 10 bits worth of version numbers will be 4K, less 4 bytes 2009-06-24 20:06 if mapping doesn't work, 512 snapshots is still a lot 2009-06-24 20:06 just not as many as zfs or btrfs 2009-06-24 20:06 but more than wafl 2009-06-24 20:06 well, mapping will work 2009-06-24 20:07 but we can delay the implementation if we want, without problems 2009-06-24 20:07 ah, it meant, local version is needing to more version 2009-06-24 20:07 however, it would be good to have a design prepared for supporting large numbers of versions via mapping 2009-06-24 20:08 what did you mean? 2009-06-24 20:08 suppose global version has 2^32, but local version is 2^10 2009-06-24 20:08 it means one inode has only 2^10 versions? 2009-06-24 20:08 right, so I want to stay with 10 bits local version 2009-06-24 20:08 local means, local to a small group of ileaf and/or dleaf blocks 2009-06-24 20:09 ok 2009-06-24 20:10 so, if few files are always hot, it can become limitation of versions? 2009-06-24 20:10 not really 2009-06-24 20:10 hot == modified always 2009-06-24 20:10 oh 2009-06-24 20:10 each ileaf and each dleaf contains the id of a mapping table 2009-06-24 20:11 number of different versions per ileaf or dleaf is less than 512, because of the size of a pointer or attribute 2009-06-24 20:12 so it is always possible to map all the 10 bit version fields in one block to 4 byte versions 2009-06-24 20:12 if blocksize is 4k? 2009-06-24 20:13 in practice, there will usually only be a different versions for an ileaf or dleaf, and the most busy ileaf/dleaf might have around 500 2009-06-24 20:13 a mapping block would be 4K, even if the filesystem size is something else 2009-06-24 20:13 mapping blocks will be stored logically in a file 2009-06-24 20:13 like other tables in tux3 2009-06-24 20:14 so map block size does not have to match underlying filesystem blocksize 2009-06-24 20:14 we need to avoid having too many map blocks on a filesystem 2009-06-24 20:15 or it will significantly increase the metadata size 2009-06-24 20:15 if we were very lazy, it would nearly double the metadata size 2009-06-24 20:15 i see 2009-06-24 20:15 however, even a very small sharing factor for mapping blocks willl reduce the number of them to insignificance 2009-06-24 20:16 so, when we want to write a given version to an ileaf or dleaf, we first check the 'nearest' mapping block (that is, one used in a lower numbered ileaf or dleaf, and see if the version is already there 2009-06-24 20:17 we can do a linear search, or think of something more clever 2009-06-24 20:17 a linear search will work fine 2009-06-24 20:17 if we find the version, we use the offset in the mapping block as the local version number 2009-06-24 20:18 if we don't, we add the new version in the first free u32 in the block 2009-06-24 20:18 free mapping block entires are zero 2009-06-24 20:18 which is not a valid version number 2009-06-24 20:20 we have the concept of a "current" mapping block, which will be the mapping block for the next ileaf or dleaf that is newly created 2009-06-24 20:20 u32 version + local version == global version? 2009-06-24 20:20 global_version = map[local_version] 2009-06-24 20:21 uint32_t map[1024]; 2009-06-24 20:21 or, version_t map[1024]; 2009-06-24 20:22 how can we know from global version to local version 2009-06-24 20:23 ? 2009-06-24 20:23 when we write, we are always writing some version 2009-06-24 20:23 which we know 2009-06-24 20:23 usually, the whole filesystem is mounted as a given version 2009-06-24 20:24 the version changes with a new snapshot, then all writes are with the new version 2009-06-24 20:24 so... 2009-06-24 20:24 yes 2009-06-24 20:24 let's it call sb->version 2009-06-24 20:24 and local version is ileaf->inode->version 2009-06-24 20:24 when we write to an ileaf that already has a map table, we search the map table for the 'current' version (that is, the version we are writing) 2009-06-24 20:25 ok 2009-06-24 20:25 thanks :) 2009-06-24 20:25 well 2009-06-24 20:25 ok, fine 2009-06-24 20:25 that is correct 2009-06-24 20:25 do we need to map somehow from sb->version to ileaf->inode->version? 2009-06-24 20:26 so, when we write an extent to an ileaf that already has a mapping table, we search for sb->version in the mapping table, if it is there, the offset gives the local version 2009-06-24 20:27 if it is not there, we add sb->version to the mapping table, and use the position where we added it as the local version 2009-06-24 20:27 if the mapping table is full, we have to use a new mapping table 2009-06-24 20:27 this is where we get to play with algorithms 2009-06-24 20:28 i see 2009-06-24 20:28 if we just randomly choose some other mapping table, we could get a lot of false sharing between unrelated btree leaves 2009-06-24 20:28 well, I'm not understanding fully though 2009-06-24 20:28 some code would help :) 2009-06-24 20:28 yes 2009-06-24 20:28 :) 2009-06-24 20:29 sb->version == map[ileaf->inode->ileaf] 2009-06-24 20:29 um... 2009-06-24 20:30 the mapping id is stored right in the ileaf 2009-06-24 20:30 let me see 2009-06-24 20:31 when we create a new extent we do: extent->version = searchmap(ileaf->map, sb->version) 2009-06-24 20:32 and need to handle the case where the version isn't found 2009-06-24 20:33 http://www.enterprisestorageforum.com/sans/news/print.php/3807706 2009-06-24 20:33 found lustre's news 2009-06-24 20:34 what is extent in here? 2009-06-24 20:34 extent is extent of dleaf, e.g.? 2009-06-24 20:35 which is like: extent->version = ifindzero(ileaf->map); ileaf->map[extent->version] = sb->version 2009-06-24 20:35 extent is the 8 byte extents we use 2009-06-24 20:35 48 bits address, 6 bits count, 10 bits local version 2009-06-24 20:35 so, now, it is changing data btree? 2009-06-24 20:36 yes 2009-06-24 20:36 ok 2009-06-24 20:36 we are writing a new version of some extent to the dleaf 2009-06-24 20:36 similar thing for ileaf attributes 2009-06-24 20:38 the easiest case is actually when btree leaves have huge numbers of versions 2009-06-24 20:38 because then each mapping block will not be shared by many btree leaves 2009-06-24 20:39 when there are small numbers of versions, we will share a mapping block with many btree leaves before it fills up 2009-06-24 20:40 in most cases, one mapping block will be enough for an entire filesystem 2009-06-24 20:40 which is ileaf->map pointing? 2009-06-24 20:40 ileaf->map is a logical 4K block number in a tux3 special file 2009-06-24 20:41 how is it initialized? 2009-06-24 20:42 initialized or allocated? 2009-06-24 20:44 empty, with a highwater mark 2009-06-24 20:45 we can use the filesize as the highwater mark 2009-06-24 20:45 ah, I assumed ileaf->map is pointer to offset of vtable 2009-06-24 20:46 and if it's offset, where come from 2009-06-24 20:47 vtable offset is just globalversion * sizeof(struct vtable_entry) 2009-06-24 20:47 just a big array of version entries 2009-06-24 20:47 i see 2009-06-24 20:47 that structure is well understood and very efficient 2009-06-24 20:48 globalversion * sizeof(vtable entry)? 2009-06-24 20:48 vtable entry is fixed structure? 2009-06-24 20:48 yes 2009-06-24 20:48 small, fixed size 2009-06-24 20:49 each vtable entry has the global version number of its parent version 2009-06-24 20:49 ileaf->map is vtable? 2009-06-24 20:49 at mount, we can the version table and build a version tree according to the parent version fields 2009-06-24 20:49 ACTION is online from the air 2009-06-24 20:49 no, ileaf->map is a different thing 2009-06-24 20:49 ah, ok 2009-06-24 20:49 shapor, you are a high flyer 2009-06-24 20:50 the vtable is pretty cool, but I am completely satifisfied with the design, so I don't talk about it much 2009-06-24 20:51 ok 2009-06-24 20:51 ok, mapping tables are another matter, I am not satisfied yet 2009-06-24 20:51 there are a couple of tricky issues 2009-06-24 20:52 one issue: deciding which partly-full mapping table to use for a newly created ileaf or dleaf 2009-06-24 20:52 actually, there aren't too many other issues 2009-06-24 20:54 ok, another issue: on deletion we will get zeroes in some mapping tables, how do we track which mapping tables have the most zeros, and are thus good candidates to choose for sharing? 2009-06-24 20:55 and a question: how will fragmentation of mapping blocks affect filesystem performance as the filesystem ages? what simple things can we do to keep false sharing to a reasonable level? 2009-06-24 20:55 I guess delete is important like create 2009-06-24 20:56 the only real issue I see on delete is keeping track of new zero mapping entries 2009-06-24 20:56 btw, deleting version needs to scan all inodes? 2009-06-24 20:57 yes, unless we add hint bits in the index blocks 2009-06-24 20:57 hammer has such hint bits 2009-06-24 20:58 I see 2009-06-24 20:58 maybe, several design is not caring quick/cheap deletion? 2009-06-24 20:59 several (btrfs, zfs, hammer?) 2009-06-24 21:00 quick version deletion is not very important because it is easy to do in the background 2009-06-24 21:00 but of course, it is always better to be more efficient 2009-06-24 21:00 well, anyway, I need to learn about snapshot/versioning 2009-06-24 21:01 I think background is not so good, because it usually affects frontend jobs 2009-06-24 21:01 here is an idea: the version mapping table also tells us which versions might be present in a given part of the itable 2009-06-24 21:02 so, we can reserve a pointer in the itable index block, and point at a mapping table 2009-06-24 21:02 to know if a given version is present in a subtree of the itable, we just search the mapping table 2009-06-24 21:02 that will work really well 2009-06-24 21:02 i see 2009-06-24 21:03 and it can be used as fsck too? 2009-06-24 21:03 it would be helpful 2009-06-24 21:03 actually, a better idea 2009-06-24 21:03 put the list of present versions right in the index node 2009-06-24 21:03 usually it will be small 2009-06-24 21:03 and not take away too many index entries 2009-06-24 21:04 for now, probably, I'm not understanding fully, unfortunatelly 2009-06-24 21:05 this is about providing hints about the contents of subtrees in order to accelerate searching for leafs with a given version or set of versions 2009-06-24 21:05 well, my feeling is of version, I mapped it to branch of SCM 2009-06-24 21:05 it is similar 2009-06-24 21:06 so, I thought if branch creation and deletion is not cheap, it is not useful after all 2009-06-24 21:06 branch creation is essentially free 2009-06-24 21:07 yes, good 2009-06-24 21:07 deletion is needed to cheap for, aggresive use 2009-06-24 21:07 like branch of git 2009-06-24 21:08 well, this would not be necessary one 2009-06-24 21:08 with hints in the index nodes, it can be very cheap 2009-06-24 21:08 good 2009-06-24 21:08 each ileaf that has the version has to be rewritten (redirected) 2009-06-24 21:09 yes 2009-06-24 21:09 and dleaf too? 2009-06-24 21:10 yes 2009-06-24 21:10 this is best done in the background 2009-06-24 21:10 so that the only effect is, space is not recovered immediately 2009-06-24 21:10 yes, and until done, global version is reserved? 2009-06-24 21:10 yes 2009-06-24 21:11 which is ok if there are a lot of versions 2009-06-24 21:11 i see 2009-06-24 21:14 for what it's worth, every snapshot strategy has similar tradeoffs to deal with 2009-06-24 21:14 zfs has to process a "dead list" of blocks, btrfs has to update many reference counts 2009-06-24 21:15 i see 2009-06-24 21:15 btrfs sounds like most bad one, if there is no hint 2009-06-24 21:16 btrfs version delete algorithm is to walk the version tree, decrement the use count of each index node, if the count hits zero then also delete the index node 2009-06-24 21:17 so it has to walk the part of the filesystem tree that was rewritten 2009-06-24 21:17 if this is not very much, then it should be fast 2009-06-24 21:18 um... 2009-06-24 21:18 if root is rewriten, it need to search all? 2009-06-24 21:18 the search stops on any subtree with reference count greater than one 2009-06-24 21:19 this part of btrfs is pretty nice 2009-06-24 21:19 what is not nice is, if you only change one pointer in an index block, the entire block has to be cloned, and both copies are kept 2009-06-24 21:20 and all parents of it? 2009-06-24 21:20 yes 2009-06-24 21:21 and this must be done before the write operation can complete 2009-06-24 21:21 so, by comparison, version deletion in the background does not sound so bad 2009-06-24 21:22 the number of blocks tux3 has to change to delete a version is similar to btrfs 2009-06-24 21:23 i see 2009-06-24 21:23 bad side of tux3 needs more jobs in memory? 2009-06-24 21:24 the main drawback to tux3 design is that btree leaf blocks can contain a lot of version information that we are not interested in 2009-06-24 21:25 I do not think this is a serious problem 2009-06-24 21:25 ah, i see 2009-06-24 21:26 it affects to fanout of btree? 2009-06-24 21:26 on the other hand, for an operation that requires considering every version at the same time (like git blame) the tux3 design can way outperform a shared tree design like btrfs 2009-06-24 21:26 yes, it affects fanout but only at the leaf nodes of the btree 2009-06-24 21:26 not a serious problem 2009-06-24 21:27 especially for rotating media, an advantage to tux3 is much less metadata in total 2009-06-24 21:27 because exactly the minimum amount of data is stored 2009-06-24 21:28 probably 2009-06-24 21:28 if we can know how much space is different 2009-06-24 21:28 do you calc it already? 2009-06-24 21:29 yes, I have estimated it 2009-06-24 21:29 it depends on pattern of writing 2009-06-24 21:29 oh, great 2009-06-24 21:29 :) 2009-06-24 21:30 e.g. after installation? 2009-06-24 21:30 and database pattern 2009-06-24 21:30 btrfs just turns off versioning for database use I think 2009-06-24 21:31 databases have their own versioning 2009-06-24 21:31 well, oracle does 2009-06-24 21:31 mysql doesn't, really 2009-06-24 21:31 oh 2009-06-24 21:31 but, mysql has backup strategy? 2009-06-24 21:32 ok, mysql is a good example 2009-06-24 21:33 a good example as database pattern? 2009-06-24 21:33 yes 2009-06-24 21:33 i.e. rewrite file data 2009-06-24 21:33 because it is stupid ;) 2009-06-24 21:33 :) 2009-06-24 21:34 so if you make a lot of random changes in the database, btrfs will make a copy of nearly every index block, even though only a few bytes might have changed in each extent 2009-06-24 21:35 tux3 will just add some new versioned pointers, and a few blocks will split 2009-06-24 21:35 you can see that tux3 will generate a small fraction of new metadata blocks 2009-06-24 21:36 keep doing that, and setting new snapshots, pretty soon btrfs will have as much metadata as data 2009-06-24 21:36 tux3 will wtill have much less metadata than data 2009-06-24 21:37 yes, it may be depends on fanout? 2009-06-24 21:38 another common example: a slowly growing log file, while snapshots are being set 2009-06-24 21:38 yes, maybe, this is good bad case of btrfs strategy 2009-06-24 21:39 bad case (to exlain well) of btrfs 2009-06-24 21:40 a mail server is another good case to think about 2009-06-24 21:40 in that case, many names in a directory are changing 2009-06-24 21:41 if snapshots are also happening, then directory data with expand enormously 2009-06-24 21:41 well, to be fair, so will it in tux3 2009-06-24 21:42 however, with btrfs it will expand something like twice as fast 2009-06-24 21:42 including the index blocks that have to be cloned 2009-06-24 21:43 i see 2009-06-24 21:44 good side of btrfs would be O(1)? 2009-06-24 21:44 1 of what? 2009-06-24 21:45 versions 2009-06-24 21:46 it's not a bad design 2009-06-24 21:46 yes 2009-06-24 21:46 many common operations should be efficient 2009-06-24 21:46 however, with filesystems, people care a lot about 10% better or worse 2009-06-24 21:46 well, I'd like to see ours and others well 2009-06-24 21:46 they are very critical 2009-06-24 21:48 it means 10% faster or worse? 2009-06-24 21:48 I think tux3 is easily the best in metadata compactness, for typical linux filesystems, we will see what kind of difference that makes 2009-06-24 21:48 i see 2009-06-24 21:48 anyway, it is time to sleep 2009-06-24 21:49 ok 2009-06-24 21:49 oyasumi, and see you 2009-06-24 21:49 oyasumi 2009-06-25 00:06 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-25 01:38 um..., can't we pick up the really key point and subset of whole design for version 0.01? 2009-06-25 01:39 at least for me, honestly, already comlex 2009-06-25 01:42 what do you think? 2009-06-25 01:42 well, I'll sleep :) 2009-06-25 04:32 -!- edt(~Ed@dsl-60-122.aei.ca) has joined #tux3 2009-06-25 09:27 hirofumi, agreed 2009-06-25 09:27 ...good morning in advance 2009-06-25 10:00 hirofumi, for version mapping I will just add the map id field to ileaf and dleaf header, which will be a format changes, to rev the format number 2009-06-25 10:00 then later, implementing mapping tables will not require a disk format change 2009-06-25 13:05 hirofumi, there? 2009-06-25 13:05 hi 2009-06-25 13:06 so, the result of the versioning discussion is, I will add a u32 map field to ileaf and dleaf, rev the file format version, and leave that problem alone for a while 2009-06-25 13:08 ok 2009-06-25 13:09 well, it is not only against to this though 2009-06-25 13:10 my intent of that is first non-optimized version, but developers can see what happen 2009-06-25 13:13 um..., small basic version to start or join developers 2009-06-25 13:23 it hope to separate the issues to small piece, and so, hope it can be solved small motivation with some fun 2009-06-25 13:26 that's it 2009-06-25 13:26 now, I am ready to start coding seriously again 2009-06-25 13:26 so what should I work on? 2009-06-25 13:27 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-25 13:27 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-25 13:27 hi tim_dimm 2009-06-25 13:28 hey 2009-06-25 13:28 I'm not sure 2009-06-25 13:28 well then, what I was working on before, replay 2009-06-25 13:28 I think it should be part of your fun now 2009-06-25 13:28 indeed, it should be fun 2009-06-25 13:28 replay is fun :) 2009-06-25 13:28 :) 2009-06-25 13:29 yes 2009-06-25 13:31 for replay, maybe, I should push the current patches to somewhere 2009-06-25 13:31 um..., but, it's not starting to work yet 2009-06-25 13:34 FWIW, I've pushed the current snapshot of my patches 2009-06-25 13:34 http://userweb.kernel.org/~hirofumi/patchset.tar.gz 2009-06-25 13:35 hope it helps to replay or something 2009-06-25 13:36 I'll try to clean it up, and usable 2009-06-25 13:38 with this series, I hope creation path works as atomic-commit 2009-06-25 13:38 wow- that's awesome hirofumi 2009-06-25 13:38 atomic-commit? 2009-06-25 13:39 no, the progress is awesome 2009-06-25 13:39 ah 2009-06-25 13:39 thanks 2009-06-25 13:39 :) 2009-06-25 13:39 well, honestly, not so many though 2009-06-25 13:40 and it is still userspace only 2009-06-25 13:40 you guys have been working hard on it for a long time, so yes, its still awesome 2009-06-25 13:41 thanks 2009-06-25 13:44 -!- pgquiles(~pgquiles@7.Red-88-0-138.dynamicIP.rima-tde.net) has joined #tux3 2009-06-25 13:48 hirofumi, it's ok to push it even if it's not working 2009-06-25 13:49 well, not working means still not compilable 2009-06-25 13:50 (grabbed your patch set) 2009-06-25 13:50 I'll apply it later today 2009-06-25 13:50 ok 2009-06-25 13:50 well 2009-06-25 13:50 _read it_ then 2009-06-25 13:50 before this series, writeback.c was introduced 2009-06-25 13:50 yes 2009-06-25 13:51 I'll try to teach writeback.c to this series 2009-06-25 13:51 hope today or tomorrow 2009-06-25 14:09 http://userweb.kernel.org/~hirofumi/tux3.tar.gz 2009-06-25 14:09 this is patched full source of tux3, it would be usual more or less 2009-06-25 14:11 btw, below patches of update-magic-number.patch in patchset/series are unrelated stuff 2009-06-25 14:11 got it 2009-06-25 14:13 thanks 2009-06-25 14:29 -!- lucente(~barnhous@190.9.69.135) has joined #tux3 2009-06-25 14:29 Need Sex? Get Laid! famouspornstars.com 2009-06-25 14:29 -!- lucente(~barnhous@190.9.69.135) has left #tux3 2009-06-25 14:29 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-25 14:32 -!- sherban(~lakshmin@80.148.30.4) has joined #tux3 2009-06-25 14:32 -!- sherban(~lakshmin@80.148.30.4) has left #tux3 2009-06-25 14:32 -!- giorgos(~mathias@202.51.102.18) has joined #tux3 2009-06-25 14:32 -!- giorgos(~mathias@202.51.102.18) has left #tux3 2009-06-25 15:01 -!- abiggs(~spirit.of@142.46.164.30) has joined #tux3 2009-06-25 18:00 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-25 19:16 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-25 23:58 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-26 00:34 -!- persson_(persson@nescafe.bsnet.se) has joined #tux3 2009-06-26 00:45 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-06-26 02:35 -!- pgquiles(~pgquiles@7.Red-88-0-138.dynamicIP.rima-tde.net) has joined #tux3 2009-06-26 05:29 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-26 05:38 -!- npmccallum(~npmccallu@cpe-76-177-118-207.natcky.res.rr.com) has joined #tux3 2009-06-26 05:55 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-26 06:30 -!- pgquiles(~pgquiles@7.Red-88-0-138.dynamicIP.rima-tde.net) has joined #tux3 2009-06-26 07:13 good morning 2009-06-26 11:36 -!- ajonat(~ajonat@190.48.127.2) has joined #tux3 2009-06-26 13:00 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-26 14:19 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-26 14:35 -!- npmccallum(~npmccallu@76.177.118.207) has joined #tux3 2009-06-26 14:53 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-26 15:47 hey flipz 2009-06-26 15:47 having a discussion on whether or not C++ should be put into the kernel. :) 2009-06-26 15:48 Linus would never accept that patch :) 2009-06-26 15:49 well, it's just that folks that want C++ in systems programming don't give a concrete reason why 2009-06-26 15:49 C++ is primarily a language for reusability of data structures, that doesn't really happen in Linux other than list.h 2009-06-26 15:49 and rbtree stuff 2009-06-26 15:50 rbtree still needs a lot of templatable code to be written out by hand though 2009-06-26 15:52 it's not hard to use 2009-06-26 15:52 true, but it's the sort of thing C++ templates would make easier :) 2009-06-26 15:53 yeah, but it's not *necessary* 2009-06-26 15:53 moving to that implementation doesn't help simplify things either 2009-06-26 15:53 since Linux uses container_of so much 2009-06-26 15:53 it works 2009-06-26 15:53 nothing in C++ is necessary. Every C++ feature can be implemented on any other finite-space turing machine ;) 2009-06-26 15:54 yeah, but it's not needed because it actually makes things more complicated which the opposite of what folks want when move to a language 2009-06-26 15:54 doesn't matter 2009-06-26 15:55 I think that constructors/destructors in stack objects might help with error paths though. But it does obscure things. 2009-06-26 15:55 anyway it's kind of moot, because I don't see people accepting a move to C++ anytime soon :) 2009-06-26 15:56 yeah, object storage is a good reason 2009-06-26 15:56 but we have things that already track that 2009-06-26 15:57 the timer code uses such things 2009-06-26 15:57 smarter pointers is a good reason, but that's not why folks use C++ 2009-06-26 15:58 a language like Cyclone would be a better choice for this kind of development 2009-06-26 15:58 it's very C like but has a much more strict displine with object store 2009-06-26 17:42 -!- edt(~Ed@dsl-60-122.aei.ca) has joined #tux3 2009-06-26 18:29 I don't see the point of banning c++ as kernel compiler, when used as just a fancier C compiler 2009-06-26 18:30 i.e., no execeptions, no virtual functions 2009-06-26 18:30 yes 2009-06-26 18:30 well, in realwork, c++ is not it 2009-06-26 18:30 realworld 2009-06-26 18:31 well 2009-06-26 18:31 I meat the issue of atomic-commit 2009-06-26 18:32 with defer-ileaf stuff 2009-06-26 18:32 which issue? 2009-06-26 18:32 inode doesn't have blockfork() stuff 2009-06-26 18:33 doesn't need it because we don't have pipelined commit yet 2009-06-26 18:33 well, I knew it was issue, but issue was more early 2009-06-26 18:33 the only place we use it for now is in flushing the bitmap 2009-06-26 18:33 I thought so, but, bitmap 2009-06-26 18:34 bitmap needs to use fork in flush_log() 2009-06-26 18:34 yes 2009-06-26 18:34 I thought that is already working 2009-06-26 18:34 btree is not 2009-06-26 18:34 btree->root/depth 2009-06-26 18:36 I'll be offline for 30 min of so 2009-06-26 18:36 ok 2009-06-26 19:08 -!- npmccallum(~npmccallu@h11.26.23.98.dynamic.ip.windstream.net) has joined #tux3 2009-06-26 19:49 -!- ajonat(~ajonat@190.48.120.137) has joined #tux3 2009-06-26 19:52 back 2009-06-26 19:53 ok 2009-06-26 20:00 ACTION wonders what dddd patch is 2009-06-26 20:01 it is test change before becoming real patch 2009-06-26 20:02 I'm using quilt like scripts, so, I'm using patch to revert it eaily 2009-06-26 20:03 http://userweb.kernel.org/~hirofumi/foodev.png 2009-06-26 20:04 is my current temporary hack 2009-06-26 20:04 I'm not checking what is happening at all though 2009-06-26 20:05 btw, it is flushing _volmap_ after commit to see result without replay 2009-06-26 20:06 let me see, LOG_BNODE_ROOT 2009-06-26 20:07 let me find where that log entry is written 2009-06-26 20:07 LOG_BNODE_ROOT is, allocation of btree root 2009-06-26 20:07 it means write_log? 2009-06-26 20:08 well, I'm also forgetting previous state in this patchset 2009-06-26 20:08 I'm trying to look it 2009-06-26 20:09 is emitted when cursor_redirect hits the root of a tree, I guess 2009-06-26 20:10 redirects the root 2009-06-26 20:11 no 2009-06-26 20:11 redirect is just logging as redirect 2009-06-26 20:11 it is new root allocation 2009-06-26 20:12 for now, it's insert_leaf() and alloc_empty_btree() 2009-06-26 20:12 ok, and we need to log that because? 2009-06-26 20:12 because we don't write bnode on every delta 2009-06-26 20:12 normally, it would be because we will not write the affected ileaf in that delta 2009-06-26 20:12 ileaf is written? 2009-06-26 20:13 I meant, ileaf will be written, but bnode is not 2009-06-26 20:13 for now, all changed ileaf blocks should be written on each delta 2009-06-26 20:13 yes 2009-06-26 20:13 so if a databtree root is redirected or newly allocated, it should be enough just to mark the inode dirty 2009-06-26 20:14 why? 2009-06-26 20:14 btree points root bnode, and root bnode points leaf 2009-06-26 20:14 because then the inode's ileaf block will be written in the next delta 2009-06-26 20:14 leaf is, yet 2009-06-26 20:15 are we talking about the root of the itable? 2009-06-26 20:15 root bnode 2009-06-26 20:15 for all btree 2009-06-26 20:15 al, root bnode = itable root 2009-06-26 20:16 it can be able to be root bnode == data btree root 2009-06-26 20:17 inode->btree will be written as ileaf, but block(inode->btree.root.block) will not be written 2009-06-26 20:17 ok, well for now, I thought we would just update the superblock to point at the new itable root\ 2009-06-26 20:17 because we have to update it anyway, to point at the commit block of the log chain 2009-06-26 20:18 sb->itable->root.block is updated 2009-06-26 20:18 but, bufffer(sb->itable->root.block) is not written yet? 2009-06-26 20:18 yes, that is supposed to be updated on every delta for now 2009-06-26 20:19 ah 2009-06-26 20:19 LOG_BNODE_ROOT is for it 2009-06-26 20:20 I get it :) 2009-06-26 20:20 ok :) 2009-06-26 20:20 but let me think it through a moment 2009-06-26 20:20 of course 2009-06-26 20:21 ok, it is because the data btree root will only be written out on a flush cycle 2009-06-26 20:21 exactly 2009-06-26 20:22 well, I'm not thinking about new log-tags deeply 2009-06-26 20:23 those are introduced, just because it is needed to work 2009-06-26 20:23 I am just thinking if that allows us to avoid dirtying the inode at all on some data writes 2009-06-26 20:24 some data writes? 2009-06-26 20:24 normally, we will have to dirty the inode and write it out every time a data write takes place 2009-06-26 20:24 to update mtime 2009-06-26 20:25 yes 2009-06-26 20:25 ok, and for now we are just going to do that, but we could log the mtime change instead, then we don't have to write the inode 2009-06-26 20:26 yes, if it improves, it can 2009-06-26 20:26 but the important thing right now is that I understand what LOG_BNODE_ROOT is for, and how to replay it 2009-06-26 20:26 it is same state with btree->root 2009-06-26 20:26 good 2009-06-26 20:27 well, I was thinking a little, it should be replyed easily 2009-06-26 20:27 I thought a little 2009-06-26 20:27 LOG_BNODE_ROOT has initial data and block address 2009-06-26 20:28 in short, LOG_BNODE_ROOT is a promise to update the in-memory btree root and dirty the inode, at the next flush cycle 2009-06-26 20:28 so, I thought, replay can do it without any dependency 2009-06-26 20:29 inode may not be necessary 2009-06-26 20:29 because it may already be written 2009-06-26 20:29 how do know at the flush cycle that the btree root has to be updated? 2009-06-26 20:29 buffer itself is dirty 2009-06-26 20:29 that is, the in-memory btree root 2009-06-26 20:30 which buffer? 2009-06-26 20:30 now, btree root has both mean, confuse me 2009-06-26 20:30 yes 2009-06-26 20:30 btree->root is in ileaf, root bnode is buffer of root, how about this? 2009-06-26 20:31 well, so, btree root is which one? 2009-06-26 20:32 ok good 2009-06-26 20:32 ok 2009-06-26 20:33 the problem is, when cursor_redirect redirects the root of a data btree, it can't just change the inode->btree.root in the cached inode, because that will be written to disk on the next delta 2009-06-26 20:34 yes 2009-06-26 20:34 so one approach is to wait until the flush cycle, and update the inode->btree.root then 2009-06-26 20:34 is this our plan? 2009-06-26 20:34 redirects generates the log of redirect, so there is no problem? 2009-06-26 20:35 right :) 2009-06-26 20:35 good 2009-06-26 20:35 I like it when there is no problem 2009-06-26 20:35 for now, it seems to work 2009-06-26 20:36 well, we will see when creation path is start to work 2009-06-26 20:36 ok, so then what do we need LOG_BNODE_ROOT for then? the new root should already be stored in the dirty inode 2009-06-26 20:37 it is ok if the bnode has not actually been written yet 2009-06-26 20:37 which I should have said earlier when you raised the issue 2009-06-26 20:37 but it took me a while to remember this 2009-06-26 20:37 need to think a bit 2009-06-26 20:38 ok, I will be back to interrupt your thinking in 5 minutes ;) 2009-06-26 20:38 could you write a pesudo code for problem case? 2009-06-26 20:38 :) 2009-06-26 20:40 it's not a problem, it's something good 2009-06-26 20:41 ah, ok 2009-06-26 20:41 well, with it, some complex path may have a problem 2009-06-26 20:41 I feel 2009-06-26 20:42 the reason we can write the inode out with the new inode->btree.root _before_ the btree root block is actually written is, a) we can have already chosen the new location for the root and b) we can _reconstruct_ the root block in cache on replay 2009-06-26 20:42 yes, exactly 2009-06-26 20:43 so, let's continue to log LOG_BNODE_ROOT, but I do not think I have to do anything special for it on replay 2009-06-26 20:43 it might be useful for debugging 2009-06-26 20:43 special? 2009-06-26 20:44 actually, I don't have to do anything at all for that log entry on replay 2009-06-26 20:44 it is excluding, it needs to reconstruct bnode 2009-06-26 20:44 ? 2009-06-26 20:45 ok, try this: if replay sees a LOG_BNODE_ROOT, what does it have to do? 2009-06-26 20:46 ok, I think, replay needs to write data to that buffer (not disk) 2009-06-26 20:48 but that LOG_REDIRECT log entries for the block's child pointers are enough to create the correct data for that buffer 2009-06-26 20:49 s/that/the/ 2009-06-26 20:49 no 2009-06-26 20:49 LOG_REDIRECT needs original buffer 2009-06-26 20:49 but, new allocated block doesn't have it 2009-06-26 20:50 what do you mean by "needs original buffer" ? 2009-06-26 20:51 orignal buffer meant source block of dest block 2009-06-26 20:52 for example, it is insert_leaf() 2009-06-26 20:52 yes, LOG_REDIRECT only has the old and new locations of the block that was redirected, and does not specify the parent block 2009-06-26 20:54 LOG_UPDATE knows the parent and child blocks 2009-06-26 20:54 well, so, bottom of insert_leaf(), it allocates new block, and change btree->block 2009-06-26 20:54 yes 2009-06-26 20:54 I think it is part of redirect job 2009-06-26 20:55 LOG_BNODE_ROOT is out of redirect job 2009-06-26 20:55 you mean, cursor_redirect should write a LOG_UPDATE? 2009-06-26 20:56 yes 2009-06-26 20:56 actually, it does 2009-06-26 20:56 yes 2009-06-26 20:56 ah, no 2009-06-26 20:56 I meant, LOG_UPDATE is part of redirect job 2009-06-26 20:56 but, it is not enough 2009-06-26 20:57 I think we need to handle, e.g. bottom of insert_leaf() 2009-06-26 20:57 cursor_redirect does a LOG_REDIRECT and LOG_UPDATE on every iteration, probably we can combine those log entries 2009-06-26 20:57 well 2009-06-26 20:58 maybe 2009-06-26 20:58 yes, well 2009-06-26 20:58 LOG_BNODE_ROOT 2009-06-26 20:58 more example of LOG_BNODE_ROOT 2009-06-26 20:59 cursor has A->B->C 2009-06-26 20:59 A, B, and C is block 2009-06-26 20:59 right 2009-06-26 20:59 redirect will make A2->B2->C2 2009-06-26 21:00 but, insert_leaf() may add the new root, D->A2->B2->C2 2009-06-26 21:00 I think we need the log for D block 2009-06-26 21:00 yes we do 2009-06-26 21:00 very nice description 2009-06-26 21:01 ok, LOG_BNODE_ROOT is for when the btree gets a new level 2009-06-26 21:01 exactly 2009-06-26 21:01 well wait 2009-06-26 21:01 we can just store the pointer to D in the inode 2009-06-26 21:01 pointer is yes 2009-06-26 21:02 but, data is not stored 2009-06-26 21:02 because D is empty except for the pointer to A2 2009-06-26 21:02 yes 2009-06-26 21:02 so, LOG_BNODE_ROOT is logging the A2 2009-06-26 21:03 and that pointer is inserted by LOG_INSERT 2009-06-26 21:03 ok, but that is the same as any other new child insertion 2009-06-26 21:03 no 2009-06-26 21:04 another mean of LOG_BNODE_ROOT is, it indicates new allocation 2009-06-26 21:04 it needs to mark it to ->bitmap 2009-06-26 21:04 but we have LOG_BALLOC for that 2009-06-26 21:05 yes 2009-06-26 21:05 actually, it can separate to some log entries 2009-06-26 21:05 true 2009-06-26 21:05 we don't know what the perfect log entries are yet 2009-06-26 21:05 so, anything that works is good enough 2009-06-26 21:06 and too many log entries is better than too few 2009-06-26 21:06 yes, good 2009-06-26 21:06 however, this discussion has allowed me to remember the essential logic 2009-06-26 21:07 good 2009-06-26 21:07 and I see I never defined a LOG_INSERT 2009-06-26 21:07 a LOG_DELETE is also necessary 2009-06-26 21:07 for truncate 2009-06-26 21:07 yes, DELETE is not including my path 2009-06-26 21:08 patch 2009-06-26 21:08 but, LOG_INSERT is LOG_BNODE_ADD 2009-06-26 21:08 patch and my thinking 2009-06-26 21:08 ah, good 2009-06-26 21:09 we also need LOG_BNODE_ADD when a new leaf is inserted 2009-06-26 21:09 yes, exactly 2009-06-26 21:09 I added it to insert_leaf() 2009-06-26 21:10 do you mind if we rename LOG_BNODE_ADD to LOG_INSERT? 2009-06-26 21:10 I thought LOG_INSERT is too short 2009-06-26 21:10 I forgot why..., try to remember 2009-06-26 21:11 we don't insert anything else 2009-06-26 21:11 only children are insert or deleted from parents 2009-06-26 21:11 ah, it may be for redirect 2009-06-26 21:11 LOG_UPDATE is the worst name :) 2009-06-26 21:11 it doesn't say anything 2009-06-26 21:12 I added LOG_LEAF_REDIRECT, and LOG_BNODE_REDIRECT 2009-06-26 21:12 ok, that would be what I called LOG_UPDATE 2009-06-26 21:12 so, I added BNODE to all bnode operations 2009-06-26 21:12 ok, let's go with your names 2009-06-26 21:13 thanks 2009-06-26 21:13 btw, LEAF and BNODE has different defree cycle, so, those are separeted 2009-06-26 21:13 yes 2009-06-26 21:14 however, we only log changes to the parent nodes 2009-06-26 21:14 yes 2009-06-26 21:14 ah 2009-06-26 21:15 LOG_UPDATE may be too short for newbie? 2009-06-26 21:15 not sure at all 2009-06-26 21:15 we already decided to use your names ;) 2009-06-26 21:15 thanks :) 2009-06-26 21:15 LOG_UPDATE will be gone on the next pull 2009-06-26 21:16 so... split 2009-06-26 21:16 ok, it will become LOG_BNODE_UPDATE 2009-06-26 21:16 well, this can be _ADD actually 2009-06-26 21:16 yes 2009-06-26 21:17 but I like _UPDATE more than _ADD 2009-06-26 21:17 but, for now, it would be usefull for debugging 2009-06-26 21:17 update, insert, delete 2009-06-26 21:17 the 3 fundamental edit operations 2009-06-26 21:17 or, insert, delete, change 2009-06-26 21:17 ok, let's use insert 2009-06-26 21:18 and "change" is better than "update" 2009-06-26 21:18 ADD was just shorter than INSERT 2009-06-26 21:18 ADD, DEL, SET? 2009-06-26 21:18 :) 2009-06-26 21:18 :) 2009-06-26 21:19 well, ok, let's rename those later 2009-06-26 21:19 anyway, I know what they are now 2009-06-26 21:19 so... split 2009-06-26 21:19 I added split is just for insert_leaf(), iirc 2009-06-26 21:20 yes, now, insert_leaf is only user 2009-06-26 21:20 so, how do we express split in terms of the other operations? 2009-06-26 21:20 well, we don't 2009-06-26 21:21 yes 2009-06-26 21:21 split is special, because the data needed to reconstruct the two resulting dirty blocks comes from the original dirty block 2009-06-26 21:21 it has new parameter, pos 2009-06-26 21:21 yes 2009-06-26 21:22 well, purely, pos is not need to split 2009-06-26 21:22 if we are sure, we are using same algorithm with log was written 2009-06-26 21:23 btree-bnode-split-log.patch <- adds the correct log entry 2009-06-26 21:24 I feel better having the split pos explicit :) 2009-06-26 21:24 yes, it 2009-06-26 21:24 me too 2009-06-26 21:24 it means, if we change our split algorithm and release an update, replay of an old filesystem will not break 2009-06-26 21:24 exactly 2009-06-26 21:25 ok, so it looks like your patch set covers everything we need to do all operations except truncate 2009-06-26 21:26 I'm not sure for now, I'm thinking only creation path though 2009-06-26 21:26 yes, good 2009-06-26 21:27 so... I think I should apply the patch set and get to work on replay 2009-06-26 21:27 any reason why not? 2009-06-26 21:27 main reason is it is not work :) 2009-06-26 21:27 well, should we make a branch for it? 2009-06-26 21:28 it is about time I tried working on a branch in mercurial 2009-06-26 21:28 find out if it works ok 2009-06-26 21:28 it may work 2009-06-26 21:29 since I'm thinking those are still temporary, I would like to use scripts to change history though 2009-06-26 21:30 temporary means I would like to merge/split patches, not basic change though 2009-06-26 21:30 sounds like extra work 2009-06-26 21:31 ok, if I was using quilt style, I would just use your series to apply the correct patches in the correct order 2009-06-26 21:31 does mercurial know how to do that? 2009-06-26 21:31 I don't know 2009-06-26 21:32 you do this with your own fork of quilt? 2009-06-26 21:32 if you are not lazy to use my scripts, I'll just send it to you 2009-06-26 21:32 I'll do it :) 2009-06-26 21:32 quilt is fork of akpm scripts 2009-06-26 21:32 yes 2009-06-26 21:32 my scripts is fork of akpm 2009-06-26 21:33 http://userweb.kernel.org/~hirofumi/patch-scripts.tar.gz 2009-06-26 21:33 ok, I should use your scripts to apply your patches, and I will do it in a temporary repository 2009-06-26 21:33 is my patch-scripts 2009-06-26 21:34 export PATH=/foo/bar/patch-scripts:$PATH 2009-06-26 21:34 and pushpatch 99999 2009-06-26 21:34 it should work 2009-06-26 21:35 how does it know what the source directory for patches is? 2009-06-26 21:36 should I be in the current directory? 2009-06-26 21:36 in the parent directory I mean? 2009-06-26 21:36 it search "patchset" directory from current dir 2009-06-26 21:36 ok, so I should be in the parent of patchset/ 2009-06-26 21:36 yes 2009-06-26 21:37 if patchset is in topdir of project, it should work on any dir in project 2009-06-26 21:38 pushpatch 99999 2009-06-26 21:38 patching file user/kernel/tux3.h 2009-06-26 21:38 applied update-magic-number 2009-06-26 21:38 patching file user/kernel/inode.c 2009-06-26 21:38 Hunk #1 succeeded at 629 (offset 63 lines). 2009-06-26 21:38 applied inode-init-race-fix 2009-06-26 21:38 patching file user/kernel/btree.c 2009-06-26 21:38 Hunk #1 succeeded at 474 with fuzz 2 (offset 7 lines). 2009-06-26 21:38 patching file user/kernel/filemap.c 2009-06-26 21:38 Hunk #1 FAILED at 107. 2009-06-26 21:38 1 out of 1 hunk FAILED -- saving rejects to file user/kernel/filemap.c.rej 2009-06-26 21:38 patching file user/kernel/inode.c 2009-06-26 21:38 Hunk #2 succeeded at 235 (offset 33 lines). 2009-06-26 21:38 Hunk #3 FAILED at 287. 2009-06-26 21:38 Hunk #4 FAILED at 314. 2009-06-26 21:38 2 out of 4 hunks FAILED -- saving rejects to file user/kernel/inode.c.rej 2009-06-26 21:38 alloc_cursor-lock-fix does not apply 2009-06-26 21:38 ah 2009-06-26 21:38 sorry 2009-06-26 21:39 I can download your fixed patchset 2009-06-26 21:39 > patchset/applied-patches 2009-06-26 21:40 my tarball applied some patches already 2009-06-26 21:40 and use pushpatch -F 9999 2009-06-26 21:40 right, should I get a new tarball, or try to recover? 2009-06-26 21:40 -F is safety flag 2009-06-26 21:40 ok, -F sounds good :) 2009-06-26 21:41 I updated the dddd, so I'll push new tarball 2009-06-26 21:41 ok 2009-06-26 21:41 anyway, it was easy 2009-06-26 21:42 http://userweb.kernel.org/~hirofumi/patchset.tar.gz 2009-06-26 21:42 yes, it's really easy 2009-06-26 21:42 poppatch 9999 2009-06-26 21:42 will revert patches 2009-06-26 21:43 it did 2009-06-26 21:43 and the .rej files did not actually seem to be created 2009-06-26 21:43 poppatch , pushpatch -F , will do until 2009-06-26 21:43 ah 2009-06-26 21:44 rejected patch is not actually applied 2009-06-26 21:44 ok, so you use --dryrun ? 2009-06-26 21:44 yes, iirc 2009-06-26 21:44 -F, not allow even if fuzzy 2009-06-26 21:44 or something it 2009-06-26 21:44 I often wish that you could tell patch to write the .rej, but not apply the patch 2009-06-26 21:45 pushpatch -f is force option 2009-06-26 21:45 but, it will actually apply the patch 2009-06-26 21:46 have you ever looked at the mercurial q commands? 2009-06-26 21:46 I see several of variants 2009-06-26 21:46 stgit or something 2009-06-26 21:47 but, I think this job should be out of scm 2009-06-26 21:47 qpop pop the current patch off the stack 2009-06-26 21:47 qprev print the name of the previous patch 2009-06-26 21:47 qpush push the next patch onto the stack 2009-06-26 21:47 it's part of mercurial now 2009-06-26 21:47 built in 2009-06-26 21:47 yes 2009-06-26 21:47 it seems quilt 2009-06-26 21:48 the scm can pop faster, but reverting to a previous version 2009-06-26 21:48 anyway, I'm using your scripts ;) 2009-06-26 21:48 later I will try the hg quilt commands 2009-06-26 21:49 I recommend quilt or something 2009-06-26 21:49 because it is not depending scm, so your finger can do it without scm 2009-06-26 21:50 well, your way is obviously a good way, you can manage patches much faster than I can 2009-06-26 21:50 well, it is what favorate 2009-06-26 21:50 and split them up better 2009-06-26 21:51 yes 2009-06-26 21:51 ok, it's oyasumi time for me 2009-06-26 21:51 this works well with scm 2009-06-26 21:51 ok, oyasumi 2009-06-26 21:52 I'll try to make sane my patchset 2009-06-26 21:52 should I download a new tarball? 2009-06-26 21:52 ok 2009-06-26 21:52 and I will look here for instructions, if you are not awake 2009-06-26 21:52 oyasumi 2009-06-26 21:52 ok 2009-06-26 21:53 pushpatch -F - apply the next patch with safety flag 2009-06-26 21:53 :) 2009-06-26 21:53 pushpatch -f - apply the next patch even if .rej 2009-06-26 21:53 :) 2009-06-26 21:54 pushpatch [-F|-f] 2009-06-26 21:54 poppatch 2009-06-26 21:58 fpatch foo/bar.c - prepare to generate a patch, .patch by diff of foo/bar.c 2009-06-26 22:00 fpatch foo/bar.c - add foo/bar.c to current patch 2009-06-26 22:01 refpatch - refresh the patch 2009-06-26 22:01 usage: 2009-06-26 22:02 cd user/kernel 2009-06-26 22:02 fpatch my-change btree.c 2009-06-26 22:02 2009-06-26 22:02 fpatch tux3.h 2009-06-26 22:03 2009-06-26 22:03 refpatch 2009-06-26 22:03 [refpatch creates patchset/patches/my-change.patch] 2009-06-26 22:03 poppatch 2009-06-26 22:04 [poppatch cancel the top of patches] 2009-06-26 22:04 echo "my-change.patch" >> $TOP_DIR/patchset/series 2009-06-26 22:05 [add patch for pushpatch] 2009-06-26 22:05 pushpatch -F 9999 2009-06-26 22:05 [apply patches by patchset/series order] 2009-06-26 22:06 ah, so the way you swap patch order is by editing series 2009-06-26 22:07 yes 2009-06-26 22:07 and try "pushpatch -F" whether it can apply or not 2009-06-26 22:08 then, if work, "refpatch" 2009-06-26 22:09 refpatch is not needed, if it's really clean to apply 2009-06-26 22:17 basic work would be enough with "pushpatch/poppatch/fpatch/refpatch" 2009-06-26 22:18 extra one: 2009-06-26 22:18 import_patch - import the patch 2009-06-26 22:19 mv-patch - rename patch 2009-06-26 22:19 kill-patch - delete patch and metafiles 2009-06-26 22:21 patch-bomb [-d|-n|-c|-t|-p] - email patches 2009-06-26 22:23 join-patch - join the patch to current one, and apply 2009-06-26 22:24 more: 2009-06-26 22:24 patchset/patches/* - created patches 2009-06-26 22:24 patchset/pc/* - list of files in the patch 2009-06-26 22:25 patchset/txt/* - comment to include the patch 2009-06-26 22:25 patchset/applied-patches - applied patches 2009-06-26 22:25 patchset/series - order of patch stack 2009-06-26 22:25 patchset/config - config of patch scripts 2009-06-26 23:12 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-27 00:38 ah, another nice to have things, it may be good if leaf has pointer to next leaf 2009-06-27 00:38 it can be used for link list of leaf 2009-06-27 00:39 if we have it, I guess btree index can be rebuild by using it 2009-06-27 00:40 and advance() can not be read btree-index 2009-06-27 00:41 and if btree-index block is corrupted or become bad block, we wouldn't lost all of subtree 2009-06-27 00:52 btw, I'm not thinking about any maintainance/implement/runtime cost at all, so, need to think later 2009-06-27 02:15 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-06-27 08:00 -!- npmccallum(~npmccallu@cpe-76-177-118-207.natcky.res.rr.com) has joined #tux3 2009-06-27 08:48 hirofumi, instead of a pointer to the next leaf, we should put the inum in the leaf, that way the index can be reconstructed even if one of the leaf nodes is missing 2009-06-27 08:50 how can we know leaf of block address? 2009-06-27 08:51 the leaf already includes the logical extent address 2009-06-27 08:51 dleaf does, and inleaf includes the base inode number 2009-06-27 08:51 ileaf I mean 2009-06-27 08:52 it is ibase for ileaf? 2009-06-27 08:52 yes 2009-06-27 08:52 we should probably have a sequence number, so we know which version of a leaf is the latest 2009-06-27 08:53 well 2009-06-27 08:53 the global version number of last update 2009-06-27 08:53 it can do without btree-index? 2009-06-27 08:55 yes 2009-06-27 08:55 um... 2009-06-27 08:56 the index can be treated as just an accelerator that can be reconstructed 2009-06-27 08:56 suppose btree root bnode was became bad-block 2009-06-27 08:56 how can we know physical address of ileaf block? 2009-06-27 08:57 scan the disk for unreferenced blocks with the magic + inode + missing offset 2009-06-27 08:59 we may see history of several ileaf blocks 2009-06-27 09:02 true 2009-06-27 09:02 this is likely in fact 2009-06-27 09:02 well, it is not high priority at all 2009-06-27 09:03 s/it/ileaf next pointer/ 2009-06-27 09:03 right, we know we are going to add some additional data to help reconstruction 2009-06-27 09:04 ok 2009-06-27 09:05 what should I do about your patch set that doesn't quite apply? 2009-06-27 09:05 well, why I said it is, I read the chunkfs pdf 2009-06-27 09:05 fix it or wait for an updated patch set? 2009-06-27 09:05 ah yes 2009-06-27 09:05 may I talk about that for a moment? 2009-06-27 09:05 yes 2009-06-27 09:06 I have a plan to do what chunkfs tries to do, in a more practical way 2009-06-27 09:06 oh 2009-06-27 09:06 the key idea is to be able to reverse map the volume, so that given a physical block you can know the parent block that points to it 2009-06-27 09:07 but that reverse information is only recorded if the parent is in a different block group than the child 2009-06-27 09:08 a block group is the number of physical blocks that can be mapped by a single bitmap block 2009-06-27 09:08 i see 2009-06-27 09:08 we have a table of all pointers that point to other block groups, and we always try to allocate a child in the same block group as its parent 2009-06-27 09:09 parent means parent btree? 2009-06-27 09:09 parent block, I was thinking 2009-06-27 09:09 i see 2009-06-27 09:09 so, an index node or an itable node 2009-06-27 09:10 the table of "far pointers" should be quite small and not expensive to maintain 2009-06-27 09:10 so, online fsck will work block group by block group 2009-06-27 09:11 um..., I need time to think, it helps from what dameges 2009-06-27 09:11 for each block group it reconstructs a reverse map of all the blocks in the group, by looking at the itable blocks and index blocks within the same group 2009-06-27 09:11 what damage do it help 2009-06-27 09:11 random block corruption, for one thing 2009-06-27 09:11 i see 2009-06-27 09:12 um..., for example, if btree bnode root was corrupted, what can we know from it? 2009-06-27 09:12 when the reverse map is fully constructed, you end up with a list of unreferenced blocks in the block group 2009-06-27 09:13 you mean, we get a read error, trying to read a data index root? 2009-06-27 09:14 yes 2009-06-27 09:14 so, the index tree will end up as unreferenced blocks, and all the data blocks 2009-06-27 09:15 reverse map can't be used to help some or all data? 2009-06-27 09:15 for each unreferenced block in the group, we first check the "far map" to see if the block is referenced from another group 2009-06-27 09:16 if the block is unreferenced, we look for an ileaf magic number, and we end up with a list of all unreferenced ileaf blocks 2009-06-27 09:16 that is, for that damaged inode 2009-06-27 09:17 then we "somehow" determine which of those ileaf blocks is the most recently written 2009-06-27 09:17 ah 2009-06-27 09:19 the "somehow" is an interesting question to think about 2009-06-27 09:19 yes 2009-06-27 09:19 "somehow" have the above history problem? 2009-06-27 09:20 I think I actually meant "dleaf" above, but the same is true of damaged inode table 2009-06-27 09:21 we are likely to have many similar copies of a btree leaf present 2009-06-27 09:21 probably 2009-06-27 09:21 so we need to record something in it to know which is the most recent, a timestamp might be good 2009-06-27 09:21 a sequence number could also work, except for wrap 2009-06-27 09:22 timestamp has resolution problem 2009-06-27 09:22 and timestamp can be back 2009-06-27 09:22 yes, bad 2009-06-27 09:23 ok, here is an idea: record the physical address of the previous version of a redirected leaf in the leaf 2009-06-27 09:23 well, it seems to be big to think right now 2009-06-27 09:23 well I thought about it :) 2009-06-27 09:23 s/big/need time/ 2009-06-27 09:23 I'm happy with the latest idea 2009-06-27 09:23 :) 2009-06-27 09:24 happy enough to write a design note about it 2009-06-27 09:25 e.g. if user is runing kvm with tux3, we can know correct metadata? 2009-06-27 09:25 kvm meant, tux3 in file image 2009-06-27 09:26 you mean, when a filesystem image is present as file contents? 2009-06-27 09:26 yes 2009-06-27 09:26 that is a challenge of course, and it is common 2009-06-27 09:26 i see, ok 2009-06-27 09:26 maybe we can make all the virtual blocks a different color? :) 2009-06-27 09:26 well 2009-06-27 09:26 :) 2009-06-27 09:26 better have a filesystem id too 2009-06-27 09:27 hard to handle that case without it 2009-06-27 09:27 and it is a common case 2009-06-27 09:27 but, id is in each blocks? 2009-06-27 09:27 just index leaf blocks 2009-06-27 09:27 ah 2009-06-27 09:28 well, user can do copy image from one to other 2009-06-27 09:28 without mkfs 2009-06-27 09:28 and file ids can collide 2009-06-27 09:28 yes 2009-06-27 09:29 well, back to patchset 2009-06-27 09:29 well, it is ok of the image is copied, we just need to know about the blocks of the _host_ filesystem 2009-06-27 09:29 id collision is the biggest problem, and if we just make the id random 32 bit hash, I think we are ok 2009-06-27 09:30 ok, back to patchset 2009-06-27 09:30 what was happened? 2009-06-27 09:30 rejected the patches? 2009-06-27 09:30 yes, as above 2009-06-27 09:31 "pushpatch -F dddd" is work? 2009-06-27 09:31 patching file user/kernel/filemap.c 2009-06-27 09:31 Hunk #1 FAILED at 107. 2009-06-27 09:31 1 out of 1 hunk FAILED -- saving rejects to file user/kernel/filemap.c.rej 2009-06-27 09:31 patching file user/kernel/inode.c 2009-06-27 09:31 Hunk #2 succeeded at 235 (offset 33 lines). 2009-06-27 09:31 Hunk #3 FAILED at 287. 2009-06-27 09:31 Hunk #4 FAILED at 314. 2009-06-27 09:31 2 out of 4 hunks FAILED -- saving rejects to file user/kernel/inode.c.rej 2009-06-27 09:31 I will try 2009-06-27 09:31 what patch was rejected? 2009-06-27 09:32 pushpatch -F dddd did work 2009-06-27 09:32 ah, ok 2009-06-27 09:32 my series is including pending patches to remember 2009-06-27 09:33 and, it is not updated to current one 2009-06-27 09:33 below of dddd.patch 2009-06-27 09:33 "applied update-magic-number" seems to be the one that failed 2009-06-27 09:33 it is ok 2009-06-27 09:34 below of dddd.patch is not needed for now 2009-06-27 09:34 but does that message come after applying the patch, or before? 2009-06-27 09:34 ok 2009-06-27 09:35 patch --dry-run foo.patch 2009-06-27 09:35 echo "result of patch" 2009-06-27 09:35 also, I did not do a pull before applying the patches, I used your tarball just as it was 2009-06-27 09:35 iirc, this order 2009-06-27 09:36 so now I will pop the patch set, pull from my working repository, and reapply dddd 2009-06-27 09:36 this time I will fix anything that breaks 2009-06-27 09:36 is this the right idea? 2009-06-27 09:37 or maybe I should clone my working repository 2009-06-27 09:37 yes, sounds good 2009-06-27 09:37 I will check the hg log and decide whether to pull or clone 2009-06-27 09:37 clone or branch would be safe 2009-06-27 09:37 this is a nice way of working 2009-06-27 09:38 ok, the log of your hg repo is the same as mine 2009-06-27 09:38 this means this scripts? 2009-06-27 09:38 yes, the scripts combined with hg 2009-06-27 09:38 yes 2009-06-27 09:38 I have been making things difficult for myself for years :) 2009-06-27 09:39 scripts is for exactly jobs before scm 2009-06-27 09:39 I need both 2009-06-27 09:40 and the version control system makes it easy to recover from any patch damage 2009-06-27 09:40 yes, and maneges the merge, distribute, history, or more 2009-06-27 09:41 the diff of your working patchset is 1700 lines 2009-06-27 09:42 oh, unfortunately, it's big 2009-06-27 09:42 22 files changed, 639 insertions(+), 226 deletions(-) 2009-06-27 09:42 that's ok 2009-06-27 09:43 we have a powerful tool to control that now 2009-06-27 09:44 it is becase I didn't know what happen with this job 2009-06-27 09:44 it is easy to read 2009-06-27 09:44 I think it is essentially correct 2009-06-27 09:44 btw, this scripts is not manage conflict/merge/distribute well 2009-06-27 09:45 I understand 2009-06-27 09:45 i.e., it is not intented to share patches by two or more developers 2009-06-27 09:46 best idea is probably to work with the patchset for a while, then merge it for distribution and begin a new patchset 2009-06-27 09:46 yes, exactly 2009-06-27 09:47 well, so, I'll try to clean this patchset up, and try mergable :) 2009-06-27 09:50 it looks very close 2009-06-27 09:51 and we are very close to having working logging except for delete 2009-06-27 09:51 I suppose ./tux3 is the best test tool for this right now 2009-06-27 09:52 yes, I hope so 2009-06-27 09:52 probably 2009-06-27 09:52 also, I have used the unit test in user/commit.c before, I can update that 2009-06-27 09:53 I will try it works more every where 2009-06-27 09:54 this patchset is trying to work atomic-commit stuff like writeback stuff 2009-06-27 09:54 yes, that is clear 2009-06-27 09:54 but, it is not working yet 2009-06-27 09:56 let me see if I can see why 2009-06-27 09:56 or you can just tell me ;) 2009-06-27 09:56 well, writeback.c is 2009-06-27 09:57 this patchset just disable old stuff by hack 2009-06-27 09:57 and it is everwhere more or less 2009-06-27 09:57 you mean, the #if 0? 2009-06-27 09:57 yes 2009-06-27 09:58 and add change_begin/change_end directly 2009-06-27 09:59 1677 int sync_super(struct sb *sb) 2009-06-27 09:59 1678 { 2009-06-27 09:59 1680 + change_begin(sb); 2009-06-27 09:59 1681 + change_end(sb); <- this defines our immediate goal 2009-06-27 09:59 yes 2009-06-27 10:00 this diff is very easy to read 2009-06-27 10:00 thanks 2009-06-27 10:00 dddd.patch would be complex 2009-06-27 10:01 it is my first version of patch - make patch with encounter 2009-06-27 10:01 encounter? 2009-06-27 10:02 I see what I do here, then write code, make patch 2009-06-27 10:02 the result of patch would be mix of some works 2009-06-27 10:02 reading sync_inodes now 2009-06-27 10:03 after that, I try to split with rethink of it 2009-06-27 10:03 handling of bitmap and volmap makes complete sense 2009-06-27 10:03 s/split/split or merge or cleanup/ 2009-06-27 10:04 the error handling is interesting... put the dirty inode list back on the sb 2009-06-27 10:05 ok, sync_inodes does not have to sync the bitmap and volmap I think 2009-06-27 10:06 it can force a flush cycle if it wants, but this is optional 2009-06-27 10:07 ah, you remove that in your diff 2009-06-27 10:07 remove the bitmap sync, but not the volmap sync 2009-06-27 10:07 ah 2009-06-27 10:07 yes 2009-06-27 10:08 so the volmap sync can be removed as well, I think 2009-06-27 10:08 bitmap is already part of atomic-commit 2009-06-27 10:08 yes 2009-06-27 10:08 but, it forces me to implement replay to see result 2009-06-27 10:09 sync inode is very easy to read 2009-06-27 10:09 I wish the core kernel code was this clean 2009-06-27 10:09 I think, it is possible that we could backport your userspace implementation of sync_inodes to kernel and clean up a lot of cruft 2009-06-27 10:10 well, userspace code is not including locking rule 2009-06-27 10:10 well, I think that implementing replay to see the result is exactly the right thing to do 2009-06-27 10:11 backport is just something to think about 2009-06-27 10:11 develop a sync model that is more useful for modern filesystems, and backport it 2009-06-27 10:11 to kernel 2009-06-27 10:12 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-27 10:12 it may can 2009-06-27 10:12 we can include locking the same way as other parts of userspace, with fake locks 2009-06-27 10:13 and debug the locking in kernel 2009-06-27 10:13 well, inode state is not same with kernel 2009-06-27 10:14 sure, maybe your inode states are better though 2009-06-27 10:14 so, it may hard to share with kernel 2009-06-27 10:14 it is similar with my buffer model, I think my buffer state handling is superior to kernel 2009-06-27 10:14 but fitting that back into historical filesystems would be too much work 2009-06-27 10:14 and too many bugs 2009-06-27 10:15 so instead, we introduce the per-filesystem handle concept, and modern filesystems can use it 2009-06-27 10:15 yes, problem is almost it 2009-06-27 10:15 I guess it would be good 2009-06-27 10:16 at least, for flush path 2009-06-27 10:17 how about this proposal: instead of sync->inodes -> sync_inode(sb->volmap), we just force a flush cycle? 2009-06-27 10:17 but, share the code is also important 2009-06-27 10:17 flush cycle? 2009-06-27 10:17 ah 2009-06-27 10:17 that way you see the result on disk 2009-06-27 10:18 current one is just hack 2009-06-27 10:18 whole of those would be removed 2009-06-27 10:18 yes 2009-06-27 10:18 forcing a flush cycle is just a different hack 2009-06-27 10:18 I'm thinking to implement it in commit.c 2009-06-27 10:19 btw, forcing a flush cycle is what's for? 2009-06-27 10:20 oh, there is an improved graphical bitmap dump :) 2009-06-27 10:20 yes 2009-06-27 10:20 I was needing to debug bitmap atomic-commit stuff 2009-06-27 10:20 force a flush cycle would be just so you can see the updated volmap blocks in your graphical too 2009-06-27 10:20 well 2009-06-27 10:21 ah 2009-06-27 10:21 ok, if I want to #ifdef 0 the volmap flush, then I will :) 2009-06-27 10:21 yes 2009-06-27 10:23 1313 + /* sb->bitmap is always dirty */ 2009-06-27 10:23 1314 + invalidate_buffers(sb->bitmap->map); <- this is to supress an assert? 2009-06-27 10:24 yes, iirc 2009-06-27 10:24 assert, or valgrind error 2009-06-27 10:25 not sure, what happen without it now 2009-06-27 10:25 http://userweb.kernel.org/~hirofumi/dddd.patch 2009-06-27 10:26 this is current dddd.patch 2009-06-27 10:26 it is needed to blockfork of bitmap 2009-06-27 10:26 poppatch 9999 2009-06-27 10:27 cp dddd.patch patchset/patches/ 2009-06-27 10:27 pcpatch dddd 2009-06-27 10:27 pushpatch -F 9999 2009-06-27 10:27 pcpatch - regenerate dddd.pc file from patch 2009-06-27 10:28 it may be needed to see correct bitmap state for atomic-commit 2009-06-27 10:28 current patch moves blockdirty from user/commit.c to kernel/btree.c ? 2009-06-27 10:28 yes 2009-06-27 10:29 user/commit.c is not including super.c or others 2009-06-27 10:29 I didn't notice it 2009-06-27 10:30 it is another dirty hack 2009-06-27 10:30 and kernel never does blockdirty in current patch? 2009-06-27 10:30 yes 2009-06-27 10:30 what does it do instead? 2009-06-27 10:31 maybe, with this patchset, kernel would be broken 2009-06-27 10:31 ok, fine, we know what to do about that 2009-06-27 10:32 well, after cleanup, if possible, kernel would work with writeback manner 2009-06-27 10:33 we have a partially debugged kernel port of blockdirty somewhere 2009-06-27 10:33 you improved it iirc 2009-06-27 10:33 yes, those are fork*.patch in this patchset 2009-06-27 10:34 maybe, those can't be applied cleanly 2009-06-27 10:34 though 2009-06-27 10:34 fine 2009-06-27 10:36 do we care if kernel is broken in the mercurial repo? 2009-06-27 10:36 it is not broken in the online git 2009-06-27 10:36 I think that so far, users who pull the mercurial repo only run fuse 2009-06-27 10:37 and ./tux3 2009-06-27 10:37 well, sometimes they build the module 2009-06-27 10:37 let's see if that still works 2009-06-27 10:37 yes 2009-06-27 10:38 but, it would be need to care more or less, if we want to share the code with kernel 2009-06-27 10:38 recover it may need more time 2009-06-27 10:38 /src/tux3/kernel/filemap.c:494: error: 'AOP_FLAG_NOFS' undeclared (first use in this function) <- it only works on recent kernel 2009-06-27 10:38 yes 2009-06-27 10:39 how do I tell it to build for a kernel that was not booted on my machine? 2009-06-27 10:39 also, delalloc and inode initialization too 2009-06-27 10:40 for old kernel? 2009-06-27 10:40 newer kernel than my workstation 2009-06-27 10:41 which is also my server, so I don't like to upgrade often ;) 2009-06-27 10:41 by the way, I will move to a separate server pretty soon 2009-06-27 10:41 is it have kvm or such? 2009-06-27 10:41 I don't really want to run the module 2009-06-27 10:41 just build it 2009-06-27 10:42 I can run the module on a different machine, or kvm, or even uml, if I can specify the arch and kernel version for the module build 2009-06-27 10:43 LINUX = /lib/modules/`uname -r`/build/ 2009-06-27 10:43 so, now, you are trying to test tux3 of kernel part? 2009-06-27 10:45 just trying to build the module without compile errors 2009-06-27 10:46 try to use newer kernel source? 2009-06-27 10:47 yes 2009-06-27 10:47 it is pretty clear what to do 2009-06-27 10:47 just point LINUX somewhere else 2009-06-27 10:47 yes 2009-06-27 10:53 then it seems to get the wrong kernel headers 2009-06-27 10:54 /usr/include/linux is wrong I think 2009-06-27 10:55 yes 2009-06-27 10:55 make LINUX=/foo/bar? 2009-06-27 10:56 ah, you are trying to build with my patchset? 2009-06-27 10:56 no, current hg, but my installed kernel is old 2009-06-27 10:56 kernel/balloc.c has #include 2009-06-27 10:57 and that include file conflicts with LINUX=/usr/src/linux-2.6.29.3 2009-06-27 10:57 LANG=C make LINUX=/devel/linux/works/linux-2.6-devron V=1 2009-06-27 10:58 it seems to be working 2009-06-27 10:58 ah, it's only source? 2009-06-27 10:58 /usr/src/linux-2.6.29.3 2009-06-27 10:59 yes 2009-06-27 10:59 .config and auto generated header is needed 2009-06-27 10:59 ah right 2009-06-27 10:59 I need to build with ARCH=um :) 2009-06-27 11:00 "make modules_prepare" may work 2009-06-27 11:01 it worked 2009-06-27 11:01 I just need ARCH=um 2009-06-27 11:01 ok 2009-06-27 11:05 make LINUX=/src/2.6.29.3.uml ARCH=um 2009-06-27 11:06 works 2009-06-27 11:06 yes, looks good to me 2009-06-27 11:07 I've tested with 2.6.30-rc7 or later in git 2009-06-27 11:07 ok, this was to try to answer the question "does it matter if we break the kernel in public hg repo?" 2009-06-27 11:08 I suspect it doesn't matter because you and I are the only ones running it right now 2009-06-27 11:08 maybe ckwood_ will run it soon :) 2009-06-27 11:09 maybe 2009-06-27 11:09 if we decide we should not break the kernel in hg, then we can make a branch 2009-06-27 11:09 if it is ok to break the kernel, then we just break it in trunk, and fix it a few days later 2009-06-27 11:10 both is ok for me, if it is really recovered within few days 2009-06-27 11:10 that is the goal 2009-06-27 11:11 at worst, we revert user/kernel and open a branch 2009-06-27 11:11 my concern is, if we ignore it for a while, we need to recover it if we want 2009-06-27 11:11 I don't intend to ignore it 2009-06-27 11:11 it would need more time than timely manner 2009-06-27 11:11 just not worry too much about breaking it for a short time 2009-06-27 11:12 by breaking it, I mean it still compiles, but it may not run correctly 2009-06-27 11:12 sounds good 2009-06-27 11:13 ok, then I think we should merge your working patch pretty soon, maybe tomorrow 2009-06-27 11:14 this gives me something concrete to develop with 2009-06-27 11:14 ok 2009-06-27 11:15 I'll try to fix current issue until tomorrow :) 2009-06-27 11:15 so, kernel module doesn't build in your working patch, as you knew 2009-06-27 11:16 yes, I'll not so care about kernel for now 2009-06-27 11:16 well, it would not be hard to reserve current code 2009-06-27 11:16 mark_buffer_dirty_non 2009-06-27 11:17 yes, it's for both of reserve and debugging 2009-06-27 11:17 actually, it found some bugs on atomic-commit 2009-06-27 11:18 on atomic-commit, it woks as assert() 2009-06-27 11:24 the reason I didn't see it in the hg diff is, kernel/atomic-commit.h isn't added to hg 2009-06-27 11:25 yes 2009-06-27 11:26 current issue of it is, it replaces the some patches before it 2009-06-27 11:26 and I'm not sure it was completed or not 2009-06-27 11:28 ? user/kernel/atomic-commit.h 2009-06-27 11:28 ? user/tux3-test.sh 2009-06-27 11:28 ? user/utility.h 2009-06-27 11:29 you want to see those in hg, then you will work with hg? 2009-06-27 11:30 I just wanted to know which files were added 2009-06-27 11:32 it can't know from patchset easily 2009-06-27 11:32 well, from the above list, user/utility.h and user/kernel/atomic-commit.h 2009-06-27 11:32 I'm setting up .hgignore 2009-06-27 11:33 tux3-test.sh is test script of me 2009-06-27 11:33 not in patchset 2009-06-27 11:33 I can easily guess which of those three is intended to be commited, and I guess... only utility.h 2009-06-27 11:34 I'm trying to commit atomic-commit.h for a while 2009-06-27 11:34 main intent is for debugging 2009-06-27 11:34 and development of atomic-commit 2009-06-27 11:35 mark_buffer_dirty() doesn't work for bnode 2009-06-27 11:36 it needs to handle on different dirty list 2009-06-27 11:36 cat .hgignore 2009-06-27 11:36 syntax: regexp 2009-06-27 11:36 \~ 2009-06-27 11:36 patchset/ 2009-06-27 11:37 yes 2009-06-27 11:37 maybe add that to your working tree, and I'll have it next time? 2009-06-27 11:38 ah 2009-06-27 11:39 well, patchset doesn't handle merge, so, I should try to merge to hg ASAP 2009-06-27 11:39 ok, well I should stop bother you then 2009-06-27 11:39 and work on replay, plus more ssd benchmarks 2009-06-27 11:40 btw, if you are working on my patchset, it can be imported to hg 2009-06-27 11:40 patch-bomb exports patches by email format 2009-06-27 11:40 and recent scm can import email format patches 2009-06-27 11:40 I was just typing: how does hg know which patches to import 2009-06-27 11:40 sure, that makes sense 2009-06-27 11:41 so, if you want to import, I can create it easily 2009-06-27 11:41 for now 2009-06-27 11:41 without comment of patch though 2009-06-27 11:41 or I can pull hg from you, either way 2009-06-27 11:41 yes 2009-06-27 11:41 you need it? 2009-06-27 11:41 your current patchset seems to be a good working base 2009-06-27 11:42 so yes, I would pull it to a non-public hg repo 2009-06-27 11:42 and work on replay 2009-06-27 11:42 I also like the patchset way of working both are good 2009-06-27 11:42 in fact, I would prefer to work with your scripts for now 2009-06-27 11:42 and learn your methods, which are clearly good 2009-06-27 11:42 ok 2009-06-27 11:43 if you need hg, please let me know 2009-06-27 11:43 it is easy 2009-06-27 11:43 the only reason for needing hg is so that the commit credit goes properly to you, with --user 2009-06-27 11:44 with email format, hg may create it from "From:" line 2009-06-27 11:44 ah, fine 2009-06-27 11:44 yes, it is 2009-06-27 11:44 I did it 2009-06-27 11:45 ok, it is less work for you to give me a patch tarball I think 2009-06-27 11:45 ok 2009-06-27 11:47 I'll off for a hour or so 2009-06-27 11:48 bye 2009-06-27 11:48 bye, see you 2009-06-27 12:10 back 2009-06-27 12:15 hi 2009-06-27 12:15 hi 2009-06-27 12:16 well, I got the dell inspirion N laptop running Ubuntu and it is a very nice laptop for the price 2009-06-27 12:17 oh 2009-06-27 12:17 $299 base, and with a web cam, dual processor and led backlight screen, it was $365 2009-06-27 12:18 the biggest flaw so far is, wireless does not work out of the box 2009-06-27 12:18 which is surprising 2009-06-27 12:18 but soon it will work 2009-06-27 12:19 oh, it sounds cheap for laptop 2009-06-27 12:19 well, I don't know well about laptop though 2009-06-27 12:19 inspiron N? 2009-06-27 12:20 ah, wireless is connected 2009-06-27 12:20 ok 2009-06-27 12:20 cool 2009-06-27 12:21 I will call myself a satisfied customer 2009-06-27 12:22 hmm, the wireless disconnected and reconnected 2009-06-27 12:22 I wonder what that is about 2009-06-27 12:22 roaming? 2009-06-27 12:22 it's connected to my wireless router 2009-06-27 12:22 which is 1 meter away 2009-06-27 12:23 um... 2009-06-27 12:24 my wireless knowledge was stopped at several years ago 2009-06-27 12:27 well I am just happy it connected 2009-06-27 12:27 I don't want to spend time analyzing wifi driver problems 2009-06-27 12:27 and I need to give this laptop to my wife, so I can get back the fit pc she is using, and use it for my server 2009-06-27 12:28 then I will be able to upgrade my workstation without shutting down the server :) 2009-06-27 12:28 good 2009-06-27 12:28 this is IT infrastructure plan for the phillips family 2009-06-27 12:29 iirc, iwconfig should say something if it's electronic wave problem 2009-06-27 12:29 this driver has issues with iwconfig 2009-06-27 12:29 oh 2009-06-27 12:30 ah, no issues now 2009-06-27 12:30 so something broke the first time I ran it 2009-06-27 12:30 on initial install 2009-06-27 12:30 perhaps a module loaded in the wrong order 2009-06-27 12:30 IW version mismatch or something 2009-06-27 12:30 ah 2009-06-27 12:30 something ugly about modules and/or driver loading 2009-06-27 12:31 that is all complex and shakey in linux, at the moment 2009-06-27 12:31 combination of efforts of rusty and greg ;) 2009-06-27 12:31 well they did lots of good things too 2009-06-27 12:32 yes 2009-06-27 12:32 also, we are only starting to get open source drivers for vendors like broadcom 2009-06-27 12:33 anyway, this laptop cost $365, and it is hard to believe it is such a nice one 2009-06-27 12:33 that is the same as an eee pc 2009-06-27 12:33 which is not linux friendly any more 2009-06-27 12:34 wireless? 2009-06-27 12:34 which, the eee or the dell? 2009-06-27 12:34 not linux friendly 2009-06-27 12:34 both are 2009-06-27 12:34 oh, asus is not linux friendly 2009-06-27 12:35 what was problem? 2009-06-27 12:35 does not ship with linux preinstalled any more 2009-06-27 12:35 ah 2009-06-27 12:35 they were paid by microsoft to stop shipping linux, I believe 2009-06-27 12:35 or threatened 2009-06-27 12:36 maybe, linux test cost are too big or desktop or laptop pc 2009-06-27 12:36 to be fair, unfortunately, it is small 2009-06-27 12:36 the dell laptop has 4 times more memory and 7 times as much cpu power as the eee pc, and a 50% bigger screen, for the same price 2009-06-27 12:36 well yes 2009-06-27 12:37 I put my eee 900 in my camera bag 2009-06-27 12:37 it goes in the flap of the camera bag 2009-06-27 12:37 the 900 and 901 are still available with linux I think 2009-06-27 12:37 but it is hard to get the 10 inch version with linux 2009-06-27 12:37 hard or impossible 2009-06-27 12:38 i see 2009-06-27 12:39 amazon has 2 left, and does not take the order directly: http://www.amazon.com/10-Inch-Netbook-Processor-Storage-Battery/dp/B001BYD16E 2009-06-27 12:41 oh, linux 2009-06-27 12:43 http://www.newegg.com/Product/Product.aspx?Item=N82E16834220368&Tpk=eee 10" linux 2009-06-27 12:43 but $429 sounds like not cheap 2009-06-27 12:44 right, I believe that is because of the microsoft deal 2009-06-27 12:44 it should be less than $300 by now 2009-06-27 12:44 but microsoft did not like that 2009-06-27 12:45 well, it is good to know that it costs microsoft a lot of money to keep asus from selling linux pcs 2009-06-27 12:45 I find it more amusing than irritating 2009-06-27 12:45 and there are better linux machines coming all the time, like the dell 2009-06-27 12:46 and the arm netbooks that will be available soon 2009-06-27 12:46 anyway, back to atomic commit 2009-06-27 12:47 well, linux needs to get more customers 2009-06-27 12:47 more desktop customer, yes 2009-06-27 12:47 in desktop area 2009-06-27 12:47 yes 2009-06-27 12:47 it is not only technical problem I think, not easy 2009-06-27 12:48 well 2009-06-27 12:48 it doesn't hurt me at all if linux desktop share grows slowly 2009-06-27 12:48 as long as it has more than 10 million desktop users, that is plenty 2009-06-27 12:49 I think it is more like 60 million now 2009-06-27 12:49 well 2009-06-27 12:49 hmm 2009-06-27 12:49 maybe not 2009-06-27 12:49 it is about 10 million probably 2009-06-27 12:49 that is a lot 2009-06-27 12:50 benefit(?), um... 2009-06-27 12:51 maybe, profit 2009-06-27 12:52 well, profit(?) is 1%, 10 million may still be small 2009-06-27 12:52 if profit(?) 2009-06-27 12:53 I guess it is enough for Dell 2009-06-27 12:53 and asus, if Microsoft did not pay them to stop 2009-06-27 12:53 Dell doesn't listen to microsoft any more, it seems 2009-06-27 12:53 they listened to microsoft for too long, and their sales decreased 2009-06-27 12:54 I hope so 2009-06-27 12:54 wel, I have a nice laptop from them anyway :) 2009-06-27 12:54 amazingly cheap 2009-06-27 12:55 I have home server from them :) 2009-06-27 12:55 I don't know how they can do it 2009-06-27 12:56 so it seems like, my wifi problems were only on the first boot, it seemed to be actually installing ubuntu then 2009-06-27 12:56 and I guess that does not work perfectly 2009-06-27 12:56 so, one reboot and it works 2009-06-27 12:57 well, compare that to windows, it will be maybe 10 reboots 2009-06-27 12:57 probably 2009-06-27 12:58 then, thanks to arjan's work, linux laptops should start in a few seconds soon 2009-06-27 12:58 arjan's work? 2009-06-27 12:58 arjan van de ven, formerly of red hat, now at intel 2009-06-27 12:58 powersave stuff? 2009-06-27 12:58 fast boot stuff 2009-06-27 12:58 ah 2009-06-27 12:59 http://lwn.net/Articles/299483/ LPC: Booting Linux in five seconds 2009-06-27 12:59 yes 2009-06-27 13:00 more quick work, we need to work on suspend stuff 2009-06-27 13:00 quck boot 2009-06-27 13:00 yes 2009-06-27 13:01 I'll whine about that to arjan ;) 2009-06-27 13:02 :) 2009-06-27 13:16 here is the test of fire: the dell plays myspace music on the first try 2009-06-27 13:16 so my wife is happy 2009-06-27 13:18 it is doing an apt-get dist-upgrade over the wireless and streaming from myspace at the same time, without any pauses in the music 2009-06-27 13:18 sounds good 2009-06-27 13:19 now, we must ahieve a similar level of quality with the filesystem ;) 2009-06-27 13:20 yes, really :) 2009-06-27 13:21 for now, change_begin/end can be latency problem 2009-06-27 13:21 maybe 2009-06-27 13:21 right, but it will be mergable 2009-06-27 13:22 if replay works 2009-06-27 13:22 yes 2009-06-27 13:22 we just say, sorry it isn't perfect, but it is usable 2009-06-27 13:22 it is about kernel for future 2009-06-27 13:22 and I guess it will be by far the smallest code base of any atomic commit filesystem 2009-06-27 13:26 probably 2009-06-27 13:32 I'm slowly recalling my series is what is doing 2009-06-27 13:32 first try was bitmap is working 2009-06-27 13:33 what about ileaf writout? 2009-06-27 13:34 maybe, I was assuming it will be done with flush(map->volmap->dirty) 2009-06-27 13:35 ok, well I will check it 2009-06-27 13:35 and middle of series try to work for bnode stuff 2009-06-27 13:35 right 2009-06-27 13:35 and I thought it needs separate mark_buffer_dirty() 2009-06-27 13:36 bnode and leaf 2009-06-27 13:36 I think so 2009-06-27 13:36 yes 2009-06-27 13:36 exactly 2009-06-27 13:36 they go on different dirty lists 2009-06-27 13:37 it is trying to use volmap->dirty for leaf buffer 2009-06-27 13:37 and pinned is for bnode 2009-06-27 13:37 but, sb->commit is still remain 2009-06-27 13:37 um... 2009-06-27 13:39 volmap->dirty (leaf buffers) makes sense 2009-06-27 13:42 ah, maybe, we need to change buffer_dirty() 2009-06-27 13:42 because? 2009-06-27 13:42 it may need to take delta/flush counter parameter 2009-06-27 13:43 yes 2009-06-27 13:43 ACTION thinks 2009-06-27 13:45 well, honestly, we should think "untested or need to think deeply" bottom of half patches in series 2009-06-27 13:45 at least 2009-06-27 13:46 true 2009-06-27 13:46 there is no rush to merge 2009-06-27 13:46 tomorrow does not need to be a merge 2009-06-27 13:47 ah, bottom of half means "direent-delete-fix.patch + dddd.patch" / 2 2009-06-27 13:47 new log tags stuff, and dirty list management 2009-06-27 13:48 right 2009-06-27 13:50 grep mark_buffer_dirty kernel/*.c -I | wc 2009-06-27 13:50 16 2009-06-27 13:51 in 4 different files 2009-06-27 13:51 should not take too long to think deeply about that ;) 2009-06-27 13:53 probably 2009-06-27 14:00 um..., first one..., we need to think blockfork() should be in which file 2009-06-27 14:00 maybe, it would be unsharable with kernel 2009-06-27 14:01 other similar functions are in kernel/filemap.c 2009-06-27 14:02 for example, blockget 2009-06-27 14:03 sounds good 2009-06-27 14:03 I'll see what happen 2009-06-27 14:12 it seems to work, the issue is blockdirty will enable for many tests if define ATOMIC 2009-06-27 14:46 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-27 14:56 well that is good, isn't it? 2009-06-27 14:56 not sure 2009-06-27 14:57 those are assuming it is not atomic-commit 2009-06-27 14:57 well, it would not be big problem 2009-06-27 15:02 ok, main dirty hack is, series is not handling sync_* at all 2009-06-27 15:03 for example, sync_inode? 2009-06-27 15:03 I added code to sync_* just as callback before exit 2009-06-27 15:03 and sync_super 2009-06-27 15:05 and sync_super is just userspace 2009-06-27 15:06 well, I'm ignoring kernel on this series though 2009-06-27 15:06 right 2009-06-27 15:07 probably, sync_super may be mapped to ->sync_fs, not sure 2009-06-27 15:07 well, sync_inode looks believable 2009-06-27 15:07 yes 2009-06-27 15:08 sync_inode may be for fsync() 2009-06-27 15:08 we will ignore for now? 2009-06-27 15:08 yes 2009-06-27 15:08 ok 2009-06-27 15:09 sync_super calls sync_inodes though, it looks right 2009-06-27 15:10 why do you say it is not handling it? 2009-06-27 15:10 so, if concentrate merge, I'll just remove sync_super() hack entrirely 2009-06-27 15:10 yes 2009-06-27 15:10 ah, #if 1 2009-06-27 15:10 the concept of sync_* is data sync/fsync 2009-06-27 15:11 well 2009-06-27 15:11 you can leave it, it is not confusing 2009-06-27 15:11 int sync_super(struct sb *sb) 2009-06-27 15:11 { 2009-06-27 15:11 #if 1 2009-06-27 15:11 change_begin(sb); 2009-06-27 15:11 change_end(sb); 2009-06-27 15:11 #else 2009-06-27 15:11 this is easy to understand 2009-06-27 15:11 yes, well, at least, it should be ifdef ATOMIC 2009-06-27 15:12 good 2009-06-27 15:12 we will need writeback stuff to compare result to atomic comiit 2009-06-27 15:12 for a while 2009-06-27 15:14 begin/end is assuming it forces to do commit 2009-06-27 15:14 ok, commit_delta should be factored out of change_end, and sync_super should just be commit_delta 2009-06-27 15:14 now, only it is true if another hack was applied in this series 2009-06-27 15:14 this would not be a hack 2009-06-27 15:15 hack is in need_delta/need_flush 2009-06-27 15:15 whoops, commit_delta is already used 2009-06-27 15:16 so, the name should be... 2009-06-27 15:16 right 2009-06-27 15:16 so let's factor this out and it will not be a hack 2009-06-27 15:16 force_delta? 2009-06-27 15:16 or 2009-06-27 15:16 just delta() 2009-06-27 15:17 well, and I thought it would be later 2009-06-27 15:17 later == after mergable of this series 2009-06-27 15:17 if you like :) 2009-06-27 15:17 yes, I like :) 2009-06-27 15:18 well, anyway, it would be soon though 2009-06-27 15:18 force_commit? 2009-06-27 15:18 so, need_delta -> 1, fine 2009-06-27 15:19 force_ sounds a little, eh, forceful 2009-06-27 15:19 sync_commit? 2009-06-27 15:20 sync_delta? 2009-06-27 15:20 well, force_delta is correct 2009-06-27 15:20 now that I think of it 2009-06-27 15:20 it means, cause a new delta and wait for it to commit 2009-06-27 15:20 ok 2009-06-27 15:20 so that is good 2009-06-27 15:20 and other filesystems use similar terminology 2009-06-27 15:20 it is good 2009-06-27 15:21 understandable for others too 2009-06-27 15:21 yes, good 2009-06-27 15:22 and it is documentation to, because when you see it in change_end, you know that we have more optimization work to do there 2009-06-27 15:22 ok, so for now we are going to commit a delta on every change, according to your patch 2009-06-27 15:22 that seems like a good place to start 2009-06-27 15:23 probably 2009-06-27 15:23 essentially, running as a sync mount 2009-06-27 15:23 well, I'll just leave those entriely after this series 2009-06-27 15:23 so, this series can concentrate to another things 2009-06-27 15:35 planning to sleep at all today? 2009-06-27 15:36 probably, sleep soon for few hours 2009-06-27 16:10 ah 2009-06-27 16:10 flips, still there? 2009-06-27 16:26 still here 2009-06-27 16:27 I've added kernel code to userspace 2009-06-27 16:27 it would be issue of lisence 2009-06-27 16:27 gpl v3? 2009-06-27 16:27 like it 2009-06-27 16:29 so... it needs a "permitting addition" 2009-06-27 16:30 btw, there is any preference about license? 2009-06-27 16:31 not a strong preference 2009-06-27 16:31 i see 2009-06-27 16:31 I don't care much whether it is v2 or v3 2009-06-27 16:32 but, someone would care 2009-06-27 16:32 I chose v3 because tridge chose ;) 2009-06-27 16:32 oh 2009-06-27 16:32 samba uses v3 2009-06-27 16:32 yes, tridge seems to have strong preference 2009-06-27 16:33 I can think about it more carefully now 2009-06-27 16:33 well, maybe, can change would be problem 2009-06-27 16:33 by the way, I did apt-get dist-upgrade on the dell and the sound stopped working 2009-06-27 16:34 so I guess Linux has still got pretty serious issues 2009-06-27 16:34 dist-upgrade to which version? 2009-06-27 16:35 it's ubuntu intrepid 2009-06-27 16:35 um..., do you know, if it's debian, interepid is which? 2009-06-27 16:35 ubuntu 2009-06-27 16:35 well, recently, on debian, I also had some problem of sound 2009-06-27 16:36 repository is ubuntu.com 2009-06-27 16:36 there seem to be problems with pulseaudio 2009-06-27 16:36 ah, interepid is mapped to debian version 2009-06-27 16:36 ah, so, it may be sequeze? 2009-06-27 16:37 squeeze 2009-06-27 16:37 in debian 2009-06-27 16:37 squeeze? 2009-06-27 16:37 next stable version 2009-06-27 16:37 ubuntu branches from sid 2009-06-27 16:38 unstable 2009-06-27 16:38 testing and stable debian is too old for ubuntu users 2009-06-27 16:38 and it stop to change version at some point? 2009-06-27 16:38 so ubuntu guys try to stabilize sid 2009-06-27 16:38 and there are not as many ubuntu devs as debian developers 2009-06-27 16:39 the result is that ubunta is only somewhat stable 2009-06-27 16:40 well, so, which version is pulseaudio 2009-06-27 16:40 good question 2009-06-27 16:40 I'm just reading about it now 2009-06-27 16:42 .9.10 2009-06-27 16:42 and there is gstreamer 2009-06-27 16:42 and puseaudio-module-hal 2009-06-27 16:42 so many layers 2009-06-27 16:42 layers of bugs 2009-06-27 16:43 alsa by itself has many layers 2009-06-27 16:43 and these things are layers on top of it 2009-06-27 16:43 yes, sounds layer seems to have cominication problem 2009-06-27 16:44 well, gstreamer version is? 2009-06-27 16:44 removing pluseaudio also removes ubuntu-desktop 2009-06-27 16:44 that is scary 2009-06-27 16:44 gnome choose the pulseaudio 2009-06-27 16:45 gstreamer 0.10 2009-06-27 16:45 this seems to be conntected to pulseaudio somehow 2009-06-27 16:45 well I will remove it, then see if alsa will work 2009-06-27 16:47 gst-launch filesrc location="$1" ! ffdemux_mp3 ! ffdec_mp3 ! pulsesink 2009-06-27 16:47 or something work? 2009-06-27 16:47 there is a complaint from kde about the intel hd audio driver not working 2009-06-27 16:48 oh, kde 2009-06-27 16:48 I just shut down gnome and started kde 2009-06-27 16:49 only kde actually notifies me that anything is wrong 2009-06-27 16:49 gnome just silently fails to work 2009-06-27 16:49 oh 2009-06-27 16:50 well, I don't know about kde at all 2009-06-27 16:50 btw, now, I'm using, pulse - 0.9.15, gst - 0.10.14 2009-06-27 16:51 ubuntu decided to ship kde4 before it was ready, which was stupid 2009-06-27 16:51 kde 3.5 is fine 2009-06-27 16:51 which is the default in debian 2009-06-27 16:51 i see 2009-06-27 16:51 folks 2009-06-27 16:51 kde is using pulse or alsa? 2009-06-27 16:52 well, maybe config though 2009-06-27 16:55 so, removing pulseaudio broke gnome 2009-06-27 16:55 well 2009-06-27 16:55 this is not your problem :) 2009-06-27 16:56 :) 2009-06-27 16:56 gnome can select with config, iirc 2009-06-27 16:56 kde starts 2009-06-27 16:57 audio device is stac9xxx, which kde says doesn't work 2009-06-27 16:57 now, working on kde? 2009-06-27 16:58 sound is working on kde? 2009-06-27 16:58 not yet 2009-06-27 16:58 I guess it probably will be 2009-06-27 16:58 it worked this morning :) 2009-06-27 16:58 oh :) 2009-06-27 16:59 the intel hd sound module loaded without complaining 2009-06-27 16:59 alsa is working? 2009-06-27 16:59 it's hard to tell with alsa 2009-06-27 16:59 gst-launch filesrc location="$1" ! ffdemux_mp3 ! ffdec_mp3 ! alsasink 2009-06-27 17:00 this would try with alsa 2009-06-27 17:01 gst-launch filesrc location="$1" ! decodebin ! alsasink 2009-06-27 17:01 this would be a bit easy to do 2009-06-27 17:06 aplay plays a sound without complaining 2009-06-27 17:06 but I think that is because the alsa device is set to some null device 2009-06-27 17:07 null device? 2009-06-27 17:07 plays, but, no sound? 2009-06-27 17:08 yes 2009-06-27 17:08 already check the mute? 2009-06-27 17:08 amixer or something? 2009-06-27 17:09 alsamixer 2009-06-27 17:09 sounds good 2009-06-27 17:09 not muted 2009-06-27 17:10 um... 2009-06-27 17:10 I think I will remove gstreamer next 2009-06-27 17:11 I guess aplay is not using gstreamer 2009-06-27 17:12 btw, there is /etc/asound.conf or ~/.asoundrc? 2009-06-27 17:13 neither 2009-06-27 17:13 i see 2009-06-27 17:14 it seems like driver's problem, um... 2009-06-27 17:14 hard to say 2009-06-27 17:15 kde and gnome both use the same "system settings" dialog it seems, and it isn't very good 2009-06-27 17:15 especially for sound 2009-06-27 17:16 http://blogs.gnome.org/uraeus/files/2007/09/gnome-sound-properties.png 2009-06-27 17:16 this? 2009-06-27 17:17 I haven't see that yet 2009-06-27 17:17 that is what I would like to see 2009-06-27 17:18 well it seems gstream can't be removed 2009-06-27 17:18 everything depends on it 2009-06-27 17:18 now how to turn it off 2009-06-27 17:19 esound was installed? 2009-06-27 17:19 http://ubuntuforums.org/showthread.php?p=6068479 2009-06-27 17:19 I should try that 2009-06-27 17:19 it is weird that something like openoffice should depend on gstreamer 2009-06-27 17:20 oh 2009-06-27 17:25 ah, and libasound-plugins 2009-06-27 17:25 it is including the pulseaudio plugin 2009-06-27 17:27 sound is back 2009-06-27 17:27 now... why? 2009-06-27 17:27 :) 2009-06-27 17:27 rebooted? 2009-06-27 17:27 I removed pulseaudio, that's all 2009-06-27 17:27 and rebooted 2009-06-27 17:27 ok, pulseaudio is daemon 2009-06-27 17:28 so, reboot killed it actually 2009-06-27 17:28 maybe 2009-06-27 17:28 well now I am happy 2009-06-27 17:28 gnome is gone, pulseaudio is gone, and things work 2009-06-27 17:29 my wife will be far happier with kde than gnome 2009-06-27 17:29 btw, libasound-plugins is installed? 2009-06-27 17:29 let me see 2009-06-27 17:30 yes 2009-06-27 17:30 i see 2009-06-27 17:30 I guessed it may be cause of pulseaudio problem 2009-06-27 17:30 but, it seems not true 2009-06-27 17:30 well, ok now 2009-06-27 17:36 btw, I separated the patches to non-dirty and dirty 2009-06-27 17:37 and dirty stuff are on last of series 2009-06-27 17:37 I guess closing to mergable 2009-06-27 17:37 I need to add comment to patches 2009-06-27 17:40 sounds good 2009-06-27 17:41 dirty stuff is - bitmap-atomic-commit.patch, split-test.patch, commit-flush-commit-via-volmap.patch, dddd.patch 2009-06-27 17:42 result code is not changed so much 2009-06-27 17:43 just merge some small patches, and move some pieace from dirty to other patch 2009-06-27 17:43 well 2009-06-27 17:46 small patches 2009-06-27 17:47 yes 2009-06-27 17:48 those are quick hack to work atomic-commit for some path 2009-06-27 17:48 really temporary hack 2009-06-27 17:49 maybe, it will break all other path, untested at all 2009-06-27 17:49 we will need to revisit it 2009-06-27 17:54 no problem 2009-06-27 18:15 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-27 18:16 -!- tim_dimm_(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-27 19:33 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-27 21:42 sound works in kde4 now 2009-06-27 21:43 solution: install cine-backend for phonon, the gstreamer backend doesn't work 2009-06-27 21:45 sounds good 2009-06-27 21:45 ACTION still not sleeping, whoops 2009-06-27 21:45 cine-backend? 2009-06-27 21:50 oh, sound library abstruction 2009-06-27 21:56 whoops ;) 2009-06-27 21:56 :) 2009-06-27 22:09 -!- dagle(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-27 23:33 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-28 04:08 -!- dagle1(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-28 05:50 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-28 06:22 -!- edt(~Ed@dsl-60-1.aei.ca) has joined #tux3 2009-06-28 06:25 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-28 06:35 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-06-28 10:08 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-28 10:24 -!- npmccallum(~npmccallu@h11.26.23.98.dynamic.ip.windstream.net) has joined #tux3 2009-06-28 12:24 -!- brandel(~annalea@srv-pvm.vinaren.vn) has joined #tux3 2009-06-28 12:24 -!- brandel(~annalea@srv-pvm.vinaren.vn) has left #tux3 2009-06-28 12:29 -!- allman(~chiverto@1898729133.ssa.megazon.com.br) has joined #tux3 2009-06-28 12:29 Join The Army Here: http://www.hawkee.com/snippet/6292/ 2009-06-28 12:29 -!- allman(~chiverto@1898729133.ssa.megazon.com.br) has left #tux3 2009-06-28 12:29 -!- yanan(~celina@60.173.11.34) has joined #tux3 2009-06-28 12:29 Join The Army Here: http://www.hawkee.com/snippet/6292/ 2009-06-28 12:29 -!- yanan(~celina@60.173.11.34) has left #tux3 2009-06-28 12:30 -!- whetston(~juscesak@20158195164.user.veloxzone.com.br) has joined #tux3 2009-06-28 12:30 -!- whetston(~juscesak@20158195164.user.veloxzone.com.br) has left #tux3 2009-06-28 12:30 -!- allman(~chiverto@1898729133.ssa.megazon.com.br) has joined #tux3 2009-06-28 12:30 Join The Army Here: http://www.hawkee.com/snippet/6292/ 2009-06-28 12:30 -!- allman(~chiverto@1898729133.ssa.megazon.com.br) has left #tux3 2009-06-28 12:38 -!- cyrus(~brod@p50999cd2.dip0.t-ipconnect.de) has joined #tux3 2009-06-28 12:38 -!- cyrus(~brod@p50999cd2.dip0.t-ipconnect.de) has left #tux3 2009-06-28 13:27 -!- tiong-ho(~whitford@c-76-108-101-11.hsd1.fl.comcast.net) has joined #tux3 2009-06-28 13:27 -!- tiong-ho(~whitford@c-76-108-101-11.hsd1.fl.comcast.net) has left #tux3 2009-06-28 13:31 -!- Ganneff(~joerg@ganneff.noc.oftc.net) has joined #tux3 2009-06-28 14:16 -!- npmccallum_(~npmccallu@h55.31.23.98.dynamic.ip.windstream.net) has joined #tux3 2009-06-28 14:52 -!- reinaldo(~alioto@125.163.120.182) has joined #tux3 2009-06-28 14:52 Join The Army Here: http://www.hawkee.com/snippet/6292/ 2009-06-28 14:52 -!- reinaldo(~alioto@125.163.120.182) has left #tux3 2009-06-28 14:58 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-28 16:09 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-28 17:45 flips, there? 2009-06-28 17:48 yes 2009-06-28 17:48 the patchset is almost mergable 2009-06-28 17:49 it would break something though 2009-06-28 17:49 ok, just explain what it breaks, and lets merge :) 2009-06-28 17:49 oh 2009-06-28 17:49 it's a bit hard to do 2009-06-28 17:50 that's ok 2009-06-28 17:50 because fundametal problem is untested 2009-06-28 17:50 just a general idea 2009-06-28 17:50 sure 2009-06-28 17:50 that's the point 2009-06-28 17:50 to test it and make it work 2009-06-28 17:50 well 2009-06-28 17:50 the patchset would start to logging everywhere 2009-06-28 17:51 on creation path 2009-06-28 17:51 it is first try of it 2009-06-28 17:51 good, we I know what you're breaking, and I think it's fine 2009-06-28 17:52 so, ->logmap may say something 2009-06-28 17:52 known one is tux3.c 2009-06-28 17:52 it is not creating sb->logmap 2009-06-28 17:53 ah 2009-06-28 17:53 that seems easy 2009-06-28 17:53 and writeback stuff is still flusher 2009-06-28 17:54 so, atomic-commit will need to replace those somehow 2009-06-28 17:54 good, that's enough of a description 2009-06-28 17:55 main problem may be those 2009-06-28 17:55 but, probably, I'm forgetting something 2009-06-28 17:56 that's a good enough description 2009-06-28 17:56 so: a) merge b) change it to drive the flush from force_delta 2009-06-28 17:56 -!- npmccallum(~npmccallu@h55.31.23.98.dynamic.ip.windstream.net) has joined #tux3 2009-06-28 17:58 probably 2009-06-28 17:58 http://userweb.kernel.org/~hirofumi/tux3/ 2009-06-28 17:59 the change from previous patchset is 2009-06-28 17:59 c) implement truncate d) fix kernel e) install on my laptop 2009-06-28 18:00 I introduced untested force_delta() 2009-06-28 18:00 and removed the dirty stuff 2009-06-28 18:00 sounds good 2009-06-28 18:01 yes 2009-06-28 18:01 well, before that, it has still many problems 2009-06-28 18:01 on userland 2009-06-28 18:02 flush_log() is not writing the bitmap inode 2009-06-28 18:02 I was forgetting it 2009-06-28 18:02 stage_delta is not flushing the inodes 2009-06-28 18:02 main problems of commit.c may be those 2009-06-28 18:02 ok, I left out the "fix bugs" step between b and c 2009-06-28 18:03 yes 2009-06-28 18:04 btw, c) is including orphan stuff? 2009-06-28 18:04 yes 2009-06-28 18:04 ok 2009-06-28 18:05 should be "truncate and delete including orphan handling" 2009-06-28 18:05 yes 2009-06-28 18:06 ah 2009-06-28 18:06 btw, we don't have delete_id like hammer, we don't need it? 2009-06-28 18:06 well 2009-06-28 18:06 correct, we don't need it 2009-06-28 18:06 I thought truncate and extend it with hole may need it 2009-06-28 18:07 let me think about that 2009-06-28 18:08 that would be, within the same snapshot 2009-06-28 18:09 because if it happens in a later snapshot, the new data will have a different version 2009-06-28 18:10 I'm not reading version pointer docs yet though 2009-06-28 18:10 well, how about this: on truncate of a big file (that has only one version) we just detach the data btree and record it in the log 2009-06-28 18:10 if parent version has data, but, on new version, the some renge has hole 2009-06-28 18:10 it can? 2009-06-28 18:11 ah, you are right 2009-06-28 18:11 you read the versioned pointer docs closely enough to catch that 2009-06-28 18:12 anyway, this is about an optimization 2009-06-28 18:12 well, I read emails related to that 2009-06-28 18:12 the easy thing to do is just complete the truncated synchronously 2009-06-28 18:12 -!- npmccallum(~npmccallu@h55.31.23.98.dynamic.ip.windstream.net) has joined #tux3 2009-06-28 18:13 so, I'm still not understanding basic one 2009-06-28 18:14 synchronously? 2009-06-28 18:15 yes, just walk the btree and free all the extents and index nodes 2009-06-28 18:17 if parent version is there, what do we do? 2009-06-28 18:18 we have a delete_id :) 2009-06-28 18:18 let me think about that 2009-06-28 18:19 :) 2009-06-28 18:19 you are assuming, a data btree with multiple versions in it 2009-06-28 18:19 ah, I was forgetting to say, the patchset may change on disk format 2009-06-28 18:20 ah 2009-06-28 18:20 but, the patchset is not udpating the rev for now 2009-06-28 18:20 maybe, slightly changed 2009-06-28 18:20 ok, changing the format is fine, I was going to change it anyway 2009-06-28 18:21 ok 2009-06-28 18:21 Here are some new dleaf fields to be added: 2009-06-28 18:21 For reconstructions: 2009-06-28 18:21 * fs id (32 bit hash generated at create time) 2009-06-28 18:21 * inode number (48 bits) 2009-06-28 18:21 * commit number (32 bits) 2009-06-28 18:21 For versioning: 2009-06-28 18:21 * version mapping table id (32 bits) 2009-06-28 18:22 No checksum in there yet, but it might happen 2009-06-28 18:22 yes 2009-06-28 18:23 inode number? 2009-06-28 18:23 so know which inode this dleaf belongs to 2009-06-28 18:23 in case the index is damaged 2009-06-28 18:23 ah 2009-06-28 18:24 the commit number lets us know which is the most recent version of an extent mapped at the same logical offset 2009-06-28 18:25 the commit number can wrap, but maybe not for 30 years or so 2009-06-28 18:25 um... 2009-06-28 18:25 which is enough time to figure out how to do an online cleanup of really old commit numbers 2009-06-28 18:25 I think no problem to add 2009-06-28 18:26 however, honestly, I _may_ not be fan of scan correct block 2009-06-28 18:26 not sure 2009-06-28 18:27 but, I don't have more good solution, and it's good 2009-06-28 18:27 it's only supposed to increase the chance of doing a correct reconstruction, not guarantee it 2009-06-28 18:28 yes 2009-06-28 18:28 maybe, another issue is storage can be too large to scan 2009-06-28 18:30 but, there is no another option at least for now, so, it's ok 2009-06-28 18:31 yes, if they are useless fields we have plenty of time to remove them 2009-06-28 18:31 my thinking is mainly about doing the reconstruction without unmounting 2009-06-28 18:31 then, it should be done one block group at a time 2009-06-28 18:31 oh 2009-06-28 18:32 scan is done by fs driver? 2009-06-28 18:32 yes 2009-06-28 18:32 i see 2009-06-28 18:32 with great care :) 2009-06-28 18:32 that is why it is important to keep the large scale structure of the filesystem simple 2009-06-28 18:33 and that is my justification for making the dleaf and ileaf formats just a little more complicated, to make the large scale structure simple 2009-06-28 18:33 yes, simple itself affects many good things 2009-06-28 18:33 so we can realistically attack problems like online reconstruction, a group at a time 2009-06-28 18:34 ok, I will be out for an hour 2009-06-28 18:34 ok 2009-06-28 18:34 and I will think about the holes issue with truncate 2009-06-28 18:34 ok, thanks 2009-06-28 18:34 ah, version stuff 2009-06-28 18:40 ah, another one is license of utility.* 2009-06-28 18:40 I've copied kernel code to those 2009-06-28 20:15 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-06-28 20:25 hirofumi, there? 2009-06-28 20:25 hi 2009-06-28 20:26 ok, here is the plan for handling truncation with versions 2009-06-28 20:26 ok 2009-06-28 20:26 we do not need a delete_id 2009-06-28 20:27 instead, we have a variant of the data btree attribute 2009-06-28 20:27 first, in the case of a truncate of an entire file, nothing is inherited by the child version 2009-06-28 20:28 yes 2009-06-28 20:28 so we just add a new data attribute for the child version 2009-06-28 20:28 in the case of a partial attribute, below the new size data is inherited from the parent version, and above that size, it is in the new attribute 2009-06-28 20:29 so this new type of attribute has a threshold field, to specify this 2009-06-28 20:29 I was planning to inherit between attributes anyway, eventually 2009-06-28 20:30 to handle the case where there are a very large number of attributes in one inode 2009-06-28 20:30 this mechanism also suggests an interesting way to handle truncation within the same version\ 2009-06-28 20:31 what is difference with delete_id and threshold? 2009-06-28 20:31 threshold is per-data attribute 2009-06-28 20:31 other data versioning is per extent 2009-06-28 20:31 maybe it is a delete_id 2009-06-28 20:31 I don't think so though 2009-06-28 20:32 hammer's delete ids are per data element, and there can be many data elements per file 2009-06-28 20:32 I could ask matt about that point 2009-06-28 20:33 um... 2009-06-28 20:34 btw, this issue is same on ileaf? 2009-06-28 20:34 i.e. delete of inode 2009-06-28 20:34 similar 2009-06-28 20:35 i see 2009-06-28 20:35 maybe, ileaf is more easy to understand for me 2009-06-28 20:35 ok, well ileaf is easy 2009-06-28 20:35 we don't have to do anything 2009-06-28 20:35 oh 2009-06-28 20:35 the entry is just removed from the directory 2009-06-28 20:36 the inode is not actually deleted and recovered until all directory entries for all versions have been deleted 2009-06-28 20:36 this makes the inode empty 2009-06-28 20:37 um... 2009-06-28 20:38 when do we can delete inode actually? 2009-06-28 20:38 an empty inode is deleted by definition 2009-06-28 20:39 we should update some statistic to know how many empty inodes there are in a given region of the inode table, but otherwise we don't have to do anything 2009-06-28 20:39 empty inode means 0 size inode? 2009-06-28 20:39 yes 2009-06-28 20:39 the initial condition for an empty inode 2009-06-28 20:39 ok 2009-06-28 20:39 when do we can empty the inode? 2009-06-28 20:39 when all versions have been deleted, the inode should be completely empty 2009-06-28 20:40 so, if the version delete algorithm is implemented correctly, this should work 2009-06-28 20:40 inodes do not have any unversioned attributes 2009-06-28 20:43 this may not be same with delete_id though 2009-06-28 20:43 different approach from hammer, yes 2009-06-28 20:43 I claimed that delete_id was not necessary 2009-06-28 20:44 now I guess that claim looks correct 2009-06-28 20:44 I think hammer actually calls it birth and death 2009-06-28 20:44 if inode has referenced by some versions, then if deleted by one version, what happen? 2009-06-28 20:44 birth sequence number and death sequence number 2009-06-28 20:44 and hammer does not have a version tree, only a linear chain of versions 2009-06-28 20:45 i see 2009-06-28 20:46 if the inode is deleted by one version, then the link count attribute for that inode will be reduced, if link count reaches zero the link count attribute itself will be removed, and all data attibutes should also be empty 2009-06-28 20:47 e.g. we add to new entry to directory, it creates new version of directory block? 2009-06-28 20:48 yes 2009-06-28 20:48 this part is a little bit like btrfs 2009-06-28 20:48 we get that behavior for free 2009-06-28 20:48 after that, existed entries points same inode by 2 version? 2009-06-28 20:49 yes 2009-06-28 20:49 there are now two link count attributes in the inode, one for each version 2009-06-28 20:49 and, then if we removed one exsted entry, inode is removed? 2009-06-28 20:50 the inode is not free until it is empty 2009-06-28 20:50 it is not empty until all versions of all attributes have been deleted 2009-06-28 20:50 link count field is shared by versions? 2009-06-28 20:50 no, the link count attribute is versioned 2009-06-28 20:50 ok 2009-06-28 20:51 it's pretty simple, isn't it? 2009-06-28 20:51 it's nice not to have to invent new mechanisms to handle these complexe situations 2009-06-28 20:51 so, I think existed entry doesn't increment the link count 2009-06-28 20:52 the threshhold mechanism above, did need to be invented, so thankyou for pushing me :) 2009-06-28 20:52 um..., I can't still understand about link count 2009-06-28 20:53 just like a traditional filesystem, there will be one link count for each reference from a directory entry 2009-06-28 20:53 yes 2009-06-28 20:53 however, the link counts can be divided up between serveral different link count attributes, for different versions 2009-06-28 20:54 but, inode which didn't modified is not incremented the link count? 2009-06-28 20:54 right 2009-06-28 20:54 so, inode has no versions? 2009-06-28 20:54 I am not sure what you mean 2009-06-28 20:55 um... 2009-06-28 20:55 create("/foo") 2009-06-28 20:55 create snapshot 2009-06-28 20:56 mount created snapshot 2009-06-28 20:56 create("/bar") 2009-06-28 20:56 unlink("/foo") 2009-06-28 20:56 after this, what happen to inode which is pointed by /foo? 2009-06-28 21:01 the inode is not empty because there is a directory entry in the old snapshot that references it 2009-06-28 21:01 so, there is a non-zero link count attribute for that version 2009-06-28 21:02 ok 2009-06-28 21:02 another variant is 2009-06-28 21:02 instead of unlink("/foo") 2009-06-28 21:02 umount created snapshot 2009-06-28 21:02 backup orignal 2009-06-28 21:02 unlink("/foo") 2009-06-28 21:02 ? 2009-06-28 21:03 whoops, s/backup/back to/ 2009-06-28 21:05 backup original? 2009-06-28 21:05 oh 2009-06-28 21:05 back to :) 2009-06-28 21:05 yes :) 2009-06-28 21:06 the result is similar: there is a non-zero link count for the new version 2009-06-28 21:06 so the inode is not empty 2009-06-28 21:06 new version? 2009-06-28 21:06 the new global version created by "create snapshot" above 2009-06-28 21:07 um... 2009-06-28 21:08 "back to orignal" means "back to orignal snapshot" 2009-06-28 21:08 I understand 2009-06-28 21:09 another way to say it is, "mount the original version" 2009-06-28 21:09 but, if we create the new snapshot, global version is different to orignal version? 2009-06-28 21:09 yes 2009-06-28 21:10 I could just leave out the word "global" above, I just meant a newly allocated version number 2009-06-28 21:10 what I'm assuming is 2009-06-28 21:11 1 - create("/foo") -> 1 - create inode 2009-06-28 21:11 mount 2 version 2009-06-28 21:11 2 - create("/bar") 2009-06-28 21:11 mount 1 version 2009-06-28 21:11 1 - unlink("/foo") -> ? 2009-06-28 21:12 yes, that is what I understood 2009-06-28 21:13 ah 2009-06-28 21:13 um... 2009-06-28 21:13 wait 2009-06-28 21:13 I didn't notice "/bar" 2009-06-28 21:13 oh 2009-06-28 21:13 eh 2009-06-28 21:13 sorry :) 2009-06-28 21:14 bar will be created as a different inum from foo 2009-06-28 21:14 that's is point I wonder 2009-06-28 21:14 yes 2009-06-28 21:14 unclear point from above is 2009-06-28 21:14 so, in the example immediately above, /foo's inode will be empty 2009-06-28 21:14 and can be used for some new inode 2009-06-28 21:15 2 - create("/bar) -> make copy block from block is including "/foo" 2009-06-28 21:16 make copy block? 2009-06-28 21:17 ah 2009-06-28 21:17 e.g. dir block has "/foo" 2009-06-28 21:17 are you asking about what happens to the directory block, or to the ileaf? 2009-06-28 21:17 ileaf 2009-06-28 21:17 wait 2009-06-28 21:17 well, I'm assuming the dir block is 2009-06-28 21:18 ileaf situation is simple above 2009-06-28 21:18 2 - copy the dir block, then add "/bar" 2009-06-28 21:18 copying the ileaf is forced on any dirty of the ileaf, it does not matter which version dirtied it 2009-06-28 21:20 um... 2009-06-28 21:20 or, stating it better: ileaf redirect is caused by the first dirty of the ileaf in a flush cycle, it does not matter which version caused the dirty 2009-06-28 21:20 ah, yes 2009-06-28 21:20 question is different one 2009-06-28 21:21 um..., if we try to add the "/bar" to block which is including "foo" on version 2 2009-06-28 21:21 I'm assuming, copy the block as version 2, then add "/bar" 2009-06-28 21:22 this is true? 2009-06-28 21:25 this is true 2009-06-28 21:25 ok 2009-06-28 21:25 simple? 2009-06-28 21:26 it just relies on the ability to version a data file 2009-06-28 21:26 where the directory acts as a data file 2009-06-28 21:26 so, now, we have 2 pointer to inode of "/foo" 2009-06-28 21:26 yes, which is correct 2009-06-28 21:27 I wonder, the inode may be 1 link count? 2009-06-28 21:27 link count of inode == 1 2009-06-28 21:27 oh I see :) 2009-06-28 21:28 thankyou for your patience 2009-06-28 21:28 no problem at all, I just asking the my question 2009-06-28 21:31 ok, if we delete the foo (2), we can't delete the link count attribute, we must create a link count (2) with count = 0 2009-06-28 21:31 i see 2009-06-28 21:31 um... 2009-06-28 21:31 we know we have to do this because there exists a parent version of the link count 2009-06-28 21:32 the issue may be, when can we empty the inode? 2009-06-28 21:33 the rule for that is the same, only our attribute delete algorithm gets slightly more complicated 2009-06-28 21:33 um... 2009-06-28 21:34 but, we don't know number of pointers until unlink(foo - 2) 2009-06-28 21:34 when the parent version of foo's link count goes to zero, and it has no parent, we can remove that link count attribute 2009-06-28 21:35 um... 2009-06-28 21:35 and at the same time, any child link count attributes with count zero can be removed 2009-06-28 21:35 the versioned pointer algorithms already handle cases like this 2009-06-28 21:35 um... 2009-06-28 21:35 ok, reading your point above 2009-06-28 21:36 unlink(foo - 2) ? 2009-06-28 21:36 it meant, 2 - unlink("foo") 2009-06-28 21:36 unlink("foo") by 2 version 2009-06-28 21:37 right 2009-06-28 21:38 so, if 1 - unlink("foo") is called before 2 - unlink("foo), it is problem? 2009-06-28 21:39 so I invented a new rule above: when unlink an inode, if it has a parent version, then we have to create a new link count attribute with count zero 2009-06-28 21:39 1 is having parent? 2009-06-28 21:40 that is, if link count (1) = 1 is present and we delete foo (2), then we must add link count (2) = 0 2009-06-28 21:40 yes, it is ok 2009-06-28 21:40 later, when we delete foo (1), link count (1) reaches zero, and we can delete both link count (1) and link count (2) 2009-06-28 21:41 yes 2009-06-28 21:41 thankyou very much for a very subtle catch 2009-06-28 21:41 I guess you understand it better than me now ;) 2009-06-28 21:41 :) 2009-06-28 21:41 no 2009-06-28 21:42 but, I wonder, if unlink("foo") on 1 version, what happen? 2009-06-28 21:42 well, you saw that point clearly, that the number of entries in all the different versions of the inode blocks have to add up to the same as the total of all the link counts attributes in the inode 2009-06-28 21:42 sorry 2009-06-28 21:42 the number of entries in all the different versions of the dirent block have to add up to the same as the total of all the link counts attributes in the inode 2009-06-28 21:43 this is a rule we can easily check with a fsck 2009-06-28 21:44 well, that is a wrong way of stating it 2009-06-28 21:44 it's more subtle than that 2009-06-28 21:50 actually, I don't think we have to create the link count = 0 attribute at all 2009-06-28 21:50 so, two cases 2009-06-28 21:50 case a: foo (2) delete then foo (1) 2009-06-28 21:51 case b: foo (1) deleted then foo (2) 2009-06-28 21:51 yes 2009-06-28 21:52 in case a, foo can't be referenced from in version (2), and an inode count (1) = 1 is present 2009-06-28 21:53 yes 2009-06-28 21:53 so it does not matter that the inode count (1) is inherited by version (2) 2009-06-28 21:54 we can't access the inode in version (2) anyway 2009-06-28 21:54 wait a bit 2009-06-28 21:55 now, we are adding the new link count attribute (version 2) to inode? 2009-06-28 21:55 already 2009-06-28 21:56 we did not need to do that in this case 2009-06-28 21:57 ah, inherited meant there is no copy from version 1? 2009-06-28 21:57 right 2009-06-28 21:57 ok 2009-06-28 21:57 by the way, inheritance is the key idea of versioned pointers 2009-06-28 21:57 i see 2009-06-28 21:58 or versioned attributes is what I should have called it 2009-06-28 21:58 initially designed for pointers, by applies just as well to attributes 2009-06-28 21:58 s/by/but/ 2009-06-28 21:59 i see 2009-06-28 21:59 well, so, case b is what I wonder 2009-06-28 22:00 in base b above, we also do not have to do anything to the version count attribute 2009-06-28 22:00 ah, version count 2009-06-28 22:00 sorry 2009-06-28 22:00 link count 2009-06-28 22:00 getting late for me ;) 2009-06-28 22:00 oh 2009-06-28 22:00 in base b above, we also do not have to do anything to the link count attribute 2009-06-28 22:01 ok 2009-06-28 22:01 link count (1) can stay = 1, even though foo has been deleted 2009-06-28 22:02 by the same argument: foo's inode cannot be referenced, so it does not matter 2009-06-28 22:02 when can inode be deleted? 2009-06-28 22:02 that remains the same: when the inode is empty of all attributes 2009-06-28 22:03 but the above is don't delete link count attr? 2009-06-28 22:03 in case b above, when foo (2) is deleted, we have to know _somehow_ that link count (1) can be deleted 2009-06-28 22:04 and that is easy, with a new rule 2009-06-28 22:04 oh 2009-06-28 22:04 it is what I'm missing 2009-06-28 22:04 s/missing/not understanding/ 2009-06-28 22:04 when foo (1) is deleted, and link count (1) goes to zero, then instead of being removed, its version is changed to (1) 2009-06-28 22:05 the versioned attribute algorithm with do this automatically 2009-06-28 22:05 that is how it works 2009-06-28 22:05 well I have to sleep 2009-06-28 22:05 this was fun :) 2009-06-28 22:05 soo 2009-06-28 22:05 sorry 2009-06-28 22:05 when foo (1) is deleted, and link count (1) goes to zero, then instead of being removed, its version is changed to (2) 2009-06-28 22:06 um... 2009-06-28 22:06 how can we know version 2? 2009-06-28 22:06 the version delete algorithm knows that from the version tree 2009-06-28 22:07 i see 2009-06-28 22:07 probably, we better group all attributes of the same type together in the inode, that is what the versioning algorithms want 2009-06-28 22:08 i see 2009-06-28 22:08 when we delete an attribute, the versioned delete algorithm looks at all versions of a given attribute type 2009-06-28 22:09 so, to state it simply: when a link count hits zero, we delete the attribute 2009-06-28 22:09 versioned delete does the right thing 2009-06-28 22:09 with this, to know more, I need to read docs 2009-06-28 22:09 yes 2009-06-28 22:09 ok 2009-06-28 22:09 well, it was subtle for me, and I have the advantage of knowing it well 2009-06-28 22:10 so, in the end, I convinced myself that we don't have to do anything special, that is not already described in the versioning altgorithms 2009-06-28 22:10 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-28 22:11 it's oysumi time 2009-06-28 22:11 ok, oyasumi 2009-06-28 22:18 by the way, ready for a pull? 2009-06-28 23:09 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-28 23:12 yes 2009-06-28 23:13 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-06-28 23:13 btw, I posted the email to ml with issue of this 2009-06-28 23:15 some issues might not be mentioned in here 2009-06-29 04:10 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-29 05:06 -!- edt(~Ed@dsl-60-1.aei.ca) has joined #tux3 2009-06-29 06:39 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-06-29 08:40 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-29 09:27 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-06-29 09:44 good morning 2009-06-29 09:48 howdy 2009-06-29 09:59 hey, long time 2009-06-29 10:10 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-29 10:13 hirofum, pulled 2009-06-29 10:13 hirofumi I mean 2009-06-29 11:53 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-29 13:16 -!- mayhugh(~ralph@115.127.15.3) has joined #tux3 2009-06-29 13:16 -!- mayhugh(~ralph@115.127.15.3) has left #tux3 2009-06-29 13:27 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-29 15:26 folks 2009-06-29 16:11 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-29 22:11 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-29 22:49 -!- edt(~Ed@dsl-216-221-38-194.aei.ca) has joined #tux3 2009-06-30 00:11 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-30 01:36 -!- edt(~Ed@dsl-216-221-38-194.aei.ca) has joined #tux3 2009-06-30 01:36 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-30 01:36 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-06-30 01:36 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-06-30 01:36 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-06-30 02:21 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-06-30 04:11 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-30 06:37 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-06-30 07:38 -!- dcg(~dcg@193.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-06-30 07:59 -!- dcg_(~dcg@22.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-06-30 10:12 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-30 11:10 good morning 2009-06-30 12:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-06-30 13:38 hi 2009-06-30 15:08 hi hirofumi 2009-06-30 15:08 hi 2009-06-30 15:08 where were we? 2009-06-30 15:09 eh? 2009-06-30 15:09 what does it mean? 2009-06-30 15:09 it means "what were we discussing last time we talked" 2009-06-30 15:09 but I have to do something for about an hour 2009-06-30 15:09 then come back 2009-06-30 15:10 i see 2009-06-30 15:10 ok 2009-06-30 15:10 marcin is interested in working on the version map issue 2009-06-30 15:10 oh, great 2009-06-30 15:11 I am ready to describe a complete _mechanism_ for it, that is easy, the hard part is policy 2009-06-30 15:11 will be back in one hour 2009-06-30 15:11 ok 2009-06-30 16:11 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-30 16:32 hirofumi, back 2009-06-30 16:32 yes 2009-06-30 16:33 well, copy from kernel stuff 2009-06-30 16:34 find_next_bit and such 2009-06-30 16:34 find_next_bit is just for the graphics I thought? 2009-06-30 16:34 yes 2009-06-30 16:34 at least for now 2009-06-30 16:35 I guess you must hate my 32 bit fix 2009-06-30 16:35 for the 32 bit shift warning 2009-06-30 16:35 no problem at all 2009-06-30 16:35 it should be #ifdef though 2009-06-30 16:36 I forget why I can't it 2009-06-30 16:37 well, I hope the compiler is smart enough to optimize it back to one shift 2009-06-30 16:37 if not, I guess I need to email the gcc list :) 2009-06-30 16:37 sizeof() didn't work for #if 2009-06-30 16:37 right 2009-06-30 16:38 I tried to do it that way 2009-06-30 16:38 failed :) 2009-06-30 16:38 :) 2009-06-30 16:38 the correct solution is to take the stupid warning out of gcc 2009-06-30 16:38 maybe we can turn that one off 2009-06-30 16:38 it is wrong to warn on 32 bit shift of a 32 bit value 2009-06-30 16:38 way wrong 2009-06-30 16:39 33 bit shift, maybe 2009-06-30 16:39 well, warn itself may be right 2009-06-30 16:39 >> 32 is always 0 2009-06-30 16:40 but, gcc may be able to detect BITS==64 2009-06-30 16:40 well 2009-06-30 16:40 so, license issue 2009-06-30 16:48 ah right 2009-06-30 16:48 ok, well obviously I can make an exception for gplv2 2009-06-30 16:48 so just need to state that correctly 2009-06-30 16:48 and write it in the source, if any change is required 2009-06-30 16:49 the main reason I like gplv3 is, tridge does 2009-06-30 16:49 I figure, if tridge likes it, it must be good :) 2009-06-30 16:50 oh 2009-06-30 16:50 samba is now gplv3 2009-06-30 16:51 yes 2009-06-30 16:51 fourth biggest open source project I think 2009-06-30 16:51 well, it's samba 2009-06-30 16:51 true, it has to have special patent protection 2009-06-30 16:51 but so does anything to do with storage 2009-06-30 16:52 anyway, I think what I should do is, ask the samba folks why they chose v3 2009-06-30 16:53 for tux3, we want to permit linking with gpl v2 code 2009-06-30 16:53 Can you cc: the list? This would be lwn worthy. Any publicity is good publicity 2009-06-30 16:53 so that should be explicitly stated 2009-06-30 16:53 sure 2009-06-30 16:54 hmm, will it be accepted for merge if it's GPLv3-only-can-be-linked-with-GPLv2? 2009-06-30 16:55 if it enters linux upstream then, then distributors will need to treat the kernel as a whole as if it were GPLv3 2009-06-30 16:55 "GPLv2 is compatible with GPLv3 if the program allows you to choose "any later version" of the GPL" -- http://www.gnu.org/licenses/quick-guide-gplv3.html 2009-06-30 16:55 kernel part should be gplv2 2009-06-30 16:55 well, I don't like that "any later version", I think it can be more specific than that 2009-06-30 16:57 the kernel part is gplv2 2009-06-30 16:57 Licensed under the GPL version 2 2009-06-30 16:57 -- kernel/inode.c 2009-06-30 16:57 I might have forgotten to update some files 2009-06-30 16:58 I should actually update it to read, under v2 _and_ v3 2009-06-30 16:58 v2 and v3? 2009-06-30 16:58 yes 2009-06-30 16:58 v2 or v3? 2009-06-30 16:58 maybe more correct to say or 2009-06-30 16:59 FSF official terminology is "or any later version" 2009-06-30 16:59 which I don't like 2009-06-30 16:59 yes 2009-06-30 17:00 iirc, kernel is removing that part 2009-06-30 17:00 right, linus hates it 2009-06-30 17:01 or v3, is kernel part? 2009-06-30 17:09 I thik for both 2009-06-30 17:09 in kernel, we want our v3 code to be compatible with existing v2 code, which it is if that is tated 2009-06-30 17:09 stated 2009-06-30 17:10 and in userspace, we want people to be able to use it with v2 code also 2009-06-30 17:10 it would be stupid to say "GPL isn't compatible with itself" 2009-06-30 17:11 license itself may be compatible 2009-06-30 17:11 I'm not sure 2009-06-30 17:12 some like "licensed for redistribution under the GPL v2 or GPL v3" 2009-06-30 17:12 but, kernel would be distributed with v2? 2009-06-30 17:12 all the kernel files should have something like the above 2009-06-30 17:13 above means v2 or v3? 2009-06-30 17:13 ah, kernel files are our kernel files? 2009-06-30 17:13 just need to tweak the woring a little, so that it is completely clear, but also not a huge amount of license text 2009-06-30 17:13 I really don't like the style where the first 30 lines of the file are about the license, instead of what the file is for 2009-06-30 17:14 yes, our kernel files 2009-06-30 17:15 Licensed under the GPL version 2 2009-06-30 17:15 sorry 2009-06-30 17:15 currently they say "Licensed under the GPL version 2" 2009-06-30 17:16 they should be revised to something like: "Licensed for distribution under the GPL version 2 or 3" 2009-06-30 17:16 userspace only files are not a pressing issue 2009-06-30 17:17 so just leave them as they are, GPL v3 only, until there is a reason to change 2009-06-30 17:18 kernel has an immediate reason to be v2 of course, and we want to be very clear that they are still compatible with v3 2009-06-30 17:18 yes 2009-06-30 17:18 maybe: "Licensed for distribution under the GNU GPL version 2 or 3" 2009-06-30 17:19 well, unclear would be problem 2009-06-30 17:21 http://en.wikipedia.org/wiki/Gpl <- good place to start reading 2009-06-30 17:21 the FSF site is, understandably, slightly biased 2009-06-30 17:22 ah, unclear means license of our code is unclear 2009-06-30 17:23 exactly 2009-06-30 17:34 um... 2009-06-30 17:35 if wiki is believable, figure seems v3 doesn't have any compatible? 2009-06-30 17:45 the figure is not very complete 2009-06-30 17:45 i see 2009-06-30 17:46 http://gplv3.fsf.org/wiki/index.php/Compatible_licenses 2009-06-30 17:48 um..., v2 is incompatible with v3 2009-06-30 17:48 unclear what is meaning "compatible" 2009-06-30 17:50 if you write a gplv2 application, you can't link against gplv3 libraries 2009-06-30 17:51 the "compatible" in that url is really it? 2009-06-30 17:51 That is the official gplv3 website ran by FSF so likely yes 2009-06-30 17:53 sejeff, unless the original author says it is ok to do exactly that 2009-06-30 17:53 ah, "compatible" in that url, it is meaning "can't link"? 2009-06-30 17:53 and I am saying that it is ok 2009-06-30 17:53 just need to write that in the files 2009-06-30 17:53 Yes, but that is always questionable. Its your code though so do as you please. 2009-06-30 17:54 here is another way of thinking about it: if FSF says licensing under "any later version" makes v2 compatible with v3, then it must be true that stating a specific later version, and no other, makes v2 compatible with v3 2009-06-30 17:56 hirofumi, yes, "compatible" means "can link with" 2009-06-30 17:56 also means "can mix in the same source file" 2009-06-30 17:56 um..., v2 compatible is too many than I was thinking 2009-06-30 17:58 so, the FSF has not really stated which licenses are compatible with or incompatible with GPL v3 2009-06-30 17:58 only one of each is listed 2009-06-30 17:58 this is a curious omission 2009-06-30 17:59 I think I will ask eben about that 2009-06-30 17:59 eben? 2009-06-30 18:00 eben moglen, main author of the gpl v3 2009-06-30 18:00 ah 2009-06-30 18:00 v2 was mainly rms, be v3 was mainly eben 2009-06-30 18:01 http://gplv3.fsf.org/dd3-faq 2009-06-30 18:02 I'll be offline for a few minutes 2009-06-30 18:02 ok 2009-06-30 18:03 well, so, "compatible" is at least two means, link and copy 2009-06-30 18:20 http://www.gnu.org/licenses/gpl-faq.html#AllCompatibility 2009-06-30 18:20 not draft 2009-06-30 19:02 the chart is a little misleading 2009-06-30 19:02 it does not cover the case where the code license saw GPL v2 or GPL v3 2009-06-30 19:03 it only covers the case where it is one specific versions, or all later versions as well 2009-06-30 19:03 I think v2 or v3 should be same with v2 or later 2009-06-30 19:04 "The only time you may not be able to combine code under two of these licenses is when you want to use code that's only under an older version of a license with code that's under a newer version" 2009-06-30 19:04 that is pretty clear 2009-06-30 19:04 so tux3 will not be "only" 2009-06-30 19:05 it is already not v2 nor v3 2009-06-30 19:05 v2 or later also includes v4, which I have not seen yet 2009-06-30 19:05 so I don't like "or later" 2009-06-30 19:05 it is not a problem in this chart 2009-06-30 19:06 "v2 or later, up to v3" :) 2009-06-30 19:06 :) 2009-06-30 19:07 well, from this chart, v2 and v3 seems can't mix 2009-06-30 19:13 the chart is misleading, it tries to make you think that every body has to say "or later" 2009-06-30 19:14 let's see what samba says 2009-06-30 19:14 I think not so 2009-06-30 19:14 http://news.samba.org/announcements/samba_gplv3/ 2009-06-30 19:14 or later seems not solves 2009-06-30 19:14 it seems to say it must update to v3 2009-06-30 19:15 s/update/upgrade/ 2009-06-30 19:16 which line? 2009-06-30 19:16 "OK if you upgrade to GPLv3"? 2009-06-30 19:19 I think what the chart tries to say is that it is impossible to release the same piece of code under both the GPL v2 and GPL v3 license, which is clearly false 2009-06-30 19:20 it says about "copy code" 2009-06-30 19:21 btw, samba seems v3 or later 2009-06-30 19:28 yes, correct 2009-06-30 19:28 for example: http://git.samba.org/?p=tridge/samba.git;a=blob;f=source/smbd/aio.c;h=74275368bdd2c2f119ca022a58f17f5ee72a81f5;hb=HEAD 2009-06-30 19:28 yes 2009-06-30 19:28 however, I still don't like the "or later" provision 2009-06-30 19:28 yes 2009-06-30 19:29 I'm not saying we should same with samba at all :) 2009-06-30 19:29 by the way, how did samba manage to switch from v2 to v3? 2009-06-30 19:29 when kernel could not? 2009-06-30 19:29 i guess unclear 2009-06-30 19:29 I'd like to know the story :) 2009-06-30 19:30 maybe, switched it by some core members 2009-06-30 19:30 without all agreement 2009-06-30 19:30 just guess 2009-06-30 19:31 well, maybe, anybody didn't blame to it 2009-06-30 19:31 maybe just to show linus that it is possible 2009-06-30 19:32 nobody complained 2009-06-30 19:32 because it is tridge 2009-06-30 19:32 if linus did it, I think some people would complain 2009-06-30 19:32 just a guess 2009-06-30 19:33 I certainly would not 2009-06-30 19:33 well, I can't also see why samba team want to do 2009-06-30 19:34 patent protection 2009-06-30 19:34 what protection? 2009-06-30 19:35 well, samba is using lgplv3, so it would not be same with kernel 2009-06-30 19:36 lgplv3 seems to be able to be linked from many licenses 2009-06-30 19:54 good point 2009-06-30 20:02 http://www.gnu.org/licenses/gpl-3.0.txt <- section 11, patents 2009-06-30 20:03 yes 2009-06-30 20:04 for example, this says that I can't secretly patent the underlying technology of tux3, or if I do make a patent, by releasing the code under gpl v3, you have a license to use the patent 2009-06-30 20:04 that protects you from me, a good start 2009-06-30 20:04 of course, nobody needs to be protected from me :) 2009-06-30 20:04 yes :) 2009-06-30 20:04 but, suppose Microsoft gave me a $10 million cheque, which turned me evil 2009-06-30 20:04 then you would need to be protected from me 2009-06-30 20:05 if you did :) 2009-06-30 20:05 I guess microsoft does not plan to give me such a large cheque 2009-06-30 20:05 well, suppose netapp did 2009-06-30 20:05 rather more likely 2009-06-30 20:06 although still not very likely to tell the truth 2009-06-30 20:10 the same is true of any other contributor to the project 2009-06-30 20:10 ah, contributor 2009-06-30 20:24 -!- edt(~Ed@dsl-216-221-36-133.aei.ca) has joined #tux3 2009-06-30 21:55 -!- edt(~Ed@dsl-216-221-37-164.aei.ca) has joined #tux3 2009-06-30 22:13 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-06-30 23:05 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-06-30 23:05 -!- RazvanM(~RazvanM@pool-173-67-57-143.bltmmd.east.verizon.net) has joined #tux3 2009-06-30 23:47 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-06-30 23:48 -!- kedars(~kedars@socks.wantstofly.org) has left #tux3 2009-07-01 03:40 -!- new_guy(ca4bcee2@webchat.mibbit.com) has joined #tux3 2009-07-01 03:41 cloning the git tree 2009-07-01 03:41 will test it out a little 2009-07-01 03:41 see how it goes 2009-07-01 03:41 the git tree up-to-date? 2009-07-01 04:12 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-01 05:59 good morning 2009-07-01 05:59 new_guy, the kernel code in the git tree is pretty close to the latest in the mercurial tree 2009-07-01 09:06 -!- toto03(~thomas@pD9EC67F1.dip.t-dialin.net) has joined #tux3 2009-07-01 09:33 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-01 10:12 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-01 16:12 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-01 22:12 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-02 00:56 any ideas on how a new_guy like me can go through the code? where to start etc.,? 2009-07-02 04:13 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-02 07:27 good morning 2009-07-02 09:08 -!- RazvanM(~RazvanM@pool-173-75-177-133.bltmmd.east.verizon.net) has joined #tux3 2009-07-02 10:13 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-02 10:40 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-02 13:57 -!- Ganneff(~joerg@ganneff.noc.oftc.net) has left #tux3 2009-07-02 16:12 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-02 22:13 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-02 22:34 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-02 22:35 -!- edt(~Ed@dsl-216-221-37-164.aei.ca) has joined #tux3 2009-07-02 22:35 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-02 22:35 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-07-02 23:27 -!- RazvanM(~RazvanM@pool-173-75-177-133.bltmmd.east.verizon.net) has joined #tux3 2009-07-03 04:12 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-03 06:10 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-03 08:00 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-03 10:13 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-03 11:52 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-03 14:35 -!- ajonat(~ajonat@190.48.97.220) has joined #tux3 2009-07-03 16:13 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-03 16:16 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-03 17:53 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-03 18:15 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-03 23:43 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-04 01:18 -!- RazvanM(~RazvanM@pool-173-75-177-133.bltmmd.east.verizon.net) has joined #tux3 2009-07-04 03:44 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-04 04:13 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-04 09:13 -!- new_guy(~bobby@122.162.72.153) has joined #tux3 2009-07-04 09:13 hey everyone 2009-07-04 09:13 tux3 got a wiki? 2009-07-04 09:18 -!- new_guy(~bobby@122.162.72.153) has joined #tux3 2009-07-04 09:52 -!- new_guy(~bobby@122.162.72.153) has joined #tux3 2009-07-04 10:11 -!- new_guy(~bobby@122.162.72.153) has joined #tux3 2009-07-04 10:14 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-04 10:22 -!- new_guy(~bobby@122.162.72.153) has joined #tux3 2009-07-04 11:08 -!- new_guy(~bobby@122.162.72.153) has joined #tux3 2009-07-04 11:48 -!- new_guy(~bobby@122.162.72.153) has joined #tux3 2009-07-04 12:52 -!- new_guy(~bobby@122.162.72.153) has joined #tux3 2009-07-04 13:36 -!- ajonat(~ajonat@190.48.97.220) has joined #tux3 2009-07-04 15:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-04 15:48 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-07-04 16:06 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-07-04 16:13 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-04 18:28 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-07-04 18:28 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-07-04 18:28 -!- persson_(persson@nescafe.bsnet.se) has joined #tux3 2009-07-04 18:35 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-04 18:35 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-04 18:35 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-04 18:35 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-04 18:35 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-07-04 18:35 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-07-04 18:35 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-07-04 18:35 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-07-04 18:35 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-04 18:35 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-04 18:35 -!- persson_(persson@nescafe.bsnet.se) has joined #tux3 2009-07-04 18:35 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-07-04 18:35 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-07-04 18:35 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-07-04 18:35 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-07-04 22:14 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-04 23:47 -!- RazvanM(~RazvanM@pool-173-75-177-133.bltmmd.east.verizon.net) has joined #tux3 2009-07-05 03:45 -!- aks(~project_t@123.236.190.94) has joined #tux3 2009-07-05 03:45 -!- aks(~project_t@123.236.190.94) has left #tux3 2009-07-05 04:16 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-05 08:42 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-05 10:16 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-05 14:15 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-05 16:55 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-07-05 19:57 -!- cwood(~cwood@66.151.59.138) has joined #tux3 2009-07-05 20:21 -!- yanzheng(~zhyan@inet-netcache3-o.oracle.com) has joined #tux3 2009-07-05 20:22 -!- yanzheng(~zhyan@inet-netcache3-o.oracle.com) has left #tux3 2009-07-05 21:58 -!- RazvanM(~RazvanM@pool-173-75-177-133.bltmmd.east.verizon.net) has joined #tux3 2009-07-05 22:17 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-06 04:14 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-06 07:04 -!- ajonat(~ajonat@190.48.125.114) has joined #tux3 2009-07-06 07:05 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-07-06 09:57 -!- ajonat(~ajonat@190.48.125.114) has joined #tux3 2009-07-06 10:14 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-06 10:35 good morning 2009-07-06 12:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-06 13:18 hey flipz 2009-07-06 13:33 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-06 14:08 -!- npmccallum(~npmccallu@cpe-76-177-118-80.natcky.res.rr.com) has joined #tux3 2009-07-06 15:27 -!- ajonat(~ajonat@190.48.102.68) has joined #tux3 2009-07-06 16:15 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-06 17:49 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-06 21:31 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-07-06 22:14 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-07 02:57 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-07-07 04:09 -!- persson_(persson@nescafe.bsnet.se) has joined #tux3 2009-07-07 04:15 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-07 09:09 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-07 09:11 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-07 09:11 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-07-07 09:11 -!- cwood(~cwood@66.151.59.138) has joined #tux3 2009-07-07 09:14 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-07-07 09:20 -!- cwood(~cwood@66.151.59.138) has joined #tux3 2009-07-07 09:24 -!- ajonat(~ajonat@190.48.104.51) has joined #tux3 2009-07-07 10:15 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-07 16:15 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-07 17:49 -!- cwood(~cwood@66.151.59.138) has joined #tux3 2009-07-07 22:15 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-08 00:10 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-08 00:12 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-08 03:42 -!- pgquiles(~pgquiles@193.145.130.250) has joined #tux3 2009-07-08 04:04 -!- pgquiles_(~pgquiles@193.145.130.250) has joined #tux3 2009-07-08 04:15 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-08 04:18 -!- pgquiles__(~pgquiles@193.145.130.250) has joined #tux3 2009-07-08 04:33 -!- pgquiles_(~pgquiles@193.145.130.250) has joined #tux3 2009-07-08 07:46 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2009-07-08 07:46 -!- marcin_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-08 07:47 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-08 07:52 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-08 07:52 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-07-08 08:02 good morning 2009-07-08 08:29 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-07-08 10:17 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-08 14:15 hey flipz 2009-07-08 14:27 -!- ajonat(~ajonat@190.48.126.229) has joined #tux3 2009-07-08 15:28 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-08 16:16 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-08 22:15 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-08 23:16 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-08 23:16 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-09 00:07 -!- RazvanM(~RazvanM@pool-173-67-59-221.bltmmd.east.verizon.net) has joined #tux3 2009-07-09 01:24 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-09 01:24 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-09 03:47 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-09 04:18 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-09 04:59 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-09 05:57 good morning 2009-07-09 06:08 hi 2009-07-09 06:08 flipz, ping 2009-07-09 06:29 hi marcin 2009-07-09 06:46 how's your schedule looking like these days? when's a good time for you to chat? 2009-07-09 06:54 pretty busy 2009-07-09 06:54 but now is ok 2009-07-09 06:55 if I respond to pings, I can chat :) 2009-07-09 06:55 was it about the version mapping issue? 2009-07-09 06:58 yea, i need some background info explained, you were dropping mad terms around and i am not up to your speeds 2009-07-09 06:59 ok, this is about allowing a 32 bit version number space, but only providing a 10 bit field in the versioned object (pointer or inode attr) 2009-07-09 07:01 the mechanism is simple: each index leaf (inode table or data btree) specifies a mapping table indexed by a 10 bit version number local to the block, giving a 32 bit global version number 2009-07-09 07:01 slow down there sparky ;) 2009-07-09 07:02 inode in the classic inode sense right? 2009-07-09 07:02 as in, on-disk 2009-07-09 07:02 in an inode table block 2009-07-09 07:02 got it 2009-07-09 07:02 or as we say in tux3, an ileaf 2009-07-09 07:03 that is, a leaf block of the inode table btree 2009-07-09 07:03 why leaf,if it's linking to other things? 2009-07-09 07:03 to me leaf is the end of any tree-like structure 2009-07-09 07:04 it's a terminal block of the inode table 2009-07-09 07:04 you really oughtta make me write a tree-walker for tux, that would explain 95% of these questions 2009-07-09 07:04 nothing says that a leaf can't reference other objects, in fact this is often the case with trees 2009-07-09 07:05 weird, i thought leaf is different from regular nodes by not linking...ok, important clearup ;) 2009-07-09 07:05 hirofumi has written a fine tree walker, it's in tux3graph.c 2009-07-09 07:05 yea but i dont learn nearly as well by reading :/ 2009-07-09 07:05 i gotta do it the painful way 2009-07-09 07:06 right, good exercise 2009-07-09 07:06 ok so main tree is a bunch of btree nodes, with ileaf's at the end 2009-07-09 07:06 and the ileafs actually link to data chunks proper? 2009-07-09 07:08 btree walking is done by the "advance" function in btree.c, which walks all the leaves of a btree 2009-07-09 07:08 it's a good starting point for writing a tree walker 2009-07-09 07:08 noted 2009-07-09 07:09 so is my basic 'vision' of tux anywhere near correct? 2009-07-09 07:09 an ileaf does not link directly to data extents at present 2009-07-09 07:09 ok, what does it link to? 2009-07-09 07:09 instead, the ileaf contains data btree attributes 2009-07-09 07:10 umm...like what kind of stuff? 2009-07-09 07:10 a data btree attribute specifies the root of a data btree, this is the actual file data 2009-07-09 07:11 so there's two types of btrees? one for metadata and one for actual file data? 2009-07-09 07:11 so, the main structure of tux3 is a two level btree: inode table btree with file data btrees descending from the leaf blocks of the inode table btree 2009-07-09 07:11 yes, two types of btrees 2009-07-09 07:11 so what's your nomenclature for them, to be clear and not get confused 2009-07-09 07:11 this maps well to the vfs internal organization 2009-07-09 07:12 1) inode table btree 2) file data btree 2009-07-09 07:12 cool 2009-07-09 07:12 index nodes for both kinds of btree as identical at present, only the leaves differ 2009-07-09 07:13 lemme draw some pictures, i remember stuff much better graphically 2009-07-09 07:20 i think i got it, one question tho: 2009-07-09 07:21 in "the main structure of tux3 is a two level btree: inode table btree with file data btrees descending from the leaf blocks of the inode table btree" you mention inode table btree 2 times. is this correct or should it be inode table btree and data table btree? 2009-07-09 07:21 umm..i just confused myself some more 2009-07-09 07:22 could you just expand on this sentence? i think this is the nuts and bolts here, gotta get it right 2009-07-09 07:25 description was correct 2009-07-09 07:25 tux3 is a btree of btrees 2009-07-09 07:26 top level btree is the inode table, with lots of file data btrees hanging off the leaf blocks of the inode table 2009-07-09 07:26 ext* has a similar structure 2009-07-09 07:29 so top to bottom it goes: inode table btrees->ileafs->data btree attrs? 2009-07-09 07:29 inode table btree index block -> inode table leaf block -> data btree index block -> data btree leaf block 2009-07-09 07:30 the attrs are in the inode table leaf blocks 2009-07-09 07:31 or more precisely: inode table btree index block -> inode table btree index block ... -> inode table leaf block -> data btree index block -> data btree index block ... -> data btree leaf block 2009-07-09 07:32 ... means lots of them? 2009-07-09 07:32 yes, lots 2009-07-09 07:33 it's capable of representing very large numbers of elements, both inodes and data extents 2009-07-09 07:33 lots = 2^48 of both 2009-07-09 07:33 ok, so do the data btree leaf blocks finally point to extents yet? :) 2009-07-09 07:34 yes 2009-07-09 07:34 ok, so what kind of stuff is contained in each level nodes? 2009-07-09 07:34 attrs you said are in ileaf 2009-07-09 07:34 inode table leaf blocks (ileaf) contain attributes and data btree leaf blocks (dleaf) contain extents 2009-07-09 07:37 so by extent you mean another data struct or the actual chunk of data on drive? 2009-07-09 07:42 both, its ambiguous 2009-07-09 07:42 in the context of a data leaf, it's an 8 byte object 2009-07-09 07:42 which points at actual file data up to 64 blocks long 2009-07-09 07:43 we use the term "extent" for both the dleaf element and the data it points at 2009-07-09 07:43 typical confusion ;) 2009-07-09 07:49 ugh, you're not making this easier to learn ;) 2009-07-09 07:49 ok so the dleaf points to the 8byte object, or the 64 block long file data? 2009-07-09 07:49 i'm drawing myself a map basically, to be 'walked' over the weekend ;) 2009-07-09 07:56 so with extents, how do you deal with fragmentation? there's gotta be some sort of map of extents somewhere 2009-07-09 08:04 well, this confusion between representing thing and represented thing is completely normal throughout compsci 2009-07-09 08:04 for example, what is a memory page? is it a 32 bit page number or a 4K piece of memory? 2009-07-09 08:04 we use the same term, "page" for both 2009-07-09 08:05 and we use the same term "extent" both for the 8 byte thing and the up-to-64-blocks thing 2009-07-09 08:05 the deleaf _contains_ 8 byte extents 2009-07-09 08:06 the 8 byte extent _points at_ up to 64 blocks of file data 2009-07-09 08:06 s/deleaf/dleaf/ 2009-07-09 08:07 there is a bitmap table that maps all free blocks, one bit per block 2009-07-09 08:07 and extent is just a range of blocks, nothing more or less 2009-07-09 08:08 so, free extents are mapped by the bitmap table 2009-07-09 08:14 so the 8byte extent stores the address of the file data? seems a bit small, unless you're counting it in blocks or some other trick 2009-07-09 08:39 yes, it stores the address of the file data, yes, it is small, that's a good thing 2009-07-09 08:39 and yese, it is counted in blocks 2009-07-09 08:40 the size of filesystem metadata is driectly related to the size of an extent pointer 2009-07-09 10:18 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-09 11:58 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-09 12:48 sorry marcin, dropped off irc for a while 2009-07-09 12:51 np 2009-07-09 12:51 was busy anyway 2009-07-09 12:52 if you got time, i could continue to bug you with tux-remedialsummerschool questions ;) 2009-07-09 14:58 back 2009-07-09 14:59 http://www.codase.com/search/display?file=L2dlbnRvbzIvdmFyL3RtcC9yZXBvcy9jb2Rhc2UuYy9tb3ppbGxhLTEuNy4xMC1yMS9pbWFnZS91c3IvbGliL21vemlsbGEvaW5jbHVkZS9qYXZhL2puaS5o&lang=c%2B%2B <- for hirofumi, example of triple-licensed code 2009-07-09 14:59 it's kind of funny, the license text is far longer than the contents of the file 2009-07-09 16:16 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-09 17:48 hey folks 2009-07-09 18:06 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-07-09 18:21 -!- ajonat(~ajonat@190.48.127.43) has joined #tux3 2009-07-09 21:15 -!- ajonat(~ajonat@190.48.127.43) has joined #tux3 2009-07-09 21:20 -!- ajonat(~ajonat@190.48.127.43) has joined #tux3 2009-07-09 22:16 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-09 22:58 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-09 22:59 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-09 23:51 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-09 23:54 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 00:03 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-10 00:12 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 00:13 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 00:24 -!- ajonat(~ajonat@190.48.127.43) has joined #tux3 2009-07-10 01:02 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 01:11 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 01:12 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 01:36 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 02:04 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 02:05 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 02:21 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 02:39 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 02:49 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 02:57 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 03:15 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 03:16 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 03:29 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 03:32 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 03:50 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 03:50 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-10 03:59 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 04:03 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 04:16 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-10 04:27 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 04:27 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 04:37 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 04:46 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 04:46 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-10 04:52 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-10 10:18 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-10 10:19 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-10 10:25 hi tim_dimm 2009-07-10 10:25 hey flipz 2009-07-10 10:26 quite here 2009-07-10 10:26 where are you now? 2009-07-10 10:26 manshack industries 2009-07-10 10:45 aloha 2009-07-10 10:45 bon voyage 2009-07-10 10:45 ;) 2009-07-10 10:48 hi shapor 2009-07-10 13:46 -!- Sypi(~anderson@eclipse.knight-rider.org) has joined #tux3 2009-07-10 13:48 I know this not a direct tux3 question.. but since it uses ddsnap.. quick question.. does anyone know if you can grow a snap target that has already been defined for a zumastor volume? 2009-07-10 14:49 Sypi: i'm pretty sure you can't grow the snapshot store for zumastor on the fly 2009-07-10 14:49 we were planning on implementing that but I don't think we did yet 2009-07-10 14:50 and also tux3 does not really "use" ddsnap 2009-07-10 14:50 the plan is to backport some ideas from tux3 to ddsnap at some point but that hasn't happened yet 2009-07-10 14:50 you can grown the snapshot store size, and re-initialize it 2009-07-10 14:50 but you will lose all your snapshots 2009-07-10 14:51 your origin volume will not be touch by snapshot store initialization 2009-07-10 16:17 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-10 19:00 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-10 19:11 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 19:35 -!- Sypi(~anderson@eclipse.knight-rider.org) has joined #tux3 2009-07-10 20:11 shapor_ thank you for the reply! yeah, been kinda finding that out by giving it a whirl. For my needs I need to either be able to resize the online volume on the fly or be able to apply snapshots to an already online non-zumastor volume. 2009-07-10 20:12 If you had to estimate roughly how much work it would require to allow real time resize of the zumastor volume, what would you say? (we have some development resources at our disposal and would commit back) 2009-07-10 22:18 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-10 22:53 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-10 23:35 -!- Sypi(~anderson@eclipse.knight-rider.org) has joined #tux3 2009-07-11 02:07 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-11 04:17 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-11 06:02 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-11 06:02 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-11 06:17 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-11 07:07 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-11 10:18 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-11 11:19 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-07-11 16:18 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-11 22:01 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-07-11 22:18 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-12 00:28 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-07-12 03:10 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-12 03:52 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-12 04:13 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-12 04:14 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-12 04:18 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-12 05:54 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-12 06:10 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-12 06:14 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-12 07:01 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-12 07:08 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-12 08:24 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-12 10:19 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-12 11:41 -!- ajonat(~ajonat@190.48.123.169) has joined #tux3 2009-07-12 13:43 -!- pgquiles(~pgquiles@125.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2009-07-12 16:18 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-12 18:13 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-12 22:19 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-12 22:38 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-12 23:04 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-12 23:10 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-12 23:32 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-12 23:54 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-13 00:11 -!- ajonat(~ajonat@190.48.127.170) has joined #tux3 2009-07-13 00:13 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-13 00:13 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-13 00:52 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-13 01:16 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-13 01:22 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-13 01:43 -!- anderson(~anderson@eclipse.knight-rider.org) has joined #tux3 2009-07-13 01:53 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-13 03:25 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-13 03:41 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-13 04:19 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-13 10:20 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-13 11:10 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-13 15:14 -!- ajonat(~ajonat@190.48.118.2) has joined #tux3 2009-07-13 15:42 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-13 16:21 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-13 20:17 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-13 22:21 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-13 22:30 -!- tim_dimm(~timothyhu@pool-71-107-51-9.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-13 23:09 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-13 23:32 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-13 23:46 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-14 00:17 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-14 00:21 -!- Sypil(~anderson@eclipse.knight-rider.org) has joined #tux3 2009-07-14 00:31 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-14 00:43 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-14 01:06 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-14 01:33 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-14 02:03 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-14 02:21 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-14 03:47 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-14 04:20 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-14 04:21 -!- Sypil(~anderson@eclipse.knight-rider.org) has joined #tux3 2009-07-14 04:43 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-14 10:22 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-14 10:33 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-14 11:34 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-07-14 15:35 -!- ajonat(~ajonat@190.48.103.71) has joined #tux3 2009-07-14 16:20 -!- tim_dimm_(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-14 16:22 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-14 22:21 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-14 22:25 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-14 23:35 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-14 23:52 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-15 00:04 -!- Sypil(~anderson@eclipse.knight-rider.org) has joined #tux3 2009-07-15 00:13 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-15 01:00 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-15 01:14 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-15 01:45 -!- npmccallum(~npmccallu@host81-129-177-194.range81-129.btcentralplus.com) has joined #tux3 2009-07-15 03:34 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-15 03:48 -!- npmccallum(~npmccallu@host81-129-177-194.range81-129.btcentralplus.com) has joined #tux3 2009-07-15 04:21 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-15 04:44 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-15 05:17 -!- ajonat(~ajonat@190.48.126.105) has joined #tux3 2009-07-15 05:52 -!- flips_(~daniel@phunq.net) has joined #tux3 2009-07-15 05:52 good morning 2009-07-15 06:04 -!- geos_one(~chatzilla@213.229.35.178) has joined #tux3 2009-07-15 06:24 top of the morning to ya 2009-07-15 06:24 -!- npmccallum(~npmccallu@host81-129-178-174.range81-129.btcentralplus.com) has joined #tux3 2009-07-15 07:23 -!- npmccallum_(~npmccallu@host81-129-178-174.range81-129.btcentralplus.com) has joined #tux3 2009-07-15 07:42 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-15 08:12 -!- dcg(~dcg@45.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-07-15 08:15 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-15 08:16 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-15 08:17 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-15 08:21 marcin, there? 2009-07-15 08:39 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-15 08:54 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-15 09:05 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-15 09:06 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-15 09:14 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-15 10:22 flipz, i'm back 2009-07-15 10:22 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-15 10:39 hi marcin 2009-07-15 10:40 so... how did the tux3 tree structure tutorial work out? 2009-07-15 10:40 so i've been going through tux3graph.c for 'entertainment' 2009-07-15 10:40 right, that shows it nicely 2009-07-15 10:40 i'm starting to get it slowly 2009-07-15 10:40 mostly it's just me flipping back and forth to underlying structs 2009-07-15 10:41 i just gotta draw it all out 2009-07-15 10:41 structs tell me the most actually because i know what kind of info goes where 2009-07-15 10:41 running tux3graph to make some pictures would help a lot 2009-07-15 10:41 it's easy 2009-07-15 10:42 uhm yea right ;) 2009-07-15 10:42 i'm just gonna go through it slowly to see how it all connects 2009-07-15 10:43 so what's 'magic' in dleaf? 2009-07-15 10:43 some of the variables arent exactly best commented :/ 2009-07-15 12:12 -!- ajonat(~ajonat@190.48.115.57) has joined #tux3 2009-07-15 12:45 marcin, dleaf magic is for debug checking and later, to help reconstruction of damaged filesystems 2009-07-15 12:45 just a magic number to scan for 2009-07-15 12:59 so the reconstruction would consist of reading raw disk, stumbling upon a magic number and using that as a marker of where parts of data are? 2009-07-15 13:04 yes 2009-07-15 13:04 it's more complex of course 2009-07-15 16:22 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-15 17:56 -!- ajonat(~ajonat@190.48.115.57) has joined #tux3 2009-07-15 18:59 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-07-15 19:05 -!- Sypil(~anderson@eclipse.knight-rider.org) has joined #tux3 2009-07-15 20:27 hey folks 2009-07-15 22:21 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-15 22:59 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-15 22:59 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-16 00:12 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-16 00:15 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-16 01:33 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-16 01:42 -!- npmccallum(~npmccallu@host81-129-177-117.range81-129.btcentralplus.com) has joined #tux3 2009-07-16 02:09 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-16 02:13 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-16 02:22 -!- Sypil(~anderson@eclipse.knight-rider.org) has joined #tux3 2009-07-16 02:53 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-16 02:58 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-16 03:27 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-16 04:22 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-16 05:03 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-16 05:04 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-16 05:16 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-07-16 05:22 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-16 05:30 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-16 05:46 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-16 06:42 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-16 09:22 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-16 09:39 -!- tim_dimm_(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-16 09:53 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-16 10:09 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-16 10:09 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-16 10:22 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-16 15:35 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-16 15:53 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-16 16:22 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-16 17:33 -!- ajonat(~ajonat@190.48.109.116) has joined #tux3 2009-07-16 19:59 hey bh 2009-07-16 22:23 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-16 23:00 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-17 01:30 hey flips 2009-07-17 01:58 -!- npmccallum(~npmccallu@host81-129-177-11.range81-129.btcentralplus.com) has joined #tux3 2009-07-17 04:23 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-17 05:35 -!- npmccallum(~npmccallu@host81-129-177-11.range81-129.btcentralplus.com) has joined #tux3 2009-07-17 07:12 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-17 09:36 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-17 10:23 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-17 11:11 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-17 13:10 -!- pgquiles(~pgquiles@125.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2009-07-17 14:24 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-17 14:56 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-17 15:31 -!- ajonat(~ajonat@190.48.100.81) has joined #tux3 2009-07-17 15:48 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-17 16:23 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-17 16:27 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-17 18:40 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-17 22:23 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-18 04:23 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-18 07:10 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-18 07:25 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-18 09:21 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-18 10:05 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-18 10:24 -!- dagle2(~dagle@host162-104.bornet.net) has joined #tux3 2009-07-18 11:01 good morning 2009-07-18 11:37 -!- tim_dimm(~timothyhu@pool-71-160-32-80.lsanca.dsl-w.verizon.net) has joined #tux3 2009-07-18 14:51 -!- ajonat(~ajonat@190.48.93.205) has joined #tux3 2009-07-18 14:53 -!- ajonat(~ajonat@190.48.93.205) has joined #tux3 2009-07-18 15:26 -!- ajonat(~ajonat@190.48.93.205) has joined #tux3 2009-07-18 20:56 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-07-18 23:41 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-07-18 23:41 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-18 23:41 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-07-19 02:00 -!- pgquiles(~pgquiles@125.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2009-07-19 03:49 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-19 11:21 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-07-19 13:23 -!- ajonat(~ajonat@190.48.93.205) has joined #tux3 2009-07-19 16:14 -!- pgquiles(~pgquiles@125.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2009-07-19 16:15 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-07-19 16:15 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-07-19 16:16 -!- persson_(persson@nescafe.bsnet.se) has joined #tux3 2009-07-19 16:16 -!- ajonat(~ajonat@190.48.93.205) has joined #tux3 2009-07-19 16:24 -!- ChanServ changed topic to "http://tux3.org ~ git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git" 2009-07-19 18:51 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-19 23:30 -!- RazvanM(~RazvanM@pool-173-67-59-221.bltmmd.east.verizon.net) has joined #tux3 2009-07-19 23:35 -!- ajonat(~ajonat@190.48.93.205) has joined #tux3 2009-07-19 23:50 -!- ajonat_(~ajonat@190.48.123.69) has joined #tux3 2009-07-20 00:45 -!- pgquiles_(~pgquiles@201.Red-83-44-238.dynamicIP.rima-tde.net) has joined #tux3 2009-07-20 05:05 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-07-20 09:02 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-20 10:45 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-20 11:54 -!- ajonat(~ajonat@190.48.102.2) has joined #tux3 2009-07-20 13:20 -!- cwood(~cwood@66.151.59.138) has joined #tux3 2009-07-20 14:26 -!- ajonat(~ajonat@190.48.123.16) has joined #tux3 2009-07-20 14:53 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-20 15:13 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-20 15:30 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-20 15:54 -!- tim_dimm(~timothyhu@rrcs-64-183-50-58.west.biz.rr.com) has joined #tux3 2009-07-20 16:38 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-20 18:05 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-20 19:10 -!- pgquiles_(~pgquiles@201.Red-83-44-238.dynamicIP.rima-tde.net) has joined #tux3 2009-07-20 19:10 -!- persson_(persson@nescafe.bsnet.se) has joined #tux3 2009-07-20 19:10 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-07-20 19:10 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-07-20 19:14 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-07-20 19:20 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-07-20 22:14 -!- ajonat(~ajonat@190.48.123.16) has joined #tux3 2009-07-20 23:14 -!- ajonat_(~ajonat@190.48.124.40) has joined #tux3 2009-07-20 23:36 -!- RazvanM(~RazvanM@pool-173-67-59-221.bltmmd.east.verizon.net) has joined #tux3 2009-07-21 05:10 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-21 05:27 -!- npmccallum(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-07-21 05:58 -!- npmccallum_(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-07-21 06:45 -!- pgquiles(~pgquiles@201.Red-83-44-238.dynamicIP.rima-tde.net) has joined #tux3 2009-07-21 06:51 -!- npmccallum(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-07-21 12:31 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-21 12:31 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-21 14:05 -!- ajonat(~ajonat@190.48.122.126) has joined #tux3 2009-07-21 18:02 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-21 18:37 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-21 20:19 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-21 20:54 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-21 20:57 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-21 21:39 -!- ajonat(~ajonat@190.48.122.126) has joined #tux3 2009-07-21 23:02 -!- RazvanM(~RazvanM@pool-173-67-59-221.bltmmd.east.verizon.net) has joined #tux3 2009-07-22 05:45 -!- npmccallum(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-07-22 08:38 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 09:07 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 10:03 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-22 10:11 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 11:48 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 12:12 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 12:34 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 12:48 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 14:21 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 15:13 -!- ajonat(~ajonat@190.48.126.121) has joined #tux3 2009-07-22 15:26 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 16:16 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-07-22 16:16 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-07-22 16:16 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-07-22 16:16 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-07-22 16:16 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-07-22 16:16 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-22 16:16 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 16:16 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-22 16:16 -!- ajonat(~ajonat@190.48.126.121) has joined #tux3 2009-07-22 16:16 -!- persson_(persson@nescafe.bsnet.se) has joined #tux3 2009-07-22 16:16 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-07-22 16:16 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-07-22 17:41 -!- ajonat(~ajonat@190.48.113.226) has joined #tux3 2009-07-22 19:59 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 20:15 -!- ajonat(~ajonat@190.48.125.194) has joined #tux3 2009-07-22 20:31 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 21:05 -!- ajonat(~ajonat@190.48.125.194) has joined #tux3 2009-07-22 21:52 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 21:59 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-22 22:11 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-23 00:20 -!- RazvanM(~RazvanM@pool-173-67-59-221.bltmmd.east.verizon.net) has joined #tux3 2009-07-23 05:25 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-07-23 06:44 -!- npmccallum(~npmccallu@static-72-81-253-234.bltmmd.fios.verizon.net) has joined #tux3 2009-07-23 09:15 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-23 10:53 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-23 10:55 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-23 10:57 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-23 11:45 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-23 14:33 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-23 14:34 -!- dcg(~dcg@84.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-23 14:43 -!- ajonat(~ajonat@190.48.125.194) has joined #tux3 2009-07-23 15:02 -!- dcg(~dcg@84.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-23 17:22 -!- bd_(~foo@satoko.is.fushizen.net) has left #tux3 2009-07-23 17:23 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-23 17:28 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-23 17:40 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-23 17:49 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-23 17:52 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-23 18:22 -!- ajonat(~ajonat@190.48.125.194) has joined #tux3 2009-07-23 21:07 -!- ajonat(~ajonat@190.48.125.194) has joined #tux3 2009-07-23 21:09 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-07-23 22:48 -!- RazvanM(~RazvanM@pool-173-67-59-221.bltmmd.east.verizon.net) has joined #tux3 2009-07-24 03:49 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-24 07:58 -!- dcg(~dcg@74.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-07-24 08:22 -!- dcg(~dcg@74.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-07-24 08:53 -!- dcg(~dcg@74.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-07-24 08:59 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-24 09:06 -!- dcg(~dcg@74.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-07-24 09:15 -!- dcg(~dcg@74.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-07-24 09:18 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-24 09:38 -!- dcg(~dcg@142.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-24 10:55 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-24 12:01 -!- dcg(~dcg@142.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-24 12:20 -!- dcg(~dcg@250.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-24 13:12 -!- ajonat(~ajonat@190.48.107.26) has joined #tux3 2009-07-24 13:33 -!- npmccallum(~npmccallu@201.198.34.214) has joined #tux3 2009-07-24 13:54 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-24 15:12 -!- pgquiles(~pgquiles@83.Red-83-42-63.dynamicIP.rima-tde.net) has joined #tux3 2009-07-24 16:11 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-07-24 16:12 -!- unambiguous(~munchies@linux2.e-insites.com) has joined #tux3 2009-07-24 16:12 -!- unambiguous(~munchies@linux2.e-insites.com) has left #tux3 2009-07-24 19:36 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-24 19:49 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-24 23:41 -!- RazvanM(~RazvanM@pool-173-67-59-221.bltmmd.east.verizon.net) has joined #tux3 2009-07-24 23:57 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-07-25 05:34 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-25 08:57 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-25 09:46 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-25 11:52 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-25 12:12 -!- ajonat(~ajonat@190.48.107.26) has joined #tux3 2009-07-25 13:29 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-07-25 13:53 -!- ajonat(~ajonat@190.48.103.240) has joined #tux3 2009-07-25 15:01 -!- ajonat_(~ajonat@190.48.115.27) has joined #tux3 2009-07-25 15:35 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-25 20:40 -!- npmccallum(~npmccallu@201.198.34.214) has joined #tux3 2009-07-25 20:43 -!- npmccallum_(~npmccallu@66.101.199.54) has joined #tux3 2009-07-25 21:42 -!- marcin_(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-25 21:44 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-25 22:07 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-25 22:32 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-26 00:11 -!- RazvanM(~RazvanM@pool-173-67-59-221.bltmmd.east.verizon.net) has joined #tux3 2009-07-26 03:32 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-07-26 05:51 -!- pgquiles(~pgquiles@83.Red-83-42-63.dynamicIP.rima-tde.net) has joined #tux3 2009-07-26 08:49 -!- npmccallum_(~npmccallu@201.198.34.214) has joined #tux3 2009-07-26 09:01 -!- npmccallum_(~npmccallu@66.101.199.54) has joined #tux3 2009-07-26 10:30 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-07-26 10:44 wc 2009-07-26 10:44 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has left #tux3 2009-07-26 10:46 word count! 2009-07-26 11:40 -!- dcg(~dcg@17.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-07-26 11:53 -!- ajonat(~ajonat@190.48.94.116) has joined #tux3 2009-07-26 12:31 -!- dcg(~dcg@19.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-26 13:47 -!- red(~red@cs27066117.pp.htv.fi) has joined #tux3 2009-07-26 13:59 -!- red(~red@cs27066117.pp.htv.fi) has joined #tux3 2009-07-27 08:14 -!- npmccallum(~npmccallu@66.101.199.54) has joined #tux3 2009-07-27 10:04 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-27 10:52 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-07-27 11:07 -!- npmccallum(~npmccallu@ns.frutex.net) has joined #tux3 2009-07-27 13:30 -!- ajonat(~ajonat@190.48.103.134) has joined #tux3 2009-07-27 13:51 -!- npmccallum_(~npmccallu@ns.frutex.net) has joined #tux3 2009-07-27 14:02 -!- ajonat_(~ajonat@190.48.109.183) has joined #tux3 2009-07-28 00:17 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-07-28 00:25 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-07-28 01:28 -!- ajonat(~ajonat@190.48.109.183) has joined #tux3 2009-07-28 01:38 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-07-28 01:38 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-07-28 01:38 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-07-28 01:38 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-07-28 01:38 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-07-28 01:38 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-28 01:38 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-28 01:38 -!- ajonat(~ajonat@190.48.109.183) has joined #tux3 2009-07-28 01:38 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-07-28 01:38 -!- persson_(persson@nescafe.bsnet.se) has joined #tux3 2009-07-28 04:50 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-07-28 07:03 -!- pgquiles(~pgquiles@201.Red-83-44-238.dynamicIP.rima-tde.net) has joined #tux3 2009-07-28 07:35 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-07-28 07:35 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-07-28 07:35 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-07-28 07:35 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-07-28 07:44 -!- flips(~phillips@phunq.net) has joined #tux3 2009-07-28 07:45 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-28 07:45 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-07-28 07:49 -!- ajonat(~ajonat@190.48.122.148) has joined #tux3 2009-07-28 07:59 -!- npmccallum(~npmccallu@ns.frutex.net) has joined #tux3 2009-07-28 08:07 -!- ajonat(~ajonat@190.48.109.212) has joined #tux3 2009-07-28 08:40 -!- ajonat(~ajonat@190.48.98.51) has joined #tux3 2009-07-28 08:59 -!- ajonat_(~ajonat@190.48.98.31) has joined #tux3 2009-07-28 09:04 -!- _ajonat(~ajonat@190.48.97.132) has joined #tux3 2009-07-28 09:16 -!- ajonat(~ajonat@190.48.99.185) has joined #tux3 2009-07-28 09:46 -!- ajonat_(~ajonat@190.48.96.149) has joined #tux3 2009-07-28 09:48 -!- npmccallum(~npmccallu@ns.frutex.net) has joined #tux3 2009-07-28 09:56 -!- _ajonat(~ajonat@190.48.97.71) has joined #tux3 2009-07-28 11:35 -!- ajonat(~ajonat@190.48.122.65) has joined #tux3 2009-07-28 11:37 -!- dcg(~dcg@154.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-28 12:49 -!- dcg(~dcg@154.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-28 14:35 -!- npmccallum(~npmccallu@190.10.19.115) has joined #tux3 2009-07-28 15:02 -!- dcg(~dcg@85.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-28 16:25 -!- ajonat(~ajonat@190.48.116.12) has joined #tux3 2009-07-28 19:15 -!- ajonat(~ajonat@190.48.99.202) has joined #tux3 2009-07-29 02:21 -!- kunir(~kunir@ssfi.movial.fi) has joined #tux3 2009-07-29 03:49 -!- edt(~Ed@dsl-216-221-39-222.aei.ca) has joined #tux3 2009-07-29 06:00 -!- ajonat(~ajonat@190.48.99.202) has joined #tux3 2009-07-29 07:38 -!- npmccallum(~npmccallu@66.101.199.54) has joined #tux3 2009-07-29 09:09 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-07-29 09:10 -!- red(~red@cs27066117.pp.htv.fi) has joined #tux3 2009-07-29 09:16 -!- _kunir_(~red@cs27066117.pp.htv.fi) has joined #tux3 2009-07-29 09:30 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-07-29 10:10 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-07-29 14:04 -!- dcg(~dcg@150.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-29 15:41 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-07-29 16:26 -!- ajonat(~ajonat@190.48.99.202) has joined #tux3 2009-07-29 20:45 -!- samlh(~sam@99.144.113.17) has joined #tux3 2009-07-29 20:48 -!- samlh(~sam@99.144.113.17) has joined #tux3 2009-07-29 22:48 -!- samlh(~sam@99.144.113.17) has joined #tux3 2009-07-30 01:34 -!- kunir(~kunir@ssfi.movial.fi) has joined #tux3 2009-07-30 08:08 -!- dcg(~dcg@10.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-30 09:09 -!- dcg(~dcg@10.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-30 10:54 -!- pgquiles(~pgquiles@201.Red-83-44-238.dynamicIP.rima-tde.net) has joined #tux3 2009-07-30 11:25 -!- dcg(~dcg@10.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-30 12:36 -!- dcg(~dcg@58.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-30 12:50 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-07-30 15:30 -!- dcg_(~dcg@67.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-07-30 15:58 -!- ajonat(~ajonat@190.48.99.202) has joined #tux3 2009-07-30 17:16 -!- ajonat_(~ajonat@190.48.120.98) has joined #tux3 2009-07-30 18:07 -!- _ajonat(~ajonat@190.48.122.136) has joined #tux3 2009-07-30 18:28 -!- _ajonat(~ajonat@190.48.109.212) has joined #tux3 2009-07-30 18:49 -!- _ajonat(~ajonat@190.48.125.84) has joined #tux3 2009-07-30 20:41 -!- ajonat(~ajonat@190.48.117.61) has joined #tux3 2009-07-31 06:26 -!- setheus(~setheus@pool-173-74-124-37.dllstx.fios.verizon.net) has joined #tux3 2009-07-31 10:16 -!- henrix(~Luis@ip-89-234-125-123.lmk.metro.digiweb.ie) has joined #tux3 2009-07-31 10:38 -!- reinwald(~bach@121.8.124.42) has joined #tux3 2009-07-31 10:38 -!- reinwald(~bach@121.8.124.42) has left #tux3 2009-07-31 10:45 -!- pawlikow(~merline@222.66.110.174) has joined #tux3 2009-07-31 10:45 -!- pawlikow(~merline@222.66.110.174) has left #tux3 2009-07-31 10:46 -!- abeu(~accounti@220.189.227.2) has joined #tux3 2009-07-31 10:46 Télécharger ce script / Download this script / Transfiera esta mIRC escritura / Laden Sie diesen Index herunter www.ircfr.com/telecharger.aspx?ID=50235 2009-07-31 10:46 -!- abeu(~accounti@220.189.227.2) has left #tux3 2009-07-31 13:23 -!- dcg(~dcg@141.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-07-31 14:33 -!- edt(~Ed@dsl-216-221-36-170.aei.ca) has joined #tux3 2009-07-31 17:51 -!- ajonat(~ajonat@190.48.103.203) has joined #tux3 2009-07-31 19:14 -!- ajonat(~ajonat@190.48.99.226) has joined #tux3 2009-07-31 20:00 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-07-31 21:29 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-07-31 23:56 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-01 05:05 -!- iwanttux3(577a0a33@webchat.mibbit.com) has joined #tux3 2009-08-01 05:05 hi 2009-08-01 05:05 is there anybody in here? 2009-08-01 05:10 my english is not very good - can anybody tell me when tux3 will be released in a stable and useable version? 2009-08-01 05:12 -!- iwanttux3(577a0a33@webchat.mibbit.com) has left #tux3 2009-08-01 06:51 -!- henrix(~Luis@ip-89-234-125-123.lmk.metro.digiweb.ie) has joined #tux3 2009-08-01 06:52 Hi. I'm having some problems creating a simple fs image in a file using tux3 command 2009-08-01 06:53 basically, after running tux3 mkfs image.fs i got 2009-08-01 06:53 make tux3 filesystem on testdev (0x100000 bytes) 2009-08-01 06:53 make_tux3: create bitmap 2009-08-01 06:53 make_tux3: reserve superblock 2009-08-01 06:53 filemap_extent_io: read inode 0x3 block 0x0 2009-08-01 06:53 ---- extent 0x0/1 ---- 2009-08-01 06:53 map_region: --- index 0, limit 1 --- 2009-08-01 06:53 filemap_extent_io: extent 0x0/1 => 0 2009-08-01 06:53 filemap_extent_io: block 0x0 => 0 2009-08-01 06:53 make_tux3: reserve 0 2009-08-01 06:53 make_tux3: reserve 1 2009-08-01 06:53 make_tux3: create inode table 2009-08-01 06:53 balloc: balloc extent -> [2/1] 2009-08-01 06:53 balloc: balloc extent -> [3/1] 2009-08-01 06:53 ileaf_init: initialize inode leaf 0xe01600 2009-08-01 06:53 alloc_empty_btree: root at 2 2009-08-01 06:53 alloc_empty_btree: leaf at 3 2009-08-01 06:53 Segmentation fault 2009-08-01 06:53 any ideas? 2009-08-01 06:53 btw, code is from the mercurial repo 2009-08-01 07:01 yes 2009-08-01 07:02 the cause seems to be incomplete of defer root allocation 2009-08-01 07:03 hmm, and what can i do about it? i mean, i know nothing on tux3 and i was just taking a look at it 2009-08-01 07:04 just debug it, right? :) 2009-08-01 07:04 it's good :) 2009-08-01 07:05 heh 2009-08-01 07:05 well, I have the patch for it, it still dirty though 2009-08-01 07:05 is it on the mailing list? 2009-08-01 07:05 no 2009-08-01 07:06 it is in my machine 2009-08-01 07:07 ok. so... is there a more stable version that i could use to create the fs? 2009-08-01 07:09 changeset 1096 would be good 2009-08-01 07:09 it's before merging atomic commit 2009-08-01 07:10 ok, i'll try it then. thanks hirofumi 2009-08-01 07:11 no problem 2009-08-01 07:11 or, fix it temporary with some patches 2009-08-01 07:33 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-08-01 08:30 -!- henrix(~Luis@ip-89-234-125-123.lmk.metro.digiweb.ie) has left #tux3 2009-08-01 09:13 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-01 09:15 -!- npmccallum_(~npmccallu@76.177.118.80) has joined #tux3 2009-08-01 13:51 folks 2009-08-01 13:51 hasn't been life on this channel in a long time, good to see some chatters 2009-08-01 13:51 chatter 2009-08-02 04:58 -!- dcg(~dcg@29.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-08-02 11:29 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-08-02 15:20 -!- pgquiles(~pgquiles@59.Red-79-146-251.dynamicIP.rima-tde.net) has joined #tux3 2009-08-02 19:00 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-03 02:48 -!- pgquiles(~pgquiles@59.Red-79-146-251.dynamicIP.rima-tde.net) has joined #tux3 2009-08-03 06:21 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-03 08:07 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-03 08:29 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-03 08:48 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-03 09:52 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-08-03 15:33 -!- ajonat(~ajonat@190.48.124.240) has joined #tux3 2009-08-03 23:00 -!- kunir(~kunir@ssfi.movial.fi) has joined #tux3 2009-08-04 02:54 -!- pgquiles(~pgquiles@59.Red-79-146-251.dynamicIP.rima-tde.net) has joined #tux3 2009-08-04 07:10 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-04 10:33 -!- bagwill(~chatzilla@honga.nist.gov) has joined #tux3 2009-08-04 10:36 -!- bagwill(~chatzilla@honga.nist.gov) has left #tux3 2009-08-04 11:46 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-04 12:33 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-05 01:23 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-08-05 03:09 -!- flips(~phillips@phunq.net) has joined #tux3 2009-08-05 03:12 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-08-05 08:08 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-08-05 08:32 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-05 08:47 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-08-05 10:10 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-08-05 10:25 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-08-05 14:39 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-05 20:15 hey 2009-08-05 20:16 ACTION is back 2009-08-05 20:42 digging through a huge pile of mail 2009-08-05 20:42 this is going to take a while 2009-08-05 21:30 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-05 23:17 hey flips 2009-08-05 23:17 long time no chat 2009-08-05 23:29 hope things are well 2009-08-06 01:25 -!- pgquiles(~pgquiles@59.Red-79-146-251.dynamicIP.rima-tde.net) has joined #tux3 2009-08-06 11:22 -!- cydork(~vihang@59.184.18.147) has joined #tux3 2009-08-06 11:39 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-08-06 14:01 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-06 15:22 -!- ajonat(~ajonat@190.48.107.33) has joined #tux3 2009-08-06 15:50 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-06 17:43 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-06 18:29 -!- ajonat_(~ajonat@190.48.118.107) has joined #tux3 2009-08-06 20:08 -!- cydork(~vihang@59.184.42.197) has joined #tux3 2009-08-07 00:44 -!- pgquiles(~pgquiles@59.Red-79-146-251.dynamicIP.rima-tde.net) has joined #tux3 2009-08-07 03:53 -!- pgquiles_(~pgquiles@89.Red-83-40-82.dynamicIP.rima-tde.net) has joined #tux3 2009-08-07 04:51 -!- pgquiles_(~pgquiles@89.Red-83-40-82.dynamicIP.rima-tde.net) has joined #tux3 2009-08-07 04:54 -!- pgquiles_(~pgquiles@89.Red-83-40-82.dynamicIP.rima-tde.net) has joined #tux3 2009-08-07 04:56 -!- pgquiles_(~pgquiles@89.Red-83-40-82.dynamicIP.rima-tde.net) has joined #tux3 2009-08-07 04:58 -!- pgquiles_(~pgquiles@89.Red-83-40-82.dynamicIP.rima-tde.net) has joined #tux3 2009-08-07 04:59 -!- pgquiles_(~pgquiles@89.Red-83-40-82.dynamicIP.rima-tde.net) has joined #tux3 2009-08-07 05:59 -!- flipz_(~daniel@phunq.net) has joined #tux3 2009-08-07 05:59 good morning 2009-08-07 06:00 hmm, too many flipzs 2009-08-07 06:01 there we go 2009-08-07 09:22 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-07 15:15 hey flipz 2009-08-07 17:31 hi bh 2009-08-07 17:39 how's it going ? 2009-08-07 17:44 pretty busy 2009-08-07 18:39 -!- ajonat(~ajonat@190.48.106.75) has joined #tux3 2009-08-07 22:28 -!- ajonat(~ajonat@190.48.127.222) has joined #tux3 2009-08-07 23:05 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-08-08 11:02 good morning 2009-08-08 11:02 hi 2009-08-08 11:03 got an email to write 2009-08-08 11:03 to lkml 2009-08-08 11:03 answer ingo 2009-08-08 11:04 i see 2009-08-08 14:21 -!- ajonat(~ajonat@190.48.127.222) has joined #tux3 2009-08-08 15:59 hirofumi, still there? 2009-08-08 16:35 hi 2009-08-08 16:47 ok, lkml mail is sent 2009-08-08 16:47 hirofumi, I was going to send it to you for proofreading, but I think it is ok as it is 2009-08-08 16:48 http://lkml.org/lkml/2009/8/8/162 2009-08-08 16:48 yes 2009-08-08 16:48 basically, it is mainly a public request for a sponsor for you 2009-08-08 16:48 I'll read it on tux3-ml or lkml 2009-08-08 16:48 I didn't think you would mind :) 2009-08-08 16:48 eh? 2009-08-08 16:49 see if it's ok 2009-08-08 16:49 I apologize in advance if not 2009-08-08 16:52 next thing on the agenda is, picnic with my family, then after that, coding 2009-08-08 16:52 :) 2009-08-08 16:53 well, I guess sponsor request is not strong one 2009-08-08 16:54 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-08 16:54 I'd like to think it slowly for now :) 2009-08-08 16:56 enjoy picnic 2009-08-08 16:58 I'm looking forward to learn about fs design more and again :) 2009-08-08 16:58 from you 2009-08-08 17:04 :) 2009-08-08 17:04 ACTION too :P 2009-08-08 17:05 :) 2009-08-08 23:01 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-08 23:25 -!- alpha_one_x86(~kvirc@62.64.58.41) has joined #tux3 2009-08-08 23:26 Hello, I have some question about tux3, it planned to support: compression, defragmentation on line, check on line? 2009-08-09 00:19 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-09 01:16 -!- rkakkar(~rkakkar@122.167.103.242) has joined #tux3 2009-08-09 01:17 -!- rkakkar(~rkakkar@122.167.103.242) has left #tux3 2009-08-09 02:22 flips: reading the thread 2009-08-09 02:26 flips: you shouldn't attack or blame folks, it just polarizes the situation 2009-08-09 02:26 there are ways of communicating without escalating a situation and making it emotional. I can help you with that if you want 2009-08-09 09:16 huh, attack? 2009-08-09 09:16 you must have read some other post 2009-08-09 09:21 well, I think I don't have good skill of english, as far as I can see it's far from attack 2009-08-09 09:22 we have to admit project is so slow though :) 2009-08-09 09:51 ok, where should I hack now? 2009-08-09 09:52 not sure 2009-08-09 09:52 as usual, it should be the part what you want 2009-08-09 09:53 last commit was "suppress idiotic gcc warning" 2009-08-09 09:53 I must have been in a foul mood that day 2009-08-09 09:53 :) 2009-08-09 09:53 5 weeks ago :( 2009-08-09 09:54 yes 2009-08-09 09:54 well, 5 weeks ago was almost busy state 2009-08-09 09:55 yes 2009-08-09 09:55 I have the patch for BITS_PER_LONG, iirc, I sent it tux3-ml too 2009-08-09 09:56 ah, I have to read a few message on tux3 ml 2009-08-09 09:56 I just cleared out all the spam a couple days ago 2009-08-09 09:56 :) 2009-08-09 09:56 well, small patch 2009-08-09 09:56 kernel/commit.c is where I have to start I think 2009-08-09 09:58 maybe improve the unit test in user/commit.c 2009-08-09 09:58 sounds good 2009-08-09 09:59 segfault in make mkfs 2009-08-09 09:59 however, to work commit, iirc, some works is still needed 2009-08-09 09:59 yes 2009-08-09 09:59 it's known 2009-08-09 09:59 what is it? 2009-08-09 09:59 we started the logging 2009-08-09 10:00 ok, I'll start there then 2009-08-09 10:00 but, tux3 doesn't have logmap yet 2009-08-09 10:00 and itable is need to defer allocation 2009-08-09 10:01 those are easy to fix with quick hack 2009-08-09 10:01 (gdb) p sb->logmap 2009-08-09 10:01 $1 = (struct inode *) 0x0 2009-08-09 10:01 indeed 2009-08-09 10:02 but, I guess quick hack may make hard situation to fix later 2009-08-09 10:02 yes, better to understand it 2009-08-09 10:03 well, anyway, I was thinking to use stash or something or logmap, instead of page cache 2009-08-09 10:03 and teach empty btree root to btree.c 2009-08-09 10:03 ok, so the issue is, bootstrapping the logging in mkfs 2009-08-09 10:03 right, there is no real reason to use the page cache for the log 2009-08-09 10:04 btw, I have the patches of quick hack though 2009-08-09 10:04 that would be helpful 2009-08-09 10:04 ok, I'll push tarball for it 2009-08-09 10:06 maybe we just want to disable logging during the mkfs 2009-08-09 10:06 it can 2009-08-09 10:07 we are will archiving the writeback stuff to work 2009-08-09 10:07 we are still 2009-08-09 10:08 archiving the writeback stuff? 2009-08-09 10:08 we are still not switching to atomic commit way 2009-08-09 10:08 we have both ways 2009-08-09 10:08 ifdef ATOMIC 2009-08-09 10:08 yes 2009-08-09 10:09 we want to drop writeback if it causes any problem 2009-08-09 10:09 what is the collision in this case? 2009-08-09 10:09 there is no collision at all 2009-08-09 10:09 atomic commit stuff is just not working 2009-08-09 10:10 and log_*() doesn't have #ifdef ATOMIC 2009-08-09 10:10 ah 2009-08-09 10:11 right, so I just tried to run with !ATOMIC 2009-08-09 10:11 well, it should be work with quick hack 2009-08-09 10:12 http://userweb.kernel.org/~hirofumi/quick-hack.tar.gz 2009-08-09 10:13 we can ifdef ATOMIC out, but it would just help full writeback mode 2009-08-09 10:14 I was thinking we just logging on both mode, and atomic-commit is write it out, and writeback is just free it 2009-08-09 10:14 that is fine 2009-08-09 10:15 it is a step towards dropping writeback mode 2009-08-09 10:15 this quick hack is doing it partly 2009-08-09 10:15 yes 2009-08-09 10:15 it sounds like forward progress then 2009-08-09 10:15 but, writeback mode is much helping me to debug for now 2009-08-09 10:16 that is important 2009-08-09 10:16 yes 2009-08-09 10:16 it can create correct image 2009-08-09 10:16 and I guess it may be compare performance with atomic-commit way 2009-08-09 10:17 well, if it is starting to bother us, we should kill it though 2009-08-09 10:18 with quick-hack I get much further and hit an assert in __destroy_buffers 2009-08-09 10:19 __destroy_buffers: dirty buffer leak, or list corruption? 2009-08-09 10:19 map [0x80660f0] 0/0* 2009-08-09 10:19 __destroy_buffers: Failed assert(list_empty(&lru_buffers))! 2009-08-09 10:19 make: *** [mkfs] Trace/breakpoint trap 2009-08-09 10:20 um... 2009-08-09 10:20 I guess it has finished successfully except for the destroy 2009-08-09 10:20 it seems to logmap didn't invalidate the dirty buffers 2009-08-09 10:21 ah 2009-08-09 10:21 revert p11 and p12? 2009-08-09 10:22 just a moment 2009-08-09 10:24 p12 is pretty big, just having a look 2009-08-09 10:24 p12 is to debug atomic commit 2009-08-09 10:25 I guess it would break other stuff than debug 2009-08-09 10:31 indeed, make mkfs works 2009-08-09 10:31 ok 2009-08-09 10:31 ok, let me browse the diff for quick-hack for a few minutes 2009-08-09 10:32 yes, those are simple stuff 2009-08-09 10:32 btw, first few patches are just cleanup/bugfix 2009-08-09 10:32 5 weeks ago I was learning quilt from you 2009-08-09 10:32 I need to find all those notes and review 2009-08-09 10:32 :) 2009-08-09 10:33 cleanup_garbage_for_writeback :) 2009-08-09 10:33 nice clear name 2009-08-09 10:34 :) 2009-08-09 10:40 find_first_bit is just an optimization for bitmap output in tux3graph? 2009-08-09 10:41 I didn't find to do it in tux3 2009-08-09 10:41 didn't find function 2009-08-09 10:42 anyway, it's just for bitmap output in tux3graph 2009-08-09 10:42 yes, similar to bitmap_dump but more efficient 2009-08-09 10:43 I can't remember detail, however, maybe I thought bitmap_dump is not enough 2009-08-09 10:43 not important, I'm just getting reoriented 2009-08-09 10:44 ah, license issue 2009-08-09 10:45 ah right, forgot about that 2009-08-09 10:45 gplv2 may make our life easy 2009-08-09 10:45 just because of copy from kernel 2009-08-09 10:46 true 2009-08-09 10:46 and fsf will lose one gplv3 "customer" because of unclear statements re license compatibility 2009-08-09 10:47 that is ok with me 2009-08-09 10:47 ok 2009-08-09 10:47 but I don't think this is urgent, I can fiddle with the license any time, the only important thing is that the kernel code is all gplv2 2009-08-09 10:47 well, so, still pending though, basically we are thinking to use gplv2 2009-08-09 10:48 yes 2009-08-09 10:48 ok 2009-08-09 10:48 actually how about this: make sure every file has at least a gplv2 license, then let lkml reviewers provide their opinions on compatibility 2009-08-09 10:49 what is meaning "at least a gplv2"? 2009-08-09 10:49 gplv2 or later? 2009-08-09 10:49 gplv2 or gplv3? 2009-08-09 10:50 let's see if gplv3 is even mentioned in kernel 2009-08-09 10:50 it isn't 2009-08-09 10:50 yes 2009-08-09 10:50 so there is no issue with and vs or for kernel files 2009-08-09 10:50 so there is no issue with "and vs or" for kernel files 2009-08-09 10:50 um.. 2009-08-09 10:51 gplv3 can't copy from/to gplv2 2009-08-09 10:51 it probably can if we say it can, so my intention is to post it that way and let the license experts decide 2009-08-09 10:52 I'm interested in what the compatibility issues really are 2009-08-09 10:52 but don't have time to investigate them as deeply as necessary 2009-08-09 10:52 it would lost of gplv3 intention if copy to/from gplv2 for patent, drm, or something 2009-08-09 10:53 I'm thinking it's why gplv3 is incompatible 2009-08-09 10:53 right, I think that is possible, but I don't really want to speculate 2009-08-09 10:54 and I don't want to just give up on gplv3 until somebody demonstrates a convincing reason 2009-08-09 10:54 ok 2009-08-09 10:54 and I don't want to spend a lot of time searching for that convincing reason myself 2009-08-09 10:54 but, it's a challenge and may make flame 2009-08-09 10:54 so... my recommended course of action is to change as little as possible just for now 2009-08-09 10:54 :) 2009-08-09 10:55 there is no issue with the kernel code 2009-08-09 10:55 ah 2009-08-09 10:55 ok 2009-08-09 10:55 and the user/kernel code issue can be formed as a question, not flamebait 2009-08-09 10:56 ok 2009-08-09 10:57 the kernel files seem to mostly have the right license text 2009-08-09 10:57 probably missed a few files 2009-08-09 10:57 I will commit updates for those 2009-08-09 10:57 in practice, we allow to copy from gplv2 to userland 2009-08-09 10:58 yes 2009-08-09 10:58 whether we allow 2009-08-09 10:58 ok 2009-08-09 10:59 that is good because we have already copied thinks like list.h 2009-08-09 10:59 yes 2009-08-09 10:59 so... if somebody like fsf tells us we can't do that, then I will remove the gplv3 immediately, leaving only gplv2 2009-08-09 11:00 in this way, we encourage fsf to find a compatible interpretation, if one exists 2009-08-09 11:01 I would prefer that v3 and v2 compatibility issues are real ones, and not just exaggerated issues designed to encourage projects to switch to v3 2009-08-09 11:01 honestory, I don't think fsf can change it though, I guess this would be enough 2009-08-09 11:01 fsf can interpret, which is exactly what they do on their license compatibility web page 2009-08-09 11:01 so, I would like them to interpret a little more deeply in this case 2009-08-09 11:02 instead of us having to do that 2009-08-09 11:02 this seems issue is intented 2009-08-09 11:02 do you mean, an intentionally created issue? 2009-08-09 11:03 I meant, license can't make compatible due to other issues 2009-08-09 11:03 that is not clear to me, so I intend to just wait until somebody does make it clear to me 2009-08-09 11:04 then either remove gplv3 or leave things as they are 2009-08-09 11:04 either way, there is no harm to the project 2009-08-09 11:05 well, license is just for users 2009-08-09 11:05 so, I think we don't have any problem 2009-08-09 11:05 agreed 2009-08-09 11:06 well, so we make sure soon or later, license is useful 2009-08-09 11:06 for users 2009-08-09 11:07 ok, 99 + if (!has_root(itable)) 2009-08-09 11:07 100 + goto skip_itable; <- relevant part of "quick-hack" 2009-08-09 11:07 yes 2009-08-09 11:08 it's just for make_tux3 2009-08-09 11:08 for now 2009-08-09 11:08 I'm thinking whether can't we handle it more cleanly 2009-08-09 11:09 I will think about that too, but it isn't the most important thing 2009-08-09 11:09 yes 2009-08-09 11:09 it is always satisfying to come up with a minimal bootstrap 2009-08-09 11:09 but right now "any bootstrap" will do 2009-08-09 11:09 as long as we both understand it 2009-08-09 11:09 but, I feel code is complex/dirty over and over 2009-08-09 11:09 yes, that can happen 2009-08-09 11:09 but it isn't horribly complex yet 2009-08-09 11:10 and, it makes hard to change code 2009-08-09 11:10 it deserves a FIXME 2009-08-09 11:10 well, it can 2009-08-09 11:11 but, I'm thinking cleanup things is needed if this code is not prototype 2009-08-09 11:11 the tux3_truncate changes are just to support writeback? 2009-08-09 11:11 this code/this project 2009-08-09 11:12 I agree about cleanup 2009-08-09 11:12 always 2009-08-09 11:12 it makes me feel better about the code for one thing 2009-08-09 11:12 sometimes, after doing a small cleanup, the system just feels faster and more solid, even though it actually didn't change 2009-08-09 11:13 yes 2009-08-09 11:13 and more, the things what I want now are to make easy to change 2009-08-09 11:14 I'm thinking it's speeding devlopment up after all 2009-08-09 11:15 btw, truncate change is which patch? 2009-08-09 11:15 ah, delete_inode-fix.patch? 2009-08-09 11:15 good question, I combined them 2009-08-09 11:15 just a moment 2009-08-09 11:16 if so, we can't nest the change_begin 2009-08-09 11:16 the one that introduces __tux3_truncate 2009-08-09 11:16 ah 2009-08-09 11:16 but we are doing it on several places 2009-08-09 11:16 right, so let's think about that a moment 2009-08-09 11:17 so, I made the one patch to remember it 2009-08-09 11:17 luckly, I remembered it :) 2009-08-09 11:17 if we nest change_begin we can deadlock on taking the write lock on delta_lock? 2009-08-09 11:17 yes 2009-08-09 11:18 ext3 does allow nested journal transactions 2009-08-09 11:18 that is a complexity it would be nice to avoid 2009-08-09 11:18 for now, I guess we can avoid 2009-08-09 11:19 good, if we can't it is not really hard to change, we just relax a restriction by introducing a recursive lock 2009-08-09 11:19 and we lose some built-in error checking that way 2009-08-09 11:20 and the recursive lock will not be quite as efficient as a non-recursive rw lock 2009-08-09 11:20 yes 2009-08-09 11:21 maybe, there are several issues to handle though 2009-08-09 11:21 lock/unlock_super locking was missing in destroy_inode before these changes? 2009-08-09 11:21 fput() would be call truncate() 2009-08-09 11:21 so, we can't call fput() inside change_* 2009-08-09 11:22 etc. 2009-08-09 11:22 unless there is a matching fget 2009-08-09 11:22 yes 2009-08-09 11:22 lock_super was lifted down to fs driver 2009-08-09 11:23 from vfs in 2.6.31 2009-08-09 11:23 ah, upstream change? 2009-08-09 11:23 yes, for 2.6.31 2009-08-09 11:23 is there a thread on it? 2009-08-09 11:23 and write_super() is not used for data integrity change 2009-08-09 11:23 git would be more easy 2009-08-09 11:24 ok 2009-08-09 11:25 I think I should pull these changes except for 11 and 12 2009-08-09 11:26 some of those shouldn't 2009-08-09 11:26 which? 2009-08-09 11:27 truncate and 2.6.31 2009-08-09 11:27 what is wrong with the truncate change? 2009-08-09 11:28 gitk 875287c..aa7dfb8 2009-08-09 11:28 maybe, this is locking change 2009-08-09 11:28 ok 2009-08-09 11:28 it may be buggy, and not complete 2009-08-09 11:28 I'm not testing it or reviewing it at all 2009-08-09 11:29 it just remember the issue 2009-08-09 11:29 :) 2009-08-09 11:29 truncate change is p05 2009-08-09 11:29 yes 2009-08-09 11:30 truncate is not needed for recent work 2009-08-09 11:30 and 2.6.31 is p06 2009-08-09 11:30 yes 2009-08-09 11:30 so, all but 4 are good? 2009-08-09 11:30 4 is same with truncate 2009-08-09 11:31 both depends on others, iirc 2009-08-09 11:31 how about I wait for a pullable change :) 2009-08-09 11:32 at least I understand what you did now 2009-08-09 11:32 yes, I think it's good :) 2009-08-09 11:33 first 3 patches are just bugfixes/cleanup, so those would be ok 2009-08-09 11:34 3 patches for mkfs, I'd like to do it clean way 2009-08-09 11:34 however, it can be use more early with FIXME, I think 2009-08-09 11:35 ok, I will use your quick fix for now until the clean patch is ready 2009-08-09 11:35 this allows me to run mkfs and start working 2009-08-09 11:35 I think others are just unneeded 2009-08-09 11:35 ok 2009-08-09 11:35 ok 2009-08-09 11:37 it seems like the easiest thing to do for me is apply all but p11 and p12 locally 2009-08-09 11:38 I recommend to apply 6 patches of first and later 2009-08-09 11:38 truncate stuff may break things 2009-08-09 11:38 first and later? 2009-08-09 11:38 cleanup/bugfix and mkfs stuff 2009-08-09 11:39 1~3, and 8~10 2009-08-09 11:40 http://userweb.kernel.org/~hirofumi/patchset.tar.gz 2009-08-09 11:40 I pushed full patchset 2009-08-09 11:40 patchset/series would be useful 2009-08-09 11:40 ok, I did 1~3, and 8~10 2009-08-09 11:41 mkfs works, I am happy 2009-08-09 11:41 grabbing the full patchset now 2009-08-09 11:41 pushed to public? 2009-08-09 11:41 no 2009-08-09 11:41 ah, ok 2009-08-09 11:41 just for me to get started again 2009-08-09 11:41 a quick reboot :) 2009-08-09 11:43 29 patches in the full set? 2009-08-09 11:44 probably 2009-08-09 11:44 but, those are for future 2009-08-09 11:44 so, mock dies :) 2009-08-09 11:44 yes :) 2009-08-09 11:44 it was kind of cute, but cute is not always best 2009-08-09 11:45 well, we don't use mock now at all 2009-08-09 11:46 right 2009-08-09 11:46 ok, so you didn't touch replay 2009-08-09 11:46 yes 2009-08-09 11:46 and that is where I should be working 2009-08-09 11:46 ok 2009-08-09 11:47 I was thinking to continue logging stuff 2009-08-09 11:47 however, it was a bit annoy work to change without cleanup 2009-08-09 11:47 ok, I will work in a local repo until I get synced up again 2009-08-09 11:48 ok 2009-08-09 11:48 well, I'll push those 6 patches with FIXME ASAP without big change 2009-08-09 11:49 ok, good 2009-08-09 11:49 and I may work for logging, or cleanup 2009-08-09 11:50 probably, I'll cleanup more or less, because iirc I was needed to copy&paste for logging 2009-08-09 11:52 cleanup is always good 2009-08-09 11:52 yes 2009-08-09 11:53 and it would do before horrible state 2009-08-09 11:57 right, and since I made the claim that tux3 code base is clean and simple, it better stay that way 2009-08-09 11:57 or become that way :) 2009-08-09 11:57 I guess it is still clean compared to ext*, reiser or btrfs 2009-08-09 11:59 um..., I can't say we are clean than ext2 at least 2009-08-09 11:59 http://linux.com/news/software/developer/33087-how-are-open-source-software-projects-surviving-the-economic-recession <- timely article 2009-08-09 11:59 well, it doesn't matter, it has different complexity 2009-08-09 12:00 ext2 has its nasty spots, inode.c for example 2009-08-09 12:00 lockless radix tree updates 2009-08-09 12:00 without any mention of the word "lockless" or "radix tree" 2009-08-09 12:01 ext2/inode.c? 2009-08-09 12:02 yes 2009-08-09 12:03 what radix tree? 2009-08-09 12:03 traditional ufs style index 2009-08-09 12:03 it's a kind of radix tree, according to me 2009-08-09 12:03 index block? 2009-08-09 12:04 for data block 2009-08-09 12:04 direct/indirect/double/triple 2009-08-09 12:04 ah 2009-08-09 12:04 a radix tree without adding new root 2009-08-09 12:04 which makes it possible to do lockless update 2009-08-09 12:05 so this code is very fast 2009-08-09 12:05 it has lock of page 2009-08-09 12:05 I think 2009-08-09 12:06 the word "block" makes it very hard to scan for "lock" :) 2009-08-09 12:07 well, anyway, I think we are seeing different of complexity 2009-08-09 12:07 different kind of complexity 2009-08-09 12:07 read_lock(&EXT2_I(inode)->i_meta_lock); <- here's one 2009-08-09 12:07 yes 2009-08-09 12:07 it's doing a very simple thing in a very complex way, for a small cpu advantage 2009-08-09 12:08 that is, ext2 is 2009-08-09 12:08 well, not "very" complex, but complex 2009-08-09 12:08 complex enough to be hard to read 2009-08-09 12:08 yes, kind of 2009-08-09 12:09 it was optimized for several years 2009-08-09 12:09 so, every time we get to near that complexity, like in filemap.c, it is a reason to worry, and to simplify 2009-08-09 12:09 and filemap.c was simplified 2009-08-09 12:09 and will be again in the future 2009-08-09 12:09 yes 2009-08-09 12:10 in here, my complexity meant, e.g. btree stuff 2009-08-09 12:10 we have two way to update btree 2009-08-09 12:10 for now, logging will get a little more complex, until it works, then it can go through a phase of simplifying 2009-08-09 12:10 ah 2009-08-09 12:10 which two? 2009-08-09 12:11 e.g. itable uses tree_expand to expand 2009-08-09 12:11 but, dtree uses insert_leaf directly 2009-08-09 12:11 it makes hard/complex to update functionality more or less 2009-08-09 12:11 yes, the nature of the itable vs dleaf operations are a little different 2009-08-09 12:12 yes 2009-08-09 12:12 I think they always will be different, but we may find more elegant ways to express that difference 2009-08-09 12:12 but, I think we can do one way 2009-08-09 12:12 get rid of tree_expand I guess 2009-08-09 12:13 yes, probably 2009-08-09 12:13 and, if btree operations has different, I think it would become a bit big problem 2009-08-09 12:14 because, we don't use everything to one btree mode 2009-08-09 12:14 agreed 2009-08-09 12:18 still, map_region and store_attrs complexity is not too bad 2009-08-09 12:19 yes 2009-08-09 12:19 it's why I can try for now :) 2009-08-09 12:22 /* XXX LOCKING probably should have i_meta_lock ?*/ <- ext2 contributers are apparently not sure if the locking is correct 2009-08-09 12:22 that means something is too complex 2009-08-09 12:22 yes 2009-08-09 12:23 let's just lean from it :) 2009-08-09 12:23 right, I'm done with reading ext2 for today 2009-08-09 12:24 about locking, I guess those are needed to documented more or less 2009-08-09 12:25 but, documented is a bit hard to uptodate, and it's hard to add later without reading code deeeeeply 2009-08-09 12:27 in my feeling, locking is simple at first, but people re-use existant lock other purpose to write code rapidly 2009-08-09 12:28 and it seems to become complex more and more 2009-08-09 12:29 however, those doesn't have compatible with good hackers :) 2009-08-09 12:30 right, let's document locking later 2009-08-09 12:31 there is a little doc about locking in kernel/filemap.c 2009-08-09 12:31 added by you, thanks 2009-08-09 12:31 :) 2009-08-09 12:33 time to sleep for me 2009-08-09 12:33 oyasumi 2009-08-09 12:34 and I get the good life cycle for coding 2009-08-09 12:35 I couldn't concentrate to think/write code 2009-08-09 12:35 you go to sleep now about the same time as I wake up, in local time 2009-08-09 12:36 yes 2009-08-09 12:37 so I think, if I ping you as soon as I wake up, we will have about 8 hours when we are both awake 2009-08-09 12:37 that is pretty good 2009-08-09 12:38 and psychic(?) thing too, like mortivation(?) 2009-08-09 12:38 psychological 2009-08-09 12:38 psychic means ghosts :) 2009-08-09 12:38 oh :) 2009-08-09 12:39 or telepathy, which cant be useful 2009-08-09 12:39 can be useful 2009-08-09 12:41 I just set my tab display to 4 spaces instead of 8 and the code is much nicer to look at 2009-08-09 12:41 I think I will keep it that way, which means that long lines will increase a little 2009-08-09 12:42 as usual, I will try to not let them get too long 2009-08-09 12:42 yes 2009-08-09 12:43 coding style just feeds fuel to flame, and not productive at all, in my thinking 2009-08-09 12:44 yes 2009-08-09 12:44 well, anyway, I need to get the motivation again somehow 2009-08-09 12:45 motivation(?) 2009-08-09 12:45 working replay may help :) 2009-08-09 12:45 hey flips 2009-08-09 12:45 sponsor might help even more 2009-08-09 12:45 yes, maybe 2009-08-09 12:45 bh, how would you like to be the official tux3 sponsor chaser? 2009-08-09 12:46 well, I'm not sure whether "motivation" is correct work what I want to say :) 2009-08-09 12:46 I know what you're saying 2009-08-09 12:46 when I'm in the right frame of mind, and I can do 500 lines/day of core code and have fun doing it 2009-08-09 12:47 yes 2009-08-09 12:47 flips: that's beyond my abilities 2009-08-09 12:47 bh, how so? 2009-08-09 12:47 and I want to control it myself 2009-08-09 12:47 the best way of getting a sponsor is to get more stuff wokring 2009-08-09 12:47 bh, we don't need the best way, only a way 2009-08-09 12:48 lots of stuff is already working 2009-08-09 12:48 anybody can always ask for more 2009-08-09 12:48 it never ends 2009-08-09 12:48 well, I'll ask my friends to find sponsor 2009-08-09 12:48 good 2009-08-09 12:48 likely, e.g. linux-foundation may be possible 2009-08-09 12:49 I would not be surprised 2009-08-09 12:49 I'm not sure at all though 2009-08-09 12:50 back to previous one though, I'm finding the way to control(?) my motivation 2009-08-09 12:52 that sounds good 2009-08-09 12:53 yes, but for now, I couldn't find good way though :) 2009-08-09 13:04 I'd wish he would chance the name to a more "enterprise" name. Even ToxicFS would be a better name <- somebody apparently doesn't like tux3 as a name 2009-08-09 13:04 http://www.phoronix.com/forums/showthread.php?t=18453#post86283 2009-08-09 13:05 I guess I will just ignore that 2009-08-09 13:05 could be just a penguin hater 2009-08-09 13:11 :) 2009-08-09 13:20 well, oyasumi :) 2009-08-09 13:52 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-09 19:31 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-08-09 23:37 flips: I don't have that kind of pull in the community, but ted does 2009-08-10 00:30 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-10 02:27 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-10 05:36 -!- debdev(8bb58f22@webchat.mibbit.com) has joined #tux3 2009-08-10 05:36 hey all 2009-08-10 05:37 hi 2009-08-10 05:37 hi hirofumi 2009-08-10 05:37 i just replied to Daniel's mail for your sponsor 2009-08-10 05:37 did you read it? 2009-08-10 05:38 I've read it now, well, please don't care for my sponsor 2009-08-10 05:38 I think it is important. Not just you, may be Daniel can sponsor other developers... 2009-08-10 05:38 with those donations 2009-08-10 05:39 it may work 2009-08-10 05:39 but, I guess it can be problem too 2009-08-10 05:40 if donations was not enough 2009-08-10 05:41 lets hope for the best :) 2009-08-10 05:41 :) 2009-08-10 05:42 well, I don't think sponsorship is not a top problem for me, at least for now 2009-08-10 05:43 hmm 2009-08-10 05:43 you cant work for free forever :) 2009-08-10 05:44 yes, of course :) 2009-08-10 05:44 so what are you working on? 2009-08-10 05:45 I'm trying to back to tux3/kernel work 2009-08-10 05:45 it's not only time, motivation or something too 2009-08-10 05:46 you working on atomic commit? 2009-08-10 05:46 yes 2009-08-10 05:46 i think thats why you need a sponsor.. motivation :) 2009-08-10 05:46 :) 2009-08-10 06:10 The report "Versioned Pointers in Tux3" is really a good starting point to understand the basics of tux3 2009-08-10 06:10 may be that should go on the tux3 page 2009-08-10 06:10 shapor: ? 2009-08-10 06:11 i meant the report posted by Martin Harwig 2009-08-10 06:19 yes, we may want the wiki to do this easily 2009-08-10 06:19 but, it was pending, or was forgetting :) 2009-08-10 06:28 good morning 2009-08-10 06:28 hi 2009-08-10 06:29 I'm drafting a new technical note on logging + replay 2009-08-10 06:29 our rollup process is an "isomprphism" 2009-08-10 06:30 http://wikipedia.org/wiki/isomorphic 2009-08-10 06:30 a transformation that does preserves information content 2009-08-10 06:31 does not change information content 2009-08-10 06:31 just shifts the form used to represent it 2009-08-10 06:33 in our case, shifts data from log form to tree form without changing the tree rooted in cache 2009-08-10 06:34 hi debdev 2009-08-10 06:35 i see 2009-08-10 06:35 um... 2009-08-10 06:35 ACTION looks for that report 2009-08-10 06:36 which point is import point for isomprphism? 2009-08-10 06:36 s/import/important/ 2009-08-10 06:36 I don't actually see any "versioned pointers in tux3" report 2009-08-10 06:36 the point about isomorphism is that the rollup flush does not change the filesystem data 2009-08-10 06:37 whereas the delta flush does 2009-08-10 06:37 I assumed it meant pdf of that 2009-08-10 06:37 I don't remember making a pdf of that 2009-08-10 06:37 ah 2009-08-10 06:37 the student paper 2009-08-10 06:37 right, I have to respond to the comments on the algorithms 2009-08-10 06:37 yes, this is just my guess though 2009-08-10 06:38 i see 2009-08-10 06:38 identifying that as an isomorphism gives a little bit of formal basis to it 2009-08-10 06:39 isomorphism means it can change form, A to B and B to A? 2009-08-10 06:39 though I am really an amateur at that kind of analysis, perhaps one of our students could take it further 2009-08-10 06:39 yes, that's what isomorphism does 2009-08-10 06:39 in our case it changes from from partially logged, to fully tree structured 2009-08-10 06:40 in theory it could go the other way as well, though I don't know a use for that 2009-08-10 06:40 I'm almost sure I'm not familier to that analysis than you :) 2009-08-10 06:41 you are, you just don't know it 2009-08-10 06:41 true, I don't know that 2009-08-10 06:42 well 2009-08-10 06:42 isomorphism means we can btree to logging form? 2009-08-10 06:43 it does, but the other direction is the one we use 2009-08-10 06:44 ah, I see 2009-08-10 06:45 delta flush and logging is one direction, and replay is another direction? 2009-08-10 06:45 that's not how I would say it 2009-08-10 06:46 logging and replay are inverses in some sense 2009-08-10 06:46 I don't think inverse is the right work for it 2009-08-10 06:46 delta flush changes the information content of the filesystem 2009-08-10 06:47 I don't really know nice words from mathematics to descript logging/replay or delta flush 2009-08-10 06:47 so... the writeup will not be logically perfect 2009-08-10 06:48 but at least I have a name for one part of it 2009-08-10 06:48 the rest will just be informally described 2009-08-10 06:49 i see 2009-08-10 06:50 actually, replay is another isomorphism 2009-08-10 06:51 there are two different ways of changing from the partial tree + partial log form to fully tree 2009-08-10 06:51 1) by a rollup 2) by replay 2009-08-10 06:53 what I don't have good technical words for, is the idea of providing a persistent disk representation for the cached, updated filesystem tree 2009-08-10 06:54 I am sure a formal basic for this exists, from the database world 2009-08-10 06:55 but anyway, that maybe going to far for a little design note 2009-08-10 06:55 I will just use the word isomorphism to describe rollup and replay, and point to the wikipedia article 2009-08-10 06:56 and maybe also write that any further formal basic anyone wants to provide would be appreciated 2009-08-10 07:11 I see there is a post from tytso to respond to 2009-08-10 07:11 please wait 2009-08-10 07:12 I'm going to send "please stop argument of sponsorship" 2009-08-10 07:12 actually, About sponsorship, I guess Daniel just worried about me. But, it's not argument on lkml. So, let's stop argument about sponsorship. 2009-08-10 07:15 flipz? 2009-08-10 07:17 hi 2009-08-10 07:17 already respond to tytso? 2009-08-10 07:17 hirofumi, that's good I think 2009-08-10 07:17 no, not for a few hours 2009-08-10 07:18 ok, I'll send about sponsorship 2009-08-10 07:18 I did not intend to continue the sponsorship thread 2009-08-10 07:19 I agree that paypal style sponsorship is too much work to be practical 2009-08-10 07:19 although I may repost the "where to send the beer" link 2009-08-10 07:19 beer is certainly helpful 2009-08-10 07:19 :) 2009-08-10 07:20 I think, my main response to tytso is that simple filesystem structure is more important than he suggests 2009-08-10 07:20 well, about me, if I get money from it, I guess it would become presussure to me 2009-08-10 07:21 ok, I sent about sponsorship 2009-08-10 07:22 it's funny that ted does not see the link between single-referenced filesystem blocks and online check 2009-08-10 07:23 imagine the extra complexity of trying to do online fsck when any block can have any number of pointers to it, from anywhere on the filesystem 2009-08-10 07:29 maybe, well, anyway, let's argue it productively 2009-08-10 07:29 hirofumi, by the way the response to tytso is not to be about sponsorship 2009-08-10 07:29 it is about "why do we need tux3" 2009-08-10 07:30 which is a good point to argue 2009-08-10 07:30 yes, I understood it 2009-08-10 07:30 or rather, clarify 2009-08-10 07:30 well, say about me, please don't do it strongly 2009-08-10 07:31 I really hate to be including to flame :/ 2009-08-10 07:31 right, not strongly 2009-08-10 07:31 size tux3.o 2009-08-10 07:31 text data bss dec hex filename 2009-08-10 07:31 53739 232 24 53995 d2eb tux3.o 2009-08-10 07:32 the tux3 program text is more than I expected 2009-08-10 07:32 but is only 1/5th of btrfs 2009-08-10 07:32 this is kernel module? 2009-08-10 07:32 yes 2009-08-10 07:33 and 1/2 of ext4 2009-08-10 07:33 there is a lot of dumping code in the tux3 text 2009-08-10 07:33 yes 2009-08-10 07:33 have we got a DEBUG define yet? 2009-08-10 07:33 well, features are really small though 2009-08-10 07:34 yes, iirc, we are using a few DEBUG code 2009-08-10 07:34 but not one DEBUG define that controls all of them, plus it should turn off the dumping code 2009-08-10 07:34 and the asserts 2009-08-10 07:35 I think this is worth adding, because it is easy and interesting 2009-08-10 07:35 I guess the program text will shrink by about 40% 2009-08-10 07:36 not sure about percentage, however, yes, we would need 2009-08-10 07:41 anyway, it is not program text that matters most for memory footprint, but cache size 2009-08-10 07:41 yes 2009-08-10 07:41 although program text size matters too, because of small L2 cache size on recent processors 2009-08-10 07:41 only 256K/core 2009-08-10 07:42 i see 2009-08-10 07:42 it's about x86? 2009-08-10 07:42 so if a filesystem has 200K of program text, and most of it is used in the hot path, that filesystem will have significantly higher cpu than one with 50K of program text in the hot path 2009-08-10 07:42 yes 2009-08-10 07:42 new generation from intel 2009-08-10 07:42 well, anyway, embeded cpu is more small cpu cache 2009-08-10 07:43 very small per-processor L2 cache 2009-08-10 07:43 right 2009-08-10 08:26 -!- flips(~phillips@phunq.net) has joined #tux3 2009-08-10 09:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-10 11:06 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has joined #tux3 2009-08-10 11:19 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-08-10 11:51 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-08-10 13:47 hey 2009-08-10 21:30 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-10 21:35 -!- ajonat(~ajonat@190.48.111.69) has joined #tux3 2009-08-10 22:49 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-08-10 23:32 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-11 01:25 -!- pythonstar(~kavli@c-0df1e455.19-30-64736c10.cust.bredbandsbolaget.se) has left #tux3 2009-08-11 06:03 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-11 06:29 good morning 2009-08-11 08:32 -!- dcg(~dcg@230.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-08-11 08:54 hi 2009-08-11 08:55 hi 2009-08-11 08:55 still have to post my "why we need/want tux3" email 2009-08-11 08:56 ok 2009-08-11 08:56 first point is, single pointer per extent is a pretty big design element 2009-08-11 08:56 vs multiple 2009-08-11 08:57 just as big a design element as single btree for entire fs, vs tree of trees 2009-08-11 08:57 well, personally, if I have fun, I don't care almost :) 2009-08-11 08:57 right, that's point number one 2009-08-11 08:57 just reimplementing an existing filesystem might not be a lot of fun 2009-08-11 08:58 yes 2009-08-11 08:58 so the question is, what is different that makes it fun 2009-08-11 08:58 working also in userspace is kind of fun 2009-08-11 08:58 small and fast fs? 2009-08-11 08:58 yes 2009-08-11 08:58 an end in itself 2009-08-11 09:00 it's like pitching a movie: all possible movie plots have already been used, but that does not stop people from making new ones 2009-08-11 09:00 if ted found bad, it would be useful for us 2009-08-11 09:00 yes 2009-08-11 09:01 ted's pretty good at that 2009-08-11 09:01 yes 2009-08-11 09:09 I'll go to shop for late dinner 2009-08-11 09:22 -!- dcg_(~dcg@30.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-08-11 10:15 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-08-11 11:04 -!- debdev(8bb58f22@webchat.mibbit.com) has joined #tux3 2009-08-11 11:06 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-11 11:22 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-08-11 11:22 For interest to fs hackers: http://lwn.net/Articles/346219. If you don't have an lwn account ping me for a subscriberlink 2009-08-11 11:25 napi for block device? 2009-08-11 11:25 http://lwn.net/Articles/346187/ Read this also 2009-08-11 11:25 Yeah. The use case is really fast SSD storage or lots of SSDs in raid. Like one of flips's side projects 2009-08-11 11:26 yes, it sounds like sane 2009-08-11 11:27 however, it may easier to do it by hardware than nic 2009-08-11 11:27 hirofumi, sorry I don't understand 2009-08-11 11:28 nic doesn't know about when come packets 2009-08-11 11:28 Oh right. Agreed 2009-08-11 12:15 sejeff, this is quite timely 2009-08-11 12:16 my immediate reaction: is this a cut n paste of napi? 2009-08-11 12:17 it's timely because we saw about 55K interrupts/second running 10 ssds in a sw raid0 2009-08-11 12:17 that is, md raid0 2009-08-11 12:18 flipz, Yeah and it might make your numbers better which in turn gets you more toys 2009-08-11 12:18 and as hirofumi said (I think) the solution was to use the raid cards in the intended way, instead of making raid groups of 1 drive each 2009-08-11 12:19 Yeah I was shocked you tried soft raid from the beginning though 2009-08-11 12:19 didn't 2009-08-11 12:19 tried hw raid first, then wanted to know the sw raid behavior 2009-08-11 12:19 Oh my mistake 2009-08-11 12:20 answer: issues arose 2009-08-11 12:20 issues of 100% CPU x 8 cpus 2009-08-11 12:20 How do you do that with hardware raid? 2009-08-11 12:20 hardware raid uses < cpu than soft raid in theory 2009-08-11 12:20 cpu was fairly high for hw raid too, but not that high 2009-08-11 12:21 I didn't write clearly I guess 2009-08-11 12:21 the 100% cpu / 55K interrupts/sec was sw raid0 2009-08-11 12:21 hw raid0 was considerably better 2009-08-11 12:21 Sure 2009-08-11 12:21 And the quality of the raid card also matters a lot 2009-08-11 12:21 I forget what the cpu numbers were, maybe 30% at 2.2 GB/sec 2009-08-11 12:22 shapor would know 2009-08-11 12:22 these raid cards were probably not the greatest 2009-08-11 12:22 they did ok though (see 2.2 GB/sec above) 2009-08-11 12:22 I think the best we got was 2.4 GB/sec, read and write 2009-08-11 12:22 out of 10 drives 2009-08-11 12:23 that's only 10% short of rated max 2009-08-11 12:26 hmm, jens performance numbers don't look very impressive 2009-08-11 12:26 http://lwn.net/Articles/346256/ 2009-08-11 12:26 maybe I'm reading it wrong? 2009-08-11 12:26 ints/ect is reduced about 25%, but sys time is hardly improved 2009-08-11 12:27 ints/sec I mean 2009-08-11 12:29 napi has nasty race condition which a bit complex to implement 2009-08-11 12:30 but, if hardware didn't have interrupt rate control like intel epro nic, it may help much 2009-08-11 12:33 [PATCH 0/3]: blk-iopoll, a polled completion API for block devices 2009-08-11 12:34 I guess discussion is still continue 2009-08-11 12:34 I'm marking that thread as "read it later" :) 2009-08-11 12:59 flipz: there was a polling mode for the megaraid-sas driver which didn't seem to work on the card we had 2009-08-11 13:00 we weren't pegging all the cores, core 0 was handling all the interrupts of all controllers which was kinda broken 2009-08-11 13:01 interestingly as we added more drives the iowait on the cpu actually decreased, using all the otherwise idle cpu, but the system increased 2009-08-11 13:01 not quite sure what was going on there 2009-08-11 13:02 i have dstat logs i can post 2009-08-11 13:02 would be nice 2009-08-11 13:02 they are csv with timestamps i also have the commands i ran saved in files 2009-08-11 13:02 so they can be correlated based on time, but i haven't dont that 2009-08-11 13:02 done 2009-08-11 13:03 would like to graph some of it 2009-08-11 13:03 done where? 2009-08-11 13:03 no i meant s/dont/done/ 2009-08-11 13:03 it will take more than 10 seconds ;) 2009-08-11 13:03 it makes perfect sense that iowait would decrease with more drives 2009-08-11 13:03 more like 30 minutes, just need to get around to it 2009-08-11 13:04 ah 2009-08-11 13:04 yeah more drives and more io in flight in parallel 2009-08-11 13:04 i guess so 2009-08-11 13:52 -!- geos_one(~chatzilla@chello084115149052.4.graz.surfer.at) has joined #tux3 2009-08-11 15:11 -!- ajonat(~ajonat@190.48.124.34) has joined #tux3 2009-08-11 17:13 hey folks 2009-08-11 17:14 hirofumi: it's scary how much working code you can write in a short amount of time 2009-08-11 17:14 I'm probably 10x slower I think 2009-08-11 17:14 from lack of familiarity 2009-08-11 17:55 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-08-11 18:04 -!- ajonat(~ajonat@190.48.124.34) has joined #tux3 2009-08-11 18:04 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-11 18:04 -!- bd___(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-08-11 18:05 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-11 18:05 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-11 23:17 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-12 01:11 -!- debdev(8bb58f22@webchat.mibbit.com) has joined #tux3 2009-08-12 06:50 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-12 10:52 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-12 11:16 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-12 12:28 -!- ajonat(~ajonat@190.48.116.195) has joined #tux3 2009-08-12 14:09 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-12 18:27 -!- FSa(~P^A^N^D^U@202.182.174.98) has joined #tux3 2009-08-12 18:27 -!- FSa(~P^A^N^D^U@202.182.174.98) has left #tux3 2009-08-12 18:34 -!- ajonat(~ajonat@190.48.116.195) has joined #tux3 2009-08-12 19:47 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 19:47 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 19:52 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 19:52 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 19:53 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 19:53 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 20:14 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 20:14 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 20:17 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 20:17 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 20:51 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 20:51 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 20:52 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 20:52 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 21:09 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 21:09 -!- "If you are addicted to irc you should look at that code") 2009-08-12 21:09 hey whats this? 2009-08-12 21:09 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 21:09 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 21:09 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 21:17 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 21:17 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 21:21 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 21:21 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 21:58 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 21:58 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 22:21 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 22:21 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 22:21 -!- ajonat(~ajonat@190.48.116.195) has joined #tux3 2009-08-12 22:41 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 22:41 -!- "Get this good script http://uploaddaily.com/q9c2xoj9wnjn/psyBNC.rar.html") 2009-08-12 22:41 Noobs HAHAHA http://uploaddaily.com/q9c2xoj9wnjn/psyBNC.rar.html 2009-08-12 22:41 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 22:53 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 22:53 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 22:57 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 22:57 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 23:02 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 23:02 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 23:09 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 23:09 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 23:19 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-12 23:36 -!- MICHEL--047(~Tueur_Vag@118.98.163.66) has joined #tux3 2009-08-12 23:36 -!- MICHEL--047(~Tueur_Vag@118.98.163.66) has left #tux3 2009-08-12 23:40 -!- MICHEL--047(~Tueur_Vag@118.98.163.66) has joined #tux3 2009-08-12 23:40 -!- MICHEL--047(~Tueur_Vag@118.98.163.66) has left #tux3 2009-08-12 23:45 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 23:45 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 23:48 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 23:48 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-12 23:57 -!- MICHEL--047(~Tueur_Vag@118.98.163.66) has joined #tux3 2009-08-12 23:57 -!- ":D") 2009-08-12 23:57 lol 2009-08-12 23:57 -!- MICHEL--047(~Tueur_Vag@118.98.163.66) has left #tux3 2009-08-12 23:59 -!- in_t4n(~tolll@212.62.97.20) has joined #tux3 2009-08-12 23:59 -!- in_t4n(~tolll@212.62.97.20) has left #tux3 2009-08-13 00:00 -!- MICHEL--047(~Tueur_Vag@118.98.163.66) has joined #tux3 2009-08-13 00:00 -!- MICHEL--047(~Tueur_Vag@118.98.163.66) has left #tux3 2009-08-13 00:05 please stop spamming the channel 2009-08-13 06:25 http://lwn.net/SubscriberLink/346299/5e413939653c64b5/ 2009-08-13 06:25 for tim_dimm 2009-08-13 07:45 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-13 07:51 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-13 08:49 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-13 11:03 let's see what Jon wrote 2009-08-13 11:09 all of those doesn't matter at all 2009-08-13 11:10 we just have to step forward to good goal 2009-08-13 11:17 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-13 11:19 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-13 11:50 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-13 11:59 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-13 13:12 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-13 15:00 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-13 17:58 hey tim_dimm 2009-08-13 17:59 ACTION wonders who came up with SOCKOPS_WRAP 2009-08-13 17:59 looks like c trying to be c++ 2009-08-13 18:00 wow, it enshrines BLK in every sockop 2009-08-13 18:00 BKL 2009-08-13 18:00 sucks well beyond belief 2009-08-13 18:03 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-13 18:20 hey flipz 2009-08-13 18:20 haha bkl 2009-08-13 19:20 -!- ajonat(~ajonat@190.48.118.3) has joined #tux3 2009-08-14 00:42 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-14 03:54 -!- ajonat(~ajonat@190.48.118.3) has joined #tux3 2009-08-14 06:05 -!- ajonat_(~ajonat@190.48.107.79) has joined #tux3 2009-08-14 07:28 -!- msindia(~maheshsha@117.254.1.153) has joined #tux3 2009-08-14 07:29 flips, we are from Pune Insitute of Computer Technology. 2009-08-14 07:29 We are looking for some project in Tux3, with reference to CDK et.al, the dedup guys. 2009-08-14 07:29 Could you suggest some? 2009-08-14 07:33 flips may be sleeping yet :) 2009-08-14 07:33 Well, it was a "broadcast" request. 2009-08-14 07:34 Could you suggest us? 2009-08-14 07:35 I'm not sure 2009-08-14 07:36 usually, I'll answer, it should be what you want 2009-08-14 07:38 what part are you interesting? 2009-08-14 07:39 Snapshotting, ZFS style. 2009-08-14 07:39 i see 2009-08-14 07:41 Is it possible to incubate Storage pool concept in Tux3? 2009-08-14 07:41 snapshot is not my so interesting area 2009-08-14 07:41 zfs's pool? 2009-08-14 07:42 iirc, flips was thinking about it 2009-08-14 07:42 Similar to it. 2009-08-14 07:43 but, it is to add devicemap or lvm interface 2009-08-14 07:43 That is something very unique to ZFS, i suppose. 2009-08-14 07:43 well, I guess lvm has similar concept more or less 2009-08-14 07:43 but, it can be used from fs 2009-08-14 07:44 dinamicly 2009-08-14 07:44 s/dinamicly/dynamicly/ 2009-08-14 07:45 ok. 2009-08-14 07:45 ah, s/it can be used/it can't be used/ 2009-08-14 07:46 so, if lvm was extended to control from fs, it may work 2009-08-14 07:46 -!- amod(~amod@117.195.11.116) has joined #tux3 2009-08-14 07:47 -!- amod(~amod@117.195.11.116) has left #tux3 2009-08-14 07:47 probably, I guess flips may have idea 2009-08-14 07:47 Why not to directly add support to Tux3? 2009-08-14 07:48 zfs doesn't also have pool directly 2009-08-14 07:48 if pool tie to one fs, I guess it would break storage stack 2009-08-14 07:48 -!- ek0ar(~ek0ar@117.195.11.116) has joined #tux3 2009-08-14 07:48 It has. AFAIK. 2009-08-14 07:49 iirc, zfs layer uses pool layer 2009-08-14 07:49 also, ufs uses pool layer via volume layer 2009-08-14 07:50 it just in my understand though 2009-08-14 07:50 We could enhance pool concept. 2009-08-14 07:51 good morning 2009-08-14 07:51 By storing the "frequently" used files on a faster disk. 2009-08-14 07:51 hi 2009-08-14 07:51 hi msindia 2009-08-14 07:51 Morning, flips. 2009-08-14 07:51 good timing 2009-08-14 07:52 hi hirofumi 2009-08-14 07:52 msindia, how about snapshotting, tux3 style? 2009-08-14 07:52 Flipz, we were discussing, storage pool in Tux3, whats your take? 2009-08-14 07:52 let me see what a storage pool is 2009-08-14 07:52 ah 2009-08-14 07:52 lvm? 2009-08-14 07:52 Not really. 2009-08-14 07:53 free space pool like zfs's 2009-08-14 07:53 http://en.wikipedia.org/wiki/ZFS#Storage_pools 2009-08-14 07:53 I think hirofumi's answer is right: lvm can be extended to do what ZFS's pool does 2009-08-14 07:54 the fs must have better control of the lvm 2009-08-14 07:54 actually, my intention was to move the lvm functionality into the block layer 2009-08-14 07:54 this would be a very interesting project 2009-08-14 07:54 but it goes outside the filesystem 2009-08-14 07:55 I think, to stay inside the filesystem, work on snapshotting would be idea 2009-08-14 07:55 and it's my uninteresting are :)) 2009-08-14 07:55 rightj :) 2009-08-14 07:55 What is status of snapshotting? 2009-08-14 07:55 No work as yet?. 2009-08-14 07:55 however, I think you will become interested at some point, there are some interesting issues in mounting multiple snapshots simultaneously 2009-08-14 07:56 msindia, base work on snapshotting is done 2009-08-14 07:56 see user/version.c 2009-08-14 07:56 well, base work may be too strong 2009-08-14 07:57 proof of concept 2009-08-14 07:57 basic concept, probably 2009-08-14 07:57 yes 2009-08-14 07:57 so... the biggest issue between that proof of concept and actually having snapshotting is, more work on the dleaf editing 2009-08-14 07:57 I will discuss with my mentor and get back to you soon :) 2009-08-14 07:58 great :) 2009-08-14 07:58 yes, great :) 2009-08-14 08:00 hirofumi, it's worth thinking about how struct inode works with multiple snapshots mounted simultaneously 2009-08-14 08:01 the inode hashing becomes ino+version I think 2009-08-14 08:02 pages that are the same version may be shared read-only between multiple inodes 2009-08-14 08:02 but since we have no way to share pages between inodes in current vfs, those pages would be duplicated, read-only 2009-08-14 08:03 s/read-only// 2009-08-14 08:03 yes, it can 2009-08-14 08:03 but, probably, it would need too many work 2009-08-14 08:04 to share pages? 2009-08-14 08:04 yes, share some objects from different sb 2009-08-14 08:05 yes, that would be a good project for a master's thesis 2009-08-14 08:06 -!- cdk(~Chinmay@59.95.26.100) has joined #tux3 2009-08-14 08:06 good, and very good, but, it can be too hard :) 2009-08-14 08:06 our goal would be just to have consistency between simultaneous mounts of the same volume, and not worry about memory efficiency 2009-08-14 08:06 hi flipz 2009-08-14 08:06 hi cdk, nice to see you again 2009-08-14 08:06 yeah ... long time ... :( 2009-08-14 08:06 been busy this month ... 2009-08-14 08:06 me too 2009-08-14 08:06 me too :) 2009-08-14 08:06 but i think i am ready to get back to work :) 2009-08-14 08:06 :) 2009-08-14 08:07 me too :) 2009-08-14 08:07 hirofumi, so one thing we have to support is multiple r/w mounts of the same volume 2009-08-14 08:08 well, about ino+version, it may not be easy 2009-08-14 08:08 what issue do you see? 2009-08-14 08:08 i see people from my college are interested in contributing :) ... good 2009-08-14 08:08 I guess, it means ino can be changed by write 2009-08-14 08:09 I think the version number for the inode is just the version it was opened under, this does not change 2009-08-14 08:09 ...except maybe when a new snapshot is set, need to think about that 2009-08-14 08:09 um... 2009-08-14 08:10 it means, e.g. on-disk is ino=100 and version=3, so global ino=1003 2009-08-14 08:10 and mount volume as version=4, global ino=1004? 2009-08-14 08:11 good question 2009-08-14 08:11 if we mount version 3 of the filesystem, it is some historical version 2009-08-14 08:12 yes 2009-08-14 08:12 so, all opened inodes will be version 3, or a parent version of version 3 2009-08-14 08:12 if we then take a snapshot of the mounted filesystem... what? 2009-08-14 08:13 does an opened file stay at version 3, or does it go to the "current" version? 2009-08-14 08:13 I think it goes to the current version 2009-08-14 08:14 not sure though, I think so too 2009-08-14 08:14 say, version 5 (because version 4 might be already taken) 2009-08-14 08:14 assuming the inode goes to the current version, how does this affect the idea of hashing via inode+version? 2009-08-14 08:14 I think it breaks that idea :) 2009-08-14 08:14 :) 2009-08-14 08:15 we do need ino+something for the hash though 2009-08-14 08:15 ino+instance maybe 2009-08-14 08:16 where instance is a small number handed out at mount time, which is unique across all current mounts 2009-08-14 08:16 so, it is shared from multiple mounted snapshot? 2009-08-14 08:17 I think it has to be, otherwise a write to a file on one mount would not be visible on another 2009-08-14 08:17 e.g., writing to the same version via two different mounts 2009-08-14 08:17 yes 2009-08-14 08:18 because, those are different snapshots? 2009-08-14 08:18 they can be the same or different 2009-08-14 08:18 why do we want multiple mounts vs .snapshot or something 2009-08-14 08:18 other than the obvious namespace violation 2009-08-14 08:19 multiple mounts means bind mount? 2009-08-14 08:19 suppose you want to go see the earlier version of a filesystem, but not unmount the current version? 2009-08-14 08:19 the closest existing model would be ddsnap 2009-08-14 08:20 what about an ioctl or something 2009-08-14 08:20 where you can mount multiple versions simultaneously from the same physical volume 2009-08-14 08:20 yes 2009-08-14 08:20 shapor, the user interface is a separate issue from how the data is handled 2009-08-14 08:20 its annoying to have a mount-per-snapshot imo 2009-08-14 08:21 I'm thinking user want to mount multiple version at same time 2009-08-14 08:21 shapor, we also plan alternative methods, such as the versioned link scheme, but mounting a particular version is the "base" facility 2009-08-14 08:21 hirofumi, yes 2009-08-14 08:22 yeah i guess mount suppose is needed either way 2009-08-14 08:23 so, if a user opens the same file version via two different mounts, they better get the same inode 2009-08-14 08:23 yes 2009-08-14 08:23 and by implication, the same sb 2009-08-14 08:23 sounds a bit tricky 2009-08-14 08:23 it is, that's why I think hirofumi will be interested 2009-08-14 08:23 it's new territory for linux, the closest thing we have to this now is clustered filesystems 2009-08-14 08:24 "same file version via two different mounts" is not sure what is meaning 2009-08-14 08:24 so if you open /snapshot/file you actually get /file 2009-08-14 08:25 "same file version via two different mounts", it can say, we can it already by bind mount 2009-08-14 08:25 hirofumi, say latest modification of a file was version 3, then somebody mounts the filesystem at version 4 and somebody else mounts it at version 5, they should see the same version 3 file 2009-08-14 08:25 ok 2009-08-14 08:26 hirofumi, I think bind mount only gives access to some subdirectory of a mounted filesystem 2009-08-14 08:26 so it's not quite what we want 2009-08-14 08:26 explicitly, two mounts of different snapshot 2009-08-14 08:26 yes 2009-08-14 08:27 whole sb too 2009-08-14 08:27 so, both mounts would have the same sb 2009-08-14 08:27 that is like a bind mount 2009-08-14 08:27 yes 2009-08-14 08:27 mount /dev/foo /mnt; mount /dev/foo /usr/local 2009-08-14 08:27 we can this 2009-08-14 08:28 it's one of bind mount 2009-08-14 08:28 well, snapshot 2009-08-14 08:29 so now we are doing mount -o version=4 /dev/foo /mnt; mount -o version=5 /dev/foo /usr/local; 2009-08-14 08:29 if snapshot is writable, I think it has difficult things 2009-08-14 08:29 ok, I understood it 2009-08-14 08:29 right, it is writeable 2009-08-14 08:30 the 4 and 5 are "version tags" 2009-08-14 08:30 that is, they are really a name for a snapshot, they are not internal version numbers 2009-08-14 08:30 ok 2009-08-14 08:31 each of them maps to some internal version number, which can change when a new snapshot is created 2009-08-14 08:31 well, I thought a little bit about it in past 2009-08-14 08:32 if share the inode, I guess vfs would be needed many changes 2009-08-14 08:33 the combination ino+tag uniquely identifies file, where tag is an external version number 2009-08-14 08:33 I think we can do this without vfs changes 2009-08-14 08:33 only change the filesystem's inode hashing scheme 2009-08-14 08:34 suppose 4 and 5, and internal inode ino=100,ver=4 2009-08-14 08:34 4 and 5 means external version in the above 2009-08-14 08:34 I use '4' and '5' to mean "external versions" 2009-08-14 08:35 so, if we share the ino=100,ver=4 for those, 5 see the change of 4? 2009-08-14 08:36 I think so 2009-08-14 08:36 eh? 2009-08-14 08:36 it is good behavior? 2009-08-14 08:36 it depends which numbers are external vs internal version numbers in your example ;) 2009-08-14 08:37 ino=100,ver='4' ? 2009-08-14 08:37 ok, start from begining 2009-08-14 08:37 mount ver=4 2009-08-14 08:37 echo aaa > file 2009-08-14 08:37 umount ver=4 2009-08-14 08:38 so, ver=4 wrote the inode=100,ver=4 2009-08-14 08:38 is this true? 2009-08-14 08:38 changes to ver=4 and ver=5 are isolated 2009-08-14 08:39 neither mount sees the changes of the other 2009-08-14 08:39 ok 2009-08-14 08:39 this means that they have two different struct inodes for the same inode number 2009-08-14 08:39 the above example is what I was thinking 2009-08-14 08:40 i see 2009-08-14 08:41 now... do we ever allow the same struct inode to be opened by two different mounts? (we do share the same sb) 2009-08-14 08:42 it means bind mount? 2009-08-14 08:43 bind mount is an orthogonal feature 2009-08-14 08:43 it's more like a clustered filesystem mount 2009-08-14 08:44 let's assume that a struct inode always belongs to just one mount, and see if there are any consistency problems 2009-08-14 08:45 on the above 4 and 5 example? 2009-08-14 08:45 sure 2009-08-14 08:45 ok 2009-08-14 08:45 clean data pages are not a problem, they can just be duplicated between different struct inodes 2009-08-14 08:45 there are 2009-08-14 08:46 however, what happens when a data page is changed? 2009-08-14 08:46 (stepping away from the keyboard for a few minutes) 2009-08-14 08:46 I thought in past, we have to share internal structures 2009-08-14 08:46 that's the question 2009-08-14 08:48 well, my thinking was stopped with it. at least it's not easy :) 2009-08-14 08:48 well, probably, almost all of backend structures 2009-08-14 08:50 ah, I think it may be easy 2009-08-14 08:50 we should hash the inode by ino+tag, where tag is the tag of the version the inode was opened under 2009-08-14 08:51 that way, if two user open the same file for the same mounted version, they get the same struct inode 2009-08-14 08:52 I'm not sure the benefit of ino+tag 2009-08-14 08:52 we are sharing the sb? 2009-08-14 08:52 yes 2009-08-14 08:52 ah 2009-08-14 08:52 we must share the sb 2009-08-14 08:52 I was thinking frontend sb+inode+page, backend others 2009-08-14 08:52 otherwise filesystem damage will occur due to multiple inconsistent, dirty caches 2009-08-14 08:53 ah, that is another approach 2009-08-14 08:53 a harder approach I think, but with advantages 2009-08-14 08:53 well, anyway, I'm not sure for now 2009-08-14 08:54 sharing sb across mounts is pretty easy, and is supported by vfs 2009-08-14 08:54 I guess it's not so easy 2009-08-14 08:54 there is namespace issue 2009-08-14 08:54 I guess 2009-08-14 08:54 um... 2009-08-14 08:55 it seemed easy when I did it for ddlink 2009-08-14 08:56 e.g. root dentry is pointed by sb->s_root 2009-08-14 08:56 it doesn't have multple dentries 2009-08-14 08:56 ah, good point 2009-08-14 08:57 well we get to interpret s_root how we want to 2009-08-14 08:57 -!- cdk_(~Chinmay@115.109.8.233) has joined #tux3 2009-08-14 08:57 it doesn't even have to point at a real inode I think 2009-08-14 08:58 what do it means (english skill problem :)) 2009-08-14 08:58 s_root doesn't even have to point at a real inode I think 2009-08-14 08:58 ah 2009-08-14 08:58 it may be able to 2009-08-14 08:59 I'm not sure which is good for now though 2009-08-14 09:00 -!- stargazr(~gaurav@59.95.0.141) has joined #tux3 2009-08-14 09:00 no change necessary for now 2009-08-14 09:00 -!- _ajonat(~ajonat@190.48.112.109) has joined #tux3 2009-08-14 09:00 we only need to address this issue when we want to allow multiple simultaneous mounts of different versions 2009-08-14 09:01 yes 2009-08-14 09:01 I think the idea of hasing by ino+tag works properly 2009-08-14 09:01 and with writable 2009-08-14 09:01 I think ino+tag works even for simulteneous r/w opens of the same inode, with different version 2009-08-14 09:01 well, I can say now, only I'm not sure 2009-08-14 09:03 it would depends on complexity and implementation of later obviously 2009-08-14 09:03 one obvious reason why simultaneous mounts of different versions have to share the same sb is, btree locking 2009-08-14 09:03 backend should be shared 2009-08-14 09:04 yes 2009-08-14 09:04 backend and frontend are both shared, according to me 2009-08-14 09:04 the ino+tag hash key makes everything ok 2009-08-14 09:04 frontend was including inode, it shouldn't 2009-08-14 09:05 the ino+tag hashing gives different struct inodes to mounts of two different versions 2009-08-14 09:05 maybe, but I'm not sure how is it easy to implement 2009-08-14 09:05 it's just a change to the inode hashing scheme 2009-08-14 09:06 yes, it was just to make sure 2009-08-14 09:06 it seems easy to me 2009-08-14 09:06 well, complexity apears on courner case :) 2009-08-14 09:07 so, I'd like to be thinking both way, at least for now 2009-08-14 09:07 and finding another better way 2009-08-14 09:07 maybe. I will try to spot those corner cases 2009-08-14 09:08 good 2009-08-14 09:08 note: I did not propose to allow simultaneous mounts of the same version by this means, that would just be a bind mount 2009-08-14 09:09 yes 2009-08-14 09:09 bind would be do automagically 2009-08-14 09:09 and we can probably use the same mechanism to do it transparently 2009-08-14 09:10 that is, we detect whenever the same volume is mounted more than once 2009-08-14 09:11 ah 2009-08-14 09:11 it was done by more upper layer, i.e. vfs 2009-08-14 09:11 for multiple mount of same version we use bind mount, otherwise we rely on the different version tags to isolate changes to the two different versions 2009-08-14 09:11 for get_sb_bdev 2009-08-14 09:11 ah, no 2009-08-14 09:11 that is called by our filesystem 2009-08-14 09:12 so we have control 2009-08-14 09:12 we would use custiom get_sb_bdev 2009-08-14 09:12 yes 2009-08-14 09:12 and not very different from existing flavors 2009-08-14 09:12 yes 2009-08-14 09:12 well, so far from current state 2009-08-14 09:12 right, it is months away 2009-08-14 09:12 and my very intersting area is not it :) 2009-08-14 09:13 understood, but we need to be sure that we're heading in the right direction 2009-08-14 09:13 multiple versions is interesting to many users 2009-08-14 09:13 if we are on current speed, it would be years unfortunately 2009-08-14 09:13 so... back to atomic commit :) 2009-08-14 09:14 so I for one will improve my speed 2009-08-14 09:14 that will not be hard 2009-08-14 09:14 since speed was zero for some months 2009-08-14 09:14 yes, sure (check right direction reapetedly) 2009-08-14 09:14 good 2009-08-14 09:15 and me too 2009-08-14 09:15 flipz, before you start with atomic commit 2009-08-14 09:15 about the patches that you mentioned 2009-08-14 09:15 unit tests 2009-08-14 09:15 version.c? 2009-08-14 09:15 ah 2009-08-14 09:15 I think that is too easy for you :) 2009-08-14 09:15 well any improvements are appreciated 2009-08-14 09:15 what do you want done there ? add returns from main ? 2009-08-14 09:16 because asserts() will exit by default i presume ? 2009-08-14 09:16 the idea is to add some functionality to check for expected state of the filesystem at the end of the unit test 2009-08-14 09:16 i.e., return pass/fail from the test 2009-08-14 09:16 asserting is also ok 2009-08-14 09:17 no change necessary for now, we have to find good design and implementaion 2009-08-14 09:17 that's a 'fail' 2009-08-14 09:17 ugh 2009-08-14 09:17 humbly(?) and greedy(?), we have to continue to find good design and implementaion 2009-08-14 09:17 sorry 2009-08-14 09:17 ok ... so when it completes the test we need a way to find whether the test went as wanted 2009-08-14 09:18 right 2009-08-14 09:19 and for that we traverse the fs like tux3graph does checking what we need ? 2009-08-14 09:19 so for example, a test of the directory would examine the final directory to find out that all the correct names are there, and only the correct names, and they refer to the correct inodes 2009-08-14 09:19 btw, also dedup seems my interesting area :) 2009-08-14 09:19 hirofumi , i think we can get back to dedup when we are done with atomic commit 2009-08-14 09:20 yes 2009-08-14 09:20 right, I think the main change that should be done to dedup is, eliminate the extra physical btree 2009-08-14 09:20 i believe we might have to change a few things in the design and then port it to the kernel 2009-08-14 09:20 by moving this information to a logical file 2009-08-14 09:20 atomic commit and optimization is top interesting one for me 2009-08-14 09:21 flipz, anything i can do in atomic commit work? it will help me get closer to the kernel-port so that i can contribute during kernel part of dedup 2009-08-14 09:21 cdk_, traverse of the filesystem tree would be the correct approach for checking many of the unit tests 2009-08-14 09:21 cdk_, yes, there are things to do in the atomic commit work 2009-08-14 09:21 yes 2009-08-14 09:22 I think the first useful task would be to try to understand the atomic commit design, and where it is not possible to understand, make us explain it more clearly 2009-08-14 09:22 current state of atomic commit would be able to see in kernel/commit.c 2009-08-14 09:22 and create() path 2009-08-14 09:22 and replay.c 2009-08-14 09:22 yes 2009-08-14 09:23 and also my design not, which is in progress ;) 2009-08-14 09:23 design note I mean 2009-08-14 09:23 good 2009-08-14 09:24 ok ... will start with that ... and i will see about the unit test things as well 2009-08-14 09:24 ah, it would be a little though, my memo may help to see implement of atomic commit 2009-08-14 09:25 its gonna be slow ... wont have as much as I had in college but will be much more regular now 2009-08-14 09:25 especially, about logging and freeing logging blocks 2009-08-14 09:25 wont have as much time i mean 2009-08-14 09:25 hirofumi, where can get the memo ? 2009-08-14 09:26 http://userweb.kernel.org/~hirofumi/notes/ 2009-08-14 09:26 I'll pushed current memos to same place 2009-08-14 09:27 -!- gaurav(~gaurav@59.95.0.141) has joined #tux3 2009-08-14 09:27 done 2009-08-14 09:28 hello hirofumi, flips 2009-08-14 09:28 hi 2009-08-14 09:28 i will chip in along with cdk 2009-08-14 09:29 good 2009-08-14 09:29 have fun 2009-08-14 09:30 but as cdk said,less time..so mostly will work on weekends.. 2009-08-14 09:30 I think it's good enough 2009-08-14 09:31 -!- setheus(~setheus@pool-173-74-124-37.dllstx.fios.verizon.net) has joined #tux3 2009-08-14 09:31 hi gaurav 2009-08-14 09:32 I was also worked for linux on weekend in almost past 2009-08-14 09:32 :) 2009-08-14 09:32 cdk_, most of the details on atomic commit are in the irc channel logs 2009-08-14 09:32 ah, yes 2009-08-14 09:33 so this could be a little frustrating, trying to find it 2009-08-14 09:33 yes, really 2009-08-14 09:33 so... the design note 2009-08-14 09:33 maybe, I or you should collect those :) 2009-08-14 09:33 I will :) 2009-08-14 09:34 well 2009-08-14 09:34 thanks! :) 2009-08-14 09:34 :) 2009-08-14 09:35 that would help us a lot! thanks! 2009-08-14 09:35 flipz, will wait for the design note ... will look through the irc logs till then 2009-08-14 09:38 and kernel/log.c, kernel/commit.c as hirofumi mentioned 2009-08-14 09:38 some of it is quite clear 2009-08-14 09:39 yes 2009-08-14 09:39 unclear part at all would be freeing logging-block strategy 2009-08-14 09:40 but, there is memo more or less 2009-08-14 09:40 ok 2009-08-14 09:41 time to sleep for me 2009-08-14 09:41 oyasumi 2009-08-14 09:42 same here ... good night. 2009-08-14 09:43 good night!or oyasumi ! 2009-08-14 10:30 -!- msindia(~maheshsha@117.254.1.153) has joined #tux3 2009-08-14 10:34 -!- msindia(~maheshsha@117.254.1.153) has left #tux3 2009-08-14 10:51 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-14 11:01 -!- msindia(~maheshsha@117.254.1.153) has joined #tux3 2009-08-14 11:44 -!- KillerBee(~bee@117.195.9.129) has joined #tux3 2009-08-14 11:57 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-14 15:27 ~. 2009-08-14 20:37 -!- SEJeff(~jeff__@209.160.84.1) has joined #tux3 2009-08-14 21:05 -!- jeff(~jeff__@209.160.84.1) has joined #tux3 2009-08-14 21:06 -!- jeff_(~jeff__@64.74.250.4) has joined #tux3 2009-08-14 21:22 -!- jeff_(~jeff__@209.160.84.1) has joined #tux3 2009-08-14 22:50 -!- covel(~talktome_@117.254.1.153) has joined #tux3 2009-08-15 01:19 -!- jeff(~jeff__@64.74.250.4) has joined #tux3 2009-08-15 01:21 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-08-15 05:41 -!- cdk(~Chinmay@115.109.15.158) has joined #tux3 2009-08-15 05:55 -!- caveats(~talktome_@117.254.15.101) has joined #tux3 2009-08-15 06:11 -!- KillerBee(~bee@117.195.16.185) has joined #tux3 2009-08-15 06:13 -!- msindia(~mr.mahesh@117.254.15.101) has joined #tux3 2009-08-15 07:34 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-15 07:47 -!- msindia(~mr.mahesh@117.254.15.101) has left #tux3 2009-08-15 07:47 -!- msindia(~mr.mahesh@117.254.15.101) has joined #tux3 2009-08-15 08:09 -!- msindia(~mrs1512@117.254.15.101) has joined #tux3 2009-08-15 08:18 -!- mecaveats(~mrs1512@117.254.2.74) has joined #tux3 2009-08-15 08:19 -!- mecaveats(~mrs1512@117.254.2.74) has joined #tux3 2009-08-15 09:23 -!- msindia(~mrs1512@117.254.11.133) has joined #tux3 2009-08-15 10:01 good morning 2009-08-15 10:36 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-15 12:26 -!- npmccallum(~npmccallu@74-143-208-253.static.insightbb.com) has joined #tux3 2009-08-15 14:48 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-15 14:54 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-15 14:55 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-08-15 15:21 I've pushed the various fixes, include "tux3 mkfs" fix 2009-08-15 15:21 static-http://userweb.kernel.org/~hirofumi/tux3/ 2009-08-15 15:24 next, I'll try to check btree.c where I can rewrite proper operations or not 2009-08-15 15:24 s/where/whether/ 2009-08-15 15:31 ACTION goes to look 2009-08-15 17:33 fls for blocksize must have been a bug 2009-08-15 17:34 written by me I think 2009-08-15 17:38 it seems this patch set is the "proper fix" for the mkfs with logging issue 2009-08-15 17:42 make mkfs and make tests run 2009-08-15 17:42 fls may not be bug 2009-08-15 17:42 ah right 2009-08-15 17:42 because we're just checking for even power of two 2009-08-15 17:42 but, it's not standard 2009-08-15 17:43 whoops, I always read those backwards 2009-08-15 17:43 if blocksize is not power of two, it's another issue 2009-08-15 17:43 fls finds the highest order one bit 2009-08-15 17:43 oh 2009-08-15 17:44 well, it's already bug of blocksize sanitalize 2009-08-15 17:45 patchset is not so proper fix 2009-08-15 17:45 logmap of inode should die 2009-08-15 17:45 and, um... 2009-08-15 17:45 forgot 2009-08-15 17:45 anyway, ffs and fls both work 2009-08-15 17:46 since we are just testing that only one bit is set 2009-08-15 17:46 yes, there is no change 2009-08-15 17:46 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-15 17:46 ffs is part of libc 2009-08-15 17:46 fls is borrowed from kernel 2009-08-15 17:46 oh 2009-08-15 17:47 oh, sure 2009-08-15 17:47 anyway it all looks good 2009-08-15 17:47 runs here without problems 2009-08-15 17:47 and fixes the mkfs crash 2009-08-15 17:48 I never noticed it 2009-08-15 17:49 yes 2009-08-15 17:49 pushed to public 2009-08-15 17:50 but, those should be temporary change 2009-08-15 17:50 yes 2009-08-15 17:50 until real proper change 2009-08-15 17:50 ah 2009-08-15 17:50 logmap, and itable defer btree root allocation 2009-08-15 19:03 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-15 21:31 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-15 23:11 -!- msindia(~mrs1512@117.254.11.133) has joined #tux3 2009-08-16 00:03 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-08-16 00:05 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-16 00:07 -!- flips(~phillips@phunq.net) has joined #tux3 2009-08-16 03:50 -!- bushwacker(~Om_muda_p@202.133.82.114) has joined #tux3 2009-08-16 03:50 -!- bushwacker(~Om_muda_p@202.133.82.114) has left #tux3 2009-08-16 03:51 -!- bushwacker(~Om_muda_p@202.133.82.114) has joined #tux3 2009-08-16 03:51 -!- bushwacker(~Om_muda_p@202.133.82.114) has left #tux3 2009-08-16 06:26 -!- KillerBee(~bee@117.195.0.207) has joined #tux3 2009-08-16 07:53 -!- KillerBee(~bee@117.195.0.207) has joined #tux3 2009-08-16 10:02 flips, there? 2009-08-16 10:02 hi 2009-08-16 10:02 well, I'm thinking about dleaf on-disk format 2009-08-16 10:02 hi 2009-08-16 10:03 current dleaf's design is what is main purpose? 2009-08-16 10:04 I suspect it might be efficiently for sparse file or something 2009-08-16 10:04 capable of representing versioned pointers 2009-08-16 10:05 I guess there is more... 2009-08-16 10:05 the design it descended from was to represent a disk volume 2009-08-16 10:05 which would have many data regions belonging to different versions 2009-08-16 10:06 oh, i see 2009-08-16 10:06 so, fs may have different pattern? 2009-08-16 10:06 the mapping from logical to physical must therefore be capable of representing multiple physical locations for each logical address 2009-08-16 10:07 yes 2009-08-16 10:07 depending on the type of file, it may be the same pattern, e.g., a file may be a virtual volume 2009-08-16 10:07 or a file may be a database, which has a pattern like a disk volume 2009-08-16 10:08 yes 2009-08-16 10:08 but usually the file pattern will be simpler than a volume 2009-08-16 10:08 but, fairly rare 2009-08-16 10:08 yes 2009-08-16 10:08 but important when it happens 2009-08-16 10:08 I've looked the my disks 2009-08-16 10:09 yes 2009-08-16 10:09 most files on a linux install should just have a single extent 2009-08-16 10:09 so, if it was just using more disk space, it may be acceptable 2009-08-16 10:09 total 1975157, <: 7676 (0.389), >: 0 (0.000), =: 1967481 (99.611) 2009-08-16 10:09 it may 2009-08-16 10:10 probably, my disks is normal *desktop* pattern 2009-08-16 10:10 mine too 2009-08-16 10:10 or developers 2009-08-16 10:10 but you can easily create a virtual volume, which is normal too 2009-08-16 10:10 total is number of files 2009-08-16 10:11 I have several virtual volumes for kvm 2009-08-16 10:11 <: is? 2009-08-16 10:11 <: is sparse file 2009-08-16 10:11 >: may be bug of my tool 2009-08-16 10:11 so there are 7676 sparse files, interesting, it is more than I would expect 2009-08-16 10:11 my tool would not be perfect, however, luckly there is no ">:" :) 2009-08-16 10:12 do you know what kind of files they are? 2009-08-16 10:12 yes 2009-08-16 10:12 there is full log 2009-08-16 10:12 e.g. 2009-08-16 10:12 /usr/lib/locale/locale-archive 2009-08-16 10:12 and many of those are *.o 2009-08-16 10:12 object file by compiler 2009-08-16 10:12 1.9M files total? 2009-08-16 10:13 yes 2009-08-16 10:13 scaned files 2009-08-16 10:13 ah, .o are sparse, didn't know that 2009-08-16 10:13 me too 2009-08-16 10:13 probably, gcc does seek and write 2009-08-16 10:13 shall I run your tool on my system? 2009-08-16 10:13 because, it needs to sure related offset 2009-08-16 10:13 of course 2009-08-16 10:14 what is >: ? 2009-08-16 10:14 mainly detect bug of my tools 2009-08-16 10:14 ah, it would be a filesystem bug? 2009-08-16 10:14 right 2009-08-16 10:14 so this tool counts sparse files 2009-08-16 10:15 http://userweb.kernel.org/~hirofumi/find-sparse.pl 2009-08-16 10:15 get stat->st_blocks, and calc metadata blocks and data block from stat->st_size 2009-08-16 10:15 only supporting ext[23] 2009-08-16 10:16 and normal file only 2009-08-16 10:16 sh find-sparse.pl / ? 2009-08-16 10:16 it is hard coding "/" 2009-08-16 10:16 just ./find-sparse.pl 2009-08-16 10:16 or perl find-sparse.pl 2009-08-16 10:17 ok, it's running 2009-08-16 10:18 anyway, a partially versioned file will also be rare, similar to a sparse file 2009-08-16 10:18 well, so, it _seems_ sparse files are *.o and db (bdb, gdbm or something) 2009-08-16 10:18 probably 2009-08-16 10:19 I'm thinking group/entry may makes complex things 2009-08-16 10:19 may be making 2009-08-16 10:19 well I planned an additional data format to take care of that 2009-08-16 10:20 it means format is became more complex? 2009-08-16 10:20 just an extent stored directly as an inode attribute 2009-08-16 10:20 it was always planned 2009-08-16 10:20 the idea was, to debug the more complex, general format first 2009-08-16 10:21 instead of implementing the easy format, then having many subtle bugs trying to extend it to the general case 2009-08-16 10:21 it sounds like good 2009-08-16 10:21 but, I feel too complex :) 2009-08-16 10:22 and you have done most of the work on it, so you would know 2009-08-16 10:22 on the other hand, it seems to function pretty well 2009-08-16 10:22 e.g. we can't binary search for dleaf 2009-08-16 10:22 binary search within the deleaf? 2009-08-16 10:22 yes 2009-08-16 10:23 but the dleaf format is already a kind of tree 2009-08-16 10:23 ok, in common cases it will not show much tree structure 2009-08-16 10:24 we need to iterate the extent (and gorup/entry) at first entry 2009-08-16 10:24 now, why can't we binsearch? 2009-08-16 10:24 we can't binsearch across groups, true 2009-08-16 10:24 because we need to calc offset 2009-08-16 10:24 but within a group we can binsearch, the common case 2009-08-16 10:25 extent posision is previous offset + number of entries 2009-08-16 10:26 there is no winthin a gorup 2009-08-16 10:27 group needs calc entries offset too, iirc 2009-08-16 10:27 the group entry format gives a limit for each extent, relative to the logical offset of the group 2009-08-16 10:27 that can be binsearched by logical address 2009-08-16 10:27 group is no, it's entry 2009-08-16 10:27 ? 2009-08-16 10:28 group has entry's info, entry has extent's info 2009-08-16 10:28 so, we can't calc extent directly from gorup 2009-08-16 10:29 I mean, the "entry" format gives a limit for each extent, relative to the logical offset of the group 2009-08-16 10:29 struct entry { be_u32 limit_and_keylo; }; 2009-08-16 10:29 yes 2009-08-16 10:30 all the limits within one group are relative to the same logical offset, allowing binsearch 2009-08-16 10:30 no 2009-08-16 10:30 but linear search is fast enough 2009-08-16 10:30 group needs to know previous extent offset of group 2009-08-16 10:31 group needs to know extent offset of previous group 2009-08-16 10:31 well, it may be fast 2009-08-16 10:32 true, so we can't find the group by binsearch, but after we have found the group, we can find a logical address by binsearch within the group 2009-08-16 10:32 maybe 2009-08-16 10:32 within group, maybe 2009-08-16 10:32 right, that was the intention 2009-08-16 10:32 but, it means we need to iterate group and entry 2009-08-16 10:33 the common case is, there is only one group 2009-08-16 10:33 maybe, it is the point 2009-08-16 10:33 if so, we can more simple group/entry format? 2009-08-16 10:34 well, actually I don't care very much whether we can use binary search or not 2009-08-16 10:35 however, we can't know offset easily means, I thought we can't calc needed space or something 2009-08-16 10:38 btw, currently, I'm thinking dleaf code is acceptable more or less 2009-08-16 10:38 it doesn't seem to be broken 2009-08-16 10:38 calc needed space for what? 2009-08-16 10:39 but, I'm fearing future, e.g. versioning and proper extent handling for optimization 2009-08-16 10:40 "calc needed space" or something, well, I'm not sure for now, what is good 2009-08-16 10:40 if it breaks then, we can fix it 2009-08-16 10:40 however, I'm thinking why I can't make simple the code of dleaf.c and filemap.c 2009-08-16 10:41 let's see how complex it is now 2009-08-16 10:41 ok 2009-08-16 10:41 I'm reading my own description of the format now 2009-08-16 10:41 dleaf.c is enough complex 2009-08-16 10:41 it's 840 lines long, one of our bigger files 2009-08-16 10:42 yes 2009-08-16 10:42 compared to 1455 lines long for ext2/inode.c 2009-08-16 10:42 and dleaf escapes the btree_ops operations 2009-08-16 10:42 the equivalent ext2 file also includes tree editng code, so comparison is not exact 2009-08-16 10:43 right 2009-08-16 10:43 that compare seems meaningless more or less 2009-08-16 10:43 let's compare btree.c + dleaf.c to ext2/inode.c 2009-08-16 10:43 maybe, we should compare ->i_data[] codes vs dleaf.c 2009-08-16 10:44 comparing to ext3 extents would be more fair 2009-08-16 10:44 I think we win 2009-08-16 10:44 I mean 2009-08-16 10:44 ext4 2009-08-16 10:44 because ext3 does not have extents 2009-08-16 10:44 yeah 2009-08-16 10:44 ext4 also achieves a 12 byte per extent compactness 2009-08-16 10:45 which is considered an important point by ext4 devs 2009-08-16 10:45 well, btrfs or hammer may be more clear 2009-08-16 10:45 yes, btrfs and hammer do not worry too much about compact representation 2009-08-16 10:45 zfs is the worst 2009-08-16 10:45 and it pays for that in slowness 2009-08-16 10:45 it can 2009-08-16 10:46 but, another benefit of it is simplify? 2009-08-16 10:46 yes 2009-08-16 10:46 but zfs has many complexities that make it complex in the end 2009-08-16 10:46 btrfs too 2009-08-16 10:47 yes 2009-08-16 10:47 what we have is very simple tree editing code 2009-08-16 10:47 and complex leaf code 2009-08-16 10:47 I would rather have that than simple leaf code and complex tree editing 2009-08-16 10:47 maybe, it's important design tradeoff 2009-08-16 10:48 that is my theory: if you have to choose, choose to keep your complexity local 2009-08-16 10:48 what is meaning local? 2009-08-16 10:49 well, local means may be dtree? 2009-08-16 10:49 cal means, the algorithm only has to take into account information that is contained within a specific part of the database 2009-08-16 10:49 yes 2009-08-16 10:49 yes, local in this case means, complexity of leaf editing is all contained within dleaf.c 2009-08-16 10:50 yes, it's good basically, I agree 2009-08-16 10:50 and example of where we may violate that is, filemap.c may have to work with two dleaf nodes at the same time 2009-08-16 10:50 but, dleaf is too main part of fs :) 2009-08-16 10:50 yes, it's also the point by me 2009-08-16 10:51 I don't want to touch dleaf internal on filemap.c 2009-08-16 10:52 and it's what I'm trying now 2009-08-16 10:52 actually, each of the dleaf editing operations is pretty short 2009-08-16 10:53 split, merge, probe, copy, chop, add 2009-08-16 10:53 dleaf operations is just infrastructure of dleaf editing, so yes 2009-08-16 10:54 we would need to add versioning to all of those 2009-08-16 10:55 well, to make sure, I'm not blaming it, I'm just finding more better way 2009-08-16 10:56 yes, when versioning is added to dleaf editing I think of it as another chance to examine the format critically 2009-08-16 10:56 I have a basic concept of how to add versioned extent editing\ 2009-08-16 10:56 ddsnap already does it, successfully, but it uses a simpler format 2009-08-16 10:57 sounds good 2009-08-16 10:57 for extent editing we have two main operations: insert and delete 2009-08-16 10:57 and we remember that we don't have sane extent handling in filemap.c/dleaf.c 2009-08-16 10:58 we would have to remember 2009-08-16 10:58 merge contiguous range or something 2009-08-16 10:58 right 2009-08-16 10:59 additional complexity is, extents may overlap the logical addresses of other extents 2009-08-16 10:59 insert and delelte, and maybe merge and separate 2009-08-16 10:59 yes 2009-08-16 10:59 yes, all the same operations 2009-08-16 10:59 but insert and delete are the two most different 2009-08-16 11:00 i see 2009-08-16 11:01 anyway, for both, the idea is, we have a source stream of extents and an output stream 2009-08-16 11:02 um..., yes 2009-08-16 11:02 we read one block at a time from the source stream, and output one block at a time to the output stream 2009-08-16 11:03 current one, or future/dream? 2009-08-16 11:03 current one, yes 2009-08-16 11:03 future/dream 2009-08-16 11:04 if dream, we want to handle multiple blocks at a time? 2009-08-16 11:04 um... 2009-08-16 11:04 insert goes: input blocks from the source and write to output until we get to where we will write, output blocks of the write, then input blocks from source until all blocks for the dleaf are done 2009-08-16 11:04 yes, we want to handle multiple blocks at a time 2009-08-16 11:05 but that is an optimization 2009-08-16 11:05 to make it work, we write the algorithm to work one block at a time 2009-08-16 11:05 we handle the conversion from extents to individual blocks in the stream input and stream output code 2009-08-16 11:06 that is, we merge contiguous blocks into extents on output 2009-08-16 11:06 yes 2009-08-16 11:06 this is similar to how it is now 2009-08-16 11:06 the reason it is important to do things this way is, the versioned pointer algorithms work one logical block at a time 2009-08-16 11:07 it would require a great deal of complexity to make them work by extents 2009-08-16 11:07 so we push that complexity to the stream input and stream output operations 2009-08-16 11:07 optimizations are clearly possible 2009-08-16 11:07 um... 2009-08-16 11:08 versioning operation is not local to dleaf, so it's not so clear to me 2009-08-16 11:09 ah, versioning is local to a dleaf 2009-08-16 11:09 um... 2009-08-16 11:09 that is the main point of this approach to versioning 2009-08-16 11:09 ileaf? 2009-08-16 11:09 versioning of attributes is local to ileaf 2009-08-16 11:09 yes 2009-08-16 11:09 versioning of extents is local to dleaf 2009-08-16 11:10 so, on total system, versioning across some places? 2009-08-16 11:10 across some places? 2009-08-16 11:11 versioning is there across on some places (?) 2009-08-16 11:12 to start with it can be strictly local to dleaf 2009-08-16 11:12 versioning affects to some places 2009-08-16 11:13 so, later, what happen? 2009-08-16 11:13 there are fancy things we can do, like let some inodes be unversioned 2009-08-16 11:13 start on dleaf, and later is? 2009-08-16 11:13 and we could have per-directory versioning 2009-08-16 11:14 um... 2009-08-16 11:14 it means inode has unversioning ->i_flags or something? 2009-08-16 11:14 yes 2009-08-16 11:14 ah, yes 2009-08-16 11:15 feature sounds like good 2009-08-16 11:15 the issues is how do we implement it simply 2009-08-16 11:16 basic versioning will give us the most important things: ability to replicate and do incremental online backup 2009-08-16 11:16 i see 2009-08-16 11:16 how do we do it with versioning? 2009-08-16 11:17 incremental replication relies on being able to take a snapshot 2009-08-16 11:17 if we take two snapshots on upstream, we can compute the data block difference between them 2009-08-16 11:17 i see 2009-08-16 11:17 this is possible because neither snapshot is changing 2009-08-16 11:18 so the block difference is stable 2009-08-16 11:18 we call that a delta 2009-08-16 11:18 um... 2009-08-16 11:18 it means snapshot should be read-only? 2009-08-16 11:18 at least, old snapshot 2009-08-16 11:18 for replication, yes 2009-08-16 11:19 i see 2009-08-16 11:19 that is easy enough to enforce 2009-08-16 11:19 a snapshot that was set for replication is made read-only 2009-08-16 11:19 more clearly, a snapshot that will be used for replication is read-only 2009-08-16 11:19 and backup also should be read-only? 2009-08-16 11:19 yes 2009-08-16 11:20 ok, sounds like work 2009-08-16 11:20 the downstream replication target already has the earlier of the two upstream snapshots 2009-08-16 11:20 to transform the downstream snapshot into the later of the two upstream snapshots, we write the computed delta to it 2009-08-16 11:21 i see 2009-08-16 11:21 then set a new downstream snapshot, which is now identical to the later of the two upstream snapshots, and one replication cycle is finished 2009-08-16 11:21 replication is working like this on other expensive system too? 2009-08-16 11:22 yes 2009-08-16 11:22 it is working like that on ddsnap 2009-08-16 11:22 and it works well 2009-08-16 11:22 so, most of the work for replication is already done 2009-08-16 11:22 e.g. do you know about EMC's system? 2009-08-16 11:22 hmm 2009-08-16 11:22 good question 2009-08-16 11:22 I think emc too 2009-08-16 11:22 but I normally think of netapp when thinking about replication 2009-08-16 11:23 i see 2009-08-16 11:23 the main difference between ddsnap replication and tux3 replication is, the replication delta is physical volume blocks for ddsnap, but it is logical file data for tux3 2009-08-16 11:24 this is actually an advantage for tux3 2009-08-16 11:24 well, my feeling from it, I would like to change the snapshot on downstream system :) 2009-08-16 11:24 change the snapshot, you mean write to it? 2009-08-16 11:24 yes 2009-08-16 11:24 we do that in ddsnap, but first we set another snapshot on downstream 2009-08-16 11:25 so we have a read-only snapshot held on downstream for replication, and a writeable snapshot that is initially identical to the read-only snapshot 2009-08-16 11:25 yes 2009-08-16 11:26 however, I though I would like to merge replicated snapshot and writable snapshot 2009-08-16 11:26 we had to do that because when we mount the snapshot on downstream, ext3 writes to the volume even if it is a read-only mount 2009-08-16 11:26 replicated snapshot and writeable snapshot are in fact merged, they share any data that was not changed 2009-08-16 11:27 oh 2009-08-16 11:27 if there are conflict, what happen? 2009-08-16 11:27 unless you also want to replicate changes back upstream, there is no conflict 2009-08-16 11:28 um... 2009-08-16 11:28 downstream can have its own changes to an upstream snapshot 2009-08-16 11:28 for example 2009-08-16 11:28 upstream changes /foo/bar 2009-08-16 11:28 then downstream also changed /foo/bar on writable snapshot 2009-08-16 11:29 and replacation was started 2009-08-16 11:29 is this conflict? 2009-08-16 11:29 the replication has to be done against a read-only snapshot on downstream 2009-08-16 11:29 yes 2009-08-16 11:29 ah 2009-08-16 11:30 it's pretty cool how this works 2009-08-16 11:30 then downstream want to merge writable snapshot and read-only snapshot 2009-08-16 11:30 s/read-only/read-only replicated/ 2009-08-16 11:31 if downstream wants, it can use diff and patch to "roll forward" its local differences to the next replicated snapshot from upstream 2009-08-16 11:31 it means that's not part of replication? 2009-08-16 11:31 or it can use rsync, or some more clever method based on analyzing differences using internal filesystem structures 2009-08-16 11:32 no, it's not part of replication 2009-08-16 11:32 do do that within the filesystem is a difficult problem 2009-08-16 11:32 but not impossible 2009-08-16 11:32 ok 2009-08-16 11:32 read-only replication is already very useful 2009-08-16 11:33 hirofumi> however, I though I would like to merge replicated snapshot and writable snapshot 2009-08-16 11:33 it is also the basis of online backup 2009-08-16 11:33 writeable downstream snapshots that are replicated back upstream is a challenging and fun problem 2009-08-16 11:33 yes, well, that merge meant it 2009-08-16 11:34 yes 2009-08-16 11:34 this is very difficult to do with physical replication like ddsnap, but will be easier with logical replication like tux3 2009-08-16 11:34 and I wonder EMC's or netapp's is what is doing 2009-08-16 11:34 neither of them do anything as fancy as that 2009-08-16 11:34 let's see what emc has for replication 2009-08-16 11:35 anyway, I should be very difficult 2009-08-16 11:35 s/I/it/ 2009-08-16 11:35 I think emc just does volume mirroring for replication, which is not the same thing 2009-08-16 11:35 I could be wrong about that 2009-08-16 11:36 http://www.emc.com/products/detail/software/replication-manager.htm 2009-08-16 11:36 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-16 11:36 hi tim_dimm 2009-08-16 11:37 hey flips 2009-08-16 11:37 the techniques describe on that emc page are essentially what zumastor implements 2009-08-16 11:37 (that was for hirofumi) 2009-08-16 11:37 link? just came into the conversation 2009-08-16 11:37 http://www.emc.com/products/detail/software/replication-manager.htm 2009-08-16 11:37 btw, zumastor's delta is what format? 2009-08-16 11:38 just trying to figure out if emc implements delta replication, or whether it is just mirroring 2009-08-16 11:38 zumastor's delta is a custom format 2009-08-16 11:38 defined by ddsnap 2009-08-16 11:38 well, I wonder if it works on different fs on downstream and upstream 2009-08-16 11:38 emc's is "automated" so it must be awesome ;-) 2009-08-16 11:39 zumastor has to have the same fs on upstream and downstream 2009-08-16 11:39 tux3 can in theory have different fs on upstream and downstream if the delta format is standardized 2009-08-16 11:40 I guess it would be good to be depending on fs detail 2009-08-16 11:40 one thing that is generally true about vendor's descriptions of their supported features: if they don't actually support something, it can be very difficult to determine that from their own technical descriptions 2009-08-16 11:41 well. e.g. I imagine downstream is using dedup and upstream is not 2009-08-16 11:41 hmm, let me think about that a moment 2009-08-16 11:42 we should make it a requirement of the dedup design that it is transparent to replication 2009-08-16 11:43 I guess, if delta is logical diff, it's not depending on internal format 2009-08-16 11:43 -!- msindia(~mrs1512@117.254.11.133) has joined #tux3 2009-08-16 11:43 that might mean it is necessary to re-do the dedup on replication 2009-08-16 11:43 that should "just work" 2009-08-16 11:43 probaly, it can be slow though 2009-08-16 11:43 dedup is expected to be slow 2009-08-16 11:43 yes 2009-08-16 11:44 it is more likely to be used on a downstream replication target than a primarly volume 2009-08-16 11:44 I guess, dedup needs to re-do always 2009-08-16 11:44 primary I mean 2009-08-16 11:45 yes, online backup or somehitng 2009-08-16 11:46 even if it's offline backup, maybe useful 2009-08-16 11:46 I will be away for a few minutes 2009-08-16 11:46 ok 2009-08-16 12:00 back 2009-08-16 12:03 ok, let's think about dleaf 2009-08-16 12:03 for a while 2009-08-16 12:03 I meant, a hour or so, I'll sleep after that :) 2009-08-16 12:04 just one addition to the above... it looks like emc implements ddsnap-style replication based on volume snapshots 2009-08-16 12:04 it would be interesting to know whether this feature came before or after ddsnap 2009-08-16 12:04 probably before 2009-08-16 12:05 I guess too 2009-08-16 12:05 and there is before emc too 2009-08-16 12:06 "EMC Snap's copies take roughly 30% of the capacity required for the source, and that's only the space necessary to protect the original data's integrity, which would subsequently be overwritten on the source volume. " 2009-08-16 12:06 sounds like ddsnap 2009-08-16 12:07 http://processor.com/editorial/article.asp?article=articles%2Fp2537%2F33p37%2F33p37.asp&guid=&searchtype=&WordList=&bJumpTo=True 2009-08-16 12:08 ok, dleaf? 2009-08-16 12:11 yes 2009-08-16 12:12 EMC seems to have realtime replication? 2009-08-16 12:12 well 2009-08-16 12:12 I guess emc has both snapshot-based and mirror-based replication 2009-08-16 12:12 i see 2009-08-16 12:13 they rely on battery-backed ram to make the snapshot-based replication fast, whereas ddsnap just lets write performance suffer 2009-08-16 12:13 one big goal of tux3 is to allow snapshot-based replication without a high write performance cost 2009-08-16 12:14 i see 2009-08-16 12:15 my goal is, make simple, stable, and fastest fs :) 2009-08-16 12:15 right 2009-08-16 12:16 well, dleaf 2009-08-16 12:16 um... 2009-08-16 12:16 ah, I want to remove filemap.c's dleaf stuff 2009-08-16 12:21 and replace with? 2009-08-16 12:22 I'm thinking replace those with btree operations 2009-08-16 12:22 that seems hard to me 2009-08-16 12:22 generic btree operations, and btree one calls internal one 2009-08-16 12:22 ah 2009-08-16 12:22 s/internal/leaf operations/ 2009-08-16 12:22 yes, possible 2009-08-16 12:23 well I like the map_region factoring though 2009-08-16 12:23 and upper layer just know a little about leaf internal 2009-08-16 12:23 what problem would the new factoring solve? 2009-08-16 12:24 I'm not sure at least for now though 2009-08-16 12:24 I'm thinking it makes add good layer 2009-08-16 12:25 upper layer of backend uses btree layer 2009-08-16 12:25 btree layer uses leaf layer 2009-08-16 12:26 right now, filemap.c treats btree as a library helper 2009-08-16 12:26 if it can, I guess it may be easy to replace leaf internal 2009-08-16 12:26 yes 2009-08-16 12:26 and is depending on leaf internal 2009-08-16 12:27 yes, which seems right to me 2009-08-16 12:27 yes, and it also means, we might be able to add new leaf easily more or less 2009-08-16 12:28 a direct data attribute will not even have a btree, but it will still have map_region 2009-08-16 12:28 sure 2009-08-16 12:32 ok, my design note 2009-08-16 12:33 ah, maybe, there is another serious problem 2009-08-16 12:33 let me explain a bit more 2009-08-16 12:33 listening 2009-08-16 12:34 we keep dleaf broken while intering new extents 2009-08-16 12:34 because we copy the part of dleaf to memory temporary 2009-08-16 12:34 worried about smp problems? 2009-08-16 12:34 and insert data, then merge temporary to dleaf 2009-08-16 12:35 no 2009-08-16 12:35 I think bitmap has problem 2009-08-16 12:36 I think bitmap need to read dtree before merge temporary memory 2009-08-16 12:37 at least bitmap is unversioned 2009-08-16 12:37 well, we are not even talking about those issues 2009-08-16 12:38 we talked related issues in past 2009-08-16 12:38 ok, so it is another allocate-in-allocate issue? 2009-08-16 12:38 yes 2009-08-16 12:38 probably 2009-08-16 12:39 we fixed the cursor_redirect(), but map_region may still have problem 2009-08-16 12:39 maybe this one is solved because the allocated bit is set only in the page cache 2009-08-16 12:40 yes 2009-08-16 12:40 but, extent on tail of dleaf may not be on page cache 2009-08-16 12:41 ah, the read issue, I always overlook it 2009-08-16 12:42 yes, map_region() for write -> balloc() -> map_region() for read 2009-08-16 12:42 so we must support reading from the bitmap btree even when an edit is in progress 2009-08-16 12:42 I must be more careful about analyzing that case 2009-08-16 12:42 probably 2009-08-16 12:43 I'm missing something though 2009-08-16 12:43 I will think about it while you are sleeping ;) 2009-08-16 12:43 :) 2009-08-16 12:44 btw, cursor_redirect() was solved by chaning btree->root at last 2009-08-16 12:44 i.e. read will see old blocks 2009-08-16 12:45 however, map_region() may be different case 2009-08-16 12:45 - when we are updating the filemap for the bitmap itself, we may have altered the btree 2009-08-16 12:45 but now we need to read in a bitmap in the altered region. 2009-08-16 12:45 <- is that the issue? 2009-08-16 12:46 it's not sure 2009-08-16 12:47 if region has dirty page cache, I guess there is no problem 2009-08-16 12:47 but, if those are outside of dirty pages, I guess it's problem 2009-08-16 12:47 and we might need to take care to read the bitmap into cache before beginning the update 2009-08-16 12:48 it seems to me, this is a robust solution 2009-08-16 12:48 I guess it's complex 2009-08-16 12:48 it needs to pin page cache 2009-08-16 12:48 yes 2009-08-16 12:49 seems like a good idea 2009-08-16 12:49 well, we don't let vfs evict our pages 2009-08-16 12:49 or, we won't let it 2009-08-16 12:49 I guess "we can't allocate blocks if dleaf is not stable" is simple 2009-08-16 12:49 stable means? 2009-08-16 12:50 can present needed extents 2009-08-16 12:51 usual operations are protected by mutex, of course 2009-08-16 12:51 this can happen only flush path of bitma, I guess 2009-08-16 12:52 s/bitma/bitmap/ 2009-08-16 12:53 right 2009-08-16 12:53 and we have investigated it a number of times 2009-08-16 12:53 once more would be good ;) 2009-08-16 12:53 :) 2009-08-16 12:54 ok, technical note on commit + replay... it starts with overview of efficiency goals for spinning disk 2009-08-16 12:54 good 2009-08-16 12:55 our goal is, mostly linear access for both read and write, for common loads 2009-08-16 12:55 probably, we need good sanity check for bitmap write case 2009-08-16 12:55 linear access for read conflicts with write 2009-08-16 12:55 ye 2009-08-16 12:55 yes 2009-08-16 12:55 we resolve that conflict by using logging 2009-08-16 12:55 iirc, it's one of problem of logfs 2009-08-16 12:55 that is the main point of the note 2009-08-16 12:56 um... 2009-08-16 12:56 loging + rollup 2009-08-16 12:56 so we avoid the logfs problem of fragmented read access 2009-08-16 12:56 data blocks is also solved? 2009-08-16 12:56 which issue? 2009-08-16 12:56 not sure :) 2009-08-16 12:57 what do you mean by data blocks? 2009-08-16 12:57 well, I was assuming data blocks is also written as liner 2009-08-16 12:57 at least for now 2009-08-16 12:58 in here, data blocks means page cache of normal file 2009-08-16 12:58 yes 2009-08-16 12:58 I should mention that explicitly 2009-08-16 12:59 linear writing means both data and metadata 2009-08-16 12:59 oh 2009-08-16 12:59 and the difficulty is, writing the higher level btree nodes and bitmap blocks linearly... it is easy to do, but it will be fragmented for read 2009-08-16 13:00 so that is why we defer the update of higher level btree nodes and bitmap blocks 2009-08-16 13:00 data blocks is also fragmented easily? 2009-08-16 13:01 yes, especially with versioning, but not just with versioning 2009-08-16 13:01 so... we assume so sane higher level allocation policy that we have not yet implemented 2009-08-16 13:01 "some sane" I meant 2009-08-16 13:02 ah, i see 2009-08-16 13:02 yes 2009-08-16 13:03 higher level allocation policy seems easier to me that the low level issues of how to avoid write vs read fragmentation of metadata 2009-08-16 13:03 it sounds hard, but we would need to do 2009-08-16 13:03 for unversioned, we can follow the model of ext3, it is not too bad 2009-08-16 13:03 um... 2009-08-16 13:04 for versioned, we need to do some creative new work, but then that problem exists for every versioning filesystem 2009-08-16 13:04 I don't think any of them has solved it very well 2009-08-16 13:04 ext* separates metadata (inodes, bitmap, or something) and data? 2009-08-16 13:05 I guess it's big difference 2009-08-16 13:05 ext3 has a name for its allocation strategy 2009-08-16 13:05 I forgot it just now 2009-08-16 13:05 orlov? 2009-08-16 13:05 yes 2009-08-16 13:05 a reasonable way of organizing data vs directories 2009-08-16 13:05 yes 2009-08-16 13:06 but, inode, bitmap, and btree placement may be issue 2009-08-16 13:06 well 2009-08-16 13:06 I'm still not thinking those almost though 2009-08-16 13:07 yes, we want to group inodes together in batches, and place the directories near them, and the data not too far away 2009-08-16 13:07 so, the allocation strategy ends up mainly driven by choice of inode number 2009-08-16 13:08 i see 2009-08-16 13:08 doesn't use perfect liner write? 2009-08-16 13:08 perfect is impossible 2009-08-16 13:09 except for initial write to empty filesystem 2009-08-16 13:09 if it's like logfs, it's possible? 2009-08-16 13:09 for logfs, writing may be perfect, but reading will suck 2009-08-16 13:09 of course, there is issues by nearing full 2009-08-16 13:09 yes 2009-08-16 13:10 yes 2009-08-16 13:10 :) 2009-08-16 13:10 if everybody else is like me, their filesystem is nearly full all the time 2009-08-16 13:10 so behavior at the nearly full state is very important 2009-08-16 13:10 oh 2009-08-16 13:10 I always try to avoid nearly full :) 2009-08-16 13:11 I try to avoid and always fail :) 2009-08-16 13:11 :) 2009-08-16 13:12 yes, sure 2009-08-16 13:12 at the nearly full state, the advantage of handling small extents well, with compact metadata becomes more important 2009-08-16 13:12 the dleaf nodes may hold many small extents then 2009-08-16 13:13 performance will still be ok if the small extents are reasonably close together 2009-08-16 13:13 well, it's depends on human's behavior 2009-08-16 13:13 it does indeed 2009-08-16 13:13 I meant, human will use disk space if there is 2009-08-16 13:13 in the worst case, the only solution is active defrag 2009-08-16 13:13 yes 2009-08-16 13:13 even if fs does best :) 2009-08-16 13:13 "humans expand to fill all available space" 2009-08-16 13:13 like a gas 2009-08-16 13:14 yes 2009-08-16 13:15 well, anyway, we need allocation policy 2009-08-16 13:16 yes, but it seems easier to me than atomic commit 2009-08-16 13:16 which is nearly there, so I guess I better concentrate 2009-08-16 13:17 good 2009-08-16 13:17 I guess there is no perfect for it, so in some terms it's hard though 2009-08-16 13:19 well, it sounds like interesting to analyze and optimize cycle 2009-08-16 13:19 if you do an incorrect allocation policy, the filesystem just runs slower, but if atomic commit is wrong, there will be corruption and nobody will want to use the filesystem 2009-08-16 13:20 yes 2009-08-16 13:20 stable is top priority for fs and me 2009-08-16 13:28 http://en.wikipedia.org/wiki/Orlov_block_allocator <- wikipedia is fine 2009-08-16 13:29 yes 2009-08-16 13:29 oh 2009-08-16 13:29 I was thinking it's from freebsd, but wiki says openbsd 2009-08-16 13:30 hmm, I thought freebsd too 2009-08-16 13:32 probably, btrfs, hammer, zfs compare would be interesting if it can be done easily 2009-08-16 13:33 oh, original is openbsd 2009-08-16 13:33 how did you find that? 2009-08-16 13:34 click wike links :) 2009-08-16 13:34 http://web.archive.org/web/20080131082712/http://www.ptci.ru/gluk/dirpref/old/dirpref.html 2009-08-16 13:35 s/wike/wiki/ 2009-08-16 13:36 :) 2009-08-16 14:13 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-16 14:19 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-16 14:50 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-16 16:03 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-16 20:33 -!- mebored(~talktome_@117.254.11.133) has joined #tux3 2009-08-16 20:42 -!- msindia(~lol@117.254.11.133) has joined #tux3 2009-08-16 21:37 -!- anthy(~DoctorX29@196.202.72.2) has joined #tux3 2009-08-16 21:37 -!- anthy(~DoctorX29@196.202.72.2) has left #tux3 2009-08-16 21:43 -!- anthy(~DoctorX29@196.202.72.2) has joined #tux3 2009-08-16 21:43 -!- anthy(~DoctorX29@196.202.72.2) has left #tux3 2009-08-16 23:47 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-17 01:20 -!- msindia(~lol@117.254.11.133) has joined #tux3 2009-08-17 08:02 -!- KillerBee(~bee@117.195.9.123) has joined #tux3 2009-08-17 08:16 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-17 10:18 -!- msindia(~lol@117.254.11.133) has joined #tux3 2009-08-17 10:40 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-17 11:23 -!- KillerBee(~abc@117.195.9.123) has joined #tux3 2009-08-17 11:59 -!- KillerBee(~killerbee@117.195.9.123) has joined #tux3 2009-08-17 12:09 -!- KillerBee(~killerbee@117.195.9.123) has joined #tux3 2009-08-17 12:10 -!- david(~david@vpn199a.rzuser.uni-heidelberg.de) has joined #tux3 2009-08-17 12:48 -!- KillerBee(~killerbee@117.195.9.123) has joined #tux3 2009-08-17 13:10 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-17 13:11 -!- msindia(~lol@117.254.11.133) has joined #tux3 2009-08-17 15:13 flips, there? 2009-08-17 15:14 group/entry doesn't have version, is it ok? 2009-08-17 15:15 well, it's ok, but group/entry is copyed to match extented? 2009-08-17 15:16 anyway, I'll try to see what happen, if I changed those with simple one 2009-08-17 15:16 and whether there are enough benefit 2009-08-17 16:19 -!- ajonat(~ajonat@190.48.100.182) has joined #tux3 2009-08-17 16:35 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-17 17:43 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-17 18:35 -!- ajonat(~ajonat@190.48.100.182) has joined #tux3 2009-08-17 19:48 btw, I guess you know about schedular activation for M:N thread model 2009-08-17 19:49 in my understand, it sounds really good, however, it was too hard to implement completely 2009-08-17 19:50 solaris, and seems freebsd gave up 2009-08-17 19:52 well, anyway, it is one of reasons why I'm sensitive about complexity 2009-08-17 20:47 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-17 22:42 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-18 01:49 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-18 07:08 good morning 2009-08-18 07:08 hirofumi, there? 2009-08-18 08:17 -!- boy_Jkt(~wettime@67.212.81.75) has joined #tux3 2009-08-18 08:17 -!- boy_Jkt(~wettime@67.212.81.75) has left #tux3 2009-08-18 08:33 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-18 09:13 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-18 09:15 hi tim_dimm 2009-08-18 09:15 morning flipz 2009-08-18 09:17 hi 2009-08-18 09:30 hi hirofumi 2009-08-18 09:31 hi 2009-08-18 09:31 < hirofumi> group/entry doesn't have version, is it ok? 2009-08-18 09:31 group/entry is copyed to match extented? 2009-08-18 09:31 group/entry is copyed to match extented? 2009-08-18 09:31 group/entry is copyed to match extented? <- ? 2009-08-18 09:32 whoops :) 2009-08-18 09:32 my isp is crazy 2009-08-18 09:32 you too :) 2009-08-18 09:32 :) 2009-08-18 09:33 well, stopping server without announce is unacceptable 2009-08-18 09:34 my isp does that frequently 2009-08-18 09:34 group/entry is copyed to match to extents 2009-08-18 09:34 speakeasy 2009-08-18 09:34 lost emails frequently? 2009-08-18 09:34 getting very tired of it, want to move but that is work 2009-08-18 09:34 no lost emails because I run my own smtp 2009-08-18 09:35 good 2009-08-18 09:35 exim4 (debian mta) 2009-08-18 09:35 maybe, I'm losting my emails now 2009-08-18 09:35 sucks 2009-08-18 09:35 really 2009-08-18 09:36 setting up exim4 isn't as hard as everybody says ;) 2009-08-18 09:36 well, yes 2009-08-18 09:36 but, it needs domain 2009-08-18 09:37 domains are cheap 2009-08-18 09:37 oh 2009-08-18 09:40 I see hirofumi.org is taken 2009-08-18 09:41 oh 2009-08-18 09:45 < hirofumi> group/entry doesn't have version, is it ok? 2009-08-18 09:45 I was wondering what you meant 2009-08-18 09:45 ah 2009-08-18 09:45 yes it's ok 2009-08-18 09:46 the version is part of the extent 2009-08-18 09:46 I think current format can be used for several ways 2009-08-18 09:47 one group/entry can have multiple extents 2009-08-18 09:47 one group/entry can have only a extent 2009-08-18 09:47 etc. 2009-08-18 09:47 yes, the current format is nearly ready for versions 2009-08-18 09:47 very small changes required 2009-08-18 09:48 um... 2009-08-18 09:48 I think there may be some places where multiple extents per entry are not handled 2009-08-18 09:48 I'm thinking group/entry should have version instead of extent 2009-08-18 09:49 what would the advantage be? 2009-08-18 09:49 I think it can search interst extent easily 2009-08-18 09:50 mainly for lookup? 2009-08-18 09:56 yes 2009-08-18 09:56 I just thought for it 2009-08-18 09:57 the version is per-extent, but the group/entry covers multiple extents, maybe with different versions 2009-08-18 09:58 so putting the version in the extent is natural 2009-08-18 09:58 map_region will just skip extents it isn't interested in 2009-08-18 09:58 yes 2009-08-18 09:59 note: because of version inheritance, map_region will not look for an exact match of the version, but instead it looks for the "nearest enclosing" version 2009-08-18 09:59 but, I'm thinking the fact might be it's rare 2009-08-18 10:00 it's not rare if the file is a virtual volume 2009-08-18 10:00 I guess it's not so 2009-08-18 10:00 let's think it 2009-08-18 10:01 our direct data attribute will not have version tags on its extents, I think 2009-08-18 10:01 e.g. ver=1 has l=1~10,p=1~10 2009-08-18 10:01 and this will be the common case 2009-08-18 10:01 direct data attribute? 2009-08-18 10:02 attribute to be implemented, to optimize the case of small files all of the same version 2009-08-18 10:02 um... 2009-08-18 10:03 in that case, the version tag goes on the direct data attribute, extents are all of the same version, there are no groups 2009-08-18 10:03 attribute means dleal's group/entry/extent? 2009-08-18 10:03 attribute means inode table block attribute 2009-08-18 10:03 i.e., setattr 2009-08-18 10:04 I mean, store_attrs 2009-08-18 10:04 ok 2009-08-18 10:04 by "all of the same version", I meant, all extents of the file the same version 2009-08-18 10:04 also, direct data attribute would be non-sparse 2009-08-18 10:05 so, if the file is sparse, or very large, or has more extents with different versions, it uses the general tree format with dleaf nodes 2009-08-18 10:06 s/more// 2009-08-18 10:06 it sounds good 2009-08-18 10:06 and assumption may be true 2009-08-18 10:06 small file doesn't sparse file 2009-08-18 10:07 right, from your measurements on the weekend something like 99% of files would use the direct pointer attribute 2009-08-18 10:08 direct data I mean 2009-08-18 10:08 no, no 2009-08-18 10:08 that just tell sparse file 2009-08-18 10:08 not small file 2009-08-18 10:09 by small, I mean less than a 10 GB or so 2009-08-18 10:09 let me see 2009-08-18 10:09 maybe smaller ;) 2009-08-18 10:09 10GB is enough big 2009-08-18 10:10 depending on fragmentation 2009-08-18 10:10 our current count=6bits 2009-08-18 10:10 extent->count 2009-08-18 10:10 we can allow the extent block count to be 16 bits instead of 6 bits, because the version field is available 2009-08-18 10:10 for direct data extents 2009-08-18 10:11 2^16 is 256MB for 4k page 2009-08-18 10:11 s/page/block/ 2009-08-18 10:11 so that is 2**28 bytes per direct data extent => 256 MB/extent 2009-08-18 10:11 right 2009-08-18 10:12 and we can allow a direct data extent to be almost as big as an inode table block 2009-08-18 10:12 so... 2009-08-18 10:12 sorry, I mean direct data attribute 2009-08-18 10:12 40 extents for 10GB 2009-08-18 10:12 right 2009-08-18 10:12 so if fragmentation is not too bad, that can be done 2009-08-18 10:13 40*sizeof(group/entry/extent)=640bytes 2009-08-18 10:13 high fragmentation would reduce the maximum size that can be represented, but it would still be much larger than the average file size on a linux install 2009-08-18 10:14 so even on a fragmented system, nearly all the files would be direct data attributes 2009-08-18 10:15 if larger than multiple hundred MB, it would be hard to allocate contiguous range 2009-08-18 10:15 without god :) 2009-08-18 10:16 depends on how big your disk is 2009-08-18 10:17 -!- pgquiles(~pgquiles@159.Red-81-39-154.dynamicIP.rima-tde.net) has joined #tux3 2009-08-18 10:17 if the disk is that big, of course we do not care a lot about the size of the metadata 2009-08-18 10:17 as long as the metadata isn't stupidly large, like in ZFS 2009-08-18 10:27 sorry for delay 2009-08-18 10:27 no problem 2009-08-18 10:27 was checking email server 2009-08-18 10:27 well 2009-08-18 10:28 I think the problem is 2009-08-18 10:28 ZFS has pretty good metadata/data ratio, but the smallest possible data pointer is 128 bytes, which really sucks for cache performance 2009-08-18 10:28 just clarifying me previous comment 2009-08-18 10:29 well, we should be more careful for cache performance mesurement 2009-08-18 10:29 well, issue of fragment 2009-08-18 10:29 I think, usual fragment case is 2009-08-18 10:30 I think our scheme will be really good for performance on a fragemented system 2009-08-18 10:30 even if contiguous write, it's interrupted by another job 2009-08-18 10:30 yes, maybe not bad 2009-08-18 10:30 in the worst case, each block needs an extent of one block, that is 8 bytes/block 2009-08-18 10:30 but, I prefer to think our is bad 2009-08-18 10:31 um... 2009-08-18 10:31 still a ratio of 1/512, metadata/data 2009-08-18 10:32 predilection? 2009-08-18 10:32 I meant 2009-08-18 10:32 prediction 2009-08-18 10:32 yes 2009-08-18 10:33 actually, could be 1/100 or so, if dleaf is on average half full, and there is some overhead for the dleaf directory 2009-08-18 10:33 still ok 2009-08-18 10:33 ah, no, I meant, I prefere to add exra point to competer 2009-08-18 10:34 s/prefere/prefer/ 2009-08-18 10:34 competer? 2009-08-18 10:34 competitor 2009-08-18 10:34 english is hard :) 2009-08-18 10:35 ah :) 2009-08-18 10:35 competer is a logical spelling ;) 2009-08-18 10:35 we should add that word 2009-08-18 10:36 and it should be competion, not competition 2009-08-18 10:37 well, what I want to say is 2009-08-18 10:37 I prefer, add disadvantage point to our, and add advantage point to competitors 2009-08-18 10:38 and with this point, if I can win, it would be good surely 2009-08-18 10:38 yes, good 2009-08-18 10:39 I know there is proverb(?) in japan 2009-08-18 10:39 ext2/3 uses 4 bytes/pointer, a big advantage with highly fragmented filesystems 2009-08-18 10:40 but, I don't know what say in english 2009-08-18 10:40 yes 2009-08-18 10:40 well, backto fragment case 2009-08-18 10:40 you want to say, it is better to assume that the competitor is very good 2009-08-18 10:40 I think 2009-08-18 10:41 yes 2009-08-18 10:41 and if it seems to same, competior's win 2009-08-18 10:41 at the same time, it is worth noticing any weaknesses, and try to avoid them 2009-08-18 10:42 yes, good 2009-08-18 10:43 and to compare it, likewise, I prefer to think we are also weak 2009-08-18 10:43 well 2009-08-18 10:44 usual situation, I think jobs is interrupted 2009-08-18 10:44 so, it's hard to allocate contiguous range 2009-08-18 10:44 the dleaf design looks pretty good for the fragmented case, with 12 bytes or less overhead per extent 2009-08-18 10:45 also, bitmaps are very good for the fragmented case, as opposed to an extent-based free map 2009-08-18 10:45 for bitmap, it depends on volume's fragment, not earch file 2009-08-18 10:45 by a slight change in dleaf semantics, we could reduce the overhead to nearly 8 bytes/extent 2009-08-18 10:46 yes, I meant volume fragmentation 2009-08-18 10:47 well, so, group/entry/extent separation is what's for? 2009-08-18 10:47 to make the per-extent overhead low 2009-08-18 10:48 this gives us the 12 bytes/extent, otherwise it would be more bytes. ZFS uses 128 bytes/extent (that is extreme) 2009-08-18 10:48 per-extent? 2009-08-18 10:48 I want real example 2009-08-18 10:48 yes, we normally have one 8 byte extent plus 4 bytes of dictionary in the dleaf, per disk extent 2009-08-18 10:48 e.g. l=1~10,p=5=15 2009-08-18 10:49 l is logical address, p is physical address 2009-08-18 10:49 and rvalue is range 2009-08-18 10:49 1~10 means? 2009-08-18 10:49 range 2009-08-18 10:50 ok 2009-08-18 10:50 so, we will use group=0,entry=1,extent=5:count=10? 2009-08-18 10:51 yes 2009-08-18 10:51 ok 2009-08-18 10:51 and I'm thinking another format 2009-08-18 10:52 struct { logical=0:pysical=5:count=10} 2009-08-18 10:52 or 2009-08-18 10:52 whoops 2009-08-18 10:52 struct { logical=1:pysical=5:count=10} 2009-08-18 10:52 or 2009-08-18 10:53 logical=1,extent=5:count=10 2009-08-18 10:53 similar to ext4's extents 2009-08-18 10:53 might be 2009-08-18 10:53 it's pretty close 2009-08-18 10:53 I'm not checking ext4 though 2009-08-18 10:54 ext4 has a good extent representation, but it does not have to allow for multiple versions of the same extent 2009-08-18 10:55 logical=1,extent=5:count=10 2009-08-18 10:55 this can? 2009-08-18 10:56 and I guess the multiple versions for same logical range is rare 2009-08-18 10:57 it is not rare if the file is a virtual volume 2009-08-18 10:57 the current dleaf format allows it 2009-08-18 10:57 same logical range? 2009-08-18 10:57 and ddsnap uses that format 2009-08-18 10:57 yes, multiple different physical regions per same logical region 2009-08-18 10:57 point is same logical range for one file 2009-08-18 10:57 yes 2009-08-18 10:57 this is easy to cause 2009-08-18 10:58 first, write a file, then set a snapshot, then seek to the middle of the file and write one block 2009-08-18 10:58 it is not same logical range 2009-08-18 10:58 it has different logical range 2009-08-18 10:59 the logical range of the single-block write lies inside the logical range of the entire file 2009-08-18 10:59 these logical ranges overlap 2009-08-18 11:00 within the dleaf, we want to have all extents that overlap next to each other 2009-08-18 11:00 adjacent in the dleaf 2009-08-18 11:00 because the versioning algorithm requires this 2009-08-18 11:01 think real represent 2009-08-18 11:01 first write is 2009-08-18 11:02 group=0,entry=1,extent=100:count=1000 2009-08-18 11:02 and overwrite one block 2009-08-18 11:02 group=0,entry=100,extent=200:count=1 2009-08-18 11:03 it saves group if match 2009-08-18 11:04 you mean, those are for this one group? 2009-08-18 11:05 are they the same version? 2009-08-18 11:05 different version 2009-08-18 11:05 because those are overlap 2009-08-18 11:05 yes 2009-08-18 11:06 and to search for the interst version 2009-08-18 11:06 we can't have count=1000, maximum is 64 2009-08-18 11:06 we have to see both 2009-08-18 11:06 oh, yes 2009-08-18 11:06 rewrite the example with that limitation? 2009-08-18 11:07 ok 2009-08-18 11:07 group=0,entry=1,extent=100:count=64 2009-08-18 11:07 group=0,entry=100,extent=132:count=1 2009-08-18 11:08 you meant this? 2009-08-18 11:08 yes 2009-08-18 11:08 ok 2009-08-18 11:08 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-18 11:09 we have to decide whether we will allow the overlapped extents in the dleaf, or whether we will cut the big extent into pieces, so that one piece can begin at the same logical address as the single block extent 2009-08-18 11:10 yes 2009-08-18 11:10 cutting up the big extent is pretty easy, but it introduces a difficult allocate-in-free problem, where deleting a version can make a dleaf grow 2009-08-18 11:11 it will take me a moment to remember the details of this problem 2009-08-18 11:11 anyway, it is not the current question 2009-08-18 11:11 let's assume that we will allow the overlapped representation 2009-08-18 11:12 probably, it would be related 2009-08-18 11:12 my main point is to make those simple 2009-08-18 11:12 we would need to have an additional dleaf dictionary entry for the 1 block extent 2009-08-18 11:12 for what? 2009-08-18 11:13 hello folks 2009-08-18 11:13 hi 2009-08-18 11:13 it meant if we cut the range to 3 or 2 piece? 2009-08-18 11:13 in my view, we have one dleaf entry for each distinct start of a logical extent 2009-08-18 11:14 and we try not to cut, to avoid the allocate-in-free problem 2009-08-18 11:15 if we do cut, then we cut so that logical extent starts all align 2009-08-18 11:15 and we don't worry much about where the extents end, except at the end of the dleaf block 2009-08-18 11:16 probably, yes 2009-08-18 11:18 however, I worry current format would be not efficient for primary case 2009-08-18 11:18 for example, no version, and big file 2009-08-18 11:19 because of the small extents? 2009-08-18 11:19 yes 2009-08-18 11:19 small, and separated group/entry/extent 2009-08-18 11:19 metadata/data ratio is something like 1/32K with the current format 2009-08-18 11:20 especially may be group/entry 2009-08-18 11:20 1/32k? 2009-08-18 11:20 1 byte of metadata for every 32K bytes of file data 2009-08-18 11:21 let give my formula for that 2009-08-18 11:21 (4+4+8) / 256k? 2009-08-18 11:22 ah, 256k/8 2009-08-18 11:22 actually, (4+4+8) / 256k? 2009-08-18 11:22 sorry 2009-08-18 11:23 actually, (4+8) / 256k = 1/46k 2009-08-18 11:24 21k? 2009-08-18 11:24 which simply says, 12 byte extents where each extent covers 2**28 bytes (as is obvious to you :) 2009-08-18 11:24 why 21k? 2009-08-18 11:24 256k/12 2009-08-18 11:25 right 2009-08-18 11:25 :) 2009-08-18 11:26 my math was borked 2009-08-18 11:26 so, metadata/data ration = 1/21k 2009-08-18 11:26 which is fine 2009-08-18 11:26 larger extents would have a very small benefit 2009-08-18 11:27 probably not possible to measure, but we will see 2009-08-18 11:27 the main place this would be visible is delete, and we can easily defer the truncate 2009-08-18 11:29 even if 21k * 4096, it's 88M? 2009-08-18 11:29 well, just say 80MB 2009-08-18 11:30 we have to read a leaf for each 80MB? 2009-08-18 11:31 (and for each 4k, we have to search on a cached leaf) 2009-08-18 11:33 oh, in readahead case, it might be for each 128k 2009-08-18 11:33 80MB / 128KB == 640 2009-08-18 11:34 we have to search 640 times for each 80MB 2009-08-18 11:35 I'm not sure how big or small those 2009-08-18 11:35 but, I may care 2009-08-18 11:35 I think I need real test 2009-08-18 11:54 sorry, was away for a moment 2009-08-18 11:55 I think that one out of line seek per 80MB will not be noticed in total performance 2009-08-18 11:55 um... 2009-08-18 11:55 but, it would be slow than others 2009-08-18 11:56 and if the seek is not out of line (because we succeeded in placing the metadata near the beginning of the data) it will cost even less 2009-08-18 11:56 my theory is that the performance difference vs a scheme with large extents will be impossible to measure 2009-08-18 11:57 i see 2009-08-18 11:57 if my theory is wrong, then we introduce a method of representing larger extents 2009-08-18 11:58 however, in here, I prefer to think we are bad :) 2009-08-18 11:58 it is a possible deficiency, yes 2009-08-18 11:59 i see 2009-08-18 12:00 well, and big problem of me is, I'm not understanding benefit of current format well 2009-08-18 12:01 the benefit of small number of bytes per extent is clear? 2009-08-18 12:02 separation of group/entry/extent 2009-08-18 12:02 I'm thinking for right now, if we merge those, we can get some additional bytes of space 2009-08-18 12:02 mainly, that is the method by which the compact representation of extents is achieved 2009-08-18 12:02 it is a simple form of compression 2009-08-18 12:03 yes 2009-08-18 12:03 it's compression 2009-08-18 12:03 yes, that optimization also occurred to me 2009-08-18 12:03 that is, we can take advantage of extents that are end-to-end and only store a dictionary entry for the first one 2009-08-18 12:04 simple format with more good compression, I think we also can 2009-08-18 12:04 I would be interested to see a simpler format with better compression, that also supports versioning 2009-08-18 12:05 well, if we ignore cpu, e.g. simple format with lzo compression 2009-08-18 12:06 "if" :) 2009-08-18 12:06 yes :) 2009-08-18 12:06 not just cpu, but additional memory to hold the expanded representation 2009-08-18 12:06 well, however, it's tradeoff 2009-08-18 12:07 exactly 2009-08-18 12:07 iirc, lzo (or something) can be in place 2009-08-18 12:07 one of the nice features of ext* is, the on-disk format is also suitable as the in-memory database 2009-08-18 12:08 but, of course, less compression ratio though 2009-08-18 12:08 i see 2009-08-18 12:08 it's very interesting 2009-08-18 12:08 ...and you would have to recompress before flushing to disk 2009-08-18 12:08 yes 2009-08-18 12:09 we have no problem for it 2009-08-18 12:09 with atomic commit 2009-08-18 12:09 true 2009-08-18 12:09 but extra memory would be needed 2009-08-18 12:09 might be 2009-08-18 12:10 but, stream type compression would be fixed space 2009-08-18 12:10 not like gzip, bzip2 2009-08-18 12:11 there could be some promise there 2009-08-18 12:11 the cpu issue remains of course 2009-08-18 12:11 I think the cpu issue is non-trivial 2009-08-18 12:12 yes 2009-08-18 12:12 sometimes there are claims that compression can improve overall performance, by reducing size of disk transfer 2009-08-18 12:12 there are more claims of this than there is proof, and anyway, it seems like this kind of thing should be implemented by an lvm, not a filesystem 2009-08-18 12:13 um... 2009-08-18 12:13 maybe, lvm layer lost many info for it 2009-08-18 12:14 it may need to compress unneeded data 2009-08-18 12:14 well 2009-08-18 12:14 dleaf format 2009-08-18 12:14 yes 2009-08-18 12:15 so, I'm not sure about group compression 2009-08-18 12:16 the proposal is that compression ratio makes the complexity increase worthwhile 2009-08-18 12:16 and we can't search interst version until see extent 2009-08-18 12:17 that last requirement is not a problem 2009-08-18 12:17 we will normally always examine every extent in a given logical region to find the ones we are interested in 2009-08-18 12:18 um... 2009-08-18 12:18 why? 2009-08-18 12:18 because of version inheritance 2009-08-18 12:18 group=0,entry=1,extent=100:count=64 2009-08-18 12:18 group=0,entry=100,extent=132:count=1 2009-08-18 12:18 in this example 2009-08-18 12:19 if we read 100, we just want to know extent=132:count=1? 2009-08-18 12:19 should also put a version= element in the example 2009-08-18 12:19 sure 2009-08-18 12:19 ver=1 group=0,entry=1,extent=100:count=64 2009-08-18 12:19 ver=2 group=0,entry=100,extent=132:count=1 2009-08-18 12:20 if we read ver=2 logical=100, we just want to know extent=132:count=1? 2009-08-18 12:20 true 2009-08-18 12:20 but, we need to read both 2009-08-18 12:21 because, we don't know about version until read extent 2009-08-18 12:23 whereas if version is in the dictionary entry, we can learn something about the versions before looking at the extents... however issues such as multiple versions for the same logical address have to be addressed 2009-08-18 12:23 after those issues are addressed, not much cpu is saved, but considerably more space is used 2009-08-18 12:23 that is my theory 2009-08-18 12:23 but prove me wrong :) 2009-08-18 12:25 well, prove someone wrong is hard, usually :) 2009-08-18 12:25 because, all has tradeoff :) 2009-08-18 12:25 that's why everybody is always right :) 2009-08-18 12:25 :) 2009-08-18 12:27 well, so, moving version from extent to logical range is good thing? 2009-08-18 12:28 um... 2009-08-18 12:28 it would be depends on whether exactly same logical adress is common or not 2009-08-18 12:29 oh, at first, why I say this 2009-08-18 12:29 I'm thinking, no version is primary case 2009-08-18 12:30 and should faster than complex versions 2009-08-18 12:34 I don't think moving the version to the logical part is good 2009-08-18 12:34 the version is a property of the physical extent, not logical 2009-08-18 12:35 that is, versioned physical extents are many to one vs logical extent 2009-08-18 12:36 no version stops being the primary case as soon as there as snapshots 2009-08-18 12:36 and when people have snapshots, they always use them 2009-08-18 12:37 um... 2009-08-18 12:37 if automatically snapshots, people will not disable it 2009-08-18 12:38 well, people will just use default 2009-08-18 12:39 it's not just that 2009-08-18 12:40 when they realize that they can have online backup and access to old versions, they don't want to give it up 2009-08-18 12:40 this is the netapp effect 2009-08-18 12:41 users and administrators get instantly addicted to the capability, when available 2009-08-18 12:41 and wonder how they were able to survive without it 2009-08-18 12:41 it's depending on _they_ 2009-08-18 12:42 enterprise people will use obviously 2009-08-18 12:42 in my observation, nearly everybody 2009-08-18 12:42 if they find they have it at work, they want it at home 2009-08-18 12:42 if desktop user, I disagree 2009-08-18 12:43 usually, they don't backup at all 2009-08-18 12:43 the microsoft previous versions is primarily for desktop users 2009-08-18 12:43 right, they don't back up because it is painful 2009-08-18 12:43 if it is not painful, then they do 2009-08-18 12:43 we haven't seen that yet :) 2009-08-18 12:43 at least, not in the home 2009-08-18 12:44 well, in enterprise it is still painful, but necessary as you say 2009-08-18 12:44 yes 2009-08-18 12:44 so... imagine if you could do a trial upgrade to the latest version of your distro, and back it out with a single command if it broke your system 2009-08-18 12:44 well, I want to see faster version of fs 2009-08-18 12:45 I think I would have been saved from problems by that maybe 10-20 times since I started using linux 2009-08-18 12:45 yes, good of course 2009-08-18 12:45 some of the upgrade problems have wasted days of my time 2009-08-18 12:45 anyway, yes, faster version of fs 2009-08-18 12:45 it's depending on cost 2009-08-18 12:45 and it should be good, even with no versions, I agree 2009-08-18 12:46 if usual work is slower than I expected, I'll not use it 2009-08-18 12:46 true 2009-08-18 12:46 well, and I expected tux3 is faster than ext3 2009-08-18 12:47 I expect so 2009-08-18 12:47 I also think the dleaf design is pretty good even for the no snapshots case 2009-08-18 12:48 but if there is a significantly better design, it is important to know it 2009-08-18 12:49 um... 2009-08-18 12:51 I'm not sure you are calc about complexity of dleaf and version 2009-08-18 12:53 well, ok, are you thinking current format is best for all of view? 2009-08-18 12:53 I means, you have alternative yourself? 2009-08-18 12:56 I think it is a good all round format 2009-08-18 12:56 I've been thinking about the complexity of adding versioning to the format for some time, and feel ok about it 2009-08-18 12:58 the stream-based update is a good technique for controlling the complexity, and we nearly have that implemented already 2009-08-18 12:59 i see 2009-08-18 13:00 e.g. merge group/entry is bad? it can't be alternatetive? 2009-08-18 13:02 If there is an advantage demonstrated it could be an alternative of course 2009-08-18 13:02 it doesn't seem to be a promising direction to go, but if I am wrong about that I would like to know 2009-08-18 13:03 well, honestly, I'm not sure 2009-08-18 13:03 but, I'm almost sure, it's simple 2009-08-18 13:04 merge group/entry means, we remove multiple entry per group, of course 2009-08-18 13:05 so, top is extent array, and bottom is logical-address array 2009-08-18 13:08 this would be just difference of culture, I can't say current our format is good for all round 2009-08-18 13:08 and of course, I'm not understanding all well 2009-08-18 13:10 well, different culture is ture, I know more or less, and it doesn't mean there is problem, it means just different 2009-08-18 13:10 well 2009-08-18 13:11 can you allow me somehow works for simpler one? 2009-08-18 13:11 I'd like to compare or something 2009-08-18 13:12 for example, the logical address array could be 6 or 8 bytes per entry, and the extent array 8 bytes, so at worst it would be 14 (or 16) bytes per extent? 2009-08-18 13:12 then you would try to have multiple physical extents that are end-to-end logically, per logical entry? 2009-08-18 13:13 yes 2009-08-18 13:13 just say 16bytes 2009-08-18 13:13 ok, say 8 bytes logical, 8 bytes phyiscal to make it simple 2009-08-18 13:14 yes, and it makes me wrong more 2009-08-18 13:14 because? 2009-08-18 13:14 ah 2009-08-18 13:14 -!- ajonat(~ajonat@190.48.100.182) has joined #tux3 2009-08-18 13:14 oh 2009-08-18 13:14 "bigger = wronger" :) 2009-08-18 13:15 now, if there is usually only one logical entry per dleaf, you aren't wrong 2009-08-18 13:15 I meant, 8bytes would be bad case than 6byte, it's good to think about it 2009-08-18 13:15 sure 2009-08-18 13:16 I'm thinking usual case is 2009-08-18 13:16 logical address start 0 2009-08-18 13:16 and end 2009-08-18 13:17 it means there is no sparse region 2009-08-18 13:18 which would normally be handled by our (not implemented yet) direct data attribute 2009-08-18 13:18 as a suggestion, you could work on the design of the direct data attribute, which is actually the one that will affect efficiency most 2009-08-18 13:18 physical contiguous? 2009-08-18 13:19 logically contiguous 2009-08-18 13:19 the direct data attribute can contain multiple extents that are logically but not physically contiguous 2009-08-18 13:19 ok, so, multiple extents with no version? 2009-08-18 13:19 yes 2009-08-18 13:19 ah, and with no sparse region 2009-08-18 13:19 the version is in the attribute tag, like other inode attributes 2009-08-18 13:20 ah 2009-08-18 13:20 maximum file size of worst case? 2009-08-18 13:21 depends on the design details of the attribute 2009-08-18 13:21 sure 2009-08-18 13:22 well, maybe 10 extents or so? 2009-08-18 13:22 sounds like enough to handle most files 2009-08-18 13:23 might not be enough recently 2009-08-18 13:23 file size average is increased 2009-08-18 13:23 from blogs though 2009-08-18 13:25 well, anyway, perhaps direct extents is interesting one for me 2009-08-18 13:25 toilet time for a while :) 2009-08-18 13:32 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-18 13:40 back 2009-08-18 13:40 um... 2009-08-18 13:41 after thinking deeply :) 2009-08-18 13:41 direct extents also introduce complexity 2009-08-18 13:41 we need to switch to another one with versioninig 2009-08-18 13:42 true, however we expect this complexity 2009-08-18 13:42 it is similar to the requirement to switch from immediate data to direct data or btree 2009-08-18 13:43 we want to have that ability, and also go the other way, so that xattr can be generalized, for one thing 2009-08-18 13:43 there is no bad almost in theory 2009-08-18 13:43 you mean, everything works in theory? 2009-08-18 13:43 but, in real work, we have to think human power and brain 2009-08-18 13:44 yes, more or less 2009-08-18 13:44 in the real world, this change is similar to what I did in htree, to go from the single linear block to the indexed case 2009-08-18 13:44 it works well and did not add much complexity 2009-08-18 13:45 we need real stable working code to run 2009-08-18 13:45 it's not easy always 2009-08-18 13:45 right, we need that before adding new data optimizations 2009-08-18 13:46 well dleaf is pretty stable, thanks in large part to your fixes 2009-08-18 13:46 unfortunately, it's not stable 2009-08-18 13:47 it has limitation 2009-08-18 13:47 there's a test that fails? 2009-08-18 13:47 and there is known bug 2009-08-18 13:47 I don't have test case for it 2009-08-18 13:47 I didn't know about the known bug :) 2009-08-18 13:48 we talked yesterday or so, about map_region problem 2009-08-18 13:50 the bitmap issue? 2009-08-18 13:51 yes 2009-08-18 13:52 that issue is independent of the method of extent representation 2009-08-18 13:52 not really 2009-08-18 13:53 my intent of simplify is including those operations makes common and stable 2009-08-18 13:54 well, I was forgetting it was bitmap though :) 2009-08-18 13:54 not dleaf 2009-08-18 13:54 bitmap can't get a lot simpler :) 2009-08-18 13:54 yes 2009-08-18 13:55 the allocate-in-allocate issue will occur in any filesystem that represents freespace in a structure that itself is allocated 2009-08-18 13:55 but, I'm thinking fundamental problem is dleaf's exeption to handle it 2009-08-18 13:55 that includes all modern filesystems that I know of 2009-08-18 13:55 yes 2009-08-18 13:56 we see those some times 2009-08-18 13:56 s/see/saw/ 2009-08-18 14:00 -!- ajonat(~ajonat@190.48.100.182) has joined #tux3 2009-08-18 15:15 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-18 16:35 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-18 16:51 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-18 17:29 flips, still there? 2009-08-18 19:20 -!- ajonat(~ajonat@190.48.100.182) has joined #tux3 2009-08-18 19:40 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-18 21:33 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-18 22:43 -!- msindia(~lol@117.254.205.152) has joined #tux3 2009-08-18 23:08 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-18 23:23 -!- debdev(8bb58f22@webchat.mibbit.com) has joined #tux3 2009-08-18 23:50 -!- msindia(~lol@117.254.2.117) has joined #tux3 2009-08-19 00:07 -!- msindia(~lol@117.254.2.117) has joined #tux3 2009-08-19 00:07 -!- debdev(8bb58f22@webchat.mibbit.com) has joined #tux3 2009-08-19 00:07 -!- ajonat(~ajonat@190.48.100.182) has joined #tux3 2009-08-19 00:07 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-19 00:07 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-08-19 00:07 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-19 00:07 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-08-19 00:07 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-08-19 00:07 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-08-19 00:07 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-08-19 02:23 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-19 02:23 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-08-19 02:23 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-19 02:23 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-08-19 02:23 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-08-19 02:23 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-08-19 02:23 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-08-19 05:16 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-19 06:33 -!- mib_jie0jdf8(c02401fc@webchat.mibbit.com) has joined #tux3 2009-08-19 06:34 Do you ues gitosis for access control to your projects? 2009-08-19 06:40 -!- mib_jie0jdf8(c02401fc@webchat.mibbit.com) has left #tux3 2009-08-19 07:30 -!- pgquiles(~pgquiles@159.Red-81-39-154.dynamicIP.rima-tde.net) has joined #tux3 2009-08-19 07:49 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-08-19 08:15 hirofumi, back 2009-08-19 08:18 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-19 08:41 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-19 08:51 hi 2009-08-19 08:51 good morning 2009-08-19 08:51 hi 2009-08-19 08:51 well 2009-08-19 08:52 I'll gave up to get your help to make tux3 simple :) 2009-08-19 08:54 well it's comparatively simple 2009-08-19 08:54 ok 2009-08-19 08:54 compared to ext3, so far 2009-08-19 08:54 maybe read some zfs code, for fun 2009-08-19 08:55 well, so, I'll just leave those for you and others, and will ignore those for now 2009-08-19 08:55 and instead, I'll tackle to other part 2009-08-19 08:56 right, the part that hasn't been tried yet 2009-08-19 08:58 for interest: http://zfs.macosforge.org/trac/browser/ 2009-08-19 08:59 um.., Forbidden 2009-08-19 08:59 ah: http://zfs.macosforge.org/trac/browser/zfs_lib/libzfs/libzfs_graph.c 2009-08-19 08:59 funny, I'm in it 2009-08-19 09:00 works now 2009-08-19 09:02 wow, there is documentation of zfs operation in the libzfs_graph.c file 2009-08-19 09:02 no actual description of what the file does, though 2009-08-19 09:02 must be a story behind that 2009-08-19 09:03 like, for example, the graphics file started as a cut and paste of the actual filesystem code 2009-08-19 09:06 it seems destory command 2009-08-19 09:07 "zfs destroy" 2009-08-19 09:45 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-19 09:46 this graphics seems to just describe the pool structure 2009-08-19 09:47 it seems tux3graph is more ambitious 2009-08-19 09:48 ah wait 2009-08-19 09:48 this doesn't do graphics at all 2009-08-19 09:48 it constructs a graph of the pool 2009-08-19 09:51 yes 2009-08-19 10:00 looks like the kernel code isn't actually in this repo 2009-08-19 10:05 http://www.opensolaris.org/os/community/zfs/source/;jsessionid=50D4634DFB90105BE54C6DF7325BBD3D <- source tour 2009-08-19 10:06 "The ZIO pipeline is where all data must pass when going to or from the disk. It is responsible for translation DVAs (Device Virtual Addresses) into logical locations on a vdev, as well as checksumming and compressing data as necessary. It is implemented as a multi-stage pipeline, with a bit mask to control which stage gets executed for each I/O. The pipeline itself is quite complex, but can be summed up by the following digram:" 2009-08-19 10:08 will the real zfs please stand up 2009-08-19 10:09 seems to be split between zio.c and vdev.c 2009-08-19 10:09 I'm using git://repo.or.cz/opensolaris mirror to see it 2009-08-19 10:09 well, I'm not looking it almost all for now 2009-08-19 10:10 it's a big reading project 2009-08-19 10:10 very different organization from any linux filesystem 2009-08-19 10:10 as I understand it, quite different from other unix filesystems too 2009-08-19 10:11 it seems usual *bsd style interface 2009-08-19 10:11 VOP_* 2009-08-19 10:11 which file? 2009-08-19 10:12 but, I'm not sure 2009-08-19 10:12 forgot 2009-08-19 10:12 it is from I looked it last time 2009-08-19 10:13 usr/src/uts/common/fs/zfs/zfs_vnops.c 2009-08-19 10:13 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_cache.c <- includes physical block caching 2009-08-19 10:14 yes, lower layer should be different at all 2009-08-19 10:59 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zfs_vfsops.c <- something fairly familiar 2009-08-19 11:01 mount option parsing is much shorter than a typical linux filesystem 2009-08-19 11:04 zfs_vnops.c <- getting into the meat 2009-08-19 11:06 zfs_write includes work normally doing by generic_* in linux 2009-08-19 11:07 yes 2009-08-19 11:07 I guess I will just drill down until I hit something resembling dleaf 2009-08-19 11:07 iirc, bsd is using binary custom protocol for mount option 2009-08-19 11:08 and, yes, bsd vfs does nothing almost, even namespace locking 2009-08-19 11:08 767 /* 2009-08-19 11:08 768 * If we made no progress, we're done. If we made even 2009-08-19 11:08 769 * partial progress, update the znode and ZIL accordingly. 2009-08-19 11:08 770 */ 2009-08-19 11:08 well, opensolaris vfs might do though 2009-08-19 11:08 hmm 2009-08-19 11:09 ah, dleaf 2009-08-19 11:09 I guess opensolaris is more like *bsd than different 2009-08-19 11:10 yes, probably 2009-08-19 11:10 dragonfly vfs sounds like doing more though 2009-08-19 11:14 I should set up a machine and install hammer 2009-08-19 11:19 kvm or something would be easy 2009-08-19 11:19 not perfect for test though 2009-08-19 11:19 counts as a machine 2009-08-19 11:19 I have plenty of real machines though 2009-08-19 11:20 good 2009-08-19 11:21 kvm has some advantage than real hardware for development 2009-08-19 11:21 there are some disadvantage too though 2009-08-19 11:21 I guess I will leave the development to Matt ;) 2009-08-19 11:21 just want to get a general feelilng for how it works, and how fast 2009-08-19 11:22 if it works well and fast, then somebody should port it to linux maybe 2009-08-19 11:22 :) well, devlopment was including some experiments 2009-08-19 11:22 (not me!) 2009-08-19 11:23 porting hammer to linux seems to be hard job 2009-08-19 11:23 unfortunately 2009-08-19 11:23 because of mismatch to vfs? 2009-08-19 11:24 yes 2009-08-19 11:24 and more, buffer handling 2009-08-19 11:24 at least from doc, it seems to need *bsd like buffer handling 2009-08-19 11:24 well if we can completely take over buffer handling so can hammer 2009-08-19 11:24 I think we already proved that 2009-08-19 11:26 iirc, it's really hard 2009-08-19 11:26 bsd can have multiple blocks size at a time 2009-08-19 11:27 mutiple buffer size 2009-08-19 11:27 like xfs 2009-08-19 11:27 I don't know about xfs 2009-08-19 11:28 well, I guess it would be big barrier to port fs from *bsd 2009-08-19 11:39 http://opensolaris.org/os/community/zfs/source/ 2009-08-19 11:51 dn->dn_phys->dn_blkptr[blkid] 2009-08-19 11:51 zfs might have direct pointer 2009-08-19 12:10 -!- Guest153(~tomek@132-goc-32.acn.waw.pl) has joined #tux3 2009-08-19 12:12 exit 2009-08-19 12:12 exit 2009-08-19 12:12 exit 2009-08-19 12:12 exit 2009-08-19 12:12 quit 2009-08-19 12:45 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/spa.h : 183 2009-08-19 12:45 seems logical pointer 2009-08-19 12:47 zfs seems to be using ffs style blockpointer 2009-08-19 12:48 but, it's logical pointer 2009-08-19 13:05 um... 2009-08-19 13:06 zfs is not using extent style block pointer? 2009-08-19 13:15 http://lwn.net/Articles/342892/ 2009-08-19 13:16 zfs doesn't seems to use btree and extent 2009-08-19 13:22 it does, maybe not obviously 2009-08-19 13:22 where it come from? 2009-08-19 13:24 now I need to prove it :) 2009-08-19 13:24 :) 2009-08-19 13:24 just a moment 2009-08-19 13:24 well, I read from zfs_read() to spa/arc/zio stuff slightly 2009-08-19 13:25 it will be in the DMU 2009-08-19 13:25 dnode is using blkptr_t as source 2009-08-19 13:25 logical offset to blkptr 2009-08-19 13:25 yes, dnode 2009-08-19 13:26 blkptr has vdev pointer, size, raid info 2009-08-19 13:26 that's an zfs-style extent 2009-08-19 13:26 but, it's not extent actually 2009-08-19 13:26 it seems multiple size fixed objects 2009-08-19 13:27 multiple fixed size objects 2009-08-19 13:28 and for regular file data, it seems to use the fixed block size (maybe per file?) 2009-08-19 13:28 because, dnode points array of blkptr like ffs/ext* style data pointer 2009-08-19 13:30 well, ffs/ext* may be some sort of btree though 2009-08-19 13:31 ah, that may be 2009-08-19 13:31 I mean, zfs may use a ufs-like index scheme 2009-08-19 13:31 interesting point is file has logical pointer only 2009-08-19 13:31 which is more like a radix tree than a btree 2009-08-19 13:31 yes, it seems to do for file data 2009-08-19 13:31 yes 2009-08-19 13:32 and all logical pointer seems to be able to have checksum and raid capability 2009-08-19 13:32 that's the dnode 2009-08-19 13:33 blkptr has those 2009-08-19 13:33 dnode has tree or array 2009-08-19 13:33 of those 2009-08-19 13:33 well, so, intersting point is 2009-08-19 13:34 it means upper layer doesn't know about physical address at all 2009-08-19 13:34 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/dnode.h#dnode_t 2009-08-19 13:35 list_t dn_dbufs; /* linked list of descendent dbuf_t's */ 2009-08-19 13:35 dnode_phys_t is maybe 2009-08-19 13:35 dbuf seems it memory buffer 2009-08-19 13:35 dbuf seems memory buffer management 2009-08-19 13:37 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-19 13:38 well, like http://lwn.net/Articles/342892/ says, it seems to try to add MMU like layer 2009-08-19 13:38 DMU may be 2009-08-19 13:38 one obvious point: zfs struct dnode is more complex than any data structure we use 2009-08-19 13:39 it's complex 2009-08-19 13:39 and yes, dnode_phys is the zfs pointer I was thinking of 2009-08-19 13:39 128 bytes in size iirc 2009-08-19 13:39 but, blkptr algorithm is really simple 2009-08-19 13:39 blkptr search algorithm 2009-08-19 13:40 it's just like radix tree, or array 2009-08-19 13:40 true. The complexity is pushed off to other parts of the system 2009-08-19 13:40 yes 2009-08-19 13:40 really 2009-08-19 13:40 my theory is: it's better to keep the complexity local to where the work is done 2009-08-19 13:41 now, honestly, it's depending on you think what is "local" 2009-08-19 13:41 indeed, there is a lot of room for interpretation in my claim 2009-08-19 13:42 yes 2009-08-19 13:42 well, I agree, local is good 2009-08-19 13:43 um..., no, I think I meant, local complexity is acceptable than larger complexity 2009-08-19 13:44 right, and all complexity is bad 2009-08-19 13:44 s/larger/complex relation?/ 2009-08-19 13:45 yes, especially, I guess I sensitive to it 2009-08-19 13:45 I'm trying to figure out how zfs translates a logical address to physical using the dnode 2009-08-19 13:45 because, it affects to stable directly 2009-08-19 13:46 ok 2009-08-19 13:46 it does seem to be like a radix tree, except that each dnode_phys seems to have its own blocksize 2009-08-19 13:46 probably 2009-08-19 13:47 maybe, dnode is used for other objects 2009-08-19 13:47 like partition, I guess 2009-08-19 13:47 dn_indblkshift <- does this mean, logical range covered by indirect block? 2009-08-19 13:48 yes 2009-08-19 13:48 ok, that's it then, it is a sort-of radix tree 2009-08-19 13:49 http://www.google.co.jp/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Fopensolaris.org%2Fos%2Fcommunity%2Fzfs%2Fdocs%2Fondiskformat0822.pdf&ei=M2WMStGmPMeBkQXAtNEk&usg=AFQjCNGs8V139wOT8sIMaX3IhrgVhcY8RA&sig2=BT0zoRjcpNOkQ4WmXuPIsg 2009-08-19 13:49 nice url ;) 2009-08-19 13:49 opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf 2009-08-19 13:49 maybe, it was including google crap :) 2009-08-19 13:50 25 page of that pdf 2009-08-19 13:50 it's radix tree of dnode 2009-08-19 13:50 yes, I've been in it before 2009-08-19 13:50 forgot everything though 2009-08-19 13:51 well, zfs seems to be very interesting fs 2009-08-19 13:52 but, probably, unfortunately it's not what I want to use 2009-08-19 13:53 the point for today is just to compare complexity 2009-08-19 13:53 yes 2009-08-19 13:53 well, to see zfs was interst 2009-08-19 13:53 s/interst/fun/ 2009-08-19 13:53 it's good to know the alternatives well 2009-08-19 13:54 yes 2009-08-19 13:54 btrfs can be another day 2009-08-19 13:54 so far from our fs, so it can't compare easily 2009-08-19 13:55 yes, btrfs would also be fun 2009-08-19 13:59 well, however, maybe zfs is, file offset => lookup radix tree or array => translate physical pointer 2009-08-19 13:59 translate would be another lookup of radix tree or something 2009-08-19 14:00 so, just guess though, lookup 2 (or more) radix tree lookup 2009-08-19 14:05 oh 2009-08-19 14:06 if we use lvm, it may be same lookup? 2009-08-19 14:07 yes, similar 2009-08-19 14:08 a feature of zfs design is supposed to be, collapsing lvm and fs pointer layers 2009-08-19 14:08 at least from this, zfs is assuming enterprize level system 2009-08-19 14:08 it seems zfs is 2009-08-19 14:09 in practice, zfs ends up paying a big performance penalty 2009-08-19 14:09 yes 2009-08-19 14:09 ext3 outperforms on equivalent hardware 2009-08-19 14:10 yes, well, however, zfs seems to have interesting ability 2009-08-19 14:12 zfs doesn't depends on physical address at all 2009-08-19 14:12 it is interesting 2009-08-19 14:13 but, disbenefit may be big too 2009-08-19 14:14 dnode seems to translate from logical to physical address, I'm not sure what you mean 2009-08-19 14:14 you mean blkptr? 2009-08-19 14:15 folks 2009-08-19 14:15 yes 2009-08-19 14:15 it seems only have virtual address 2009-08-19 14:15 vdev, and offset 2009-08-19 14:15 and size 2009-08-19 14:16 so, I guess physical address can be changed by DMU(?) silently 2009-08-19 14:17 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-08-19 14:19 by the way, zfs locking does not seem much more sophisticated that ours 2009-08-19 14:20 ours? 2009-08-19 14:20 linux? 2009-08-19 14:20 tux3? 2009-08-19 14:20 well, per dnode mutex instead of per-tree mutex 2009-08-19 14:20 tux3 2009-08-19 14:21 oh 2009-08-19 14:21 so zfs is more granular 2009-08-19 14:21 but nothing really clever, just an ordinary mutex per node 2009-08-19 14:21 -!- ajonat(~ajonat@190.48.122.251) has joined #tux3 2009-08-19 14:21 we may see it carefully 2009-08-19 14:22 *bsd's vfs doesn't have locking like linux vfs 2009-08-19 14:22 it might be inode->i_mutex 2009-08-19 14:23 well, I'm not sure though 2009-08-19 14:23 hmm, zfs does not seem to use extents 2009-08-19 14:23 yes 2009-08-19 14:23 I also seems so 2009-08-19 14:25 at least zfs layer 2009-08-19 14:25 lower layer is hard to see 2009-08-19 14:27 but I seem to remember reading a blog by jeff bonwick talk about advantages of extents 2009-08-19 14:28 ah, it is just for freespace mapping 2009-08-19 14:28 not for file index 2009-08-19 14:28 curious inconsistency 2009-08-19 14:28 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-19 14:28 http://blogs.sun.com/bonwick/entry/space_maps 2009-08-19 14:28 yes, it's free space 2009-08-19 14:29 this blog is fun 2009-08-19 14:29 it seems there is no update recently though 2009-08-19 14:30 would be blogs.oracle.com ;) 2009-08-19 14:31 :) 2009-08-19 14:31 we may also defer to free defree-stash more in future 2009-08-19 14:32 yes, the argument for btree allocation is strong when space is not fragmented 2009-08-19 14:32 maybe not what you meant 2009-08-19 14:33 :) 2009-08-19 14:33 we can use more big bulk free 2009-08-19 14:34 keep defree more for a long term 2009-08-19 14:34 and sort and merge, and apply to free map 2009-08-19 14:35 yes 2009-08-19 14:36 I was referring to the idea I have described, where unfragmented regions of the allocation map can be represented as a btree of extents 2009-08-19 14:36 extra complexity, in return for extra scalability 2009-08-19 14:36 yes 2009-08-19 14:37 if there is single format, it would be extra good 2009-08-19 14:38 (bitmap + extent) / 2 2009-08-19 14:38 :) 2009-08-19 14:38 instead of switching those 2009-08-19 14:39 I was thinking, the bitmap file stays the way it is, but is sparse 2009-08-19 14:40 btree allocation map is mapped logically into another inode 2009-08-19 14:41 i see 2009-08-19 14:41 um..., it means, switch those? 2009-08-19 14:41 switch? 2009-08-19 14:42 bitmap and btree allocation map (<- free map base on extent?) 2009-08-19 14:42 the btree allocation map would be the primary structure, it may contain regions that are marked as being bitmaps, so we look in the bitmap file for those 2009-08-19 14:43 ah, i see 2009-08-19 14:43 um... 2009-08-19 14:43 we might find that it's more trouble than it's work 2009-08-19 14:44 s/work/worth/ 2009-08-19 14:44 i see 2009-08-19 14:45 well, anyway, it sounds like one of good ideas 2009-08-19 14:45 it may have manegement trouble though 2009-08-19 14:45 flipz: what ever you do, release the code under GPL 2009-08-19 14:46 so that you have some patent protection 2009-08-19 14:46 and maybe ask for some help at OSDL for this kind of stuff 2009-08-19 14:46 of course 2009-08-19 14:47 patent protection? 2009-08-19 14:47 there's enough new ideas here that you might get fucking sued off of you ass for this kind of stuff 2009-08-19 14:47 from contributers? 2009-08-19 14:47 of course, I expect that IBM and RH to help out in this case, but you never know 2009-08-19 14:48 flipz: the patent thing is about mutual destruction 2009-08-19 14:51 well 2009-08-19 14:51 I was thinking, that is mix of bitmap and extent free map 2009-08-19 14:52 but, it is 2 layer of extent and bitmap 2009-08-19 14:53 yes, I was thinking the same thing 2009-08-19 14:54 probably doesn't hurt 2009-08-19 14:54 an example would help 2009-08-19 14:54 flipz: the logging of this discussion might be admittable in a court 2009-08-19 14:54 so we should talk as much as possible :) 2009-08-19 14:55 and nobody will ever patent anything again 2009-08-19 14:55 well, patent system is not for us only 2009-08-19 14:55 we just need to describe every possible new idea in detail 2009-08-19 14:55 various countries has own patent system 2009-08-19 14:56 anyway, btree for free data is now an ancient concept 2009-08-19 14:56 yes 2009-08-19 14:56 well 2009-08-19 14:57 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-19 14:57 so, I was thinking, mixing bitmap and extent in one btree 2009-08-19 14:58 I don't know how to do it at all though :) 2009-08-19 14:59 for some reason I thought it would be more robust if bitmaps that exist, are at the same logical locations in the bitmap file as they are now 2009-08-19 14:59 then the bitmap btree could just represent each bitmap as a single extent, say 16 bytes\ 2009-08-19 15:00 a 128 MB extent 2009-08-19 15:00 where 128 MB is the size covered by a 4K bitmap 2009-08-19 15:00 max volume size / 16? 2009-08-19 15:01 so, one 16 byte extent can represent a 128 MB region that is entirely free 2009-08-19 15:01 what is meaning 128MB extent? 2009-08-19 15:02 extent in this case just means address, blockcount 2009-08-19 15:02 maybe region is a better word 2009-08-19 15:02 this extent is not dleaf's extent? 2009-08-19 15:03 no, there is no need to use the exact same format 2009-08-19 15:03 or should I say "yes" ;) 2009-08-19 15:03 :) 2009-08-19 15:03 luckly, I noticed it in this case 2009-08-19 15:04 so, the a region in the free tree might be marked as "represented by a bitmap" 2009-08-19 15:04 ok, so, this extent is internal format in bitmap-file datas? 2009-08-19 15:04 then we look in the bitmap file just as we do now, for free space 2009-08-19 15:05 internal format for allocation btree 2009-08-19 15:05 ok 2009-08-19 15:06 btree, but not leaf? 2009-08-19 15:10 well, I guess it means 2 layer 2009-08-19 15:10 um... 2009-08-19 15:10 probably, it needs to free unused region on bitmap? 2009-08-19 15:12 if a region is represented by a bitmap, it is only represented by the bitmap 2009-08-19 15:12 I meant, block of bitmap itself 2009-08-19 15:12 do freeing in a bitmap works exactly as it does now 2009-08-19 15:12 ah 2009-08-19 15:13 well, I noticed it is unnecessary 2009-08-19 15:13 but, zero clear 2009-08-19 15:13 and write it out 2009-08-19 15:13 that should be about the same issue as it is now, except we would have two inodes to worry about 2009-08-19 15:13 yes 2009-08-19 15:14 ah 2009-08-19 15:15 if we free the block of bitmap itself, it has already good ability? 2009-08-19 15:15 I'm ignoring how hard "freeing the block of bitmap itself" 2009-08-19 15:16 you mean, there is some advantage? 2009-08-19 15:16 yes, good advantage 2009-08-19 15:16 I don't see it right away 2009-08-19 15:16 I guess, btree know free region by dleaf 2009-08-19 15:17 so, bitmap btree's dleaf doesn't that block, it means those blocks are free 2009-08-19 15:18 I imaged it works like free extent map 2009-08-19 15:18 yes 2009-08-19 15:19 also, we can add free counter to for each bitmap block 2009-08-19 15:19 to see it became zero easily 2009-08-19 15:21 um... 2009-08-19 15:22 if we can free bitmap block easily, this sounds good 2009-08-19 15:22 oh 2009-08-19 15:22 not so good 2009-08-19 15:22 hard to update the counter 2009-08-19 15:22 we have to know fulled bitmap block 2009-08-19 15:23 oh, why? 2009-08-19 15:23 we have to keep it somewhere 2009-08-19 15:23 actually, not that hard 2009-08-19 15:23 I thought, the block of bitmap itself has counter 2009-08-19 15:24 so, bitmap data is (4096 - 4) * 8 bits, for example 2009-08-19 15:25 I guess it can know big free region, but it can't know big fully filled region 2009-08-19 15:31 it's a little inconvenient when the bitmap does not represent exactly a power of two blocks 2009-08-19 15:31 what is problem? 2009-08-19 15:34 well, anyway, it has to solve the fully filled block problem 2009-08-19 15:34 maybe it's just me, I like bitmaps to align exactly 2009-08-19 15:35 yes, fully filled 128 MB region would be represented by an extent again 2009-08-19 15:35 in general, nearly empty and nearly full regions are best represented by extents 2009-08-19 15:35 yes 2009-08-19 15:36 well, idea is dleaf has extent already :) 2009-08-19 15:40 -!- RazvanM_(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-19 15:45 true 2009-08-19 15:46 well I like the pattern of mapping a btree into a logical file 2009-08-19 15:46 and then the dleaf is not really visible to the logically-mapped btree 2009-08-19 15:47 we have two layers of btrees 2009-08-19 15:47 it sounds inefficient, but it isn't (I think) 2009-08-19 15:48 I think it's inefficient, but it's necessary 2009-08-19 15:48 the page cache is the reason it is not inefficient 2009-08-19 15:49 most accesses to logically mapped btree nodes are looked in efficiently via the page cache radix tree 2009-08-19 15:49 it's inefficient than one level btree clearly 2009-08-19 15:50 that's not so clear 2009-08-19 15:50 but we can't avoid it 2009-08-19 15:50 one level btree only lookup one btree 2009-08-19 15:50 metadata * one btree + data 2009-08-19 15:51 but, two btree is, metadata * 2 btree + data? 2009-08-19 15:51 when we access a physical block, we look it up in the volmap, a deeper radix tree than the one mapping a file 2009-08-19 15:51 it requires cache 2009-08-19 15:52 we only have to do a lookup in the underlying tree after a miss in the page cache 2009-08-19 15:52 there are many more hits than misses 2009-08-19 15:52 yes 2009-08-19 15:52 but, I can't see why I can't call it inefficient :) 2009-08-19 15:52 you can call whatever you like inefficient :) 2009-08-19 15:53 :) 2009-08-19 15:53 well, anyway, 2 btree is unnecessary for atomic commit, I think 2009-08-19 15:55 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-19 15:57 yes 2009-08-19 15:57 so... back to that 2009-08-19 15:57 bitmap? 2009-08-19 15:57 I will be away from keyboard for a little while now 2009-08-19 15:57 back to replay 2009-08-19 15:58 specifically, to replay technical note 2009-08-19 15:58 ok, good 2009-08-19 15:58 I'll sleep 2009-08-19 16:01 oyasumi 2009-08-19 19:46 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-19 21:19 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-19 22:14 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-19 22:36 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-08-20 02:33 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-08-20 06:40 -!- msindia(~lol@117.254.201.222) has joined #tux3 2009-08-20 09:42 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-20 10:11 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-20 10:37 good morning 2009-08-20 10:37 that too 2009-08-20 12:01 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-20 12:25 -!- pgquiles(~pgquiles@159.Red-81-39-154.dynamicIP.rima-tde.net) has joined #tux3 2009-08-20 15:28 -!- ajonat(~ajonat@190.48.99.135) has joined #tux3 2009-08-20 15:35 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-08-20 20:56 hey folks, flipz 2009-08-20 21:31 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-20 21:45 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-20 22:15 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-20 22:51 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-20 23:54 -!- setheus_(~setheus@pool-173-74-124-37.dllstx.fios.verizon.net) has joined #tux3 2009-08-21 06:11 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-21 07:30 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-21 07:46 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2009-08-21 07:46 -!- hirofumi_(~hirofumi@210.171.168.39) has joined #tux3 2009-08-21 07:46 -!- vcgomes`(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-08-21 08:23 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-21 10:22 good morning 2009-08-21 12:27 hey 2009-08-21 14:13 ~. 2009-08-21 17:26 -!- marcin(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2009-08-21 19:38 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-21 21:29 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-08-21 22:20 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-22 08:44 -!- npmccallum(~npmccallu@205.167.128.153) has joined #tux3 2009-08-22 09:27 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-08-22 15:56 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-22 23:27 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-23 03:57 -!- yousef(~yousef@helium.yousef.org) has joined #tux3 2009-08-23 03:57 hi 2009-08-23 14:31 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-23 16:41 -!- npmccallum(~npmccallu@205.167.128.153) has joined #tux3 2009-08-23 17:19 -!- npmccallum(~npmccallu@32.131.93.70) has joined #tux3 2009-08-23 18:50 -!- tim_dimm(~timothyhu@adsl-66-166-137.asm.bellsouth.net) has joined #tux3 2009-08-23 22:53 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-24 02:37 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-08-24 02:43 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-08-24 10:45 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-24 16:35 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-24 20:56 -!- yousef(~yousef@helium.yousef.org) has joined #tux3 2009-08-24 20:56 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-08-24 20:56 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-08-24 20:57 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-08-24 23:34 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-25 05:36 anybody alive here? :) 2009-08-25 05:48 hi 2009-08-25 06:26 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-25 08:20 hi hirofumi 2009-08-25 08:20 sorry I left earlier... are you still around? 2009-08-25 08:30 hi 2009-08-25 08:30 I'm going to sleep soon though 2009-08-25 08:31 no problem... basically, I saw the link for tux3 on kernelnewbies.org and I would like to get involved with the project.. who should I talk to? 2009-08-25 08:33 flips would be good for it 2009-08-25 08:33 ok, I'll ask him when he's around. thanks. 2009-08-25 08:33 btw, I've knew kernelnewbies.org has link now 2009-08-25 08:35 yeah, apparently someone called "debiandev" added the link to the kernel projects page. 2009-08-25 08:36 ah, i see 2009-08-25 08:37 well, the import thing is what you want to 2009-08-25 08:37 though 2009-08-25 08:37 http://tux3.org/tux3/ 2009-08-25 08:38 this is hg repository, and latest source of tux3 2009-08-25 08:38 it's including the both of userland and kernel 2009-08-25 08:39 kernel part is tux3/user/kernel/* 2009-08-25 08:40 userland is tux3/user/*, but those also use tux3/user/kernel/* 2009-08-25 08:40 ok... I'll take a look at those till I talk to flips... 2009-08-25 08:42 if you have question, here or tux3-ml would be good 2009-08-25 08:43 I'll probably just stick around here, but I have the ml as a second option. 2009-08-25 08:44 if you have question for design, flips is best to answer, I think 2009-08-25 08:44 yes 2009-08-25 08:44 well, flips seems to be busy recently 2009-08-25 08:45 perhaps, he have some time at weekend 2009-08-25 08:45 I messaged him and we'll see when he responds... it's not urgent or anything. 2009-08-25 08:46 so I can wait. 2009-08-25 08:47 good 2009-08-25 09:25 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-25 09:25 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-25 09:25 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-08-25 09:25 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-08-25 09:25 -!- yousef(~yousef@helium.yousef.org) has joined #tux3 2009-08-25 09:25 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-25 09:25 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-08-25 09:25 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2009-08-25 09:25 -!- setheus_(~setheus@pool-173-74-124-37.dllstx.fios.verizon.net) has joined #tux3 2009-08-25 09:25 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-25 09:25 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-08-25 09:25 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-08-25 09:26 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-08-25 09:26 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-08-25 09:26 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-08-25 09:26 -!- persson(persson@nescafe.bsnet.se) has joined #tux3 2009-08-25 09:26 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-25 09:26 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-25 09:26 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-08-25 09:26 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-08-25 09:26 -!- yousef(~yousef@helium.yousef.org) has joined #tux3 2009-08-25 09:26 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-25 09:26 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-08-25 09:26 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2009-08-25 09:26 -!- setheus_(~setheus@pool-173-74-124-37.dllstx.fios.verizon.net) has joined #tux3 2009-08-25 09:26 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-25 09:26 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-08-25 09:26 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-08-25 10:46 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-08-25 11:24 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-25 17:14 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-25 17:24 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-25 17:27 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-25 17:31 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-25 18:01 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-25 20:30 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-25 20:59 -!- ajonat(~ajonat@190.48.105.24) has joined #tux3 2009-08-25 21:32 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-25 22:21 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-26 09:16 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-26 10:31 good morning 2009-08-26 10:31 yousef, there? 2009-08-26 10:32 yeah 2009-08-26 10:32 you rang? 2009-08-26 10:32 yeah... 2009-08-26 10:33 being the kernel hacker wannabe that I am, I'm interested in contributing to this project 2009-08-26 10:33 I have some _very_ basic kernel programming knowledge (basically from a class that I've taken before) 2009-08-26 10:33 there's a project for every level of kernel skill 2009-08-26 10:34 what part of kernel interests you? 2009-08-26 10:34 note: we have a number of tux3 university alumni here who will be able to help you 2009-08-26 10:36 we went over several topics (cpu scheduling and cpu scheduling algorithms e.g. fcfs, sjf, srtf, etc), paging and page-replacement algorithms (lru, mfu, lfu, ..etc), disk IO scheduling (fcfs, sstf, scan, c-scan, ..etc), filesystems (attributes: name, type, location, size, protection, ..etc, operations: create, rename, delete, write, ..etc, access methods: sequential, direct, indexed, ..etc), important structures in the kernel (file, dentry, inode, super_bloc 2009-08-26 10:37 we had to modify a CPU scheudler, IO scheduler, add a system call to the kernel, ..etc 2009-08-26 10:37 so again, I have some basic knowledge of various topics in the kernel 2009-08-26 10:38 and honestly, I'm just trying to get involved... my interest would the networking subsystems, but that's quite difficult to start at (I've actually had some attempts but never really did anything) 2009-08-26 10:39 so I figured since you guys need some more help (according to the kernel projects page at kernelnewbies.org), I thought you guys may have something for me here. 2009-08-26 10:42 sure, vfs is a good place to start with, regardless of where you eventually end up working in kernel 2009-08-26 10:42 everything in linux involves the vfs at some level 2009-08-26 10:43 same as saying "everything is a file" <- traditional unix saying 2009-08-26 10:43 yeah... I remember we cover the FCB and FIT at least 2009-08-26 10:43 covered* 2009-08-26 10:43 yeah. 2009-08-26 10:46 ok, well let's see, we still don't have anybody doing the fsck prototype 2009-08-26 10:46 that would be a pretty good introduction to filesystem basics 2009-08-26 10:46 need to understand disk layout and how to traverse it 2009-08-26 10:47 from there you can compare the way we access disk in userspace vs kernel, very useful to know 2009-08-26 10:47 tracks, blocks, etc? 2009-08-26 10:47 blocks and extents 2009-08-26 10:47 where an extent is a range of blocks 2009-08-26 10:47 ok 2009-08-26 10:48 and buffers, object that holds a block 2009-08-26 10:49 alright.. 2009-08-26 10:49 starting point would be to look at tux3graph.c 2009-08-26 10:49 (hirofumi's code) 2009-08-26 10:50 ok 2009-08-26 10:51 so, let's see... 1) read tux3graph.c and 2) read about disk layouts ... would that be a good first step? 2009-08-26 10:52 yes, good 2009-08-26 10:52 any questions, just ask 2009-08-26 10:52 most people on the channel can answer pretty well I think 2009-08-26 10:52 yeah, I'll probably bug you guys a lot :) 2009-08-26 11:58 folks 2009-08-26 11:59 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-26 12:12 -!- ajonat(~ajonat@190.48.107.102) has joined #tux3 2009-08-26 12:55 -!- ajonat(~ajonat@190.48.98.154) has joined #tux3 2009-08-26 14:26 -!- edt(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-26 14:31 -!- ed__(~Ed@233-76.162.dsl.aei.ca) has joined #tux3 2009-08-26 15:09 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2009-08-26 15:10 hirofumi, Have you seen this: http://lwn.net/Articles/348825/ 2009-08-26 15:10 hi 2009-08-26 15:11 ah, yes 2009-08-26 15:11 I saw it 2009-08-26 15:11 hi 2009-08-26 15:11 Didn't you reverse engineer that format? 2009-08-26 15:11 yes, I did 2009-08-26 15:11 it's still read-only though 2009-08-26 15:12 miscrosoft has patent for exfat 2009-08-26 15:12 It would be great if you cared to comment on that article. Especially seeing how you wrote a ro driver for exfat 2009-08-26 15:13 um... 2009-08-26 15:13 I would not have big interest to that 2009-08-26 15:14 what comment? 2009-08-26 15:14 Oh ok 2009-08-26 15:15 nevermind then 2009-08-26 15:15 I'm still not reading comments of that article... 2009-08-26 15:19 SEJeff, did you care about proprietary driver? 2009-08-26 15:20 Well for things like USB sticks, this will likely be a big deal 2009-08-26 15:20 People will want them in windows and Linux. Microsoft will push exFat 2009-08-26 15:21 yes, probably 2009-08-26 15:21 Most of my usb sticks are ext3 so I don't care 2009-08-26 15:21 And hopefully they will soon be tux3 :) 2009-08-26 15:21 :) 2009-08-26 15:22 well, it would not be hard to write exfat driver 2009-08-26 15:22 But it wouldn't go mainline with patents would it 2009-08-26 15:23 I think it can include to mainline, but yes 2009-08-26 15:23 unclear patent issue is bad 2009-08-26 15:24 patent number of exfat wikipedia would not be hard to workaround though 2009-08-26 15:24 microsoft may have another patents 2009-08-26 15:24 -!- ajonat_(~ajonat@190.48.103.168) has joined #tux3 2009-08-26 15:25 well, anyway, thanks for info 2009-08-26 15:29 you're welcome 2009-08-26 18:02 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-26 19:32 -!- ajonat(~ajonat@190.48.103.168) has joined #tux3 2009-08-26 20:03 that exera story is just plain weird 2009-08-26 20:04 seems like yet another msft plot 2009-08-26 23:10 I think msft is thinking to get money by exfat's patent like vfat 2009-08-26 23:33 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-27 04:15 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-08-27 05:59 good morning 2009-08-27 06:04 hi 2009-08-27 11:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-27 12:03 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-08-27 12:55 -!- ajonat(~ajonat@190.48.127.132) has joined #tux3 2009-08-27 14:50 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2009-08-27 17:00 -!- ajonat(~ajonat@190.48.109.159) has joined #tux3 2009-08-27 18:39 -!- ajonat(~ajonat@190.48.114.26) has joined #tux3 2009-08-27 23:16 -!- ajonat(~ajonat@190.48.123.186) has joined #tux3 2009-08-28 00:15 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-28 06:18 in tux3graph.c, a struct dev is initialized in main(). I can't find the type declaration of struct dev anywhere in tux3graph.c or inode.c (which tux3graph.c includes). Where is it defined? 2009-08-28 06:46 the linux cross-referencer isn't helping much either as i can't seem to find a struct dev defined anywhere (there is a struct device, but not struct dev -- and struct device doesn't have an 'fd' member, so the one used in tux3graph.c must be a different one) 2009-08-28 08:13 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-28 08:33 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-28 10:14 -!- Mekapaedia(~Mekapaedi@208-98-203-182.cable.dynamic.sunwave.net) has joined #tux3 2009-08-28 11:04 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-28 12:01 -!- Mekapaedia(~Mekapaedi@208-98-203-182.cable.dynamic.sunwave.net) has joined #tux3 2009-08-28 14:32 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-28 15:06 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-28 16:01 -!- Mekapaedia(~Mekapaedi@208-98-203-182.cable.dynamic.sunwave.net) has joined #tux3 2009-08-28 16:58 -!- Mekapaedia(~Mekapaedi@208-98-203-182.cable.dynamic.sunwave.net) has joined #tux3 2009-08-28 17:00 -!- Mekapaedia(~Mekapaedi@208-98-203-182.cable.dynamic.sunwave.net) has joined #tux3 2009-08-28 18:38 -!- Mekapaedia(~Mekapaedi@208-98-203-182.cable.dynamic.sunwave.net) has joined #tux3 2009-08-28 19:08 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-28 19:22 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-28 19:42 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-28 19:53 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-28 21:15 yousef, probably, I guess you already found 2009-08-28 21:15 it's in buffer.h 2009-08-28 21:16 tux3/user/buffer.h 2009-08-28 21:45 hirofumi: thanks. I'm still going over your code, so I may have some more questions. 2009-08-28 21:51 -!- Mekapaedia(~Mekapaedi@208-98-203-182.cable.dynamic.sunwave.net) has joined #tux3 2009-08-28 22:18 bah, what about struct sb? 2009-08-28 22:22 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-28 23:46 -!- bobby1234(~bobby@122.161.90.63) has joined #tux3 2009-08-29 01:11 sb is in tux3/user/kernel/tux3.h 2009-08-29 01:12 I suggest you use tag to read sources if you are not using 2009-08-29 01:13 ctags, etags, cscope, global, id, or something 2009-08-29 01:13 I'm using slightly modified cscope to read source 2009-08-29 01:15 well, to read source efficiently, devlop env (editor, tag, build, scm, etc.) is important 2009-08-29 01:40 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-08-29 01:41 -!- bobby1234(~bobby@122.161.90.63) has joined #tux3 2009-08-29 01:56 -!- kunir(~kunir@cs27066117.pp.htv.fi) has joined #tux3 2009-08-29 02:23 -!- bobby1234(~bobby@122.161.184.23) has joined #tux3 2009-08-29 02:48 -!- Mekapaedia(~Mekapaedi@208-98-203-182.cable.dynamic.sunwave.net) has joined #tux3 2009-08-29 05:18 hirofumi: yeah... I've cloned the repository to my local disk and used ctags to generate an index of the source... 2009-08-29 05:51 what is the default block size used in tux3? 512? 1k? 2k? 4k? and does struct sb (sb for superblock, I assume) add up to that number? 2009-08-29 06:07 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-29 07:24 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-29 11:40 default block size is 4k for now 2009-08-29 11:42 struct sb is in-core structure to manage superblock 2009-08-29 11:42 struct disksuper is on-disk superblock 2009-08-29 11:46 tux3 reserves first 8k of partition, and disksuper is written to fixed position (4k position) 2009-08-29 11:50 for now, size of disksuper is 4k even if blocksize is not 4k 2009-08-29 11:51 load_sb/save_sb is functions to read/write disksuper 2009-08-29 12:21 btw, iirc, ctags can't find caller, but it can find the define of structure :) 2009-08-29 12:23 well, no problem to ask, but good tag tool would save your time 2009-08-29 12:34 tux3 time :) 2009-08-29 13:24 -!- ajonat(~ajonat@190.48.118.66) has joined #tux3 2009-08-29 14:37 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-29 15:47 hirofumi: yeah, ctags seems to be doing the job for me for now, and thanks for all the explanations. 2009-08-29 15:47 I've warned you guys that I'll be bugging you a lot with my questions :) 2009-08-29 16:16 -!- edt(~Ed@dsl-216-221-34-142.aei.ca) has joined #tux3 2009-08-29 16:35 -!- edt(~Ed@dsl-216-221-34-142.aei.ca) has joined #tux3 2009-08-29 16:52 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-29 17:07 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-29 17:32 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-29 18:31 -!- lulzer(~Mekapaedi@S0106001ee5376df1.cg.shawcable.net) has joined #tux3 2009-08-29 18:32 -!- Mekapaedia(~Mekapaedi@S0106001ee5376df1.cg.shawcable.net) has joined #tux3 2009-08-29 18:54 -!- edt(~Ed@dsl-216-221-33-175.aei.ca) has joined #tux3 2009-08-29 19:05 -!- edt(~Ed@232-76.162.dsl.aei.ca) has joined #tux3 2009-08-29 19:20 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-29 19:36 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-29 19:47 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-30 00:00 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-30 05:16 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-08-30 06:01 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-30 08:23 -!- cydork(~vihang@triband-mum-59.184.2.37.mtnl.net.in) has joined #tux3 2009-08-30 09:53 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-30 12:26 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-08-30 13:04 -!- Mekapaedia(~Mekapaedi@S0106001ee5376df1.cg.shawcable.net) has joined #tux3 2009-08-30 15:26 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-30 19:35 -!- tim_dimm(~timothyhu@cpe-69-204-166-130.nycap.res.rr.com) has joined #tux3 2009-08-30 22:20 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-30 22:38 -!- RazvanM_(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-31 02:36 -!- mecaveats(~ROR@117.254.13.253) has joined #tux3 2009-08-31 02:36 -!- mecaveats(~ROR@117.254.13.253) has left #tux3 2009-08-31 04:04 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-08-31 06:46 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-31 07:56 -!- cdk(~Chinmay@59.95.5.66) has joined #tux3 2009-08-31 10:39 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-08-31 10:41 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-08-31 13:51 -!- codemac_(d8f01e17@webchat.mibbit.com) has joined #tux3 2009-08-31 21:21 alright folks, so tux3 doesn't seem to be in linus' tree (yet)... to make testing/development easier, I'd like to use tux3 on my thumbdrive/external hdd... how do I build tux3? do i just run make in user/kernel/ and then insdmod/modprobe the module? 2009-08-31 21:27 yes 2009-08-31 21:28 kernel module may not work recently though 2009-08-31 21:28 cd user/kernel; make LINUX=/path/to/kernel 2009-08-31 21:28 would be work 2009-08-31 21:30 cool... and there's a mktux3 (or similar) utility in user/ I assume? 2009-08-31 21:30 you know, to create the tux3 filesystem on my thumbdrive. 2009-08-31 21:31 tux3 mkfs /dev/foo 2009-08-31 21:31 sweet. 2009-08-31 21:31 well, even if it works, there are some known memory leaks 2009-08-31 21:32 so, long term test would not be work 2009-08-31 21:34 that's alright... I just want to be able to use the filesystem on a basic level and run tux3graph on it 2009-08-31 21:36 if basic level, tux3 command or fuse may be what you want 2009-08-31 21:37 both can test on userland 2009-08-31 21:38 e.g. 2009-08-31 21:38 ./tux3 mkfs file-foo 2009-08-31 21:38 well, by "on a basic level" I mean I need to at least have a block device formatted in tux3 so I could do `tux3graph /dev/foo` 2009-08-31 21:39 echo test-data | ./tux3 write file-foo "filename" 2009-08-31 21:39 oh I see 2009-08-31 21:39 yes 2009-08-31 21:39 yeah that's a great idea... I don't have to use my thumbdrive for it then 2009-08-31 21:39 yes 2009-08-31 21:40 so, you can, tux3graph -v file-foo 2009-08-31 21:40 great 2009-08-31 21:42 basic development can be on userland 2009-08-31 21:42 it's one of reasons why there is tux3 command and fuse 2009-08-31 21:43 well, good luck :) 2009-08-31 21:44 thanks. as usual, i'll yell here if i have more questions. 2009-08-31 21:45 yes, good 2009-08-31 22:03 erm, what package/toolkit is the `dot` utility part of? 2009-08-31 22:04 it's graphviz 2009-08-31 22:21 cool... so in tux3graph.c in main(), after parsing commandline arguments/options, we open our device and get a file descriptor, create a temporary dev struct with the fd member set to the file we just opened, pass that dev struct to the rapid_sb macro which creates a pointer to a temporary sb struct and initialize its members... ignoring the details, does this sound right? 2009-08-31 22:22 (up to that point) 2009-08-31 23:43 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-08-31 23:48 yes 2009-08-31 23:50 it's right 2009-09-01 01:24 we then create some buffers (seems to me they're a bunch of linked lists looking at init_buffers() and preallocate_buffers()) we a pool size of 2^19 (a "pool", if I understand the terminology correctly, is basically how much space you have in the volume -- why is it set to a constant here) ? 2009-09-01 01:24 s/we a pool/with a pool/ 2009-09-01 03:29 if there is no free buffers, buffer allocater will reclaim the clean (and no referencer) buffer 2009-09-01 03:29 in theory, constant size is not enough 2009-09-01 03:30 but, for the test on userland, it would be enough in practically 2009-09-01 08:33 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-09-01 09:43 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-01 12:15 -!- pgquiles(~pgquiles@32.Red-83-44-237.dynamicIP.rima-tde.net) has joined #tux3 2009-09-01 12:40 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-01 16:06 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-01 17:03 -!- Mekapaedia(~Mekapaedi@S0106001ee5376df1.cg.shawcable.net) has joined #tux3 2009-09-01 18:07 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-09-01 21:40 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-09-01 22:33 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-09-02 00:02 -!- shweta(~shweta@117.195.38.171) has joined #tux3 2009-09-02 00:08 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-02 00:30 -!- pgquiles(~pgquiles@176.Red-79-144-195.dynamicIP.rima-tde.net) has joined #tux3 2009-09-02 00:53 -!- dcg(~dcg@12.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-09-02 08:53 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-02 09:58 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-02 10:24 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-02 11:12 hi tim_dimm 2009-09-02 11:31 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-02 11:31 tux3 cabal tonight 2009-09-02 11:31 summer holidays are over ;) 2009-09-02 12:23 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-02 12:52 7:30, right? 2009-09-02 13:59 I'll be there at 7 2009-09-02 13:59 get the table warmed up 2009-09-02 17:22 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-02 21:06 -!- Mekapaedia(~Mekapaedi@S0106001ee5376df1.cg.shawcable.net) has joined #tux3 2009-09-02 21:30 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-02 21:35 -!- Mekapedia(~Mekapaedi@S0106001ee5376df1.cg.shawcable.net) has joined #tux3 2009-09-02 23:34 -!- Mekapaedia(~Mekapaedi@S0106001ee5376df1.cg.shawcable.net) has joined #tux3 2009-09-02 23:51 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-03 00:41 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-09-03 08:29 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-03 08:29 morning tim_dimm 2009-09-03 08:30 morning flips 2009-09-03 08:30 uh, flipz 2009-09-03 08:31 so, wiki ideas 2009-09-03 08:32 what is the most usable, least vulnerable, most elegant wiki thingy? 2009-09-03 09:08 mornin' 2009-09-03 09:39 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-03 09:40 hi shapor 2009-09-03 09:40 < flipz> what is the most usable, least vulnerable, most elegant wiki thingy? 2009-09-03 09:40 or another way of putting it, what is the least worst? 2009-09-03 10:33 mediawiki seems to be popular so I assume it's good... other than that, i have no idea. 2009-09-03 11:45 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-03 12:58 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-03 13:01 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-03 13:14 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-03 13:28 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-03 13:34 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-03 13:37 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-03 15:34 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-03 16:46 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-03 22:01 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-03 23:31 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-09-04 00:13 -!- ajonat(~ajonat@190.48.103.68) has joined #tux3 2009-09-04 01:18 -!- setheus(~setheus@pool-173-74-124-37.dllstx.fios.verizon.net) has joined #tux3 2009-09-04 07:35 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-04 09:29 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-04 11:52 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-04 17:10 -!- ajonat(~ajonat@190.48.103.68) has joined #tux3 2009-09-04 20:40 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-04 23:54 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-09-05 13:38 -!- ajonat(~ajonat@190.48.102.121) has joined #tux3 2009-09-05 14:13 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-05 17:56 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-05 20:20 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-05 23:51 -!- ajonat(~ajonat@190.48.102.121) has joined #tux3 2009-09-06 00:16 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-09-06 06:46 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-07 13:21 -!- bobby_(bobby@123.237.80.54) has joined #tux3 2009-09-07 16:31 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-09-07 16:31 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-09-07 16:31 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-07 21:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-07 21:52 -!- bobby_(bobby@123.237.80.54) has joined #tux3 2009-09-08 00:04 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-09-08 00:47 -!- ajonat(~ajonat@190.48.126.121) has joined #tux3 2009-09-08 06:37 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-08 07:35 -!- dcg(~dcg@64.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-09-08 08:00 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-08 08:29 -!- dcg_(~dcg@218.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-09-08 14:07 -!- ajonat(~ajonat@190.48.126.121) has joined #tux3 2009-09-08 17:41 -!- ajonat(~ajonat@190.48.126.121) has joined #tux3 2009-09-08 20:44 -!- Wdm(~chatzilla@41.249.6.242) has joined #tux3 2009-09-08 20:47 -!- kedars_(~kedars@socks.wantstofly.org) has joined #tux3 2009-09-08 23:42 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-09-09 08:15 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2009-09-09 09:01 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-09 10:20 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-09 10:40 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-09-09 10:41 -!- flips(~phillips@phunq.net) has joined #tux3 2009-09-09 11:59 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-09 12:27 -!- Anshul(~Anshul@117.195.65.98) has joined #tux3 2009-09-09 12:40 can anyone tell ,me how to join device drivers channel? 2009-09-09 12:44 Same way you joined here, except with a different channel. 2009-09-09 12:44 yeah i was trying with dd but it didn't worked 2009-09-09 12:44 :( 2009-09-09 12:44 i mean #dd 2009-09-09 12:45 Wrong channel, I'd assume. *shrug* 2009-09-09 12:45 can u plz 2009-09-09 12:45 help m,e regarding this 2009-09-09 12:45 I don't know what channel you're trying to join, or even if there is any channel about device drivers on this network. 2009-09-09 12:46 okay 2009-09-09 12:46 me will find it out once 2009-09-09 12:46 :) 2009-09-09 12:56 yeah joined that one man 2009-09-09 12:56 :) 2009-09-09 20:16 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-09 20:54 -!- ajonat(~ajonat@190.48.127.119) has joined #tux3 2009-09-09 23:39 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-09-10 08:01 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-10 11:12 -!- Ansh(~Anshul@117.195.67.133) has joined #tux3 2009-09-10 11:20 -!- Ansh(~Anshul@117.195.67.133) has joined #tux3 2009-09-10 16:55 -!- ajonat(~ajonat@190.48.127.119) has joined #tux3 2009-09-10 17:08 -!- ajonat_(~ajonat@190.48.110.75) has joined #tux3 2009-09-10 23:28 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-09-10 23:31 -!- pgquiles(~pgquiles@176.Red-79-144-195.dynamicIP.rima-tde.net) has joined #tux3 2009-09-11 04:01 -!- mohaa(~mohaa@89.16.14.236) has joined #tux3 2009-09-11 04:07 -!- mohaa(~mohaa@89.16.14.236) has left #tux3 2009-09-11 05:25 -!- pgquiles(~pgquiles@176.Red-79-144-195.dynamicIP.rima-tde.net) has joined #tux3 2009-09-11 10:52 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-11 12:50 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-09-11 13:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-11 15:32 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-11 16:04 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-11 19:03 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-11 21:20 -!- ajonat(~ajonat@190.48.110.75) has joined #tux3 2009-09-11 21:32 here's a little informational backfill, showing that gcc is expected to generate good code for unaligned integer accesses, provided we declare them packed: http://lkml.indiana.edu/hypermail/linux/kernel/0902.1/02147.html 2009-09-11 21:33 which comes up with tux3 xattr scheme I think 2009-09-12 00:38 -!- RazvanM(~RazvanM@pool-173-67-51-165.bltmmd.east.verizon.net) has joined #tux3 2009-09-12 04:30 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-12 06:34 -!- loomsen(~loomsen@dslb-088-078-129-001.pools.arcor-ip.net) has joined #tux3 2009-09-12 07:59 -!- dcg(~dcg@190.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-09-12 08:35 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-12 09:50 -!- tim_dimm_(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-12 14:31 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-12 14:38 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-12 16:38 -!- pgquiles(~pgquiles@75.Red-81-33-103.dynamicIP.rima-tde.net) has joined #tux3 2009-09-12 23:52 -!- RazvanM(~RazvanM@pool-173-67-60-144.bltmmd.east.verizon.net) has joined #tux3 2009-09-13 05:05 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-13 08:50 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-13 10:56 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-13 11:20 -!- pgquiles(~pgquiles@176.Red-79-144-195.dynamicIP.rima-tde.net) has joined #tux3 2009-09-13 11:40 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-13 13:03 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-13 15:23 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-13 16:19 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-09-13 18:37 -!- bd__(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-09-13 18:44 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-09-13 19:42 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-13 20:21 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-13 21:00 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-13 22:30 -!- RazvanM(~RazvanM@pool-173-67-60-144.bltmmd.east.verizon.net) has joined #tux3 2009-09-13 23:15 -!- RazvanM_(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-14 02:02 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-09-14 08:33 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-14 09:01 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-14 09:31 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-14 09:37 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-14 09:59 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-14 10:08 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-14 13:34 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-14 16:20 -!- bh_(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-09-14 21:56 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-14 22:21 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-14 22:41 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-14 23:42 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-14 23:44 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-15 01:52 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-15 07:45 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-15 08:05 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-15 09:01 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-15 09:40 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-15 11:18 -!- pgquiles(~pgquiles@176.Red-79-144-195.dynamicIP.rima-tde.net) has joined #tux3 2009-09-15 11:22 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-15 16:19 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-15 16:46 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-15 18:29 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-15 19:40 violin memory made the news 2009-09-15 19:40 oh yeah? 2009-09-15 19:41 link? 2009-09-15 19:41 http://www.theregister.co.uk/2009/09/15/violin_gets_basile/ 2009-09-15 19:41 hmm 2009-09-15 19:41 thought you'd already have the scoop 2009-09-15 19:41 I knew about Basille 2009-09-15 19:44 given the divisions in fusionio as a companyh they might consider renaming to fissionio 2009-09-15 19:45 hah 2009-09-15 19:45 that's funny 2009-09-15 19:46 Basile is smart 2009-09-15 19:46 He was trying to raise another $20M or so for Violin 2009-09-15 19:47 my guess is that he's done it 2009-09-15 19:48 pr all over the place. hit hpc wire too 2009-09-15 19:50 http://www.hpcwire.com/offthewire/Solace-Breaks-the-Microsecond-Barrier-for-Shared-Memory-Messaging-59231992.html 2009-09-15 19:50 that's cool too 2009-09-15 19:55 more power to them 2009-09-15 19:55 never give up, never surrender 2009-09-15 19:55 talking about violin, right? 2009-09-15 19:56 severe constipation on their part 2009-09-15 23:44 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-15 23:59 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-16 00:38 hey folks 2009-09-16 00:42 flips: you work for violin now ? 2009-09-16 01:21 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-16 03:00 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-09-16 04:36 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-16 07:00 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-16 08:03 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-09-16 08:07 Hello? 2009-09-16 08:21 Personne ici except us chickens? 2009-09-16 09:04 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-09-16 09:04 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-16 09:36 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-16 10:50 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-16 11:14 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-16 11:55 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-16 12:50 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-16 12:59 -!- mib_x5yv94og(d92c063c@webchat.mibbit.com) has joined #tux3 2009-09-16 12:59 hi anyone around? 2009-09-16 13:02 -!- mib_x5yv94og(d92c063c@webchat.mibbit.com) has left #tux3 2009-09-16 14:14 -!- npmccallum(~npmccallu@cpe-76-177-118-80.natcky.res.rr.com) has joined #tux3 2009-09-16 18:03 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-09-16 18:28 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-09-16 18:29 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-09-16 18:30 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-09-16 23:24 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-17 00:30 -!- Schachmann(~volker@p50866158.dip.t-dialin.net) has joined #tux3 2009-09-17 00:33 -!- Schachmann(~volker@p50866158.dip.t-dialin.net) has left #tux3 2009-09-17 04:58 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-17 07:01 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-17 07:40 -!- npmccallum(~npmccallu@cpe-76-177-118-80.natcky.res.rr.com) has joined #tux3 2009-09-17 09:23 good morning 2009-09-17 09:36 -!- npmccallum(~npmccallu@cpe-76-177-118-80.natcky.res.rr.com) has joined #tux3 2009-09-17 12:06 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-17 12:18 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-17 18:52 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-17 21:37 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-17 22:29 -!- yousef_(~yousef@helium.yousef.org) has joined #tux3 2009-09-17 23:12 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-18 01:38 hey flips 2009-09-18 05:05 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-18 07:13 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-18 07:37 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-18 08:06 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-18 09:56 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-18 10:10 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-18 10:51 hi tim_dimm 2009-09-18 10:52 hi flipz 2009-09-18 11:40 folks 2009-09-18 11:43 hey bh 2009-09-18 11:45 what's been going on with tux3 ? 2009-09-18 15:30 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-18 19:03 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-18 19:22 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-18 19:43 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-09-18 20:15 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-18 22:55 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-09-19 02:10 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-19 03:52 -!- bobby(~bobby@122.162.74.14) has joined #tux3 2009-09-19 03:53 hey all 2009-09-19 08:18 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-09-19 08:43 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-19 09:18 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-09-19 11:00 good morning 2009-09-19 13:04 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-19 13:44 -!- tim_dimm(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-19 14:15 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-19 14:22 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-09-19 14:22 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-09-19 14:57 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-19 15:36 -!- flips(~phillips@phunq.net) has left #tux3 2009-09-20 13:13 -!- flips(~phillips@phunq.net) has joined #tux3 2009-09-21 00:02 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-21 00:30 -!- kunir(~kunir@ssfi.movial.fi) has joined #tux3 2009-09-21 00:31 -!- kunir(~kunir@ssfi.movial.fi) has joined #tux3 2009-09-21 02:24 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2009-09-21 02:29 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-09-21 07:30 morning shapor 2009-09-21 08:10 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-09-21 10:33 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-21 12:50 hey flipz 2009-09-21 17:39 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-21 17:52 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-21 18:50 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-21 18:50 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-21 20:29 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-21 21:05 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-21 23:42 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-22 00:06 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-09-22 02:06 -!- ckwood(~ckwood@cpe-75-82-56-173.socal.res.rr.com) has joined #tux3 2009-09-22 05:37 -!- npmccallum(~npmccallu@cpe-76-177-118-80.natcky.res.rr.com) has joined #tux3 2009-09-22 07:08 well, linus has officially pronounced linux bloated 2009-09-22 07:09 a good enough reason for me to get back to work on our nonbloated-by-design fs 2009-09-22 09:22 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-22 12:42 -!- pgquiles(~pgquiles@50.Red-81-35-101.dynamicIP.rima-tde.net) has joined #tux3 2009-09-22 13:39 -!- dcg(~dcg@206.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-09-22 14:56 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-22 15:07 -!- dcg_(~dcg@154.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-09-22 15:33 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-22 16:26 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-22 16:45 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-22 21:06 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-09-22 23:29 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-23 04:32 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-23 06:11 -!- npmccallum(~npmccallu@74-93-194-82-WashingtonDC.hfc.comcastbusiness.net) has joined #tux3 2009-09-23 10:24 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-23 15:23 -!- pgquiles_(~pgquiles@50.Red-81-35-101.dynamicIP.rima-tde.net) has joined #tux3 2009-09-23 15:32 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-23 19:28 -!- npmccallum(~npmccallu@32.165.83.170) has joined #tux3 2009-09-23 20:13 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-09-23 22:15 -!- ajonat(~ajonat@190.48.119.109) has joined #tux3 2009-09-23 23:22 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-09-24 00:08 -!- kunir(~kunir@ssfi.movial.fi) has joined #tux3 2009-09-24 00:44 -!- kunir(~kunir@ssfi.movial.fi) has joined #tux3 2009-09-24 02:11 -!- pranith(c05e2302@webchat.mibbit.com) has joined #tux3 2009-09-24 02:12 hey all 2009-09-24 02:12 flipz, whats cooking? 2009-09-24 02:12 hirofumi, ping 2009-09-24 05:20 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-24 09:05 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-24 09:41 -!- npmccallum(~npmccallu@74-93-194-82-WashingtonDC.hfc.comcastbusiness.net) has joined #tux3 2009-09-24 10:01 -!- npmccallum(~npmccallu@74-93-194-82-WashingtonDC.hfc.comcastbusiness.net) has joined #tux3 2009-09-24 10:30 -!- pgquiles(~pgquiles@50.Red-81-35-101.dynamicIP.rima-tde.net) has joined #tux3 2009-09-24 12:02 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-24 14:12 -!- ajonat(~ajonat@190.48.119.109) has joined #tux3 2009-09-24 16:33 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-24 22:43 -!- pranith(c05e2302@webchat.mibbit.com) has joined #tux3 2009-09-24 22:53 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-09-25 00:07 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-25 05:08 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-25 09:58 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-25 14:19 -!- pgquiles(~pgquiles@50.Red-81-35-101.dynamicIP.rima-tde.net) has joined #tux3 2009-09-25 15:10 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-25 19:13 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-26 00:10 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-26 05:39 -!- pgquiles(~pgquiles@12.Red-81-37-107.dynamicIP.rima-tde.net) has joined #tux3 2009-09-26 05:44 -!- pgquiles(~pgquiles@12.Red-81-37-107.dynamicIP.rima-tde.net) has joined #tux3 2009-09-26 05:49 -!- pgquiles(~pgquiles@12.Red-81-37-107.dynamicIP.rima-tde.net) has joined #tux3 2009-09-26 14:50 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-26 23:14 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-27 00:39 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-09-27 01:26 -!- Ansh(~Anshul@117.195.68.208) has joined #tux3 2009-09-27 03:19 -!- pgquiles(~pgquiles@157.Red-81-33-102.dynamicIP.rima-tde.net) has joined #tux3 2009-09-27 08:08 -!- pgquiles(~pgquiles@50.Red-81-35-101.dynamicIP.rima-tde.net) has joined #tux3 2009-09-27 08:31 -!- pgquiles(~pgquiles@50.Red-81-35-101.dynamicIP.rima-tde.net) has joined #tux3 2009-09-27 08:36 -!- pgquiles(~pgquiles@50.Red-81-35-101.dynamicIP.rima-tde.net) has joined #tux3 2009-09-27 08:48 -!- pgquiles(~pgquiles@50.Red-81-35-101.dynamicIP.rima-tde.net) has joined #tux3 2009-09-27 14:01 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-27 17:10 -!- ajonat(~ajonat@190.48.111.193) has joined #tux3 2009-09-27 22:22 -!- ajonat(~ajonat@190.48.111.193) has joined #tux3 2009-09-27 22:47 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-28 01:16 -!- pgquiles(~pgquiles@50.Red-81-35-101.dynamicIP.rima-tde.net) has joined #tux3 2009-09-28 02:24 -!- geos_one(~chatzilla@213.229.35.178) has joined #tux3 2009-09-28 03:22 flipz: what's been going on ? you there ? 2009-09-28 05:06 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-28 05:54 -!- npmccallum(~npmccallu@cpe-76-177-118-80.natcky.res.rr.com) has joined #tux3 2009-09-28 09:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-28 12:57 -!- ajonat(~ajonat@190.48.120.238) has joined #tux3 2009-09-28 23:52 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-28 23:58 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-29 00:28 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-29 05:18 -!- npmccallum(~npmccallu@cpe-76-177-118-80.natcky.res.rr.com) has joined #tux3 2009-09-29 08:48 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-29 10:37 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-29 10:58 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-29 11:44 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-09-29 11:59 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-29 15:27 -!- ajonat(~ajonat@190.48.125.153) has joined #tux3 2009-09-29 23:36 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-30 03:17 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-09-30 04:10 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-30 05:40 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-09-30 05:55 -!- npmccallum(~npmccallu@cpe-76-177-118-80.natcky.res.rr.com) has joined #tux3 2009-09-30 06:50 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-30 06:50 morning all 2009-09-30 06:55 -!- ajonat(~ajonat@190.48.125.153) has joined #tux3 2009-09-30 07:05 -!- dcg(~dcg@85.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-09-30 07:16 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-30 07:25 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-30 07:43 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-30 07:44 morning timothy 2009-09-30 07:44 morning flipz 2009-09-30 07:45 oftc admin said clonebots hit the irc pretty hard two weeks ago 2009-09-30 07:45 you'r disconnecting every few minutes 2009-09-30 07:45 I got knocked off with their sweep 2009-09-30 07:45 laptop goes to sleep 2009-09-30 07:45 need a low power desktop 2009-09-30 07:46 it only happen 3 times 2009-09-30 07:48 timothyhuber: I recommend a MiniITX system, I just got one myself. ;-) http://anandtech.com/video/showdoc.aspx?i=3562 2009-09-30 07:49 I'll check it out kspaans 2009-09-30 08:09 kspaans, I have a couple fit-pcs, amd geode, unfortunately I don't think amd is going to develop that processor any more 2009-09-30 08:10 processor runs on 2 watts and is fast enough for most things 2009-09-30 08:10 definitely fast enough for an always on server 2009-09-30 08:10 Yeah, I stopped playing games after my first year of University, so now the only thing I need power for is compiling kernels really fast. :D 2009-09-30 08:11 which doesn't have to be always on 2009-09-30 08:12 I guess my next low power system is likely to be mini-itx, most probably cyrix 2009-09-30 08:12 well 2009-09-30 08:12 maybe arm if something decent shows up 2009-09-30 08:13 maybe android will get enough spin so vendors start making server-capable arm boxes 2009-09-30 08:13 Why Cyrix? Do they even make CPUs anymore? 2009-09-30 08:15 ah I guess they are via now 2009-09-30 08:15 oh, cyrix became geode 2009-09-30 08:16 so that's what I'm using now 2009-09-30 08:16 I like it, it's sad it's not continuing 2009-09-30 08:17 actually, the processor is .9 watts 2009-09-30 08:22 That would be cool. 2009-09-30 08:23 Heck, I don't use X very much anymore, I'd be happy to jack a serial console into my WRT54GL, and use that -- if I could get a large enough framebuffer. :P 80x25 isn't enough for a young'un like me! 2009-09-30 08:23 via c7 looks about the same power/speed wise 2009-09-30 08:24 the most power hungry component in the fit pc is the hard disk, which will get replaced with an SSD pretty soon 2009-09-30 08:25 oh, the wifi runs a little hot too 2009-09-30 08:28 wikipedia calls the atom a "huge leap forward" vs the c7 2009-09-30 08:28 http://en.wikipedia.org/wiki/Mini-itx 2009-09-30 08:28 course, wikipedia is always right 2009-09-30 08:32 Well, mine has the N330 atom, a dual core 1.6GHz core, with hyper-threading. So it's not too much of a slouch. 2009-09-30 08:33 got a link? 2009-09-30 08:33 fan or not? 2009-09-30 08:36 Oh, sure, just a sec. 2009-09-30 08:38 http://anandtech.com/video/showdoc.aspx?i=3562 2009-09-30 08:38 That's a review of the board I have. 2009-09-30 08:38 They took better pictures than I did. :P 2009-09-30 08:40 fanless is a non-negotiable requirement for me, for always-on 2009-09-30 08:41 ACTION nods 2009-09-30 08:42 It gets a little warm without a fan, but I don't believe it gets "problematic warm". I haven't had too much chance to play with it yet -- too busy with work and being social at the moment. 2009-09-30 08:43 most of the hear from the fit pc is the hard disk 2009-09-30 08:43 heat I mean 2009-09-30 08:57 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-30 09:19 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-30 09:24 morning all 2009-09-30 09:25 chatty morning :) 2009-09-30 09:30 good morning 2009-09-30 09:31 random chat, mostly 2009-09-30 09:31 good for the chats/day stats though ;-) 2009-09-30 09:31 we chat, therefor we exist 2009-09-30 09:31 there's that 2009-09-30 09:32 not quite as useful as "isn't the new checkin totally amazing" 2009-09-30 09:32 hardly 2009-09-30 09:33 speaking of which- what's next in the que? 2009-09-30 09:34 atomic commit 2009-09-30 09:34 and the next step in that is a design note 2009-09-30 09:39 shapor, seems it's time for a going away event 2009-09-30 09:40 set a departure date? 2009-09-30 09:50 Tux3 isn't in mainline right? But it is in the kernel (as opposed to in FUSE)? 2009-09-30 09:56 yes 2009-09-30 09:57 kernel code isn't read for real use though, it can't handle crashing before umount 2009-09-30 09:57 atomic commit will fix that 2009-09-30 09:58 I was disappointed with the File Systems section of my OS class this summer. I should skip ahead in the Minix3 book to the filesystem part so I can feel like I know things. All the while reading the tux3 code of course. :) 2009-09-30 10:00 how long did you spend on filesystems? 2009-09-30 10:00 in the course? 2009-09-30 10:03 Let's see... 2009-09-30 10:03 It was a single section in our course notes... I'll look at my own notes. 2009-09-30 10:05 Eegads! We only had one or two 1.5 hour lectures on FSes. 2009-09-30 10:05 No wonder why I don't remember learning much. :) 2009-09-30 10:05 :) 2009-09-30 10:05 roughtly enough time to learn about index blocks vs data blocks 2009-09-30 10:05 My school doesn't seem very big on systems reasearchy stuff though. *sigh* 2009-09-30 10:06 Yeah, pretty much. 2009-09-30 10:06 I mostly understand how FAT works! Wow! 2009-09-30 10:06 yummy 2009-09-30 10:06 well all filesystems are the same at that level 2009-09-30 10:07 I'm interested by the idea of log-structured FSs, but haven't read into them that much just yet. 2009-09-30 10:08 here's a good place to start reading more: http://en.wikipedia.org/wiki/Virtual_file_system 2009-09-30 10:08 I really need to learn how _any_ FS works before I get too excited about the fancy stuff. I get ahead of myself a lot. :) 2009-09-30 10:08 flipz: Ooh, I had neglected to consider the VFS, thanks! 2009-09-30 10:08 hmm, maybe that's not the greatest link 2009-09-30 10:09 flipz: indeed 2009-09-30 10:10 here's a better one: http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html 2009-09-30 10:10 written from the point of view of a relative newbie 2009-09-30 10:11 Well, it's the thought that counts. If I get antsy about not knowing where to start when I look at Linux's source, I'll dive into the VFS stuff. 2009-09-30 10:11 flipz: Perfect. 2009-09-30 10:12 understanding the vfs is essential to any filesystem study 2009-09-30 10:12 even userspace filesystems are written following the vfs model 2009-09-30 10:13 that page looks really good 2009-09-30 10:13 we need more newbies with that amount of clarity :) 2009-09-30 10:17 ah the newbie was michael k johnston 2009-09-30 10:17 how bout that 2009-09-30 11:12 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-09-30 11:22 -!- dcg_(~dcg@109.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-09-30 13:27 -!- dcg_(~dcg@41.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-09-30 13:37 -!- dcg__(~dcg@154.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-09-30 15:18 hey flipz 2009-09-30 15:18 long time no see 2009-09-30 15:19 dang bh, should have logged on earlier 2009-09-30 15:19 we had a party ;-) 2009-09-30 15:19 tux3 party in LA ? 2009-09-30 15:19 how's tux3 going ? is it stalled right now ? 2009-09-30 15:19 no, we actually had a few lines of chat 2009-09-30 15:19 not stalled, just in low priority mode 2009-09-30 15:19 lower rather 2009-09-30 15:20 what's been worked on at the moment ? 2009-09-30 15:20 atomic commit 2009-09-30 15:20 flipz is coming over later- I'll get more out of him 2009-09-30 15:20 ok, brb, rebooting 2009-09-30 15:20 k 2009-09-30 15:52 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-09-30 15:52 back for a bit 2009-09-30 15:52 hey 2009-09-30 15:52 hey 2009-09-30 15:52 long time no see, what's been going on ? 2009-09-30 15:53 getting back to tux3 any time soon ? 2009-09-30 15:53 been occupied with non tux3 stuff 2009-09-30 15:53 yes, getting back to it 2009-09-30 15:53 about now I think 2009-09-30 15:54 not as heavily as before, but hopefully steady 2009-09-30 15:54 hopefully, when it gets to a certain point more folks can share the load 2009-09-30 15:55 the reason for tux3 is clear: because linux is bloated 2009-09-30 15:55 I happen to agree with linus 2009-09-30 15:57 ...out for a bit 2009-09-30 16:00 good 2009-09-30 16:00 working as a consultant right now or something ? 2009-09-30 16:52 -!- dcg(~dcg@122.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-09-30 19:25 -!- edt(~Ed@dsl-216-221-38-250.aei.ca) has joined #tux3 2009-09-30 19:44 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-30 20:14 -!- edt(~Ed@dsl-216-221-36-104.aei.ca) has joined #tux3 2009-09-30 20:35 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-30 21:00 -!- ajonat(~ajonat@190.48.125.153) has joined #tux3 2009-09-30 21:22 -!- flips(~phillips@phunq.net) has joined #tux3 2009-09-30 21:35 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-09-30 21:44 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-09-30 22:51 -!- ajonat(~ajonat@190.48.125.153) has joined #tux3 2009-09-30 23:12 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-09-30 23:12 back 2009-09-30 23:23 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-09-30 23:24 -!- ajonat(~ajonat@190.48.125.153) has joined #tux3 2009-10-01 00:09 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-01 01:05 -!- gr8rahul(~rahul@socks.wantstofly.org) has joined #tux3 2009-10-01 02:19 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2009-10-01 02:28 -!- pranith(8bb58f22@webchat.mibbit.com) has joined #tux3 2009-10-01 02:28 hey all 2009-10-01 06:15 hi pranith 2009-10-01 06:15 whoops 2009-10-01 06:15 slight timezone issue it seems 2009-10-01 07:09 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 07:20 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 07:34 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 08:57 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 09:03 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 09:28 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-01 10:20 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 10:33 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 10:34 -!- ajonat(~ajonat@190.48.125.153) has joined #tux3 2009-10-01 11:15 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-10-01 12:23 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 13:28 -!- pgquiles(~pgquiles@1.Red-81-37-107.dynamicIP.rima-tde.net) has joined #tux3 2009-10-01 14:00 hey flipz 2009-10-01 14:30 -!- cdk(~Chinmay@59.95.1.36) has joined #tux3 2009-10-01 14:52 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-10-01 20:23 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 20:44 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 22:49 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-01 22:59 -!- ajonat(~ajonat@190.48.113.29) has joined #tux3 2009-10-02 00:09 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-02 02:26 -!- cdk(~Chinmay@59.95.47.172) has joined #tux3 2009-10-02 02:45 -!- soliko(~soliko@85.65.59.203.dynamic.barak-online.net) has joined #tux3 2009-10-02 06:04 -!- cdk(~Chinmay@59.95.47.172) has joined #tux3 2009-10-02 06:53 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-02 07:39 good morning 2009-10-02 07:39 hi cdk 2009-10-02 07:40 hi 2009-10-02 07:41 long time 2009-10-02 09:00 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-02 09:16 morning flipz 2009-10-02 09:25 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-02 09:56 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-02 10:12 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-02 10:27 hi timothyhuber 2009-10-02 10:52 -!- pranith(~bobby@122.162.70.27) has joined #tux3 2009-10-02 10:52 hi pranith 2009-10-02 10:52 hey flips! 2009-10-02 10:52 long time :) 2009-10-02 10:52 I saw you were on a few times when I was asleep 2009-10-02 10:52 sorry, can't answer when I'm asleep ;) 2009-10-02 10:53 yeah, time zones... 2009-10-02 10:53 how's school going? 2009-10-02 10:53 its been a year since I passed out :) 2009-10-02 10:54 im working for mentor graphics now.. 2009-10-02 10:54 passed out == graduated 2009-10-02 10:55 i see you are closing in on the atomic commit thingy 2009-10-02 10:55 wow, congratulations 2009-10-02 10:56 i will be applying for post graduation next year 2009-10-02 10:56 yes, I'll be working on atomic commit over the weekend 2009-10-02 10:59 are we close it to it being complete ? :) 2009-10-02 12:48 -!- gr8rahul(~rahul@socks.wantstofly.org) has joined #tux3 2009-10-02 13:02 -!- ajonat(~ajonat@190.48.96.204) has joined #tux3 2009-10-02 13:31 hey flipz 2009-10-02 14:58 sk8 oclock 2009-10-02 19:03 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-02 19:43 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-02 22:14 -!- ajonat(~ajonat@190.48.96.204) has joined #tux3 2009-10-02 23:01 -!- ajonat(~ajonat@190.48.96.204) has joined #tux3 2009-10-02 23:47 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-03 01:34 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-03 01:34 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-10-03 01:34 -!- flips(~phillips@phunq.net) has joined #tux3 2009-10-03 01:34 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-10-03 01:42 -!- pranith(~bobby@122.162.69.147) has joined #tux3 2009-10-03 08:20 -!- pranith(~bobby@122.162.73.84) has joined #tux3 2009-10-03 10:40 -!- pranith(~bobby@122.162.73.84) has joined #tux3 2009-10-03 11:35 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-03 11:52 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-10-03 12:21 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-03 12:59 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-03 13:47 -!- ajonat(~ajonat@190.48.96.204) has joined #tux3 2009-10-03 14:08 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has left #tux3 2009-10-03 14:30 -!- ajonat(~ajonat@190.48.96.204) has joined #tux3 2009-10-03 17:32 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-03 17:38 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-03 17:42 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-03 18:53 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-03 23:10 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-04 05:04 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-04 05:06 -!- kunir(~kunir@195.148.105.102) has left #tux3 2009-10-04 05:07 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-04 07:55 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-04 09:07 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-04 10:27 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-04 10:42 -!- dcg(~dcg@30.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-04 10:50 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-04 11:12 -!- dcg_(~dcg@122.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-04 11:38 -!- dcg_(~dcg@98.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-04 12:49 -!- dcg__(~dcg@184.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-04 13:43 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-04 14:00 -!- dcg(~dcg@20.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-04 14:36 -!- ajonat(~ajonat@190.48.96.204) has joined #tux3 2009-10-04 14:37 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-04 14:48 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-04 15:19 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-04 15:20 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-04 16:02 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-04 16:30 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-04 16:39 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-04 16:41 -!- bd__(~foo@satoko.is.fushizen.net) has joined #tux3 2009-10-04 17:12 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-04 20:49 -!- ajonat(~ajonat@190.48.96.204) has joined #tux3 2009-10-04 21:59 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-10-04 23:03 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-05 04:21 -!- bh(~billh@ip68-107-19-169.sd.sd.cox.net) has joined #tux3 2009-10-05 05:10 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-05 07:14 -!- ajonat(~ajonat@190.48.117.128) has joined #tux3 2009-10-05 09:42 -!- dcg(~dcg@84.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-05 10:41 -!- dcg_(~dcg@112.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-05 10:55 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-05 11:21 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-05 13:04 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-05 14:16 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-05 14:36 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-05 14:39 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-05 14:56 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-10-05 16:53 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-05 17:14 -!- ajonat(~ajonat@190.48.122.189) has joined #tux3 2009-10-05 18:08 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-05 19:53 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-05 19:55 -!- ajonat(~ajonat@190.48.122.189) has joined #tux3 2009-10-05 20:09 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-05 20:27 -!- data`(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-10-05 21:36 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-05 22:30 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-05 22:56 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-05 23:48 -!- bh_(~billh@ip68-107-19-169.sd.sd.cox.net) has joined #tux3 2009-10-06 00:07 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-06 06:12 -!- pixelbeat(~padraig@84.203.137.218) has joined #tux3 2009-10-06 06:55 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 07:06 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 09:00 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 09:16 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 09:21 -!- dcg(~dcg@53.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-06 09:22 -!- ajonat(~ajonat@190.48.121.203) has joined #tux3 2009-10-06 09:29 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 09:52 -!- dcg_(~dcg@85.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-06 10:14 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-06 11:27 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 11:51 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 12:05 -!- dcg__(~dcg@114.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-06 17:13 -!- ajonat(~ajonat@190.48.121.203) has joined #tux3 2009-10-06 17:26 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-06 17:28 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 18:06 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 18:13 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 18:30 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-06 18:48 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-10-06 19:02 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 19:20 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 20:21 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 20:41 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-06 22:37 -!- ajonat(~ajonat@190.48.121.203) has joined #tux3 2009-10-06 23:37 -!- ajonat_(~ajonat@190.48.125.112) has joined #tux3 2009-10-07 00:22 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-07 00:58 hey flips 2009-10-07 02:05 flips: writing scheduler code is a major bitch 2009-10-07 03:37 -!- pgquiles(~pgquiles@1.Red-81-37-107.dynamicIP.rima-tde.net) has joined #tux3 2009-10-07 03:43 -!- pgquiles(~pgquiles@1.Red-81-37-107.dynamicIP.rima-tde.net) has joined #tux3 2009-10-07 03:44 -!- pgquiles(~pgquiles@1.Red-81-37-107.dynamicIP.rima-tde.net) has joined #tux3 2009-10-07 04:48 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-07 05:54 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-10-07 06:05 -!- Ansh(~Anshul@117.195.69.173) has joined #tux3 2009-10-07 07:31 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-07 07:40 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-07 08:05 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-07 08:10 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-07 08:24 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-07 08:35 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-07 09:39 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-10-07 11:13 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-07 11:14 -!- timothyhuber(~timothyhu@pool-72-87-168-82.plspca.dsl-w.verizon.net) has joined #tux3 2009-10-07 12:32 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-07 13:02 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-07 13:59 -!- dcg(~dcg@2.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-07 14:16 -!- ajonat(~ajonat@190.48.121.208) has joined #tux3 2009-10-07 15:47 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-07 16:25 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-10-07 16:25 -!- gr8rahul(~rahul@socks.wantstofly.org) has joined #tux3 2009-10-07 18:09 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-07 20:45 -!- ajonat(~ajonat@190.48.121.208) has joined #tux3 2009-10-08 00:34 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-08 05:38 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-10-08 06:26 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-10-08 06:33 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-08 07:34 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-08 07:59 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-08 08:07 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-08 08:22 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-08 10:05 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-08 10:55 -!- ajonat(~ajonat@190.48.100.211) has joined #tux3 2009-10-08 10:57 -!- ajonat(~ajonat@190.48.100.211) has joined #tux3 2009-10-08 12:00 -!- dcg(~dcg@29.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-08 12:43 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-08 13:14 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-08 13:38 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-08 17:34 -!- ajonat(~ajonat@190.48.93.61) has joined #tux3 2009-10-08 20:33 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has left #tux3 2009-10-08 20:33 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-08 21:40 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-08 22:14 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-09 04:43 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-09 06:16 -!- npmccallum(~npmccallu@cpe-76-177-118-80.natcky.res.rr.com) has joined #tux3 2009-10-09 07:12 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-09 07:32 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-09 07:38 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-09 07:39 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-09 07:42 -!- npmccallum_(~npmccallu@76.177.118.80) has joined #tux3 2009-10-09 08:28 -!- npmccallum_(~npmccallu@76.177.118.80) has joined #tux3 2009-10-09 09:16 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-09 09:19 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-09 09:58 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-09 11:51 -!- dcg(~dcg@98.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-09 13:01 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-09 13:27 flipz, ping 2009-10-09 13:27 hiya 2009-10-09 13:28 howdy 2009-10-09 13:54 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-09 15:12 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-09 16:56 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-09 18:47 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-09 19:07 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-09 22:47 -!- ajonat(~ajonat@190.48.104.186) has joined #tux3 2009-10-10 00:45 hey flipz 2009-10-10 06:17 -!- bobby(~bobby@122.161.184.31) has joined #tux3 2009-10-10 07:33 -!- npmccallum(~npmccallu@76.177.118.80) has joined #tux3 2009-10-10 07:48 -!- bobby(~bobby@122.161.184.31) has joined #tux3 2009-10-10 07:53 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-10 08:34 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-10 09:03 -!- bobby(~bobby@122.161.184.31) has joined #tux3 2009-10-10 09:19 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-10 09:46 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-10 12:03 -!- bobby(~bobby@122.161.184.31) has joined #tux3 2009-10-10 12:09 -!- dcg(~dcg@50.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-10 12:35 -!- dcg(~dcg@120.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-10 12:53 -!- ajonat(~ajonat@190.48.119.164) has joined #tux3 2009-10-10 13:01 -!- dcg_(~dcg@224.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-10 16:20 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-10 19:00 -!- pgquiles_(~pgquiles@1.Red-81-37-107.dynamicIP.rima-tde.net) has joined #tux3 2009-10-10 21:36 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-10 22:32 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-10 23:55 -!- RazvanM(~RazvanM@pool-173-67-56-200.bltmmd.east.verizon.net) has joined #tux3 2009-10-11 00:38 -!- RazvanM_(~RazvanM@96.234.237.221) has joined #tux3 2009-10-11 08:17 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-11 10:39 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-11 14:34 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-11 14:56 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-11 15:18 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-11 17:34 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-11 21:12 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-11 21:51 -!- gr8rahul(~rahul@socks.wantstofly.org) has left #tux3 2009-10-12 00:30 -!- RazvanM(~RazvanM@96.234.237.221) has joined #tux3 2009-10-12 06:02 -!- npmccallum(~npmccallu@76.177.130.9) has joined #tux3 2009-10-12 08:09 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-12 08:21 -!- pranith(8bb58f22@webchat.mibbit.com) has joined #tux3 2009-10-12 08:21 hey all 2009-10-12 08:21 flips: is the mercurial repo updated? can I pull from it? 2009-10-12 08:50 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-12 10:08 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-12 10:58 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-12 10:59 -!- pgquiles(~pgquiles@228.Red-79-146-250.dynamicIP.rima-tde.net) has joined #tux3 2009-10-12 11:04 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-12 11:32 -!- ajonat(~ajonat@190.48.122.227) has joined #tux3 2009-10-12 12:46 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-12 12:59 -!- dcg(~dcg@25.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-12 16:44 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-12 19:50 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-12 20:18 pranith, mecurial is up to date 2009-10-12 20:49 -!- ajonat(~ajonat@190.48.122.227) has joined #tux3 2009-10-13 00:25 -!- RazvanM(~RazvanM@96.234.237.221) has joined #tux3 2009-10-13 01:13 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-10-13 02:02 -!- pranith(8bb58f22@webchat.mibbit.com) has joined #tux3 2009-10-13 02:02 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-10-13 04:19 hey flips 2009-10-13 07:52 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-13 07:52 -!- bh(~billh@ip68-107-19-169.sd.sd.cox.net) has joined #tux3 2009-10-13 07:52 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-10-13 07:52 -!- yousef_(~yousef@helium.yousef.org) has joined #tux3 2009-10-13 08:13 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-13 08:34 -!- npmccallum(~npmccallu@cpe-76-177-130-9.natcky.res.rr.com) has joined #tux3 2009-10-13 10:16 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-13 11:58 -!- ajonat(~ajonat@190.48.122.227) has joined #tux3 2009-10-13 12:34 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-13 13:16 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-10-13 13:43 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-13 13:58 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-10-13 17:24 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-13 18:20 -!- timothyhuber_(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-13 20:24 -!- timothyhuber(~timothyhu@pool-71-119-254-96.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-13 22:35 -!- ajonat(~ajonat@190.48.122.227) has joined #tux3 2009-10-13 22:57 morning everyone. 2009-10-13 22:57 tux3 segfaults when i try to touch a file... maybe it is something you already know.. 2009-10-13 22:58 this is in fuse version.. 2009-10-13 23:11 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-10-13 23:38 pranith: any idea why? tried running under valgrind or gdb? 2009-10-13 23:38 yeah, it seg faults under valgrind... 2009-10-13 23:38 dint get the time to check it though... 2009-10-13 23:38 will do that today 2009-10-13 23:38 but I guess you can reproduce it 2009-10-13 23:39 make mkfs 2009-10-13 23:39 make debug 2009-10-13 23:39 cd test/ 2009-10-13 23:39 touch hello --> segfault 2009-10-14 00:16 -!- RazvanM(~RazvanM@96.234.237.221) has joined #tux3 2009-10-14 05:31 -!- asn(~fafa@labs.cs.unipi.gr) has joined #tux3 2009-10-14 05:38 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-10-14 05:50 Being new in the filesystem business, I'm reading tux3's documentation to get in touch. I'm trying to understand the second/third paragraph of http://lwn.net/Articles/288896/. Can you recommend me some reading that would break that down to something a bit more detailed so that newbies could get it? :) 2009-10-14 07:31 Cooome on guys, one of you that can help me is surely alive 2009-10-14 07:33 Help the tux3 project by helping a poor soul asking for articles regarding tux3! 2009-10-14 08:01 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-14 08:38 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-14 09:27 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-14 09:33 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-14 10:56 -!- bobby_(~bobby@122.162.67.190) has joined #tux3 2009-10-14 12:16 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-14 12:58 -!- pgquiles(~pgquiles@228.Red-79-146-250.dynamicIP.rima-tde.net) has joined #tux3 2009-10-14 13:01 -!- pgquiles(~pgquiles@228.Red-79-146-250.dynamicIP.rima-tde.net) has joined #tux3 2009-10-14 13:05 -!- pgquiles(~pgquiles@228.Red-79-146-250.dynamicIP.rima-tde.net) has joined #tux3 2009-10-14 13:19 -!- asn(~fafa@83.212.104.29) has joined #tux3 2009-10-14 14:11 So, is any kind soul alive in here? 2009-10-14 14:23 -!- ajonat(~ajonat@190.48.122.227) has joined #tux3 2009-10-14 15:40 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-14 16:37 -!- asn(~fafa@83.212.104.29) has joined #tux3 2009-10-14 16:56 -!- timothyhuber_(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-14 17:00 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-10-14 17:02 -!- npmccallum(~npmccallu@cpe-76-177-130-9.natcky.res.rr.com) has joined #tux3 2009-10-14 18:03 -!- timothyhuber_(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-14 18:06 -!- timothyhuber_(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-14 18:52 -!- ajonat(~ajonat@190.48.106.185) has joined #tux3 2009-10-14 19:45 flips, ping 2009-10-14 19:55 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-14 20:37 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-14 21:11 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-14 22:36 asn, hello 2009-10-14 23:48 -!- RazvanM(~RazvanM@96.234.237.221) has joined #tux3 2009-10-15 07:14 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-15 09:16 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-15 10:25 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-15 11:53 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-15 12:55 -!- ajonat(~ajonat@190.48.103.100) has joined #tux3 2009-10-15 17:13 -!- ajonat(~ajonat@190.48.103.100) has joined #tux3 2009-10-15 17:51 -!- yousef(~yousef@helium.yousef.org) has joined #tux3 2009-10-15 17:52 -!- npmccallum(~npmccallu@76.177.130.9) has joined #tux3 2009-10-15 17:52 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-10-15 17:52 -!- bh(~billh@ip68-107-19-169.sd.sd.cox.net) has joined #tux3 2009-10-15 19:14 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-15 19:35 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-15 21:14 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-15 22:09 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-16 01:25 -!- RazvanM(~RazvanM@96.234.237.221) has joined #tux3 2009-10-16 06:45 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-16 07:40 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-16 09:36 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-16 09:55 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-16 10:15 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2009-10-16 10:16 -!- pixelbeat_(~padraig@84.203.137.218) has joined #tux3 2009-10-16 10:19 -!- bh(~billh@ip68-107-19-169.sd.sd.cox.net) has joined #tux3 2009-10-16 10:22 -!- dcg(~dcg@45.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-16 10:43 -!- dcg_(~dcg@117.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-16 11:46 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-16 12:47 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-16 13:45 -!- pgquiles(~pgquiles@11.Red-81-32-36.dynamicIP.rima-tde.net) has joined #tux3 2009-10-16 13:53 -!- pgquiles(~pgquiles@11.Red-81-32-36.dynamicIP.rima-tde.net) has joined #tux3 2009-10-16 14:07 -!- pgquiles(~pgquiles@11.Red-81-32-36.dynamicIP.rima-tde.net) has joined #tux3 2009-10-16 14:19 -!- ajonat(~ajonat@190.48.98.154) has joined #tux3 2009-10-16 15:13 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-16 15:13 -!- bh(~billh@ip68-107-19-169.sd.sd.cox.net) has joined #tux3 2009-10-16 15:20 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-16 16:07 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-16 16:43 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-16 16:43 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-17 00:13 -!- RazvanM(~RazvanM@96.234.237.221) has joined #tux3 2009-10-17 06:09 -!- dcg(~dcg@227.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-17 06:19 -!- dcg_(~dcg@36.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-17 06:48 -!- dcg__(~dcg@78.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-17 07:09 -!- dcg(~dcg@31.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-17 07:36 -!- dcg_(~dcg@101.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-17 08:13 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-17 12:06 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-17 13:50 -!- bh_(~billh@ip68-107-19-169.sd.sd.cox.net) has joined #tux3 2009-10-17 15:11 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-17 22:52 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-17 23:11 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-18 00:37 -!- RazvanM(~RazvanM@96.234.239.172) has joined #tux3 2009-10-18 01:09 -!- RazvanM_(~RazvanM@pool-173-67-53-14.bltmmd.east.verizon.net) has joined #tux3 2009-10-18 07:33 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-18 07:45 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-18 08:04 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-18 08:23 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-18 09:27 -!- RazvanM(~RazvanM@pool-173-67-51-185.bltmmd.east.verizon.net) has joined #tux3 2009-10-18 10:30 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-18 10:35 -!- pgquiles(~pgquiles@228.Red-79-146-250.dynamicIP.rima-tde.net) has joined #tux3 2009-10-18 10:49 -!- pgquiles(~pgquiles@228.Red-79-146-250.dynamicIP.rima-tde.net) has joined #tux3 2009-10-18 10:51 -!- pgquiles(~pgquiles@228.Red-79-146-250.dynamicIP.rima-tde.net) has joined #tux3 2009-10-18 12:42 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-18 14:02 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-18 18:13 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-18 19:24 -!- ajonat(~ajonat@190.48.115.251) has joined #tux3 2009-10-18 19:30 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-18 19:42 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-18 19:49 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-18 20:30 -!- ajonat(~ajonat@190.48.115.251) has joined #tux3 2009-10-18 21:42 -!- timothyhuber_(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-19 00:29 -!- RazvanM(~RazvanM@pool-173-67-51-185.bltmmd.east.verizon.net) has joined #tux3 2009-10-19 06:09 -!- npmccallum(~npmccallu@76.177.130.9) has joined #tux3 2009-10-19 07:36 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-19 10:39 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-19 10:44 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-19 10:47 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-19 11:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-19 15:11 -!- ajonat(~ajonat@190.48.127.85) has joined #tux3 2009-10-19 19:02 -!- ajonat(~ajonat@190.48.127.85) has joined #tux3 2009-10-19 22:27 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-20 00:33 -!- RazvanM(~RazvanM@pool-173-67-51-185.bltmmd.east.verizon.net) has joined #tux3 2009-10-20 00:39 -!- pranith(8bb58f22@webchat.mibbit.com) has joined #tux3 2009-10-20 00:40 flips: ping 2009-10-20 01:02 hey folks 2009-10-20 01:03 the channel is quite chronically idle which is unfortunate 2009-10-20 01:06 bh: hello 2009-10-20 01:13 how's it going ? 2009-10-20 05:15 hi pranith 2009-10-20 05:23 flips: long time! how are u? 2009-10-20 11:15 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-20 11:17 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-20 12:59 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-10-20 13:15 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-10-20 13:30 -!- pgquiles(~pgquiles@228.Red-79-146-250.dynamicIP.rima-tde.net) has joined #tux3 2009-10-20 14:46 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-20 14:51 hey flips 2009-10-20 15:32 -!- ajonat(~ajonat@190.48.103.112) has joined #tux3 2009-10-20 20:26 -!- ajonat(~ajonat@190.48.103.112) has joined #tux3 2009-10-20 21:38 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-21 00:16 -!- RazvanM(~RazvanM@pool-173-67-51-185.bltmmd.east.verizon.net) has joined #tux3 2009-10-21 00:34 -!- ajonat(~ajonat@190.48.121.94) has joined #tux3 2009-10-21 01:54 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-10-21 01:54 -!- flips(~phillips@phunq.net) has joined #tux3 2009-10-21 03:47 hey flips 2009-10-21 03:47 not sure if you're awake, but saying hi anyways 2009-10-21 08:19 -!- npmccallum(~npmccallu@h22.54.23.98.dynamic.ip.windstream.net) has joined #tux3 2009-10-21 09:00 -!- npmccallum_(~npmccallu@h13.26.190.173.dynamic.ip.windstream.net) has joined #tux3 2009-10-21 11:24 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-21 11:37 -!- ajonat(~ajonat@190.48.121.94) has joined #tux3 2009-10-21 11:51 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-21 13:19 -!- dcg(~dcg@249.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-21 13:21 -!- npmccallum__(~npmccallu@h180.60.23.98.dynamic.ip.windstream.net) has joined #tux3 2009-10-21 14:36 -!- ajonat(~ajonat@190.48.121.94) has joined #tux3 2009-10-21 15:30 -!- ajonat(~ajonat@190.48.89.127) has joined #tux3 2009-10-21 16:43 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-10-21 16:58 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-21 16:58 -!- yousef(~yousef@helium.yousef.org) has joined #tux3 2009-10-21 18:53 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-10-21 18:53 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-10-21 19:51 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-21 21:10 -!- ajonat(~ajonat@190.48.108.103) has joined #tux3 2009-10-22 00:12 -!- RazvanM(~RazvanM@pool-173-67-51-185.bltmmd.east.verizon.net) has joined #tux3 2009-10-22 03:42 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-10-22 04:23 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-10-22 06:25 -!- npmccallum(~npmccallu@h180.60.23.98.dynamic.ip.windstream.net) has joined #tux3 2009-10-22 08:15 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-10-22 08:54 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-10-22 11:08 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-22 11:14 hey bh 2009-10-22 11:42 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-22 13:10 -!- tim(~Tim@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-22 13:23 tim = tim_dimm? 2009-10-22 13:23 y 2009-10-22 13:23 hi 2009-10-22 13:23 on my old laptop 2009-10-22 13:23 hi 2009-10-22 13:23 battery died 2009-10-22 13:57 hey flipz tim 2009-10-22 13:58 hey bh 2009-10-22 14:06 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-10-22 14:06 hi all 2009-10-22 14:29 -!- tim(~Tim@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-22 14:49 -!- ajonat(~ajonat@190.48.115.50) has joined #tux3 2009-10-22 15:34 -!- tim(~Tim@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-22 19:02 -!- flips(~phillips@phunq.net) has joined #tux3 2009-10-22 19:03 shapor, still there? 2009-10-22 19:04 network was down here, don't know what that was about 2009-10-22 19:04 ACTION does not recommend speakeasy 2009-10-22 19:07 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-10-22 19:26 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-22 20:36 hey flipz 2009-10-22 22:11 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-10-22 23:18 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-10-23 00:43 -!- RazvanM(~RazvanM@pool-173-67-51-185.bltmmd.east.verizon.net) has joined #tux3 2009-10-23 05:27 good morning 2009-10-23 06:12 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-10-23 06:14 flips: good to see you 2009-10-23 08:01 hi kunir 2009-10-23 08:15 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-10-23 10:06 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-23 10:28 its been so quite lately 2009-10-23 12:03 it has 2009-10-23 12:05 ready for some new patches? 2009-10-23 12:38 11688000 2009-10-23 12:38 whoops ;) 2009-10-23 14:06 -!- ajonat(~ajonat@190.48.115.50) has joined #tux3 2009-10-23 14:47 -!- pgquiles(~pgquiles@143.Red-83-41-234.dynamicIP.rima-tde.net) has joined #tux3 2009-10-23 15:19 hey flipz 2009-10-23 15:20 when are you going to commit those patches ? 2009-10-23 15:41 hey flips 2009-10-23 15:41 long time no see 2009-10-23 15:41 bh, good question 2009-10-23 15:41 but you might be interested in this: http://jobs.apple.com/index.ajs?method=mExternal.showJob&RID=42559 2009-10-23 15:42 hi pgquiles, que pasa? 2009-10-23 15:42 pgquiles, one thing I don't need is a job ;) 2009-10-23 15:43 not even if it makes tux3 as the default filesystem in mac os x in the future ;-) 2009-10-23 15:43 ooh 2009-10-23 15:44 it does look like a nice job for somebody 2009-10-23 15:44 surely they must know about Mark McKusick 2009-10-23 15:45 what do you think bh, would he do it? 2009-10-23 15:46 heh, they want git experience 2009-10-23 15:48 git is very much like a filesystem 2009-10-23 15:49 ACTION started a git-wrapper library with a Qt API - libQtGit http://gitorious.org/libqtgit/ 2009-10-23 15:50 hmm, interesting, are they thinking about building a "versioning filesystem" ? 2009-10-23 15:51 I just assumed they wanted the candidate to be familiar with source control tools that will be used on the job 2009-10-23 15:52 flipz: I have no idea what they are trying to build but Apple has that "Time Machine" software which could really use a versioning filesystem 2009-10-23 15:55 it would make sense, and it would explain why they would not just be happy with ZFS 2009-10-23 15:55 ZFS is not very good about snapshots of snapshots 2009-10-23 15:55 anyway, it's pure speculation 2009-10-23 15:56 it's interesting that apple is willing to back some filesystem work 2009-10-23 15:56 btrfs envy? 2009-10-23 15:56 couldn't be 2009-10-23 15:57 pgquiles: yeah but i think they recently dropped time machine? 2009-10-23 15:57 nice ui, but the implementation was a bit of a horrible hack 2009-10-23 15:58 shapor: did they? I bought a Macbook pro last week and Mac OS X 10.6.1 comes with Time Machine :-? 2009-10-23 15:58 oh i could be wrong then 2009-10-23 15:58 time machine seemed to be pretty popular 2009-10-23 15:58 my condolences on the mac purchase 2009-10-23 15:58 shapor: maybe you mean they dropped zfs, which they did 2009-10-23 15:58 shapor: :-D 2009-10-23 15:59 pgquiles, oh, I didn't know that, why did they drop ZFS? 2009-10-23 15:59 its ufs with directory hardlines to emulate snapshots i think 2009-10-23 15:59 shapor: it's just I want to develop on Mac, too. I already do windows-linux cross-platform development, so Mac was the next logical platform 2009-10-23 15:59 http://www.macrumors.com/2009/10/23/apple-shuts-down-open-source-zfs-project/ 2009-10-23 15:59 flipz: no reason said, they just dropped it 2009-10-23 15:59 that links to the job posting too 2009-10-23 16:00 oh, very recent 2009-10-23 16:00 flipz: it was dropped from snow leapord which came out in august? 2009-10-23 16:01 it was suspected to be dropped for a while now its official i guess 2009-10-23 16:01 citing license issue 2009-10-23 16:01 I suppose that would be patent licensing issues 2009-10-23 16:01 http://dustin.github.com/2009/10/23/mac-zfs.html 2009-10-23 16:02 what ever happened with the netapp/sun lawsuit? 2009-10-23 16:02 got very quiet 2009-10-23 16:02 i wonder if thats a sticking point in the oracle acquisition 2009-10-23 16:03 sun continues to burn cash and lay off employees 2009-10-23 16:03 haven't heard any hint of that 2009-10-23 16:03 pure conjecture :) 2009-10-23 16:03 it's about to become the oracle netapp lawsuit 2009-10-23 16:03 http://www.dw-world.de/dw/article/0,,4812340,00.html 2009-10-23 16:04 so whats new on the tux3 front? 2009-10-23 16:05 I feel a checkin coming on 2009-10-23 16:05 :) 2009-10-23 16:05 how about a cleanup for mainline submission 2009-10-23 16:05 screw atomic commit 2009-10-23 16:05 :P 2009-10-23 16:05 patches happily accepted 2009-10-23 16:05 even whitespace patches 2009-10-23 16:05 what needs to be done 2009-10-23 16:06 even gross, scripts/lindent compliance patches 2009-10-23 16:06 I need to fish out the old, initial review thread 2009-10-23 16:06 pretty minor things mostly 2009-10-23 16:06 like not using printf in kernel 2009-10-23 16:07 i.e., use printk instead 2009-10-23 16:07 and wrap it to 80 columns :p 2009-10-23 16:07 lame :p 2009-10-23 16:07 so basically make it less readable? ;) 2009-10-23 16:07 it's a linux thing 2009-10-23 16:08 whats going on with btrfs 2009-10-23 16:08 squeeze the code on the left by 8 character tabs and on the right by the length of a punch card 2009-10-23 16:08 getting usable yet? 2009-10-23 16:08 does anyone follow it? 2009-10-23 16:08 the squished stuff that ends up in the few characters of remaining space in the middle must be pure goodness ;) 2009-10-23 16:08 haven't checked up recently 2009-10-23 16:09 seems to be progressing 2009-10-23 16:09 ext4 marches on 2009-10-23 16:11 hey flipz long time no chat 2009-10-23 16:11 hi bh 2009-10-23 16:11 bh, are you involved in the edf scheduler threads on lkml? 2009-10-23 16:11 kirk mckusik 2009-10-23 16:11 no 2009-10-23 16:11 sorry ;) 2009-10-23 16:11 I knew that 2009-10-23 16:12 him and eric allman are partners 2009-10-23 16:12 so does this mean the edf scheduler on offer is entirely satisfactory to you, freeing you up for other things? 2009-10-23 16:12 I was at there house for the BSDi 10th anniversay and Eric kept hitting on me 2009-10-23 16:12 wee 2009-10-23 16:13 flipz: no, the reason why I've been working on EDF since Feb is not because of a simple EDF policy 2009-10-23 16:13 I've been working on something that's much harder 2009-10-23 16:13 and it's sure to turn heads when it's done 2009-10-23 16:13 I guarantee that 2009-10-23 16:13 like with my -rt pseudo release ? same kind of impact 2009-10-23 16:14 it's why I've held out so long my part of the project is very complicated, never been done before in a modern kernel 2009-10-23 16:14 with multiprocessor support 2009-10-23 16:14 or uniprocessor support even 2009-10-23 16:15 flipz: it was a pretty funny night, met up with very high powered folks, I just wish the BSD community wasn't so fucked 2009-10-23 16:16 flipz: word was that kirk was too expensive for them 2009-10-23 16:16 cheap ass megacorp ;) 2009-10-23 16:17 just tell them to see a few more iphones 2009-10-23 16:17 sell I mean 2009-10-23 16:17 and I spose Matt has no interest in punching a clock 2009-10-23 16:17 I've got a Palm Pre 2009-10-23 16:17 kicks the iphones ass for core functionality 2009-10-23 16:18 not even close 2009-10-23 16:19 why can't this be an android phone? http://www.mobileburn.com/news.jsp?Id=8091 2009-10-23 16:19 flipz: I regret doign to project at times 2009-10-23 16:19 instead of that funky htc thingy 2009-10-23 16:19 well, it's a linux phone 2009-10-23 16:20 may be enough to finally move me to a smartphone, which is to say, light enough 2009-10-23 16:20 flipz: palm pre 2009-10-23 16:20 if you use facebook and stuff like that, it's a winner, look at synergy for that device 2009-10-23 16:21 it's contact unification across gmail, facebook and yahoo 2009-10-23 16:21 and I'm sure others will be added to it 2009-10-23 16:21 IM is integrated, sms, gtalk, etc.. it's all one same stream 2009-10-23 16:21 same for mail 2009-10-23 16:21 and it's got reasonable copy/paste functionality 2009-10-23 16:22 Flash is coming to that device as well 2009-10-23 16:22 that's my recommendation 2009-10-23 16:22 the only complaint I have with that phone is lag and the crappy dialer program 2009-10-23 16:22 but the core functionality of Web OS totally makes it worth putting up with that 2009-10-23 16:22 the dialer program can be updated, hacked. 2009-10-23 16:23 the core OS is written in JavaScript so you can spawn a vim task to edit the stuff on a rooted device for hacking it 2009-10-23 16:23 the JS is human reable 2009-10-23 16:23 ok, off to Home Depot 2009-10-23 16:23 enjoy 2009-10-23 16:24 flipz: get it committed man, too much lag is happening 2009-10-23 16:24 imo, I should have done more file systems work 2009-10-23 16:24 because I think it's the next biggest thing 2009-10-23 16:26 well, it's a fun day for filesystem news 2009-10-23 16:48 by the mailing list it looks like btrfs is still having enospc issues 2009-10-23 16:49 -!- ajonat(~ajonat@190.48.115.50) has joined #tux3 2009-10-23 17:05 shapor, then we should be heroic and solve enospc for tux3 2009-10-23 17:41 flipz: post the announcement here when you have the bits committed or something 2009-10-23 17:41 ACTION heads off to aikido class 2009-10-23 18:40 -!- ajonat(~ajonat@190.48.115.50) has joined #tux3 2009-10-23 19:37 ~,., 2009-10-24 00:06 -!- RazvanM(~RazvanM@pool-173-67-51-185.bltmmd.east.verizon.net) has joined #tux3 2009-10-24 03:01 -!- pgquiles(~pgquiles@143.Red-83-41-234.dynamicIP.rima-tde.net) has joined #tux3 2009-10-24 16:43 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-10-24 17:02 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-24 17:16 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-25 00:26 -!- RazvanM(~RazvanM@pool-173-67-51-185.bltmmd.east.verizon.net) has joined #tux3 2009-10-25 03:35 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 03:47 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 03:51 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 03:53 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 03:58 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 03:59 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 04:03 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 04:15 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 04:21 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 04:23 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 04:44 hey flipz, so I went over tux3graph.c, and I think I understand a good chunk of it (I understand everything syntax-wise, but not why certain things are done in a certain why)... what can I do next? 2009-10-25 06:28 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 06:30 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 07:14 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 07:17 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 07:21 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 07:23 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 07:29 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 07:37 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 12:16 -!- pgquiles(~pgquiles@209.Red-79-148-70.dynamicIP.rima-tde.net) has joined #tux3 2009-10-25 12:21 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 12:46 -!- pgquiles(~pgquiles@209.Red-79-148-70.dynamicIP.rima-tde.net) has joined #tux3 2009-10-25 12:47 -!- pgquiles(~pgquiles@209.Red-79-148-70.dynamicIP.rima-tde.net) has joined #tux3 2009-10-25 13:09 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-25 14:48 -!- ajonat(~ajonat@190.48.119.64) has joined #tux3 2009-10-25 16:04 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-25 21:38 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-25 23:33 -!- RazvanM(~RazvanM@pool-173-67-51-185.bltmmd.east.verizon.net) has joined #tux3 2009-10-26 03:48 -!- pgquiles(~pgquiles@209.Red-79-148-70.dynamicIP.rima-tde.net) has joined #tux3 2009-10-26 03:49 -!- pgquiles_(~pgquiles@248.Red-79-152-102.dynamicIP.rima-tde.net) has joined #tux3 2009-10-26 06:08 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-10-26 09:59 -!- cdk(~chinmayka@59.95.29.145) has joined #tux3 2009-10-26 10:24 -!- cdk_(~chinmayka@59.95.30.242) has joined #tux3 2009-10-26 10:26 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-26 13:50 -!- dcg(~dcg@10.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-26 15:45 -!- dcg_(~dcg@198.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-26 19:59 -!- flips(~phillips@phunq.net) has joined #tux3 2009-10-26 22:47 -!- RazvanM(~RazvanM@pool-173-67-51-185.bltmmd.east.verizon.net) has joined #tux3 2009-10-27 00:58 -!- RazvanM_(~RazvanM@pool-173-67-55-32.bltmmd.east.verizon.net) has joined #tux3 2009-10-27 09:41 -!- dcg(~dcg@179.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-27 11:39 -!- dcg(~dcg@29.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-10-27 12:38 -!- ajonat(~ajonat@190.48.124.91) has joined #tux3 2009-10-27 12:42 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-10-27 12:58 -!- dcg(~dcg@9.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-10-27 13:30 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-27 13:30 flipz, ping 2009-10-27 14:04 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-10-27 14:04 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has left #tux3 2009-10-27 15:36 -!- pgquiles(~pgquiles@248.Red-79-152-102.dynamicIP.rima-tde.net) has joined #tux3 2009-10-27 16:19 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-27 17:03 -!- yousef_(~yousef@helium.yousef.org) has joined #tux3 2009-10-27 17:14 -!- flips(~phillips@phunq.net) has joined #tux3 2009-10-27 17:50 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-27 20:33 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-10-28 08:29 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-28 09:07 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-28 09:57 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-28 12:24 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-28 12:48 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-28 13:03 flipz ping 2009-10-28 19:25 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-28 20:11 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-28 22:16 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-28 22:23 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 04:22 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-29 07:17 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-10-29 07:31 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2009-10-29 07:31 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-10-29 07:42 -!- pixelbeat_(~padraig@84.203.137.218) has joined #tux3 2009-10-29 08:23 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-10-29 10:04 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 13:16 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 14:07 -!- bh(~billh@ip68-107-19-169.sd.sd.cox.net) has joined #tux3 2009-10-29 14:12 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-10-29 14:12 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-29 14:12 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-10-29 14:12 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-29 14:12 -!- pgquiles(~pgquiles@248.Red-79-152-102.dynamicIP.rima-tde.net) has joined #tux3 2009-10-29 14:12 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-10-29 14:12 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 14:12 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2009-10-29 14:12 -!- yousef_(~yousef@helium.yousef.org) has joined #tux3 2009-10-29 14:12 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-29 14:12 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-10-29 14:12 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-10-29 14:14 -!- shapor_(~shapor@yzf.shapor.com) has joined #tux3 2009-10-29 14:14 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 15:21 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 15:36 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 16:45 ah, I was just thinking about why apple zapped zfs 2009-10-29 16:45 other theories aside, I think they found it runs considerably slower than hfs 2009-10-29 16:46 not a feature that would endear it to mac heads 2009-10-29 16:57 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 17:23 hey 2009-10-29 17:24 flipz: volume recovery is near impossible if it gets corrupted 2009-10-29 17:24 since folks aren't using RAID typically for recovery 2009-10-29 18:07 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 21:55 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 21:56 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-29 22:52 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-30 05:59 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-10-30 07:00 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-30 07:07 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-30 07:43 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-30 11:31 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-30 11:38 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-30 12:22 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-30 15:23 -!- pgquiles(~pgquiles@222.Red-83-41-45.dynamicIP.rima-tde.net) has joined #tux3 2009-10-30 21:39 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-30 22:31 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-31 00:10 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-31 02:35 -!- RazvanM(~RazvanM@96.234.245.163) has joined #tux3 2009-10-31 05:38 -!- pgquiles(~pgquiles@222.Red-83-41-45.dynamicIP.rima-tde.net) has joined #tux3 2009-10-31 07:47 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-10-31 07:56 -!- kspaans(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-10-31 08:48 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-31 09:00 -!- RazvanM(~RazvanM@pool-173-75-185-149.bltmmd.east.verizon.net) has joined #tux3 2009-10-31 09:00 -!- pgquiles(~pgquiles@222.Red-83-41-45.dynamicIP.rima-tde.net) has joined #tux3 2009-10-31 10:42 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-31 11:22 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-31 12:44 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-10-31 15:42 -!- pgquiles(~pgquiles@222.Red-83-41-45.dynamicIP.rima-tde.net) has joined #tux3 2009-10-31 15:56 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-10-31 15:57 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-10-31 19:06 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-10-31 20:12 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-31 21:13 -!- bh(~billh@ip68-107-19-155.sd.sd.cox.net) has joined #tux3 2009-10-31 21:36 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-10-31 22:27 -!- bh(~billh@ip68-107-19-152.sd.sd.cox.net) has joined #tux3 2009-11-01 00:52 -!- RazvanM(~RazvanM@pool-173-75-185-149.bltmmd.east.verizon.net) has joined #tux3 2009-11-01 01:28 -!- flips(~phillips@phunq.net) has joined #tux3 2009-11-01 01:32 -!- RazvanM_(~RazvanM@pool-173-67-59-195.bltmmd.east.verizon.net) has joined #tux3 2009-11-01 02:04 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-11-01 06:59 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-01 07:14 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-01 09:39 -!- pgquiles(~pgquiles@248.Red-79-152-102.dynamicIP.rima-tde.net) has joined #tux3 2009-11-01 09:41 -!- pgquiles(~pgquiles@248.Red-79-152-102.dynamicIP.rima-tde.net) has joined #tux3 2009-11-01 09:42 -!- pgquiles(~pgquiles@248.Red-79-152-102.dynamicIP.rima-tde.net) has joined #tux3 2009-11-01 09:43 -!- pgquiles(~pgquiles@248.Red-79-152-102.dynamicIP.rima-tde.net) has joined #tux3 2009-11-01 10:07 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-01 11:48 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-11-01 13:31 -!- RazvanM(~RazvanM@pool-173-75-186-103.bltmmd.east.verizon.net) has joined #tux3 2009-11-01 15:18 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-11-01 15:27 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-11-01 15:41 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-01 16:09 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-01 16:26 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-01 21:55 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-01 23:35 -!- RazvanM(~RazvanM@pool-173-75-186-103.bltmmd.east.verizon.net) has joined #tux3 2009-11-02 03:17 -!- pgquiles(~pgquiles@62.43.226.52) has joined #tux3 2009-11-02 03:35 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-11-02 07:49 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-02 08:29 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-02 09:08 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-02 09:09 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-11-02 09:12 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-11-02 09:30 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-02 09:39 -!- timothyhuber(~timothyhu@pool-71-104-223-10.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-02 10:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-11-02 11:26 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-02 11:57 -!- npmccallum(~npmccallu@cpe-76-177-102-115.natcky.res.rr.com) has joined #tux3 2009-11-02 16:41 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-02 17:15 hey timothy 2009-11-02 17:15 hey 2009-11-02 17:15 http://www.google.com/hostednews/ap/article/ALeqM5iZyuOu30N1NSJEhA0clIAAucwRTgD9BNNGUO0 2009-11-02 17:16 I remember you telling me about this when it happened 2009-11-02 17:16 awesome news 2009-11-02 17:16 moral of the story is "never trust an SUV" 2009-11-02 17:16 I've been following that story 2009-11-02 17:16 especially a red one 2009-11-02 17:16 I think he was in a sedan of some type 2009-11-02 17:16 oh right 2009-11-02 17:16 cyclist went right through the rear window 2009-11-02 17:17 well, never trust an SUV just stands to reason anyway :) 2009-11-02 17:17 heh 2009-11-02 17:17 I don't trust anyone on the road 2009-11-02 17:17 except me of course 2009-11-02 17:22 crazy 2009-11-02 17:23 never trust anyone on the road, including yourself ;) 2009-11-02 17:24 never trust yourself, especially when you hit the road 2009-11-02 17:24 never trust the road 2009-11-02 17:24 especially when on a R1 2009-11-02 17:37 :) 2009-11-02 20:14 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-02 20:42 -!- tux3bot(~tux3bot@yzf.shapor.com) has joined #tux3 2009-11-02 20:44 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-11-02 23:50 hey flipz and company 2009-11-02 23:50 flipz: it was slashotted today that zfs got deduplication features 2009-11-03 00:10 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-11-03 00:22 -!- kspaans_(kspaans@artificial-flavours.csclub.uwaterloo.ca) has joined #tux3 2009-11-03 01:15 flipz: an overly large indirect block pointer is problematic for performance because of the limitations related to getting that stuff into memory for use right ? 2009-11-03 01:53 -!- RazvanM(~RazvanM@pool-173-75-186-103.bltmmd.east.verizon.net) has joined #tux3 2009-11-03 06:00 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-11-03 11:05 hi bh 2009-11-03 11:05 long block pointers reduce index fanout 2009-11-03 11:06 zfs designer seems to hope this effect can be counteracted by allocating fewer, larger extents 2009-11-03 11:07 natural fragmentation, especially with versioning/snapshotting is the fly in this ointment 2009-11-03 11:08 when fragementation sets in, and it always does, small pointers are better 2009-11-03 11:09 zfs pointers are 128 bytes :-o 2009-11-03 11:20 bytes? 2009-11-03 11:46 true 2009-11-03 11:46 not bits 2009-11-03 11:48 really? 2009-11-03 11:48 thats... impressive 2009-11-03 11:49 so this is a pointer? http://friendfeed.com/tontcoles/0bdde4f8/3d-demo-in-128-bytes-yeah-and-with-no-help-from 2009-11-03 12:01 let's see... I'll give them 512 bits for the data hash, and we'll assume drive IDs and offsets within them are 8 bytes each. That gives 8 physical location entries in addition to the hash. 2009-11-03 12:01 seems understandable 2009-11-03 12:01 er, 4 physical location entries even, with those assumptions 2009-11-03 12:02 Now consider that in actuality, zfs can replicate to 5 or more locations in some cases, I believe (metadata affecting all volumes, when you have 3-way replicated data) 2009-11-03 12:02 and you start to see where they could get rid of 512 bits :) 2009-11-03 12:02 er, 128 bytes rather 2009-11-03 12:04 the bottom line is that zfs is slow ;) 2009-11-03 12:04 by empirical observation 2009-11-03 12:04 but has a lot of neat features 2009-11-03 12:04 like automatic dedup 2009-11-03 12:04 yes 2009-11-03 12:05 but try to explain to a mac head while their disk is now half the speed, but has shiny features 2009-11-03 12:07 sort of like "this horse drawn buggy has a built-in jacuzzi, do you like it more than your car?" 2009-11-03 12:08 opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf 2009-11-03 12:08 ah page 15 2009-11-03 12:09 haha theres 24 bytes of padding in blkptr_t 2009-11-03 12:10 brings the "disk is cheap" refrain to an entirely new level 2009-11-03 12:10 so how is tux3 going to get moving again 2009-11-03 12:11 theres obviously so much room for improvement in "state of the art" 2009-11-03 12:12 I suppose when I start coding again, we seem to be waiting for me to demo the atomic commit 2009-11-03 12:12 aren't you due some vacation soon? holidays? :) 2009-11-03 12:12 and now it is a holy mission to save the filesystem world from bloat and cruft 2009-11-03 12:13 I am, and my family needs some or most of it 2009-11-03 12:14 what I need to do is break the immediate logjam so some other artists can contribute 2009-11-03 12:16 shapor, the world apparently wants a tux3 wiki, do you think your server could stand the taint of a little php? 2009-11-03 12:18 incidentally, the zfs "on disk format" manual is the closest zfs gets to having design documentation 2009-11-03 12:18 any further tech info has to be scraped from blog posts and mail lists 2009-11-03 12:42 flipz: thanks 2009-11-03 12:42 got into an argument last night about why zfs was slower than other FSes 2009-11-03 12:44 there were discussions about this within NetApp regarding the metadata overhead that zfs has over other FSes 2009-11-03 12:45 it was felt that the amount of metadata zfs effects the cache somehow 2009-11-03 12:48 can't remember the specific discussion, but it was a kind of tradeoff decision that they felt that ZFS didn't properly make 2009-11-03 13:10 well lots of metadata is certainly going to take up a lot of precious cache 2009-11-03 13:10 flipz: i dont know about php but i'm sure theres something i can do 2009-11-03 13:19 hey shapor 2009-11-03 13:32 hi bh 2009-11-03 13:46 shapor: yeah, was just talking about how slow ZFS is 2009-11-03 13:46 that's all 2009-11-03 14:29 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-03 19:04 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-03 19:08 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-04 00:13 -!- pgquiles(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2009-11-04 04:06 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-11-04 06:47 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-04 07:25 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-04 07:31 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-04 09:42 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-04 10:14 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-04 10:40 -!- cdk(~chinmayka@59.95.23.77) has joined #tux3 2009-11-04 12:35 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-04 13:20 wow btrfs looks really similar to zfs 2009-11-04 13:20 http://btrfs.wiki.kernel.org/index.php/Btrfs_design 2009-11-04 13:24 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-04 18:27 -!- bd_(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-11-04 22:23 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-05 00:22 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-11-05 01:35 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-11-05 02:10 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-11-05 03:49 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-11-05 03:52 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-11-05 03:54 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-11-05 03:58 -!- kunir(~kunir@195.148.105.102) has joined #tux3 2009-11-05 07:33 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-11-05 07:34 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-05 07:47 good morning 2009-11-05 07:48 hi 2009-11-05 07:52 hi hirofumi, did you go to the kernel summit? 2009-11-05 07:52 hi 2009-11-05 07:52 no, i didn't 2009-11-05 07:53 I thought, so close to you 2009-11-05 07:53 yes 2009-11-05 07:53 probably 1 hour or so 2009-11-05 07:54 but, I can't hear english well :) 2009-11-05 07:54 :) 2009-11-05 07:55 it's ok, they don't actually speak english, they speak geek 2009-11-05 07:55 :) 2009-11-05 07:55 well, I'm recently thinking about mmap of tux3 2009-11-05 07:56 how implement it without complex or pain 2009-11-05 07:56 and implications of atomic commit? 2009-11-05 07:56 yes 2009-11-05 07:57 I have thought about it too, but it was a while ago, so it will take some time to recall the details 2009-11-05 07:57 oh, you found some way for it? 2009-11-05 07:58 first obvious point is, only the data portion of regular files is affected 2009-11-05 07:58 yes 2009-11-05 08:00 next point, which may be wrong is, except for snapshots and sync point like msync, it is ok for the data content that arrives in the file to be slightly inconsistent 2009-11-05 08:01 so we don't have to worry about getting an exact point in time view of memory into the file at each commit 2009-11-05 08:01 maybe, it's not enough 2009-11-05 08:01 only at msync or new snapshot (when we have that) 2009-11-05 08:02 because, we make copy of data for write() 2009-11-05 08:02 so, I guess mmap writes data to old block 2009-11-05 08:02 copy of data is block-fork 2009-11-05 08:03 yes, oops 2009-11-05 08:03 yes 2009-11-05 08:03 it bothers me :) 2009-11-05 08:03 me too 2009-11-05 08:03 let me think a bit 2009-11-05 08:04 well, even if without mmap, it would be enough useful though 2009-11-05 08:05 one very crude solution is to disable data COW for mmapped file 2009-11-05 08:05 but, probably it can't use as rootfs 2009-11-05 08:05 not very satisfying 2009-11-05 08:05 yes 2009-11-05 08:06 well, disable cow may also be a bit hard to do 2009-11-05 08:06 messy, yes 2009-11-05 08:06 especially as only part of a file may be memmapped 2009-11-05 08:06 probably, disable mmap(2) may be easy 2009-11-05 08:07 also not nice 2009-11-05 08:07 yes 2009-11-05 08:08 well, so, I was thinking to wait I/O by page_mkwrite() 2009-11-05 08:08 well, the changed, memmapped data is on the memory page, we have control of where we write that 2009-11-05 08:09 if page is not under I/O? 2009-11-05 08:10 btw, I read googlefs pdf recently 2009-11-05 08:10 do you know about googlefs design? 2009-11-05 08:11 a little 2009-11-05 08:11 i see 2009-11-05 08:11 it's a userspace, non-posix hack 2009-11-05 08:11 it was my first distributed fs which read design 2009-11-05 08:11 yes 2009-11-05 08:12 pretty crude actually, but served a need 2009-11-05 08:12 yes, really 2009-11-05 08:12 it wastes a lot of space, but that makes it simple 2009-11-05 08:12 simple is good 2009-11-05 08:13 yes, it was interesting 2009-11-05 08:13 simple, and it seems some jobs/guarantee moves to app 2009-11-05 08:14 yes 2009-11-05 08:14 well, so distributed fs seems interesting :) 2009-11-05 08:15 we have a very nice example in kernel already, ocfs2 2009-11-05 08:15 which is a lan-oriented dfs 2009-11-05 08:15 i see 2009-11-05 08:15 as opposed to googlefs, which is inet-oriented 2009-11-05 08:15 oh 2009-11-05 08:16 but there are many, interesting and tricky problems that come up even with a lan 2009-11-05 08:16 see "split brain" 2009-11-05 08:16 also, distributed locking and synchronization is a big problem, usually with horrible solutions 2009-11-05 08:17 oh, i see 2009-11-05 08:17 ocfs2 has a more simple solution to distributed locking than most dfs 2009-11-05 08:18 makes good use of udp broadcast 2009-11-05 08:18 ocfs2 is using dlm? 2009-11-05 08:18 this idea should be developed further, but unfortunately, the developers seem to be distracted by other things 2009-11-05 08:18 yes, ocfs2 has its own dlm 2009-11-05 08:19 it seems, each distributed filesystem has its own dlm 2009-11-05 08:19 OCFS2_FS_USERSPACE_CLUSTER seems to use DLM(gfs2's) 2009-11-05 08:19 they were talking about doing that 2009-11-05 08:19 an unfortunate regression 2009-11-05 08:20 i see 2009-11-05 08:20 gfs2's dlm has had persistent reliability issues, and missed essential functionality like lock migration 2009-11-05 08:21 i see 2009-11-05 08:21 the simple udp dlm that ocfs2 started with is a much strong base 2009-11-05 08:21 a tcp-based dlm will never approach the performance of a udp based dlm 2009-11-05 08:21 oh, i see 2009-11-05 08:22 which one is simple? 2009-11-05 08:22 also scalable? 2009-11-05 08:22 udp broadcast as used in the original ocfs2 dlm is far simpler than point-to-point tcp used in gfs2's dlm 2009-11-05 08:22 oh, good 2009-11-05 08:22 lets see what OCFS2_FS_USERSPACE_CLUSTER is 2009-11-05 08:23 distributed filesystem in general are supposed to offer many advantages, including reduced cost by using commodity servers instead of expensive, large servers like superdomes etc 2009-11-05 08:23 it seems fs/ocfs2/stack_user.c 2009-11-05 08:24 and reliability is supposed to be increased, but in practice it is worse 2009-11-05 08:24 worse? 2009-11-05 08:24 efficiency is supposed to increase, but last time I checked, gfs2's throughput scaled negatively with cluster size 2009-11-05 08:25 worse -> distributed filesystems tend to suffer from cluster bugs 2009-11-05 08:25 oh, i see 2009-11-05 08:26 distributed fs's simplify is very important? 2009-11-05 08:26 gfs2 in particular suffers from unexpected full-cluster reboots under load 2009-11-05 08:26 yes, simplicity is very important in a dfs 2009-11-05 08:26 i see 2009-11-05 08:27 because many new kinds of bugs are possible 2009-11-05 08:27 i see, interesting 2009-11-05 08:27 in practice, cluster filesystems have not been commercially successful 2009-11-05 08:27 there are a number of good attempts, for example ibrix 2009-11-05 08:28 on that view, googlefs sounds like right direction 2009-11-05 08:28 and lustre is an example of a very complex cluster fs that has been somewhat successful in government labs 2009-11-05 08:28 oh, great 2009-11-05 08:28 actually, there's a much better one than googlefs 2009-11-05 08:29 it's called... 2009-11-05 08:29 let me see 2009-11-05 08:29 where's shapor? :) 2009-11-05 08:29 :) 2009-11-05 08:30 after the googlefs, I saw elliptics and pohmelfs a little 2009-11-05 08:30 http://en.wikipedia.org/wiki/GlusterFS 2009-11-05 08:30 it seems simple 2009-11-05 08:30 glusterfs 2009-11-05 08:33 seems not so big 2009-11-05 08:33 rather it seems simple 2009-11-05 08:33 it's an active project with solid results 2009-11-05 08:34 oh 2009-11-05 08:37 it is one of good dfs? 2009-11-05 08:37 good question 2009-11-05 08:37 I have not used it, just met the developers 2009-11-05 08:37 i see 2009-11-05 08:38 at SCALE, the local linux expo 2009-11-05 08:38 performance scalability is impressively linear 2009-11-05 08:38 if I start to learn dfs, I'd like to start with very simple dfs, but work enough 2009-11-05 08:39 oh, sounds great 2009-11-05 08:47 http://lkml.indiana.edu/hypermail/linux/kernel/0903.2/00916.html <- ocfs2 still maintains its own dlm, but seems to now support the gfs2 dlm as well 2009-11-05 08:49 ah, i see 2009-11-05 08:49 O2CB is using own dlm 2009-11-05 08:50 and userspace is using gfs2's dlm 2009-11-05 08:50 from help text, O2CB seems kernelspace clustering 2009-11-05 09:13 kernelspace clustering is necessary for a kernel based filesystem, otherwise deadlock 2009-11-05 09:14 or at least, deadlock unless severe restrictions are applied to the userspace code 2009-11-05 09:15 um.. 2009-11-05 09:15 what is clustering in here? 2009-11-05 09:16 protocol? 2009-11-05 09:16 clustering = dlm + clustermembership management 2009-11-05 09:16 i see 2009-11-05 09:17 sure 2009-11-05 09:17 so, if a node drops out of the cluster, kernel has to be able to handle that without involving userspace 2009-11-05 09:17 for example 2009-11-05 09:17 otherwise, userspace may try to allocate memory when the cluster filesystem is trying to flush memory => deadlock 2009-11-05 09:19 i see 2009-11-05 09:19 userspace must be on localfs in this case? 2009-11-05 09:19 so, clusterfs can return EIO or such 2009-11-05 09:26 hey folks 2009-11-05 09:26 hi 2009-11-05 09:26 hi 2009-11-05 09:27 hi 2009-11-05 09:27 what's been going on ? how's atomic commit going ? 2009-11-05 09:27 in this case, deadlock can occur even if the userspace is on the local filesystem 2009-11-05 09:28 my example is a simple allocate-in-flush deadlock 2009-11-05 09:28 yes 2009-11-05 09:28 flipz: yeah, into a discussion about zfs not taking into account tradeoff issues with the size of the meta-data such as indirect blocks. Maybe I have the wrong term to describe it here 2009-11-05 09:29 but, clusterfs can detect timeout or fail? 2009-11-05 09:29 where it effects the cache 2009-11-05 09:30 bh: apparently btrfs made the exact same tradeoff (or just copied zfs) 2009-11-05 09:30 shapor: like how ? will that effect performance as well ? 2009-11-05 09:31 > 100 bytes block pointer i believe 2009-11-05 09:31 http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html 2009-11-05 09:31 Would something like the size of the indirect block pointer cause problems with read speed in that it'll lag getting all of the relevant bits into memory so that the data block can be reached ? 2009-11-05 09:32 i dont think so 2009-11-05 09:32 ok 2009-11-05 09:32 then I misinterpreted something in the discussion then 2009-11-05 09:32 but its just much less cache-friendly to the system 2009-11-05 09:32 fs metadata gets accessed frequently on many systems 2009-11-05 09:33 cache is only so big 2009-11-05 09:33 there was a discussion in the WAFL group between a former ZFS engineer and another engineer about how the size of the indirect block pointer could cause problems with caching but I don't know 2009-11-05 09:33 the specifics 2009-11-05 09:33 could be, i dont know that much about it 2009-11-05 09:34 because of the granularity of the object itself in the cache ? 2009-11-05 09:35 ACTION has been sick but feels just about normal today 2009-11-05 09:36 u8 fsid[16]; 2009-11-05 09:36 that seems a bit crazy 2009-11-05 09:36 16 bytes ? 2009-11-05 09:37 in every block pointer 2009-11-05 09:37 I always thought that halving the amount of indirect block pointers in an indirect block was crazy 2009-11-05 09:38 but I guess if that reduces fanout that sort of might work 2009-11-05 09:39 i think it would be pretty easy to benchmark the effects of the size of the block pointer 2009-11-05 09:39 just comment out the completely useless fields 2009-11-05 09:39 reformat and test I guess 2009-11-05 09:39 yeah 2009-11-05 09:40 I was under the impression that it would be 2009-11-05 09:40 but maybe I misunderstood the conversation 2009-11-05 09:42 i would think it is important enough to even do things like intelligent variable length encoding of fields such as size 2009-11-05 09:42 like http://archives.free.net.ph/message/20090706.153931.b6351354.en.html 2009-11-05 09:43 what's that for ? 2009-11-05 10:00 shapor, agreed about the fsid 2009-11-05 10:00 smacks of committee design 2009-11-05 10:01 I bet zillions of things will break if you change the size of the zfs block pointer 2009-11-05 10:02 I guess it is for raid 2009-11-05 10:02 if so, those may be compared with fs+soft-raid 2009-11-05 10:04 shapor, I didn't think btrfs block pointers are as big as 100 bytes, how do you figure that? 2009-11-05 10:06 btw, if raid blocksize is same with fs blocksize, I guess overhead might be big 2009-11-05 10:07 on btrfs/zfs 2009-11-05 10:07 hirofumi, yes, for redundant storage of metadata blocks, but I do not think that they needed to make every block pointer 128 bytes just to store metadata redundantly 2009-11-05 10:08 yes, probably 2009-11-05 10:08 I have thought about alternative ways that tux3 could use 2009-11-05 10:08 I guess some bytes for each 1M or so on raid 2009-11-05 10:09 use? 2009-11-05 10:10 some improved interface between filesystem and block layer that would allow the filesystem to have finer control of raid layout 2009-11-05 10:10 that's what I mean by "use" 2009-11-05 10:10 ah, yes 2009-11-05 10:10 lvm stuff controled by fs sounds good 2009-11-05 10:11 so for example, if the filesystem could specify level of redundancy per block range, it could define certain regions of blocks with higher redundancy, and store metadata there 2009-11-05 10:12 the effect of this would be to require only a single pointer to metadata, and still have multiple copies of metadata blocks 2009-11-05 10:12 yes 2009-11-05 10:13 I guess some (trade off) problems may be on there though 2009-11-05 10:13 I mean 2009-11-05 10:13 yes 2009-11-05 10:13 block layer also provide atomicity by own? 2009-11-05 10:14 flipz: in that link? 2009-11-05 10:14 shapor, I guess I just didn't read closely enough 2009-11-05 10:14 unless i understand wrong, struct btrfs_header no? 2009-11-05 10:14 struct btrfs_header { u8 csum[32 bytes]; u8 fsid[16]; __le64 blocknr; __le64 generation; __le64 owner; __le32 nritems; __le16 flags; u8 level; 2009-11-05 10:14 I thought the header is only once per block 2009-11-05 10:15 same in zfs no? 2009-11-05 10:15 hirofumi, my thinking is, it is good to try to separate block layer from filesystem cleanly 2009-11-05 10:16 but the current interface to block layer does not allow much more than "read/write these blocks to this location" 2009-11-05 10:16 yes, it's good side 2009-11-05 10:16 flipz: yes very much so 2009-11-05 10:17 I am thinking, maybe it would be nice to say "read/write these blocks to this location with this geometry" 2009-11-05 10:17 in reality the vast majority of filesystems in the world are on exactly 1 device 2009-11-05 10:17 shapor, yes 2009-11-05 10:18 that case has to work really well 2009-11-05 10:18 especially, on desktop 2009-11-05 10:18 true words, those 2009-11-05 10:18 many servers too 2009-11-05 10:18 lots of hardware raid1 setups 2009-11-05 10:19 probably more than software raid1 2009-11-05 10:20 well, fs needs to tell some things to block layer for same advantage though 2009-11-05 10:20 e.g. fs would need to teach freespace region to block layer (like trim command) 2009-11-05 10:21 yes it seems like the interface should be simple and clean 2009-11-05 10:21 not fully integrated 2009-11-05 10:21 "i no longer care about this block(s)" 2009-11-05 10:21 yes 2009-11-05 10:21 "this is metadata" 2009-11-05 10:21 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-05 10:22 "this probably won't ever get read again" 2009-11-05 10:22 oh 2009-11-05 10:22 what is it? 2009-11-05 10:22 "won't ever get read" 2009-11-05 10:23 something you may want to put on slower media 2009-11-05 10:23 i see 2009-11-05 10:23 whaaa- actual discussion of tux3 or am I late joining another conversation? 2009-11-05 10:23 hi guys 2009-11-05 10:24 hi 2009-11-05 10:24 could be useful if you had a tiered storage block device 2009-11-05 10:24 or wanted to implement one 2009-11-05 10:24 with ssds, fast disk, and big slow disk 2009-11-05 10:24 ah 2009-11-05 10:25 I was thinking it would be controled by fs 2009-11-05 10:25 and big memory footprints :-) 2009-11-05 10:25 its possible 2009-11-05 10:26 I was not imaging to push it to block layer 2009-11-05 10:26 i guess its a question of where to draw the line 2009-11-05 10:26 yes 2009-11-05 10:26 filesystems already seem so complex to me 2009-11-05 10:27 yes 2009-11-05 10:27 hirofumi, yes it is essential for the block layer API to include information about what is freespace 2009-11-05 10:27 yes 2009-11-05 10:27 i think everyone agrees on that :) 2009-11-05 10:28 and I was playing to push up something to userland 2009-11-05 10:28 slight increase in API traffic in some cases, to tell the block layer "this is free" 2009-11-05 10:28 s/playing/thinking/ 2009-11-05 10:28 hirofumi: howso? 2009-11-05 10:29 well, I recently read googlefs 2009-11-05 10:29 it seems to provide few guarantee 2009-11-05 10:30 app needs to handle undefine region by own 2009-11-05 10:30 but, fs is simple, and app would handle it effiently 2009-11-05 10:31 well, probaby I can't explain those well 2009-11-05 10:31 :) 2009-11-05 10:31 it's just possibility 2009-11-05 10:32 indeed it makes a lot of sense to push a lot of responsibility to the application 2009-11-05 10:32 you really dont need a filesystem at all 2009-11-05 10:32 sure 2009-11-05 10:32 except that application coders can't be bothered with all that detail, so another layer is developed and inserted in the stack 2009-11-05 10:33 http://evlan.org/applications/filesystem/ 2009-11-05 10:33 yes 2009-11-05 10:33 this process eventually results in many layers, and the whole tower eventually topples over 2009-11-05 10:34 flipz: if the language you programmed in was a bit smarter... you wouldnt need one 2009-11-05 10:34 if app is enough important and big, maybe app layer can be done those efficiently and simplely 2009-11-05 10:34 it would be easier in fact 2009-11-05 10:35 open(), read(), ... if error... 2009-11-05 10:35 that crap is written a lot 2009-11-05 10:35 hirofumi, exactly, like a search engine perhaps 2009-11-05 10:35 its actually *harder* to use a filesystem ... i think 2009-11-05 10:35 however, this breaks down with smaller apps 2009-11-05 10:38 yes, it seems to mean the generic os/fs is hard 2009-11-05 10:39 a generic filesystem almost seems like a mistake for anything other than usb sticks or cd's 2009-11-05 10:40 things you wish to pass around and share 2009-11-05 10:41 flipz: i disagree about it breaking down with smaller apps 2009-11-05 10:46 shapor, you mean coders of small apps should just program directly to the googlefs api? 2009-11-05 10:47 well not the googlefs one 2009-11-05 10:47 but sure why not program against a better api than posix 2009-11-05 10:51 shapor, what specifically is troublesome in posix? 2009-11-05 10:51 time to sleep 2009-11-05 10:51 oyasumi, and see you 2009-11-05 11:20 flipz: not something wrong with it, its just designed around a paradigm that is very free form 2009-11-05 11:20 with generic layers that probably get in the way of efficiency 2009-11-05 12:21 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-05 12:22 well the file access part of posix is actually very lightweight 2009-11-05 12:23 it's the consistency rules that really matter 2009-11-05 12:24 so, the main idea of googlefs is to relax the posix consistency rules 2009-11-05 12:57 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-11-05 13:57 -!- npmccallum(~npmccallu@72-255-53-208.client.stsn.net) has joined #tux3 2009-11-05 13:59 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-05 14:34 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-11-05 14:37 flipz: if we do that, it's ike going backwards in time which will blow up the universe right ? 2009-11-05 15:28 bh, either that or make you billions of dollars 2009-11-05 15:28 which is what it did for larry and sergey 2009-11-05 15:41 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-05 15:56 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-05 16:14 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-11-05 19:22 -!- npmccallum(~npmccallu@72-255-53-208.client.stsn.net) has joined #tux3 2009-11-05 19:52 -!- npmccallum(~npmccallu@72-255-53-208.client.stsn.net) has joined #tux3 2009-11-05 20:36 -!- edt(~Ed@dsl-62-144.aei.ca) has joined #tux3 2009-11-05 23:34 -!- RazvanM(~RazvanM@pool-173-75-186-103.bltmmd.east.verizon.net) has joined #tux3 2009-11-06 02:24 -!- data(~data@ns.nbi33627.nbiserv.de) has joined #tux3 2009-11-06 03:17 -!- tuxirclogreader(548f26df@webchat.mibbit.com) has joined #tux3 2009-11-06 03:17 hi 2009-11-06 03:19 anybody available? 2009-11-06 03:21 I'm looking for some news on the tux3 page. It would be nice to be informed about the state of the project 2009-11-06 06:46 -!- cdk(~chinmayka@59.95.39.19) has joined #tux3 2009-11-06 12:07 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-11-06 12:30 -!- pgquiles(~pgquiles@143.Red-81-44-62.dynamicIP.rima-tde.net) has joined #tux3 2009-11-06 16:09 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2009-11-06 16:55 -!- ajonat(~ajonat@190.48.101.203) has joined #tux3 2009-11-06 21:00 -!- flips(~phillips@phunq.net) has joined #tux3 2009-11-06 21:01 -!- flipz(~daniel@phunq.net) has joined #tux3 2009-11-07 00:49 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-11-07 01:22 -!- RazvanM(~RazvanM@pool-173-75-186-103.bltmmd.east.verizon.net) has joined #tux3 2009-11-07 16:59 -!- ajonat(~ajonat@190.48.99.184) has joined #tux3 2009-11-08 06:31 -!- pgquiles(~pgquiles@143.Red-81-44-62.dynamicIP.rima-tde.net) has joined #tux3 2009-11-08 07:15 -!- edt(~Ed@dsl-216-221-39-50.aei.ca) has joined #tux3 2009-11-08 09:58 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-11-08 12:28 -!- pgquiles(~pgquiles@248.Red-79-152-102.dynamicIP.rima-tde.net) has joined #tux3 2009-11-08 12:29 -!- pgquiles(~pgquiles@248.Red-79-152-102.dynamicIP.rima-tde.net) has joined #tux3 2009-11-08 12:30 -!- pgquiles(~pgquiles@248.Red-79-152-102.dynamicIP.rima-tde.net) has joined #tux3 2009-11-08 13:19 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-11-08 13:44 -!- ajonat(~ajonat@190.48.97.32) has joined #tux3 2009-11-08 14:49 -!- dcg(~dcg@109.pool80-103-1.dynamic.orange.es) has joined #tux3 2009-11-08 14:56 -!- dcg_(~dcg@10.pool80-103-0.dynamic.orange.es) has joined #tux3 2009-11-08 16:59 -!- ajonat(~ajonat@190.48.97.32) has joined #tux3 2009-11-08 20:48 -!- bd__(~foo@satoko.is.fushizen.net) has joined #tux3 2009-11-08 20:51 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-08 23:05 -!- bd_(~foo@satoko.is.fushizen.net) has joined #tux3 2009-11-08 23:05 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-11-09 00:07 -!- RazvanM(~RazvanM@96.234.238.132) has joined #tux3 2009-11-09 06:08 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-11-09 07:24 Hirofumi, there? 2009-11-09 07:24 hi 2009-11-09 07:25 Hi, I think I have the mmap issues sorted out 2009-11-09 07:26 oh, great 2009-11-09 07:26 what does it handle? 2009-11-09 07:26 first thing is, except for msync and new snapshot, it is ok if we do not know exactly which commit a memory write will show up in 2009-11-09 07:26 umm.. 2009-11-09 07:27 why? 2009-11-09 07:27 the reason for this is, the user did not explicitly sync 2009-11-09 07:27 it's an expected race 2009-11-09 07:27 I think race is critical 2009-11-09 07:27 just like writing from two different tasks to the same location without locks 2009-11-09 07:27 our synchronizer in this case is msync 2009-11-09 07:28 and we also support some other consistency gaurantees like read after write 2009-11-09 07:28 but, we do cow when background is writing page 2009-11-09 07:28 i.e. 2009-11-09 07:28 page is under I/O 2009-11-09 07:29 frontend does cow for this page to write 2009-11-09 07:29 and userland is mapping this page too 2009-11-09 07:30 in this case, frontend can't cow, because userland is mapping this page 2009-11-09 07:30 ? 2009-11-09 07:31 well, we can use the rmap list to change all users, in the case we change the cache page 2009-11-09 07:31 as in fork 2009-11-09 07:31 yes 2009-11-09 07:31 but, it might be high cost 2009-11-09 07:32 we only do that for pages that are actually forked 2009-11-09 07:32 it should be rare 2009-11-09 07:32 um... 2009-11-09 07:32 not sure 2009-11-09 07:33 it is a point to consider carefully 2009-11-09 07:33 anyway, there are some more details 2009-11-09 07:33 well, sync all userland which is using I/O page, would be racey and deadlockable 2009-11-09 07:34 that is the next issue 2009-11-09 07:34 ok 2009-11-09 07:34 so when we have an msync or new snapshot, then we care about which commit a mmap write affects 2009-11-09 07:35 in that case, we must copy-protect any mmap page that is in flight 2009-11-09 07:36 this should be a small number of pages 2009-11-09 07:37 all I/O page, not msync or snapshot? 2009-11-09 07:38 I think we only need to do this for msync or snapshot 2009-11-09 07:38 um... 2009-11-09 07:38 userland can mmap page under I/O? 2009-11-09 07:39 yes 2009-11-09 07:39 so, we should care about it too? 2009-11-09 07:40 any page we replace in cache has to be updated in all user page tables 2009-11-09 07:40 yes 2009-11-09 07:40 and it is harder point 2009-11-09 07:40 this gives a point in time snapshot of the page, even for mmap writes 2009-11-09 07:42 indeed 2009-11-09 07:42 so we are sure that 1) the version of the page before the fork will be written to disk and 2) an exact copy of the written page is now mapped, to which further changes will be made 2009-11-09 07:43 so, the key idea here is that we do allow a race in the case there is no msync or snapshot 2009-11-09 07:43 um... 2009-11-09 07:44 the race is a small one: we are not sure whether the old or new version of a mmapped page will be in a given commit 2009-11-09 07:44 what happend if there is the race 2009-11-09 07:44 this race will never cause loss of data 2009-11-09 07:44 um... 2009-11-09 07:45 process (A) reads new page, and same time process (B) reads old page? 2009-11-09 07:45 on same file and same offset 2009-11-09 07:46 how could that happen? We always have the current copy of the page in page cache 2009-11-09 07:47 this only affects what we see on disk if we crash and restart 2009-11-09 07:47 but, mmap can referer the old page via PTE? 2009-11-09 07:48 you mean, the memory page, or the image of the page on disk? 2009-11-09 07:48 I meant page cache 2009-11-09 07:48 i.e. memory page 2009-11-09 07:48 pte is not supposed to be able to refer to the old page 2009-11-09 07:49 um... 2009-11-09 07:49 let me explain what I'm thinking 2009-11-09 07:49 process A) does mmap 2009-11-09 07:50 suppose it maps page A1 2009-11-09 07:50 process B) writes data to A1 2009-11-09 07:50 in here, A) and B) still see same data 2009-11-09 07:50 after that, tux3 start to write A1 to disk 2009-11-09 07:51 and B) writes data to A1 again 2009-11-09 07:51 in this case, tux3 will cow A1, because it's under I/O 2009-11-09 07:52 it makes the copy of A1' from A1 2009-11-09 07:52 but, process A) is still mapping A1 2009-11-09 07:53 what do you think? 2009-11-09 07:54 I think we need to set the page read-only before doing the cow, we know we have to do that because we can see that at least one process has it memmapped 2009-11-09 07:54 to close the race where A can change the old page data after we do our cow 2009-11-09 07:54 yes, and I noticed more problem now 2009-11-09 07:55 :) 2009-11-09 07:55 process A) can read A1 without page fault 2009-11-09 07:55 after B) makes A1' 2009-11-09 07:56 ah, we set the page invalid instead of read-only 2009-11-09 07:56 yes 2009-11-09 07:57 so, any I/O and mmap read can be cause of page fault 2009-11-09 07:57 ok, here come the deadlocks :) 2009-11-09 07:58 :) 2009-11-09 07:58 well, I'm thinking tux3 design is usefull without mmap support though 2009-11-09 07:59 as backend of storage service 2009-11-09 07:59 well, mmap still seems supportable 2009-11-09 08:00 or if file was mmaped, we disable page-fork for that file? 2009-11-09 08:00 the point is, the case where we require page table tricks are rare, for example, we do not need to write-protect every mmapped page on every commit 2009-11-09 08:00 yes 2009-11-09 08:01 yes, we cam disable atomic data update for mmapped files if we hit difficult problems 2009-11-09 08:01 this is not so bad 2009-11-09 08:01 however, mmap and msync style file write would be expencive 2009-11-09 08:01 yes, probably 2009-11-09 08:01 and maybe just wait I/O 2009-11-09 08:02 what is the expensive part? 2009-11-09 08:02 many page fault 2009-11-09 08:02 maybe, e.g. userland write-ahead logging 2009-11-09 08:03 with mmap and msync 2009-11-09 08:03 but only for pages that are actually under IO in a given commit 2009-11-09 08:03 msync has to walk all dirty pages even for normal filesystems 2009-11-09 08:04 yes, so, userland have page fault for each msync? 2009-11-09 08:04 I don't think so, only in the case where we have both a vfs write and a memory write to the same page 2009-11-09 08:05 and then we only have a fault if there is a second memory write 2009-11-09 08:05 ah 2009-11-09 08:05 or memory read 2009-11-09 08:05 otherwise, we just have the expense of flushing the tlb 2009-11-09 08:05 yes, read too 2009-11-09 08:05 I would have missed the point about the read ;) 2009-11-09 08:06 ok, yes 2009-11-09 08:07 probably, vfs write for mmaped page is expencive 2009-11-09 08:07 only 2009-11-09 08:07 back to the deadlocks, I think the main source of deadlock would be the block library, i.e., block_write_full_page, and I hope we will not be using that 2009-11-09 08:08 I guess backend would be simple 2009-11-09 08:08 let me think for a moment about the cost of vfs write to a mmapped page 2009-11-09 08:08 and I think all of those are handled on frontend 2009-11-09 08:10 ok, the expensive case is where we have a vfs write to a mmapped page _dirty in the previous delta_ 2009-11-09 08:10 yes, I think so too 2009-11-09 08:11 I would like mmap to work :) 2009-11-09 08:11 well, of course :) 2009-11-09 08:12 the mmap consistency requirements for snapshots are similar to the cross-node consistency requirements for a cluster filesystem 2009-11-09 08:13 i see 2009-11-09 08:14 I think I'd like to think more about mmap to solve more simple way 2009-11-09 08:15 the point about cluster filesystem is, similar page table tricks were required to solve a similar problem 2009-11-09 08:15 walking mmaped process might be not so good 2009-11-09 08:16 it's not too bad because of the rmap list 2009-11-09 08:16 i see 2009-11-09 08:16 probably 2009-11-09 08:16 but, it might need mmap_sem or something 2009-11-09 08:16 yes 2009-11-09 08:17 mmap_sem seems to be contensive 2009-11-09 08:17 contentious 2009-11-09 08:17 I think contensive is a great new word :) 2009-11-09 08:18 :) 2009-11-09 08:18 "contended" is the usual word 2009-11-09 08:18 i see :) 2009-11-09 08:20 let me see, mmap_sem is per-mm ? 2009-11-09 08:22 yes, it's per-mm 2009-11-09 08:23 https://bugzilla.redhat.com/show_bug.cgi?id=435734 <- mmap_sem already has big latency issues 2009-11-09 08:23 I'm not sure we actually have to take mmap_sem 2009-11-09 08:24 um 2009-11-09 08:25 me too 2009-11-09 09:02 I think mmap_sem protects changes to vmas, but we will only chang a pte for an existing vma 2009-11-09 09:04 umm... 2009-11-09 09:04 not sure though, page fault takes down_read(mmap_sem) 2009-11-09 09:04 #to change a pte we just need the mm->page_table_lock spinlock 2009-11-09 09:05 page fault may add a new vma, so it meeds mmap_sem 2009-11-09 09:05 please check my facts :) 2009-11-09 09:05 it has been a while since my last mm patch 2009-11-09 09:09 it looks like mmap_sem used to protect both the ptes and vmas, a long time ago 2009-11-09 09:09 http://mail.nl.linux.org/linux-mm/1999-06/msg00063.html 2009-11-09 09:10 probably, i_mmap_lock and ptl 2009-11-09 09:11 maybe, similar to clear_page_dirty_for_io() 2009-11-09 09:18 why do you think i_mmap_lock is needed? 2009-11-09 09:19 just because clear_page_dirty_for_io() does 2009-11-09 09:19 a good hint 2009-11-09 09:20 it search vma for specified page 2009-11-09 09:21 btw, rmap is used for MAP_SHARED page? 2009-11-09 09:21 maybe because of dirty page accounting 2009-11-09 09:22 rmap should be used for any mapped page including shared, however details of this changeed a few times so I need to re-study 2009-11-09 09:25 rmap may not be used for map_shared, because it is possible to find the mapping by searching the mapping->i_mmap_writable list 2009-11-09 09:25 which used to be i_mmap_shared I think 2009-11-09 09:26 this needs to be checked 2009-11-09 09:27 http://lkml.indiana.edu/hypermail/linux/kernel/0405.0/0641.html <- hugh got rid of i_mmap_shared, looks like 2009-11-09 09:27 and i_mmap_writable is something new 2009-11-09 09:28 linux locking is lots of fun, because the rules are never written down 2009-11-09 09:28 the rules are not written down, partly because nobody is really sure what they are 2009-11-09 09:35 page_mkclean_file() seems to do it what we want to 2009-11-09 09:36 it search all vmas which maps i_mapping of interest file 2009-11-09 09:37 ah, ok, so we search vmas now instead of having an rmap list 2009-11-09 09:38 yes, probably 2009-11-09 09:39 so that is why the i_mmap_list is used 2009-11-09 09:39 um 2009-11-09 09:39 i_mmap_lock 2009-11-09 09:39 yes 2009-11-09 09:40 it seems to lock vma's tree 2009-11-09 09:40 for that inode 2009-11-09 09:40 yes 2009-11-09 09:40 this only handles file-backed pages, which are the only ones we care about 2009-11-09 09:40 yes 2009-11-09 09:42 however, it has race with page is dirtied 2009-11-09 09:42 I'm not sure whether we need to care it or not 2009-11-09 09:42 and we have to invalidate, not just make the page clean 2009-11-09 09:43 but the locking and searching is what we need 2009-11-09 09:44 we have to invalidate so that a future memory read will read from the correct page, as you pointed out 2009-11-09 09:45 yes, probably 2009-11-09 09:46 maybe, one of solutions 2009-11-09 09:47 invalidating a page range is a common operation, there is probably a function that does what we want 2009-11-09 09:48 probably 2009-11-09 09:48 but, it might be for syscall 2009-11-09 09:48 mprotect 2009-11-09 09:49 also needed internally for cluster filesystems 2009-11-09 09:49 oh 2009-11-09 09:51 migrate.c is another good place to look 2009-11-09 09:53 maybe remove_file_migration_ptes 2009-11-09 09:53 what we need is very similar to page migration 2009-11-09 09:54 even migrate_page_copy 2009-11-09 09:59 migrate_page_copy seems to just move radix tree 2009-11-09 10:00 remove_file_migration_ptes seems to do 2009-11-09 10:00 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-09 10:35 right, it just invalidates the ptes and lets fault bring in the new ones 2009-11-09 10:36 migrate can also fail if it can't find all the references to a page 2009-11-09 10:37 yes 2009-11-09 10:49 if filesystem is trying to migrate a cache page it should know all the references 2009-11-09 10:51 I coded the initial version of page fork based on the migration code, probably it would be better to use the real thing 2009-11-09 10:52 sounds good 2009-11-09 10:53 I also did it without mmap stuff 2009-11-09 10:53 probably, it should share with migrate code 2009-11-09 10:54 seems like the right thing to do 2009-11-09 10:55 update the migration code if necessary 2009-11-09 10:55 yes, if possible 2009-11-09 10:57 btw, do you familir to floating point? 2009-11-09 10:57 yes 2009-11-09 10:57 somewhat :) 2009-11-09 10:58 :) 2009-11-09 10:58 intel fpu has stack to calc 2009-11-09 10:59 it's normalized floating value? 2009-11-09 10:59 yes, it seemed like a good idea at the time 2009-11-09 10:59 internal precision is ten bytes, I think this is normalized 2009-11-09 11:00 it is certainly normalized when stored as 8 or 4 byte float 2009-11-09 11:00 unnormalized values are used when very close to zero, this is covered by ieee standard I think 2009-11-09 11:01 also, even intel's 10 byte fp format was added to ieee standard 2009-11-09 11:02 yes 2009-11-09 11:04 I can't seem to find proof of that just now 2009-11-09 11:05 it has sign,exponent,integer,fraction ? 2009-11-09 11:05 yes, the same format as other ieee forms but wider exponent and fraction 2009-11-09 11:06 so, what is integer in here? 2009-11-09 11:06 the 64 bit fraction is also used for exact integer calculations on the fpu 2009-11-09 11:07 I seem to recall there is a fpu flags bit that tells you when the result of a calculation is not exact 2009-11-09 11:07 integer means normalized or denormalized format? 2009-11-09 11:08 when stored, it will be normalized 2009-11-09 11:08 you never get to see the internal form 2009-11-09 11:08 except if you decode a register save area 2009-11-09 11:08 now, I'm playing with fpu registers 2009-11-09 11:09 :) 2009-11-09 11:09 it's strange, isn't it 2009-11-09 11:10 well, actually, I'm debugging bogus behavior 2009-11-09 11:10 userland is using fpu 2009-11-09 11:11 so, I'd like to know what does in fpu registers snapshot 2009-11-09 11:11 0x0000 401c ee6b 2800 0000 0000 2009-11-09 11:11 I think you can see denormals in the register snapshot 2009-11-09 11:11 oh 2009-11-09 11:11 40 is a typical exponent for values near zero 2009-11-09 11:12 fpu's stack has above value 2009-11-09 11:12 well 2009-11-09 11:12 the exponent should be in the highest addressed byte, so exponent of zero which is a denormal 2009-11-09 11:13 sorry, very large negative exponent 2009-11-09 11:13 negative? 2009-11-09 11:14 yes, because of exponent bias 2009-11-09 11:15 exponet is 401c? 2009-11-09 11:15 the whole thing is stored backwards, so that would be part of the fraction 2009-11-09 11:18 http://en.wikipedia.org/wiki/Subnormal_number <- seen this? 2009-11-09 11:18 I was right the first time, exponent of zero is a denormal 2009-11-09 11:19 ah 2009-11-09 11:19 the value is 2009-11-09 11:19 0x0000401cee6b280000000000 2009-11-09 11:19 in this format 2009-11-09 11:19 0x0, 0xee6b2800, 0x401c, 0x0 2009-11-09 11:19 that is, biased exponent of zero 2009-11-09 11:19 actual dump is this 2009-11-09 11:19 it's part of unsigned long st_space[32] 2009-11-09 11:23 struct _fpreg_ia32 st_space[8]; 2009-11-09 11:23 why st_space[32] ? 2009-11-09 11:24 ah, ia32_user_fxsr_struct 2009-11-09 11:24 -!- bd__(~foo@2001:470:1f07:61f::feed:f00d) has joined #tux3 2009-11-09 11:25 ok, so 16 bytes per 10 byte long double 2009-11-09 11:31 finding a definition of the fsave save area format is not easy 2009-11-09 11:32 very few people ever look at this 2009-11-09 11:34 http://ragestorm.net/downloads/387intel.txt 2009-11-09 11:36 the registers are storeed in 10 byte fields 2009-11-09 11:38 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2009-11-09 11:45 hey 2009-11-09 11:45 hi 2009-11-09 11:45 how's it going ? 2009-11-09 11:46 well, I'm finding new company recently 2009-11-09 11:48 already found or still finding? 2009-11-09 11:48 hirofumi: where were you working before ? 2009-11-09 11:48 it's tough economic times currently 2009-11-09 11:48 still finding 2009-11-09 11:48 well, started very recently 2009-11-09 11:48 I could probably still get a job but the kinds of jobs out there aren't really about development 2009-11-09 11:48 before working for miracle-linux 2009-11-09 11:50 well, currently, working as part-time job for miracle-linux 2009-11-09 11:52 finding job as regular employee 2009-11-09 11:53 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2009-11-09 11:53 hirofumi: are you in Japan ? 2009-11-09 11:53 yes 2009-11-09 11:54 how's the Linux industry there ? 2009-11-09 11:54 I'm not aware of any Japanese companies there that do serious Linux stuff 2009-11-09 11:54 not so many 2009-11-09 11:55 well, there are NEC, FUJITSU, Hitachi or such 2009-11-09 11:55 those are using and supporting Linux 2009-11-09 11:55 however, would not be Linux company 2009-11-09 11:56 and they would not need me :) 2009-11-09 11:57 well many companies need linux kernel developers, that are not really linux companies 2009-11-09 11:57 yes, probably 2009-11-09 11:57 hugh dickens works for veritas, which is part of semantec for example 2009-11-09 11:58 and he seems to have more freedom than most developers working for linux companies 2009-11-09 11:58 but, interesting companies seem to be in foregin contries 2009-11-09 11:58 most folks that I know move to silicon valley or work remotely for some American company 2009-11-09 11:58 that are in the industry 2009-11-09 11:58 hugh sounds like leaved veritas 2009-11-09 11:59 eh ? he left ? 2009-11-09 11:59 yes 2009-11-09 11:59 he's very anti -rt patch as I remember 2009-11-09 11:59 from news site of JLS 2009-11-09 12:00 well, if I remember correctly :) 2009-11-09 12:00 I thought that his background was the VM system ? 2009-11-09 12:00 there is always somebody needing linux filesystem developers 2009-11-09 12:01 I don't know much about him 2009-11-09 12:01 flipz: not that I know of 2009-11-09 12:01 however, probably yes 2009-11-09 12:02 filesystem or userland filesystem or such are interesting 2009-11-09 12:02 well, still not sure what company is good 2009-11-09 12:36 ubuntu? 2009-11-09 12:36 interesting 2009-11-09 12:36 but, it would not be in japan 2009-11-09 12:45 I'm not sure what their policy is on remote, I think it is pretty good 2009-11-09 12:46 i see 2009-11-09 12:47 I'll try to see ubuntu 2009-11-09 12:47 at least, check web site 2009-11-09 12:47 btw, mostly time to sleep, now 5:47am 2009-11-09 12:47 :) 2009-11-09 12:51 oyasumi 2009-11-09 12:52 oyasumi, and see you 2009-11-09 14:06 hirofumi: it would be a great if you can work for them remotely 2009-11-09 14:06 they apparently have a cloud data product that they're pushing 2009-11-09 16:36 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-09 17:32 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-09 17:55 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-09 20:02 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2009-11-09 21:00 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-09 21:05 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-09 21:46 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2009-11-09 21:48 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-09 22:30 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-10 00:14 -!- RazvanM(~RazvanM@96.234.238.132) has joined #tux3 2009-11-10 06:05 -!- npmccallum(~npmccallu@76.177.102.115) has joined #tux3 2009-11-10 07:06 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-10 07:25 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w.verizon.net) has joined #tux3 2009-11-10 09:06 -!- kedars(~kedars@socks.wantstofly.org) has joined #tux3 2009-11-10 09:10 -!- timothyhuber(~timothyhu@pool-71-107-53-233.lsanca.dsl-w