Large-scale MC production: M. Moulson Frequently Asked Questions (FAQ) 31-Mar-2004 ============================================================= ------------------------------------------------------------- Q. How do I submit MC jobs using the new mechanism? ------------------------------------------------------------- A. The offline group is responsible for the production of MC jobs using the new machinery. If you are the offline expert and need help (or if you are just interested), see the following web page: http://www.lnf.infn.it/kloe/private/mc/mcinst.txt Individuals cannot submit these jobs, but any requests will be considered. Forward your request to M. Moulson or C. Bloise and it will be discussed at the next offline meeting. Note that the new production mechanism is not well suited for the production of small jobs (i.e., rare signal samples). The old MC procedure is still active for this purpose. ------------------------------------------------------------- Q. How do I interface with the MC output for analysis? ------------------------------------------------------------- A. One significant feature of the new MC production machinery is the production of MC DST's. MC DST's are very similar to data DST's. For the time being, there are 5 MC DST streams: stream_id 61: stream_code: mkc analogous to dkc 62 mk0 dk0 63 m3p d3p 64 mrn drn 65 mrc drc There is now a new protocol for accessing MC DST's. It is called dbmcdst. It uses the logger.dtr_mcs_data view. The SQL query that you use from A/C with "input_url" can use any of the fields in this view. You can obtain the list of field names with dbonl fields logger.dtr_mcs_data Here are some examples: !-- all KSKL MC DST's for 2002 data, any card input url "dbmcdst:run_nr>23400 and dtr_stream_id=62" !-- or alternately... input url "dbmcdst:run_nr>23400 and dtr_stream_code='mk0'" !-- all neutral rad MC DST's for the stated run range in 2001 data, any card input url "dbmcdst:run_nr between 19000 and 21000 and dtr_stream_code='mrn'" !-- same as above but only for files generated with the all_phys card input url "dbmcdst:run_nr between 19000 and 21000 and dtr_stream_code='mrn' and mc_mccard_id=2" !-- or alternately... input url "dbmcdst:run_nr between 19000 and 21000 and dtr_stream_code='mrn' and mc_mccard_code='all_phys'" !-- all files generated with eps_ppg in run range, any DST type input url "dbmcdst:mc_mccard_code='eps_ppg' and run_nr between 20010 and 20020" !-- specified by filename input url "dbmcdst:dtr_filename='mrc_20235_eps_ppg_17.dst'" The command kls has also been updated yet to use the new protocol: !-- all neutral rad MC DST's for the stated run range in 2001 data, any card kls mcdst "run_nr between 19000 and 21000 and dtr_stream_code='mrn' and mc_mccard_code='all_phys'" ------------------------------------------------------------- Q. What modules do I have to run when analyzing MC DST's? ------------------------------------------------------------- A. Only your analysis code, plus whatever that requires. In many cases, EMCDBINI and DCDBINI are good to put in your path, but it really depends on what you're doing in your analysis. There are no modules that are always required. In particular, the MC DST's don't contain any data encoded with the SQZ library, so you don't need KBKMDD. (The .mcr files, however, do require KBKMDD to be read.) ------------------------------------------------------------- Q. How is the streaming handled for MC DST's? ------------------------------------------------------------- A. In the old reconstructed MC files (.mcr), the event classification decision was recorded in ECLS/ECLO, but not enforced. This allowed studies of event classification efficiency. The same is true for the new .mcr files. However, we now have MC DST's, which are streamed. MC DST's are streamed on the union of the event type in MC truth and the event classification decision. In other words, an mk0 DST will contain all events generated as KS/KL events, plus any events reconstructed and classified as KS/KL events. So, if you can do your analysis from a single stream in data, you should be able to do your MC analysis from the corresponding MC DST stream, including event classification efficiencies. (The TSKT and FILFO decisions are similarly recorded but not enforced.) Note that, exactly as for data, additional stream-specific algorithms are run for MC events in the DST-making phase. A single .mcr is written for each MC run. This is then analyzed by four separate datarec processes to make the DST's. For mkc, the retracking performed for charged kaons is applied (obviously, only for those events recognized as K+/K-; unrecognized K+/K- events in MC truth are passed but do not receive the special treatment). For mk, the t0 step 1 algorithm is applied for KL tag events. Radiative events are divided into the charged and neutral streams during the DST phase. For neutral radiatives, the t0 step 1 algorithm is applied. ------------------------------------------------------------- Q. Do I have to worry about overlapping events in MC DST's? ------------------------------------------------------------- A. Yes, if you have to worry about them when analyzing data. Otherwise, no. For example, if you look at more than one stream when analyzing data, there may be overlaps. In MC DST's, the part of the streaming based on MC truth should be completely orthogonal, so there is really no difference from data in this respect. ------------------------------------------------------------- Q. Is there any documentation on the MC DST banks? ------------------------------------------------------------- A. I have a partially written memo describing the formats of all banks used in DST's: regular DST's, kpm DST's, and MC DST's. Finishing this memo is a priority for me (although I do have a lot of priorities these days...). Anyway, this is on the way. In the meantime, please keep in mind that the routines in the TLS library which fill the structures used in PROD2NTU have all been updated to transparently read full reconstructed files or DST files, and this is true for both MC and data DST's. In certain cases (particularly for the trigger), the information in the DST's does not allow the structures to be completely filled. However, these routines should return whatever information is available. Note that this means that 1) PROD2NTU should work out of the box for MC DST's, and 2) the PROD2NTU subroutines can be used by people doing analysis in their own code, without any detailed knowledge of the underlying bank structure. ------------------------------------------------------------- Q. What improvements in the simulation have been implemented? ------------------------------------------------------------- A. Too many to list here. Documentation is on the way. In the meantime, for the full story, consult the extensive list of presentations and meeting summaries at http://www.lnf.infn.it/kloe/private/mc/pres Of particular note is the short list below of things that vary run-by-run in the campaign. ------------------------------------------------------------- Q. What changes run-by-run in the simulation? ------------------------------------------------------------- A. -. Inserted background both in the EmC and DC -. sqrt(s) -. The production cross section (sometimes, see below) -. Mean phi momentum (z) -. Position of luminous region -. Extent of luminous region in x and z -. Beam energy spread (in 3 large run groups) -. Dead and hot DC wires -. Trigger thresholds (in large run groups) ------------------------------------------------------------- Q. What should I know about that doesn't change run-by-run in the simulation? ------------------------------------------------------------- A. The production cross section, sometimes. For continuum channels (such as eps_ppg), a reference cross section is used. Note that the phi BR's change with sqrt(s) in any case-- that is part of M. Antonelli's new phi generator. So, if we're simulating a run with sqrt(s) = 1018.0 MeV, the MC output contains a larger fraction of rho-pi events than it does when we're simulating a run with sqrt(s) = 1019.5. ------------------------------------------------------------- Q. How is the number of events to generate determined? ------------------------------------------------------------- A. The number of events generated when an individual run is simulated is calculated using the VLAB luminosity for that run and the cross section for the card that governs the generation. For the cross section, either a reference value is used (for example, for continuum processes or other situations where we don't have a good ready parameterization of the energy dependence of the cross section), or the cross section is parameterized by sqrt(s). At the moment, fits to KLOE data are used to obtain the energy-dependent cross sections for the all_phys and neu_kaon cards. The actual number of events generated is then obtained by multiplying the the luminosity scale factor (LSF) for the campaign. For example, in the all_phys running, we use an LSF of 1:5. The number of MC events generated for the 430 pb-1 2001-2002 data set (counting good data only) would then correspond to the number expected for 86 pb-1 of data. Of course, these events are distributed in run space in the same way as the events in data are. ------------------------------------------------------------- Q. What is the significance of the MC run number? How does it differ from the run number simulated? ------------------------------------------------------------- A. The significance of the MC run number has not changed. The MC run number (mcrun_nr in most DB tables) is just a progressive index of the MC runs for each MC card. For technical reasons, it turned out to be easiest to keep it this way. ------------------------------------------------------------- Q. Where are the MC and data run numbers in the YBOS file? ------------------------------------------------------------- A. The MC run number is in the LRID bank. As such it is loaded into the jobsta common by A/C. (This means that it is the MC run number that A/C shows you in the status report when you process an MC DST.) The run number being simulated is in a new bank called BRID. The BRID bank is actually a copy of the LRID bank from the inserted background event, and must be present in every event. BRID is identical in format to LRID. It is not loaded by A/C, however. You can BLOCAT BRID and obtain the simulated run number at zero offset from inddat in the IW array. Note that the BRID is only present in EVENT records! This is because in principle the source of inserted events can change during an MC run (an MC run could in theory correspond to more than one physical run, though we're not currently doing things that way). This means that, if you want the simulated run number for your analysis, you have to look for it every event, NOT in a Begin Run routine. Of course, if you use the simulated run number to access HepDB for the boost, etc., make sure to call the DB only when the run number from BRID changes. Read the run number from BRID every event, but make the database calls only when the run number actually changes. ------------------------------------------------------------- Q. What do we do with the reconstructed MC files? ------------------------------------------------------------- A. The maximum number of events for a given MC run is determined by the desire to keep reconstructed MC file (.mcr file) size below 1GB, and is established empirically for each card. A given run in data can require more than one MC run to simulate. Background events are available for each raw data file taken. When the generation of a run has to be split up, it is split up into groups of the raw files of which it consists. Each group of raw files is simulated separately, as far as background insertion is concerned. The exact mechanism is complex, but ensures that the background spectra in the sum of the .mcr files for any simulated physical run faithfully reproduces the background spectra in the data for that run. The .mcr files are recombined by physical run number when MC DST's are made. The reconstructed MC files are tagged by their MC run number and are archived. If you really need to use them, you can see them with kls or kid using the dbmc protocol. It will take some SQL gymnastics to get the right combination to fully reproduce each run and/or to get the correspondence to physical run numbers. Note that the .mcr files in principle contain everything necessary for re-reconstruction. All of the information exists to allow the background to be removed and recover the pristine .mco file. This allows us to reconstruct with no background, or with a different source of background, if need be. ------------------------------------------------------------- Q. What about the .mco files? ------------------------------------------------------------- A. They're toast. See above; we don't need 'em. By the way, we never kept them before, either. ------------------------------------------------------------- Q. How is the background obtained? ------------------------------------------------------------- A. This will be documented more extensively in the future. In brief, the background is obtained from gamma-gamma events. Since these events are neutral, all DC hits in these events are considered background. The calorimeter clusters not identified as belonging to the gamma-gamma topology are considered background. The separation is imperfect in the calorimeter, so event downscaling techniques are used to make sure that the energy and polar angle distributions of the clusters selected as background correspond to the background only. The timing distribution of background clusters has been shown to be well reproduced. The histograms used for the statisical separation of background clusters from gamma-gamma fragments and additional clusters from 3-gamma events are made in groups of 2 to 5 pb-1, where the group boundaries are chosen on the basis of the accidental rate measured in cosmic events as recorded by physmon. ------------------------------------------------------------- Q. Are background events recycled? ------------------------------------------------------------- A. Yes. The cross section for an identified gamma-gamma event that yields background is roughly 40 nb. So, in all_phys running at 1:5 scale (CS = 620 nb), each background event will be used in 15.5 simulated events, on average. For each raw data file in the data set, we make a background (bgg) file. An MC run corresponds to a set of raw files in data. To the extent that the cross section for an extractable bgg event is constant, if we insert all of the bgg events for the set of raw files into the MC run (recycling by the appropriate reuse factor), then we should reproduce the background spectrum resulting from the time dependence of the background variation in the data. The reuse factor is a non-integer value. The number of times to reuse each background event is obtained by remainder series. (If the reuse factor is 15.537, slightly more than half of the background events will be used 16 times and the rest will be used 15 times.) Reconstruction of a file stops when either the MC events run out or the background does. When the reuse factor is calculated correctly, at most a few MC events are lost because the background runs out. (Note: an earlier version of the insert module used a pseudorandom sequence rather than the remainder series. On average, with that method, slightly fewer events are reconstructed than are generated. This technique was used for the eps_ppg campaign.) ------------------------------------------------------------- Q. What if I want to study the background itself? ------------------------------------------------------------- A. Any distribution that you make that includes measured quantities from background clusters or hits will obviously show the fluctuations from the underlying background statistics. This is an unavoidable fact of life that comes from the fact that the cross section for gamma-gamma events with isolable background is relatively small, which necessitates the recycling of background events for high-statistics MC production. In many cases--such as if you're studying events of a particular type that pass a series of selection cuts--this probably won't matter. The obvious exception, as pointed out to me by C. Gatti, is if you're studying the background itself. In this case, you may want to exploit the fact that when the background is recycled, it is recycled into contiguous events, at least for now. So, if this is a crucial concern for you, you can test on the run and/or event number of the inserted event (in the BRID bank) and fill your histograms and/or Ntuples only when the inserted event changes. Note that in principle, if we are simulating a frequent process at a high LSF, then it may take more than one MC run to span a single raw file in data. In this case, all insertions of the same event are not guaranteed to be contiguous in the MC output files. We are not running this way for now, and probably won't be for a while. But it can't be entirely excluded that we will never do so, so if you have a doubt, ask me. ------------------------------------------------------------- Q. How is the background inserted? ------------------------------------------------------------- A. This will also be documented more extensively in the future. Roughly speaking, the hits on both the EmC and DC are inserted, preserving their timing relative to the t0 in the gamma-gamma event from which they were extracted. This happens before the t0 smearing. The simulated and background hits are then smeared together using the standard t0 smearing algorithm. Background hits preceeding simulated hits will clobber the simulated hits and vice versa. Background hits with negative times are introduced with negative times. In the DC, a background hit with negative time will cause the simulated hit to be removed from the event.