NIMH Data Archive (NDA) Data Validation & Submission Webinar
Malcolm Jackson: My name is Malcolm Jackson and I'll be giving you this presentation today. You should currently be able to hear me and see my screen.
I am showing the NIMH Data Archive home page.
And today, we’re going to be discussing how to validate and submit your data into the NIMH data archive as part of this current ongoing submission period.
Now, our standard submission periods are either January 15th or July 15th if you're an NIMH awardee or April 1st to October 1st if you're an NIAAA awardee. So, at the top here, the first thing I'd like to show you is some of the resources that are available on our website, as a reference tool for you going forward, our contact information, and so on.
Then we will go into some of the assumptions of this presentation, some of the prerequisites to submitting data that, at this point, if you were planning on submitting during the current period, will need to have already been taken care of. And, if not, it should be taken care of as soon as possible. And then, I will move on to a demonstration of the actual tool that is used to upload data to the system, and to validate your data against the standards as they were defined by you in the NIMH data archive’s data dictionary, and then we'll discuss a little bit of what happens after you upload data.
So, as I said First, let's take a look at the web pages we have that provide written information about this you can reference.
So, the contribute data section here on our homepage is where you can find all our documentation on the expected data standards, the approach to harmonization, as well as information on post submission, quality assurance checks, these automated checks we run after your data are uploaded. And then the header here of the section provides a good overview of the overall process, including how you plan for and execute a data submission.
In addition to that, we have a page with information on the tools. As I mentioned before, the validation and upload tool is the key tool that you will be using to validate and upload your data. This is an HTML-based tool. Essentially, that just means it's a webpage; you’re using your browser.
There’s also a version of the tool that's a Python package you can use to upload data if you prefer that, but the basic or default workflow that most people will be using is going to involve the HTML validation tools. So, we're going to cover that in particular today. We also have past webinars and video tutorials you can reference under the webinars and Tutorials section.
So, this webinar, the Data Validation and Submission webinar, is actually the third in a series of four webinars that we provide. The first is an overall orientation webinar for just getting started. Some of you may have attended that or seen a recording of it.
The next one is visible on this page.
Now, that covers a lot of the startup steps we can briefly look at as part of the prerequisite section I mentioned. The second webinar concerns data harmonization. That covers, in general, what is covered in the data standards and data harmonization approach pages here. A lot of that information is on the website.
And then, it goes over how to start preparing your data and making it consistent with your data dictionary in NDA and how to prepare your data for submission. And of course, this webinar will go over the details of how to use the tool to upload data.
And then, the fourth webinar is on how to access data. Now, that refers to accessing data for secondary analysis and for your own research. That may not be necessarily applicable to those of you on this presentation. But we, of course, encourage anyone to re-use data. I'll look into re-using data as part of a project. So, when that’s scheduled, that may also be of use. And then, of course, very general information is available under our About Us page, including our contact information.
Now, I want to mention this at the top because we put a lot of resources into being available to provide active support throughout the process.
So, if you have any questions at all, or any feedback, or any uncertainty whatsoever about any of this, don't hesitate at all to e-mail us at NDAHelp@mail.nih.gov or to call us on the phone at 301-443-3265. We're available during all normal business hours on the East Coast. And as far as questions during this presentation are concerned. Again, I encourage you to reach out to us directly at this e-mail address or this phone number. If you type in your question to the GoToWebinar chat window during the presentation, we will try to go over all of those together at the end. We will end the webinar at 3 PM so if there are questions, we don’t get you can send this to the email NDAHelp@mail.nih.gov Or call us at 301-443-3265
Now with that, let's go ahead and move into the first part of our presentation I mentioned, which is the prerequisites or assumptions about your projects now. When your project gets started, there are two things, the initial e-mails you get your introductory e-mails we have asked you for. And those primarily are going to be the data submission agreement, which I'm going to open here and go over very quickly, and also the data expected list, which we'll take a look at in a moment. So, here's our data submission agreement; this is a PDF file.
Now, if you have not completed one of these, one of these is required per project. It needs to be filled out and signed by the PI; that’s what the submitter refers to, and an authorized Institutional Business Official. Now, that's someone at your institution who has been assigned an SO role, that stands for signing official, in NIH's eRA Commons system. So, if you have not turned in one of these, you won't even have access to the system to submit data or have access to begin preparing your submission. So hopefully, you've submitted this if you intend to submit data in the near future, and you have not sent in a submission agreement for your project. And just to make a quick note, there, one is required per project, so not everyone involved needs to submit their own; only the PI needs to submit one, and that covers the entire grant throughout the entire duration of that Grant’s effective dates.
But if one has not been completed yet, I recommend that you do that as soon as possible. That's generally expected within six months of award. The second thing that is expected within six months of the award is the data expected list in your Collection.
In terms of the overall process, the data submission agreement is what gets things started. And once we receive that, the principal investigator is given ownership of an NDA Collection.
Now, a Collection is this virtual container on the website into which all of the data associated with your project will be uploaded. So, after you've turned in the agreement, this is turned over to you, and then you're able to use this to start getting set up. Now, that primarily involves creating the data expected list.
Now, before I go into this in a little bit of detail, you might be wondering what page this is. Now, I would expect many of you will have already seen your Collection if we're at this stage of submitting data. But this is what a Collection looks like.
In order to navigate to your Collection, from our webpage, from anywhere on the website, you can click My Dashboard, and this will prompt you to log in.
I'm already logged in, so it's just going to show my personal information here on my profile.
And then, I also have these other dashboards, including the Collection's Dashboard, which will display all of the Collections that I have permissions in as part of my submission credentials.
So, you can see here's my demonstration Collection.
This is the Collection I was just in, and if I open this in a new tab, this is my Collection, 2913; you’ll have your own Collection number with all of your grant information listed.
This is just a shell for demonstration, so it doesn't have grant information, but yours will and you can navigate to it in this way.
If you are working on a grant, and you don't see that Collection listed on your dashboard, like you'd expect it to, that means that you have not been assigned privileges in the Collection yet.
So, the permissions tab that I'm looking at now will become visible if you're logged in as a user with permissions.
Initially, the only user with those privileges will be the principal investigator. And they get those privileges when the data submission agreement is received. So, you can see it's all coming together, and they will basically need to come in here and delegate access to others. So, if you have privileges in the Collection at all, it will appear on your dashboard. If you don't, then the principal investigator or someone who already has administrator privileges will need to come in and assign you access. You can see there's some dummy information here for these other fake users as well as myself.
And: Administrator privileges are what will be necessary to edit users or to edit the data expected list, which I will get back to in just a second. And then submission privileges without admin will be sufficient for anyone who's just going to be uploading data and doesn't need the ability to assign access to people.
And it's important to note that anyone with access to the Collection like this will have access to any research data you upload into it.
When you upload the data to the Collection, initially, only those people in that permissions tab and NDA staff, and of course, the NIH program staff will have access to it. It’s not definitely made available to the general research community at that point. So, the data expected list, which at this point should be completed for you, needs to be done as soon as possible. And it is basically a list of all the data you're collecting. And then it's going to be tied to data structures in the data dictionary. Now, this has gone over in detail in the data harmonization webinar, the preceding webinar in the series. So, if you have no idea at all what I'm talking about and you've never seen this before, the thing to do: go watch a recording of that webinar and then go into your Collection and set this up.
If you have not started this, it will probably create a delay in your submission because, as covered in another webinar, this is basically the method through which you ensure the data that you are preparing to deposit are consistent with defined data dictionary structures and which all data are required to have.
So, you can't just upload data and whatever structure or format you know you happen to have on hand at the moment. It needs to be harmonized to a definition in the system.
The tool will validate it against that definition which you create in conjunction with NDA staff. Here are the data curators.
And that does take time; it takes time to set that up. So, if you haven't started that, you may not be able to submit it until it's completed. And there may be a backlog of that, so I would anticipate a delay in that case.
So those are the things that, if they are not done already, should be resolved urgently. And to finish up with sort of what is a prerequisite, let's take a look at the GUID tool quickly. Now, GUIDs are these alphanumeric IDs, and they correspond to an individual subject or participant in your research project. And they will need to have been created, or you can create them now using this tool, which I've just downloaded from the Tools menu. So, hopefully, you will do that. I'm going to log in with my normal NDA credentials. And this does require a Java version of Java eight. And you can see it's this basic data entry interface. This is also gone over in detail in the data harmonization webinars. So, I would, again, refer you to that.
TECHNICAL DIFFICULTY: SKIP TO 20:10
Hello, can everyone hear me now? Yes, I think we are good. Ok, thank you, Tracy. Thank you for bearing with us. Sorry about that everyone, seems we did have one of these network earthquakes I was mentioning at the top, so thank you for bearing with us! So, it sounds you can see my screen and hear me again, let’s get back into it.
So, As I was saying:
The GUID tool, which you can see right here on my screen, is the Java-based program that you need to download onto your computer and then run and enter all of the personal information for your research participants to get a GUID. You can also use this to create pseudoGUIDs. I'm just going to do that now so you can generally see what GUIDs look like. That is what you would use as the IDs or the primary keys for your participants as you're uploading data.
Again, this is covered more in the data harmonization webinar, but I do like to mention these things in this webinar to make sure that no one missed any of the earlier steps as they're getting ready to try to upload so you don't get to the tool and wonder why it's not letting you deposit the data.
So: Now, let's move on to how you actually upload data.
So first, let's take a look at the data dictionary. And let’s also open up.
We're going to our data dictionary here, where all our data structures are located.
So, this is a good resource to have open. We'll get into this in a little bit. And let's also take a look at our validation tool.
So, in addition to the Tools menu, on our web page, you can access the tool directly here. This is what it looks like. In terms of the anatomy of this tool, here's our introduction page. You can either start a new submission. You can also resume an incomplete submission. So, if you're in the middle of submitting data and your network shuts completely off, just like what mine just did, you can come back and resume it. You don't necessarily have to start over from scratch.
A couple of other things I would mention is that all these pages have, well, this one's just the login page, so it doesn't have one, but this little icon in the upper right is going to show you all of our assistive instructional content specific to each page of this tool. So, that should be a valuable resource if you run into any acute problems while you're doing it. And then, I want to mention this. So, down here at the bottom, we have this Request Help. This opens up another pane where you can enter your information, your Collection ID, and a message. Now, if you need to contact the helpdesk while you're using the validation tool, there's no better way of doing that than to do this and to fill this out and submit it. So, when you email us that is very great.
If you use this tool to contact us, this actually does send us your message. It also automatically includes a variety of troubleshooting information that we can use to help address your question a lot faster. Then that dispenses with us asking you for that troubleshooting information, and you haven’t provided it.
So, definitely use that if you need to submit a ticket during your time in the validation and upload tool.
So, let's say that you have all your GUIDs created, your data expected list is complete, your data files have all been loaded with your data. You've turned in your submission agreement, and everything is just ready to go. So, to proceed from here, you would click Create new Submission. And you can see here, the workflow at the top is going to validate the data.
There will be a step to associate files, and that will be relevant for people who are submitting not just the spreadsheets, the CSV files that I'm about to show you, and just to sort of back up and frame that. The data you'll be uploading using this tool will be CSV tables.
All the data that gets uploaded to the NDA will be in a CSV template and those are what get pushed to our system through this tool. And then some types of data like imaging, genomics files, EEG data. And those kinds of things will also have the associated raw or underlying data files of some kind. And those also have a CSV spreadsheet that links to those files. That step is where we will link the files directly and push that, and then it builds a package and uploads the data. So, let's look at some data files and start validating them.
So, I have my demonstration data folder here that has a variety of things in it, so I'm going to take that. And I'm just going to drag all of them into the tool, just like that. So, that's how you get started. And this is really all there is to it.
Now, the core takeaway from this is going to be that the tool, itself, provides you the feedback and the information that should be necessary in order to identify the problem, hopefully, resolve it, but, if not, at least identify it, and then perhaps determine the appropriate course of action to resolve it. So, as you can see here, I have three entries now in my tool. Even though I dragged in the subfolder and all these other files. So, what the tool's doing is it's automatically ignoring anything that's not a CSV file because that's really what the tool grabs and wants to validate and submit. So, these are the CSVs. It's identified them as potential data files based on the fact that they're CSVs. And then it's identified them as being a certain data structure.
That's what the short name column indicates. Now from this point I can proceed, or I can continue to others, there is a big Help button, so you request help.
I can continue to add to this and just keep tossing files in. On your end, you can validate all the files you have at once and upload them in one gigantic package, or you can upload them one file at a time. In 100 different packages, it's totally irrelevant as far as we're concerned. All that matters is that all of it gets uploaded and that all of it is validated. So, from this point, you can see we are these are bolded out, they are complete with errors. It's displaying errors, it's displaying some number of warnings, and then I have the option to re-validate. So, if I re-validate, it's going to get the exact same result that I did last time.
So, let’s open up one of these files. This one has a huge number of errors. So, actually we do not need to open the file. So, we can actually see the errors.
We can see the errors right here in the tool by clicking on the Table, and then it drops down, and in this second viewing pane, we can see the errors themselves.
You can see there are a lot of errors; there are 175 of them, in fact. So, let's say I don't want to see the errors. I only want to view the warnings. I can uncheck Errors and just display Warnings.
I can also check both, well, let's uncheck Warnings, and then I can click grouped by error message here. And that's going to collapse this into the types of errors that I have and then display the record level errors below that. So, what exactly am I looking at here? This is a list of the problems that the tool has detected in this particular file. And I have this file. I'm going to open it now, right here.
And when I say problems, that's really what the tool refers to as an error.
An error represents where your file and the NDA data dictionary structure that this template is identified as being are not consistent.
Or in some cases, like this first one, there are other sorts of general standards that need to be met that aren't necessarily in your data dictionary structure, but they're consistent across all data dictionary structures. So, I believe if I collapse these, I'll have just these two types of errors.
There are a lot of these invalid range errors that are taking a while to collapse. But those are the only two kinds. So, these unrecognized GUID errors:
So, here's the message showing me exactly what it's finding. The element with type GUID must be recognized by the system. Then we have this string of alphanumeric characters that represents the GUID. Now, since this is an unrecognized GUID error, clearly, that's indicating that this particular GUID is not recognized. It's an error. So, the tool is checking all the GUIDs in your files against the GUID matching system. It's not checking anything other than the fact that they exist as valid GUIDs. And then, consistent throughout all error types, it's going to show you the record number, it’s going to show you a column name, and then it's going to allow you to hide these as you go through them to clear things out. So, we could even just hide this entire group.
The invalid range error is going to be indicated as it says that data fields must match the value range fields and note that texts are case sensitive and the value range. So, let's take a look at the data dictionary structure. So, I'm going back to my data dictionary on the NDA website. And I'll search for ABC since that is the same. So, here's the data structure I was just looking at. The data dictionary can be searched using a variety of these tools. Oops, that was the wrong one.
Here’s the ABC Community checklist. This is the structure that I was just looking at. The ID for these structures is called, in a lot of these pages, the short name. You can see the short name here is abc_community02. That is the short name here. And if we go to the file, you will see that that's also what's here in the top row. This is my fake data sample file. This is the file whose results are displayed on the page behind me there. So, abc_community, and then 2, in the first two columns in the first row, the tool reads that. And that's how it knows that this is the data structure against which to match your file.
So, when you load this into the tool, it's taking that ID from the first header row, and then it's taking the second header row, and it's parsing that to identify which column is which element. And then it's going to look at the data dictionary and compare each of those elements to the definition. So, in our subject key column, which is a GUID, that's our first one here, where we're showing this fake GUID. This is what it's checking to see if it's a valid GUID or not. So, when the tool sees that this element is, first, it's a data type GUID, it knows that means it's one of our GUIDs, and it needs to check to make sure they're all valid. So, it did, only it found that this is not a valid GUID.
So, you can see this probably is a valid GUID since it passed validation. In short, the record number is going to be the row number. So, you can see in records 6, 7, 8, 9, and 10 we have our bad GUID. That's a column name, Subject key.
So, the tool in the sequence is telling us which row it's in which column it's in. And then, by the category, we know what the issue is. Based on the category, we know the issue is the GUID is false or bad. So, in the invalid range, you can see 1, 2, 3, 4, 5, these are by the record. And it's the same elements over and over again.
So, there’s some set of elements in my file where the value range is just not matched, and I think that's going to be right here. You can see the inter-relation parent. That was the first one on the sequence, and then yes, no is the value range, and then you can see those. Those are actually right here. They all have no numbers in them. So, this is not matching the range. It's not even the same type of data, so the tool is detecting that and basically perceiving that as a problem.
So, the solution to that might be that you've made a data entry mistake somehow, and it needs to be corrected. Or the explanation might be that there's something about the data dictionary that needs to be updated to accommodate your data.
This is a good example of the latter case because we have yes or no, and if you click on this little, “I,” this is the support of translation rules currently in place for this element, and the translation rule is basically what tells the system; Our default standard definition says yes and no with a capitalized letter. They can also take, however, y” and lowercase or uppercase and translate there.
So, that basically allows you, if you have one of those rules implemented, to leave your data in a different form and then have the tool automatically parse it correctly and interpret it correctly.
And rules like that, that's an example of a translation rule, in which a value on your end is interpreted as a different kind of a default value in the dictionary. And the other kind of rule you would find is an alias where your element names are different.
And we implement an alias rule that allows the tool to interpret your element name as the correct element name for that element in that structure for you.
So, those are set up with your data curator as part of the data harmonization process. If you don't know who your data curator is and you need a rule like that implemented in order to get one of these errors resolved, or you're going to encounter it based on what the data structure says, you can go ahead and request that, and have them help you set it up.
So, as a kind of an example, if you were collecting this instrument and were using 0 and 1 as decoding, instead of yes or no, you would need to have a translation rule implemented. And then your data could still say 1 or 0 like these do, but the tool would recognize it and interpret it based on that rule.
Now one thing about that to note is that if you have any rules like that setup, you need to have this box checked to validate against accustoms scope. So, this custom scope refers to this idea of the data dictionary being customized for the scope of your specific project.
When we have a rule like a translation rule or an alias, those aren't global rules that affect every single person who uses NDA or submits data. They are specific to your Collection. So, you would need to validate against accustom scope.
When you do that, it will ask you to select your Collection ID, and then the tool will look at your specific data dictionary rules and use those to validate.
So that's how you would implement that in the tool once the rules are set up on your end.
Now, let's move on from this file. And I hope that is sort of the main takeaway. But the takeaway from this should be that when you drop a file in here, the tool will tell you where in the file all of the errors are and what the error is.
And then, you need to determine whether that's either a mistake in the data or if it's a mistake in terms of the harmonization between your data and the dictionary. And in the former case, fix it, and in the latter case, talk to your data curator to see the best solution.
And, if that's all you get out of this presentation, that should be sufficient to get you well on the way to getting your data submitted. So, I have the luxury of just being able to remove that entirely. Let's ignore that with those 200 errors and move on to these two other files, and we'll take a look at two other kinds of errors.
This is basically the same principle, but let's take a look at these other files, as opposed to these general wide-ranging errors. I'm going to close this, and what I'm doing now is I'm just going to show you how easy it is to resolve one of these problems. And it is obviously a very simplified system.
So, now I'm looking at the NDAR subject file.
This, of course, is the Research Subject and Pedigree data structure. Now I'm going to look up that structure.
Hopefully, it's clear that the data dictionary is a very useful resource for helping troubleshoot any problems you see in your file; just because that's where the definition is, that’s what the validation rule is checking against. So here I have a single error. There are 62 warnings. So, as I said, an error represents a place where the validation tool finds a discrepancy between your files and the data dictionary.
A warning represents, basically like an optional field that's been left out or a best practice that's been violated, like the removal of a blank column or something like that. So, those are issues that could potentially contribute to problems at some point, but they don't actually cause any errors, and they don't actually block your technical submission.
So, Warnings can safely be disregarded in most cases. You can go ahead and ignore the warnings in your files if you so choose. So, this file only has one error, and you can see the error is not an integer. So that's already telling us exactly what the problem is. The column name is Interview Age. The record number is five, and the message is that the field provided was not an integer; that seems fairly straightforward. And indeed, if we go here, you can see we have our column interview age. Record five. You'll note that record five corresponds to row seven in the file because Rows 1 and 2 are the headers. So, to find the row on your file just add it to the record number. And you can see here that it says 12. It's a string; it’s not an integer. So, I'll just change that. Obviously, that's not exactly 12. But it's correct based on this H sequence in this fake data.
So, I just corrected it, and I just saved this file. Then if I go and click revalidate. There, now you can see it's no longer bold, zero errors.
So, this file is now ready to upload. So, if I remove this offending file with errors still, it will give me an error-free view here, and I would be able to proceed with a packaged upload, but first, let's resolve this other error.
You may have a lot more data, you may have real data instead of the simplified fake data that I have, but the general principle is the same, which is that you use the tool to troubleshoot errors in the data, identify their cause, and then resolve them one way or the other.
So, let's look at our image spreadsheet now:
Now, this is important to note, those first two files I showed you are just questionnaires and sort of basic structures. Ultimately, those data are just going to be a spreadsheet. What we're looking at here is an instance of our image structure image03 is the ID of this structure. And this is the basic general structure for all raw imaging data of any kind. So, a lot of you may be uploading imaging data. You may be using this. And the spreadsheet itself basically contains metadata on the files. There are a lot of elements in the structure that go all the way out.
And then most importantly, I guess, it's important to note that these structures are also how the tool goes and gets your actual underlying files. You can see here DICOMs are typically what we expect from imaging. There may be other file types for other data types. But as far as this is going, it's going to be the same basic principle for all these data types with associated files like this, which is that one of the elements in the data structure will be a file type element, like an image file. It may also be called a Data file, and some of the structures.
In all cases, that element will take the path of the file or zipped up an archive of files that are associated with the record whose metadata is in that row.
So, if Subject X came in, we’re looking at a subject who came in for five visits, once every month for five months. You can see our dates are matching monthly. The interview age, which is in months, is increasing monthly. And we have our image file element, which is providing the actual path to the files that were collected at each of these visits under these parameters. So, if you have the same visit but have different kinds of imaging, those are two rows. But if you have one visit where they have a lot of images in the same exact setup, then you can zip those together and upload them all under the same record.
So here, you can see I'm pointing out to these ping images just as a sample, and I'm showing the absolute path from the drive route all the way up to the subdirectory where this file is located. And if I open back up my demonstration folder, you can see it's going to duplicate this. And then my image03, see users; my name, desktop demo, image03. So here are the PNGs.
And if I go back to my tool, it's telling me that this, OK, actually, sex is null, so this is required; it can't be null. Missing required field, record six. I know that's row eight. And indeed, right here. So, I just need to fill that in. I'll save it, but I'll leave that file open for later. And then when I re-validate, it's passed. So obviously, this is, once again, pretty simple, but this is how it's done.
So, now, to move on to the next, there was an error message up there at the top, but it went away. So, it must have been a temporary problem. So, I click Next, and it will take me to my associated Files stage.
So, you can see here my NDAR subject has zero files required; there are no associated files. But my image structure does have six required files. Now, as we go to this, you'll see it's found all six of these files. And you'll note that it did that even with this path not being exactly correct.
OK, so I need to pick this directory and upload it here.
OK, so I think it actually did this on its own, sorry. This interface has changed a little bit since the last time I used it, but it has said it’d matched six out of six.
So, it's going to let me build a package because I had those files in the same directory as I have my image03; it scanned in all its subfolders and found these exact match filenames irrespective of the path.
So, to provide a clearer account of this, the path in your CSV file should be the exact absolute path of the file's current location, as we just observed; when you're trying to do relative paths, the tool is designed to accommodate for that. It does it by just searching on the file names within the working directory and all of its subdirectories. So, that sort of got interpreted as a relative path because it's in the same working directory.
But the best practice is going to be to use an absolute path to their direct location that will also allow you to drop them in from a different directory if necessary. So hopefully, that was clear.
Now that I have that set up and my package is ready to get built, I'm going to log in now. You might be noticing now that this is only the first time I've logged in, and up until this point, I have not logged in at all. Over the course of your project, you can be validating this every day. You know, every day, every time you add to a data file, you could just validate it and then make sure that it’s remaining consistent with the standards you've defined. And then, only when you go to push data into NDA will it ask you to log in and authenticate with NDA to allow you to pull your project information and build a package.
So, after you've done validating and you're ready to push a package of any size, once again, you could do this one file; you could do this with ten files or a million files. It’s relevant
Once you're ready, you'll see this page, and it will list all the files you have.
It will include the count of Associated files. My six awkwardly associated Files are right here. They're all accounted for. And I will also mention: The reason you have this step where you have to go in and confirm these are the files you want is that this is a tool running in your browser.
So, since it's not actually running locally, you need additional authorization to get the privileges it needs to get that file as a security measure. That's just how web browser-based tools like this need to work.
In any case, once you get to this page, you're going to want to select your Collection. Since I'm a system administrator, I have every Collection available. You'll only have a couple, maybe one, two, depending on how many grants you're working on that have NDA data sharing expectations.
I just selected this test Collection, and I'll give this a title now and give it a description. Now, the descriptions and the titles are mainly for internal use, but it's really recommended that you give it a very descriptive name. You can go in and look at these submissions later.
And a lot of times, people end up needing to go in and verify what they uploaded, what data got uploaded by a different site on our project, who's uploaded what. And then, they go into their submission tab, and the people who've been uploading the data have named their submissions.
So, they have 50 submissions that are all named data submission. Obviously, that's not particularly helpful in that person who's validating that stuff, that needs to go in and actually downloader-download every single submission to see what was uploaded when.
So, it's definitely advised to provide a descriptive title and an apt description, don't do what I've just done, and give it the bare minimum, that might come back to bite you later.
Now, once again, you'll need Collection permissions in order to find yours here. You need submission privileges; if you don't have that, you won't be able to proceed.
So, from here, once I've named it and selected my Collection, I'll just click Build Package, and it will create my submission package.
OK, so now it's ready to go; I’m all at 0%. So, after I check this to verify that nothing I'm uploading contains personally identifiable information, I can submit this. Once you submit this data, let's move on to what happens next. We lost some time to the network problem.
Let's briefly cover what happens at the end of this process and what comes next. And we will have time to address your questions.
After you click submit, you'll start seeing progress at this point and only at this point if it gets interrupted.
And by this point, I mean clicking submit data. Only after you click submit data will you be able to come back and resume an in-progress submission. So, once you click Submit, that's when it initiates it and starts the process, and you can come back and resume it.
So, this may take some time, depending on how large your package is, it may happen very quickly. This is a very small package, so I suspect it would upload quite quickly. Yours might be significantly larger; it might take hours to upload. Once it's completed, you will get an automated e-mail notification that says we've received your submission. At that point, for that submission, in and of itself, you're basically completed, you're done.
So, you can go to your Submissions tab in your Collection, and the submission will show up here. This is an empty Collection. So, there's nothing here, but yours will be listed in this tab.
This is the one I was just alluding to, the lists, what has been uploaded, and when with its name and so on, you'll be able to use that to verify your submission was received.
And at that point, you won't hear anything from NDA for a while.
It might be up to four months before you hear anything or a couple of months. And what happens after this: Let's go to contribute data our General Information section on contributing and take a look at our Data QA reporting step. So, after your data are uploaded, over the following months, there are a series of automated quality checks that are run on the data.
And this is why, well these are some of the main reasons why it's important to upload cumulative datasets for those types of data that call for it. So, for example, you don't need to be uploading the exact same duplicate DICOM files every six months. But it is good to re-upload a cumulative dataset every six months of the spreadsheet data. And that allows comparison between cumulative datasets takes place, these post submission quality checks. So, you can see this page for a list of the checks we run.
And after a few months, once this is finished, you may receive a notification from us, informing you that we found errors in your submission. Now, that might be you forgot to upload cumulatively, and we detected that, and we're asking that you do upload a cumulative dataset. Or it might be some of these issues. You can see the interview, age, and date consistencies are checked. Sexes are checked for consistency, just in case. GUID Source subject ID mappings are checked, and so on. So as an example of this, let's take a look at our fake data again in our image sheet.
As I said, these are rows of the same subject. They're the same subject because they're the same ID.
And they've come in based on the date once a month, on the first, and they've aged one month per month as expected, to have you uploaded data where someone one month later was ten years older. We would detect that. If they got younger, we would also detect that. And we would flag that, as an error in the data, that we would notify you about it through these reports that go out within a couple of months afterward. So, that's what you should expect next. And then, in this ongoing, cumulative submission cycle, where every six months, you're uploading a cumulative dataset.
And we'll continue to check it, and then, if errors are found, the expectation is those will be corrected either in the next upload or in an intermediary fixing repair upload or something like that. So that's how this will proceed.
Now as one final note, well, let's go ahead and move on to questions. I will just put our contact information up again; we have five minutes. So hopefully we will get through few, and if you do not hear your question, email it to us or call us.
The first question is, do we need to notify our data curator if we plan to use preexisting aliases in place of the data element name listed in the data dictionary?
So, I would recommend that you do, just to make sure it will work, especially if you already have one, and you're working with them. So, if they're already listed on that dictionary page, those are older global aliases from before we implemented the system that allows us to do it on a per-project basis. So, I don't know the answer to this question for sure, but it's possible that those are not automatically included in yours, in your specific scope for it. I would check with them that it is the best thing to do in general when in doubt.
OK, so the next question is, does it matter what IP address I'm accessing this?
Whether you can use this at home, the IP address should not make any difference for anything on the website or accessing the webinar recording.
OK, so we have another question here.
We did not collect the middle name or the place of birth of our subjects. This is going back to that data that's needed in order to create a GUID. Is there some kind of exception we can request so that we don't need that information in our GUID?
So, an actual GUID cannot be created without all of that information.
However, in a case like this, the sort of default, prescriptive course of action would be to take all the data you have; you would still upload that. But you would be able to use a pseudoGUID, which is a totally just random arbitrary ID, and you'd be able to use that to upload the data now. The expectation there would be that, if possible, you're going to be re-contacting those participants in order to get those two additional pieces of information and create the GUID.
If there's a reason why it's not going to be feasible to re-contact and get that information, I would recommend contacting the NDA helpdesk to get the process started to have an exception documented for that. Otherwise, go ahead and upload with the pseudoGUIDs. Then as you re-contact subjects and get that information, the GUID tool also has a promote pseudoGUID function that allows you to create a GUID using the pseudoGUID in addition to their PII and link it.
So, in a nutshell, there, you're uploading a pseudoGUID as a placeholder and then correcting them later. So that's the sort of default. But again, if you need an exception, contact the helpdesk. So, hopefully, that will answer the question to anyone who runs into it.
The next question is for the 15th, for a data submission deadline. So, for example, on January 15th, are we required to submit all data collected until that date, or is there a cutoff point?
So, obviously, if someone came in for a clinical visit on January 14th, you're probably not going to get that person's data collected at that visit into the upload by the 15th.
The submission period that ends in January goes from December 1st to January 15th. So, in terms of a cutoff point, we would give that December 15th date as the cutoff point. So, if you want to upload everything collected through November and then stuff you're collecting in December and January if you're collecting data during the submission period, that can be rolled into the next period's dataset to make sure you have enough time to clean and enter those data you're collecting right before the deadline.
OK, so we have two more questions here. For the first one, you mentioned the special instructions for uploading imaging data. Are there similar are instructions for uploading ecological momentary assessment data?
So, I'm not familiar with ecological momentary assessment data. That is something I would ask your data curator. They may need a different structure for it. There may be an existing structure that I'm not aware of already. They may have an answer they can give you confidence right off the bat. Reach out directly to us offline for this.
The second question is if you do not have any data to input for a subject, do you leave it blank? The answer to that question is yes. So, if a data field in a structure is marked as recommended, as opposed to required, and you just didn't collect it, you just leave it blank.
If it's required, and you just didn't collect it, there will be a missing code.
So, that's something you would want to check in the data dictionary. But if it's not required, and you didn't collect it, just leave it blank. If you did collect it, then you should upload it.
Ok, that covers an hour, and we do not have additional questions at the moment. Hopefully, that was helpful for everyone. Thank you for bearing with us during the brief technical outage we had. And if anything, else comes to mind, or if you did not ask us right here at the end, go ahead and e-mail us at NDAHelp@mail.nih.gov. Or contact us on the phone at 301-443-3265; the recording will be available and sent via email afterward. So, just send us a question.
Thank you for bearing with us
On that note, thank you very much for attending.