NIMH Data Archive (NDA) Data Harmonization Training Webinar
TRACY KING: Hello, everyone.
My name is Tracy King, and I would like to welcome you to today's NDA Data Harmonization Webinar. A portion of today’s webinar will be played using your computer audio. This means that if you dial into the audio conference using your phone audio, you will hear the video play through your computer or device speakers rather than over the telephone.
If you are not hearing the audio properly, whether you're connected via the computer or the phone, you should check the volume on your computer speakers and or check the device output. It's possible that the audio is playing through a different app, but output like a USB headset. If for any reason during today's webinar, we experience any technical difficulties, we will provide you with the recorded webinar link via e-mail.
Please e-mail us at NDAHelp@mail.nih.gov any questions that you have, and we will get back to you as soon as possible. With that, we will start the webinar.
MALCOLM JACKSON: Hello, everyone. My name is Malcolm Jackson, and welcome to the NDA Data Harmonization Webinar.
Today, we'll be going over how to take the data that you are collecting as part of your NIH grant and harmonize it with the NDA data standard. And, by Harmonize, we mean, take steps to make sure your data are consistent as possible with data from other laboratories when it is deposited into the NDA system.
In our presentation today, I'll be going through a slide deck quickly, to give an overview of the different pieces of the data harmonization procedure. And then I will be going to the actual NDA website to give you a live demonstration of some of these features we'll be discussing and tools, and how to approach this.
Let’s get started. If you have any questions during this presentation please go ahead and type those into the Webinar Client, the Questions section there, and we will go over all the questions at the end.
First, let's discuss our webinar program in general. This is the NDA Data Harmonization webinar. This is the second in a series of four webinars we give, the first being an overall Orientation for new grantees that covers the basics of the entire data sharing process. This presentation, as I said, is on NDA data harmonization. This is the step of taking the data you are collecting and preparing it for submission into the infrastructure for sharing.
We also have a webinar on how to submit the data itself, an entire webinar just on that. And then we also have a fourth webinar on accessing data in the system.
Now, all data in NDA are eventually shared for re-use with the research community. That fourth webinar is on, how to do that, how to access the system, and the data within it to use it in your own research. So, this training is not going to cover how to get your project started. It's not going to cover any of the paperwork that you need to do as your award comes in and so on. We're going to focus today on, as I said, the steps involved in taking your data and preparing it for submission and working with NDA to get it eligible for submission when associated with the structure in our Data Dictionary.
We will be covering the GUID, the Global Unique Identifier, which is the ID for individual people. Each person in your data will have to have a GUID. That is the primary key for that individual, as their data are submitted to the system, and eventually shared. You’ll have to take steps on your own to create those and sort of plug them into your data.
We'll also be covering the Data Dictionary. This refers to the set of all the measure's instruments, data types that are defined in NDA as having a standard structure, and your data will need to match that or have a new definition defined to accommodate it. So, we'll cover the Data Dictionary, we’ll cover how to set up your Data Expected list. That is a deliverable expected as part of the data sharing process within six months of your grant award.
And as you completed the data submission agreement, which will have initiated your project and given you control over your Collection in NDA, the Data Expected list will really be the next step in that process. That's our primary tracking mechanism for your project. We're going to be covering that today in detail.
We're also going to understand the data harmonization process.
So how you're going to approach this process, and it is an ongoing process, of harmonizing your data to allow it to be submitted and shared to NDA and maximizing its value for re-use. And as always, most importantly, we will understand how to get help, how to find more information, how to get the support that you need to complete these other steps successfully.
Given that, we will move on to this overview of the data sharing process, this presentation will cover kind of the middle steps. This kind of linear chart shows the overall process from when you're initially applying for your grant, all the way up until close out.
And what we're going to be covering now is this second row of getting your GUIDs created, defining Data Expected, uploading your data. Well, we won’t to discuss uploading your data, but we're going up to that point. We're working with the Data Dictionary to harmonize.
This presentation is not going to cover the startup points, those are in the orientation webinar, we're also not going to be covering how to prepare your manuscripts or publish or upload the data or the QA process. That's all going to be covered in the next webinar, the submission one.
So just as a sort of illustration of where we are, how to be oriented in this process.
So, with that, let's go ahead and dive in with the first real sort of subject matter here. As I mentioned, the GUID. This is an abbreviation for the Global Unique Identifier. The GUID itself as a series of alpha numeric characters, it's a bunch of letters and numbers, and it's generated by a tool.
The tool is a piece of software that you download onto your local computer, and then run, login, and use it to create these identifiers. You create these IDs by entering participant information, PII, from their birth certificate into the tool’s interface.
Many of the people listening to this will be at the beginning of their project, relatively and if you've not been collecting the information specific to the GUID from the participants, that does need to occur. If possible, people who've already been enrolled and didn't provide the necessary information can and should be contacted.
What you need to be collecting is the first name, middle name, their last name, and then birth date, sex, and the city location, community, or municipality, or so on of birth, exactly as they appear on the participant's birth certificate. The birth certificate is involved as the gold standard just to ensure that these data don't change over the course of anyone's life. Even if your name changes, that won't retroactively change the name that was, on your birth certificate. It will remain consistent.
So, none of this information, which is personally identifiable, is ever sent to us at NIH. When you download this tool that generates the GUIDs, we call it the GUID tool, onto your computer, and run it, and enter in these data. That tool locally on your computer is going to use those to create a series of one-way hash codes.
Those hash codes, which are not from which these data are not deducible and that's what sent to NIH, and then that is matched against a database of other kinds of hash codes like that. If a match is found that existing GUID is returned, otherwise a new one is created, and that keyword is returned to you and can be used for that person.
In this way, the data from the same individual in different potential studies can be provided. It can be linked across time and space using this ID without anyone ever having access to or sharing any of their personal information to increase the value of all those data. It is going to be present in all NDA data as you go to prepare this for submission and harmonized and uploaded, every file you upload, every data record uploaded to NDA will have this GUID in it, as the subject key. It's the primary key for this individual. There's a specific data element for it, called subject key. We'll see an example of that in the Data Dictionary a little bit later.
So, speaking of the Data Dictionary, the Data Dictionary itself, this term we use just refers to all the data, the measures, instruments, data collection instruments, and questionnaires.
In some cases, specific types of data, for example, there's a data structure for general raw imaging data overall, but all these instruments defined what we refer to as structures in NDA or the Data Dictionary. So, each piece of data collected and submitted must have an associated structure.
Then it must match the parameters, requirements of that structure as it is uploaded to be accepted into the database.
So, if you're collecting a kind of diagnostic instrument, if you have a questionnaire, if you have a demographic sheet, if you're collecting raw imaging data, each of these types of data will need to be associated with the data structure in the dictionary before it's uploaded.
So, if one doesn't exist, we will help you create it, if one does exist, but it doesn't exactly match yours, we can help work with you to get it matched as it’s possible that one already exists and has been defined by another group. And in that case, you can upload your data using the existing structure. So that process of the kind of back and forth to get your data and the Data Dictionary consistent so that they can be uploaded, that's what we mean by data harmonization. That's really what this is all about.
So, the Data Dictionary is fundamentally a list of these structures which represent are recognized and defined data, sort of standard shells. Each of these structures is itself just a list of these elements. Each of these elements is a variable or question or an item on one of these instruments.
The dictionary as a webpage and tool also allows you to generate blank template files for submitting. It allows you to check all the definitions for all the structures. It's a powerful resource throughout the entire data sharing process. At the bottom of the slide, you'll see a screenshot of the Data Dictionary search page. We'll look at that in real life a little bit later. And then we're going move on to the Data Expected list. So, you may have received e-mails about this.
It is expected within six months of your award, and it is something in your project's Collection that you would need to go in and define or have someone to whom you have delegated this privilege should go in and define it on the website. It is our primary tracking mechanism for your data sharing progress. And once it is completed by you, it will be a list of all the data you're collecting. Along with the associated counts we should receive and the dates upon which we begin receiving data.
Each of these instruments, kind of your data sharing and submission schedule, is defined by you, in the form of the Data Expected list. The dates should be based on, the standard terms and conditions, the expectations placed on your project, based on the terms of your grant.
Each of the items on this list will correspond to a dictionary item. So essentially, the Data Expected list is a combination of your data sharing schedule, and your own projects, sort of personal individual Data Dictionary set. And as I mentioned before, we recommend doing it within six months. The sooner you do this from the start the better it is because this is also the mechanism through which harmonization is constructed. So, as you go and create this list, we're going to be working with you.
And that's how this process is initiated and worked out. At the moment, we have these two different data sharing schedules that can be used to determine what dates based on your own projects, individual start and end dates. There's first raw data, also known as descriptive data in some of our documents.
This is data that characterizes a research subject, data not related to your primary aims, raw imaging data, things like that.
And, you know, there's sort of a functional definition for it here, which is that ultimately these are data that can be shared four months after submission.
These are data that you're collecting and over the course of your project, as it progresses, as you were depositing them into the system every six months during our standard, biannual cumulative upload period. And then it's shared four months later after it's been QA’d (or according to the terms and conditions of your grant award).
This is to be contrasted with analyzed data, which are data that are related to your primary aims, data that you are sort of generating that will be inappropriate for sharing with the general community until it is published. And that is shared when it's published right away, or all the data in your project, or shared one year after your project end date. Including all these data that should be submitted as soon as possible. Now, I will note that this is changing, as of January of 2019, these distinctions will disappear, and all data will be shared one year after the project end date, including the first no cost extension.
That is for projects starting effective January of 2020.
And you can contact NDAHelp@mail.nih.gov for any clarifications about the schedule, which is in transition.
Now, for some data types:
This is just to illustrate, sort of what I'm talking about, when I discuss these different types, and it also characterizes each of the types in terms of what needs to be done in addition to the actual data itself.
What sort of other tasks were required before it can be uploaded, based on some of these more complex types, complex in the sense of in which they are uploaded to our system. So, first we can consider data, you might call clinical assessments, essentially, these are just spreadsheets. Ultimately, when these data are prepared and they are submitted to NDA, they will be most likely in the form of a CSV spreadsheet.
Now, there are alternative ways of doing it.
But the basic default process is that a CSV spreadsheet will be uploaded.
These clinical assessment data types are the data that only consist of the spreadsheet. They don't have associated source files. They don't have associated additional pages on the website that need to be updated to provide the right metadata and linkages to the metadata. So those are just spreadsheets.
Then we have imaging, non-functional imaging. These are a CSV spreadsheet, all the data will have an associated CSV spreadsheet, and this is the standard image03. structure.
And then it will have these associated files, the raw data files themselves, maybe a DICOM file for imaging. Likewise, fMRI will have an associated spreadsheet. It will have associated files with this image03 spreadsheet, which is a Data Dictionary structure for imaging data. And then you also must define the experiment.
So, an experiment definition is something in NDA on the website that you will need to create, and then link with your data in that spreadsheet to upload it.
That definition is also required for these other data types that all involve a kind of experiment that needs to be run.
And the purpose of these definitions is to allow someone to reproduce it as easily as possible, and that includes this EEG, Omics data, genomics data, other kinds of omics data, as well as eye tracking data. Now, when you see image03 here, EEG sub files01, these terms, each of those, and that sort of format is when you may come to recognize as the ID of a data structure.
And these are those very data structures I was referring to earlier.
The sort of specific data structures, i.e., the structure EEG sub files01 is a data structure generically for all raw EEG data. Regardless of the experiment you're running or any tasks that may be performed during the experiment.
Those are defined in their own structures and the actual data on your scanners or the equipment. Those are what's in the experiment definition.
So, here, we this also covers experiments. You can see down at the bottom, there's a screenshot of what an experiment looks like on the website. We’ll see one of those briefly a little bit later. But, as I said before, the experiments are what equipment you're using, the design, what stimuli were presented, and so on.
And then these are assigned an ID, and that's linked in your spreadsheet to the data record in that spreadsheet, as well as the Raw underlying files, which are also uploaded.
Now, once all your data are in these spreadsheets, and they're arranged appropriately with either their associated files, their experiment definitions completed, or both, and everything checks out, they are ready to be validated and submitted using the NDA Validation and Submission tool. So, the validation tool is a tool we also maintain, and this is going to be covered in detail and demonstrated in the submission webinar. The third, in our series, this is the second in our series. So, we're not going to cover this tool today.
The tool is what takes in those files, and it validates them against the appropriate Data Dictionary structure to make sure that all the data within it are matching are harmonized to the standard that you define for your own data within that structure.
So, it's going to check that values are within ranges that are defined, that the fields required for upload are not missing and so on, and then we'll make sure that it has valid source files and experiment definitions when it checks, if that's appropriate, and then we'll package and upload the data. So essentially this is a post collection, pre-upload QA check on the data before it's even submitted. We have other QA checks that are performed after it's submitted. Those will also be discussed in the third webinar. However, this is a pre-upload QA check.
And you can use this tool to validate your data while you're harmonizing it and mark your progress on that, even if you don't intend to upload it, since they're separate functions of the tool. We recommend using the Validation and Upload tool as frequently as every time you record new data. This will provide you with accurate data and will be pre-formatted for submission to NDA later.
So, in general, there's a couple of tools we're going to highlight as part of the harmonization process.
Because, as I said, you can use the validation tool to check your data against the structure once your structures are defined, and then make submission a little easier, because you're using this harmonization check as it's coming in. So, there is an HTML based client that's basically a webpage you visited in your browser. That's the primary validation tool for submitting data. And then there's also a Python client for some more advanced use cases and pipelines. Both of those are just different clients. They use the same underlying Web services.
Both of those are available for you to set up your harmonization, validation, and submission workspace. Then we also have our Data Dictionary Tool on our end, which is an internal tool that allows us to create project specific aliases and translations for your Collection. Now, we'll discuss what those mean exactly in detail later. And as I said at the beginning, it's also important to where to get help. You can see NDAHelp@mail.nih.gov is available as our e-mail helpdesk. We are available during all normal business hours.
Let's move on to the actual website demonstration phase now that our overview and slideshow is complete. So now I'm going to show you what the actual tool to create GUIDs looks like. What the Data Dictionary looks like, how you use that webpage, and how you go in and create your Data Expected list.
And finally, I will also be showing you an example of what the data will look like in your spreadsheets.
So first, let's look at some web resources.
So, this is the NDA website.
You can see the homepage nda.nih.gov is our website, and under our section on contributing data, you can see we have NDA data standards and NDA harmonization approach.
So, we're going to be going over how to use the site in this presentation as a resource.
However, if you're interested in documentation, if you prefer the visual learning of reading text, we do have these two pages: one documents what is expected for each different kind of data that we've seen before, and how it's expected to be dealt with.
Then we also have a document outlining our general harmonization approach. So, this covers a lot of what I'll be going over today, in terms of what kind of data goes back and forth between members of your project, and our Data Dictionary data curator. So, I would highlight this section over all as a good resource for information on data submission in general. And I would highlight these two pages in particular as good resources for information. Specifically, the data harmonization piece of submission.
Let's move over to the website and look at the NDA GUID tool. So, I have my GUID tool here. I'm going to launch it.
And as you see when I click launch, it's downloading this file as I mentioned before. This is a file you download locally. It needs to be run on your system so that we can avoid sending any of the PII to us. This does require Java, unfortunately, a version of Java eight is required to run this application on your system, so that needs to be installed. I'm going to close my console here, since you probably won't see that.
So, I'm going to log in with my credentials.
I can use my normal NDA username and password to log into that.
You can create one of those from the homepage, and then you just need to make sure you register for the GUID tool on your profile page after you've created your account.
Now, once I accept the warning message, I see that the GUID tool is quite a simple graphic interface for entering these data. It's just a basic double entry interface for each of the fields.
So, you would enter in their first name, last name, we have this answer regarding their middle name.
If they have a middle name, it must be entered here, and then, as I mentioned before, their birthday, their sex at birth, and their city municipality of birth.
So, once you have all these information entered, you would basically just click Generate GUID. I'm not going to enter in a lot of information now, but you would click Generate GUID, and the GUID would appear right here on this little in this little box. You then have this copy click, go into clipboard button, which copies it, and then you can paste it into a document where you're maintaining it appropriately.
So that's basically all there is to creating a single GUID. This new button clears the form, and then exit closes the tool.
Now, if you're creating GUIDs one at a time and then copying them and pasting them into a document to maintain, that's basically all there is to know about this.
However, the tool does have a few other features that I would highlight.
So first, the tool can create multiple GUIDs. So, to do that click first on this GUI template link.
It's going to download this CSV Excel File.
As you see here, it's populated with sample information, different, various historical figures information is inserted here as a placeholder so you would delete all of that, leaving this ID, the way it is. This is just a sequential ID that you might need to fill out if you have many more people, and you can enter in all the data here.
The only difference I would say is these use an existing GUID flag.
This should be set to yes for everyone in your spreadsheet, unless you know for a fact that they are a high-risk pair within this sheet for a false positive.
So, the main example of that would be, say, a pair of twins, maybe whose first names start with the same letter, or something like that.
That's essentially the main case we see where it false positives come in. And in that case, you would put no in their fields for this and yes, and the rest.
This isn't in the graphic interface, because the tool will automatically detect things like that, and prompt you, with this question, if it arises, but in the bulk generation, it's a field in the spreadsheet you need to upload.
So once this is all populated with your data, now you save it. And then you would click Get GUIDs for multiple subjects. That would take in that CSV file, it's going to give you a file browser, you would find it and load it. And then the tool would just give you back all those IDs. So, that is how you would create GUIDs for many subjects at once.
You can also use this tool to create pseudo GUIDs one at a time.
So, a pseudo GUID, which looks like this, a normal GUID looks like this, but it won't have this underscore I and V in the middle of it. A pseudo GUID is unlike a GUID, it's just random, it's arbitrary. It's a totally random sequence of information that has no bearing on the person's PII, it's not linked to anything.
So, a pseudo GUID, what is appropriate in cases where, for whatever reason, it's totally impossible to get the data to create the GUID for those participants.
So, for example, if someone declined to give you their birth date or something like that, you could still upload their data, but you could use a pseudo GUID, or if your project was retrospective when, the city/municipality of birth wasn't collected, and it's impossible to re contact.
And then in cases like this, the pseudo GUID can be used. To use pseudo GUIDs for cases that are not just one-off participants declining to provide all of this information, while still consenting to have their data shared, you would need to e-mail NDAHelp@mail.nih.gov to check in about that. And you also would need to check in with the helpdesk to get more than one pseudo GUID at once because this tool cannot do that. So, if you need many, you will need to contact the Helpdesk.
Pseudo GUIDs can also be used as a placeholder to upload data for participants while you're in the process of getting these data. So, with that, I'll highlight how to convert a pseudo GUID.
When I click convert pseudo GUID it toggled over to this interface, you'll notice it's exactly the same except it has this pseudo GUID field.
This is for promoting a pseudo GUID into a full GUID if data have been uploaded using a pseudo GUID end at a later date, the actual data required for a full GUID is collected.
Now you can also do this in bulk and the pseudo GUID promotion template here. And the convert multiple pseudo GUIDs option here will perform that function in exactly the same way that the tool does it for normal GUIDs. And that's the GUID tool! That is what you need to know to generate IDs for all your participants.
These don't necessarily have to be done in a linear sequence, but if we're thinking of it that way, we can say GUID or the first step, everyone must have a GUID. So, let's say that now you've completed that, all your participants have a GUID, and they're getting enrolled, and we're ready to move on to the next step.
The next step will be going into your NDA Collection and setting up the Data Expected list, and we're going to cover this, and the NDA Data Dictionary kind of hand in hand, because they really do go hand in hand.
Let's get started with that by just taking a quick look at the Data Dictionary interface, and how to navigate it. When I click here, Data Dictionary data structures, I have this view, this has given me sort of a query view, where I can see what data are actually already shared, and each of these structures. You can see they’re sort of alphabetical. I can toggle between a detailed view and a table view.
So, here's my table view. That's probably a little bit better for you since you are not interested necessarily at this juncture in, say, how many female participants already have data shared?
We still show which ones are getting used more here, but this is probably easier to navigate. Now there's thousands of structures here, so it's kind of browsing this in whole is not really going to be an option regardless of which view you use. So, considering that, I have this text search field. I can type in there; I can apply there with just a text.
And that's going reduce it a little bit, but there's still a lot of anxiety. Well, it says right here, there's 44. So, I can also filter on different.
NDA will narrow it down that way and apply. I guess they're all classified as NDA because I still 44 results.
But in addition to that, you can search by Category one or multiple selections, data type, and then apply or reset filters using these buttons.
So, this allows you to kind of narrow down the 3,000 data structures that are currently defined to find the one that you're looking for.
And so, how would you determine which one to look for in the first place?
Well, when you go to create your Data Expected list, it will more or less look like this.
It will have some dates that were set on this item. Well, first of all, it will have one item on it.
If you're collecting genomic data of some kind, it will probably have two, But, at a minimum, that will have this one, it will have Research Subject and Pedigree. So, this structure Research Subject and Pedigree, which is linked to directly right here and NDAR Subject01, is its ID.
Here's a data structure page.
Every project that shares data through NDA is expected to provide this data structure.
This is a summary structure that provides just one record per subject information that characterizes that subject, and then this is relevant especially in genomic studies, pedigree information, things like that.
It's a summary structure that our query tools use, therefore, is necessary for everyone to upload. It's probably not something you're collecting specifically as part of your project and if we go to the page, you can see everyone is doing it.
There are over 300,000 participants who have this structure in the system.
So, this is included by default in your Data Expected list as sort of the seed item. It's therefore likely to be the only one when you come in. So, when there's only one here and it says Research Subject and Pedigree, it will actually have a targeted enrollment of one probably.
And then it will have an initial submission date and an initial sharing date that are set based on the sort of default schedule for projects and your award.
So, the dates that you will see here are going to be probably the first data submission period for your schedule that is more than six months after your award was given.
If your grant was awarded in, let's say December of 2019, and your submission schedule operates on the April/ October, bi-annual submission periods, then your initial submission date seeded in your Data Expected list would be October of 2020.
So, in terms of the different submission dates, you should expect to see a July/January submission paradigm if you are getting a grant from the National Institute of Mental Health and you should expect to see in April/October paradigm if you were awarded a grant from NIAAA, the National Institute of Alcohol, Abuse and Alcoholism.
So, this is pretty much mental health Data Expected from July of 2019. So how do we actually create this list, and how does it relate at all to the Data Dictionary that I was just showing you?
Well, when I click Add New Data Expected, I can see I have this dialog. And it prompts me to enter enrollment.
So, let's say, I'm going to continue this trend of having a targeted enrollment of 100, I'll duplicate the initial submission date that I've seen here, and the initial sharing date, just so they're consistent, and then I'll search. So, you can see, I'm searching for Research Subject and Pedigree, but that's already on here. So, let's search for ABC Community.
Whoops, you can see I typed it too fast, and it overrode the actual title string. So aberrant behavior checklist, this structure. This is one of our, just sort of clinical instruments.
I'm adding that just as an example, so, I went ahead and added that and you can see, now, it's on the list. I'll go ahead and save, just to be safe. It should save automatically, but it never hurts to save. So, I can edit this. I can delete it. Once you upload data, you won't be able to delete these items anymore.
But you can edit it in order to change the targeted enrollment number, which is expected to be the final unique count of subjects in that structure, the submission date and the sharing day are adjustable as your project progresses, so I'll cancel that.
I'll add a new item.
So, I'll add 100. Let's give it the same dates. Just for consistency, and let's say I'm also collecting raw imaging data.
You can see I search for image, and in case you haven't picked up on it from the data structure, search here, this list: What's getting displayed here is just coming from the Data Dictionary. These are the titles so images, the title, body image states scale is the title, biss01one, image03 MRI QM one, these things are the sort of structure IDs in the Data Dictionary.
So, by selecting this, I'm indicating here that I'm collecting raw imaging data. That will be uploaded in July of 2019 and is getting added to the list.
And once again, I'll save just to be safe. Although it should save automatically.
Now, in any case, as you go through this and add, you will be adding the data structures that you've identified in the Data Dictionary as matching or being usable for your project. So, one way to approach this would be to look at your grant, look at your project, and determine what you're collecting and go through the Data Dictionary and find a structure that corresponds to all of the instruments you're collecting, or all data types you're collecting.
And once all your data are accounted for, you can go in and add it to the Data Expected list. That that would be one way of approaching it in sort of a simplified fashion.
What I would recommend that you do instead, and the way we expect this to play out, is that you can start with those that you can find that are very straightforward, that you can identify as being either usable outright or that need to be modified.
Or you can start by identifying ones that don't exist at all, obviously, and then adding them as new items, which I'll show you in just one second. But the point I'm getting to is that the most important part of this is just get started on adding items to the Data Expected list. Even if it's just updating your enrollment for Research Subject and Pedigree and making sure that dates are realistic based on your timeline and editing that, you know, that's a good start. Going in and adding other really straightforward ones. That's a good start.
Because once you add, once you start editing your Data Expected list, only at that point, is a data curator assigned to your Collection. We have a team of people who work on the dictionary, and they work specifically on getting people's data harmonized to the Data Dictionary and getting the Data Dictionary updated to accommodate new projects data.
And the process of getting your data harmonized will ultimately be kind of a back and forth between yourself and the data curator assigned to your Collection.
And that person will be your dedicated point of contact for all things related to your data, or the dictionary as things go forward, but that person won't be assigned to your project until you've started editing this Data Expected list in order to start adding items and getting set up.
So, whether you go through and prepare your whole Data Expected list offline, and then add it as a data entry task, or whether you go through and just identify a couple of structures at a time and add them over the course of a week or so. It's really up to you.
Once we see activity in this direction, a data curator will be assigned, and you will see them in the Permissions tab here when they're assigned.
You can see I'm here, and then there's a couple of test accounts. But in your Collection, this will be you, your PI, other administrators, and then someone from NDA may also be added as a curator to monitor your progress and help out. And as you start submitting these structures or asking questions at the helpdesk, that curator will be your main point of contact. They'll go back and forth with you and try to make sure everything's squared away in time for submission. So, in the event that you do have structures that need to be defined from scratch.
So, say you're collecting something, and you've checked the Data Dictionary, and you can't find any example of it. Or it's anything even remotely close, you need an entirely new structure.
That's also done through the Data Expected list.
And that's one of the most common questions we get is how do I define an entirely novel structure in your system for new data I'm collecting? Well, to do that, you would click Add New Data Expected, and you add it just like these others, except that you would click upload definition and toggle to that menu instead of searching by data structure title.
Let's say I'm still collecting 100 subjects’ data, and we'll stick with our arbitrary dates.
Just for consistency, but, again, these dates you'll be setting based on your own project’s expectations.
I would provide a title, so we're defining this from scratch, so we'll need a title and then you would pick a file to upload.
So, let us say this file, this PDF file. And what you upload here can be a PDF file.
If what you're uploading is there's a questionnaire and you have a scanned copy of it, including all of the manuals and encoding and a PDF file that would be perfectly appropriate. It could also be a zipped archive of multiple files.
Basically, you would just need to attach here whatever is necessary, in your opinion, for NDA to go in, take a look at it, and create an initial draft of the structure.
So, this is what we want to see in terms of new structure definition. We have a name, we have a file, and then a data curator can take that and get started on drafting it out. And once they're done with a draft, they'll contact you, and you can kind of go back and forth with them to make sure everything works the way you want it to.
So, we've discussed the Data Dictionary and how to find structures. We've discussed how to add those to the Data Expected list.
We've discussed how to add new structure requests to the Data Expected list. So, before we move on to an actual file, and then on to experiments, let's take a look at a few structures themselves, so that you're familiar with it, and then that will help as we go into the actual files. So, I'm going to pull back up this example that I've found of this community checklist.
So, when you go into a structure, it will look like this, you'll have this kind of hetero section, and then, the structure itself is just this list of elements.
So, as you can see here, this initial block of elements well, it includes this comments element, but sex, interview, age, interview date, the source subject ID, which is your internal ID for the participant, and the subject ID, which is the GUID of the participant.
So, those five elements are required in every single record of every single structure as the kind of core identifier for this record. Everything else in this structure is an item or a question in this instrument.
You can see, we have the name, there's a type provided, there's a size provided, there's a status of whether it's required or recommended.
There's a description, a value range, there's a notes column, which typically contains the coding for the values. You can see that down here. Here's an example of the notes column being used to provide the sort of underlying values that the code stands for. And then there's an alias column.
So, there are a few things that I want to highlight here. First, is the required versus recommended status in this required column.
Now, a column being recommended just means that if you don't have data in that field, then the system will not block you from uploading.
So, the rule for this is that if you collected the data, and it's in a recommended column, then it must be provided.
If it's in a required column, then the system will, technically block you from uploading, if you don't have a value in that field now, the other pieces of this year, the size, the value, range the notes where the sort of coding is explained.
These are all the things that are really the definition, those are what you're defining. That's what constitutes this structure and what the validation tool is going to be checking to try to catch any mistakes or errors in either the standard or the data that's been entered into these fields.
Now, on a couple of occasions so far, I've mentioned aliases and I've mentioned translation rules. These are two rules we have access to that we can use to make your job a lot easier.
First, let's talk about aliases. So, in this column you can see, GUID is an alias of subject key.
An alias is an alternate element name.
This would be a good time to introduce you to a submission template.
Up here in the header, you can see we have a few download options. This is an external link to the source of this particular instrument. So, let's open the ABC community checklist, a blank structured template. This is actually generated on the fly by our web service when you click that. So, this is a template for submission.
So, if I had a project where I was submitting this, in theory, I could just click that, get this template, and fill this in with data, and upload it - it's ready to go. And that's because it has the two sort of core pieces of a structure of a template already configured. I'll show you one that has data populated in a minute, but this is what you will be uploading if you're following the default expected process of uploading CSV Templates via the tool, the Validation Submission Tool.
So, the elements here that are relevant are the first two rows. If I expand this column, you can see we're ABC Community and two. And that matches ABC Community here. These two columns in the first header, row must be the two components of this so-called short name, the ID of the structure. That's how the tool and our system knows which structure this file is. So that's necessary.
Second is the row of elements. You'll notice that this is basically just the element name column from the definition transposed. That's exactly what it is.
So, each of these is one of the elements and let's go ahead and open now one that has data in it and then you'll be able to see what I'm talking about a little bit better.
So here we go. This is fake data.
And you can see, it's fake data, so this is what it looks like with data in it.
Each of the rows is one subject at one time points record. And then each column is an element.
And in each row for an individual's visit or time point, it will have the data, the value for that visitor time point. So, you can see here how it's organized in a spreadsheet, ultimately, and this also demonstrates what your data will look like if they're longitudinal.
So, if you have this spreadsheet, this is actually a demonstration of what it would look like.
If you had two subjects who were each coming in for five different visits, each a month apart, they have the same ID.
They have a different date, and then their age is increasing. The age must be in months, by the way. That's why it's 100, and so on.
So, the age in month is going up, that's matching the time passage, and then the IDs are the same.
And each column contains the appropriate data value for that visit. And this is obviously a very simple example, but ultimately this is what your data will look like.
It will be like this in spreadsheets with the subject ID and the data going out in the columns.
So, if you can configure a way on your end to have those files results automatically from your own internal capture system, that’s something that some people found very positive or manual data entry is the ultimate fallback position.
In terms of the actual structure, this is what we're talking about. So, a structure like this with any elements and all the different parameters defined is going to be ultimately required for all the data you're uploading.
So, to kind of go back to our framing here, we have been working on the Data Expected list. We've used the Data Dictionary to identify structures that need to be created from scratch. I've added my request for that to the list. I've added existing structures we can use right out of the box to the list.
And one thing that doesn't happen in the Data Expected list is potentially edits to existing structures. Once you have a data curator, those requests can be sent to that person directly via e-mail, e-mail to NDAHelp@mail.nih.gov, our helpdesk address.
And now we know what our data will need to start to look like.
So now, just to wrap up, I'm going to show you what this looks like with a more complicated kind of data with imaging data. And then, we'll take a quick look at experiments, and then that will wrap up the live presentation.
So, to do that, I have my image03 spreadsheet prepared, and as I said before, you know, even these other types, they're all going to have a CSV spreadsheet like this. This one is just going to be pointing to other files as well.
So here we have the same kind of spreadsheet, the elements out here in these columns.
And then you can see, this is the same person coming in, once per month, five times in a row.
In most respects, this is exactly the same as the spreadsheet I was just showing you. The data, and you just need to be populated in these columns.
The key differences are associated files and experiments, which are these two columns.
So, first, let's deal with associated files. This has an element called image file. Similar structures, and by similar, I mean the other structures that are type specific File metadata structures like EEG, subject and so on.
EEG sub files, rather, they'll also have a file type element like this.
And what goes in that element is not data you've collected, but it's actually a path to file and question.
So, subject X here, in their visit on January first, 2009, et cetera, had these scans in these files taken and then you know that there's data about the rest of the files throughout the rest of this spreadsheet.
So, you can zip up all the scans that were done by one person under the same parameters at the same day and upload that in this element.
There are ultimately more file elements you can use to add more data up to some number of individual files, if you prefer not to zip too many of them. Then, they need to be located in a directory on your local system the same system as this file. And then the tool as it parses this spreadsheet will automatically go and identify those files and get them.
So, you can see here, I have really the absolute paths starting at the Drive route and then going up through my user folder into this demonstration folder in the image03.
You'll notice this duplicates exactly the directory structure I see, when I go here, my Demonstration folder.
And then this image dot PNG is, of course, exactly the file I'm referencing here.
Now, if you're uploading a zip file full of DICOMs for this person, it's not going to be this simple PNG, but this is the principle underlying the organization of this disk must take a path to the file and question. It can be an absolute path like this. It can also be a relative path, starting at the location, the working directory of this CSV spreadsheet itself when it goes into the tool. But you can't go wrong with an absolute path like this.
And once those are populated correctly, the tool will check that. If it fails to find one, you'll know, and you'll need to make sure that the path is correct, or that the directory structure is correct. So, that's how the files are uploaded. It’s basically just as simple as getting the path and the file type element. And then, let's look now at experiments.
As I said before, this MRI is not going to require an experiment. Only functional MRI does, and then the other experiment types do like EEG and omics, and so on. So, for that reason, the experiment idea here is going to be empty for those, and in this purported FMRI record, it will be filled in. You can see the scan types different. That's how I know that's FMRI. So, let's take a look at Image03.
Just as a navigation aid here, these little “i’s” can show you links directly to the structures for each of your Data Expected items.
So, here's Image. Here's my image file structure.
And here's my experiment ID. So, this is conditional, a few of these are conditional. And those go into these logical operators.
And if the operators met, then it becomes required, and if not, then it's not necessary. When the scan type is FMRI, then an experiment ID must be provided. So, let's go back to our Collection and take a look at the experiments tab, which is where those are defined. First, I would click Add new Experiment.
Add a new experiment.
You can copy existing experiments as a base to start from events where the experiment is very similar.
Let's stick with FMRI.
I'm only going to show it from fMRI, but at all the different types have their own slightly different, but basically the same in principle, kind of interface. They all look like this more or less. And they require you to add a name. So, let's give this a name like new experiment.
And as I alluded to before, and we saw on the screenshot, you have the option to select your scanning equipment, software, what equipment was used to present stimuli, if any.
And then, adding events or blocks to define your experimental design, including computer files that are used, audio files, video files, if those are stimuli you used. Then, whether it's post processed or not, allows you to provide these other files in support of it.
So, these are designed to be as general as possible, accommodate all different kinds of experiments, while allowing you, the original researcher, to provide as much information as necessary for someone else to come in later and understand how to reproduce this. So, it's basically just this form you fill out, and as you do so, you will notice it's extensible.
So, if there's software, you use that, you can click Add New.
Then select the parent node under which you want to insert this element, and then you can add a new.
So, if you were using a new kind of neuro behavioral system software, you could add it here, name it, and then use that yourself.
And that way, as more people define experiments, will have more possibilities already included, as that will become available for others to select in the future. So, once you've filled this out and saved it, this will be your ID, and then you would need to enter that in the experiment ID column of your data file for any records that were collected using those experimental parameters.
So, 417 is kind of a sample experiment that's just plugged in here. You need a number in this, in this row, for rows where it's not necessary, you can just leave those blank. That's another important point.
You don't need to put a missing code in every single blank cell in your data.
If you didn't collect it and it's not required, you should simply leave it blank in the structure.
So, at this point we've covered the GUID, we've covered the Data Dictionary, we've covered how to create your Data Expected list by adding to it.
We've covered that the process will primarily involve you working back and forth to the data curator who will be assigned after you start your Data Expected list.
We also covered which web pages you can use to find written information about this.
Let's go back to our final point of knowing where to get help. You can e-mail us at any time for assistance.
NDAHelp@mail.nih.gov is our helpdesk e-mail address available during all normal East Coast business hours.
If you have any questions at all or any comments, do not hesitate to contact us. We have a lot of people available and dedicated to trying to make your data harmonization project and the subsequent depositing and sharing of data as efficient as possible.
The goal is to just put you on the right path and get you started in understanding what our system looks like, and how your data will ultimately no gets set up in order to be uploaded. So, with that, thank you very much for joining me this afternoon. And once again, please let us know if you need any assistance. Thank you.