Office for Disparities Research and Workforce Diversity Webinar Series: Advancing Methods and Measures to Examine the Underlying Mechanisms of Violent Deaths in LGBTQ Populations
TAMARA LEWIS JOHNSON: Good afternoon and welcome to the 2022 National Institute of Mental Health LGBTQ Mental Health Lecture. My name is Tamara Lewis Johnson, and I am the chief of the Women’s Mental Health Research Program at the Office for Disparities Research and Workforce Diversity at the NIMH.
The purpose of the webinar series is to spotlight research on mental health disparities, women’s mental health, minority mental health, and rural mental health.
This afternoon we are spotlighting the research of Drs. Cochran, Arseniev-Koehler, Foster, Chang, and Mays. This research was funded by the NIMH’s Division of Services and Intervention Research.
Now, I will say a little bit about the topic. Due to multiple minority stressors, LGBTQ individuals are at increased risk for suicide and violent death. Health research literature suggests that LGBTQ people are more likely to die by violent death than their non-LGBTQ peers. But because sexual orientation and gender identity data are not systematically collected at the time of death, actual mortality to disparities and the context of these deaths are not well known.
This research fills a gap in LGBTQ research and will transform the status of mortality data collection to better identify sexual orientation/gender identity data, which facilitates studies to guide prevention and intervention research around preventable deaths among LGBTQ populations.
The researchers will discuss their efforts to measure sexual orientation/gender identity using postmortem data when there is a violent death among LGBTQ individuals and ways to improve surveillance and electronic medical record systems for sexual orientation/gender identity collection, use, and quality.
In addition, they will present findings on data, using the National Violent Death Reporting System to better understand the violent deaths among LGBTQ individuals and methods for increasing the ability to identify sexual orientation/gender identity after a violent death has occurred. Together these researchers present evidence-based data, analytical techniques that demonstrate the need for better usability of mortality data of LGBTQ populations.
Now, I will introduce the speakers. The first is Dr. Susan Cochran. She is the professor of the Department of Epidemiology at UCLA Fielding School of Public Health and the Department of Statistics. By training, she is both a clinical psychologist and a social epidemiologist. Her research interests focus on the psycho-social determinants of health and health-related behaviors and the role of social stigma and discrimination in health care access, health behaviors, mental health, and health outcomes.
With funding from NIMH over the years, she has investigated patterns of sexual risk taking related to HIV infection control, the burden of mental health and substance use disorders among sexual and racial and ethnic minorities and serves as a principal investigator of the California Quality of Life Survey, a statewide population-based study of mental health over time among sexual minorities and heterosexuals.
Dr. Cochran will take questions immediately following the presentation. All other questions will be held for the end of the panel of all the speakers.
Dr. Cochran will be followed by Dr. Arseniev-Koehler and Dr. Jacob Foster. Dr. Arseniev-Koehler is a computational and cultural sociologist with substantive interests in health, language, and social categories. Her research aims to clarify core concepts and debates about shared meanings, such as stereotypes. Empirically, she focuses on cases linked to health and inequality such as the moral meanings attached to body weight, the stigmatizing meanings of disease, and gender stereotypes. To investigate these topics, she uses computational methods and machine learning, especially computational text analysis.
Dr. Jacob Foster is a computational sociologist interested in the social production of collective intelligence, the evolutionary dynamics of ideas, and the co-construction of culture and cognition. His empirical work blends computational methods with qualitative insights from science studies to probe the strategies, dispositions, and social processes that shape the production and persistence of scientific and technological ideas. He uses machine learning to mine the cultural meanings buried in text, and computational methods from macro-evolution to understand the dynamics of cultural populations.
Dr. Foster also develops formal models of the structure and dynamics of ideas and institutions, with an emerging theoretical and empirical focus on the rich nexus of cognition, culture, and computation.
Dr. Chang is an associate professor in the Department of Computer Science at UCLA. He leads a natural language processing group. His research includes designing robust machine-learning methods for large and complex data and building fair and accountable language-processing technologies for social-good applications. Dr. Chang has published more than 120 papers in machine learning, artificial intelligence and natural language processing.
Dr. Vickie Mays will close out the panel. She is the distinguished professor in the Department of Psychology in the College of Letters and Sciences and Healthy Policy and Management at the UCLA Fielding School of Public Health, a senior fellow in mental health at the UCLA California Health Policy Research Center. She is originally trained as a clinical psychologist as well as a professor at the UCLA School of Public Health.
She has served for 15 years as a NIH P60 center director on minority health disparities. Her research program in mental and physical health disparities among racial and ethnical and sexual minorities has a particular emphasis on identifying various statuses associated with these disparities, intersect to produce specific and sometimes unique pathways for negative mental and physical health outcomes. Two current two foci of her work are in big data and COVID-19. Her NIMH-funded work focuses on the development of artificial intelligence methods for the classification of race, ethnicity, and social orientation and big data on suicides and homicides. She is funded to develop a predictive model of health equity for the mitigation of COVID, specifically for the black population, focused on the identification of social vulnerabilities.
She has served on a number of NIH National Academies for Engineering, Sciences, and Medicine, federal, state, and local board committees. She currently is a congressional appointee to the National Committee on Vital and Health Statistics where she serves as the co-chair of the workgroup to assess SOGI, sexual orientation and gender identity, and structural determinants of health data measures, definitions, collections, use, and protection.
And now, we will go to some housekeeping. These are our housekeeping notes. Participants who have entered are on mute or in listen-only mode. Your Cameras have been disabled. Participants may submit questions via the Q&A box and at any time during the webinar, please address your question to the intended speaker. Questions will be answered towards the end of the webinar during the Q&A session with the exception of Dr. Cochran’s presentation, which is the first one.
If you have technical difficulties hearing or viewing the webinar, please note these in the Q&A box and our technicians will work to fix the problem. You can also send an email to the email on the screen.
All webinars in this series are being recorded and will be made available in the coming weeks. CEUs and certificates of attendance are not being offered for this webinar/lecture.
With no further ado, I will have Dr. Susan Cochran take it away.
SUSAN D. COCHRAN: Good morning, everyone. Welcome to this seminar. I am going to be speaking first in the series to kind of set the framework for the work that we have been doing at UCLA and so in this talk, what I would like to do is spend some time talking about why we submitted an R21 to do this work and spend some time also talking about the challenges of using the database that we decided to use, which is the National Violent Death Reporting System, NVDRS, some of the early attempts that have been made by others to use this database, and then share with you some of our preliminary findings for work that we are still working on.
It is sometimes hard to realize that the measurement of sexual orientation in health data is a relatively new phenomenon that following the onset of HIV epidemic, the measurement of sexual orientation started to appear inside of different health data systems and here in this slide, I am showing the appearance date of these measurements. One of the effects of including these measurements in health data was to revolutionize the field of the study of sexual orientation and health. Sorry I am having difficulties with my mouse.
Before the inclusion of these measures, we knew that there were high rates of suicide related morbidity in LGBT populations. But we could not compare those results to other populations to see whether or not they were higher or lower. We just knew they were high.
And what happened when sexual orientation was included as a variable in regular health data sets was that for the first time, we could compare these high rates of suicide morbidity that we were seeing with comparison groups and so over time, we were able to demonstrate without any doubt that there were much higher rates of suicide ideation, suicide attempts. It did not matter which population you were looking at, young people, older people, men, women. It did not matter whether you were talking about suicide ideation in the last year or lifetime. There just was this much higher rate of suicide morbidity in the population.
That effect remains today and so even though there has been quite a bit of social change that has occurred in the last 35 years since we started being able to measure this effect, it is also true that this effect has not waned. This effect has not disappeared even though social change has led to more rights for LGB individuals. It has led to more discussion. There are more people who are out. They are able to come out at work. There are all sorts of ways in which life has changed dramatically for sexual minorities in the United States. But we still see this higher rate of suicide attempts particularly among younger individuals.
Whether this higher rate of suicide attempts is also matched by a higher rate of suicide mortality, it is difficult to determine within the United States health data systems. There is evidence from Europe where they have different data systems that the rate of suicide mortality is higher among sexual minorities. There is also some emerging evidence from Canada where they have a different data system that there is somewhat higher rates of suicide. But here in the United States, we are hampered in being able to determine whether or not there are higher rates of suicide simply because our data sets are just not set up to determine that.
We do not measure sexual orientation in the death certificate. We cannot tell from a death certificate whether a person might be gay or bisexual or heterosexual or whatever. We have trouble linking data from our health data where we do have those measurements into our mortality statistics for a number of different reasons.
Given this problem of trying to figure out is there a higher rate or not of mortality, our group started thinking about is there a data set out there that we can use that will allow us to answer this question of whether or not there are higher rates of suicide mortality in this population.
That led to this R21 application moonshot that we submitted. We said we are going to take the National Violent Death Reporting System database, which is an administrative database. We knew that database was large. That is what you need to be able to do this kind of research. We also knew that the database had begun to measure sexual orientation and transgender identity of the victims whose cases are reported inside this database. But we also knew that it was incomplete. That even though the database had begun in 2003 that the inclusion of this measurement only started to come into the database around 2012. We had the idea that we could take the narratives that are in this database and use the narratives to determine sexual orientation of all those cases that had not yet been coded. That was the plan behind the R21. That was what we applied for when we submitted our application to NIMH. We were fortunate that NIMH review panels saw the value of that work and we were funded to begin this work.
Now that we were funded, we actually sat down with the database and we started trying to think about how can we make use of this administrative database to determine sexual orientation of the now more than 400,000 victims whose cases are reported inside this database. That led us to all the challenges that exist in the NVDRS.
Let me spend a few minutes just talking about what is the NVDRS. This is a database that was begun in 2002 by the Centers for Disease Control. Gradually over time, different state health departments have come into this database and as of now, all 50 states and Puerto Rico are included inside this database. And what happens is that at the state level, public health workers take information about a death that has been classified as a violent death, which is a death that is by suicide, by homicide, or by some sort of unintentional injury that happens to the person. Most of the deaths in the database are suicides or homicides.
The public health worker –so there are hundreds of public health workers who are doing this work – follows a CDC coding manual and they fill out the information in the coding sheet. And this coding sheet is then transmitted electronically to the Centers for Disease Control and the case is included in the database.
The form that they are filling out – they fill it out using information that they have available at the time. They have information from the medical examiner. They have information from the law enforcement reports. They have the death certificate. They might have toxicology reports. They even can sometimes look at other information that we do not have available on our end. They might be able to look at social media or newspaper reports about the death. And then from this information, they fill out the form and they also write two narratives about what happened with the death, who the person was who died and perhaps some of the things that were going on at the time around when they died.
Before I go to the next slide, I wanted to put a triggering notice in this slide because some of the things that I am going to show might be upsetting and some of the later presentations also might have some upsetting information in the slide. I just wanted to pause for a moment, let you all take a breath. We are talking about violent death here. There may be some things that you might want to pause the presentation and then come back to it to listen to.
What I show here is a couple of examples of these death narratives. These are not actual death narratives. These are death narratives honestly that I wrote because this is a restricted data set with the CDC. When you obtain the data set, you agree that you are not going to disclose this information because of privacy for the people who are left behind by these victims.
This narrative shows you that typically they follow a similar format. They are brief. They describe the person. They describe some of the events that are going on around the person. There can be up to two narratives for each death.
In the NVDRS, the sexual orientation classification is a very rare event actually. In this slide, I show you that most cases in the NVDRS have unknown classification of sexual orientation. The reason for this unknown classification of sexual orientation is because of the instruction to the coder. The public health coder can only code this information if in fact sexual orientation is mentioned in the narrative. And if it is not mentioned in the narrative, they are not allowed to code the data. That is one of the reasons why so many cases are missing.
The other thing to recognize about the NVDRS is that this is a very different measurement of sexual orientation. It is not a person’s individual identity. But it is a phenotypic assessment of the individual. It is what people around that individual at the time of their death say about them and whether or not their death happens in the context of something related to sexual orientation or transgender identity.
And then finally, the last thing I want to mention is that base rate really matters here. You can have an algorithm that perfectly predicts. But if you have a very rare event you are trying to predict to, you are going to have problems in accuracy.
There have been a couple of studies that have tried to use this database with some success. One was Ream, who chose to just say I am going to use the classified data and look at differences. Those are very few cases in the database and there is a very strong assumption here that all the ones that are missing are missing at random, which they probably are not.
Another study came out of our group, which was done by Kirsty Clark. And what she did was she searched. She used qualitative methods to search the database for indicators of sexual orientation and transgender status. Then she went back through the narratives that were identified and had readers read them and come up with a very good classification of whether or not there was a minority sexual orientation mentioned in the database and whether or not bullying occurred. She was able to show that bullying is a very strong component of suicides where people are LGBTQ in the NVDRS.
And then finally, I just wanted to share with you some of the work that we have been doing, which is that we have been trying to use information that is in the database and in the narratives to try to predict to sexual orientation status. We have a couple of gold standards in the database. We have the actual coding for LGBT status. We have information from the CDC on a study that they did where they coded for LGBT status.
And then we had these markers that we have developed by going through the narratives, trying to code information that we think might be indicative of sexual orientation, including terms related to sexual orientation and also the nature of the relationships that people are in whether they are the same or different gender, whether or not HIV is mentioned or whether or not a family member is mentioned in the narrative.
From this, we have been able to do a pretty good job of predicting what the sexual orientation is of the victim in the death record. We are still not at a level where we are ready to report it. We think we are there, but we are still tweaking some things, trying to come up with some better solutions.
And in the process of doing this work, we realize that we needed to take a step back and think about these narratives in a slightly different way. We needed to expand our team and bring on experts in some other fields. In the next talks, what you are going to hear is from some of the people who we started to bring into our team to be able to think about these narratives in a slightly different way than what we have been thinking about.
I am going to stop here and pass it on to the next speaker, who is Alina.
TAMARA LEWIS JOHNSON: Actually, Dr. Cochran, I am going to step in just for a minute. We do have a question from the audience for you, Dr. Cochran. Do you want to turn your camera back on? The question is why is having a mother or brother in the narrative potentially indicative of LBGTQ identity?
SUSAN D. COCHRAN: Well, it is actually potentially indicative of heterosexual identity. One of the ways to think about what we are trying to do here is we are trying to predict the sexual orientation or the transgender status of the case whether it be LGBTQ or whether it be heterosexual. It turns out that it is predictive of being heterosexual to have kin, family members, children, brothers, sisters, aunts, and uncles described inside the narrative.
TAMARA LEWIS JOHNSON: Thank you very much. And then I have one other question. What factors do you suggest researchers consider to improve gathering sexual orientation/gender identity characteristics from the National Violent Death Reporting System?
SUSAN D. COCHRAN: I think that question might be better answered at the end of the presentation by Dr. Mays.
TAMARA LEWIS JOHNSON: Here is one more. Why is a mention of HIV/AIDS being used as a potential indicator for sexual and/or gender minority status? This may feel incredibly stigmatizing.
SUSAN D. COCHRAN: One of the problems with not using HIV is that you are looking – you are deciding I am not going to look at information in a database that will help me to identify a population when that population we know is likely to have higher rates of HIV infection than other populations. It is not a question of stigmatizing a group. It is actually important to measure it.
When we first started measuring sexual orientation in health surveys, this concern about ‘you are going to stigmatize people by asking them what their sexual orientation’ was used as a reason not to measure sexual orientation. If we had not measured sexual orientation, we would not know today that sexual orientation is associated with higher rates of suicide. We just would not know what we know.
One of the things that we have to think about in the field is that we can worry about whether something is stigmatizing or not or we can use a correlation that we see is there to find out something. To my mind, it is important that we know. It is much more important that we know the health risks that are in our community. We are not saying by using it that everyone who has HIV is gay. But I do know in the database that most of the people who have HIV infection or AIDS is mentioned in the narrative are gay men. I know from reading the database, from reading the narrative. Ignoring that information because of some concern that you are stigmatizing a population I think is honestly – does a disservice to the community. I am just going to say that.
I think if we had another measure that said that all gay people are over 7 feet tall, I would certainly use that to try and identify cases in the database. This is I think the difference between using what you have and wishing it is something else. I am just going to say that. I know politically that may not be a positive way of saying it.
This is, for example, why we are using this measure of kins being mentioned in the database. We have discovered that the mentioning of uncles and aunts is a predictor of heterosexuality. It is not a statement about heterosexuality. It is just a co-occurrence that we observe, and we are going to use it.
TAMARA LEWIS JOHNSON: Thank you so much for responding to those two questions, Dr. Cochran, and for your talk this afternoon.
We are going to proceed on with the next speaker who is Dr. Arseniev-Kohler. Take it away.
ALINA ARSENIEV-KOEHLER: Thank you. My name is Alina Arseniev-Kohler. This part of the presentation will go back and forth between myself and Professor Foster.
One of the big points I want to make is that these narratives and the NVDRS exemplifies some of the promises and challenges of working with text data more generally. Text data is something that is all around us. It is very prevalent. We see it in things like social media, in the NVDRS, in other surveillance systems and electronic health records. It is not the kind of data that we can just ignore.
It also has some different information that might occur in structure data. We saw that with sexual orientation, and this also occurs in many other formats as well.
At the same time, text data is very difficult to work with and we will see that with the NVDRS. Even just summarizing a large set of narratives like the NVDRS is very difficult. There are lots of ambiguity. Words can have multiple meanings. It might be very domain specific, so you have things like abbreviations. You have typos that might be messy. That might be okay for a human reader. But it is a lot more challenging when we are trying to offload some of this work to a computer, which we have to do when we have things like hundreds of thousands of narratives in the case of the NVDRS. For these reasons, text data often goes under utilized and that includes these hundreds of thousands of narratives in the NVDRS.
We have this case of a broader methodological problem, which is how do we wrangle this information in a way that is scalable. Generally, you can think of this massive haystack of data, and we want to try to at least to start we want to summarize it and pick out some of the important needles or patterns going on in this haystack. This is one of the core challenges that we have focused on in getting to know these narratives and that we will talk a little bit about today.
In general, one of the core solutions to summarizing a large haystack of data like the NVDRS and computational text analysis is an approach called topic modeling. The basic idea of topic modeling is this way to get to know a large data set. It asks two questions. One, what are the basic building blocks or themes or topics that are comprising this large data set? Two, how is each specific text like each narrative drawing on these building blocks? What topics are involved in a given narrative?
Topic modeling is very useful because it is scalable so you can use it with data sizes that are really big or medium size. It also is inductive, meaning we do not need to read through the hundreds of thousands of narratives to be able to get out these topics. We do not need to know ahead of time what are the important topics across these hundreds of thousands of narratives or whatever text data we are working with.
I am showing you here a table on the right, which is topic modeling applied to another data set called the Malawi journals project just to show you a case where topic modeling worked really well. This data is a set of journals that are collected by participants in Malawi from 1999 to 2012. They were asked to document conversations around them about HIV/AIDS. This was a supplement to a quantitative survey, which the researchers were frustrated by because they felt like they were not getting some of the information they wanted.
The result was like a thousand notebooks, each of 7000 words. It is a lot of text data and that means it is too much again for one or even a handful of humans to go through and read and summarize and get a sense of what this data is trying to tell us.
When they applied topic modeling, they were able to get out a series of topics that summarizes data. I am showing this table of topics so each row here is a topic. And what the computer actually spits out for a given topic is in this blue box – it is a list of words or terms or phrases, which are the most representative words for that topic. It is a way for you to get a sense of what this topic is about.
Once you see this list of terms, you can then interpret the terms or look at the documents that have the highest loading of this topic and perhaps add a description like they did in this table. Perhaps you also instead of calling it topic number 3, you give it a label like in this case. They call this particular topic condoms.
A little bit more specifically, topic modeling and there are lots of different approaches, but in general, what it gives you is two different products. You can think of these as two different data frames. First, we get on the left this blue matrix or data frame and what this is showing you is the loading of each word in the vocabulary on each of your topics.
You can think of topics kind of like when we do factor analysis. Out of all the different ways that words are co-occurring in the corpus, you are trying to extract these latent groups of words that tend to co-occur and have these grouped patterns. You can think of topics kind of like what we are doing with factor analysis where you are pulling out factors.
From this blue matrix, we can then pull out the – We can identify which topics are characterized by which words because we just look at the words with the highest loadings on a given topic.
And you also get this purple matrix or data frame on the right. Here, we have each row in the matrix or data frame is a document such as an NVDRS narrative. And then a column here is the loading of that narrative on each of your topics. Again, you can characterize a given narrative by which of the topics it has the most of or look at the distribution or profile of a document based on its different topics.
And then moving back to the Malawi journals project, we can now think of this as a structured variable. We went from unstructured text data to having these topics that are characterized in the documents and we can use them like any other structured variable. This is when things start to get really cool because we can start to look at things across covariates like time in this case. What you are seeing here is the prevalence across time from 1999 to 2012 of two different topics, one about dying and AIDS and another about antiretrovirals and health. You can see that at first, the topic about dying of AIDS was very prevalent. It kind of petered off. Meanwhile this topic of antiretrovirals and health was slowly increasing in prevalence.
The problem is that these traditional topic models, which are kind of our key tool to summarize large-scale text data, did not actually work so well on the NVDRS narratives. I am showing you here in this table some of the topics that we got out when we first tried these traditional topic models. They are okay. You can see that these most representative terms. They look fine. It is not like they do not make any sense but they were not very useful. They are not telling us much more than we get from the hundreds of structured variables already there.
Some of the possible reasons could be because this is really specific language. It is not like general English. There are typos. The document lengths are all over the place. That could be some of the reasons the topic modeling did not work well in the NVDRS. But those are characteristics that a lot of different data sets have, not just the NVDRS. If topic modeling did not work well, there is probably a large amount of data out there that topic modeling also does not work well on.
That led us to go back to thinking about basic science and methods and develop a new approach to model topics or summarize large data, including the NVDRS but learning lessons that will apply well beyond NVDRS.
And this model that we developed is called the discourse atom topic model. The novelty here is that compared to other approaches to model topics, we integrate another key text analysis approach called word embedding, which we will explain next.
But the result and the big picture is that we still get these two outputted data frames, which structure this text information for us. Each topic is still interpreted by the most representative words. And each document is still characterized as a mix of topics.
Now, I will turn it over to Professor Foster, who will explain a little bit more about this other key approach.
JACOB G. FOSTER: As Alina mentioned, the other critical ingredient of our new approach is this method called word embedding. Word embedding describes a set of techniques that solves a critical problem, which is how do you represent human language on a computer.
Fundamentally, computers work with numbers or with things built out of numbers. You could, for example, just assign a distinct number to each word. You could say the word hello. It will get the number 172. The problem with this approach is that the sort of things that it makes sense for you to do with numbers like adding them or subtracting them or seeing how close or how far apart they are from one another, those do not really translate into things that it makes sense to do with words. And that is because they are not capturing that just simple recipe, I just gave you. It does not capture the kind of structure that language has. Words are used in systematic ways.
So, for example, we think of some words as being similar. Man is more like woman than it is like telescope. In part, this is because man and woman tend to show up in more similar contexts across a wide range of texts than man and telescope does.
By the same token, because our language is shot through with gender bias, reflecting bias that exists in our society, the word man and the word telescope are more likely to show up together than the word woman and the word telescope. The word telescope has more male connotation than the word house cat might.
Word embedding is an approach that turns some of the elementary units of human language/words into mathematical objects called word vectors and it does so in such a way that the properties of those vectors capture the human-perceived meaning of the words based on how those words are actually used within a corpus, a collection of documents like the NVDRS. The vector for man and woman should be more similar to each other in a sense of similarity that makes sense for vectors than either of them is to the word telescope.
Now, a vector is just a mathematical term for a list of numbers. And the simplest way to represent words as vectors would be to make a list as long as the number of distinct words in the corpus. And then fill it up with zeroes except for a single spot. And in that single spot, you put a one. And each distinct word would get a different spot where you write a one instead of a zero. You might put the word one in spot 172 for hello.
Now, there are several problems with this approach. First, it is pretty wasteful. A list as long as every distinct word in your vocabulary puts a lot of zeroes in there. But also, the properties of vectors that look like that do not capture the properties of real human word usage. You might wonder for a second why I am kind of obsessed with talking about vectors and this is because vectors are really well understood. Mathematical objects that have lots of natural operations that you can do with them. It turns out that we can figure out a way to leverage those natural operations to map onto things that it makes sense to do with human language that kind of reflect actual word usage.
So, with word-to-vec and with similar embedding algorithms, what we are doing is forcing the computer to use a much shorter vector, a list of a few hundred rather than many thousands of numbers to represent the word. In each spot in that list, what we call a dimension in the vector or embedding, you can have any number that we typically normalize the vector, so they are all the same length, which means that the numbers that show up tend to be relatively small numbers.
The word-to-vec algorithm constructs those vectors by playing many many iterations of a sort of guessing game where you give it short snatches of text and it has to say guess a missing word from that snatch of text based on something like the average of all of the word vectors, averaging being an operation that makes sense to take with vectors. By doing this task many many times, the computer learns how to represent each word as a vector so that it succeeds at this guessing game.
Now, vectors also have a nice geometric interpretation. You are almost certainly familiar with the idea of vectors in two dimensions, just an ordered pair of numbers, an X and a Y coordinate. Here is an example where there are a bunch of words represented as two dimensional vectors. To find in this space the word vector for girl, which you can see here as .66 and .89. You go .66 units on the X-axis, .89 units on the Y-axis. As you can see, many of the other words in that vicinity are semantically similar. Girls is close to boy or to child.
Now the notion of similarity that makes sense for word vectors like the ones we are talking about is something called cosine similarity. I will show you what this looks like in a moment. But basically, it means are these two vectors pointing in similar directions.
The big difference in our case is that our word vectors do not live in two dimensions. They live in anywhere between 200 and 1000 dimensions. Humans can’t visualize what that looks like. If you happen to be able to, I am sure that many people who would love to talk with you to understand how on earth you can do that. But the main point here is we do not have to be able to do that visualization. The computer is able to work with these high-dimensional objects for us.
Now actually, understanding how the word embedding approach works its magic has been a major project in a field called theoretical machine learning. You can think of theoretical machine learning as like the theoretical statistics of the machine learning community. It aims to understand and sort of provide guarantees about why something that you might use in the machine learning world actually works.
In a series of papers, the computer scientists at Princeton, Sanjeev Arora and his colleagues, provided a really compelling explanation of why it was that the training process I described, guess missing words based on some representation of an average context, produced word vectors whose math of similarity and difference, even whose math loves vector subtraction, vector addition, map onto human recognizable meanings like similarity or even things like solving analogy problems.
So, this is the work that we have built on. It gives our approach a sort of special edge. Unlike a lot of very powerful recent machine learning methods, this approach has very firm theoretical foundations. You know why it is that something is working.
And because of the sort of work that all of us in this webinar do as social scientists, as public health researchers or public health officials, it matters that our methods are firmly grounded in theoretical foundations and that we can actually explain why they predict the things that they do. We are naturally skeptical of using opaque black box approaches on things like literal matters of life and death.
What I am going try to do in the rest of my time is give you a quick picture of the theoretical story behind our method and then I will hand it back to Alina, who will show you some of the interesting and social science and public health relevant things that you can learn from the NVDRS using this approach.
Our approach provides a path to go from word embeddings, these ways of representing words as vectors in say a 300-dimensional space to topic models, these sort of thematic building blocks of a discourse that Alina told you about.
The first step in this story is the notion of a semantic space. A semantic space is just this embedding space that we were talking about before. It is where the word vectors live. I am showing it here as a 3D sphere. It is normalized. That is why it is a sphere. But in reality, it does not live in three dimensions. It lives in 300 dimensions.
And the theoretical model that was developed by Arora and his colleagues is a simple model of how text is produced. In that model, we have something called a discourse vector. This is just some location, some arrow pointing in the semantic space. It represents – you can think of it as a latent variable representing what is being talked about in a particular moment in time.
Now, how does that discourse vector actually determine the words that show up in the text to manifest things that we can observe? There is a very simple model here where the probability of a particular word, W, being written down or spoken or somehow put in the text at a particular moment, given where that discourse vector is at that particular moment is just proportional to the similarity of the discourse vector at that time and the word vector. In other words, how similar in this cosine similarity sense is C sub-T, the discourse vector at time T, the little red vector and V sub-W, the vector representing that particular word.
So, for example, if the vectors are very close and you can see they are pointing in similar direction, which means they are similar in this cosine similarity sense. Then there is a high probability that whatever word that great arrow represents will be produced, will be in the text.
By contrast, if the vectors are far apart, as you can see here, then there is a low probability of the word whatever the one that is represented by the dotted gray line being used.
Now note, this is basically like a topic in a topic model. It is a distribution, a probability distribution over words. In fact, given this recipe I have just told you, every position that the discourse vector points, in fact, any vector in the embedding space defines a topic in this sense.
So now, imagine that we pick words to appear in the text or to be spoken or however language is being generated. We pick it according to this probabilistic process while the discourse vector is slowly doing some random walk in that space. It turns out that that generative model, that picture of how text is produced generates text such that if you do the word-to-vec trick, you train a machine learning algorithm to predict words based on some average representation of the context as shown here where, for example, you might say let us average the word vectors for ‘the cat sits on the’ and then predict mat on the basis of that. When you train an algorithm in that way, you get vectors whose mathematical properties provably encode some aspects of their semantic structure. This encoding is one of the things that got everyone so very excited about root embeddings. You can use them to solve analogies, for example, because there turns out to be hidden directions in that space that capture key semantic and syntactic, that is, meaning and grammatical relationships.
The direction that you move to go from man to woman is similar to the direction you move to get from king to queen. That is a simple encoding of a kind of gender dichotomy in the embedding space. The direction you move when you go from walking to walked is similar to the direction you move when going from swimming to swam. That is syntax, a syntactical dimension having to do with verb tense.
The direction going from Turkey to Ankara is similar to the one going from Canada to Ottawa or Japan to Tokyo. The country capital relationship. That is a direction too and this is why these vector representations are so powerful. They encode a ton of semantic and syntactic structure and especially the semantic structure, the meaning structure is ripe for answering social science questions as Alina will show you.
Okay now, I have told you the forward side of the story, the theoretical generative model that Arora et al. used to explain how word embeddings work. But as in the case, once you have this generative model, that provides you a way to go from observed stuff like the words that are actually in a sentence like this thing describing what word embeddings are to the latent stuff like the position of a discourse vector.
So, we can estimate the position of the discourse vector, get a maximum a posteriori estimates of its position with a simple calculation, basically, taking a weighted average that is actually justified by inverting this generative model that I have just described to you.
So now, we have a way to represent any particular piece of text as a trajectory in the semantic space. You take a set of words like the pink words here, a little semantic window, and then we estimate where the discourse vector was when say one of the words in that window was produced. Then we slide the window along, estimate the discourse vector position again. So, as we move through the sentence, the paragraph, the document, we reconstruct how this unobserved discourse vector is moving through the space.
This is an example of something called a sentence embedding, a procedure that represents an entire sequence of words like ‘the parents talk to their children’ as a single vector produced by averaging all of those word vectors together. What you can see here as we are sliding that little window along is we are producing a sequence of sentence embeddings in this way.
To recap so far, in our approach to topic modeling, a topic is still a distribution over words. It is a probability distribution over words. Some words are very probable. Some words are less probable. And every position of that discourse vector defines a topic.
But now a document is not a distribution over topic. It is not some mixture or probability of talking about topic 1 versus topic 7. It is a trajectory through the semantic space and that is a big difference. It allows you to represent at fine granularity what is being talked about.
Now, the problem with all of this is that semantic space is big. If we have a narrative of 150 words, I have now told you how to represent it as a sequence of 150 discourse vectors that lived in 300 dimensions. So how can we simplify this?
Just like in normal topic modeling, what we want is a way to decompose semantic space into some semantic regions, some sort of conceptually coherent chunks, ideally, a small number of them, that trace out highly semantically related words.
In other words, what we want to do is come up with what is called a sparse dictionary for the semantic space. Basically, what this means is I want to be able to write down any word vector in the semantic space by combining a handful of building blocks called atoms or discourse atoms plus an error term.
You might ask yourself why a handful of building blocks. As Alina mentioned, real words have this property of being polysemous. They have many meanings. For example, the word vector for bank is a combination of meanings having to do with finance, a meaning having to do with rivers, a meaning that maybe has to do with the way balls bank off of things in pool, et cetera.
Here is what this looks like represented in equations. I want to write down the word vector V sub–W as what is called a sparse linear combination of these discourse atoms, which are just special vectors in the embedding space plus an error.
Now, remember, each of these atoms is just a vector in the embedding space. Any vector, any embedding space, as I told you, can be interpreted as a topic. So, to actually figure out what the atom means, we just use the same recipe as before for whatever that atom is. It is most likely to generate words whose vectors are close to it. We just look at the closest words to characterize an atom. Those are the most probable things that would be talked about if the discourse vector were pointing exactly in this way. This is just like what Alina was telling us to do with traditional topic models. Characterized topics by the most probable words so we can interpret them.
Now we figured out how to combine these various elements together to give what you might think of as a full-blown topic model, something that can answer those two basic questions Alina mentioned before. What are the building blocks of a discourse and what building blocks are used in a specific text?
We can answer these questions in the framework of word embeddings. That allows us to leverage a very powerful machinery for working with these rich, powerful semantic representations and that is one of the things that we are so excited about.
So here is the basic workflow. Given a corpus like the NVDRS, we use an algorithm like word-to-vec to learn a semantic space, a space of embeddings. Then we use a sparse dictionary learning approach. We use something called KSVD to figure out the discourse atoms that best represent that semantic embedding space. These atoms are in and of themselves extremely interesting just as in normal topic modeling, just as Alina showed you.
They tell us a lot about what the discourse in that corpus is like and Alina will give you some examples from the NVDRS in a moment. They do so in a way that deals gracefully with some of those problems Alina mentioned like domain-specific vocabulary, like stop words, like text of highly variable length. And they produce very semantically crisp topics that are highly coherent and quite distinct from one another.
With the embedding space and the atoms, we now have the ingredients we will use to take a concrete example, a narrative from the NVDRS and figure out what topic best represents it or what topics best represent it. And the recipe here is also very simple. For a set of words in the text, we use a simple procedure to estimate the position of the discourse vector. Then we find the nearest atom to that discourse vector and assign that to the narrative. Slide the window along. Estimate a new discourse vector. See if we have a new atom and repeat.
So now, a narrative is represented as a sequence of atoms. We can turn that sequence into a distribution over atoms, which ones are used a lot, which ones are used a little. We can even turn it into a simple presence-absence kind of variable. This narrative uses these atoms. It does not use these others.
Now, I am going to turn it back over to Alina, who will show you how this basic procedure can be used to drive social science insights and potentially actionable public health information.
ALINA ARSENIEV-KOEHLER: Could we switch oh Thank you. So now, I am showing you some of the results after we apply this new topic model to the actual NVDRS narratives. This table is showing you just a couple of the 225 topics that we discovered. Each row here is still a topic or a list of terms or words. For example, this first row is showing you here the seven most representative terms for this first topic, which we then interpreted and looked up the narratives that loaded very highly on this topic and then assigned it a label rather than just calling it topic number 1 or topic number 2.
Some of the things to point out here is, one, these topics now are a lot more precise and focused and ultimately more informative than when we were applying the prior topic modeling approach. Something else that we found is that there were not any topics related explicitly to sexual orientation. This is something that reflects the fact that this inductive process – the model is discovering topics rather than us telling it what to discover necessarily.
And then the last point I will make about these topics that we discovered is they really cover this very wide range of information, some of which is reflected in the structured data. For example, we have some topics that are about mental health. And then there are some topics that are not reflected in those hundreds of structured variables. We have this topic here about cleanliness or reclusive behavior. Those are things that is mentioned in the narratives but is not captured in the structured data.
Here, I am showing you a heat map, which is illustrating how different manners of death have different profiles of topics. They tend to use different topics. This offers us a little bit of a face validity check to start to trust that all of this theoretical modeling and this prior work is actually really producing something that is informative when applied to the NVDRS narrative.
To interpret this heat map, we have the different manners of death, for example, undetermined, unintentional, homicide. And then we have each row is a topic out of the 225 topics. There are 225 rows. Each row is showing you is there is a lot of this topic or less. If it is a lot, it is going to be a darker blue. If it is less, it is going to be closer to white.
Some of the things to point out here is that again, one, we see that there are these different profiles of topics for different manners of death. And the other thing is if we start to look at which manners of death have the most similar profiles using these clustering, which is shown where my cursor is going right now, we are able to see which manners of death have similar profiles. For example, undetermined deaths and suicide have the most similar topics. That makes sense for what we know about misclassification of suicide that often many undetermined deaths might not have the evidence or information available to be classified as suicide. But it is very likely that they may have been suicide cases.
We can search – do this with our other variables that are maybe more nuanced such as sexual orientation. Now, this looks a little bit different. What you are seeing here is there is this different profile by different categories for sexual orientation. Here, you will see we have the cases where it is missing and then also the cases where it is unknown as well as heterosexual, bisexual, lesbian, and gay.
And something that automatically jumps out from this picture versus the previous one is that there is just a lot of topics in general among cases coded as bisexual. That actually is telling us something a little bit different, not that they are just using different topics. It is actually that cases coded bisexual tended to have a lot more words in general. These narratives tend to be very long, which is telling us then we also need to think about word count when we are analyzing these topics, but it is also just informing us about the nature of these differences in the narratives.
You can also see from the clustering that the cases, for example, that are bisexual, lesbian, and gay coded tend to be more similar to one another in terms of the topics that they use. And cases that are coded as heterosexual tend to be more similar to unknown, which makes sense, given that the heterosexual cases are the majority class.
So, this new topic model has provided us a large list of topics. I encourage you to go to our paper and look at the 225 topics to give you a better sense of what these themes are. We have summarized this haystack of data. We found some of these really interesting needles in the haystack, but not what?
Again, we can use topics like any other type of structured variable. We can start to characterize a topic of interest in a little bit more depth and we can do so using it like any other structured variable, for example, putting it in a regression equation, which is now back to kind of more of a familiar method.
We can take a topic like here is topic 53. It is about sedative and pain medications. I am showing you some of the words most representative of this topic. We can characterize using regression, which types of decedents and which types of cases tend to have a lot or less of this topic or in this case, I am going to show you the results when we have any amount of this topic versus none.
I am showing you a regression table now, not necessarily to highlight every single result in the regression table. But I think it is more to just drive home this point that we can now use a topic like here we have sedative and pain medications as a dependent variable and a look at this relationship with all these other characteristics. We see, for example, that cases where the decedent is female tend to have a lot – are more likely to have this topic. And we see that by manner of death compared to suicides, homicides are a lot less likely to have this topic about sedative pain medications while undetermined deaths are a lot more likely to have this topic, which again this is corroborating what we know from previous work. But we are showing you that this topic modeling is really picking up something very informative.
So, some takeaways. First, I started with this point that the NVDRS really exemplifies the challenges and promises of text data. I hope that has come across. It is this really exciting data to work with, but it is also difficult. At the same time, some of the work that we do and what we learned using NVDRS can be applied to a range of different data sources like electronic health records, for example.
We ended up needing to not use existing text analysis tools but developing a new one, this discourse atom topic model, which also will be very useful, we think, for a range of data.
And then what we have provided substantively is this topical summary of the information of the narratives. We have been able to look at some of the specific patterns like the presence of sedative and pain medications and what kinds of situations this topic is occurring in.
With that, I will turn it over to Dr. Chang. Thank you.
KAI-WEI CHANG: Thanks, Alina and Jacob for introducing the NLP technique here. I am Kai-Wei Chang. I am an associate professor in the Computer Science Department at UCLA. In the next session, I would like to talk about some concern about societal bias and language processing especially when applying this in the NVDRS narrative text mentioned LGBT individual.
I will start with the word embedding models that Alina and Jacob was mentioning. Jacob has introduced the word embedding model. So just to recall here, word embedding is a technique that enables machines to understand the meaning of words by making each word into a vector in an embedded space.
As these embeddings carry the meaning of words, my research has found that you can get a direction between these words and use that to represent certain semantic information. For example, if you take the word woman and minus the word man in an embedded space, and that factor can represent gender.
That is that the direction between woman and man will be parallel to other representative of other genders like aunt or uncle and queen to the king. So those directions would parallel to each other.
So, in another way to thinking about this is that we can take these gender directions to identify the words that can solve the analogy puzzle. For example, we can have a he to the brother and then ask the model what she is to. In this way, the model can identify a sister is a word that can solve these analogy puzzle.
If we play with these type of analogy tasks then we find that where you have a he to the beer, the model can identify a she is to – the model can identify the she is to cocktail. So that is representative of the beverage that both genders usually drink.
This example is kind of acute but when you apply it to certain other cases and then these can be triggering. For example, the model identify he is to physician is aligned to she is to registered nurse, which is problematic in terms of they carry the bias in occupations.
And even a more triggering, the model would say he is to professor as she is to associate professor. So somehow the model captured the rank and associate that with a gender. Of course, this would be very problematic that when you apply these to some downstream applications.
To get a better understanding of how the gender information is carried in the embedded space. So, in this figure, we make each word to the gender direction based on how they are closer to the pronoun she and he. The word on the left side are the words that are more closer to the pronoun she so it is more related to female. And the word on the right is more closer to the male pronoun he.
We find that there are certain words that originally carry the gender information like mom, dad, brother, boy. Those were making to their corresponding genders successfully. But if you look at other words on the top – those words supposedly should not have a gender information. But still, it has been making into either male or female gender. For example, game is more closer to male whereas like sewing is actually more closer to female.
So, then we confirm that this type of gender information has been carried by the word embedding model. This is not surprising because the model is learned from logical action of the text and in those texts, there is association between the gender to these words.
The problem is that when we apply these word embedding models in the downstream applications like coreference resolution, then those are bias in the data and may affect the model’s performance.
In this I will go into use of coreference as an example. Coreference resolution is one of the most important tasks in natural language processing. In this task, we are trying to identify the phrases that represent the same entity.
For example, if you look at text on the top. In this paragraph, we talk about the president is more vulnerable than most. He is blah, blah, blah. You can see that these are pronouns his in the second sentence co-refer with the mention in the first sentence. We know that when we mention this pronoun his, it refers to the president in the first sentence.
As you can imagine, this is an important task in NLP pipeline because if you want to understand a paragraph like this, then you need to know what does this pronoun refer to. Then we can check the information around this pronoun.
The problem we find that is that the current NLP model can successfully identify the pronoun his is referred to the president on the text on the top. But if you look at cases on the bottom, this is identical paragraph. The only difference is now we change the male pronoun to a female pronoun.
But even this small change will cost the model to make a mistake. In this case, the model cannot identify the female pronoun her is refer to the president.
The problem is the model actually learn the bias in the embedded space. You think the female pronoun is more far away then an occupation like the president than the male pronoun. So, they fail to identify the correct relations.
It is also because that when the model trained on the test coreference, on the test coreference contained twice more association between male and occupation than the female mentioned. The model also then the association that a male pronoun is more likely to associate with occupation than the female pronoun and we find that is problematic when we analyze the data.
In terms of NVDRS narratives, again, this is an important test that we need to be able to identify the coreference relation so that we can do more detailed analysis and especially for NVDRS. There are several cases of LGBT individuals as a victim. This is especially important for us because LGBT youth are five times more likely to attempt suicide than the heterosexual youth. We really want to be able to analyze those cases well so then we can have a way to prevent suicide.
So here, I show an example of the narrative in the VDN. Again, this is an example that we write to mimic other data in the coreference. As you can see that in this example, we have a sentence saying the primary victim is a 50-year-old male. The primary victim’s partner state that he and the primary victim have been living together for three years.
In this case, as a human when we read these tests, we know this he should be referred to the primary victim’s partner. However, we find that the model may confuse by the gender information between the male and the pronoun he and then identify he as a primary victim rather than the primary victim’s partner.
Even though from a grammatical structure, we can see that he must be to refer to these phrases in the same sentence. But the models – they all make mistakes.
This is one of the challenges that we find that when applying the correct model in this NVDRS is a narrative. To better understand how the current model is found in this narrative, we perform a data notation and then do a comprehensive study to see how the current coreference system to work on these data sets.
What we did is we sample several narratives from the data based on the gender of the victim and the partner. Specifically, we got 30 narratives where the victim is a male, and the partner is female. In another 30 cases where the victim is a female, and the partner is male. In 30 cases that the victim and the partner are in the same gender.
And then we hired three experienced coders to annotate the correct information. And then the coder is guided by the senior public health experts.
As we can see that the model actually suffers from understanding the correct relations in NVDRS narrative, especially for a new pronoun, the model performance is worse. If we use the current based coreference model, the model can achieve around 80 percent in the F1 in the general NLP text. But for the specific type of LGBT narrative, the model can achieve around 40 percent. And for the neopronoun cases, the model performed even worse.
One other reason is the model does not really understand new pronouns and also the model cannot understand expression like partner or instead of girlfriend or boyfriend or wife and husband. This is a kind of a different way to express to cause a model to understand those text.
So, a little question here is how can we improve the model to understanding those narratives. We tried a simple way to deal with that. That is, we are trying to augment the data by replacing some of the phrases to other phrases. For example, we can switch an agenda so when we say see a boyfriend in a text then we switch that to a girlfriend. When we say the pronoun he, then we also switch the gender. In this way, we can eliminate the information of the gender in annotated data, and we find that even using these simple techniques we can improve the models.
But to be honest even with these techniques, we are still not able to achieve 100 percent performance on this data set. This is still a research direction to how to improve the model on this type of narrative.
Okay so this is an example that we find that the current model has a trouble to deal with LGBT cases in the NVDRS data set. But in general, we find that the current NLP model has an issue to identify LGBT individual text.
For example, the current language model can do very well on the following game. That if I give you a sentence and I take out one word in the sentence and the model can predict what is a word I should be putting to make this sentence correct.
So, if I give you a sentence like this, Alex went to the hospital for their appointment, and someone felt sick. Supposedly, if the model understands the non-binary gender, the model should put they as an insert, and to make this consistent with these gender pronouns there.
But the model actually put she or he with high probability. And they have a very low probability score. The model is not the right answer.
And this is a consistent phenomenon that we find that in many cases. For example, when we look at the closest word of a gender pronoun, we find that model has a trouble to understand those new pronouns of xe and ze.
To get a better understanding of the problem, we conduct a survey with 19 participants and to understand what is the issue in the NLP model when they apply to the text with LGBT mentions. We find that in many NLP applications that identify an entity with coreference resolution of machine translation, there are different cases that the LGBT community find that it is problematic, and it will cause some harm to those LGBT individuals.
So, we summarize some of these cases in our paper. In the interest of time, I am not going to go through the details of all these cases. But I refer people to this paper to the discussions and analyses of how the current NLP technique has trouble to deal with those LGBT cases.
Just to give a quick summary of what I said. We find that there is bias in the automatic models that when they use to analyze the data, they may influence the results. It is essential to improve the inclusion of the NLP technique so that you can work better on those cases.
Okay with this, I will turn to the mike to Dr. Mays.
VICKIE MAYS: What I am going to talk about today has to do with using non-AI approaches, as I think Dr. Cochran started in the beginning talking about. There are two different types of data that are in the NVDRS. One is the narratives, and the other is the quantitative data.
And so, part of what we decided to think about is using the qualitative data to indicate things that we thought should be variables in the quantitative data so that you did not have to keep searching and searching.
The study goals that we did was what kind of indicators are associated with LGBT suicide in the data set. Can we develop these into quantitative measures that would be specific questions asked? Last is can we reduce the public health coder bias and increase reliance on facts that would occur through the investigations because it would be required to have this actual data to do it.
In terms of the narratives for the study, we used the data from 2003 to 2017. We restricted it just to the suicide data. What we did was we selected 1200 suicide deaths, which meant that if we were using the narratives, there are two narratives for each case. One will be for the medical examiner’s coroners and the other one is from law enforcement. We could have as many as 2400.
We knew that 621 of those 1200 – we knew that they were LGBT cases. I think Dr. Cochran talked about that in the beginning that we used things that were identified by CDC. In the ones that we knew to be LGBT, 54 percent were gay males, 28 percent lesbians, 10 percent transgender, and 8 percent bisexual. Then we selected another randomly 579 cases to give us 1200.
We used six undergraduate students as raters. All of them were trained in terms of suicide awareness. Some of them were very knowledgeable about LGBT populations either from their own lived experiences or they had siblings. They had worked on hotlines. There was a lot of knowledge that we wanted to make sure they had before we put them into this.
And then we had one undergraduate. You can see the influence of Dr. Cochran here, who was a stat major. That person actually was in the background kind of putting our batches together, setting up what the coder agreements were.
Each time we had a set of trainings. Here is what we did in order to reach inter-rater reliability. The coders finished a batch. They did it individually. They met. They discussed it. They came up with what their disagreements were and then they re-coded again. And then we saw again when it is that they reached – still had disagreements.
What happened then is that they had another – the way in which we resolved this is that they had another session. We looked at who were our coders that had the most discrepancy. Who were the coders that did well? And then we matched them. It is like a training. We matched them with coders who had done well to then explain kind of what the agreement and disagreement were and that allowed us to be able to do our inter-rater reliability.
Here, what you can see is how we did this because we did not just hand them 1200. We started out with a batch of what we call practice cases. These were not in the actual 1200. But we ended up with practice cases. And then we started it. We would go through, and You can see how we started off in the practice cases with only about 33 percent agreement. But we would go through, and you can see when we had people come back again, we could reach agreement. It told us a lot about some of the things that I think my colleagues have talked about is this is not easy work.
I think one of the things you will find over time is that as language changes, the training will even need to be greater. It is a really big deal that we have in this work is language does tend to change particularly in the SOGI population. We are going to have to pay attention to that.
You heard Dr. Cochran talk about Dr. Clark, who at the time was a graduate student with us, her work on bullying. Here is an example of what a code sheet looks like. Since this paper is published, we actually took this one.
What we did was we needed in this back and forth to come up with a definition of what is coding. What you have to do is decide in these narratives when to code the presence of bullying and when it is actually there and when to code that it is absent. What you are trying to do is to code whether it is present or absent. Then there might be some specific things that will come up in the course of doing this work that you want to actually give the coders an idea about in case there are things that are a little off or what have you.
Let us look at the results. The question is what factors were more characteristic in LGBT suicide narratives. We found seven of the characteristics. I will give you a couple because the rest of them we are in the midst of writing and finishing up the paper. But, again, this HIV/AIDS history. I know some people think it is stigmatizing. It is one of the things that particularly for gay men we were able if that was in the narrative, to find out with great probability that this was likely to be a gay man.
Part of what we also have to think about is we focus sometimes on the pharmacologic and behavioral interventions in terms of HIV and It may be that what we are discovering is that there is a relationship between HIV and suicide that we may need to attend to. In this instance, I think having this variable I think as Dr. Cochran said is actually insightful.
What we did come up with in terms of the other six characteristics is typically, they were very specific kinds of conflicts or discriminations within LGBT that if, for example, family conflict around coming out or something like that were there, we were pretty good at even though the words to the relationships around LGBT might not be there, we were able to identify that that was an LGBT case.
What did not work, and I have to applaud my undergraduates. They spent a lot of time on this. We tried to use the interpersonal theory of suicide. This is Joiner’s theory. And we actually developed code sheet for perceived burdensome and thwarted belongingness. I think as a matter of fact, this work actually began when Kirsty Clark was with us and we attempted to develop these code sheets, hoping to be able to use this. It just did not work.
If you just use stigma and it is not a specific stigma, it does not work either because stigma just in and of itself did not identify LGBT but it just identified suicide. I think it is important to think about how can we come up with things in which given that the way that NVDRS runs, which is the LGBT status is only there if it has something very specific to do with the death. That we are finding that there are contexts that really have to do with LGBT deaths that those should be in the quantitative data and determine so that we can more easily identify LGBT individuals.
What are the recommendations that we made from this particular study? First of all, we need to treat SOGI as classification variables and include them in the quantitative data rather than only coding it if it is a factor in the suicide. We do this in terms of age. We do this in terms of race ethnicity. We need to think about doing this also to the extent that it is possible because in an investigation, you might not always get this.
But if it is included as part of the quantitative data, you get better ability for surveillance and monitoring of suicides as a population health activity as opposed to just worrying about the person. It will allow states to actually do much better at looking at the population of LGB and some of the other parts.
Also, increase specific set of questions as part of the data gathering in investigations and by cross referencing these records with the vital statistics where what you are looking at is if you know who the person is, can you look? In some states, you will be able to tell if people have changed their birth record. In the birth record, I did not want to say the certificate because it is in the birth record, which is kind of behind a firewall. But being able to do that would actually really facilitate better use of this data.
My job being the last is to do some summary comments about the R21 in general. What did we learn in this research? The R21 was an appropriate mechanism. For those of you who were thinking about what to do, R21 was great. We developed a proof of concept in our model. That leads us to point that we can apply for an R01. We have been so busy generating papers; we have not done that yet.
We started out with an intention to focus on LGBT studies, but it required a lot of data cleaning and exploratory development work. Yes, now we are poised for an R01. But what you need to hear is we are almost in a privileged position to be able to do this because we had other funding. Almost all of us have regular FTEs. It really took a lot of work to clean the data. I think you heard about misspelled words, all this other stuff to get the machine learning to work. It takes a lot of work to clean the data set up.
I think Dr. Cochran talked about this. The most useful way to do this work is to realize that you need a team. That it cannot be just a couple of you. Again, R21s do not give you a lot of money to do that. I have to applaud my colleagues for joining us in this because it was not about money. But it was a lot about making these methods work better.
The work was actually much more difficult than we thought. Some of us are very senior investigators. We thought we could just blow through this. But part of what we needed was this innovation of methodologies, which is why we brought more colleagues on.
We also have to move beyond this assuming the political nature of the records to pursue an understanding of representations of representations of representations, meaning that – in terms of thinking about bias, thinking about how bias is generated, we really have to realize that there are a lot of different kinds of biases here. We have to really kind of figure out how we are going to deal with that. We are still in the process of achieving some of our goals. We have about four to six papers still in the works.
What are some of the questions that we still have to answer? Can we reduce suicides through interventions in clinical settings? We are hopeful to be able to use these models in terms of EHRs. We understand the context of the violent deaths particularly in trans populations. Here, we think the work is really understanding the homicides. What we want to do is look at what the investigations are like.
Can we provide evidence-based rationale for states to engage in surveillance and monitoring as a cost-effective intervention? That is something that I think is very important.
Is the right approach to these questions to actually use this dataset that we are using or is there other data? Should we be thinking about modifying the data systems or modifying the data?
We have suicide prevention methods. A question is what can we learn using the data that we have that will allow us to adapt to LGB and especially T or much like we discovered in HIV where they wanted us to adapt methods. It really did not work. We needed to start anew. For those of you who are HIV researchers, it was in the DEBIs. That is a big question I think that still exists.
I am going to stop here and move us to the questions that may be in the question and answer – I know I answered, and Dr. Cochran answered. Some of the questions have been answered. But if people feel they are unanswered still to ask again.
TAMARA LEWIS JOHNSON: Thank you, Dr. Mays and all of the presenters, for that outstanding presentation, a panel of talks, very important cutting-edge questions on what ways to address violent deaths among LGBTQ populations and the application and the development of innovative technologies to be able to track and monitor this.
If all the speakers could unmute and turn their cameras on, I am going to go to the Q&A box, and we will get started. Thank you for all of you that have hung in there with us because we are getting to the good part of the opportunity to get your questions answered.
Here is a question. If the machine learning process takes place at a certain time with text samples from a certain time period, does it change the accuracy of the prediction of the word vector? For example, with words related to SOGI that are very dynamic, it may change significantly over time. How does this process of embedding account for that change over time?
ALINA ARSENIEV-KOEHLER: I can take that question. That is a great question. That is actually a bigger area of work with text analysis is how do we do all these methods knowing that language changes. It is not static.
In the particular examples that we are showing, we are taking all the vocabulary whatever time it is from so even if there was a word that was only used in 2000. We will have that word at least in our bucket of possible words that can be used.
But ideally, you also can look at how a word that might be used in two different time periods, it might even change across time too. We know that the word queer, for example, has really changed meaning across time and that is something that is an active other area of work with computational text analysis.
VICKIE MAYS: Can I also comment? One of the things that I have great worries about because I sit on all these different committees, and one is within academic medical centers, and we are trying to figure out in the electronic health record to get people to put in their identity variables. The problem is that in doing that when we start to figure out what is quality data, what is quality control, it is almost like if the bioinformatics group does not almost every six months or so relook at what they are collecting and how to make sure that they can expand it then we are going to lose. And what you know is that in electronic health records, that is expensive.
We have to figure out some way because I think whoever asked that question -- it is a great question. It is a question that we are worrying about in terms of transferring all these great scientific facts we are finding into an applied system, a system that normally – remember, we use – I will not call out brand names. But remember in the electronic health record, that is very costly in terms of going in and doing some re-engineering. It is costly for CDC in terms of the National Center for Health Statistics to re-engineer systems. It is costly for the states to re-engineer how they collect that data in vital records. We have a problem that we have to figure out in order to use all this wonderful science.
TAMARA LEWIS JOHNSON: Great. Thank you so much.
Here is another question that is coming in. The person says I am curious to know more about the coding of interpersonal theory of suicide variables and constructs within the NVDRS narratives. Do you think this means that these constructs are less relevant to LGBTQ folks? Also, would you be willing to speak more about and/or share the coding sheets for how bothersomeness and belongingness were coded within the NVDRS?
VICKIE MAYS: What we think – we spent a lot of time on this because we wanted to have the theory actually match what we were seeing. And Joiner’s suicide theory is the only one that actually talks about actual completed suicides. We thought it would be a perfect match.
What it is is that those terms do not distinguish. Part of what you want to think about is that those experiences are just more universal. They are just not as specific to LGBT.
It is funny. Whoever asked that question, thank you because it makes me think about, we should write a paper still to say what we learned that it did not work is actually learning that this is a universal experience, and it is not an LGBT other letters experience. I think we have learned something. I do not know how to make it be specific.
Again, our code sheets that work in terms of predictability, they all tended to have some very specific unique things that many LGBT people would experience that if a State put that in their coding, they would be able to do just much better surveillance. Whoever it was wants to contact me, I will be happy to talk through a little bit more about how our code sheets did.
TAMARA LEWIS JOHNSON: Here is a question for Dr. Arseniev-Koehler and Dr. Foster. How can the topic modeling approach to examine sexual orientation and gender identity data on suicide be applied to other mental illness such as depression, anxiety, and things like that?
ALINA ARSENIEV-KOEHLER: I can take it and then Jacob can add to whatever I miss. Two different ways. One, you could – just like we did the heat map with sexual orientation, and we had that heat map with manner of death. You could also think of doing that kind of same approach where you are looking at the different distributions of topics across different categories of the structured mental health variables. There are variables such as had this person had a recent crisis in mental health. Does the person seem like they had depressed mood prior to the death? Did they have a diagnosed mental illness? You could search and use those variables and look at the topics.
And then the flipside is you could go through the 225 topics, which again are just public on the paper. You can look at them. Then you can start to think some of these match up with the structured variables and some of them offer some insights relevant to mental health that are not captured in the structured variables. You could use those in that regression framework like I showed you with the sedative and pain medications and look which types of death tend to have this particular topic about mental health. That can give you some more insights.
TAMARA LEWIS JOHNSON: Great. Here is an incoming question from the audience. Is there anything a patient advocate can do to help support patients, health care providers, administration, and researchers to promote the collection of data?
VICKIE MAYS: I am happy to answer that one. Thank you very much. One bugs the systems. What you want to do is to ask the states, whatever state you are in, or work with the state organizations to say you have all this administrative data, but it is not being able to be mined and it is not being able to be mined as easily because it needs better coding. It needs better ways of identifying it. To be able to have LGBT as an identity status variable, the same way you have age or race, as an advocate. Think about that. I cannot guarantee that it is easy to get. But at least if you have it in some, it allows us to be able to use it better.
I think that one of the things I am going to caution you about and I am right in the midst of writing something about this is we have to be careful of what we ask for. You can ask for data to be collected. But you must in these days ask about privacy protection. Right now, I do not have to tell you that as jurisdictions begin to change kind of their legal policies, we put people at risk. As you ask, balance with protection.
TAMARA LEWIS JOHNSON: Thank you so much, Dr. Mays, for that question.
Dr. Foster, did you have something that you wanted to say about the interpersonal theory of suicide variables and constructs as it relates to NVDRS narratives?
JACOB G. FOSTER: No. I just wanted to amplify Alina’s comment about the analysis of our mental health questions vis-a-vie the discourse atom topic model approach and say whoever asked the question, great question. And it is something we are actively looking at because obviously that is a very important dimension particularly in the suicide violent deaths. As Alina said, there are really interesting facets of some of those conditions that do not come up in the structured variables that we can see in the narratives and the way that they are talked about.
I would underline, as Alina said, like the paper is open access on PNAS. It is also because of NIH funding available. Just spend some time with the caveats that Dr. Cochran made. Some of this is upsetting stuff to read about. But just spend some time looking at those topics.
We made the choice. Give all of them in the supporting materials so you can see what they are and read some of the terms because I think there is a ton of interesting work that can be done along these lines with the NVDRS and many other texts – these unstructured text data resources using this approach. Spend some time looking at it and see what you can do.
TAMARA LEWIS JOHNSON: Great. Thank you so much.
Dr. Chang, I have a question for you. Given that augmentation rules may not fully address diversities of sexual orientation, what considerations should investigators incorporate into their study design?
KAI-WEI CHANG: That is a very good question. I think technically, there are several ways that we can improve the model inclusive. Data augmentation is one way to do that. There are many other ways that we can consider. For example, we can do some projections in the embedded space to make the model better to be aware of different gender.
But another thing that is more important is that when people design a study, they should be aware of this issue. So aware that a model may not perform well in all the cases and especially for some LGBT groups or some other minority group, the model may perform extremely worse in those cases. When we apply the model, we need to be careful and then we need to go deeper to look at those cases to see if the performance of the model is expected. If not, we can from the technical side, to try to improve the model. But for the other side, we need to be careful when we get any conclusion from those models.
TAMARA LEWIS JOHNSON: Great. Thank you so much, Dr. Chang.
Here is another question. We, as a state, are creating systems integration across all of our health care and human services to collect SOGI data for service, population needs as well as population and health and health equity. Are there some tips and lessons learned outside of the NVDRS for a broader collection and usage of SOGI data? And insights that you might have about data privacy and confidentiality as we have user input asking different questions for their identity than current questions so the complexity increases.
VICKIE MAYS: One of the things that I think is important is if you are doing this in terms of your system is first to figure out what linkages can you do so that you do not have to keep asking in every place but instead you have checks and balances. That is number one. Can you, for example, go to the vital records? Will your mortality in vital records be a check and balance? Linkages, I think, would be one of my first wishes.
The second wish I would say is again and I kind of talked about this is language is going to constantly change. You are going to need somebody like Professor Chang, who is going to be able to give you some kind of formula by which you can constantly upgrade that language so that you can capture some of the differences because that language can actually – they may be generational differences. They may be race ethnicity differences and we have not talked about that a lot. That is also important. That language is about culture, and you need to understand how to do that. This is why we are very happy that we have Dr. Chang with us because that is part of what he does.
The third thing is in terms of bias, I think you need to understand what biases are. We always think about discrimination. That is not just the bias. Dr. Chang has talked about bias. Everyone else has talked. You have to figure out that there are many different types of biases to address that come into data in very different ways. Some are human biases. Some are machine biases. Some of them are kind of the system bias of how you structured your EHR or how you structured your data set. You have to remember that the use of the data will make a big difference. If your clinical data and survey data are totally different in terms of the comfort that people may have of giving you information. I applaud whoever is re-doing their system and they want to do this. I say bring many people to the table and get a data governance agreement about where you want to end up.
TAMARA LEWIS JOHNSON: Thank you so much, Dr. Mays and to all the presenters this afternoon for this outstanding panel of presentations of a very important and relevant topic to the NIMH and to public health.
I want to say that our next and final webinar of the ODWD webinar series is the NIH scientific workforce diversity initiatives, promoting inclusive excellence in the extramural research ecosystem on the 22 of September. If you have not already done so, I encourage you to sign up and register to attend that. Thank you. We are recording and archiving this presentation. There will be a transcript. All of this information will be available in the coming weeks on the ODWD webpage. Thank you so much for taking time to listen to the presentations and to participate. Have a good rest of your day.