Postcards From The Data Edge
By Vanessa Raymond
Feb. 17. 2023
Often what lies behind a dataset is a story, or a set of stories. Sometimes, it鈥檚 an epic saga complete with heroes, foes, trials and tribulations. Like all great stories, there are periods of woe and periods of triumph. Data stories are inextricable from human stories, with all the high drama that accompanies funding cycles, trends in research, and the crucial role that charismatic, passionate movers and shakers can have.
Working with data, as dry as that may sound to some, is actually quite an emotional process. Anyone who has spent any amount of time creating or collecting data, analyzing it, managing it, describing it, sharing, or re-using it can tell you how frustrating and exciting the process is. The challenges may stem from an ethical concern, a data quality concern, or a technical concern. Below read three data vignettes, accompanied by some bigger picture framing from data academics, featuring some of ACEP鈥檚 data champions: Michelle Wilber, Emily Richmond, and Dylan Palmieri.
Michelle Wilber, an ACEP researcher describes the care and caution she brings to her data work on electric vehicles in AV狼论坛:
Most of the data that I play with is crowd-sourced electric vehicle (EV) data. In some cases, for example the Municipality of Anchorage鈥檚 electrical box truck data, which is a publicly-funded vehicle, the data is all public. Everyone involved in that project is signed on to making the data public, and so we have very little ethical concern or quandaries on a project like that.
The rest of the data I play with often comes from AV狼论坛ns who own and operate an EV and through the goodness of their heart share their data with me. When someone contributes their data to my research I have them sign a data sharing agreement, at which point I remove any personal info such as personally identifiable information (PII)鈥 but even after all that we still have some challenging questions.
For example, when we mention a 鈥淔airbanks Chevy Bolt鈥 on a plot or graph about electric vehicles in AV狼论坛. Well, as far as I know, there鈥檚 only one Chevy Bolt in Fairbanks. Even without the personally identifiable information, it鈥檚 still identifiable. Now most of what we are sharing from a data perspective is somewhat niche and esoteric; it鈥檚 published in a few papers and read by a handful of like-minded researchers. Mostly we鈥檙e talking about the impact of external temperature on trip efficiency, nothing overtly personal and not the hottest topic out there (literally!). Now if we had a broader audience for our data and our charts were being broadcast on the evening news, then that might be a different story.
Michelle describes how ACEP researchers, through their relationships and networks, maintain the trust of research partners and collaborations by protecting the people behind the data. Michelle also touches on the fact that when a larger spotlight is shone on research, the tools and techniques for managing the data may need to change to address the change in the scale of the interest, or the politicization of a research topic.
ACEP researchers seek to go above and beyond the requirements with their data relationships and networks, because, sometimes, behind the data we find the faintest whispers of peoples鈥 lives. Proceeding with caution, care, and empathy is the only just way to approach this. Wilber describes how even with powerhouse data, data that comes from an electric utility about the electrical output of the powerhouse, we can have a people-centric approach.
This situation illustrates how, even when we are using powerhouse data from communities we firmly believe that data belongs to the people who created it and the organization who shared it. Beyond just the personal protections, there鈥檚 the whole other challenge to ensure we are supporting energy sovereignty. At that level it鈥檚 not just personal data, it鈥檚 a relationship we try to make sure it鈥檚 respectful of another entity in total, be it a tribal council or community.
Michelle鈥檚 work echoes the thinking and cautioning of the biggest thinkers in data ethics, such as Kate Crawford, who speaks about the new harms introduced by big data and data science, challenging traditional research ethics guidelines:
Big Data stretches our concepts of ethical research in significant ways (Boyd and Crawford, 2012). It moves ethical inquiry away from traditional harms such as physical pain or a shortened lifespan to less tangible concepts such as information privacy impact and data discrimination. It may involve the traditional concept of a human subject as an individual, or it may affect a much wider distributed grouping or classification of people. It fundamentally changes our understanding of research data to be (at least in theory) infinitely connectable, indefinitely repurposable, continuously updatable and easily removed from the context of collection. By doing so, it forces us to grapple with the ways in which familiar and practical ethical constraints depend upon research data being temporally and contextually constrained and restricted by technical infrastructures and financial cost. Further, data science methods create an abstract relationship between researchers and subjects, where work is being done at a distant remove from the communities most concerned...1
Sometimes, the data story is just one of cleaning, or fixing. Like tinkerers outside of the data sphere, the headaches are numerous, unexpected, and can be quite vexing. ACEP鈥檚 data science analyst, Emily Richmond, is super annoyed with data right now. Why? Well she just discovered that some (but not all) of the data points she has been building an analysis off of are off by a decimal point. She vents because, to put it simply, she needs the story behind the data just as much as she needs the data itself.
We should be able to find this data online in a nice format, but no, it鈥檚 not that simple. It鈥檚 never that simple. We don鈥檛 have access to the source file [the original data collected that is then analyzed to create a published & final dataset], so we don鈥檛 know where the data came from. Why isn鈥檛 this data already public? It feels like someone is transcribing it, maybe some expert was making adjustments as they identified issues, but it makes it really hard to verify these numbers. I鈥檓 not an expert in the field so it鈥檚 hard for me to understand the difficulties they faced getting this data. I really wish they had some metadata, or source data files accompanying this dataset - it鈥檚 necessary for telling the story of the data so that we can use it. I feel like I'm floating between spreadsheets made by different people.
Emily鈥檚 challenge is a familiar one in the data world. We inherit a dataset that鈥檚 incredibly valuable, however we don鈥檛 have the decoder ring - we can鈥檛 understand or verify how the data was made, that the data is accurate, and what decisions were made along the way to result in this dataset. Without good documentation, metadata, and some standard best practices being employed, it sometimes renders valuable data insights unusable for statistical analysis or other data science pursuits, just because of the high degree of uncertainty the data product introduces into the research process. The gold standard for datasets is that we receive a source or raw data file, the script that did the analysis, the data product that comes from the analysis, and we also receive metadata about the dataset: the who, what, when, where, and why, or the data鈥檚 origin story.
Mimi Onuoha is a data artist who has written at length about missing data, and in her 2018 essay 鈥淲hat is Missing Is Still There鈥 2 she attempts to define and describe data.
...academic Mitchell Whitelaw defines data as measurements extracted from the flux of the real. When we typically think of collecting data, we think of big important things: census information, UN data on health and diseases, data mined from large companies like Google, Amazon, or Facebook. From this perspective, Whitelaw鈥檚 definition of data is admirably concise and effective. With its clever use of the word 鈥渆xtraction鈥, it hints at the resource-driven nature of data collection鈥 Whitelaw鈥檚 definition calls to mind corporate imaginings of data as a resource. In a capitalist society, it is always a smart business decision to collect data. A world collected is a world classified is a world rendered legible is a world made profitable. 鈥 simpler definition comes to mind. Data: the things that we measure and care about鈥. Missing datasets is the term I have for these blank spots in a world that nowadays seems soaked in data. They form a ghostly parallel鈥 they too are the facts of our world, the vertices of measurements. But they are the ones that we know little about. Data are what people care enough about to measure. Missing datasets are the things that people care about, but cannot measure.
Emily, in her work with ACEP, has chanced upon another type of data, the dataset that got forgotten. The data we used to care about, or perhaps that we used to be able to measure but can no longer.
ACEP鈥檚 Dylan Palmieri, Winter 2022 graduate of UAF鈥檚 computer science master's degree program, is tackling a different data challenge right now. Dylan鈥檚 challenge is one of documentation. It requires digging deep, really deep, into the way some air quality sensors were designed to understand the way the data is being created by the sensor. And if that wasn鈥檛 hard enough, he then needs to write code to translate this data into a format that ACEP researchers can easily understand in order to analyze the data coming out of the sensors.
鈥淚 basically want to write this software so that no one, and I mean no one, ever has to do what I have had to do鈥 says Dylan. 鈥淚 guess we can call it data democratization. It shouldn鈥檛 require a degree in computer science to draw conclusions from this data.鈥 He goes on to describe his process further:
We start with the sensor documentation from the manufacturer. Yes, we read the manual! Actually I have read it four different times now. Then we look at the data outputs of the sensor. It鈥檚 not obvious what it all means. I realized I had to go deeper, and understand how the data gets created and structured to see how the sensor is creating the data. It鈥檚 鈥渋nvisible work鈥, no one knows I had to go this far to be able to document the data. But, ultimately the researchers don鈥檛 want bits or bit strings, they want to see some numbers that make sense to them. I want this software and its associated documentation to abstract out the 鈥渘iche鈥 things and make it accessible to a general audience.
The day I figured out how this sensor worked was a good day. That was nice, that was fun. I enjoy the architecture work that comes with computer science and data work. I enjoy making pieces of code and software that you can string together to make a process. So I am creating a tool that translates the sensor data into something understandable, some format that the researchers care about. I am also making a few small tools, widgets of sorts, that analyze things for the researchers. I鈥檇 love one day to connect the sensor to the internet and create a way to stream data, but we鈥檙e not there yet. For now I am working on creating a dataset that really well describes itself, it has all the metadata [the data about the data] right there with the data. I want to write this documentation and craft the dataset interface so that for the majority of the people consuming it, it鈥檚 relatively intuitive.
Dylan鈥檚 work to fully understand a piece of hardware, write software and create data processes, and also document the way the data is structured and the outputs of the data is what makes research possible. It allows ACEP鈥檚 energy researchers and fuel cost research teams to focus on the analysis, bringing to bear all their subject matter expertise, to the research question at hand. This 鈥渋nvisible work鈥 is also a time saving and cost saving measure that gets the data into a usable shape, without others also having to go down the rabbit hole later on.
Like Dylan, leading thinkers in data ethics, and specifically data feminists, are looking at this same concept of invisible work and data supply chains. Writes Catherine D鈥橧gnazio and Lauren Klein in chapter 鈥淪how Your Work鈥 from their 2020 book Data Feminism,
Coding is work, as anyone who鈥檚 ever programmed anything knows well. But it鈥檚 not always work that is easy to see. The same is true for collecting, analyzing, and visualizing data. We tend to marvel at the scale and complexity of an interactive visualization 鈥ut we are less often exposed to the networks of processes and people that help constitute the visualization itself鈥nfortunately, however, when releasing a data product to the public, we tend not to credit the many hands who perform this work. We often cite the source of the dataset, and the names of the people who designed and implemented the code and graphic elements. But we rarely dig deeper to discover who created the data in the first place, who collected the data and processed them for use, and who else might have labored to make creations like the Ship Map possible. Admittedly, this information is sometimes hard to find. And when project teams (or individuals) are already operating at full capacity, or under budgetary strain, this information can鈥攊ronically鈥攕imply be too much additional work to pursue. Even in cases in which there are both resources and desire, information about the range of the contributors to any particular project sometimes can鈥檛 be found at all. But the various difficulties we encounter when trying to acknowledge this work reflects a larger problem in what information studies scholar Miriam Posner calls our data supply chain.
The dangers in invisible work is if it is also siloed work, work that researchers, funders, community partners, or university stakeholders don鈥檛 see or understand. In this context, the invisible work can be underestimated in terms of time it takes to get from raw outputs from a sensor to a beautiful and compelling data visualization, in terms of the cost for clean and preparing data, and in terms of the skills and human capacity needed on a data-capable research team, to produce reliable data assets for research activities that solve some of AV狼论坛鈥檚 most complex and pressing questions about the shape of our world today, and in the future.
ACEP enters 2023 with a deep commitment to AV狼论坛鈥檚 energy data ecosystem that extends beyond individual grants and research projects to create a broader network of support of energy data in AV狼论坛, one of the most unique and data rich energy landscapes in the nation. Under the direction of executive director Jeremy Kasper and executive officer Jennifer Harris, ACEP is poised to hire a cohort of data experts including devops system engineers who can create and maintain a data infrastructure that allows ACEP researchers to more easily receive and rely on data collected across remote communities in AV狼论坛, and data science analysts like Emily Richmond who can support research teams at ACEP to clean and analyze data. In addition to creating these positions, ACEP is also launching a new initiative to have a data librarian internship project as part of the ACEP Undergraduate Summer Internship (AUSI) program, pilot a data club for high schoolers, and engage a cohort of computer science students in data, geospatial, and programming tasks related to energy data.
ACEP has also invested in a Data Governance Lead to facilitate the dynamic data rich environment of ACEP鈥檚 researchers. The data governance lead鈥檚 role is to create a culture and norms around data decisions at all levels of ACEP鈥檚 research enterprise to create some consistent, best practices for ethical and, where appropriate, accessible energy data products that benefit AV狼论坛n remote communities, AV狼论坛n energy researchers, and researchers around the world. As part of this effort, ACEP has joined the Interagency Arctic Research Policy Committee (IARPC) as co-chair of the Data Management team.
Data haikus
After the snow fall
data crunching hard like ice
How to melt and share
鈥擵anessa Raymond
Funding never lasts
Beyond nearest horizon
Halfway there will do
鈥擵anessa Raymond
data mgmt plan
required, confer
for a joyful exercise
鈥擵anessa Raymond
good, bad, dirty, flawed
wrangled, wrassled, abandoned
rescued, ninja'ed, qa'ed
鈥擵anessa Raymond
We just want to talk
There's no need to be afraid
It's all just numbers
鈥擡mily Richmond
For we must muster
Up the obvious questions
To find the answers
鈥擡mily Richmond
Where did it begin,
The minds of data design;
The means to my end
鈥擡mily richmond
Data is vital
Accuracy is a must
Provides answers
鈥擜lora Greer
Data, vast and deep
Endless streams of information
A world to explore
鈥攐辫别苍础滨
Searching for meaning
Answers exist in data
Statistics blossom
鈥擠ylan Palmieri
"a haiku", noted
Emily, "appears to have a
normal distribution"
鈥擪elsey Aho, US Forestry Service
b iological
i mpulsive events, or a
t hresholdy 蕿蓴蓲d 诲谋濒蔁
鈥擪elsey Aho, US Forestry Service
1 # do 'good' 'bad' labels
2 # remove the humanity
3 # from the data sets(?)
鈥擪elsey Aho, US Forestry Service
1 Metcalf, J., & Crawford, K. (2016). Where are human subjects in Big Data research? The emerging ethics divide. Big Data & Society 3(1). .
2 Onuoha, M. 2018. What is Missing is Still There. Nichons-Nous Dans L'Internet. Accessed at .