Posts Tagged ‘textmining’


November 18, 2022

Introducing #Galactica. A large language model for science.

Can summarize #academic literature, solve #math problems, generate #Wiki articles, write scientific code, annotate molecules and proteins

Explore and get weights:

Text mining on BOG22 abstracts

May 22, 2022

‘Lost’ medieval literature uncovered by techniques used to track wildlife | Science | AAAS

February 23, 2022

Could be used in other contexts than medieval lit.

Only a tenth of the human genome is studied | The Economist

April 28, 2021

There are roughly 20,000 genes in the human genome. Understanding genes and the proteins they encode can help to unravel the causes of diseases, and inspire new drugs to treat them. But most research focuses on only about ten percent of genes. Thomas Stoeger, Luis Amaral and their colleagues at Northwestern University in Illinois used machine learning to investigate why that might be.

First the team assembled a database of 430 biochemical features of both the genes themselves (such as the levels at which they are expressed in different cells) and the proteins for which they code (for example, their solubility). When they fed these data to their algorithm, they were able to explain about 40% of the difference in the attention paid to each gene (measured by the number of papers published) using just 15 features. Essentially, there were more papers on abundantly expressed genes that encode stable proteins. That suggests researchers—perhaps not unreasonably—focus on genes that are easier to study. Oddly, though, the pattern of publication has not changed much since 2000, despite the completion of the human genome project in 2003 and huge advances in DNA-sequencing technology. “}}

Robo-writers: the rise and risks of language-generating AI

April 17, 2021


A neural network’s size — and therefore its power — is roughly measured by how many parameters it has. These numbers define the strengths of the connections between neurons. More neurons and more connections means more parameters; GPT-3 has 175 billion. The next-largest language model of its kind has 17 billion (see ‘Larger language models’). (In January, Google released a model with 1.6 trillion parameters, but it’s a ‘sparse’ model, meaning each parameter does less work. In terms of performance, this is equivalent to a ‘dense’ model that has between 10 billion and 100 billion parameters, says William Fedus, a researcher at the University of Montreal, Canada, and Google.)

Small research teams ‘disrupt’ science more radically than large ones

February 28, 2019

“The authors describe and validate a citation-based index of ‘disruptiveness’ that has previously been proposed for patents6. The intuition behind the index is straightforward: when the papers that cite a given article also reference a substantial proportion of that article’s references, then the article can be seen as consolidating its scientific domain. When the converse is true — that is, when future citations to the article do not also acknowledge the article’s own intellectual forebears — the article can be seen as disrupting its domain.

The disruptiveness index reflects a characteristic of the article’s underlying content that is clearly distinguishable from impact as conventionally captured by overall citation counts. For instance, the index finds that papers that directly contribute to Nobel prizes tend to exhibit high levels of disruptiveness, whereas, at the other extreme, review articles tend to consolidate their fields.”

How to identify anonymous prose – Johnson

November 3, 2018

How to identify anonymous prose Interesting parallels between #textmining & genome seq. analysis (eg finding characteristic k-mers for a bacterial species)

Reading by the Numbers: When Big Data Meets Literature

November 11, 2017

Reading by the Numbers: When #BigData Meets Literature Distant reading as a complement to close reading for literary texts. Perhaps a useful dichotomy for biosequences too!

“Literary criticism typically tends to emphasize the singularity of exceptional works that have stood the test of time. But the canon, Mr. Moretti argues, is a distorted sample. Instead, he says, scholars need to consider the tens of thousands of books that have been forgotten, a task that computer algorithms and enormous digitized databases have now made possible.

“We know how to read texts,” he wrote in a much-quoted essay included in his book “Distant Reading,” which won the 2014 National Book Critics Circle Award for Criticism. “Now let’s learn how to not read them.””


Wikipedia shapes language in scientific papers

October 27, 2017

"Wikipedia is one of the world’s most popular websites, but scientists rarely cite it in their papers. Despite this, the online encyclopedia seems to be shaping the language that researchers use in papers, according to an experiment showing that words and phrases in recently published Wikipedia articles subsequently appeared more frequently in scientific papers"

“Thompson and co-author Douglas Hanley, an economist at the University of Pittsburgh in Pennsylvania, commissioned PhD students to write 43 chemistry articles on topics that weren’t yet on Wikipedia. In January 2015, they published a randomized set of half of the articles to the site. The other half, which served as control articles, weren’t uploaded.

Using text-mining techniques to measure the frequency of words, they found that the language in the scientific papers drifted over the study period as new terms were introduced into the field. This natural drift equated to roughly one new term for every 250 words, Thompson told Nature. On top of those natural changes in language over time, the authors found that, on average, another 1 in every 300 words in a scientific paper was influenced by language in the Wikipedia article.”


#Wikipedia shapes lang. in science Seeding it with new pages & watching them evolve (v ctrls) as a type of soc. expt

What the Enron E-mails Say About Us

August 6, 2017

Mark as Read highlights #Enron email as a canonical corpus for #textmining, w/ >3K academic papers published on this