https://www.economist.com/united-states/2023/12/14/american-journalism-sounds-much-more-democratic-than-republican American journalism sounds much more Democratic than Republican
QT:{{”
The first step in our analysis was compiling a partisan “dictionary”. We took all speeches in Congress in 2009-22 and broke them up into two-word phrases. We then filtered this list to terms used by large shares of one party’s lawmakers, but rarely by the other’s. The result was a collection of 428 phrases that reliably distinguish Democratic and Republican speeches, such as “unborn baby” versus “reproductive care” or “illegal alien” versus “undocumented immigrant”….Next, we collected 242,000 articles from news websites in 2016-22, and transcripts of 397,000 prime-time tv segments from 2009-22. We calculated an ideological score for each one by comparing the frequencies of terms on our list. For example, a story in which 0.1% of distinct phrases are Republican and 0.05% are Democratic has a conservative slant of 0.05 percentage points, or five per 10,000 phrases.
…Finally, we calculated the average partisan leaning of each news source’s coverage, weighting each story by the share of its content about domestic politics.
“}}
Posts Tagged ‘textmining’
Measuring media ideology: Bias or reality
December 20, 2023Litmaps: Literature Map Software for Lit Reviews & Research
August 31, 2023Galactica
November 18, 2022Introducing #Galactica. A large language model for science.
Can summarize #academic literature, solve #math problems, generate #Wiki articles, write scientific code, annotate molecules and proteins
Explore and get weights: galactica.org
Text mining on BOG22 abstracts
May 22, 2022‘Lost’ medieval literature uncovered by techniques used to track wildlife | Science | AAAS
February 23, 2022Could be used in other contexts than medieval lit.
https://www.science.org/content/article/lost-medieval-literature-uncovered-techniques-used-track-wildlife
Only a tenth of the human genome is studied | The Economist
April 28, 2021QT:{{”
There are roughly 20,000 genes in the human genome. Understanding genes and the proteins they encode can help to unravel the causes of diseases, and inspire new drugs to treat them. But most research focuses on only about ten percent of genes. Thomas Stoeger, Luis Amaral and their colleagues at Northwestern University in Illinois used machine learning to investigate why that might be.
First the team assembled a database of 430 biochemical features of both the genes themselves (such as the levels at which they are expressed in different cells) and the proteins for which they code (for example, their solubility). When they fed these data to their algorithm, they were able to explain about 40% of the difference in the attention paid to each gene (measured by the number of papers published) using just 15 features. Essentially, there were more papers on abundantly expressed genes that encode stable proteins. That suggests researchers—perhaps not unreasonably—focus on genes that are easier to study. Oddly, though, the pattern of publication has not changed much since 2000, despite the completion of the human genome project in 2003 and huge advances in DNA-sequencing technology. “}}
Robo-writers: the rise and risks of language-generating AI
April 17, 2021https://www.nature.com/articles/d41586-021-00530-0
GPT3
QT:{{”
A neural network’s size — and therefore its power — is roughly measured by how many parameters it has. These numbers define the strengths of the connections between neurons. More neurons and more connections means more parameters; GPT-3 has 175 billion. The next-largest language model of its kind has 17 billion (see ‘Larger language models’). (In January, Google released a model with 1.6 trillion parameters, but it’s a ‘sparse’ model, meaning each parameter does less work. In terms of performance, this is equivalent to a ‘dense’ model that has between 10 billion and 100 billion parameters, says William Fedus, a researcher at the University of Montreal, Canada, and Google.)
“}}
Small research teams ‘disrupt’ science more radically than large ones
February 28, 2019QT:[[”
“The authors describe and validate a citation-based index of ‘disruptiveness’ that has previously been proposed for patents6. The intuition behind the index is straightforward: when the papers that cite a given article also reference a substantial proportion of that article’s references, then the article can be seen as consolidating its scientific domain. When the converse is true — that is, when future citations to the article do not also acknowledge the article’s own intellectual forebears — the article can be seen as disrupting its domain.
The disruptiveness index reflects a characteristic of the article’s underlying content that is clearly distinguishable from impact as conventionally captured by overall citation counts. For instance, the index finds that papers that directly contribute to Nobel prizes tend to exhibit high levels of disruptiveness, whereas, at the other extreme, review articles tend to consolidate their fields.”
“]]
How to identify anonymous prose – Johnson
November 3, 2018How to identify anonymous prose
http://Economist.com/books-and-arts/2018/09/22/how-to-identify-anonymous-prose Interesting parallels between #textmining & genome seq. analysis (eg finding characteristic k-mers for a bacterial species)
Reading by the Numbers: When Big Data Meets Literature
November 11, 2017Reading by the Numbers: When #BigData Meets Literature
https://www.NYTimes.com/2017/10/30/arts/franco-moretti-stanford-literary-lab-big-data.html Distant reading as a complement to close reading for literary texts. Perhaps a useful dichotomy for biosequences too!
QT:{{”
“Literary criticism typically tends to emphasize the singularity of exceptional works that have stood the test of time. But the canon, Mr. Moretti argues, is a distorted sample. Instead, he says, scholars need to consider the tens of thousands of books that have been forgotten, a task that computer algorithms and enormous digitized databases have now made possible.
“We know how to read texts,” he wrote in a much-quoted essay included in his book “Distant Reading,” which won the 2014 National Book Critics Circle Award for Criticism. “Now let’s learn how to not read them.””
“}}