Skip to main content
Solutions
Content Strategy & Ideation Content Creation & Workflow Publishing & Promotion Performance & Optimization
Customers
Enterprise Agencies Media SMBs
Talent
Find Writers Join Talent Network Job Board Writer FAQs
Industries Plans Get Started Login 1 (866) 501-3116

Industries

Agriculture Art & Design Automotive Building Materials Cannabis Career Construction Counseling Customer Service Dental Education Energy & Environment Engineering
Fashion & Beauty Family Practice Food & Beverage Gaming Health & Wellness Healthcare Higher Education Home & Garden Human Resources Injury Law Interior Design IT & Security
Insurance Legal Manufacturing Media & Entertainment Medical Law Nutrition Parenting Payments Personal Finance Real Estate Relationships
Retail & Ecommerce Religion & Spirituality Restaurant and Bar SaaS Sales Senior Services Software Sports & Fitness Technology Transportation & Logistics Travel
x Close Menu Solutions
Content Strategy & Ideation Content Creation & Workflow Publishing & Promotion Performance & Optimization
Industries
Agriculture Art & Design Automotive Building Materials Dental Gaming Fashion & Beauty Family Practice Cannabis Career Construction Counseling Customer Service Injury Law Interior Design IT & Security Retail & Ecommerce Education Energy & Environment Food & Beverage Personal Finance Healthcare Health & Wellness Higher Education Home & Garden Human Resources Transportation & Logistics Insurance Legal Manufacturing Media & Entertainment Medical Law Nutrition Parenting Payments Real Estate Religion & Spirituality Restaurant and Bar SaaS Sales Senior Services Software Sports & Fitness Technology Travel Relationships Engineering
Customers
SMBs Agencies Enterprise Media
Plans Talent
Find Writers Join Talent Network Job Board Writer FAQs
Get Started Login 1 (866) 501-3116
  1. Blog Home
  2. Development
  3. Scripted Writers
  4. A Text is a Text is a Text

A Text is a Text is a Text

I have a bachelor's degree in Literature, and at college I spent most of my time thinking, writing, and talking about texts. About what a text is says, what a text means, and what a text is. When I decided to make the switch from studying literature to Computer Science a year and a half ago, I would have never thought I would be fortunate enough to still be thinking about those very topics.

Here at Scripted, I work on automatically grouping similar texts or documents together---in Machine Learning terms, I'm talking about document clustering and classification. There are a variety of ways to accomplish this, and the 'best' algorithm for this goal isn't always a clear-cut choice, as it can be largely dependent on the reasons you're trying to group documents together.

One preprocessing requirement most of these algorithms have in common is the need to reduce a document to a 'bag of words'. That is, these algorithms aren't concerned with word order. They just look at how many times each word appears in a certain text, or we can take the bag of words model and then produce a sometimes more telling representation like tf*idf vectors. To get an idea of what this looks like, here are the top terms of this very blog post in bag of words form (term frequency or tf) and tf*idf form (term frequency multiplied by inverse document frequency) when considered against the entire Scripted corpus:

Term Frequency: tf*idf:











text
document
literature
computer
algorithm
grouping
vector
time
term
clustering
15
10
5
5
5
4
4
4
4
6
grouping
vector
literature
corpus
classifying
startling
arc
grouped
extracting
algorithm
0.7349142790721426
0.4409485674432856
0.2755928546520534
0.2204742837216428
0.1837285697680356
0.1377964273260267
0.1102371418608214
0.0918642848840178
0.0787408156148724
0.06263473969364851


Using representations like these can be incredibly effective for grouping documents together and, when clustering, they can be very useful for discovering some central topics the documents in a corpus are about. For example, here are few of the topics businesses have hired our writers to write about:










heart
blood
artery
pressure
high
vessel
beer
ale
brewing
flavor
hop
yeast
data
cloud
quality
information
storage
system
beach
resort
luxury
island
hotel
vacation


This type of grouping is useful to us as a kind of finger on the pulse of our writers, so that we can see who writes about what in more detail. But one thing about the bag of words model bothers me, perhaps irrationally so--it's a simplification that changes the very form of a document. Once converted, a text is no longer a text, but a vector. It's a simplification that's necessary for this type of grouping, but both the literature student and the computer scientist in me are saddened every time I convert a document to a vector. The literature student part of me is saddened because I know I am losing important features of the text like tone, style, narrative arc, and meaning. The computer scientist part of me is saddened for the exact same reason--because these features are ultimately information. Indeed, they are almost the entire reason we care about these documents at all.

This simplification is necessary, at least for the time being, because computers are still ineffective at extracting and handling these features. That is to say: computers are not good readers (and even worse writers). Advancements are always being made, but as of now it's necessary to break a text down into a bag of words so a clustering or classifying algorithm can process it. And I can't deny that there is something pleasantly surprising about the fact that the simplification of a text into a vector can be incredibly useful. My literature student's inclination is to resist anything but a holistic approach when analysing a text, but it's always startling to see how the hyper-specialized nature of an algorithm can produce fascinating results.

Published by Scripted Writers on Friday, July 20, 2012 in Development, Featured, Staff.

Sign Up For Your 30 Day Free Trial Today!

You agree to Scripted’s Terms of Use and Privacy Policy.
Already have an account? Login
©2011-2023
Our Company
About Us Privacy Terms of Use GDPR Trust
For Members
Enterprise Agencies Publishers Customer FAQs Newsletter Customer Sign In
For Writers
Writer Services Agreement Writer FAQs Writer Sign In
Additional Resources
The Scripted Blog Industries Podcast Technology Affiliates Competitors
Social Buttons