Sunday, January 15, 2023

Ok, Machine Learning

Ok, Machine Learning

You are a scholar and a teacher. You're worried about these AI chat systems; you don't necessarily care that your students are using the thing. What you really care about is that if they do ask one of these systems a question, they get the right answer.

And, for your own research, you wonder if you can get a good answer to your own questions. How do you tell if they're any good?

First you go and feed one of these systems your own homework question, right?

Do not try this, at least not first thing out of the gate. You can be fooled by your own head if you try and "grade" the results without knowing what the system is doing.

Instead, try this. Ask it a question that looks and sounds like something Wikipedia can answer, then see if it does 2 things: do the answers corresponde to the Wikipedia page relevant to the question? And, just as importantly, does it use only the answers found in the Wikipedia page in question?

The first test is of course for accuracy. Note, I don't mean that the answer is quote for quote from Wikipedia, in fact it's better here if it doesn't quote pull directly. I just mean, do the facts and assertions match up to those of the Wikipedia page?

The second test is for completeness. Extra information here is not by default extra credit, and should be discounted unless you are dealing with a field you know well enough to find that information in a trustworthy, publically available digital source. This is a test for completeness: only trustworthy, creditable information that's publically available and verifiable independent of the chat system should be included.

And yes, you should also try this with "known shitty" internet questions. If you start seeing lunatic fringe answers in the results you know the system in question has not been evaluated completely for Garbage In, Garbage Out. Not all data sets are valid for the purpose presented.

You should also try this with other questions that, though you aren't necessarily expert in, you can readily track down both the Wikipedia page, and the top 10 or 20 field standard references to. This is a test for breadth of knowledge: has the system been built to fool you in particular?

And then, if you're ready for finding out if the system really knows its stuff, find out if it can do the same thing with a well-known review article in your field or one you're interested in learning...

You are an artist. Really, you're intrigued by whether these systems can work for you. And, deep down maybe you're worried that it's using your own art somehow. How do you know if the system is useful, first? How do you know that it's actually doing something artistically worthwhile, and not just copying in a hidden way?

First thing you do is feed it a prompt for one of your own artworks, right?

Don't do this first. Wait a bit on fishing for your stuff and try something else. Your eyes will play tricks on you.

Instead, try this: ask the system to reproduce your favorite Van Gogh. Or Rembrandt. Or whomever, just make it a public-domain piece that you know well. One that you've studied yourself.

How did it do? Now, find out if it can do Jackson Pollock, or Andy Warhol? And yes I'm serious, if it has Jackson's or Andy's work in its dataset, it should be able to reliably get to a named artwork. If not?

It's restricted in some way from reproducing that newer work. This can be good or bad depending on your view on copyright, but know that this means that, artistically, there's a hole in its view of the world somewhere. Whether or not its useful for your purpose I'll leave to your artistic mind.

Depending on how well it did with a newer, name artist, now is also the time to ask it if it's capable of producing one of your works. Then, if you're interested in how well it works under the hood, go on to find out how it combines two well known works to produce something you haven't seen before? Here's where you get to judge whether or not it can do something useful for you. What would have happened had Annie Lebowitz been able to work with Ansel Adams? How would Picasso have done the Sistine Chapel? What would Van Gogh's Forty Views Of Fuji look like?

You're a pro musician: you're booked. Can you use one of these systems to compose, produce? How do you know they're doing something useful and not just sampling?

First, ask it to reproduce a piece you know, and not one of your own. Bach, the Beatles, listen widely and deeply.

Did it work for all of your tests? Get wild: pick one and ask it to change the key. After that, ask it for a different rhythm.

Note: depending on what the algorithm is doing, these two questions in particular can be either very easy, or very nearly impossible. If it does work, then they're doing it properly (ie. signal analysis is involved at the important levels). If not, it's sampling in an obscured way, in which case you can ask it for your own works with a completely different purpose in mind.

The point being: an expert system that is only sampling (Type 1) has its uses. However, an expert system that can actually morph something properly (Type 2), like a key change or a samba to four on the floor rhythm change, now that's a different tool entirely. And, fundamentally, there's a very real difference in what's going on under the hood between the two: a sampling machine that reproduces one of your own works is straight up copying.

A music-signal analysis expert system can get to your work through a different route entirely. It sounds weird, but this kind of system may indeed know you well enough to reproduce something you wrote without directly copying.

In fact, this applies to the artist, the musician, and the scholar as well: if you find a system that can quote you, or that can reproduce one of your works, whether its a Copier (Type 1) or an Analyzer (Type 2) matters. Type 2 systems are the most useful, the most properly constructed, and the most likely to be capable of reproducing your work without directly copying it.

At least in the immediate gold rush mentality that always accompanies new tech, I would suspect that we'll see quite a few Type 1, Copier, systems, because it's one of the easiest ways to take computational and data analysis shortcuts that allow those in a hurry to produce something that can fool people into thinking they're dealing with a Type 2, Analyzer, system. But as with sampling as it already exists, Type 1 systems that can reliably re-word known information very much have uses, if in a quite different manner than do Type 2 systems.

Wednesday, January 4, 2023

Alas, Machine Learning

Alas, Machine Learning

Thoughts and ramblings for my own purposes.

Under the hood, it's both computational and communications bound as ever. In practical terms, this means there's a point coming where mass computational bounds will kick in. What's economically viable to build for computers limits us all, but then that's in effect what ML was built to address, in some ways. Still kicks in, just at a new level.

I wonder what the effective "word" length here is, or will be? Think of letters, then words, then sentences as 1 point, 2-point, ... n-point basis functions. Are paragraphs then the n+1 limit? Essay length? Not in the sense of not being able to construct longer systems, more in the sense of repetitiveness, enforced periodicity by basis set limit rather than formal limit. "Perception of the machine" falls here.

Some years back, generic sports articles based simply on the line scores began to be generated this way, similarly with AP style news reports. "So and so won, so and so lost, here's the breakdown" kind of thing. Certain kinds of traffic and weather reports could as well be generated this way. Web pages, summaries.

Where's the error creep in, and how do you work with or around it? Garbage in, Garbage out always applies. In a purely numerical context, new algorithms can always be measured. How do you insure accuracy here? Replicability, too?

In a couple of the major fields involved, when asked long ago I made the comparison to the periodic table: meaning that what was missing was an empirical map. How does X relate to Y? Everything is foggy and dim; is it even possible to lay out a map in such flickering shadow and light? Here then is ML coming in with at least a possible construction.

Which is of course where the formal part began, or one way into it. Here's this arbitrary data set we know nought of. How does it relate to itself? What can we do with this arbitrarily large volume of presumptive knowledge that we don't yet understand?

Suppose you had a library accumulated by a sage since passed on. The sage was mysterious, crusty and cranky, and disinclined to tell anyone of their methods. Now your hands pour over old manuscripts in forgotten tongues, all organized, clearly, but in some fashion our old friend forgot to teach us. What do we do? We don't speak any of these languages, we don't know how our friend did it, what they meant by putting this scroll next to this codex next to this little sheet of paper much scratched and stained.

Let us consult the crystal; can it tell us where and how and why each text fits with another? Can it summarize for us what is contained there, and, better, which questions we can ask of which text? What if we could, then, summon forth both a librarian to organize, to systematize, and a scholar to help us understand what we have? And, perhaps, if we're dreaming, a new sage to add to the collection of knowledge?

This last is, formally, where we break down. "Artificial Intelligence" was/is market speak. Machine learning is what the experts preferred, though whether the distinction continues to be respected with mass adoption, I dunno. That aside, the difference is that the first two questions relate to transformations contained within a data set.

The third relates both to generalization between data sets, and to generalization beyond data sets. Crudely, interpolation versus extrapolation, though just like diagonalization versus singular value decomposition the equivalence is there. Still and all... asking for something new becomes the frontier.

Just like Wikipedia, scholarly communities will be obligated to query refine and strengthen a given instance, out of self defense. You'll need to make sure that if such a thing is out there it's giving correct answers. This took a long time to even begin happening with Wikipedia, and it's only done now in narrow instances. Professional obligations will expand; disciplines unused to programming should now understand that they'll need to require it.

Just like every other stage of computational development: does the computer do what I need done? Calculator, spreadsheet, web, can I get the answer I need? Can I trust the answer? It's a tool, how do I use it?

Listen: transformative work is transformative. That this allows automated transformation is irrelevant. The copyright office recognizes this, it also recognizes that the person using the computer to transform my work into something else, no matter what work they've put into the computer, isn't creating something in the same way as they would have if they had written it themselves. Thus, at present, ML-generated works are not copyrightable.

This has many implications.

First, cheap copies where someone takes one of my works, changes just enough of it to fly under the radar at Amazon or wherever, and tries to cash in, becomes untenable in the long run. Why would you need to do that if you can just ask an instance to generate a new work? Even if it's incorporating my work into the melange, so what, that's what would happen anyway, just in bits on a computer rather than the memories of the next generation of artists.

Second, it means there's going to an almighty fight when the media conglomerates realize what uncopyrightable means in this context. Right now, the media conglomerates appear to recognize that their catalogue has significant value in the brand new future.

And it does. WarnerBros or Disney or whoever appear to sit on the gold mine for training the next generation of ML machines to spit out branded media.

Sounds great. Each house will be able to perpetuate their secret sauce, down to the actors and voices and music and images... too bad for them its not creation in the artistic sense, and thus, for now, uncopyrightable. Neither is it something they can prevent others from doing. At least not if they actually want someone to view their product in the first place. If you can use today's actors in perpetuity, so can anyone else, sayeth the copyright office.

Which of course means that the media conglomerates are going to raise high holy hell when they figure it out. Gods preserve us. You thought they bowed when Mickey was threatened, look out.

For movies and music, assuming that no one manages to completely screw up all of copyright law by doing something "novel", I suspect the fine line that makes this work economically for the conglomerates is finding someone who can use the ML systems to generate as a part of something larger. In other words, ML systems as an element of a broader, complete artistic creation process. Like sampling only with broader extent than audio.

At the same time, there will then also be now video and written story equivalents of Muzak, only generated for airplane seatbacks or waiting rooms or whatever.

So, video and audio; Dylan and Simon and all the rest selling their catalogues, Cameron and Avatar 2, the last great cash grabs available before the previous financial landscape changes irrevocably.

What then of text? I'll use Stephen King as an example here, not because I know anything of what he or his heirs are planning, but because he's one of the primary household names in the written word.

Suppose that someone involved recognizes that King's life's work represents not just a present value, but a future value: in an ML world, all of King's works become the basis for future works, long after the author has left us.

If the copyright office says, great, fine, but it's still not copyrightable, is this life's work valuable in the instance of generating ML work in the future?

It is if you've heirs then capable of their own transformations and creative contributions to the eventual new work. Or, failing that, well able to hire it done. If we accept that conglomerates will find producers and directors who can successfully generate "based on" work to be monetized, then so too can estates find a combination of writers to generate "based on" novels and stories.

Only, now, without even the need to go digging for half-finished trunk books, or outlines, or notes on, or all the other ways they've done it in the past. The computer can generate that outline to order. And the estate can commission, or ask a son or daughter or...

So, then, thus: if there's now multiple generations of writers who "grew up" as Star Wars or Star Trek or "insert media here" writers for hire, the future will hold then estate-trademarked (because remember this: you can't stop someone else from using already published works to do their own transformations. But that is what trademarks help with if used properly...) Stephen King media writers, and Dean Koontz media writers. Think what Brandon Sanderson did with the Wheel of Time, but now perpetually and at much larger scale. No longer half a dozen at best, but like Tom Clancy's estate, over and over again as needed. At least for the 70 years after the original author passes.

And, this applies not just to someone of the stature of King or Clancy or Koontz. Imagine what will happen with the Song of Ice and Fire. Or the Name of the Wind. Or even your own works, you little writer you. Maybe there's room here, not just for your heirs to continue a little bit of money coming their way, but to even extend it a little. We can all make a little business for our kids to work in, even if how they do it doesn't quite resemble the way we did it.

So: there are creative ways that ML will be used to jump the uncopyrightable hurdle. Book it, it's already happening. And thus, the financial landscape will change, not burn down.

This provides opportunity. Protection, in that the silly cheap copy bullshit will likely fall away as unnecessary. And yeah, they'll be using your work, but transformative is transformative, you're never protected from that. It's actually better to have your work then be part of a much broader library that's built from. Then it's part of a stew, not a sushi bar. And some idea that the sort of Tom Clancy/Frank Herbert perpetual zombiehood now becomes a tool that any reasonably savvy artist can use for their heirs and assigns.

It's kind of a big deal, ain't it? And in a good way if you're ready for it.

The doom and gloomers here are missing the forest for the trees, especially on one big thing: there's always someone better than you at what you do. So what? That great orators exist doesn't stop me from speaking to those I must needs talk to. That Andres Segovia played and was recorded stops me not at all from picking up my guitar. If I need to sketch, I don't let all the much better artists and draftsfolks out there prevent me from doing my little cartoons up.

Art is communication. If you are not simply to be a consumer of art, you will have your place to go to when you need it. You have a way to let your voice ring out. No one can prevent it.

And, perhaps, just maybe, and with care, the computer will show you some more new options for how to pass your voice on to others. Find that inner 15 year old that doesn't give a shit what their parents say, doesn't look for a moment to whether it's worth anything, doesn't know or care who's done it before, damnit they're gonna make their own art come hell or high water.

That voice? That hand? That story, that song? It's always yours. It's always you. Embrace it no matter what. Let the worry warts go bother someone else.

Sunday, January 1, 2023

Just Gotta Love The Great Computer Randomizations

Just Gotta Love The Great Computer Randomizations

Ah, that wonderful feeling: some update blobbed a configuration file somewhere. One that I haven't touched in (scanning file dates) six years now. One that I set a customization flag in by hand the last time it drove me crazy that I couldn't get a gui widget to give me the environment I prefer, damnit, not the as-shipped.

It's really amazing how fast you get out of the habit of looking under the hood when you don't have a daily. I say how fast; 6 years isn't. Well, it used to not be, but that's one of those things you can understand intellectually, not the gut level. Not until you look up and see the dust on the proverbial bookshelf, anyway.

One of this bit twirler's bad habits is to have ingrained a handful of key-bindings into the subconscious long ago. And then have never revisited the muscle memory because it was faster at any given moment to just go with the habits engrained. It's something like having learned scales with a given rhythmic pattern, and then having never gone back and re-learned them with another. I get locked into something that isn't necessarily the best use of keystrokes, but isn't sufficiently troublesome so as to make me take the time to code a different way.

I like to think I should have done something like that for book formatting, for instance. I could and should sit down and write out the various steps, script them, and automate the process, at least so far as the steps to which automation is the better choice. I just haven't. Yet, I'll get to it... eventually. There's always something else though, you see.

And look at me. I did actually want to talk about Machine Learning today. I blame Noah Smith, he put up a Substack post that set out some things I like, and a couple I don't, in relation to where Machine Learning is just at this moment. And then here I am getting caught first of the year arguing with my computer.

Maybe that's it. Maybe the computer's trying to keep me away from discussing the topic. Hmm, I'll have to meditate on that.

If so, I'd like to think that it's my computer telling me that, if I am going to be spending time at the keyboard, wouldn't I rather be fictioning?

Yes, yes I would. We are our bits then, aren't we? inside and outside the brain case. Kind of cool, ain't it? And here we thought we'd all need to directly use our brain stem for signal transport. Understanding really does only come in small chunks.