Hacker News new | past | comments | ask | show | jobs | submit login

Copilot is crazy. The other day, I was writing a Python function that would call a Wikipedia API. I pulled from the internet an example of a GET request, and pasted it as a comment in my code.

  # sample call: https://en.wikipedia.org/w/api.php?action=query&format=json&list=geosearch&gscoord=37.7891838%7C-122.4033522&gsradius=10000&gslimit=100
Then I defined a variable,

  base_url = "https://en.wikipedia.org/w/api.php?"
Then, like magic, Copilot suggested all the remaining keys that would go in the query params. It even knew which params were to be kept as-is, and which ones would come from my previous code:

  action = "query"  # action=query
  format = "json"  # or xml
  lat = str(latitude.value)  # 37.7891838
  lon = str(longitude.value)  # -122.4033522
  gscoord = lat + "%7C" + lon
  ...
  api_path = base_url + "action=" + action + "&format=" + format + ... + "&gscoord=" + gscoord
As a guy who gets easily distracted while programming, Copilot saves me a lot of time and keeps me engaged with my work. I can only imagine what it'll look like 10 years from now.



This comment is accidentally the perfect example of why copilot is a horrific idea.

The old "just copy-paste from Stack Overflow" approach to development is satirised and ridiculed these days (despite being still in common practice I'm certain), because as we all know so well by now, an accepted answer on SO does not always equate to a correct answer. Yes, the SO guys & community do do their best to improve answer quality iteratively (wiki answers, etc.), but there's still a lot of bad answers, and even many of the "good" ones become outdated or don't keep up with modern best-practice (especially when it comes to security).

Omitting urlencoding isn't the biggest crime, but it is a pretty standard URL-building step, and the fact that a tool released this year is spitting out code that omits something so simple is fairly damning. It's also a micro-example of much larger errors Copilot will surely be responsible for. Missing url encoding can be an injection vector in many applications, even if it's not the most common risk, but miss encoding in other string-building ops and you've made your way into the OWASP Top 10.

The big difference between copilot and SO is there's no community engaging in an open & transparent iterative process to improve the quality of answers.


+1. URL encoding is also very relevant on the backend and has security implications, e.g., you want to be sure you are protecting against double encoded URLs. If you elide details like URL encoding using copilot that's dangerous.


For me APIs are actually one of the places it performs the worst. Copilot is like having an inexperienced yet annoyingly optimistic pair programmer. The code it generates appears conceivable in some hypothetical universe. No guarantee it's this one though.

Remember it doesn't actually know an API or how it should be used: it's putting things together to look like typical code. For me that has meant difficult to spot bugs like linking up incorrect variables from the rest of my code.

I wish it could integrate the first SO answer to a generated question, because I always end up there anyway having to fix things.


I think my experience has been sort of between you two. Maybe 1/3 times it's spot on. The rest of the time, there is some minor tweak I need to make (it gets a parameter or variable name wrong). I've yet to hit cases where the code it generates looks right but doesn't run as expected, thankfully.

I've only had it for about a week now but overall I'm happy with it. None of the code I'm writing is crazy cutting-edge stuff and in aggregate I'm sure it saves me more time than takes, including the time I spend reviewing and potentially changing the generated code.


That's a bummer. I just got whitelisted and was hoping it could save me time with some APIs where they only have code in X language or curl and I have to work backwards if I run into any issues.


Bit of a dodgy way to form query parameters though. Other than for a quick script.


Even for a quick script this worries me about copilot; if it suggests this, then more people use it and think this is right, commit it, and then copilot suggests this more - that’s a bad feedback loop. At least in StackOverflow you get someone commenting why it’s bad and showing how to use a dictionary instead


I think they only pick starred repos not truly in the wild code. That's not a guarantee of good code but still a decent check.


I'm not against "copying" code. I just looked up "python build url query" The first link describes the `urllib.parse. urlencode` function which takes a dict.

So I would build the query like so:

    from urllib.parse import urlencode
    urlencode({
        "action": "query",
        "format": "json",
        ...
        "gscoord": f"{str(latitude.value)}|{str(longitude.value)}",
    })
I think this is orders of magnitude clearer code. But that's a parameter that's subjective that CoPilot can't adjust for (although it can be better).


I'm surprised no one has suggested using `requests` considering how easy, safe and readable it is:

    >>> import requests, pprint
    >>> 
    >>> 
    >>> url = "https://en.wikipedia.org/w/api.php"
    >>> resp = requests.get(
    ...     url, 
    ...     params=dict(
    ...         action="query",
    ...         list="geosearch",
    ...         format="json",
    ...         gsradius=10000,
    ...         gscoord=f"{latitude.value}|{longitude.value}"
    ...     )
    ... )
    >>> 
    >>> pprint.pprint(resp.json())
    {'batchcomplete': '',
     'query': {'geosearch': [{'dist': 26.2,
                              'lat': 37.7868194444444,
                              'lon': -122.399905555556,
                              'ns': 0,
    ...


For what it's worth, Copilot can do it.

I typed the following prompt:

    def search_wikipedia(lat, lon):
        """
        use "requests" to do a geosearch on Wikipedia and pretty-print the resulting JSON
        """
And it completed it with:

    r = requests.get('https://en.wikipedia.org/w/api.php?action=query&list=geosearch&gsradius=10000&gscoord={0}|{1}&gslimit=20&format=json'.format(lat, lon))
    pprint.pprint(r.json())


It's like a junior dev who doesn't quit unnecessary code golfing. Somehow the AI is more comfortable with string-based URL manipulation, which is a straight anti-pattern.


Presumably because that's what it's seen in the training data. Remember, it doesn't care about what the code does, it's just doing a search for similar looking code.


That doesn't exactly do what the guy above you was talking about, though.


That's what the rest of the thread is complaining about, it's still slapping the strings in there with basic formatting. No different than the top level approach.


This. Code should be optimized for reading, I think this kind of code is OK for exploratory stuff, but needs to be rewritten later.


Well. Code should be optimized first for correctness, and simple string concatenation will not work for URL params.


It'll certainly work, just seems sloppy.


Plenty of edge cases there (e.g. url encoding), but I don't want to preach to the choir and rabbit hole on this minor detail.


Speaking as a former pentester, this is a fine way to form query params in this specific case, if lat and long are floats.

They're the only data you can control, and unless they're strings, it's useless for exploitation. Even denormal floats / INF / NAN won't help achieve an objective.

I broadly agree with you, but people are pummeling Copilot for writing code that I saw hundreds of times. Yes, sometimes I was able to exploit some of that code. But the details matter.


But I would still never not escape the params because you don’t know how that code will change one day or where it will end up, and chances are that you won’t remember to fix it later if you don’t fix it now.

We just had a major failure at work recently because someone decided to not decode URL params and their code worked fine for years because it never mattered… until it did.

Just do it right. It’s so easy. Why risk yourself a ton of headache in the future to save you a few seconds?


If the example code is everything that Copilot generated, there's no guarantee that lat or long are floats and that seems to be an implementation detail left to the user.

Isn't that a pretty big risk though? Specifically, that people will use co-pilot recommendations "as-is" and give little thought to the actual workings of the recommendation?

After all, if you have to intimately understand the code it's recommending are you really saving that much time over vetting a Googled solution yourself?


How so? I'd prefer a proper structured library, is that what you mean? If so, the Copilot code actually seems not dodgy - because the author _started_ with `base = ...` , indicating that they were string formatting the params.

Or did you mean something else?


It probably does suck! I’m not very experienced, and I was just whipping up something quick to test if my random MSFS2020 mod idea could work.


Copilot may be the first of many computer dictionaries and thesauruses.

Oxford English Dictionary, for example, is a human version of defining language and a thesaurus is a completion engine.

Human language didn't suffer by having a dictionary and thesaurus. Computer language doesn't suffer either.


As a noob, what's the issue with this?


It's harder to read than other methods, and it doesn't encode the URL parameters, which means it potentially produces an invalid URL, and in some cases could lead to security problems (similar to SQL injection).


How so?


Concatenating strings for example. As shown, it's the query string equivalent of sql injection.

Use something like URLBuilder, or URIParams, or whatever your platform supports. Don't use string concatenation ever, if at all possible, and if not possible (wtf?), then at least escape strings.


I usually try to avoid working with URLs as bare strings like this, both for readability and correctness (URL encoding is tricky). With ‘requests’ you can do something like pass a dictionary of your query params and it takes care of forming the actual request URL.

https://docs.python-requests.org/en/latest/user/quickstart/#...


It's much safer (i.e. fewer degrees of freedom for bugs to appear) to use f-strings in situations like this one.

One correlated but ancillary benefit, is that there are fewer variables to simulate the state for in your brain, while you're reading the code. You don't have to wonder if a variable is going to change on you, in-between when it is initialized and when it is used.

It's safer still to use a library (e.g. urllib3) that does encoding for you (allowing you to omit magic strings like `"%7C"` from the logic of this function alltogther).

Like GP said, very handy for one-off scripts or areas of your codebase where quality is "less important". I may be pedantic, but I wouldn't give this a pass on code review.


code lacks context sensitive escaping

  api_path = base_url + urllib.parse.urlencode({
    'action': action,
    'format': letThisBeVariable,
    ...
    'gscoord': str(latitude.value) + '|' + str(longitude.value)
  })
see: https://docs.python.org/3/library/urllib.parse.html#urllib.p...

Mantra: when inserting data into a context (like an url) escape the data for that context.


It has no escape logic. Okay for scripts, as GP stated, very bad for production code.


the "nice" way of doing this would would be to create a list of your stringified arguments, mapped urlencoding over them, and then join them with the parameter separator. this ends up being resilient to someone adding something that ends up being incorrect, and makes explicit in the code what you're trying to do.


That's impressive. Discoverability and the varying quality of documentation is a big headache for new programmers or people engaging with a an unfamiliar framework/API/library. I really like the comments pointing out alternatives (json or xml) and the static lat-long values.

One reason I've championed the development of visual programming (flow-based, node diagrams, etc) is that while you don't want to compress a big complex program down into a single layer and become unreadable in the process, graphical methods are a great way for people to see what options they have available and just point at things they want to try out.

Instead of struggling with syntax at the same time as trying to find out what they can do with a new API, they can engage in trial-and-error to find out the capabilities and what makes it worth using, then build up their competency once they are clear about their objectives.

I'm looking forward to trying this now that it's available for my favorite IDE, but I'll probably want to set up a hotkey to switch it on and off as I need it. Once I get fully comfortable with something I often find code completion tools get in the way of my flow.


The next version of Copilot will submit its answers to HN and return the highest-voted comment that compiles, after stripping out the well actually spurious tokens. Just look how well it worked this time?


Yesterday I tried to convert the representation of a number into another representation/type in Rust:

    let coeff = BigUint::from_str(x).unwrap();
    let coeff: <<G1Affine as AffineCurve>::ScalarField as PrimeField>::BigInt =
        coeff.try_into().unwrap();
    let x: <G1Affine as AffineCurve>::ScalarField = coeff.into();

I wrote that, then I wanted to move that code in a function so I wrote:

    fn string_int_to_field_el(s: &str)

copilot suggested the following one-liner that did exactly the same thing:

    fn string_int_to_field_el(s: &str) -> <G1Affine as AffineCurve>::ScalarField {
        let x: <G1Affine as AffineCurve>::ScalarField = s.parse().unwrap();
        x
    }
I still don't understand how some heuristics could produce this code. It's mind blowing.


That code is unadulterated garbage


But what if I told you I wrote it fast?

… as a top level comment said, this is optimizing for the wrong problem.


In a way it is impressive codepilot knew how to separate the query string while still being correct. However this is a fairly naive way to build a url that I wouldn't encourage committing. If I saw this in code in a review I would recommend using a dict and `urlencode` or one of the various other URL builders available (either in the stdlib or through another library like `yarl`/etc.).


Now we just need to train it to make a dictionary for that info instead of forming a long url. But if it has to use a long url to use urljoin and/or string formatting.


btw the `mwclient` library makes querying the Wikipedia API a breeze!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: