Productivity tools don't help you when dealing with people problems. You can't throw a TODO app or some other bullshit on someone who is underperforming or to coach someone for a promotion.
Isn't increasing productivity by solving hard problems why we get paid? My biggest question is, why isn't it even talked about and/or have aspirational goals set up?
Solution Architect Associate is what I'd recommend to start with. Practitioner is for non-technical people, Developer Associate and Sysops Associate are just marginally different than SA, often focusing on services which are marginally useful at best (few people use AWS CI/CD solutions and it makes up most of the difference for developer cert), I wouldn't bother with them. I haven't heard anything about the data engineer one (it seems new).
Funnily enough the site which changed to next.js did, they actually had a lovely offers.json file (~1MB, thousands of offers, just without a long-form HTMLy description) but they were also using it for their transactional needs (meaning that everyone had to load this file before the offers showed up). So they rewrote everything to next.js instead of adding some pagination...
It also shows how awful some sites are. I'm scraping a jobs offer portal written in next.js (although I use the endpoint for hydration), one of the lovely things they do is they pass in props containing all possible skills candidates can have, they send 200KB per listing, less than 10KB is the average volume of job-specific (actually required) data.
Previously they were rendering it from a REST endpoint which only had the necessary data, but now I get a lovely dictionary of all possible values of various properties, provided on a silver platter.
Another portal I'm scraping uses GraphQL, and they don't have access control, I get all the data they have on impressions, I know how much money somebody paid for listing, I know when the offer is going to be auto-bumped - just lovely. And no need to use Playwright either.
However the entire post is worth reading because other people contributed some good points also.
The Mangadex devs seem to have followed some of the advice and the referenced inefficient JS API call has been made more efficient since my comment, but notably it is still sending many times more data than what is required.
I once had a discussion with a management that thought it’s fine not to implement access control for GraphQl because no one knows the endpoint. I explained it’s possible to figure out the endpoint through the Network tab in the console. They assumed it’s fine because that’s too technical.
That’s when you stop involving management in technical discussions. As a software engineer, you do what must be done. If access control is a must, management can wait on the less important features. If they are stupid and don’t understand despite explaining them, they don’t have to know.
Yeah, but don’t get yourself in trouble and be a renegade either.
I do not agree that engineers should avoid technical talks with managers. They should just always be prepared to break it down and provide examples. Look people in the eye, empathize with their sentiments, don’t be an asshole if you can help it, in my humble opinion.
One of the easiest fixes for Next page sizes is pruning unused prop data. It’s super common, even in relatively professional settings. I think it isn’t intuitive that unused data will be sent. The whole tree shaking and optimization pipeline seems like it should handle stuff like that, but sure enough, it doesn’t (for good reason, though).
The irony is that I discovered this also while scraping a job portal. I wonder if we are scraping the same site lol. I am building a job board for tech jobs, but with very granular filters.
How do you plan to implement very granular filters when the data is rarely complete or neatly organised?
I had this discussion with my friend who runs ArbeitNow. It's hard to create good filters when the data is just paragraphs of text. It's hard to augment these listings when the data is not reliable.
Won't comment on this now, but will follow up once it launches. I've been working on this now for 3 months. So it isn't simple aggregator. There is a lot of data enrichment happening by collecting auxiliary data from LinkedIn, Crunchbase, GitHub, etc.
Interesting. I was initially intending to do the same, but now I just use the project as OSINT source for research on various companies, if I or any of my friends is looking for a job they get a dump of everything they posted on popular portals in the last 3 years or so.
I'm looking for a job, and just started hacking together a job posts database with data scraped from a few aggregators and a couple of specific companies. Currently it's all just local and for myself, mostly raw data with some basic statistics and keyword similarity heuristics.
I decided to try something new and different and take it into my own hands (instead of hoping LinkedIn shows me the "right job").
I'm frustrated by the sheer volume of clearly duplicate job posts which often only vary by the "city" (even if they're 100% remote) and the lack of any kind of basic filtering for programming languages, tech stacks, platforms, years of experience requirements, company industry / size / age etc.
I'm really curious what others who have been doing this for much longer have learned! How do you store your data? How do you process it? Have you eliminated manual cleansing and grooming? So many questions...
I'm scraping jobs for a specific country I don't think you're in, but a couple of things:
- always keep the original data you scraped in as close to original form as you can (in next.js example, I just strip the unnecessary props - but only those I know are useless, see point 2) - there aren't that many job offers out there, text is easy to compress, storage is cheap. Eventually you will change the idea for what's important, and having original files to work with allows you to re-process them. Kind of like data vault approach in data engineering.
- prepare for the APIs to evolve, version your stuff. Have stringent validation (like json schema with enums and not allowing extra fields) to detect this evolution, so that you can adapt. This is extra important for portals with complex data models where you might not fully understand them at first.
- understand key fields - it's often the case that "posted date" is auto-bumped every now and then, depending on which package you bought. Try going through buy-a-listing workflow to understand the options. For multi-city listings there is often some sort of way of de-duplicating (like URL slug having a common prefix, or the data containing "main" listing). Spending an hour going through the data you scraped goes a long way.
- a lot of the data comes from manual user input or from some imports from the company's ATS. You will need to normalize those values in some way, and it's not always possible to do it in a lossless way. A good example is company name, which tends to change over time or be used in different formats. This is a pretty difficult problem to solve and I don't have any good pointers for this.
For the most part I keep the data in sqlite databases in format specific to the job portal, I haven't found a decent way of normalizing the data as the portals have very different ideas about the level of information they keep in a structured way. The only processing I do is indexing the values I need and deduplicating multi-location listings and same offers across many days/weeks/months. This is good enough for my use case, but falls short of making it useful for any commercial use.
That's not quite right, if you have a single person company (with unlimited personal liability) you can choose between revenue and income tax. Both are flat rates, revenue is 12% for certain groups of companies, including software engineering with revenues below several million euro, income tax is 19% universally.
My main use case for postman like tools is to get me into the middle of a workflow to reproduce some sort of a scenario. Stateless tools just don't do it, as I need to spend a long time copying things around.
Allows arbitrary scripts to save the context of requests. Hardcoded examples you can't change without committing in contract are not even close to that functionality.