Input from IT/Coding/Techie types required

Forum adverts like this one are shown to any user who is not logged in. Join us by filling out a tiny 3 field form and you will get your own, free, dakka user account which gives a good range of benefits to you:

No adverts like this in the forums anymore.
Times and dates in your local timezone.
Full tracking of what you have read so you can skip to your first unread post, easily see what has changed since you last logged in, and easily see what is new at a glance.
Email notifications for threads you want to watch closely.
Being a part of the oldest wargaming community on the net.

If you are already a member then feel free to login now.

**Ketara**

So this is from my girlfriend:-

I'm doing my Masters degree, and my dissertation involves analysing fandom archives. A lot of the data is too old to reflect certain changes that took place in 2018 that I want to map. A person has made their scripts available on github (for anyone to use) and they are designed to perform scrapes on the sites I want to gather data from (Fanfiction.net and Archive of Our Own, etc.) I know little about Python and coding in general, so I don't know how much I need to know to be able to do this.

I suppose my question is: could I perform the data scrapes with the scripts provided, or is it beyond my skillset and a waste of time trying?

P.s. scripts https://github.com/fandomstats/toastystats

So I know absolutely jack about anything to do with this. Nada. Nothing. Barely understand what she's talking about. I don't know if what she's asking makes technical sense, whether it's even feasible, and if it is, whether it consists of hitting a button or five days hard work. Any advice about these things would be appreciated.

**Herzlos**

Perfectly possible to do.

She just needs to install python, download the scripts and run them.

There's a readme file here - https://github.com/fandomstats/toastystats/blob/master/README.md which hopefully explains what she needs to do.

Hopefully if she already has some understanding of scripting/programming even if it's not python then she should be OK to produce the data.

(if she's reading python it's kind of unique in that it uses tabs so identify blocks, rather than braces).

If you've got any specific questions, fire away and I'll see if I can help.

**john_chandler**

Yup, as above. Looks like each of the scrapers have a decent Readme file on usage, and there are no external dependencies (which should make things easier).

Just to add, as someone who sometimes has to write scrapers, they can be incredibly fragile. Assuming there have been no changes to the sites, things should be okay. However, if any of the sites have changed since July 2019 (looking at the repo) the respective scraper might have a problem.

**Ketara**

john_chandler wrote:
Yup, as above. Looks like each of the scrapers have a decent Readme file on usage, and there are no external dependencies (which should make things easier).

Just to add, as someone who sometimes has to write scrapers, they can be incredibly fragile. Assuming there have been no changes to the sites, things should be okay. However, if any of the sites have changed since July 2019 (looking at the repo) the respective scraper might have a problem.

So is it possible that the code might need rewriting slightly? If so, who would you hire to do that and how complex would it be?

Again, I've absolutely no idea about any of this, so do excuse if what I'm asking is idiotic.

**Herzlos**

Anyone who's comfortable with the scripts and html (the pages being scraped) should be able to update the scraper.

How easy that'll be will depend on the complexity of the pages being scraped and how neat the code is. I'd expect most Comp Sci students to be able to do it.

**techsoldaten**

Tell your girlfriend she's going to find this scraper is dated and doesn't reflect recent UI changes to these platforms.

The AO3 scraper, for instance, is using BeautifulSoup to find links in list tags. Major parts of the site don't use list tags anymore, a lot of data will be missed. It won't throw errors on this either, it just won't be aware of the information it's missing.

So results may not be representative.

Looking at the scraper, seems it was designed to let you write your own scripts on top of it. She should know she will need to write at least a little code if she wants to ask anything beyond surface level questions.

Maybe hire a developer on freelancer.com to make it work?

**Ketara**

So the general consensus here is that it will be difficult for someone with no knowledge to execute well, due to potentially messy and obsolete code.

What sort of timeframe would it take an average professional to fix it up and have it feed out the data? If it's possible to guess, that is (I accept it may not be). Are we talking half an hour, an afternoon, or a week?

**techsoldaten**

Hard to give estimates, depends on exactly what she needs to do.

TBH, you might have an easier time writing a scraper from scratch. All it uses is beautifulsoup - a Python library that reads content from HTML pages.

If you were to go that route, and you were just looking for a dump of all content since 2018, that's the sort of thing someone could do in a day or two.

**WildeGirl**

Hello all, I am girlfriend!
Firstly, thankyou so much for your replies.
I tried to read the readmes when I originally found the codes, and just got more confused!

Is it possible to track platform usage over time, e.g. 2015 - 2020, or would I have needed to track that in real time?

Also if BeautifulSoup is outdated now, would I need someone to write me a whole scraper, or could I get away with having adjustments made to BeautifulSoup?

I'll look at freelancer.com as suggested, I think I'm definitely going to need outsider help!

**warhammer_4**

Another programmer guy here,

WildeGirl wrote:
Also if BeautifulSoup is outdated now, would I need someone to write me a whole scraper, or could I get away with having adjustments made to BeautifulSoup?

BeautifulSoup is a reusable piece of code used to parse html data (the language used to program website user interfaces). The scripts linked in the OP use it to pull information from the fanfiction sites. There is nothing wrong with beautifulsoup – the particular scripts in question are just out of date and need to be updated to work with the user interfaces of the sites you are looking at. As the whole purpose of the scripts is to hook up to the site user interface, they are not super useful and it would make more sense to just re-write them.

WildeGirl wrote:
Is it possible to track platform usage over time, e.g. 2015 - 2020, or would I have needed to track that in real time?

It should be possible to analyse historic usage, depending on what specific figures you are looking for. Eg, if you want to check site visits, this may be tricky to trawl (though other sources will track this for you), but interpolating something like the number of articles published per day/topic of articles over time would be very possible.

Ketara wrote:
So the general consensus here is that it will be difficult for someone with no knowledge to execute well, due to potentially messy and obsolete code.

What sort of timeframe would it take an average professional to fix it up and have it feed out the data? If it's possible to guess, that is (I accept it may not be). Are we talking half an hour, an afternoon, or a week?

As techsoldaten says, this should be a relatively simple job for a professional programmer, depending on what data you are specifically trying to get your hands on.

The overall constraints that the define the cost/complexity of the task are:

The exact sites you are interested in. Each separate site will require its own custom script to retrieve the information you want. So, this job gets more expensive with each new website you want to look at.

The exact time range you need data for. A longer time range requires pulling more data, which may necessitate a more robust script.

What specific data you want. Eg, do you need the actual stories, or just the metadata associated with them (date posted, user, title, etc.). If you are interested in only metadata, it is important to specify the exact fields you want, as depending on the site layout getting some types of data may be easier than others.

Another thought is simply getting in touch with the owners of the sites you are interested in and asking them for a relevant dump of their database. From the web host’s perspective it is actually more cost effective to give you the information you want, instead of having you waste bandwidth pulling it from the user interface. Obviously this may not be possible depending on how reachable/amenable the site owners are.

**Herzlos**

It may also be worth asking around to see if any other research students have done or are planning on doing similar work, and share the resources.

**filbert**

What about going on Fiverr on something similar and getting an expert to whip up a script?

**trexmeyer**

Can you provide clarification on the specific data that you need to gather?