Eric B.
AlsasuaSpain
97% Job Success
Top Rated

Scraping | Browser Automation | Python | Scrapy | Selenium Playwright

I have 10 years of experience as a developer coding automated web crawling & scraping solutions (data extractions). Since 2012 I have developed hundreds of different scrapers and crawlers for various purposes from different business directories, e-commerce sites, review sites, etc. I have scraped data from websites that require parsing Javascript/Ajax using infinite scrolls, popups, lazy loads. What kind of jobs can I do for you? ------------------------------------------------------------ - scrape a directory (e.g., Yellow pages, White Pages) - scrape an e-commerce website (e.g., Amazon) - scrape search engines (e.g., Google Search, Bing Search, Yahoo Search) - scrape any other kind of website. - extract emails (portfolio) - create a custom web scraping tool to use on-demand. Do I scrap the data, or do you?------------------------------------------------------------------ It is your choice: I can give you the script or the installer if you want to run it yourself. It is usually the best option if you're going to scrape data for a very long time or with a certain frequency. or I can extract the data for you using the technologies that best fit. That is the best option if you want something fast and just once. How does the scraper work? -------------------------------------------------------------------- It gets all the URLs (commercial sites) or can gather data by region, county, or city (directories). In other cases, we adapt. Crawler or Scraper? ------------------------------------------------------------------------------- Both. To scrap data points, we need first to crawl the URLs we want to extract data from, and then, we scrap the data. The crawler can be vertical (categories in a commercial site, for example) and horizontal (pagination of each category). It can be static (we have a list of URLs) or dynamic (we autogenerate the list of URLs dynamically in commercial sites, for example) In the case of directories, we need input data. It can be a list of keywords and a list of zipcodes What output formats do I offer? ---------------------------------------------------------------- - .xls, .xlsx, .txt, .csv, .json, .xml or MS-Access (.mdb, .accdb), MySQL (sql), SQLServer (sql) What technologies do I use? -------------------------------------------------------------------- Scripts ....................................................................... - Python + Beautiful Soup or LXML - Scrapy I give you the script. You run it. Source code ............................................................ - Visual basic .NET / C# + Winforms (WebRequest, HTTPWebRequest, WebClient, HTTPClient, Web Browser (Internet Explorer), Webkit Browser (Google Chrome), Gecko Browser (Mozilla Firefox)) I give you the installer, you install the software, and you run it. How do I extract the data ? ................................. - XPath - CSS Selectors - String Processing (directly from the HTML as a simple text) Automation tools ................................................... - Selenium Automation tools such as Selenium are meant to resolve more complex scenarios where human behavior must be mimicked: For example, clicking a button, opening a popup, dealing with an infinite loop. It must be used with websites where the HTML is loaded dynamically using technologies such as Javascript and Ajax (RIA - Rich Internet Applications) Selenium can be used with Visual Basic .NET + Winforms and with Python and Scrapy too (Scrapy being Python-based). - Playwright Playwright is an alternative of Selenium and is a wonder to work with. Very simple code, straight to the point. To be considered very seriously. What about databases? -------------------------------------------------------------------------- We are not usually obliged to use databases for scraping purposes. Still, in some cases, the website structure needs us to reverse-engineer the underlying database and create our own with several tables and several one-to-many relationships. In that case, the data is distributed in several tables, and we can get not one but several datasets simply making queries. What about data cleansing? --------------------------------------------------------------------- It is crucial. I always deliver clean data after applying cleansing functions that deal with HTML special characters, change the case, concatenate.

Eric B. has more jobs. Create an account to review them

Skills

  • Web Scraping
  • Bot Development
  • Beautiful Soup
  • XLSX
  • CSV