From someone in the process of becoming a data scientist. They had a question on regular expressions, which made almost no sense. It appears that the core concepts of ETL -- Extracting source data, Transforming it into a useful form and the Loading into some persistent storage for long-term analysis -- had not been embraced. It appears the design pattern was unknown. All I could gather from the sketchy email chain was that something involving regular expressions had become difficult.
I wrote this in response: Handling Irregular File Formats.
Here's part of the follow-up.
"I have been focusing on the math associated w/ math optimization. I have been using spreadsheets to perform the computations."
Really.
Spreadsheets.
The ETL pipeline question/rant/complaint was part of loading a spreadsheet?
That seems somehow wrong. There are real tools available that really do real data science work. The word "optimization" hints that scipy.optimize might be a more useful exercise than hacking around with spreadsheets.
Perhaps some advice from a real data scientist might help: http://www.becomingadatascientist.com
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.