Time went very fast! One month has gone for coming israel for high-tech internship.
My internship company called Sixgill which focus on analyzing dark web data and sell the SaaS for the customer who interested in our analysis result.
My mission in this two month is to help them build Chinese machine learning model for categorize the content of posts. For example, the post is talking about drugs, fraud, weapon or porn.
- First step is to find Chinese dark web website and crawl it. This is my first time using Tor to discover stuff in dark web. Some of them are really dark, and it surprises me a lot. It’s like a totally different world which I knew.
- Second step is to label data, only me can read Chinese so this important job is doing only by me. By the way, label data is really exhausted!
- Third step is to train the good model. Colleague and I discuss about which algorithm will be have a great result. We use TF-IDF, LSA and XGBoost for now. Maybe there will be more suitable algorithm for our data.
By the way, we have a special strategy for label data and I think it is nice!
Because label all the data is too time consuming, we only label part of it. We use this data to train model and then to predict the data which not yet been labeled. I just need to look the result, revising the label if it is wrong and put it back to our training data. By this way, we can speed up our time for label data and make our model better.
I’m looking forward to Aug! What a great model we will make!!