2020-03-30 19:42:03 +02:00
# go-enry [![GoDoc](https://godoc.org/github.com/go-enry/go-enry?status.svg)](https://pkg.go.dev/github.com/go-enry/go-enry/v2) [![Test](https://github.com/go-enry/go-enry/workflows/Test/badge.svg)](https://github.com/go-enry/go-enry/actions?query=workflow%3ATest+branch%3Amaster) [![codecov](https://codecov.io/gh/go-enry/go-enry/branch/master/graph/badge.svg)](https://codecov.io/gh/go-enry/go-enry)
2016-12-09 13:30:21 +01:00
2020-05-27 15:07:57 +02:00
Programming language detector and toolbox to ignore binary or vendored files. _enry_ , started as a port to _Go_ of the original [Linguist ](https://github.com/github/linguist ) _Ruby_ library, that has an improved _2x performance_ .
- [CLI ](#cli )
- [Library ](#library )
- [Use cases ](#use-cases )
- [By filename ](#by-filename )
- [By text ](#by-text )
- [By file ](#by-file )
- [Filtering ](#filtering-vendoring-binaries-etc )
- [Coloring ](#language-colors-and-groups )
- [Languages ](#languages )
- [Go ](#go )
- [Java bindings ](#java-bindings )
- [Python bindings ](#python-bindings )
- [Divergences from linguist ](#divergences-from-linguist )
- [Benchmarks ](#benchmarks )
- [Why Enry? ](#why-enry )
- [Development ](#development )
- [Sync with github/linguist upstream ](#sync-with-githublinguist-upstream )
- [Misc ](#misc )
- [License ](#license )
2016-12-09 13:30:21 +01:00
2019-10-29 15:49:16 +01:00
# CLI
2016-12-09 13:30:21 +01:00
2020-03-30 19:22:43 +02:00
The CLI binary is hosted in a separate repository [go-enry/enry ](https://github.com/go-enry/enry ).
2016-12-09 13:30:21 +01:00
2020-03-30 19:22:43 +02:00
# Library
2016-12-09 13:30:21 +01:00
2020-05-27 15:07:57 +02:00
_enry_ is also a Go library for guessing a programming language that exposes API through FFI to multiple programming environments.
2019-10-29 15:49:16 +01:00
2020-03-30 19:22:43 +02:00
## Use cases
2019-10-29 15:49:16 +01:00
2020-05-27 15:07:57 +02:00
_enry_ guesses a programming language using a sequence of matching _strategies_ that are
applied progressively to narrow down the possible options. Each _strategy_ varies on the type
2020-03-30 19:22:43 +02:00
of input data that it needs to make a decision: file name, extension, the first line of the file, the full content of the file, etc.
2019-10-29 15:49:16 +01:00
2020-03-30 19:22:43 +02:00
Depending on available input data, enry API can be roughly divided into the next categories or use cases.
### By filename
2020-05-27 15:07:57 +02:00
2020-03-30 19:22:43 +02:00
Next functions require only a name of the file to make a guess:
2020-05-27 15:07:57 +02:00
- `GetLanguageByExtension` uses only file extension (wich may be ambiguous)
- `GetLanguageByFilename` useful for cases like `.gitignore` , `.bashrc` , etc
- all [filtering helpers ](#filtering )
Please note that such guesses are expected not to be very accurate.
2020-03-30 19:22:43 +02:00
### By text
2020-05-27 15:07:57 +02:00
2020-03-30 19:22:43 +02:00
To make a guess only based on the content of the file or a text snippet, use
2020-05-27 15:07:57 +02:00
- `GetLanguageByShebang` reads only the first line of text to identify the [shebang ](<https://en.wikipedia.org/wiki/Shebang_(Unix )>).
- `GetLanguageByModeline` for cases when Vim/Emacs modeline e.g. `/* vim: set ft=cpp: */` may be present at a head or a tail of the text.
- `GetLanguageByClassifier` uses a Bayesian classifier trained on all the `./samples/` from Linguist.
It usually is a last-resort strategy that is used to disambiguate the guess of the previous strategies, and thus it requires a list of "candidate" guesses. One can provide a list of all known languages - keys from the `data.LanguagesLogProbabilities` as possible candidates if more intelligent hypotheses are not available, at the price of possibly suboptimal accuracy.
2019-08-07 20:58:33 +02:00
2020-03-30 19:22:43 +02:00
### By file
2020-05-27 15:07:57 +02:00
2020-03-30 19:22:43 +02:00
The most accurate guess would be one when both, the file name and the content are available:
2020-05-27 15:07:57 +02:00
- `GetLanguagesByContent` only uses file extension and a set of regexp-based content heuristics.
- `GetLanguages` uses the full set of matching strategies and is expected to be most accurate.
2019-10-29 15:49:16 +01:00
2020-03-30 19:22:43 +02:00
### Filtering: vendoring, binaries, etc
2020-05-27 15:07:57 +02:00
_enry_ expose a set of file-level helpers `Is*` to simplify filtering out the files that are less interesting for the purpose of source code analysis:
- `IsBinary`
- `IsVendor`
- `IsConfiguration`
- `IsDocumentation`
- `IsDotFile`
- `IsImage`
- `IsTest`
- `IsGenerated`
2020-03-30 19:22:43 +02:00
2020-03-21 14:37:39 +01:00
### Language colors and groups
2020-05-27 15:07:57 +02:00
_enry_ exposes function to get language color to use for example in presenting statistics in graphs:
- `GetColor`
- `GetLanguageGroup` can be used to group similar languages together e.g. for `Less` this function will return `CSS`
2020-03-21 14:37:39 +01:00
2020-03-30 19:22:43 +02:00
## Languages
### Go
2019-10-29 15:49:16 +01:00
In a [Go module ](https://github.com/golang/go/wiki/Modules ),
2019-08-07 20:58:33 +02:00
import `enry` to the module by running:
2016-12-09 13:30:21 +01:00
2020-03-30 19:22:43 +02:00
```sh
2020-03-19 17:31:29 +01:00
go get github.com/go-enry/go-enry/v2
2019-08-07 20:58:33 +02:00
```
The rest of the examples will assume you have either done this or fetched the
library into your `GOPATH` .
```go
// The examples here and below assume you have imported the library.
2020-03-19 17:31:29 +01:00
import "github.com/go-enry/go-enry/v2"
2019-08-07 20:58:33 +02:00
2017-07-07 09:59:56 +02:00
lang, safe := enry.GetLanguageByExtension("foo.go")
2017-10-04 16:53:26 +02:00
fmt.Println(lang, safe)
// result: Go true
2016-12-09 13:30:21 +01:00
2017-10-04 16:53:26 +02:00
lang, safe := enry.GetLanguageByContent("foo.m", []byte("< matlab-code > "))
fmt.Println(lang, safe)
// result: Matlab true
2016-12-09 13:30:21 +01:00
2017-10-04 16:53:26 +02:00
lang, safe := enry.GetLanguageByContent("bar.m", []byte("< objective-c-code > "))
fmt.Println(lang, safe)
// result: Objective-C true
2017-06-21 09:07:55 +02:00
// all strategies together
2017-10-04 16:53:26 +02:00
lang := enry.GetLanguage("foo.cpp", []byte("< cpp-code > "))
// result: C++ true
2017-04-05 19:03:20 +02:00
```
2019-10-29 15:49:16 +01:00
Note that the returned boolean value `safe` is `true` if there is only one possible language detected.
2017-07-07 09:59:56 +02:00
2020-03-30 19:22:43 +02:00
A plural version of the same API allows getting a list of all possible languages for a given file.
2017-07-07 09:59:56 +02:00
```go
2017-10-04 16:53:26 +02:00
langs := enry.GetLanguages("foo.h", []byte("< cpp-code > "))
// result: []string{"C", "C++", "Objective-C}
2017-07-07 09:59:56 +02:00
2017-10-04 16:53:26 +02:00
langs := enry.GetLanguagesByExtension("foo.asc", []byte("< content > "), nil)
2017-07-07 09:59:56 +02:00
// result: []string{"AGS Script", "AsciiDoc", "Public Key"}
2017-10-04 16:53:26 +02:00
langs := enry.GetLanguagesByFilename("Gemfile", []byte("< content > "), []string{})
2017-07-07 09:59:56 +02:00
// result: []string{"Ruby"}
```
2020-03-30 19:22:43 +02:00
### Java bindings
2017-10-04 17:18:38 +02:00
2020-03-30 19:22:43 +02:00
Generated Java bindings using a C shared library and JNI are available under [`java` ](https://github.com/go-enry/go-enry/blob/master/java ).
2019-10-29 15:49:16 +01:00
A library is published on Maven as [tech.sourced:enry-java ](https://mvnrepository.com/artifact/tech.sourced/enry-java ) for macOS and linux platforms. Windows support is planned under [src-d/enry#150 ](https://github.com/src-d/enry/issues/150 ).
2017-07-03 08:30:03 +02:00
2020-03-30 19:22:43 +02:00
### Python bindings
2017-07-03 08:30:03 +02:00
2019-10-29 15:49:16 +01:00
Generated Python bindings using a C shared library and cffi are WIP under [src-d/enry#154 ](https://github.com/src-d/enry/issues/154 ).
2017-06-21 08:22:22 +02:00
2019-10-29 15:49:16 +01:00
A library is going to be published on pypi as [enry ](https://pypi.org/project/enry/ ) for
macOS and linux platforms. Windows support is planned under [src-d/enry#150 ](https://github.com/src-d/enry/issues/150 ).
2017-06-21 08:22:22 +02:00
2020-05-27 15:07:57 +02:00
## Divergences from Linguist
2017-07-11 13:48:15 +02:00
2020-11-15 13:32:56 +01:00
The `enry` library is based on the data from `github/linguist` version **v7.12.1** .
2019-08-05 12:42:16 +02:00
2020-03-30 19:22:43 +02:00
Parsing [linguist/samples ](https://github.com/github/linguist/tree/master/samples ) the following `enry` results are different from the Linguist:
2017-10-04 17:09:58 +02:00
2020-09-17 09:33:08 +02:00
- [Heuristics for ".txt" extension ](https://github.com/github/linguist/blob/8083cb5a89cee2d99f5a988f165994d0243f0d1e/lib/linguist/heuristics.yml#L521 ) in Vim Help File could not be parsed, due to unsupported negative lookahead in RE2 regexp engine.
- [Heuristics for ".sol" extension ](https://github.com/github/linguist/blob/8083cb5a89cee2d99f5a988f165994d0243f0d1e/lib/linguist/heuristics.yml#L464 ) in Solidity could not be parsed, due to unsupported negative lookahead in RE2 regexp engine.
2020-05-27 15:07:57 +02:00
- [Heuristics for ".es" extension ](https://github.com/github/linguist/blob/e761f9b013e5b61161481fcb898b59721ee40e3d/lib/linguist/heuristics.yml#L103 ) in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine.
2017-07-11 13:48:15 +02:00
2020-05-27 15:07:57 +02:00
- [Heuristics for ".rno" extension ](https://github.com/github/linguist/blob/3a1bd3c3d3e741a8aaec4704f782e06f5cd2a00d/lib/linguist/heuristics.yml#L365 ) in RUNOFF could not be parsed, due to unsupported lookahead in RE2 regexp engine.
2019-08-05 13:57:21 +02:00
2020-05-27 15:07:57 +02:00
- [Heuristics for ".inc" extension ](https://github.com/github/linguist/blob/f0e2d0d7f1ce600b2a5acccaef6b149c87d8b99c/lib/linguist/heuristics.yml#L222 ) in NASL could not be parsed, due to unsupported possessive quantifier in RE2 regexp engine.
2020-03-21 20:16:52 +01:00
2020-11-15 13:32:56 +01:00
- [Heuristics for ".as" extension ](https://github.com/github/linguist/blob/223c00bb80eff04788e29010f98c5778993d2b2a/lib/linguist/heuristics.yml#L67 ) in ActionScript could not be parsed, due to unsupported positive lookahead in RE2 regexp engine.
2020-05-27 15:07:57 +02:00
- As of [Linguist v5.3.2 ](https://github.com/github/linguist/releases/tag/v5.3.2 ) it is using [flex-based scanner in C for tokenization ](https://github.com/github/linguist/pull/3846 ). Enry still uses [extract_token ](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60 ) regex-based algorithm. See [#193 ](https://github.com/src-d/enry/issues/193 ).
2019-04-08 15:58:46 +02:00
2020-05-27 15:07:57 +02:00
- Bayesian classifier can't distinguish "SQL" from "PLpgSQL. See [#194 ](https://github.com/src-d/enry/issues/194 ).
2017-07-11 13:48:15 +02:00
2020-05-27 15:07:57 +02:00
- Detection of [generated files ](https://github.com/github/linguist/blob/bf95666fc15e49d556f2def4d0a85338423c25f3/lib/linguist/generated.rb#L53 ) is not supported yet.
(Thus they are not excluded from CLI output). See [#213 ](https://github.com/src-d/enry/issues/213 ).
2019-04-03 17:35:03 +02:00
2020-05-27 15:07:57 +02:00
- Overriding languages and types though `.gitattributes` is not yet supported. See [#18 ](https://github.com/src-d/enry/issues/18 ).
2019-04-03 17:35:03 +02:00
2020-05-27 15:07:57 +02:00
- `enry` CLI output does NOT exclude `.gitignore` ed files and git submodules, as Linguist does
2017-07-11 13:48:15 +02:00
2019-08-05 12:42:16 +02:00
In all the cases above that have an issue number - we plan to update enry to match Linguist behavior.
2017-07-11 13:48:15 +02:00
2020-05-27 15:07:57 +02:00
## Benchmarks
2017-07-11 13:48:15 +02:00
2020-05-27 15:07:57 +02:00
Enry's language detection has been compared with Linguist's on [_linguist/samples_ ](https://github.com/github/linguist/tree/master/samples ).
2017-06-30 14:00:37 +02:00
2017-10-04 17:18:38 +02:00
We got these results:
2017-06-30 14:00:37 +02:00
2018-12-27 11:55:34 +01:00
![histogram ](benchmarks/histogram/distribution.png )
2017-06-30 14:00:37 +02:00
2019-10-29 15:49:16 +01:00
The histogram shows the _number of files_ (y-axis) per _time interval bucket_ (x-axis).
Most of the files were detected faster by enry.
2017-06-30 14:00:37 +02:00
2020-03-30 19:22:43 +02:00
There are several cases where enry is slower than Linguist due to
2019-10-29 15:49:16 +01:00
Go regexp engine being slower than Ruby's on, wich is based on [oniguruma ](https://github.com/kkos/oniguruma ) library, written in C.
2017-06-30 14:00:37 +02:00
2019-08-06 17:22:11 +02:00
See [instructions ](#misc ) for running enry with oniguruma.
2017-06-30 14:00:37 +02:00
2020-05-27 15:07:57 +02:00
## Why Enry?
2019-08-05 12:42:16 +02:00
2019-10-29 15:49:16 +01:00
In the movie [My Fair Lady ](https://en.wikipedia.org/wiki/My_Fair_Lady ), [Professor Henry Higgins ](http://www.imdb.com/character/ch0011719/ ) is a linguist who at the very beginning of the movie enjoys guessing the origin of people based on their accent.
2019-08-05 12:42:16 +02:00
2019-10-29 15:49:16 +01:00
"Enry Iggins" is how [Eliza Doolittle ](http://www.imdb.com/character/ch0011720/ ), [pronounces ](https://www.youtube.com/watch?v=pwNKyTktDIE ) the name of the Professor.
2019-08-05 12:42:16 +02:00
## Development
2019-10-29 15:49:16 +01:00
To run the tests use:
2019-08-05 12:42:16 +02:00
2020-03-30 19:22:43 +02:00
go test ./...
Setting `ENRY_TEST_REPO` to the path to existing checkout of Linguist will avoid cloning it and sepeed tests up.
Setting `ENRY_DEBUG=1` will provide insight in the Bayesian classifier building done by `make code-generate` .
2017-06-30 14:00:37 +02:00
2019-08-05 12:42:16 +02:00
### Sync with github/linguist upstream
2020-05-27 15:07:57 +02:00
_enry_ re-uses parts of the original [github/linguist ](https://github.com/github/linguist ) to generate internal data structures.
2019-08-05 12:42:16 +02:00
In order to update to the latest release of linguist do:
2019-08-06 11:57:34 +02:00
```bash
$ git clone https://github.com/github/linguist.git .linguist
$ cd .linguist; git checkout < release-tag > ; cd ..
2019-08-05 12:42:16 +02:00
2019-08-06 11:57:34 +02:00
# put the new release's commit sha in the generator_test.go (to re-generate .gold test fixtures)
2020-03-30 19:22:43 +02:00
# https://github.com/go-enry/go-enry/blob/13d3d66d37a87f23a013246a1b0678c9ee3d524b/internal/code-generator/generator/generator_test.go#L18
2019-08-05 12:42:16 +02:00
2019-08-06 11:57:34 +02:00
$ make code-generate
```
2019-08-05 12:42:16 +02:00
To stay in sync, enry needs to be updated when a new release of the linguist includes changes to any of the following files:
2020-05-27 15:07:57 +02:00
- [languages.yml ](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml )
- [heuristics.yml ](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.yml )
- [vendor.yml ](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml )
- [documentation.yml ](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml )
2019-08-05 12:42:16 +02:00
There is no automation for detecting the changes in the linguist project, so this process above has to be done manually from time to time.
When submitting a pull request syncing up to a new release, please make sure it only contains the changes in
2020-03-30 19:22:43 +02:00
the generated files (in [data ](https://github.com/go-enry/go-enry/blob/master/data ) subdirectory).
2019-08-05 12:42:16 +02:00
2020-03-30 19:22:43 +02:00
Separating all the necessary "manual" code changes to a different PR that includes some background description and an update to the documentation on ["divergences from linguist" ](#divergences-from-linguist ) is very much appreciated as it simplifies the maintenance (review/release notes/etc).
2019-08-05 12:42:16 +02:00
## Misc
< details >
2019-10-29 15:49:16 +01:00
< summary > Running a benchmark & faster regexp engine< / summary >
2019-08-05 12:42:16 +02:00
### Benchmark
2020-05-27 15:07:57 +02:00
All benchmark scripts are in [_benchmarks_ ](https://github.com/go-enry/go-enry/blob/master/benchmarks ) directory.
2019-08-05 12:42:16 +02:00
#### Dependencies
2020-05-27 15:07:57 +02:00
2018-12-26 22:09:27 +01:00
As benchmarks depend on Ruby and Github-Linguist gem make sure you have:
2017-06-30 14:00:37 +02:00
2020-05-27 15:07:57 +02:00
- Ruby (e.g using [`rbenv` ](https://github.com/rbenv/rbenv )), [`bundler` ](https://bundler.io/ ) installed
- Docker
- [native dependencies ](https://github.com/github/linguist/#dependencies ) installed
- Build the gem `cd .linguist && bundle install && rake build_gem && cd -`
- Install it `gem install --no-rdoc --no-ri --local .linguist/github-linguist-*.gem`
2017-10-04 17:09:58 +02:00
2019-08-05 12:42:16 +02:00
#### Quick benchmark
2020-05-27 15:07:57 +02:00
2020-03-30 19:22:43 +02:00
To run quicker benchmarks
2019-08-05 12:42:16 +02:00
make benchmarks
2020-03-30 19:22:43 +02:00
to get average times for the primary detection function and strategies for the whole samples set. If you want to see measures per sample file use:
2019-08-05 12:42:16 +02:00
make benchmarks-samples
2018-12-26 22:09:27 +01:00
2019-08-05 12:42:16 +02:00
#### Full benchmark
2020-05-27 15:07:57 +02:00
2018-12-26 22:09:27 +01:00
If you want to reproduce the same benchmarks as reported above:
2020-05-27 15:07:57 +02:00
- Make sure all [dependencies ](#benchmark-dependencies ) are installed
- Install [gnuplot ](http://gnuplot.info ) (in order to plot the histogram)
- Run `ENRY_TEST_REPO="$PWD/.linguist" benchmarks/run.sh` (takes ~15h)
2018-12-26 22:09:27 +01:00
2020-03-30 19:22:43 +02:00
It will run the benchmarks for enry and Linguist, parse the output, create csv files and plot the histogram.
2018-12-26 22:09:27 +01:00
2019-08-05 12:42:16 +02:00
### Faster regexp engine (optional)
2017-06-30 14:00:37 +02:00
2019-08-05 12:42:16 +02:00
[Oniguruma ](https://github.com/kkos/oniguruma ) is CRuby's regular expression engine.
2020-05-27 15:07:57 +02:00
It is very fast and performs better than the one built into Go runtime. _enry_ supports swapping
2019-08-05 12:42:16 +02:00
between those two engines thanks to [rubex ](https://github.com/moovweb/rubex ) project.
The typical overall speedup from using Oniguruma is 1.5-2x. However, it requires CGo and the external shared library.
2019-08-06 11:57:34 +02:00
On macOS with [Homebrew ](https://brew.sh/ ), it is:
2017-06-30 14:00:37 +02:00
2019-08-05 12:42:16 +02:00
```
brew install oniguruma
```
2017-06-30 14:00:37 +02:00
2019-08-05 12:42:16 +02:00
On Ubuntu, it is
2017-06-30 14:00:37 +02:00
2019-08-05 12:42:16 +02:00
```
sudo apt install libonig-dev
```
2017-06-30 14:00:37 +02:00
2019-08-05 12:42:16 +02:00
To build enry with Oniguruma regexps use the `oniguruma` build tag
2017-06-30 14:00:37 +02:00
2019-08-05 12:42:16 +02:00
```
go get -v -t --tags oniguruma ./...
```
2017-07-06 13:30:01 +02:00
2019-08-05 12:42:16 +02:00
and then rebuild the project.
2017-06-08 09:27:27 +02:00
2019-08-05 12:42:16 +02:00
< / details >
2017-06-08 09:27:27 +02:00
2020-05-27 15:07:57 +02:00
## License
2017-04-05 19:03:20 +02:00
2017-07-14 16:42:20 +02:00
Apache License, Version 2.0. See [LICENSE ](LICENSE )