Free Frontend

Free Frontend

1612437669

21 Sublime Text Themes

21 Sublime Text Themes
Collection of free Sublime 📝 Text UI themes (dark, light and other). Update of February 2019 collection. 3 new items.

#html #css #javascript

What is GEEK

Buddha Community

21 Sublime Text Themes

Navigating Between DOM Nodes in JavaScript

In the previous chapters you've learnt how to select individual elements on a web page. But there are many occasions where you need to access a child, parent or ancestor element. See the JavaScript DOM nodes chapter to understand the logical relationships between the nodes in a DOM tree.

DOM node provides several properties and methods that allow you to navigate or traverse through the tree structure of the DOM and make changes very easily. In the following section we will learn how to navigate up, down, and sideways in the DOM tree using JavaScript.

Accessing the Child Nodes

You can use the firstChild and lastChild properties of the DOM node to access the first and last direct child node of a node, respectively. If the node doesn't have any child element, it returns null.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");
console.log(main.firstChild.nodeName); // Prints: #text

var hint = document.getElementById("hint");
console.log(hint.firstChild.nodeName); // Prints: SPAN
</script>

Note: The nodeName is a read-only property that returns the name of the current node as a string. For example, it returns the tag name for element node, #text for text node, #comment for comment node, #document for document node, and so on.

If you notice the above example, the nodeName of the first-child node of the main DIV element returns #text instead of H1. Because, whitespace such as spaces, tabs, newlines, etc. are valid characters and they form #text nodes and become a part of the DOM tree. Therefore, since the <div> tag contains a newline before the <h1> tag, so it will create a #text node.

To avoid the issue with firstChild and lastChild returning #text or #comment nodes, you could alternatively use the firstElementChild and lastElementChild properties to return only the first and last element node, respectively. But, it will not work in IE 9 and earlier.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");
alert(main.firstElementChild.nodeName); // Outputs: H1
main.firstElementChild.style.color = "red";

var hint = document.getElementById("hint");
alert(hint.firstElementChild.nodeName); // Outputs: SPAN
hint.firstElementChild.style.color = "blue";
</script>

Similarly, you can use the childNodes property to access all child nodes of a given element, where the first child node is assigned index 0. Here's an example:

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");

// First check that the element has child nodes 
if(main.hasChildNodes()) {
    var nodes = main.childNodes;
    
    // Loop through node list and display node name
    for(var i = 0; i < nodes.length; i++) {
        alert(nodes[i].nodeName);
    }
}
</script>

The childNodes returns all child nodes, including non-element nodes like text and comment nodes. To get a collection of only elements, use children property instead.

Example

<div id="main">
    <h1 id="title">My Heading</h1>
    <p id="hint"><span>This is some text.</span></p>
</div>

<script>
var main = document.getElementById("main");

// First check that the element has child nodes 
if(main.hasChildNodes()) {
    var nodes = main.children;
    
    // Loop through node list and display node name
    for(var i = 0; i < nodes.length; i++) {
        alert(nodes[i].nodeName);
    }
}
</script>

#javascript 

Hermann  Frami

Hermann Frami

1642173480

Flight Rules for Git

Flight rules for Git

🌍 EnglishEspañolРусский简体中文한국어Tiếng ViệtFrançais日本語

What are "flight rules"?

A guide for astronauts (now, programmers using Git) about what to do when things go wrong.

Flight Rules are the hard-earned body of knowledge recorded in manuals that list, step-by-step, what to do if X occurs, and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. [...]

NASA has been capturing our missteps, disasters and solutions since the early 1960s, when Mercury-era ground teams first started gathering "lessons learned" into a compendium that now lists thousands of problematic situations, from engine failure to busted hatch handles to computer glitches, and their solutions.

— Chris Hadfield, An Astronaut's Guide to Life on Earth.

Conventions for this document

For clarity's sake all examples in this document use a customized bash prompt in order to indicate the current branch and whether or not there are staged changes. The branch is enclosed in parentheses, and a * next to the branch name indicates staged changes.

All commands should work for at least git version 2.13.0. See the git website to update your local git version.

Table of Contents generated with DocToc

Repositories

I want to start a local repository

To initialize an existing directory as a Git repository:

(my-folder) $ git init

I want to clone a remote repository

To clone (copy) a remote repository, copy the URL for the repository, and run:

$ git clone [url]

This will save it to a folder named the same as the remote repository's. Make sure you have a connection to the remote server you are cloning from (for most purposes this means making sure you are connected to the internet).

To clone it into a folder with a different name than the default repository name:

$ git clone [url] name-of-new-folder

I set the wrong remote repository

There are a few possible problems here:

If you cloned the wrong repository, simply delete the directory created after running git clone and clone the correct repository.

If you set the wrong repository as the origin of an existing local repository, change the URL of your origin by running:

$ git remote set-url origin [url of the actual repo]

For more, see this StackOverflow topic.

I want to add code to someone else's repository

Git doesn't allow you to add code to someone else's repository without access rights. Neither does GitHub, which is not the same as Git, but rather a hosted service for Git repositories. However, you can suggest code using patches, or, on GitHub, forks and pull requests.

First, a bit about forking. A fork is a copy of a repository. It is not a git operation, but is a common action on GitHub, Bitbucket, GitLab — or anywhere people host Git repositories. You can fork a repository through the hosted UI.

Suggesting code via pull requests

After you've forked a repository, you normally need to clone the repository to your machine. You can do some small edits on GitHub, for instance, without cloning, but this isn't a github-flight-rules list, so let's go with how to do this locally.

# if you are using ssh
$ git clone git@github.com:k88hudson/git-flight-rules.git

# if you are using https
$ git clone https://github.com/k88hudson/git-flight-rules.git

If you cd into the resulting directory, and type git remote, you'll see a list of the remotes. Normally there will be one remote - origin - which will point to k88hudson/git-flight-rules. In this case, we also want a remote that will point to your fork.

First, to follow a Git convention, we normally use the remote name origin for your own repository and upstream for whatever you've forked. So, rename the origin remote to upstream

$ git remote rename origin upstream

You can also do this using git remote set-url, but it takes longer and is more steps.

Then, set up a new remote that points to your project.

$ git remote add origin git@github.com:YourName/git-flight-rules.git

Note that now you have two remotes.

  • origin references your own repository.
  • upstream references the original one.

From origin, you can read and write. From upstream, you can only read.

When you've finished making whatever changes you like, push your changes (normally in a branch) to the remote named origin. If you're on a branch, you could use --set-upstream to avoid specifying the remote tracking branch on every future push using this branch. For instance:

$ (feature/my-feature) git push --set-upstream origin feature/my-feature

There is no way to suggest a pull request using the CLI using Git (although there are tools, like hub, which will do this for you). So, if you're ready to make a pull request, go to your GitHub (or another Git host) and create a new pull request. Note that your host automatically links the original and forked repositories.

After all of this, do not forget to respond to any code review feedback.

Suggesting code via patches

Another approach to suggesting code changes that doesn't rely on third party sites such as Github is to use git format-patch.

format-patch creates a .patch file for one or more commits. This file is essentially a list of changes that looks similar to the commit diffs you can view on Github.

A patch can be viewed and even edited by the recipient and applied using git am.

For example, to create a patch based on the previous commit you would run git format-patch HEAD^ which would create a .patch file called something like 0001-My-Commit-Message.patch.

To apply this patch file to your repository you would run git am ./0001-My-Commit-Message.patch.

Patches can also be sent via email using the git send-email command. For information on usage and configuration see: https://git-send-email.io

I need to update my fork with latest updates from the original repository

After a while, the upstream repository may have been updated, and these updates need to be pulled into your origin repo. Remember that like you, other people are contributing too. Suppose that you are in your own feature branch and you need to update it with the original repository updates.

You probably have set up a remote that points to the original project. If not, do this now. Generally we use upstream as a remote name:

$ (main) git remote add upstream <link-to-original-repository>
# $ (main) git remote add upstream git@github.com:k88hudson/git-flight-rules.git

Now you can fetch from upstream and get the latest updates.

$ (main) git fetch upstream
$ (main) git merge upstream/main

# or using a single command
$ (main) git pull upstream main

Editing Commits

 

What did I just commit?

Let's say that you just blindly committed changes with git commit -a and you're not sure what the actual content of the commit you just made was. You can show the latest commit on your current HEAD with:

(main)$ git show

Or

$ git log -n1 -p

If you want to see a file at a specific commit, you can also do this (where <commitid> is the commit you're interested in):

$ git show <commitid>:filename

I wrote the wrong thing in a commit message

If you wrote the wrong thing and the commit has not yet been pushed, you can do the following to change the commit message without changing the changes in the commit:

$ git commit --amend --only

This will open your default text editor, where you can edit the message. On the other hand, you can do this all in one command:

$ git commit --amend --only -m 'xxxxxxx'

If you have already pushed the message, you can amend the commit and force push, but this is not recommended.

 

I committed with the wrong name and email configured

If it's a single commit, amend it

$ git commit --amend --no-edit --author "New Authorname <authoremail@mydomain.com>"

An alternative is to correctly configure your author settings in git config --global author.(name|email) and then use

$ git commit --amend --reset-author --no-edit

If you need to change all of history, see the man page for git filter-branch.

I want to remove a file from the previous commit

In order to remove changes for a file from the previous commit, do the following:

$ git checkout HEAD^ myfile
$ git add myfile
$ git commit --amend --no-edit

In case the file was newly added to the commit and you want to remove it (from Git alone), do:

$ git rm --cached myfile
$ git commit --amend --no-edit

This is particularly useful when you have an open patch and you have committed an unnecessary file, and need to force push to update the patch on a remote. The --no-edit option is used to keep the existing commit message.

 

I want to delete or remove my last commit

If you need to delete pushed commits, you can use the following. However, it will irreversibly change your history, and mess up the history of anyone else who had already pulled from the repository. In short, if you're not sure, you should never do this, ever.

$ git reset HEAD^ --hard
$ git push --force-with-lease [remote] [branch]

If you haven't pushed, to reset Git to the state it was in before you made your last commit (while keeping your staged changes):

(my-branch*)$ git reset --soft HEAD@{1}

This only works if you haven't pushed. If you have pushed, the only truly safe thing to do is git revert SHAofBadCommit. That will create a new commit that undoes all the previous commit's changes. Or, if the branch you pushed to is rebase-safe (ie. other devs aren't expected to pull from it), you can just use git push --force-with-lease. For more, see the above section.

 

Delete/remove arbitrary commit

The same warning applies as above. Never do this if possible.

$ git rebase --onto SHA1_OF_BAD_COMMIT^ SHA1_OF_BAD_COMMIT
$ git push --force-with-lease [remote] [branch]

Or do an interactive rebase and remove the line(s) corresponding to commit(s) you want to see removed.

 

I tried to push my amended commit to a remote, but I got an error message

To https://github.com/yourusername/repo.git
! [rejected]        mybranch -> mybranch (non-fast-forward)
error: failed to push some refs to 'https://github.com/tanay1337/webmaker.org.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Note that, as with rebasing (see below), amending replaces the old commit with a new one, so you must force push (--force-with-lease) your changes if you have already pushed the pre-amended commit to your remote. Be careful when you do this – always make sure you specify a branch!

(my-branch)$ git push origin mybranch --force-with-lease

In general, avoid force pushing. It is best to create and push a new commit rather than force-pushing the amended commit as it will cause conflicts in the source history for any other developer who has interacted with the branch in question or any child branches. --force-with-lease will still fail, if someone else was also working on the same branch as you, and your push would overwrite those changes.

If you are absolutely sure that nobody is working on the same branch or you want to update the tip of the branch unconditionally, you can use --force (-f), but this should be avoided in general.

 

I accidentally did a hard reset, and I want my changes back

If you accidentally do git reset --hard, you can normally still get your commit back, as git keeps a log of everything for a few days.

Note: This is only valid if your work is backed up, i.e., either committed or stashed. git reset --hard will remove uncommitted modifications, so use it with caution. (A safer option is git reset --keep.)

(main)$ git reflog

You'll see a list of your past commits, and a commit for the reset. Choose the SHA of the commit you want to return to, and reset again:

(main)$ git reset --hard SHA1234

And you should be good to go.

 

I accidentally committed and pushed a merge

If you accidentally merged a feature branch to the main development branch before it was ready to be merged, you can still undo the merge. But there's a catch: A merge commit has more than one parent (usually two).

The command to use

(feature-branch)$ git revert -m 1 <commit>

where the -m 1 option says to select parent number 1 (the branch into which the merge was made) as the parent to revert to.

Note: the parent number is not a commit identifier. Rather, a merge commit has a line Merge: 8e2ce2d 86ac2e7. The parent number is the 1-based index of the desired parent on this line, the first identifier is number 1, the second is number 2, and so on.

 

I accidentally committed and pushed files containing sensitive data

If you accidentally pushed files containing sensitive, or private data (passwords, keys, etc.), you can amend the previous commit. Keep in mind that once you have pushed a commit, you should consider any data it contains to be compromised. These steps can remove the sensitive data from your public repo or your local copy, but you cannot remove the sensitive data from other people's pulled copies. If you committed a password, change it immediately. If you committed a key, re-generate it immediately. Amending the pushed commit is not enough, since anyone could have pulled the original commit containing your sensitive data in the meantime.

If you edit the file and remove the sensitive data, then run

(feature-branch)$ git add edited_file
(feature-branch)$ git commit --amend --no-edit
(feature-branch)$ git push --force-with-lease origin [branch]

If you want to remove an entire file (but keep it locally), then run

(feature-branch)$ git rm --cached sensitive_file
echo sensitive_file >> .gitignore
(feature-branch)$ git add .gitignore
(feature-branch)$ git commit --amend --no-edit
(feature-branch)$ git push --force-with-lease origin [branch]

Alternatively store your sensitive data in local environment variables.

If you want to completely remove an entire file (and not keep it locally), then run

(feature-branch)$ git rm sensitive_file
(feature-branch)$ git commit --amend --no-edit
(feature-branch)$ git push --force-with-lease origin [branch]

If you have made other commits in the meantime (i.e. the sensitive data is in a commit before the previous commit), you will have to rebase.

 

I want to remove a large file from ever existing in repo history

If the file you want to delete is secret or sensitive, instead see how to remove sensitive files.

Even if you delete a large or unwanted file in a recent commit, it still exists in git history, in your repo's .git folder, and will make git clone download unneeded files.

The actions in this part of the guide will require a force push, and rewrite large sections of repo history, so if you are working with remote collaborators, check first that any local work of theirs is pushed.

There are two options for rewriting history, the built-in git-filter-branch or bfg-repo-cleaner. bfg is significantly cleaner and more performant, but it is a third-party download and requires java. We will describe both alternatives. The final step is to force push your changes, which requires special consideration on top of a regular force push, given that a great deal of repo history will have been permanently changed.

Recommended Technique: Use third-party bfg

Using bfg-repo-cleaner requires java. Download the bfg jar from the link here. Our examples will use bfg.jar, but your download may have a version number, e.g. bfg-1.13.0.jar.

To delete a specific file.

(main)$ git rm path/to/filetoremove
(main)$ git commit -m "Commit removing filetoremove"
(main)$ java -jar ~/Downloads/bfg.jar --delete-files filetoremove

Note that in bfg you must use the plain file name even if it is in a subdirectory.

You can also delete a file by pattern, e.g.:

(main)$ git rm *.jpg
(main)$ git commit -m "Commit removing *.jpg"
(main)$ java -jar ~/Downloads/bfg.jar --delete-files *.jpg

With bfg, the files that exist on your latest commit will not be affected. For example, if you had several large .tga files in your repo, and then in an earlier commit, you deleted a subset of them, this call does not touch files present in the latest commit

Note, if you renamed a file as part of a commit, e.g. if it started as LargeFileFirstName.mp4 and a commit changed it to LargeFileSecondName.mp4, running java -jar ~/Downloads/bfg.jar --delete-files LargeFileSecondName.mp4 will not remove it from git history. Either run the --delete-files command with both filenames, or with a matching pattern.

Built-in Technique: Use git-filter-branch

git-filter-branch is more cumbersome and has less features, but you may use it if you cannot install or run bfg.

In the below, replace filepattern may be a specific name or pattern, e.g. *.jpg. This will remove files matching the pattern from all history and branches.

(main)$ git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch filepattern' --prune-empty --tag-name-filter cat -- --all

Behind-the-scenes explanation:

--tag-name-filter cat is a cumbersome, but simplest, way to apply the original tags to the new commits, using the command cat.

--prune-empty removes any now-empty commits.

Final Step: Pushing your changed repo history

Once you have removed your desired files, test carefully that you haven't broken anything in your repo - if you have, it is easiest to re-clone your repo to start over. To finish, optionally use git garbage collection to minimize your local .git folder size, and then force push.

(main)$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
(main)$ git push origin --force --tags

Since you just rewrote the entire git repo history, the git push operation may be too large, and return the error “The remote end hung up unexpectedly”. If this happens, you can try increasing the git post buffer:

(main)$ git config http.postBuffer 524288000
(main)$ git push --force

If this does not work, you will need to manually push the repo history in chunks of commits. In the command below, try increasing <number> until the push operation succeeds.

(main)$ git push -u origin HEAD~<number>:refs/head/main --force

Once the push operation succeeds the first time, decrease <number> gradually until a conventional git push succeeds.

 

I need to change the content of a commit which is not my last

Consider you created some (e.g. three) commits and later realize you missed doing something that belongs contextually into the first of those commits. This bothers you, because if you'd create a new commit containing those changes, you'd have a clean code base, but your commits weren't atomic (i.e. changes that belonged to each other weren't in the same commit). In such a situation you may want to change the commit where these changes belong to, include them and have the following commits unaltered. In such a case, git rebase might save you.

Consider a situation where you want to change the third last commit you made.

(your-branch)$ git rebase -i HEAD~4

gets you into interactive rebase mode, which allows you to edit any of your last three commits. A text editor pops up, showing you something like

pick 9e1d264 The third last commit
pick 4b6e19a The second to last commit
pick f4037ec The last commit

which you change into

edit 9e1d264 The third last commit
pick 4b6e19a The second to last commit
pick f4037ec The last commit

This tells rebase that you want to edit your third last commit and keep the other two unaltered. Then you'll save (and close) the editor. Git will then start to rebase. It stops on the commit you want to alter, giving you the chance to edit that commit. Now you can apply the changes which you missed applying when you initially committed that commit. You do so by editing and staging them. Afterwards you'll run

(your-branch)$ git commit --amend

which tells Git to recreate the commit, but to leave the commit message unedited. Having done that, the hard part is solved.

(your-branch)$ git rebase --continue

will do the rest of the work for you.

Staging

 

I want to stage all tracked files and leave untracked files

$ git add -u

To stage part of tracked files

# to stage files with ext .txt
$ git add -u *.txt

# to stage all files inside directory src
$ git add -u src/

 

I need to add staged changes to the previous commit

(my-branch*)$ git commit --amend

If you already know you don't want to change the commit message, you can tell git to reuse the commit message:

(my-branch*)$ git commit --amend -C HEAD

 

I want to stage part of a new file, but not the whole file

Normally, if you want to stage part of a file, you run this:

$ git add --patch filename.x

-p will work for short. This will open interactive mode. You would be able to use the s option to split the commit - however, if the file is new, you will not have this option. To add a new file, do this:

$ git add -N filename.x

Then, you will need to use the e option to manually choose which lines to add. Running git diff --cached or git diff --staged will show you which lines you have staged compared to which are still saved locally.

 

I want to add changes in one file to two different commits

git add will add the entire file to a commit. git add -p will allow to interactively select which changes you want to add.

 

I staged too many edits, and I want to break them out into a separate commit

git reset -p will open a patch mode reset dialog. This is similar to git add -p, except that selecting "yes" will unstage the change, removing it from the upcoming commit.

 

I want to stage my unstaged edits, and unstage my staged edits

In many cases, you should unstage all of your staged files and then pick the file you want and commit it. However, if you want to switch the staged and unstaged edits, you can create a temporary commit to store your staged files, stage your unstaged files and then stash them. Then, reset the temporary commit and pop your stash.

$ git commit -m "WIP"
$ git add . # This will also add untracked files.
$ git stash
$ git reset HEAD^
$ git stash pop --index 0

NOTE 1: The reason to use pop here is want to keep idempotent as much as possible. NOTE 2: Your staged files will be marked as unstaged if you don't use the --index flag. (This link explains why.)

Unstaged Edits

 

I want to move my unstaged edits to a new branch

$ git checkout -b my-branch

 

I want to move my unstaged edits to a different, existing branch

$ git stash
$ git checkout my-branch
$ git stash pop

 

I want to discard my local uncommitted changes (staged and unstaged)

If you want to discard all your local staged and unstaged changes, you can do this:

(my-branch)$ git reset --hard
# or
(main)$ git checkout -f

This will unstage all files you might have staged with git add:

$ git reset

This will revert all local uncommitted changes (should be executed in repo root):

$ git checkout .

You can also revert uncommitted changes to a particular file or directory:

$ git checkout [some_dir|file.txt]

Yet another way to revert all uncommitted changes (longer to type, but works from any subdirectory):

$ git reset --hard HEAD

This will remove all local untracked files, so only files tracked by Git remain:

$ git clean -fd

-x will also remove all ignored files.

I want to discard specific unstaged changes

When you want to get rid of some, but not all changes in your working copy.

Checkout undesired changes, keep good changes.

$ git checkout -p
# Answer y to all of the snippets you want to drop

Another strategy involves using stash. Stash all the good changes, reset working copy, and reapply good changes.

$ git stash -p
# Select all of the snippets you want to save
$ git reset --hard
$ git stash pop

Alternatively, stash your undesired changes, and then drop stash.

$ git stash -p
# Select all of the snippets you don't want to save
$ git stash drop

I want to discard specific unstaged files

When you want to get rid of one specific file in your working copy.

$ git checkout myFile

Alternatively, to discard multiple files in your working copy, list them all.

$ git checkout myFirstFile mySecondFile

I want to discard only my unstaged local changes

When you want to get rid of all of your unstaged local uncommitted changes

$ git checkout .

 

I want to discard all of my untracked files

When you want to get rid of all of your untracked files

$ git clean -f

 

I want to unstage a specific staged file

Sometimes we have one or more files that accidentally ended up being staged, and these files have not been committed before. To unstage them:

$ git reset -- <filename>

This results in unstaging the file and make it look like it's untracked.

Branches

I want to list all branches

List local branches

$ git branch

List remote branches

$ git branch -r

List all branches (both local and remote)

$ git branch -a

 

Create a branch from a commit

$ git checkout -b <branch> <SHA1_OF_COMMIT>

 

I pulled from/into the wrong branch

This is another chance to use git reflog to see where your HEAD pointed before the bad pull.

(main)$ git reflog
ab7555f HEAD@{0}: pull origin wrong-branch: Fast-forward
c5bc55a HEAD@{1}: checkout: checkout message goes here

Simply reset your branch back to the desired commit:

$ git reset --hard c5bc55a

Done.

 

I want to discard local commits so my branch is the same as one on the server

Confirm that you haven't pushed your changes to the server.

git status should show how many commits you are ahead of origin:

(my-branch)$ git status
# On branch my-branch
# Your branch is ahead of 'origin/my-branch' by 2 commits.
#   (use "git push" to publish your local commits)
#

One way of resetting to match origin (to have the same as what is on the remote) is to do this:

(main)$ git reset --hard origin/my-branch

 

I committed to main instead of a new branch

Create the new branch while remaining on main:

(main)$ git branch my-branch

Reset the branch main to the previous commit:

(main)$ git reset --hard HEAD^

HEAD^ is short for HEAD^1. This stands for the first parent of HEAD, similarly HEAD^2 stands for the second parent of the commit (merges can have 2 parents).

Note that HEAD^2 is not the same as HEAD~2 (see this link for more information).

Alternatively, if you don't want to use HEAD^, find out what the commit hash you want to set your main branch to (git log should do the trick). Then reset to that hash. git push will make sure that this change is reflected on your remote.

For example, if the hash of the commit that your main branch is supposed to be at is a13b85e:

(main)$ git reset --hard a13b85e
HEAD is now at a13b85e

Checkout the new branch to continue working:

(main)$ git checkout my-branch

 

I want to keep the whole file from another ref-ish

Say you have a working spike (see note), with hundreds of changes. Everything is working. Now, you commit into another branch to save that work:

(solution)$ git add -A && git commit -m "Adding all changes from this spike into one big commit."

When you want to put it into a branch (maybe feature, maybe develop), you're interested in keeping whole files. You want to split your big commit into smaller ones.

Say you have:

  • branch solution, with the solution to your spike. One ahead of develop.
  • branch develop, where you want to add your changes.

You can solve it bringing the contents to your branch:

(develop)$ git checkout solution -- file1.txt

This will get the contents of that file in branch solution to your branch develop:

# On branch develop
# Your branch is up-to-date with 'origin/develop'.
# Changes to be committed:
#  (use "git reset HEAD <file>..." to unstage)
#
#        modified:   file1.txt

Then, commit as usual.

Note: Spike solutions are made to analyze or solve the problem. These solutions are used for estimation and discarded once everyone gets clear visualization of the problem. ~ Wikipedia.

 

I made several commits on a single branch that should be on different branches

Say you are on your main branch. Running git log, you see you have made two commits:

(main)$ git log

commit e3851e817c451cc36f2e6f3049db528415e3c114
Author: Alex Lee <alexlee@example.com>
Date:   Tue Jul 22 15:39:27 2014 -0400

    Bug #21 - Added CSRF protection

commit 5ea51731d150f7ddc4a365437931cd8be3bf3131
Author: Alex Lee <alexlee@example.com>
Date:   Tue Jul 22 15:39:12 2014 -0400

    Bug #14 - Fixed spacing on title

commit a13b85e984171c6e2a1729bb061994525f626d14
Author: Aki Rose <akirose@example.com>
Date:   Tue Jul 21 01:12:48 2014 -0400

    First commit

Let's take note of our commit hashes for each bug (e3851e8 for #21, 5ea5173 for #14).

First, let's reset our main branch to the correct commit (a13b85e):

(main)$ git reset --hard a13b85e
HEAD is now at a13b85e

Now, we can create a fresh branch for our bug #21:

(main)$ git checkout -b 21
(21)$

Now, let's cherry-pick the commit for bug #21 on top of our branch. That means we will be applying that commit, and only that commit, directly on top of whatever our head is at.

(21)$ git cherry-pick e3851e8

At this point, there is a possibility there might be conflicts. See the There were conflicts section in the interactive rebasing section above for how to resolve conflicts.

Now let's create a new branch for bug #14, also based on main

(21)$ git checkout main
(main)$ git checkout -b 14
(14)$

And finally, let's cherry-pick the commit for bug #14:

(14)$ git cherry-pick 5ea5173

 

I want to delete local branches that were deleted upstream

Once you merge a pull request on GitHub, it gives you the option to delete the merged branch in your fork. If you aren't planning to keep working on the branch, it's cleaner to delete the local copies of the branch so you don't end up cluttering up your working checkout with a lot of stale branches.

$ git fetch -p upstream

where, upstream is the remote you want to fetch from.

 

I accidentally deleted my branch

If you're regularly pushing to remote, you should be safe most of the time. But still sometimes you may end up deleting your branches. Let's say we create a branch and create a new file:

(main)$ git checkout -b my-branch
(my-branch)$ git branch
(my-branch)$ touch foo.txt
(my-branch)$ ls
README.md foo.txt

Let's add it and commit.

(my-branch)$ git add .
(my-branch)$ git commit -m 'foo.txt added'
(my-branch)$ foo.txt added
 1 files changed, 1 insertions(+)
 create mode 100644 foo.txt
(my-branch)$ git log

commit 4e3cd85a670ced7cc17a2b5d8d3d809ac88d5012
Author: siemiatj <siemiatj@example.com>
Date:   Wed Jul 30 00:34:10 2014 +0200

    foo.txt added

commit 69204cdf0acbab201619d95ad8295928e7f411d5
Author: Kate Hudson <katehudson@example.com>
Date:   Tue Jul 29 13:14:46 2014 -0400

    Fixes #6: Force pushing after amending commits

Now we're switching back to main and 'accidentally' removing our branch.

(my-branch)$ git checkout main
Switched to branch 'main'
Your branch is up-to-date with 'origin/main'.
(main)$ git branch -D my-branch
Deleted branch my-branch (was 4e3cd85).
(main)$ echo oh noes, deleted my branch!
oh noes, deleted my branch!

At this point you should get familiar with 'reflog', an upgraded logger. It stores the history of all the action in the repo.

(main)$ git reflog
69204cd HEAD@{0}: checkout: moving from my-branch to main
4e3cd85 HEAD@{1}: commit: foo.txt added
69204cd HEAD@{2}: checkout: moving from main to my-branch

As you can see we have commit hash from our deleted branch. Let's see if we can restore our deleted branch.

(main)$ git checkout -b my-branch-help
Switched to a new branch 'my-branch-help'
(my-branch-help)$ git reset --hard 4e3cd85
HEAD is now at 4e3cd85 foo.txt added
(my-branch-help)$ ls
README.md foo.txt

Voila! We got our removed file back. git reflog is also useful when rebasing goes terribly wrong.

I want to delete a branch

To delete a remote branch:

(main)$ git push origin --delete my-branch

You can also do:

(main)$ git push origin :my-branch

To delete a local branch:

(main)$ git branch -d my-branch

To delete a local branch that has not been merged to the current branch or an upstream:

(main)$ git branch -D my-branch

I want to delete multiple branches

Say you want to delete all branches that start with fix/:

(main)$ git branch | grep 'fix/' | xargs git branch -d

I want to rename a branch

To rename the current (local) branch:

(main)$ git branch -m new-name

To rename a different (local) branch:

(main)$ git branch -m old-name new-name

To delete the old-name remote branch and push the new-name local branch:

(main)$ git push origin :old_name new_name

 

I want to checkout to a remote branch that someone else is working on

First, fetch all branches from remote:

(main)$ git fetch --all

Say you want to checkout to daves from the remote.

(main)$ git checkout --track origin/daves
Branch daves set up to track remote branch daves from origin.
Switched to a new branch 'daves'

(--track is shorthand for git checkout -b [branch] [remotename]/[branch])

This will give you a local copy of the branch daves, and any update that has been pushed will also show up remotely.

I want to create a new remote branch from current local one

$ git push <remote> HEAD

If you would also like to set that remote branch as upstream for the current one, use the following instead:

$ git push -u <remote> HEAD

With the upstream mode and the simple (default in Git 2.0) mode of the push.default config, the following command will push the current branch with regards to the remote branch that has been registered previously with -u:

$ git push

The behavior of the other modes of git push is described in the doc of push.default.

I want to set a remote branch as the upstream for a local branch

You can set a remote branch as the upstream for the current local branch using:

$ git branch --set-upstream-to [remotename]/[branch]
# or, using the shorthand:
$ git branch -u [remotename]/[branch]

To set the upstream remote branch for another local branch:

$ git branch -u [remotename]/[branch] [local-branch]

 

I want to set my HEAD to track the default remote branch

By checking your remote branches, you can see which remote branch your HEAD is tracking. In some cases, this is not the desired branch.

$ git branch -r
  origin/HEAD -> origin/gh-pages
  origin/main

To change origin/HEAD to track origin/main, you can run this command:

$ git remote set-head origin --auto
origin/HEAD set to main

I made changes on the wrong branch

You've made uncommitted changes and realise you're on the wrong branch. Stash changes and apply them to the branch you want:

(wrong_branch)$ git stash
(wrong_branch)$ git checkout <correct_branch>
(correct_branch)$ git stash apply

 

I want to split a branch into two

You've made a lot of commits on a branch and now want to separate it into two, ending with a branch up to an earlier commit and another with all the changes.

Use git log to find the commit where you want to split. Then do the following:

(original_branch)$ git checkout -b new_branch
(new_branch)$ git checkout original_branch
(original_branch)$ git reset --hard <sha1 split here>

If you had previously pushed the original_branch to remote, you will need to do a force push. For more information check Stack Overlflow

Rebasing and Merging

 

I want to undo rebase/merge

You may have merged or rebased your current branch with a wrong branch, or you can't figure it out or finish the rebase/merge process. Git saves the original HEAD pointer in a variable called ORIG_HEAD before doing dangerous operations, so it is simple to recover your branch at the state before the rebase/merge.

(my-branch)$ git reset --hard ORIG_HEAD

 

I rebased, but I don't want to force push

Unfortunately, you have to force push, if you want those changes to be reflected on the remote branch. This is because you have changed the history. The remote branch won't accept changes unless you force push. This is one of the main reasons many people use a merge workflow, instead of a rebasing workflow - large teams can get into trouble with developers force pushing. Use this with caution. A safer way to use rebase is not to reflect your changes on the remote branch at all, and instead to do the following:

(main)$ git checkout my-branch
(my-branch)$ git rebase -i main
(my-branch)$ git checkout main
(main)$ git merge --ff-only my-branch

For more, see this SO thread.

 

I need to combine commits

Let's suppose you are working in a branch that is/will become a pull-request against main. In the simplest case when all you want to do is to combine all commits into a single one and you don't care about commit timestamps, you can reset and recommit. Make sure the main branch is up to date and all your changes committed, then:

(my-branch)$ git reset --soft main
(my-branch)$ git commit -am "New awesome feature"

If you want more control, and also to preserve timestamps, you need to do something called an interactive rebase:

(my-branch)$ git rebase -i main

If you aren't working against another branch you'll have to rebase relative to your HEAD. If you want to squash the last 2 commits, for example, you'll have to rebase against HEAD~2. For the last 3, HEAD~3, etc.

(main)$ git rebase -i HEAD~2

After you run the interactive rebase command, you will see something like this in your text editor:

pick a9c8a1d Some refactoring
pick 01b2fd8 New awesome feature
pick b729ad5 fixup
pick e3851e8 another fix

# Rebase 8074d12..b729ad5 onto 8074d12
#
# Commands:
#  p, pick = use commit
#  r, reword = use commit, but edit the commit message
#  e, edit = use commit, but stop for amending
#  s, squash = use commit, but meld into previous commit
#  f, fixup = like "squash", but discard this commit's log message
#  x, exec = run command (the rest of the line) using shell
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out

All the lines beginning with a # are comments, they won't affect your rebase.

Then you replace pick commands with any in the list above, and you can also remove commits by removing corresponding lines.

For example, if you want to leave the oldest (first) commit alone and combine all the following commits with the second oldest, you should edit the letter next to each commit except the first and the second to say f:

pick a9c8a1d Some refactoring
pick 01b2fd8 New awesome feature
f b729ad5 fixup
f e3851e8 another fix

If you want to combine these commits and rename the commit, you should additionally add an r next to the second commit or simply use s instead of f:

pick a9c8a1d Some refactoring
pick 01b2fd8 New awesome feature
s b729ad5 fixup
s e3851e8 another fix

You can then rename the commit in the next text prompt that pops up.

Newer, awesomer features

# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# rebase in progress; onto 8074d12
# You are currently editing a commit while rebasing branch 'main' on '8074d12'.
#
# Changes to be committed:
#   modified:   README.md
#

If everything is successful, you should see something like this:

(main)$ Successfully rebased and updated refs/heads/main.

Safe merging strategy

--no-commit performs the merge but pretends the merge failed and does not autocommit, giving the user a chance to inspect and further tweak the merge result before committing. no-ff maintains evidence that a feature branch once existed, keeping project history consistent.

(main)$ git merge --no-ff --no-commit my-branch

I need to merge a branch into a single commit

(main)$ git merge --squash my-branch

 

I want to combine only unpushed commits

Sometimes you have several work in progress commits that you want to combine before you push them upstream. You don't want to accidentally combine any commits that have already been pushed upstream because someone else may have already made commits that reference them.

(main)$ git rebase -i @{u}

This will do an interactive rebase that lists only the commits that you haven't already pushed, so it will be safe to reorder/fix/squash anything in the list.

I need to abort the merge

Sometimes the merge can produce problems in certain files, in those cases we can use the option abort to abort the current conflict resolution process, and try to reconstruct the pre-merge state.

(my-branch)$ git merge --abort

This command is available since Git version >= 1.7.4

I need to update the parent commit of my branch

Say I have a main branch, a feature-1 branch branched from main, and a feature-2 branch branched off of feature-1. If I make a commit to feature-1, then the parent commit of feature-2 is no longer accurate (it should be the head of feature-1, since we branched off of it). We can fix this with git rebase --onto.

(feature-2)$ git rebase --onto feature-1 <the first commit in your feature-2 branch that you don't want to bring along> feature-2

This helps in sticky scenarios where you might have a feature built on another feature that hasn't been merged yet, and a bugfix on the feature-1 branch needs to be reflected in your feature-2 branch.

Check if all commits on a branch are merged

To check if all commits on a branch are merged into another branch, you should diff between the heads (or any commits) of those branches:

(main)$ git log --graph --left-right --cherry-pick --oneline HEAD...feature/120-on-scroll

This will tell you if any commits are in one but not the other, and will give you a list of any nonshared between the branches. Another option is to do this:

(main)$ git log main ^feature/120-on-scroll --no-merges

Possible issues with interactive rebases

 

The rebase editing screen says 'noop'

If you're seeing this:

noop

That means you are trying to rebase against a branch that is at an identical commit, or is ahead of your current branch. You can try:

  • making sure your main branch is where it should be
  • rebase against HEAD~2 or earlier instead

 

There were conflicts

If you are unable to successfully complete the rebase, you may have to resolve conflicts.

First run git status to see which files have conflicts in them:

(my-branch)$ git status
On branch my-branch
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

  both modified:   README.md

In this example, README.md has conflicts. Open that file and look for the following:

   <<<<<<< HEAD
   some code
   =========
   some code
   >>>>>>> new-commit

You will need to resolve the differences between the code that was added in your new commit (in the example, everything from the middle line to new-commit) and your HEAD.

If you want to keep one branch's version of the code, you can use --ours or --theirs:

(main*)$ git checkout --ours README.md
  • When merging, use --ours to keep changes from the local branch, or --theirs to keep changes from the other branch.
  • When rebasing, use --theirs to keep changes from the local branch, or --ours to keep changes from the other branch. For an explanation of this swap, see this note in the Git documentation.

If the merges are more complicated, you can use a visual diff editor:

(main*)$ git mergetool -t opendiff

After you have resolved all conflicts and tested your code, git add the files you have changed, and then continue the rebase with git rebase --continue

(my-branch)$ git add README.md
(my-branch)$ git rebase --continue

If after resolving all the conflicts you end up with an identical tree to what it was before the commit, you need to git rebase --skip instead.

If at any time you want to stop the entire rebase and go back to the original state of your branch, you can do so:

(my-branch)$ git rebase --abort

 

Stash

Stash all edits

To stash all the edits in your working directory

$ git stash

If you also want to stash untracked files, use -u option.

$ git stash -u

Stash specific files

To stash only one file from your working directory

$ git stash push working-directory-path/filename.ext

To stash multiple files from your working directory

$ git stash push working-directory-path/filename1.ext working-directory-path/filename2.ext

 

Stash with message

$ git stash save <message>

or

$ git stash push -m <message>

 

Apply a specific stash from list

First check your list of stashes with message using

$ git stash list

Then apply a specific stash from the list using

$ git stash apply "stash@{n}"

Here, 'n' indicates the position of the stash in the stack. The topmost stash will be position 0.

Furthermore, using a time-based stash reference is also possible.

$ git stash apply "stash@{2.hours.ago}"

 

Stash while keeping unstaged edits

You can manually create a stash commit, and then use git stash store.

$ git stash create
$ git stash store -m <message> CREATED_SHA1

Finding

I want to find a string in any commit

To find a certain string which was introduced in any commit, you can use the following structure:

$ git log -S "string to find"

Commons parameters:

--source means to show the ref name given on the command line by which each commit was reached.

--all means to start from every branch.

--reverse prints in reverse order, it means that will show the first commit that made the change.

 

I want to find by author/committer

To find all commits by author/committer you can use:

$ git log --author=<name or email>
$ git log --committer=<name or email>

Keep in mind that author and committer are not the same. The --author is the person who originally wrote the code; on the other hand, the --committer, is the person who committed the code on behalf of the original author.

I want to list commits containing specific files

To find all commits containing a specific file you can use:

$ git log -- <path to file>

You would usually specify an exact path, but you may also use wild cards in the path and file name:

$ git log -- **/*.js

While using wildcards, it's useful to inform --name-status to see the list of committed files:

$ git log --name-status -- **/*.js

 

I want to view the commit history for a specific function

To trace the evolution of a single function you can use:

$ git log -L :FunctionName:FilePath

Note that you can combine this with further git log options, like revision ranges and commit limits.

Find a tag where a commit is referenced

To find all tags containing a specific commit:

$ git tag --contains <commitid>

Submodules

 

Clone all submodules

$ git clone --recursive git://github.com/foo/bar.git

If already cloned:

$ git submodule update --init --recursive

 

Remove a submodule

Creating a submodule is pretty straight-forward, but deleting them less so. The commands you need are:

$ git submodule deinit submodulename
$ git rm submodulename
$ git rm --cached submodulename
$ rm -rf .git/modules/submodulename

Miscellaneous Objects

Copy a folder or file from one branch to another

$ git checkout <branch-you-want-the-directory-from> -- <folder-name or file-name>

Restore a deleted file

First find the commit when the file last existed:

$ git rev-list -n 1 HEAD -- filename

Then checkout that file:

git checkout deletingcommitid^ -- filename

Delete tag

$ git tag -d <tag_name>
$ git push <remote> :refs/tags/<tag_name>

 

Recover a deleted tag

If you want to recover a tag that was already deleted, you can do so by following these steps: First, you need to find the unreachable tag:

$ git fsck --unreachable | grep tag

Make a note of the tag's hash. Then, restore the deleted tag with following, making use of git update-ref:

$ git update-ref refs/tags/<tag_name> <hash>

Your tag should now have been restored.

Deleted Patch

If someone has sent you a pull request on GitHub, but then deleted their original fork, you will be unable to clone their repository or to use git am as the .diff, .patch URLs become unavailable. But you can checkout the PR itself using GitHub's special refs. To fetch the content of PR#1 into a new branch called pr_1:

$ git fetch origin refs/pull/1/head:pr_1
From github.com:foo/bar
 * [new ref]         refs/pull/1/head -> pr_1

Exporting a repository as a Zip file

$ git archive --format zip --output /full/path/to/zipfile.zip main

Push a branch and a tag that have the same name

If there is a tag on a remote repository that has the same name as a branch you will get the following error when trying to push that branch with a standard $ git push <remote> <branch> command.

$ git push origin <branch>
error: dst refspec same matches more than one.
error: failed to push some refs to '<git server>'

Fix this by specifying you want to push the head reference.

$ git push origin refs/heads/<branch-name>

If you want to push a tag to a remote repository that has the same name as a branch, you can use a similar command.

$ git push origin refs/tags/<tag-name>

Tracking Files

 

I want to change a file name's capitalization, without changing the contents of the file

(main)$ git mv --force myfile MyFile

I want to overwrite local files when doing a git pull

(main)$ git fetch --all
(main)$ git reset --hard origin/main

 

I want to remove a file from Git but keep the file

(main)$ git rm --cached log.txt

I want to revert a file to a specific revision

Assuming the hash of the commit you want is c5f567:

(main)$ git checkout c5f567 -- file1/to/restore file2/to/restore

If you want to revert to changes made just 1 commit before c5f567, pass the commit hash as c5f567~1:

(main)$ git checkout c5f567~1 -- file1/to/restore file2/to/restore

I want to list changes of a specific file between commits or branches

Assuming you want to compare last commit with file from commit c5f567:

$ git diff HEAD:path_to_file/file c5f567:path_to_file/file

Same goes for branches:

$ git diff main:path_to_file/file staging:path_to_file/file

I want Git to ignore changes to a specific file

This works great for config templates or other files that require locally adding credentials that shouldn't be committed.

$ git update-index --assume-unchanged file-to-ignore

Note that this does not remove the file from source control - it is only ignored locally. To undo this and tell Git to notice changes again, this clears the ignore flag:

$ git update-index --no-assume-unchanged file-to-stop-ignoring

Debugging with Git

The git-bisect command uses a binary search to find which commit in your Git history introduced a bug.

Suppose you're on the main branch, and you want to find the commit that broke some feature. You start bisect:

$ git bisect start

Then you should specify which commit is bad, and which one is known to be good. Assuming that your current version is bad, and v1.1.1 is good:

$ git bisect bad
$ git bisect good v1.1.1

Now git-bisect selects a commit in the middle of the range that you specified, checks it out, and asks you whether it's good or bad. You should see something like:

$ Bisecting: 5 revision left to test after this (roughly 5 step)
$ [c44abbbee29cb93d8499283101fe7c8d9d97f0fe] Commit message
$ (c44abbb)$

You will now check if this commit is good or bad. If it's good:

$ (c44abbb)$ git bisect good

and git-bisect will select another commit from the range for you. This process (selecting good or bad) will repeat until there are no more revisions left to inspect, and the command will finally print a description of the first bad commit.

Configuration

I want to add aliases for some Git commands

On OS X and Linux, your git configuration file is stored in ~/.gitconfig. I've added some example aliases I use as shortcuts (and some of my common typos) in the [alias] section as shown below:

[alias]
    a = add
    amend = commit --amend
    c = commit
    ca = commit --amend
    ci = commit -a
    co = checkout
    d = diff
    dc = diff --changed
    ds = diff --staged
    extend = commit --amend -C HEAD
    f = fetch
    loll = log --graph --decorate --pretty=oneline --abbrev-commit
    m = merge
    one = log --pretty=oneline
    outstanding = rebase -i @{u}
    reword = commit --amend --only
    s = status
    unpushed = log @{u}
    wc = whatchanged
    wip = rebase -i @{u}
    zap = fetch -p
    day = log --reverse --no-merges --branches=* --date=local --since=midnight --author=\"$(git config --get user.name)\"
    delete-merged-branches = "!f() { git checkout --quiet main && git branch --merged | grep --invert-match '\\*' | xargs -n 1 git branch --delete; git checkout --quiet @{-1}; }; f"

I want to add an empty directory to my repository

You can’t! Git doesn’t support this, but there’s a hack. You can create a .gitignore file in the directory with the following contents:

 # Ignore everything in this directory
 *
 # Except this file
 !.gitignore

Another common convention is to make an empty file in the folder, titled .gitkeep.

$ mkdir mydir
$ touch mydir/.gitkeep

You can also name the file as just .keep , in which case the second line above would be touch mydir/.keep

I want to cache a username and password for a repository

You might have a repository that requires authentication. In which case you can cache a username and password so you don't have to enter it on every push and pull. Credential helper can do this for you.

$ git config --global credential.helper cache
# Set git to use the credential memory cache
$ git config --global credential.helper 'cache --timeout=3600'
# Set the cache to timeout after 1 hour (setting is in seconds)

To find a credential helper:

$ git help -a | grep credential
# Shows you possible credential helpers

For OS specific credential caching:

$ git config --global credential.helper osxkeychain
# For OSX
$ git config --global credential.helper manager
# Git for Windows 2.7.3+
$ git config --global credential.helper gnome-keyring
# Ubuntu and other GNOME-based distros

More credential helpers can likely be found for different distributions and operating systems.

I want to make Git ignore permissions and filemode changes

$ git config core.fileMode false

If you want to make this the default behaviour for logged-in users, then use:

$ git config --global core.fileMode false

I want to set a global user

To configure user information used across all local repositories, and to set a name that is identifiable for credit when review version history:

$ git config --global user.name “[firstname lastname]”

To set an email address that will be associated with each history marker:

git config --global user.email “[valid-email]”

I've no idea what I did wrong

So, you're screwed - you reset something, or you merged the wrong branch, or you force pushed and now you can't find your commits. You know, at some point, you were doing alright, and you want to go back to some state you were at.

This is what git reflog is for. reflog keeps track of any changes to the tip of a branch, even if that tip isn't referenced by a branch or a tag. Basically, every time HEAD changes, a new entry is added to the reflog. This only works for local repositories, sadly, and it only tracks movements (not changes to a file that weren't recorded anywhere, for instance).

(main)$ git reflog
0a2e358 HEAD@{0}: reset: moving to HEAD~2
0254ea7 HEAD@{1}: checkout: moving from 2.2 to main
c10f740 HEAD@{2}: checkout: moving from main to 2.2

The reflog above shows a checkout from main to the 2.2 branch and back. From there, there's a hard reset to an older commit. The latest activity is represented at the top labeled HEAD@{0}.

If it turns out that you accidentally moved back, the reflog will contain the commit main pointed to (0254ea7) before you accidentally dropped 2 commits.

$ git reset --hard 0254ea7

Using git reset it is then possible to change main back to the commit it was before. This provides a safety net in case history was accidentally changed.

(copied and edited from Source).

 

Git Shortcuts

Git Bash

Once you're comfortable with what the above commands are doing, you might want to create some shortcuts for Git Bash. This allows you to work a lot faster by doing complex tasks in really short commands.

alias sq=squash

function squash() {
    git rebase -i HEAD~$1
}

Copy those commands to your .bashrc or .bash_profile.

PowerShell on Windows

If you are using PowerShell on Windows, you can also set up aliases and functions. Add these commands to your profile, whose path is defined in the $profile variable. Learn more at the About Profiles page on the Microsoft documentation site.

Set-Alias sq Squash-Commits

function Squash-Commits {
  git rebase -i HEAD~$1
}

Other Resources

Books

Tutorials

Scripts and Tools

  • firstaidgit.io A searchable selection of the most frequently asked Git questions
  • git-extra-commands - a collection of useful extra Git scripts
  • git-extras - GIT utilities -- repo summary, repl, changelog population, author commit percentages and more
  • git-fire - git-fire is a Git plugin that helps in the event of an emergency by adding all current files, committing, and pushing to a new branch (to prevent merge conflicts).
  • git-tips - Small Git tips
  • git-town - Generic, high-level Git workflow support! http://www.git-town.com

GUI Clients

  • GitKraken - The downright luxurious Git client,for Windows, Mac & Linux
  • git-cola - another Git client for Windows and OS X
  • GitUp - A newish GUI that has some very opinionated ways of dealing with Git's complications
  • gitx-dev - another graphical Git client for OS X
  • Sourcetree - Simplicity meets power in a beautiful and free Git GUI. For Windows and Mac.
  • Tower - graphical Git client for OS X (paid)
  • tig - terminal text-mode interface for Git
  • Magit - Interface to Git implemented as an Emacs package.
  • GitExtensions - a shell extension, a Visual Studio 2010-2015 plugin and a standalone Git repository tool.
  • Fork - a fast and friendly Git client for Mac (beta)
  • gmaster - a Git client for Windows that has 3-way merge, analyze refactors, semantic diff and merge (beta)
  • gitk - a Git client for linux to allow simple view of repo state.
  • SublimeMerge - Blazing fast, extensible client that provides 3-way merges, powerful search and syntax highlighting, in active development.

Author: K88hudson
Source Code: https://github.com/k88hudson/git-flight-rules 
License: CC-BY-SA-4.0 License

#git 

So erstellen Sie einen Fake-News-Detektor in Python

Erkennung gefälschter Nachrichten in Python

Untersuchen des Fake-News-Datensatzes, Durchführen von Datenanalysen wie Wortwolken und Ngrams und Feinabstimmen des BERT-Transformators, um einen Fake-News-Detektor in Python mithilfe der Transformer-Bibliothek zu erstellen.

Fake News sind die absichtliche Verbreitung falscher oder irreführender Behauptungen als Nachrichten, bei denen die Aussagen absichtlich irreführend sind.

Zeitungen, Boulevardzeitungen und Zeitschriften wurden durch digitale Nachrichtenplattformen, Blogs, Social-Media-Feeds und eine Vielzahl mobiler Nachrichtenanwendungen ersetzt. Nachrichtenorganisationen profitierten von der zunehmenden Nutzung sozialer Medien und mobiler Plattformen, indem sie ihren Abonnenten minutenaktuelle Informationen lieferten.

Die Verbraucher haben jetzt sofortigen Zugriff auf die neuesten Nachrichten. Diese digitalen Medienplattformen haben aufgrund ihrer einfachen Anbindung an den Rest der Welt an Bedeutung gewonnen und ermöglichen es den Benutzern, Ideen zu diskutieren und auszutauschen und Themen wie Demokratie, Bildung, Gesundheit, Forschung und Geschichte zu debattieren. Gefälschte Nachrichten auf digitalen Plattformen werden immer beliebter und werden für Profitzwecke wie politische und finanzielle Gewinne verwendet.

Wie groß ist dieses Problem?

Da das Internet, soziale Medien und digitale Plattformen weit verbreitet sind, kann jeder ungenaue und voreingenommene Informationen verbreiten. Die Verbreitung von Fake News lässt sich kaum verhindern. Es gibt einen enormen Anstieg bei der Verbreitung falscher Nachrichten, die nicht auf einen Sektor wie Politik beschränkt sind, sondern Sport, Gesundheit, Geschichte, Unterhaltung sowie Wissenschaft und Forschung umfassen.

Die Lösung

Es ist wichtig, falsche und richtige Nachrichten zu erkennen und zu unterscheiden. Eine Methode besteht darin, einen Experten entscheiden zu lassen und alle Informationen auf Fakten zu überprüfen, aber dies kostet Zeit und erfordert Fachwissen, das nicht geteilt werden kann. Zweitens können wir Tools für maschinelles Lernen und künstliche Intelligenz verwenden, um die Identifizierung von gefälschten Nachrichten zu automatisieren.

Online-Nachrichteninformationen umfassen verschiedene unstrukturierte Formatdaten (wie Dokumente, Videos und Audio), aber wir konzentrieren uns hier auf Nachrichten im Textformat. Mit dem Fortschritt des maschinellen Lernens und der Verarbeitung natürlicher Sprache können wir jetzt den irreführenden und falschen Charakter eines Artikels oder einer Aussage erkennen.

Mehrere Studien und Experimente werden durchgeführt, um Fake News in allen Medien aufzudecken.

Unser Hauptziel dieses Tutorials ist:

  • Untersuchen und analysieren Sie den Fake-News-Datensatz.
  • Erstellen Sie einen Klassifikator, der gefälschte Nachrichten so genau wie möglich unterscheiden kann.

Hier das Inhaltsverzeichnis:

  • Einführung
  • Wie groß ist dieses Problem?
  • Die Lösung
  • Datenexploration
    • Verteilung der Klassen
  • Datenbereinigung für die Analyse
  • Explorative Datenanalyse
    • Ein-Wort-Wolke
    • Häufigstes Bigram (Zwei-Wort-Kombination)
    • Häufigstes Trigramm (Drei-Wort-Kombination)
  • Aufbau eines Klassifikators durch Feinabstimmung von BERT
    • Datenaufbereitung
    • Tokenisieren des Datensatzes
    • Laden und Feintuning des Modells
    • Modellbewertung
  • Anhang: Erstellen einer Übermittlungsdatei für Kaggle
  • Fazit

Datenexploration

In dieser Arbeit haben wir den Fake-News-Datensatz von Kaggle verwendet , um nicht vertrauenswürdige Nachrichtenartikel als Fake News zu klassifizieren. Wir verfügen über einen vollständigen Trainingsdatensatz mit den folgenden Merkmalen:

  • id: eindeutige ID für einen Nachrichtenartikel
  • title: Titel eines Nachrichtenartikels
  • author: Autor des Nachrichtenartikels
  • text: Text des Artikels; könnte unvollständig sein
  • label: ein Etikett, das den Artikel als potenziell unzuverlässig markiert, gekennzeichnet durch 1 (unzuverlässig oder gefälscht) oder 0 (zuverlässig).

Es ist ein binäres Klassifizierungsproblem, bei dem wir vorhersagen müssen, ob eine bestimmte Nachricht zuverlässig ist oder nicht.

Wenn Sie ein Kaggle-Konto haben, können Sie den Datensatz einfach von der dortigen Website herunterladen und die ZIP-Datei entpacken.

Ich habe den Datensatz auch in Google Drive hochgeladen, und Sie können ihn hier herunterladen oder die gdownBibliothek verwenden, um ihn automatisch in Google Colab- oder Jupyter-Notebooks herunterzuladen:

$ pip install gdown
# download from Google Drive
$ gdown "https://drive.google.com/uc?id=178f_VkNxccNidap-5-uffXUW475pAuPy&confirm=t"
Downloading...
From: https://drive.google.com/uc?id=178f_VkNxccNidap-5-uffXUW475pAuPy&confirm=t
To: /content/fake-news.zip
100% 48.7M/48.7M [00:00<00:00, 74.6MB/s]

Entpacken der Dateien:

$ unzip fake-news.zip

Im aktuellen Arbeitsverzeichnis werden drei Dateien angezeigt: train.csv, test.csv, und submit.csv, die wir train.csvim Großteil des Tutorials verwenden werden.

Installieren der erforderlichen Abhängigkeiten:

$ pip install transformers nltk pandas numpy matplotlib seaborn wordcloud

Hinweis: Wenn Sie sich in einer lokalen Umgebung befinden, stellen Sie sicher, dass Sie PyTorch für GPU installieren, gehen Sie zu dieser Seite für eine ordnungsgemäße Installation.

Lassen Sie uns die wesentlichen Bibliotheken für die Analyse importieren:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Die NLTK-Korpora und -Module müssen mit dem standardmäßigen NLTK-Downloader installiert werden:

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

Der Fake-News-Datensatz umfasst Original- und fiktive Artikeltitel und -texte verschiedener Autoren. Lassen Sie uns unseren Datensatz importieren:

# load the dataset
news_d = pd.read_csv("train.csv")
print("Shape of News data:", news_d.shape)
print("News data columns", news_d.columns)

Ausgabe:

 Shape of News data: (20800, 5)
 News data columns Index(['id', 'title', 'author', 'text', 'label'], dtype='object')

So sieht der Datensatz aus:

# by using df.head(), we can immediately familiarize ourselves with the dataset. 
news_d.head()

Ausgabe:

id	title	author	text	label
0	0	House Dem Aide: We Didn’t Even See Comey’s Let...	Darrell Lucus	House Dem Aide: We Didn’t Even See Comey’s Let...	1
1	1	FLYNN: Hillary Clinton, Big Woman on Campus - ...	Daniel J. Flynn	Ever get the feeling your life circles the rou...	0
2	2	Why the Truth Might Get You Fired	Consortiumnews.com	Why the Truth Might Get You Fired October 29, ...	1
3	3	15 Civilians Killed In Single US Airstrike Hav...	Jessica Purkiss	Videos 15 Civilians Killed In Single US Airstr...	1
4	4	Iranian woman jailed for fictional unpublished...	Howard Portnoy	Print \nAn Iranian woman has been sentenced to...	1

Wir haben 20.800 Zeilen, die fünf Spalten haben. Sehen wir uns einige Statistiken der textSpalte an:

#Text Word startistics: min.mean, max and interquartile range

txt_length = news_d.text.str.split().str.len()
txt_length.describe()

Ausgabe:

count    20761.000000
mean       760.308126
std        869.525988
min          0.000000
25%        269.000000
50%        556.000000
75%       1052.000000
max      24234.000000
Name: text, dtype: float64

Statistiken für die titleSpalte:

#Title statistics 

title_length = news_d.title.str.split().str.len()
title_length.describe()

Ausgabe:

count    20242.000000
mean        12.420709
std          4.098735
min          1.000000
25%         10.000000
50%         13.000000
75%         15.000000
max         72.000000
Name: title, dtype: float64

Die Statistiken für die Trainings- und Testsätze lauten wie folgt:

  • Das textAttribut hat eine höhere Wortzahl mit durchschnittlich 760 Wörtern und 75 % mit mehr als 1000 Wörtern.
  • Das titleAttribut ist eine kurze Aussage mit durchschnittlich 12 Wörtern, und 75 % davon sind ungefähr 15 Wörter.

Unser Experiment wäre mit Text und Titel zusammen.

Verteilung der Klassen

Zählplots für beide Etiketten:

sns.countplot(x="label", data=news_d);
print("1: Unreliable")
print("0: Reliable")
print("Distribution of labels:")
print(news_d.label.value_counts());

Ausgabe:

1: Unreliable
0: Reliable
Distribution of labels:
1    10413
0    10387
Name: label, dtype: int64

Verteilung von Etiketten

print(round(news_d.label.value_counts(normalize=True),2)*100);

Ausgabe:

1    50.0
0    50.0
Name: label, dtype: float64

Die Anzahl der nicht vertrauenswürdigen Artikel (gefälscht oder 1) beträgt 10413, während die Anzahl der vertrauenswürdigen Artikel (zuverlässig oder 0) 10387 beträgt. Fast 50 % der Artikel sind gefälscht. Daher misst die Genauigkeitsmetrik, wie gut unser Modell beim Erstellen eines Klassifikators abschneidet.

Datenbereinigung für die Analyse

In diesem Abschnitt werden wir unseren Datensatz bereinigen, um einige Analysen durchzuführen:

  • Löschen Sie nicht verwendete Zeilen und Spalten.
  • Führen Sie eine Nullwertimputation durch.
  • Sonderzeichen entfernen.
  • Stoppwörter entfernen.
# Constants that are used to sanitize the datasets 

column_n = ['id', 'title', 'author', 'text', 'label']
remove_c = ['id','author']
categorical_features = []
target_col = ['label']
text_f = ['title', 'text']
# Clean Datasets
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer
from collections import Counter

ps = PorterStemmer()
wnl = nltk.stem.WordNetLemmatizer()

stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)

# Removed unused clumns
def remove_unused_c(df,column_n=remove_c):
    df = df.drop(column_n,axis=1)
    return df

# Impute null values with None
def null_process(feature_df):
    for col in text_f:
        feature_df.loc[feature_df[col].isnull(), col] = "None"
    return feature_df

def clean_dataset(df):
    # remove unused column
    df = remove_unused_c(df)
    #impute null values
    df = null_process(df)
    return df

# Cleaning text from unused characters
def clean_text(text):
    text = str(text).replace(r'http[\w:/\.]+', ' ')  # removing urls
    text = str(text).replace(r'[^\.\w\s]', ' ')  # remove everything but characters and punctuation
    text = str(text).replace('[^a-zA-Z]', ' ')
    text = str(text).replace(r'\s\s+', ' ')
    text = text.lower().strip()
    #text = ' '.join(text)    
    return text

## Nltk Preprocessing include:
# Stop words, Stemming and Lemmetization
# For our project we use only Stop word removal
def nltk_preprocess(text):
    text = clean_text(text)
    wordlist = re.sub(r'[^\w\s]', '', text).split()
    #text = ' '.join([word for word in wordlist if word not in stopwords_dict])
    #text = [ps.stem(word) for word in wordlist if not word in stopwords_dict]
    text = ' '.join([wnl.lemmatize(word) for word in wordlist if word not in stopwords_dict])
    return  text

Im obigen Codeblock:

  • Wir haben NLTK importiert, eine berühmte Plattform für die Entwicklung von Python-Anwendungen, die mit der menschlichen Sprache interagieren. Als nächstes importieren wir refür Regex.
  • Wir importieren Stoppwörter aus nltk.corpus. Bei der Arbeit mit Wörtern, insbesondere bei der Betrachtung der Semantik, müssen wir manchmal gebräuchliche Wörter eliminieren, die einer Aussage keine signifikante Bedeutung hinzufügen, wie z. B. "but", "can", "we", usw.
  • PorterStemmerwird verwendet, um Wortstämme mit NLTK auszuführen. Stemmer entfernen Wörter ihrer morphologischen Affixe und lassen nur den Wortstamm übrig.
  • Wir importieren WordNetLemmatizer()aus der NLTK-Bibliothek zur Lemmatisierung. Lemmatisierung ist viel effektiver als Stemmung . Es geht über die Wortreduktion hinaus und wertet das gesamte Lexikon einer Sprache aus, um eine morphologische Analyse auf Wörter anzuwenden, mit dem Ziel, nur Flexionsenden zu entfernen und die Basis- oder Wörterbuchform eines Wortes zurückzugeben, die als Lemma bekannt ist.
  • stopwords.words('english')Lassen Sie uns einen Blick auf die Liste aller englischen Stoppwörter werfen, die von NLTK unterstützt werden.
  • remove_unused_c()Funktion wird verwendet, um die unbenutzten Spalten zu entfernen.
  • Wir imputieren Nullwerte mit Noneder Verwendung der null_process()Funktion.
  • Innerhalb der Funktion clean_dataset()rufen wir remove_unused_c()und null_process()Funktionen auf. Diese Funktion ist für die Datenbereinigung zuständig.
  • Um Text von ungenutzten Zeichen zu bereinigen, haben wir die clean_text()Funktion erstellt.
  • Für die Vorverarbeitung verwenden wir nur die Entfernung von Stoppwörtern. Zu diesem Zweck haben wir die nltk_preprocess()Funktion erstellt.

Vorverarbeitung der textund title:

# Perform data cleaning on train and test dataset by calling clean_dataset function
df = clean_dataset(news_d)
# apply preprocessing on text through apply method by calling the function nltk_preprocess
df["text"] = df.text.apply(nltk_preprocess)
# apply preprocessing on title through apply method by calling the function nltk_preprocess
df["title"] = df.title.apply(nltk_preprocess)
# Dataset after cleaning and preprocessing step
df.head()

Ausgabe:

title	text	label
0	house dem aide didnt even see comeys letter ja...	house dem aide didnt even see comeys letter ja...	1
1	flynn hillary clinton big woman campus breitbart	ever get feeling life circle roundabout rather...	0
2	truth might get fired	truth might get fired october 29 2016 tension ...	1
3	15 civilian killed single u airstrike identified	video 15 civilian killed single u airstrike id...	1
4	iranian woman jailed fictional unpublished sto...	print iranian woman sentenced six year prison ...	1

Explorative Datenanalyse

In diesem Abschnitt führen wir Folgendes durch:

  • Univariate Analyse : Es ist eine statistische Analyse des Textes. Wir werden zu diesem Zweck die Wortwolke verwenden. Eine Wortwolke ist ein Visualisierungsansatz für Textdaten, bei dem der häufigste Begriff in der größten Schriftgröße dargestellt wird.
  • Bivariate Analyse : Hier werden Bigramm und Trigramm verwendet. Laut Wikipedia: „ Ein N-Gramm ist eine zusammenhängende Folge von n Elementen aus einem gegebenen Text- oder Sprachmuster. Je nach Anwendung können die Elemente Phoneme, Silben, Buchstaben, Wörter oder Basenpaare sein. Die N-Gramme werden typischerweise aus einem Text- oder Sprachkorpus gesammelt".

Ein-Wort-Wolke

Die häufigsten Wörter erscheinen fett und größer in einer Wortwolke. In diesem Abschnitt wird eine Wortwolke für alle Wörter im Datensatz erstellt.

Die Funktion der WordCloud - Bibliothek wordcloud()wird verwendet, und die generate()wird zum Generieren des Wortwolkenbildes verwendet:

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# initialize the word cloud
wordcloud = WordCloud( background_color='black', width=800, height=600)
# generate the word cloud by passing the corpus
text_cloud = wordcloud.generate(' '.join(df['text']))
# plotting the word cloud
plt.figure(figsize=(20,30))
plt.imshow(text_cloud)
plt.axis('off')
plt.show()

Ausgabe:

WordCloud für die gesamten Fake-News-Daten

Wortwolke nur für zuverlässige Nachrichten:

true_n = ' '.join(df[df['label']==0]['text']) 
wc = wordcloud.generate(true_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()

Ausgabe:

Wortwolke für zuverlässige Nachrichten

Wortwolke nur für Fake News:

fake_n = ' '.join(df[df['label']==1]['text'])
wc= wordcloud.generate(fake_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()

Ausgabe:

Wortwolke für gefälschte Nachrichten

Häufigstes Bigram (Zwei-Wort-Kombination)

Ein N-Gramm ist eine Folge von Buchstaben oder Wörtern. Ein Zeichen-Unigramm besteht aus einem einzelnen Zeichen, während ein Bigramm aus einer Reihe von zwei Zeichen besteht. In ähnlicher Weise bestehen Wort-N-Gramme aus einer Reihe von n Wörtern. Das Wort "united" ist ein 1-Gramm (Unigram). Die Kombination der Wörter "United State" ist ein 2-Gramm (Bigramm), "New York City" ist ein 3-Gramm.

Lassen Sie uns das häufigste Bigramm in den zuverlässigen Nachrichten darstellen:

def plot_top_ngrams(corpus, title, ylabel, xlabel="Number of Occurences", n=2):
  """Utility function to plot top n-grams"""
  true_b = (pd.Series(nltk.ngrams(corpus.split(), n)).value_counts())[:20]
  true_b.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
  plt.title(title)
  plt.ylabel(ylabel)
  plt.xlabel(xlabel)
  plt.show()
plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Bigrams', "Bigram", n=2)

Top-Bigramme zu Fake News

Das häufigste Bigramm in den Fake News:

plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Bigrams', "Bigram", n=2)

Top-Bigramme zu Fake News

Häufigstes Trigramm (Drei-Wort-Kombination)

Das häufigste Trigramm bei zuverlässigen Nachrichten:

plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Trigrams', "Trigrams", n=3)

Das häufigste Trigramm auf Fake News

Für Fake News jetzt:

plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Trigrams', "Trigrams", n=3)

Die häufigsten Trigramme auf Fake News

Die obigen Diagramme geben uns einige Ideen, wie beide Klassen aussehen. Im nächsten Abschnitt verwenden wir die Transformers-Bibliothek , um einen Detektor für gefälschte Nachrichten zu erstellen.

Aufbau eines Klassifikators durch Feinabstimmung von BERT

In diesem Abschnitt wird ausgiebig Code aus dem BERT-Tutorial zur Feinabstimmung entnommen, um mithilfe der Transformers-Bibliothek einen Klassifikator für gefälschte Nachrichten zu erstellen. Für detailliertere Informationen können Sie also zum Original-Tutorial gehen .

Wenn Sie keine Transformatoren installiert haben, müssen Sie:

$ pip install transformers

Lassen Sie uns die erforderlichen Bibliotheken importieren:

import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
from sklearn.model_selection import train_test_split

import random

Wir wollen unsere Ergebnisse reproduzierbar machen, auch wenn wir unsere Umgebung neu starten:

def set_seed(seed: int):
    """
    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
    installed).

    Args:
        seed (:obj:`int`): The seed to set.
    """
    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # ^^ safe to call this function even if cuda is not available
    if is_tf_available():
        import tensorflow as tf

        tf.random.set_seed(seed)

set_seed(1)

Das Modell, das wir verwenden werden, ist das bert-base-uncased:

# the model we gonna train, base uncased BERT
# check text classification models here: https://huggingface.co/models?filter=text-classification
model_name = "bert-base-uncased"
# max sequence length for each document/sentence sample
max_length = 512

Tokenizer laden:

# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

Datenaufbereitung

Lassen Sie uns nun NaNWerte aus den Spalten text, authorund bereinigen:title

news_df = news_d[news_d['text'].notna()]
news_df = news_df[news_df["author"].notna()]
news_df = news_df[news_df["title"].notna()]

Erstellen Sie als Nächstes eine Funktion, die den Datensatz als Pandas-Datenrahmen nimmt und die Trainings-/Validierungsaufteilungen von Texten und Beschriftungen als Listen zurückgibt:

def prepare_data(df, test_size=0.2, include_title=True, include_author=True):
  texts = []
  labels = []
  for i in range(len(df)):
    text = df["text"].iloc[i]
    label = df["label"].iloc[i]
    if include_title:
      text = df["title"].iloc[i] + " - " + text
    if include_author:
      text = df["author"].iloc[i] + " : " + text
    if text and label in [0, 1]:
      texts.append(text)
      labels.append(label)
  return train_test_split(texts, labels, test_size=test_size)

train_texts, valid_texts, train_labels, valid_labels = prepare_data(news_df)

Die obige Funktion nimmt den Datensatz in einem Datenrahmentyp und gibt sie als Listen zurück, die in Trainings- und Validierungssätze aufgeteilt sind. Die Einstellung include_titleauf Truebedeutet, dass wir die titleSpalte zu dem hinzufügen, die textwir für das Training verwenden werden, die Einstellung include_authorauf bedeutet, dass wir auch die Spalte zum Text Truehinzufügen .author

Stellen wir sicher, dass die Beschriftungen und Texte die gleiche Länge haben:

print(len(train_texts), len(train_labels))
print(len(valid_texts), len(valid_labels))

Ausgabe:

14628 14628
3657 3657

Tokenisieren des Datensatzes

Verwenden wir den BERT-Tokenizer, um unseren Datensatz zu tokenisieren:

# tokenize the dataset, truncate when passed `max_length`, 
# and pad with 0's when less than `max_length`
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

Konvertieren der Kodierungen in einen PyTorch-Datensatz:

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)

Laden und Feintuning des Modells

Wir werden verwenden BertForSequenceClassification, um unser BERT-Transformatormodell zu laden:

# load the model
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Wir setzen num_labelsauf 2, da es sich um eine binäre Klassifikation handelt. Die folgende Funktion ist ein Rückruf, um die Genauigkeit für jeden Validierungsschritt zu berechnen:

from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

Lassen Sie uns die Trainingsparameter initialisieren:

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=10,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=100,                # number of warmup steps for learning rate scheduler
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=200,               # log & save weights each logging_steps
    save_steps=200,
    evaluation_strategy="steps",     # evaluate each `logging_steps`
)

Ich habe den per_device_train_batch_sizeauf 10 eingestellt, aber Sie sollten ihn so hoch einstellen, wie Ihre GPU möglicherweise passen könnte. Setzen Sie logging_stepsund save_stepsauf 200, was bedeutet, dass wir eine Bewertung durchführen und die Modellgewichte bei jedem 200-Trainingsschritt speichern.

Auf dieser Seite finden Sie   detailliertere Informationen zu den verfügbaren Trainingsparametern.

Lassen Sie uns den Trainer instanziieren:

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

Training des Modells:

# train the model
trainer.train()

Das Training dauert je nach GPU einige Stunden. Wenn Sie die kostenlose Version von Colab verwenden, sollte es mit NVIDIA Tesla K80 eine Stunde dauern. Hier ist die Ausgabe:

***** Running training *****
  Num examples = 14628
  Num Epochs = 1
  Instantaneous batch size per device = 10
  Total train batch size (w. parallel, distributed & accumulation) = 10
  Gradient Accumulation steps = 1
  Total optimization steps = 1463
 [1463/1463 41:07, Epoch 1/1]
Step	Training Loss	Validation Loss	Accuracy
200		0.250800		0.100533		0.983867
400		0.027600		0.043009		0.993437
600		0.023400		0.017812		0.997539
800		0.014900		0.030269		0.994258
1000	0.022400		0.012961		0.998086
1200	0.009800		0.010561		0.998633
1400	0.007700		0.010300		0.998633
***** Running Evaluation *****
  Num examples = 3657
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-200
Configuration saved in ./results/checkpoint-200/config.json
Model weights saved in ./results/checkpoint-200/pytorch_model.bin
<SNIPPED>
***** Running Evaluation *****
  Num examples = 3657
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-1400
Configuration saved in ./results/checkpoint-1400/config.json
Model weights saved in ./results/checkpoint-1400/pytorch_model.bin

Training completed. Do not forget to share your model on huggingface.co/models =)

Loading best model from ./results/checkpoint-1400 (score: 0.010299865156412125).
TrainOutput(global_step=1463, training_loss=0.04888018785440506, metrics={'train_runtime': 2469.1722, 'train_samples_per_second': 5.924, 'train_steps_per_second': 0.593, 'total_flos': 3848788517806080.0, 'train_loss': 0.04888018785440506, 'epoch': 1.0})

Modellbewertung

Da load_best_model_at_endauf eingestellt ist, Truewerden nach Abschluss des Trainings die besten Gewichte geladen. Lassen Sie es uns mit unserem Validierungsset auswerten:

# evaluate the current model after training
trainer.evaluate()

Ausgabe:

***** Running Evaluation *****
  Num examples = 3657
  Batch size = 20
 [183/183 02:11]
{'epoch': 1.0,
 'eval_accuracy': 0.998632759092152,
 'eval_loss': 0.010299865156412125,
 'eval_runtime': 132.0374,
 'eval_samples_per_second': 27.697,
 'eval_steps_per_second': 1.386}

Speichern des Modells und des Tokenizers:

# saving the fine tuned model & tokenizer
model_path = "fake-news-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Nach dem Ausführen der obigen Zelle wird ein neuer Ordner mit der Modellkonfiguration und den Gewichten angezeigt. Wenn Sie eine Vorhersage durchführen möchten, verwenden Sie einfach die from_pretrained()Methode, die wir beim Laden des Modells verwendet haben, und Sie können loslegen.

Als nächstes erstellen wir eine Funktion, die den Artikeltext als Argument akzeptiert und zurückgibt, ob er gefälscht ist oder nicht:

def get_prediction(text, convert_to_label=False):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    d = {
        0: "reliable",
        1: "fake"
    }
    if convert_to_label:
      return d[int(probs.argmax())]
    else:
      return int(probs.argmax())

Ich habe ein Beispiel dafür genommen test.csv, dass das Modell nie eine Inferenz durchgeführt hat, ich habe es überprüft, und es ist ein aktueller Artikel aus der New York Times:

real_news = """
Tim Tebow Will Attempt Another Comeback, This Time in Baseball - The New York Times",Daniel Victor,"If at first you don’t succeed, try a different sport. Tim Tebow, who was a Heisman   quarterback at the University of Florida but was unable to hold an N. F. L. job, is pursuing a career in Major League Baseball. <SNIPPED>
"""

Der Originaltext befindet sich in der Colab-Umgebung , wenn Sie ihn kopieren möchten, da es sich um einen vollständigen Artikel handelt. Übergeben wir es an das Modell und sehen uns die Ergebnisse an:

get_prediction(real_news, convert_to_label=True)

Ausgabe:

reliable

Anhang: Erstellen einer Übermittlungsdatei für Kaggle

In diesem Abschnitt werden wir alle Artikel vorhersagen test.csv, um eine Einreichungsdatei zu erstellen, um unsere Genauigkeit im Testsatz des Kaggle-Wettbewerbs zu sehen :

# read the test set
test_df = pd.read_csv("test.csv")
# make a copy of the testing set
new_df = test_df.copy()
# add a new column that contains the author, title and article content
new_df["new_text"] = new_df["author"].astype(str) + " : " + new_df["title"].astype(str) + " - " + new_df["text"].astype(str)
# get the prediction of all the test set
new_df["label"] = new_df["new_text"].apply(get_prediction)
# make the submission file
final_df = new_df[["id", "label"]]
final_df.to_csv("submit_final.csv", index=False)

Nachdem wir Autor, Titel und Artikeltext miteinander verkettet haben, übergeben wir die get_prediction()Funktion an die neue Spalte, um die Spalte zu füllen label, und verwenden dann die to_csv()Methode, um die Übermittlungsdatei für Kaggle zu erstellen. Hier ist mein Submission Score:

Einreichungspunktzahl

Wir haben eine Genauigkeit von 99,78 % und 100 % auf privaten und öffentlichen Bestenlisten. Das ist großartig!

Fazit

Okay, wir sind mit dem Tutorial fertig. Sie können diese Seite überprüfen , um verschiedene Trainingsparameter zu sehen, die Sie optimieren können.

Wenn Sie einen benutzerdefinierten Fake-News-Datensatz zur Feinabstimmung haben, müssen Sie einfach eine Liste von Beispielen an den Tokenizer übergeben, wie wir es getan haben, Sie werden danach keinen anderen Code mehr ändern.

Sehen Sie sich den vollständigen Code hier oder die Colab - Umgebung hier an .

Royce  Reinger

Royce Reinger

1649392464

Flight Rules for Git: Guide About What to Do When Things Go Wrong

Flight rules for Git

What are "flight rules"?

A guide for astronauts (now, programmers using Git) about what to do when things go wrong.

Flight Rules are the hard-earned body of knowledge recorded in manuals that list, step-by-step, what to do if X occurs, and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. [...]

NASA has been capturing our missteps, disasters and solutions since the early 1960s, when Mercury-era ground teams first started gathering "lessons learned" into a compendium that now lists thousands of problematic situations, from engine failure to busted hatch handles to computer glitches, and their solutions.

— Chris Hadfield, An Astronaut's Guide to Life on Earth.

Conventions for this document

For clarity's sake all examples in this document use a customized bash prompt in order to indicate the current branch and whether or not there are staged changes. The branch is enclosed in parentheses, and a * next to the branch name indicates staged changes.

All commands should work for at least git version 2.13.0. See the git website to update your local git version. 

Table of Contents generated with DocToc

Repositories

I want to start a local repository

To initialize an existing directory as a Git repository:

(my-folder) $ git init

I want to clone a remote repository

To clone (copy) a remote repository, copy the URL for the repository, and run:

$ git clone [url]

This will save it to a folder named the same as the remote repository's. Make sure you have a connection to the remote server you are cloning from (for most purposes this means making sure you are connected to the internet).

To clone it into a folder with a different name than the default repository name:

$ git clone [url] name-of-new-folder

I set the wrong remote repository

There are a few possible problems here:

If you cloned the wrong repository, simply delete the directory created after running git clone and clone the correct repository.

If you set the wrong repository as the origin of an existing local repository, change the URL of your origin by running:

$ git remote set-url origin [url of the actual repo]

For more, see this StackOverflow topic.

I want to add code to someone else's repository

Git doesn't allow you to add code to someone else's repository without access rights. Neither does GitHub, which is not the same as Git, but rather a hosted service for Git repositories. However, you can suggest code using patches, or, on GitHub, forks and pull requests.

First, a bit about forking. A fork is a copy of a repository. It is not a git operation, but is a common action on GitHub, Bitbucket, GitLab — or anywhere people host Git repositories. You can fork a repository through the hosted UI.

Suggesting code via pull requests

After you've forked a repository, you normally need to clone the repository to your machine. You can do some small edits on GitHub, for instance, without cloning, but this isn't a github-flight-rules list, so let's go with how to do this locally.

# if you are using ssh
$ git clone git@github.com:k88hudson/git-flight-rules.git

# if you are using https
$ git clone https://github.com/k88hudson/git-flight-rules.git

If you cd into the resulting directory, and type git remote, you'll see a list of the remotes. Normally there will be one remote - origin - which will point to k88hudson/git-flight-rules. In this case, we also want a remote that will point to your fork.

First, to follow a Git convention, we normally use the remote name origin for your own repository and upstream for whatever you've forked. So, rename the origin remote to upstream

$ git remote rename origin upstream

You can also do this using git remote set-url, but it takes longer and is more steps.

Then, set up a new remote that points to your project.

$ git remote add origin git@github.com:YourName/git-flight-rules.git

Note that now you have two remotes.

  • origin references your own repository.
  • upstream references the original one.

From origin, you can read and write. From upstream, you can only read.

When you've finished making whatever changes you like, push your changes (normally in a branch) to the remote named origin. If you're on a branch, you could use --set-upstream to avoid specifying the remote tracking branch on every future push using this branch. For instance:

$ (feature/my-feature) git push --set-upstream origin feature/my-feature

There is no way to suggest a pull request using the CLI using Git (although there are tools, like hub, which will do this for you). So, if you're ready to make a pull request, go to your GitHub (or another Git host) and create a new pull request. Note that your host automatically links the original and forked repositories.

After all of this, do not forget to respond to any code review feedback.

Suggesting code via patches

Another approach to suggesting code changes that doesn't rely on third party sites such as Github is to use git format-patch.

format-patch creates a .patch file for one or more commits. This file is essentially a list of changes that looks similar to the commit diffs you can view on Github.

A patch can be viewed and even edited by the recipient and applied using git am.

For example, to create a patch based on the previous commit you would run git format-patch HEAD^ which would create a .patch file called something like 0001-My-Commit-Message.patch.

To apply this patch file to your repository you would run git am ./0001-My-Commit-Message.patch.

Patches can also be sent via email using the git send-email command. For information on usage and configuration see: https://git-send-email.io

I need to update my fork with latest updates from the original repository

After a while, the upstream repository may have been updated, and these updates need to be pulled into your origin repo. Remember that like you, other people are contributing too. Suppose that you are in your own feature branch and you need to update it with the original repository updates.

You probably have set up a remote that points to the original project. If not, do this now. Generally we use upstream as a remote name:

$ (main) git remote add upstream <link-to-original-repository>
# $ (main) git remote add upstream git@github.com:k88hudson/git-flight-rules.git

Now you can fetch from upstream and get the latest updates.

$ (main) git fetch upstream
$ (main) git merge upstream/main

# or using a single command
$ (main) git pull upstream main

Editing Commits

 

What did I just commit?

Let's say that you just blindly committed changes with git commit -a and you're not sure what the actual content of the commit you just made was. You can show the latest commit on your current HEAD with:

(main)$ git show

Or

$ git log -n1 -p

If you want to see a file at a specific commit, you can also do this (where <commitid> is the commit you're interested in):

$ git show <commitid>:filename

I wrote the wrong thing in a commit message

If you wrote the wrong thing and the commit has not yet been pushed, you can do the following to change the commit message without changing the changes in the commit:

$ git commit --amend --only

This will open your default text editor, where you can edit the message. On the other hand, you can do this all in one command:

$ git commit --amend --only -m 'xxxxxxx'

If you have already pushed the message, you can amend the commit and force push, but this is not recommended.

 

I committed with the wrong name and email configured

If it's a single commit, amend it

$ git commit --amend --no-edit --author "New Authorname <authoremail@mydomain.com>"

An alternative is to correctly configure your author settings in git config --global author.(name|email) and then use

$ git commit --amend --reset-author --no-edit

If you need to change all of history, see the man page for git filter-branch.

I want to remove a file from the previous commit

In order to remove changes for a file from the previous commit, do the following:

$ git checkout HEAD^ myfile
$ git add myfile
$ git commit --amend --no-edit

In case the file was newly added to the commit and you want to remove it (from Git alone), do:

$ git rm --cached myfile
$ git commit --amend --no-edit

This is particularly useful when you have an open patch and you have committed an unnecessary file, and need to force push to update the patch on a remote. The --no-edit option is used to keep the existing commit message.

 

I want to delete or remove my last commit

If you need to delete pushed commits, you can use the following. However, it will irreversibly change your history, and mess up the history of anyone else who had already pulled from the repository. In short, if you're not sure, you should never do this, ever.

$ git reset HEAD^ --hard
$ git push --force-with-lease [remote] [branch]

If you haven't pushed, to reset Git to the state it was in before you made your last commit (while keeping your staged changes):

(my-branch*)$ git reset --soft HEAD@{1}

This only works if you haven't pushed. If you have pushed, the only truly safe thing to do is git revert SHAofBadCommit. That will create a new commit that undoes all the previous commit's changes. Or, if the branch you pushed to is rebase-safe (ie. other devs aren't expected to pull from it), you can just use git push --force-with-lease. For more, see the above section.

 

Delete/remove arbitrary commit

The same warning applies as above. Never do this if possible.

$ git rebase --onto SHA1_OF_BAD_COMMIT^ SHA1_OF_BAD_COMMIT
$ git push --force-with-lease [remote] [branch]

Or do an interactive rebase and remove the line(s) corresponding to commit(s) you want to see removed.

 

I tried to push my amended commit to a remote, but I got an error message

To https://github.com/yourusername/repo.git
! [rejected]        mybranch -> mybranch (non-fast-forward)
error: failed to push some refs to 'https://github.com/tanay1337/webmaker.org.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Note that, as with rebasing (see below), amending replaces the old commit with a new one, so you must force push (--force-with-lease) your changes if you have already pushed the pre-amended commit to your remote. Be careful when you do this – always make sure you specify a branch!

(my-branch)$ git push origin mybranch --force-with-lease

In general, avoid force pushing. It is best to create and push a new commit rather than force-pushing the amended commit as it will cause conflicts in the source history for any other developer who has interacted with the branch in question or any child branches. --force-with-lease will still fail, if someone else was also working on the same branch as you, and your push would overwrite those changes.

If you are absolutely sure that nobody is working on the same branch or you want to update the tip of the branch unconditionally, you can use --force (-f), but this should be avoided in general.

 

I accidentally did a hard reset, and I want my changes back

If you accidentally do git reset --hard, you can normally still get your commit back, as git keeps a log of everything for a few days.

Note: This is only valid if your work is backed up, i.e., either committed or stashed. git reset --hard will remove uncommitted modifications, so use it with caution. (A safer option is git reset --keep.)

(main)$ git reflog

You'll see a list of your past commits, and a commit for the reset. Choose the SHA of the commit you want to return to, and reset again:

(main)$ git reset --hard SHA1234

And you should be good to go.

I accidentally committed and pushed a merge

If you accidentally merged a feature branch to the main development branch before it was ready to be merged, you can still undo the merge. But there's a catch: A merge commit has more than one parent (usually two).

The command to use

(feature-branch)$ git revert -m 1 <commit>

where the -m 1 option says to select parent number 1 (the branch into which the merge was made) as the parent to revert to.

Note: the parent number is not a commit identifier. Rather, a merge commit has a line Merge: 8e2ce2d 86ac2e7. The parent number is the 1-based index of the desired parent on this line, the first identifier is number 1, the second is number 2, and so on.

I accidentally committed and pushed files containing sensitive data

If you accidentally pushed files containing sensitive, or private data (passwords, keys, etc.), you can amend the previous commit. Keep in mind that once you have pushed a commit, you should consider any data it contains to be compromised. These steps can remove the sensitive data from your public repo or your local copy, but you cannot remove the sensitive data from other people's pulled copies. If you committed a password, change it immediately. If you committed a key, re-generate it immediately. Amending the pushed commit is not enough, since anyone could have pulled the original commit containing your sensitive data in the meantime.

If you edit the file and remove the sensitive data, then run

(feature-branch)$ git add edited_file
(feature-branch)$ git commit --amend --no-edit
(feature-branch)$ git push --force-with-lease origin [branch]

If you want to remove an entire file (but keep it locally), then run

(feature-branch)$ git rm --cached sensitive_file
echo sensitive_file >> .gitignore
(feature-branch)$ git add .gitignore
(feature-branch)$ git commit --amend --no-edit
(feature-branch)$ git push --force-with-lease origin [branch]

Alternatively store your sensitive data in local environment variables.

If you want to completely remove an entire file (and not keep it locally), then run

(feature-branch)$ git rm sensitive_file
(feature-branch)$ git commit --amend --no-edit
(feature-branch)$ git push --force-with-lease origin [branch]

If you have made other commits in the meantime (i.e. the sensitive data is in a commit before the previous commit), you will have to rebase.

I want to remove a large file from ever existing in repo history

If the file you want to delete is secret or sensitive, instead see how to remove sensitive files.

Even if you delete a large or unwanted file in a recent commit, it still exists in git history, in your repo's .git folder, and will make git clone download unneeded files.

The actions in this part of the guide will require a force push, and rewrite large sections of repo history, so if you are working with remote collaborators, check first that any local work of theirs is pushed.

There are two options for rewriting history, the built-in git-filter-branch or bfg-repo-cleaner. bfg is significantly cleaner and more performant, but it is a third-party download and requires java. We will describe both alternatives. The final step is to force push your changes, which requires special consideration on top of a regular force push, given that a great deal of repo history will have been permanently changed.

Recommended Technique: Use third-party bfg

Using bfg-repo-cleaner requires java. Download the bfg jar from the link here. Our examples will use bfg.jar, but your download may have a version number, e.g. bfg-1.13.0.jar.

To delete a specific file.

(main)$ git rm path/to/filetoremove
(main)$ git commit -m "Commit removing filetoremove"
(main)$ java -jar ~/Downloads/bfg.jar --delete-files filetoremove

Note that in bfg you must use the plain file name even if it is in a subdirectory.

You can also delete a file by pattern, e.g.:

(main)$ git rm *.jpg
(main)$ git commit -m "Commit removing *.jpg"
(main)$ java -jar ~/Downloads/bfg.jar --delete-files *.jpg

With bfg, the files that exist on your latest commit will not be affected. For example, if you had several large .tga files in your repo, and then in an earlier commit, you deleted a subset of them, this call does not touch files present in the latest commit

Note, if you renamed a file as part of a commit, e.g. if it started as LargeFileFirstName.mp4 and a commit changed it to LargeFileSecondName.mp4, running java -jar ~/Downloads/bfg.jar --delete-files LargeFileSecondName.mp4 will not remove it from git history. Either run the --delete-files command with both filenames, or with a matching pattern.

Built-in Technique: Use git-filter-branch

git-filter-branch is more cumbersome and has less features, but you may use it if you cannot install or run bfg.

In the below, replace filepattern may be a specific name or pattern, e.g. *.jpg. This will remove files matching the pattern from all history and branches.

(main)$ git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch filepattern' --prune-empty --tag-name-filter cat -- --all

Behind-the-scenes explanation:

--tag-name-filter cat is a cumbersome, but simplest, way to apply the original tags to the new commits, using the command cat.

--prune-empty removes any now-empty commits.

Final Step: Pushing your changed repo history

Once you have removed your desired files, test carefully that you haven't broken anything in your repo - if you have, it is easiest to re-clone your repo to start over. To finish, optionally use git garbage collection to minimize your local .git folder size, and then force push.

(main)$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
(main)$ git push origin --force --tags

Since you just rewrote the entire git repo history, the git push operation may be too large, and return the error “The remote end hung up unexpectedly”. If this happens, you can try increasing the git post buffer:

(main)$ git config http.postBuffer 524288000
(main)$ git push --force

If this does not work, you will need to manually push the repo history in chunks of commits. In the command below, try increasing <number> until the push operation succeeds.

(main)$ git push -u origin HEAD~<number>:refs/head/main --force

Once the push operation succeeds the first time, decrease <number> gradually until a conventional git push succeeds.

 

I need to change the content of a commit which is not my last

Consider you created some (e.g. three) commits and later realize you missed doing something that belongs contextually into the first of those commits. This bothers you, because if you'd create a new commit containing those changes, you'd have a clean code base, but your commits weren't atomic (i.e. changes that belonged to each other weren't in the same commit). In such a situation you may want to change the commit where these changes belong to, include them and have the following commits unaltered. In such a case, git rebase might save you.

Consider a situation where you want to change the third last commit you made.

(your-branch)$ git rebase -i HEAD~4

gets you into interactive rebase mode, which allows you to edit any of your last three commits. A text editor pops up, showing you something like

pick 9e1d264 The third last commit
pick 4b6e19a The second to last commit
pick f4037ec The last commit

which you change into

edit 9e1d264 The third last commit
pick 4b6e19a The second to last commit
pick f4037ec The last commit

This tells rebase that you want to edit your third last commit and keep the other two unaltered. Then you'll save (and close) the editor. Git will then start to rebase. It stops on the commit you want to alter, giving you the chance to edit that commit. Now you can apply the changes which you missed applying when you initially committed that commit. You do so by editing and staging them. Afterwards you'll run

(your-branch)$ git commit --amend

which tells Git to recreate the commit, but to leave the commit message unedited. Having done that, the hard part is solved.

(your-branch)$ git rebase --continue

will do the rest of the work for you.

Staging

I want to stage all tracked files and leave untracked files

$ git add -u

To stage part of tracked files

# to stage files with ext .txt
$ git add -u *.txt

# to stage all files inside directory src
$ git add -u src/

I need to add staged changes to the previous commit

(my-branch*)$ git commit --amend

If you already know you don't want to change the commit message, you can tell git to reuse the commit message:

(my-branch*)$ git commit --amend -C HEAD

I want to stage part of a new file, but not the whole file

Normally, if you want to stage part of a file, you run this:

$ git add --patch filename.x

-p will work for short. This will open interactive mode. You would be able to use the s option to split the commit - however, if the file is new, you will not have this option. To add a new file, do this:

$ git add -N filename.x

Then, you will need to use the e option to manually choose which lines to add. Running git diff --cached or git diff --staged will show you which lines you have staged compared to which are still saved locally.

I want to add changes in one file to two different commits

git add will add the entire file to a commit. git add -p will allow to interactively select which changes you want to add.

I staged too many edits, and I want to break them out into a separate commit

git reset -p will open a patch mode reset dialog. This is similar to git add -p, except that selecting "yes" will unstage the change, removing it from the upcoming commit.

I want to stage my unstaged edits, and unstage my staged edits

In many cases, you should unstage all of your staged files and then pick the file you want and commit it. However, if you want to switch the staged and unstaged edits, you can create a temporary commit to store your staged files, stage your unstaged files and then stash them. Then, reset the temporary commit and pop your stash.

$ git commit -m "WIP"
$ git add . # This will also add untracked files.
$ git stash
$ git reset HEAD^
$ git stash pop --index 0

NOTE 1: The reason to use pop here is want to keep idempotent as much as possible. NOTE 2: Your staged files will be marked as unstaged if you don't use the --index flag. (This link explains why.)

Unstaged Edits

I want to move my unstaged edits to a new branch

$ git checkout -b my-branch

I want to move my unstaged edits to a different, existing branch

$ git stash
$ git checkout my-branch
$ git stash pop

I want to discard my local uncommitted changes (staged and unstaged)

If you want to discard all your local staged and unstaged changes, you can do this:

(my-branch)$ git reset --hard
# or
(main)$ git checkout -f

This will unstage all files you might have staged with git add:

$ git reset

This will revert all local uncommitted changes (should be executed in repo root):

$ git checkout .

You can also revert uncommitted changes to a particular file or directory:

$ git checkout [some_dir|file.txt]

Yet another way to revert all uncommitted changes (longer to type, but works from any subdirectory):

$ git reset --hard HEAD

This will remove all local untracked files, so only files tracked by Git remain:

$ git clean -fd

-x will also remove all ignored files.

I want to discard specific unstaged changes

When you want to get rid of some, but not all changes in your working copy.

Checkout undesired changes, keep good changes.

$ git checkout -p
# Answer y to all of the snippets you want to drop

Another strategy involves using stash. Stash all the good changes, reset working copy, and reapply good changes.

$ git stash -p
# Select all of the snippets you want to save
$ git reset --hard
$ git stash pop

Alternatively, stash your undesired changes, and then drop stash.

$ git stash -p
# Select all of the snippets you don't want to save
$ git stash drop

I want to discard specific unstaged files

When you want to get rid of one specific file in your working copy.

$ git checkout myFile

Alternatively, to discard multiple files in your working copy, list them all.

$ git checkout myFirstFile mySecondFile

I want to discard only my unstaged local changes

When you want to get rid of all of your unstaged local uncommitted changes

$ git checkout .

I want to discard all of my untracked files

When you want to get rid of all of your untracked files

$ git clean -f

I want to unstage a specific staged file

Sometimes we have one or more files that accidentally ended up being staged, and these files have not been committed before. To unstage them:

$ git reset -- <filename>

This results in unstaging the file and make it look like it's untracked.

Branches

I want to list all branches

List local branches

$ git branch

List remote branches

$ git branch -r

List all branches (both local and remote)

$ git branch -a

 

Create a branch from a commit

$ git checkout -b <branch> <SHA1_OF_COMMIT>

 

I pulled from/into the wrong branch

This is another chance to use git reflog to see where your HEAD pointed before the bad pull.

(main)$ git reflog
ab7555f HEAD@{0}: pull origin wrong-branch: Fast-forward
c5bc55a HEAD@{1}: checkout: checkout message goes here

Simply reset your branch back to the desired commit:

$ git reset --hard c5bc55a

Done.

 

I want to discard local commits so my branch is the same as one on the server

Confirm that you haven't pushed your changes to the server.

git status should show how many commits you are ahead of origin:

(my-branch)$ git status
# On branch my-branch
# Your branch is ahead of 'origin/my-branch' by 2 commits.
#   (use "git push" to publish your local commits)
#

One way of resetting to match origin (to have the same as what is on the remote) is to do this:

(main)$ git reset --hard origin/my-branch

 

I committed to main instead of a new branch

Create the new branch while remaining on main:

(main)$ git branch my-branch

Reset the branch main to the previous commit:

(main)$ git reset --hard HEAD^

HEAD^ is short for HEAD^1. This stands for the first parent of HEAD, similarly HEAD^2 stands for the second parent of the commit (merges can have 2 parents).

Note that HEAD^2 is not the same as HEAD~2 (see this link for more information).

Alternatively, if you don't want to use HEAD^, find out what the commit hash you want to set your main branch to (git log should do the trick). Then reset to that hash. git push will make sure that this change is reflected on your remote.

For example, if the hash of the commit that your main branch is supposed to be at is a13b85e:

(main)$ git reset --hard a13b85e
HEAD is now at a13b85e

Checkout the new branch to continue working:

(main)$ git checkout my-branch

 

I want to keep the whole file from another ref-ish

Say you have a working spike (see note), with hundreds of changes. Everything is working. Now, you commit into another branch to save that work:

(solution)$ git add -A && git commit -m "Adding all changes from this spike into one big commit."

When you want to put it into a branch (maybe feature, maybe develop), you're interested in keeping whole files. You want to split your big commit into smaller ones.

Say you have:

  • branch solution, with the solution to your spike. One ahead of develop.
  • branch develop, where you want to add your changes.

You can solve it bringing the contents to your branch:

(develop)$ git checkout solution -- file1.txt

This will get the contents of that file in branch solution to your branch develop:

# On branch develop
# Your branch is up-to-date with 'origin/develop'.
# Changes to be committed:
#  (use "git reset HEAD <file>..." to unstage)
#
#        modified:   file1.txt

Then, commit as usual.

Note: Spike solutions are made to analyze or solve the problem. These solutions are used for estimation and discarded once everyone gets clear visualization of the problem. ~ Wikipedia.

 

I made several commits on a single branch that should be on different branches

Say you are on your main branch. Running git log, you see you have made two commits:

(main)$ git log

commit e3851e817c451cc36f2e6f3049db528415e3c114
Author: Alex Lee <alexlee@example.com>
Date:   Tue Jul 22 15:39:27 2014 -0400

    Bug #21 - Added CSRF protection

commit 5ea51731d150f7ddc4a365437931cd8be3bf3131
Author: Alex Lee <alexlee@example.com>
Date:   Tue Jul 22 15:39:12 2014 -0400

    Bug #14 - Fixed spacing on title

commit a13b85e984171c6e2a1729bb061994525f626d14
Author: Aki Rose <akirose@example.com>
Date:   Tue Jul 21 01:12:48 2014 -0400

    First commit

Let's take note of our commit hashes for each bug (e3851e8 for #21, 5ea5173 for #14).

First, let's reset our main branch to the correct commit (a13b85e):

(main)$ git reset --hard a13b85e
HEAD is now at a13b85e

Now, we can create a fresh branch for our bug #21:

(main)$ git checkout -b 21
(21)$

Now, let's cherry-pick the commit for bug #21 on top of our branch. That means we will be applying that commit, and only that commit, directly on top of whatever our head is at.

(21)$ git cherry-pick e3851e8

At this point, there is a possibility there might be conflicts. See the There were conflicts section in the interactive rebasing section above for how to resolve conflicts.

Now let's create a new branch for bug #14, also based on main

(21)$ git checkout main
(main)$ git checkout -b 14
(14)$

And finally, let's cherry-pick the commit for bug #14:

(14)$ git cherry-pick 5ea5173

 

I want to delete local branches that were deleted upstream

Once you merge a pull request on GitHub, it gives you the option to delete the merged branch in your fork. If you aren't planning to keep working on the branch, it's cleaner to delete the local copies of the branch so you don't end up cluttering up your working checkout with a lot of stale branches.

$ git fetch -p upstream

where, upstream is the remote you want to fetch from.

 

I accidentally deleted my branch

If you're regularly pushing to remote, you should be safe most of the time. But still sometimes you may end up deleting your branches. Let's say we create a branch and create a new file:

(main)$ git checkout -b my-branch
(my-branch)$ git branch
(my-branch)$ touch foo.txt
(my-branch)$ ls
README.md foo.txt

Let's add it and commit.

(my-branch)$ git add .
(my-branch)$ git commit -m 'foo.txt added'
(my-branch)$ foo.txt added
 1 files changed, 1 insertions(+)
 create mode 100644 foo.txt
(my-branch)$ git log

commit 4e3cd85a670ced7cc17a2b5d8d3d809ac88d5012
Author: siemiatj <siemiatj@example.com>
Date:   Wed Jul 30 00:34:10 2014 +0200

    foo.txt added

commit 69204cdf0acbab201619d95ad8295928e7f411d5
Author: Kate Hudson <katehudson@example.com>
Date:   Tue Jul 29 13:14:46 2014 -0400

    Fixes #6: Force pushing after amending commits

Now we're switching back to main and 'accidentally' removing our branch.

(my-branch)$ git checkout main
Switched to branch 'main'
Your branch is up-to-date with 'origin/main'.
(main)$ git branch -D my-branch
Deleted branch my-branch (was 4e3cd85).
(main)$ echo oh noes, deleted my branch!
oh noes, deleted my branch!

At this point you should get familiar with 'reflog', an upgraded logger. It stores the history of all the action in the repo.

(main)$ git reflog
69204cd HEAD@{0}: checkout: moving from my-branch to main
4e3cd85 HEAD@{1}: commit: foo.txt added
69204cd HEAD@{2}: checkout: moving from main to my-branch

As you can see we have commit hash from our deleted branch. Let's see if we can restore our deleted branch.

(main)$ git checkout -b my-branch-help
Switched to a new branch 'my-branch-help'
(my-branch-help)$ git reset --hard 4e3cd85
HEAD is now at 4e3cd85 foo.txt added
(my-branch-help)$ ls
README.md foo.txt

Voila! We got our removed file back. git reflog is also useful when rebasing goes terribly wrong.

I want to delete a branch

To delete a remote branch:

(main)$ git push origin --delete my-branch

You can also do:

(main)$ git push origin :my-branch

To delete a local branch:

(main)$ git branch -d my-branch

To delete a local branch that has not been merged to the current branch or an upstream:

(main)$ git branch -D my-branch

I want to delete multiple branches

Say you want to delete all branches that start with fix/:

(main)$ git branch | grep 'fix/' | xargs git branch -d

I want to rename a branch

To rename the current (local) branch:

(main)$ git branch -m new-name

To rename a different (local) branch:

(main)$ git branch -m old-name new-name

To delete the old-name remote branch and push the new-name local branch:

(main)$ git push origin :old_name new_name

 

I want to checkout to a remote branch that someone else is working on

First, fetch all branches from remote:

(main)$ git fetch --all

Say you want to checkout to daves from the remote.

(main)$ git checkout --track origin/daves
Branch daves set up to track remote branch daves from origin.
Switched to a new branch 'daves'

(--track is shorthand for git checkout -b [branch] [remotename]/[branch])

This will give you a local copy of the branch daves, and any update that has been pushed will also show up remotely.

I want to create a new remote branch from current local one

$ git push <remote> HEAD

If you would also like to set that remote branch as upstream for the current one, use the following instead:

$ git push -u <remote> HEAD

With the upstream mode and the simple (default in Git 2.0) mode of the push.default config, the following command will push the current branch with regards to the remote branch that has been registered previously with -u:

$ git push

The behavior of the other modes of git push is described in the doc of push.default.

I want to set a remote branch as the upstream for a local branch

You can set a remote branch as the upstream for the current local branch using:

$ git branch --set-upstream-to [remotename]/[branch]
# or, using the shorthand:
$ git branch -u [remotename]/[branch]

To set the upstream remote branch for another local branch:

$ git branch -u [remotename]/[branch] [local-branch]

 

I want to set my HEAD to track the default remote branch

By checking your remote branches, you can see which remote branch your HEAD is tracking. In some cases, this is not the desired branch.

$ git branch -r
  origin/HEAD -> origin/gh-pages
  origin/main

To change origin/HEAD to track origin/main, you can run this command:

$ git remote set-head origin --auto
origin/HEAD set to main

I made changes on the wrong branch

You've made uncommitted changes and realise you're on the wrong branch. Stash changes and apply them to the branch you want:

(wrong_branch)$ git stash
(wrong_branch)$ git checkout <correct_branch>
(correct_branch)$ git stash apply

 

I want to split a branch into two

You've made a lot of commits on a branch and now want to separate it into two, ending with a branch up to an earlier commit and another with all the changes.

Use git log to find the commit where you want to split. Then do the following:

(original_branch)$ git checkout -b new_branch
(new_branch)$ git checkout original_branch
(original_branch)$ git reset --hard <sha1 split here>

If you had previously pushed the original_branch to remote, you will need to do a force push. For more information check Stack Overlflow

Rebasing and Merging

 

I want to undo rebase/merge

You may have merged or rebased your current branch with a wrong branch, or you can't figure it out or finish the rebase/merge process. Git saves the original HEAD pointer in a variable called ORIG_HEAD before doing dangerous operations, so it is simple to recover your branch at the state before the rebase/merge.

(my-branch)$ git reset --hard ORIG_HEAD

 

I rebased, but I don't want to force push

Unfortunately, you have to force push, if you want those changes to be reflected on the remote branch. This is because you have changed the history. The remote branch won't accept changes unless you force push. This is one of the main reasons many people use a merge workflow, instead of a rebasing workflow - large teams can get into trouble with developers force pushing. Use this with caution. A safer way to use rebase is not to reflect your changes on the remote branch at all, and instead to do the following:

(main)$ git checkout my-branch
(my-branch)$ git rebase -i main
(my-branch)$ git checkout main
(main)$ git merge --ff-only my-branch

For more, see this SO thread.

 

I need to combine commits

Let's suppose you are working in a branch that is/will become a pull-request against main. In the simplest case when all you want to do is to combine all commits into a single one and you don't care about commit timestamps, you can reset and recommit. Make sure the main branch is up to date and all your changes committed, then:

(my-branch)$ git reset --soft main
(my-branch)$ git commit -am "New awesome feature"

If you want more control, and also to preserve timestamps, you need to do something called an interactive rebase:

(my-branch)$ git rebase -i main

If you aren't working against another branch you'll have to rebase relative to your HEAD. If you want to squash the last 2 commits, for example, you'll have to rebase against HEAD~2. For the last 3, HEAD~3, etc.

(main)$ git rebase -i HEAD~2

After you run the interactive rebase command, you will see something like this in your text editor:

pick a9c8a1d Some refactoring
pick 01b2fd8 New awesome feature
pick b729ad5 fixup
pick e3851e8 another fix

# Rebase 8074d12..b729ad5 onto 8074d12
#
# Commands:
#  p, pick = use commit
#  r, reword = use commit, but edit the commit message
#  e, edit = use commit, but stop for amending
#  s, squash = use commit, but meld into previous commit
#  f, fixup = like "squash", but discard this commit's log message
#  x, exec = run command (the rest of the line) using shell
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out

All the lines beginning with a # are comments, they won't affect your rebase.

Then you replace pick commands with any in the list above, and you can also remove commits by removing corresponding lines.

For example, if you want to leave the oldest (first) commit alone and combine all the following commits with the second oldest, you should edit the letter next to each commit except the first and the second to say f:

pick a9c8a1d Some refactoring
pick 01b2fd8 New awesome feature
f b729ad5 fixup
f e3851e8 another fix

If you want to combine these commits and rename the commit, you should additionally add an r next to the second commit or simply use s instead of f:

pick a9c8a1d Some refactoring
pick 01b2fd8 New awesome feature
s b729ad5 fixup
s e3851e8 another fix

You can then rename the commit in the next text prompt that pops up.

Newer, awesomer features

# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# rebase in progress; onto 8074d12
# You are currently editing a commit while rebasing branch 'main' on '8074d12'.
#
# Changes to be committed:
#   modified:   README.md
#

If everything is successful, you should see something like this:

(main)$ Successfully rebased and updated refs/heads/main.

Safe merging strategy

--no-commit performs the merge but pretends the merge failed and does not autocommit, giving the user a chance to inspect and further tweak the merge result before committing. no-ff maintains evidence that a feature branch once existed, keeping project history consistent.

(main)$ git merge --no-ff --no-commit my-branch

I need to merge a branch into a single commit

(main)$ git merge --squash my-branch

 

I want to combine only unpushed commits

Sometimes you have several work in progress commits that you want to combine before you push them upstream. You don't want to accidentally combine any commits that have already been pushed upstream because someone else may have already made commits that reference them.

(main)$ git rebase -i @{u}

This will do an interactive rebase that lists only the commits that you haven't already pushed, so it will be safe to reorder/fix/squash anything in the list.

I need to abort the merge

Sometimes the merge can produce problems in certain files, in those cases we can use the option abort to abort the current conflict resolution process, and try to reconstruct the pre-merge state.

(my-branch)$ git merge --abort

This command is available since Git version >= 1.7.4

I need to update the parent commit of my branch

Say I have a main branch, a feature-1 branch branched from main, and a feature-2 branch branched off of feature-1. If I make a commit to feature-1, then the parent commit of feature-2 is no longer accurate (it should be the head of feature-1, since we branched off of it). We can fix this with git rebase --onto.

(feature-2)$ git rebase --onto feature-1 <the first commit in your feature-2 branch that you don't want to bring along> feature-2

This helps in sticky scenarios where you might have a feature built on another feature that hasn't been merged yet, and a bugfix on the feature-1 branch needs to be reflected in your feature-2 branch.

Check if all commits on a branch are merged

To check if all commits on a branch are merged into another branch, you should diff between the heads (or any commits) of those branches:

(main)$ git log --graph --left-right --cherry-pick --oneline HEAD...feature/120-on-scroll

This will tell you if any commits are in one but not the other, and will give you a list of any nonshared between the branches. Another option is to do this:

(main)$ git log main ^feature/120-on-scroll --no-merges

Possible issues with interactive rebases

 

The rebase editing screen says 'noop'

If you're seeing this:

noop

That means you are trying to rebase against a branch that is at an identical commit, or is ahead of your current branch. You can try:

  • making sure your main branch is where it should be
  • rebase against HEAD~2 or earlier instead

 

There were conflicts

If you are unable to successfully complete the rebase, you may have to resolve conflicts.

First run git status to see which files have conflicts in them:

(my-branch)$ git status
On branch my-branch
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

  both modified:   README.md

In this example, README.md has conflicts. Open that file and look for the following:

   <<<<<<< HEAD
   some code
   =========
   some code
   >>>>>>> new-commit

You will need to resolve the differences between the code that was added in your new commit (in the example, everything from the middle line to new-commit) and your HEAD.

If you want to keep one branch's version of the code, you can use --ours or --theirs:

(main*)$ git checkout --ours README.md
  • When merging, use --ours to keep changes from the local branch, or --theirs to keep changes from the other branch.
  • When rebasing, use --theirs to keep changes from the local branch, or --ours to keep changes from the other branch. For an explanation of this swap, see this note in the Git documentation.

If the merges are more complicated, you can use a visual diff editor:

(main*)$ git mergetool -t opendiff

After you have resolved all conflicts and tested your code, git add the files you have changed, and then continue the rebase with git rebase --continue

(my-branch)$ git add README.md
(my-branch)$ git rebase --continue

If after resolving all the conflicts you end up with an identical tree to what it was before the commit, you need to git rebase --skip instead.

If at any time you want to stop the entire rebase and go back to the original state of your branch, you can do so:

(my-branch)$ git rebase --abort

 

Stash

Stash all edits

To stash all the edits in your working directory

$ git stash

If you also want to stash untracked files, use -u option.

$ git stash -u

Stash specific files

To stash only one file from your working directory

$ git stash push working-directory-path/filename.ext

To stash multiple files from your working directory

$ git stash push working-directory-path/filename1.ext working-directory-path/filename2.ext

 

Stash with message

$ git stash save <message>

or

$ git stash push -m <message>

 

Apply a specific stash from list

First check your list of stashes with message using

$ git stash list

Then apply a specific stash from the list using

$ git stash apply "stash@{n}"

Here, 'n' indicates the position of the stash in the stack. The topmost stash will be position 0.

Furthermore, using a time-based stash reference is also possible.

$ git stash apply "stash@{2.hours.ago}"

 

Stash while keeping unstaged edits

You can manually create a stash commit, and then use git stash store.

$ git stash create
$ git stash store -m <message> CREATED_SHA1

Finding

I want to find a string in any commit

To find a certain string which was introduced in any commit, you can use the following structure:

$ git log -S "string to find"

Commons parameters:

--source means to show the ref name given on the command line by which each commit was reached.

--all means to start from every branch.

--reverse prints in reverse order, it means that will show the first commit that made the change.

 

I want to find by author/committer

To find all commits by author/committer you can use:

$ git log --author=<name or email>
$ git log --committer=<name or email>

Keep in mind that author and committer are not the same. The --author is the person who originally wrote the code; on the other hand, the --committer, is the person who committed the code on behalf of the original author.

I want to list commits containing specific files

To find all commits containing a specific file you can use:

$ git log -- <path to file>

You would usually specify an exact path, but you may also use wild cards in the path and file name:

$ git log -- **/*.js

While using wildcards, it's useful to inform --name-status to see the list of committed files:

$ git log --name-status -- **/*.js

 

I want to view the commit history for a specific function

To trace the evolution of a single function you can use:

$ git log -L :FunctionName:FilePath

Note that you can combine this with further git log options, like revision ranges and commit limits.

Find a tag where a commit is referenced

To find all tags containing a specific commit:

$ git tag --contains <commitid>

Submodules

 

Clone all submodules

$ git clone --recursive git://github.com/foo/bar.git

If already cloned:

$ git submodule update --init --recursive

 

Remove a submodule

Creating a submodule is pretty straight-forward, but deleting them less so. The commands you need are:

$ git submodule deinit submodulename
$ git rm submodulename
$ git rm --cached submodulename
$ rm -rf .git/modules/submodulename

Miscellaneous Objects

Copy a folder or file from one branch to another

$ git checkout <branch-you-want-the-directory-from> -- <folder-name or file-name>

Restore a deleted file

First find the commit when the file last existed:

$ git rev-list -n 1 HEAD -- filename

Then checkout that file:

git checkout deletingcommitid^ -- filename

Delete tag

$ git tag -d <tag_name>
$ git push <remote> :refs/tags/<tag_name>

 

Recover a deleted tag

If you want to recover a tag that was already deleted, you can do so by following these steps: First, you need to find the unreachable tag:

$ git fsck --unreachable | grep tag

Make a note of the tag's hash. Then, restore the deleted tag with following, making use of git update-ref:

$ git update-ref refs/tags/<tag_name> <hash>

Your tag should now have been restored.

Deleted Patch

If someone has sent you a pull request on GitHub, but then deleted their original fork, you will be unable to clone their repository or to use git am as the .diff, .patch URLs become unavailable. But you can checkout the PR itself using GitHub's special refs. To fetch the content of PR#1 into a new branch called pr_1:

$ git fetch origin refs/pull/1/head:pr_1
From github.com:foo/bar
 * [new ref]         refs/pull/1/head -> pr_1

Exporting a repository as a Zip file

$ git archive --format zip --output /full/path/to/zipfile.zip main

Push a branch and a tag that have the same name

If there is a tag on a remote repository that has the same name as a branch you will get the following error when trying to push that branch with a standard $ git push <remote> <branch> command.

$ git push origin <branch>
error: dst refspec same matches more than one.
error: failed to push some refs to '<git server>'

Fix this by specifying you want to push the head reference.

$ git push origin refs/heads/<branch-name>

If you want to push a tag to a remote repository that has the same name as a branch, you can use a similar command.

$ git push origin refs/tags/<tag-name>

Tracking Files

 

I want to change a file name's capitalization, without changing the contents of the file

(main)$ git mv --force myfile MyFile

I want to overwrite local files when doing a git pull

(main)$ git fetch --all
(main)$ git reset --hard origin/main

 

I want to remove a file from Git but keep the file

(main)$ git rm --cached log.txt

I want to revert a file to a specific revision

Assuming the hash of the commit you want is c5f567:

(main)$ git checkout c5f567 -- file1/to/restore file2/to/restore

If you want to revert to changes made just 1 commit before c5f567, pass the commit hash as c5f567~1:

(main)$ git checkout c5f567~1 -- file1/to/restore file2/to/restore

I want to list changes of a specific file between commits or branches

Assuming you want to compare last commit with file from commit c5f567:

$ git diff HEAD:path_to_file/file c5f567:path_to_file/file

Same goes for branches:

$ git diff main:path_to_file/file staging:path_to_file/file

I want Git to ignore changes to a specific file

This works great for config templates or other files that require locally adding credentials that shouldn't be committed.

$ git update-index --assume-unchanged file-to-ignore

Note that this does not remove the file from source control - it is only ignored locally. To undo this and tell Git to notice changes again, this clears the ignore flag:

$ git update-index --no-assume-unchanged file-to-stop-ignoring

Debugging with Git

The git-bisect command uses a binary search to find which commit in your Git history introduced a bug.

Suppose you're on the main branch, and you want to find the commit that broke some feature. You start bisect:

$ git bisect start

Then you should specify which commit is bad, and which one is known to be good. Assuming that your current version is bad, and v1.1.1 is good:

$ git bisect bad
$ git bisect good v1.1.1

Now git-bisect selects a commit in the middle of the range that you specified, checks it out, and asks you whether it's good or bad. You should see something like:

$ Bisecting: 5 revision left to test after this (roughly 5 step)
$ [c44abbbee29cb93d8499283101fe7c8d9d97f0fe] Commit message
$ (c44abbb)$

You will now check if this commit is good or bad. If it's good:

$ (c44abbb)$ git bisect good

and git-bisect will select another commit from the range for you. This process (selecting good or bad) will repeat until there are no more revisions left to inspect, and the command will finally print a description of the first bad commit.

Configuration

I want to add aliases for some Git commands

On OS X and Linux, your git configuration file is stored in ~/.gitconfig. I've added some example aliases I use as shortcuts (and some of my common typos) in the [alias] section as shown below:

[alias]
    a = add
    amend = commit --amend
    c = commit
    ca = commit --amend
    ci = commit -a
    co = checkout
    d = diff
    dc = diff --changed
    ds = diff --staged
    extend = commit --amend -C HEAD
    f = fetch
    loll = log --graph --decorate --pretty=oneline --abbrev-commit
    m = merge
    one = log --pretty=oneline
    outstanding = rebase -i @{u}
    reword = commit --amend --only
    s = status
    unpushed = log @{u}
    wc = whatchanged
    wip = rebase -i @{u}
    zap = fetch -p
    day = log --reverse --no-merges --branches=* --date=local --since=midnight --author=\"$(git config --get user.name)\"
    delete-merged-branches = "!f() { git checkout --quiet main && git branch --merged | grep --invert-match '\\*' | xargs -n 1 git branch --delete; git checkout --quiet @{-1}; }; f"

I want to add an empty directory to my repository

You can’t! Git doesn’t support this, but there’s a hack. You can create a .gitignore file in the directory with the following contents:

 # Ignore everything in this directory
 *
 # Except this file
 !.gitignore

Another common convention is to make an empty file in the folder, titled .gitkeep.

$ mkdir mydir
$ touch mydir/.gitkeep

You can also name the file as just .keep , in which case the second line above would be touch mydir/.keep

I want to cache a username and password for a repository

You might have a repository that requires authentication. In which case you can cache a username and password so you don't have to enter it on every push and pull. Credential helper can do this for you.

$ git config --global credential.helper cache
# Set git to use the credential memory cache
$ git config --global credential.helper 'cache --timeout=3600'
# Set the cache to timeout after 1 hour (setting is in seconds)

To find a credential helper:

$ git help -a | grep credential
# Shows you possible credential helpers

For OS specific credential caching:

$ git config --global credential.helper osxkeychain
# For OSX
$ git config --global credential.helper manager
# Git for Windows 2.7.3+
$ git config --global credential.helper gnome-keyring
# Ubuntu and other GNOME-based distros

More credential helpers can likely be found for different distributions and operating systems.

I want to make Git ignore permissions and filemode changes

$ git config core.fileMode false

If you want to make this the default behaviour for logged-in users, then use:

$ git config --global core.fileMode false

I want to set a global user

To configure user information used across all local repositories, and to set a name that is identifiable for credit when review version history:

$ git config --global user.name “[firstname lastname]”

To set an email address that will be associated with each history marker:

git config --global user.email “[valid-email]”

I've no idea what I did wrong

So, you're screwed - you reset something, or you merged the wrong branch, or you force pushed and now you can't find your commits. You know, at some point, you were doing alright, and you want to go back to some state you were at.

This is what git reflog is for. reflog keeps track of any changes to the tip of a branch, even if that tip isn't referenced by a branch or a tag. Basically, every time HEAD changes, a new entry is added to the reflog. This only works for local repositories, sadly, and it only tracks movements (not changes to a file that weren't recorded anywhere, for instance).

(main)$ git reflog
0a2e358 HEAD@{0}: reset: moving to HEAD~2
0254ea7 HEAD@{1}: checkout: moving from 2.2 to main
c10f740 HEAD@{2}: checkout: moving from main to 2.2

The reflog above shows a checkout from main to the 2.2 branch and back. From there, there's a hard reset to an older commit. The latest activity is represented at the top labeled HEAD@{0}.

If it turns out that you accidentally moved back, the reflog will contain the commit main pointed to (0254ea7) before you accidentally dropped 2 commits.

$ git reset --hard 0254ea7

Using git reset it is then possible to change main back to the commit it was before. This provides a safety net in case history was accidentally changed.

(copied and edited from Source).

Git Shortcuts

Git Bash

Once you're comfortable with what the above commands are doing, you might want to create some shortcuts for Git Bash. This allows you to work a lot faster by doing complex tasks in really short commands.

alias sq=squash

function squash() {
    git rebase -i HEAD~$1
}

Copy those commands to your .bashrc or .bash_profile.

PowerShell on Windows

If you are using PowerShell on Windows, you can also set up aliases and functions. Add these commands to your profile, whose path is defined in the $profile variable. Learn more at the About Profiles page on the Microsoft documentation site.

Set-Alias sq Squash-Commits

function Squash-Commits {
  git rebase -i HEAD~$1
}

Other Resources

Books

Tutorials

Scripts and Tools

  • firstaidgit.io A searchable selection of the most frequently asked Git questions
  • git-extra-commands - a collection of useful extra Git scripts
  • git-extras - GIT utilities -- repo summary, repl, changelog population, author commit percentages and more
  • git-fire - git-fire is a Git plugin that helps in the event of an emergency by adding all current files, committing, and pushing to a new branch (to prevent merge conflicts).
  • git-tips - Small Git tips
  • git-town - Generic, high-level Git workflow support! http://www.git-town.com

GUI Clients

  • GitKraken - The downright luxurious Git client,for Windows, Mac & Linux
  • git-cola - another Git client for Windows and OS X
  • GitUp - A newish GUI that has some very opinionated ways of dealing with Git's complications
  • gitx-dev - another graphical Git client for OS X
  • Sourcetree - Simplicity meets power in a beautiful and free Git GUI. For Windows and Mac.
  • Tower - graphical Git client for OS X (paid)
  • tig - terminal text-mode interface for Git
  • Magit - Interface to Git implemented as an Emacs package.
  • GitExtensions - a shell extension, a Visual Studio 2010-2015 plugin and a standalone Git repository tool.
  • Fork - a fast and friendly Git client for Mac (beta)
  • gmaster - a Git client for Windows that has 3-way merge, analyze refactors, semantic diff and merge (beta)
  • gitk - a Git client for linux to allow simple view of repo state.
  • SublimeMerge - Blazing fast, extensible client that provides 3-way merges, powerful search and syntax highlighting, in active development.

🌍 EnglishEspañolРусский简体中文한국어Tiếng ViệtFrançais日本語

Author: K88hudson
Source Code: https://github.com/k88hudson/git-flight-rules 
License: CC-BY-SA-4.0 License

#git #guide 

Cómo construir un detector de noticias falsas en Python

Detección de noticias falsas en Python

Explorar el conjunto de datos de noticias falsas, realizar análisis de datos como nubes de palabras y ngramas, y ajustar el transformador BERT para construir un detector de noticias falsas en Python usando la biblioteca de transformadores.

Las noticias falsas son la transmisión intencional de afirmaciones falsas o engañosas como noticias, donde las declaraciones son deliberadamente engañosas.

Los periódicos, tabloides y revistas han sido reemplazados por plataformas de noticias digitales, blogs, fuentes de redes sociales y una plétora de aplicaciones de noticias móviles. Las organizaciones de noticias se beneficiaron del mayor uso de las redes sociales y las plataformas móviles al proporcionar a los suscriptores información actualizada al minuto.

Los consumidores ahora tienen acceso instantáneo a las últimas noticias. Estas plataformas de medios digitales han aumentado en importancia debido a su fácil conexión con el resto del mundo y permiten a los usuarios discutir y compartir ideas y debatir temas como la democracia, la educación, la salud, la investigación y la historia. Las noticias falsas en las plataformas digitales son cada vez más populares y se utilizan con fines de lucro, como ganancias políticas y financieras.

¿Qué tan grande es este problema?

Debido a que Internet, las redes sociales y las plataformas digitales son ampliamente utilizadas, cualquiera puede propagar información inexacta y sesgada. Es casi imposible evitar la difusión de noticias falsas. Hay un aumento tremendo en la distribución de noticias falsas, que no se restringe a un sector como la política sino que incluye deportes, salud, historia, entretenimiento y ciencia e investigación.

La solución

Es vital reconocer y diferenciar entre noticias falsas y veraces. Un método es hacer que un experto decida y verifique cada pieza de información, pero esto lleva tiempo y requiere experiencia que no se puede compartir. En segundo lugar, podemos utilizar herramientas de aprendizaje automático e inteligencia artificial para automatizar la identificación de noticias falsas.

La información de noticias en línea incluye varios datos en formato no estructurado (como documentos, videos y audio), pero aquí nos concentraremos en las noticias en formato de texto. Con el progreso del aprendizaje automático y el procesamiento del lenguaje natural , ahora podemos reconocer el carácter engañoso y falso de un artículo o declaración.

Se están realizando varios estudios y experimentos para detectar noticias falsas en todos los medios.

Nuestro objetivo principal de este tutorial es:

  • Explore y analice el conjunto de datos de noticias falsas.
  • Cree un clasificador que pueda distinguir noticias falsas con la mayor precisión posible.

Aquí está la tabla de contenido:

  • Introducción
  • ¿Qué tan grande es este problema?
  • La solución
  • Exploración de datos
    • Distribución de Clases
  • Limpieza de datos para análisis
  • Análisis exploratorio de datos
    • Nube de una sola palabra
    • Bigrama más frecuente (combinación de dos palabras)
    • Trigrama más frecuente (combinación de tres palabras)
  • Creación de un clasificador mediante el ajuste fino de BERT
    • Preparación de datos
    • Tokenización del conjunto de datos
    • Cargar y ajustar el modelo
    • Evaluación del modelo
  • Apéndice: Creación de un archivo de envío para Kaggle
  • Conclusión

Exploración de datos

En este trabajo, utilizamos el conjunto de datos de noticias falsas de Kaggle para clasificar artículos de noticias no confiables como noticias falsas. Disponemos de un completo dataset de entrenamiento que contiene las siguientes características:

  • id: identificación única para un artículo de noticias
  • title: título de un artículo periodístico
  • author: autor de la noticia
  • text: texto del artículo; podría estar incompleto
  • label: una etiqueta que marca el artículo como potencialmente no confiable denotado por 1 (poco confiable o falso) o 0 (confiable).

Es un problema de clasificación binaria en el que debemos predecir si una determinada noticia es fiable o no.

Si tiene una cuenta de Kaggle, simplemente puede descargar el conjunto de datos del sitio web y extraer el archivo ZIP.

También cargué el conjunto de datos en Google Drive y puede obtenerlo aquí o usar la gdownbiblioteca para descargarlo automáticamente en Google Colab o cuadernos de Jupyter:

$ pip install gdown
# download from Google Drive
$ gdown "https://drive.google.com/uc?id=178f_VkNxccNidap-5-uffXUW475pAuPy&confirm=t"
Downloading...
From: https://drive.google.com/uc?id=178f_VkNxccNidap-5-uffXUW475pAuPy&confirm=t
To: /content/fake-news.zip
100% 48.7M/48.7M [00:00<00:00, 74.6MB/s]

Descomprimiendo los archivos:

$ unzip fake-news.zip

Aparecerán tres archivos en el directorio de trabajo actual: train.csv, test.csvy submit.csv, que usaremos train.csven la mayor parte del tutorial.

Instalando las dependencias requeridas:

$ pip install transformers nltk pandas numpy matplotlib seaborn wordcloud

Nota: si se encuentra en un entorno local, asegúrese de instalar PyTorch para GPU, diríjase a esta página para una instalación adecuada.

Importemos las bibliotecas esenciales para el análisis:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

El corpus y los módulos NLTK deben instalarse mediante el descargador NLTK estándar:

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

El conjunto de datos de noticias falsas comprende títulos y textos de artículos originales y ficticios de varios autores. Importemos nuestro conjunto de datos:

# load the dataset
news_d = pd.read_csv("train.csv")
print("Shape of News data:", news_d.shape)
print("News data columns", news_d.columns)

Producción:

 Shape of News data: (20800, 5)
 News data columns Index(['id', 'title', 'author', 'text', 'label'], dtype='object')

Así es como se ve el conjunto de datos:

# by using df.head(), we can immediately familiarize ourselves with the dataset. 
news_d.head()

Producción:

id	title	author	text	label
0	0	House Dem Aide: We Didn’t Even See Comey’s Let...	Darrell Lucus	House Dem Aide: We Didn’t Even See Comey’s Let...	1
1	1	FLYNN: Hillary Clinton, Big Woman on Campus - ...	Daniel J. Flynn	Ever get the feeling your life circles the rou...	0
2	2	Why the Truth Might Get You Fired	Consortiumnews.com	Why the Truth Might Get You Fired October 29, ...	1
3	3	15 Civilians Killed In Single US Airstrike Hav...	Jessica Purkiss	Videos 15 Civilians Killed In Single US Airstr...	1
4	4	Iranian woman jailed for fictional unpublished...	Howard Portnoy	Print \nAn Iranian woman has been sentenced to...	1

Tenemos 20.800 filas, que tienen cinco columnas. Veamos algunas estadísticas de la textcolumna:

#Text Word startistics: min.mean, max and interquartile range

txt_length = news_d.text.str.split().str.len()
txt_length.describe()

Producción:

count    20761.000000
mean       760.308126
std        869.525988
min          0.000000
25%        269.000000
50%        556.000000
75%       1052.000000
max      24234.000000
Name: text, dtype: float64

Estadísticas de la titlecolumna:

#Title statistics 

title_length = news_d.title.str.split().str.len()
title_length.describe()

Producción:

count    20242.000000
mean        12.420709
std          4.098735
min          1.000000
25%         10.000000
50%         13.000000
75%         15.000000
max         72.000000
Name: title, dtype: float64

Las estadísticas para los conjuntos de entrenamiento y prueba son las siguientes:

  • El textatributo tiene un conteo de palabras más alto con un promedio de 760 palabras y un 75% con más de 1000 palabras.
  • El titleatributo es una declaración breve con un promedio de 12 palabras, y el 75% de ellas tiene alrededor de 15 palabras.

Nuestro experimento sería con el texto y el título juntos.

Distribución de Clases

Parcelas de conteo para ambas etiquetas:

sns.countplot(x="label", data=news_d);
print("1: Unreliable")
print("0: Reliable")
print("Distribution of labels:")
print(news_d.label.value_counts());

Producción:

1: Unreliable
0: Reliable
Distribution of labels:
1    10413
0    10387
Name: label, dtype: int64

Distribución de etiquetas

print(round(news_d.label.value_counts(normalize=True),2)*100);

Producción:

1    50.0
0    50.0
Name: label, dtype: float64

La cantidad de artículos no confiables (falsos o 1) es 10413, mientras que la cantidad de artículos confiables (confiables o 0) es 10387. Casi el 50% de los artículos son falsos. Por lo tanto, la métrica de precisión medirá qué tan bien funciona nuestro modelo al construir un clasificador.

Limpieza de datos para análisis

En esta sección, limpiaremos nuestro conjunto de datos para hacer algunos análisis:

  • Elimina las filas y columnas que no uses.
  • Realizar imputación de valor nulo.
  • Eliminar caracteres especiales.
  • Elimina las palabras vacías.
# Constants that are used to sanitize the datasets 

column_n = ['id', 'title', 'author', 'text', 'label']
remove_c = ['id','author']
categorical_features = []
target_col = ['label']
text_f = ['title', 'text']
# Clean Datasets
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer
from collections import Counter

ps = PorterStemmer()
wnl = nltk.stem.WordNetLemmatizer()

stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)

# Removed unused clumns
def remove_unused_c(df,column_n=remove_c):
    df = df.drop(column_n,axis=1)
    return df

# Impute null values with None
def null_process(feature_df):
    for col in text_f:
        feature_df.loc[feature_df[col].isnull(), col] = "None"
    return feature_df

def clean_dataset(df):
    # remove unused column
    df = remove_unused_c(df)
    #impute null values
    df = null_process(df)
    return df

# Cleaning text from unused characters
def clean_text(text):
    text = str(text).replace(r'http[\w:/\.]+', ' ')  # removing urls
    text = str(text).replace(r'[^\.\w\s]', ' ')  # remove everything but characters and punctuation
    text = str(text).replace('[^a-zA-Z]', ' ')
    text = str(text).replace(r'\s\s+', ' ')
    text = text.lower().strip()
    #text = ' '.join(text)    
    return text

## Nltk Preprocessing include:
# Stop words, Stemming and Lemmetization
# For our project we use only Stop word removal
def nltk_preprocess(text):
    text = clean_text(text)
    wordlist = re.sub(r'[^\w\s]', '', text).split()
    #text = ' '.join([word for word in wordlist if word not in stopwords_dict])
    #text = [ps.stem(word) for word in wordlist if not word in stopwords_dict]
    text = ' '.join([wnl.lemmatize(word) for word in wordlist if word not in stopwords_dict])
    return  text

En el bloque de código de arriba:

  • Hemos importado NLTK, que es una plataforma famosa para desarrollar aplicaciones de Python que interactúan con el lenguaje humano. A continuación, importamos repara expresiones regulares.
  • Importamos palabras vacías desde nltk.corpus. Cuando trabajamos con palabras, particularmente cuando consideramos la semántica, a veces necesitamos eliminar palabras comunes que no agregan ningún significado significativo a una declaración, como "but", "can", "we", etc.
  • PorterStemmerse utiliza para realizar palabras derivadas con NLTK. Los lematizadores despojan a las palabras de sus afijos morfológicos, dejando únicamente la raíz de la palabra.
  • Importamos WordNetLemmatizer()de la biblioteca NLTK para la lematización. La lematización es mucho más eficaz que la derivación . Va más allá de la reducción de palabras y evalúa todo el léxico de un idioma para aplicar el análisis morfológico a las palabras, con el objetivo de eliminar los extremos flexivos y devolver la forma base o de diccionario de una palabra, conocida como lema.
  • stopwords.words('english')permítanos ver la lista de todas las palabras vacías en inglés admitidas por NLTK.
  • remove_unused_c()La función se utiliza para eliminar las columnas no utilizadas.
  • Imputamos valores nulos con Noneel uso de la null_process()función.
  • Dentro de la función clean_dataset(), llamamos remove_unused_c()y null_process()funciones. Esta función es responsable de la limpieza de datos.
  • Para limpiar texto de caracteres no utilizados, hemos creado la clean_text()función.
  • Para el preprocesamiento, solo utilizaremos la eliminación de palabras vacías. Creamos la nltk_preprocess()función para ese propósito.

Preprocesando el texty title:

# Perform data cleaning on train and test dataset by calling clean_dataset function
df = clean_dataset(news_d)
# apply preprocessing on text through apply method by calling the function nltk_preprocess
df["text"] = df.text.apply(nltk_preprocess)
# apply preprocessing on title through apply method by calling the function nltk_preprocess
df["title"] = df.title.apply(nltk_preprocess)
# Dataset after cleaning and preprocessing step
df.head()

Producción:

title	text	label
0	house dem aide didnt even see comeys letter ja...	house dem aide didnt even see comeys letter ja...	1
1	flynn hillary clinton big woman campus breitbart	ever get feeling life circle roundabout rather...	0
2	truth might get fired	truth might get fired october 29 2016 tension ...	1
3	15 civilian killed single u airstrike identified	video 15 civilian killed single u airstrike id...	1
4	iranian woman jailed fictional unpublished sto...	print iranian woman sentenced six year prison ...	1

Análisis exploratorio de datos

En esta sección realizaremos:

  • Análisis Univariante : Es un análisis estadístico del texto. Usaremos la nube de palabras para ese propósito. Una nube de palabras es un enfoque de visualización de datos de texto donde el término más común se presenta en el tamaño de fuente más considerable.
  • Análisis bivariado : Bigram y Trigram se utilizarán aquí. Según Wikipedia: " un n-grama es una secuencia contigua de n elementos de una muestra determinada de texto o habla. Según la aplicación, los elementos pueden ser fonemas, sílabas, letras, palabras o pares de bases. Los n-gramas normalmente se recopilan de un corpus de texto o de voz".

Nube de una sola palabra

Las palabras más frecuentes aparecen en negrita y de mayor tamaño en una nube de palabras. Esta sección creará una nube de palabras para todas las palabras del conjunto de datos.

Se usará la función de la biblioteca de WordCloudwordcloud() y generate()se utilizará para generar la imagen de la nube de palabras:

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# initialize the word cloud
wordcloud = WordCloud( background_color='black', width=800, height=600)
# generate the word cloud by passing the corpus
text_cloud = wordcloud.generate(' '.join(df['text']))
# plotting the word cloud
plt.figure(figsize=(20,30))
plt.imshow(text_cloud)
plt.axis('off')
plt.show()

Producción:

WordCloud para todos los datos de noticias falsas

Nube de palabras solo para noticias confiables:

true_n = ' '.join(df[df['label']==0]['text']) 
wc = wordcloud.generate(true_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()

Producción:

Nube de palabras para noticias confiables

Nube de palabras solo para noticias falsas:

fake_n = ' '.join(df[df['label']==1]['text'])
wc= wordcloud.generate(fake_n)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()

Producción:

Nube de palabras para noticias falsas

Bigrama más frecuente (combinación de dos palabras)

Un N-grama es una secuencia de letras o palabras. Un unigrama de carácter se compone de un solo carácter, mientras que un bigrama comprende una serie de dos caracteres. De manera similar, los N-gramas de palabras se componen de una serie de n palabras. La palabra "unidos" es un 1 gramo (unigrama). La combinación de las palabras "estado unido" es de 2 gramos (bigrama), "ciudad de nueva york" es de 3 gramos.

Grafiquemos el bigrama más común en las noticias confiables:

def plot_top_ngrams(corpus, title, ylabel, xlabel="Number of Occurences", n=2):
  """Utility function to plot top n-grams"""
  true_b = (pd.Series(nltk.ngrams(corpus.split(), n)).value_counts())[:20]
  true_b.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
  plt.title(title)
  plt.ylabel(ylabel)
  plt.xlabel(xlabel)
  plt.show()
plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Bigrams', "Bigram", n=2)

Top bigramas sobre noticias falsas

El bigrama más común en las noticias falsas:

plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Bigrams', "Bigram", n=2)

Top bigramas sobre noticias falsas

Trigrama más frecuente (combinación de tres palabras)

El trigrama más común en noticias confiables:

plot_top_ngrams(true_n, 'Top 20 Frequently Occuring True news Trigrams', "Trigrams", n=3)

El trigrama más común en las noticias falsas

Para noticias falsas ahora:

plot_top_ngrams(fake_n, 'Top 20 Frequently Occuring Fake news Trigrams', "Trigrams", n=3)

Trigramas más comunes en Fake news

Los gráficos anteriores nos dan algunas ideas sobre cómo se ven ambas clases. En la siguiente sección, usaremos la biblioteca de transformadores para construir un detector de noticias falsas.

Creación de un clasificador mediante el ajuste fino de BERT

Esta sección tomará código ampliamente del tutorial BERT de ajuste fino para hacer un clasificador de noticias falsas utilizando la biblioteca de transformadores. Entonces, para obtener información más detallada, puede dirigirse al tutorial original .

Si no instaló transformadores, debe:

$ pip install transformers

Importemos las bibliotecas necesarias:

import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
from sklearn.model_selection import train_test_split

import random

Queremos que nuestros resultados sean reproducibles incluso si reiniciamos nuestro entorno:

def set_seed(seed: int):
    """
    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
    installed).

    Args:
        seed (:obj:`int`): The seed to set.
    """
    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # ^^ safe to call this function even if cuda is not available
    if is_tf_available():
        import tensorflow as tf

        tf.random.set_seed(seed)

set_seed(1)

El modelo que vamos a utilizar es el bert-base-uncased:

# the model we gonna train, base uncased BERT
# check text classification models here: https://huggingface.co/models?filter=text-classification
model_name = "bert-base-uncased"
# max sequence length for each document/sentence sample
max_length = 512

Cargando el tokenizador:

# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

Preparación de datos

Limpiemos ahora los NaNvalores de las columnas text, authory :title

news_df = news_d[news_d['text'].notna()]
news_df = news_df[news_df["author"].notna()]
news_df = news_df[news_df["title"].notna()]

A continuación, crear una función que tome el conjunto de datos como un marco de datos de Pandas y devuelva las divisiones de entrenamiento/validación de textos y etiquetas como listas:

def prepare_data(df, test_size=0.2, include_title=True, include_author=True):
  texts = []
  labels = []
  for i in range(len(df)):
    text = df["text"].iloc[i]
    label = df["label"].iloc[i]
    if include_title:
      text = df["title"].iloc[i] + " - " + text
    if include_author:
      text = df["author"].iloc[i] + " : " + text
    if text and label in [0, 1]:
      texts.append(text)
      labels.append(label)
  return train_test_split(texts, labels, test_size=test_size)

train_texts, valid_texts, train_labels, valid_labels = prepare_data(news_df)

La función anterior toma el conjunto de datos en un tipo de marco de datos y los devuelve como listas divididas en conjuntos de entrenamiento y validación. Establecer include_titleen Truesignifica que agregamos la titlecolumna a la textque vamos a usar para el entrenamiento, establecer include_authoren Truesignifica que también agregamos authoral texto.

Asegurémonos de que las etiquetas y los textos tengan la misma longitud:

print(len(train_texts), len(train_labels))
print(len(valid_texts), len(valid_labels))

Producción:

14628 14628
3657 3657

Tokenización del conjunto de datos

Usemos el tokenizador BERT para tokenizar nuestro conjunto de datos:

# tokenize the dataset, truncate when passed `max_length`, 
# and pad with 0's when less than `max_length`
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

Convertir las codificaciones en un conjunto de datos de PyTorch:

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)

Cargar y ajustar el modelo

Usaremos BertForSequenceClassificationpara cargar nuestro modelo de transformador BERT:

# load the model
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Establecemos num_labelsa 2 ya que es una clasificación binaria. A continuación, la función es una devolución de llamada para calcular la precisión en cada paso de validación:

from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

Vamos a inicializar los parámetros de entrenamiento:

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=10,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=100,                # number of warmup steps for learning rate scheduler
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=200,               # log & save weights each logging_steps
    save_steps=200,
    evaluation_strategy="steps",     # evaluate each `logging_steps`
)

Configuré el valor per_device_train_batch_sizeen 10, pero debe configurarlo tan alto como su GPU pueda caber. Establecer el logging_stepsy save_stepsen 200, lo que significa que vamos a realizar una evaluación y guardar los pesos del modelo en cada 200 pasos de entrenamiento.

Puede consultar  esta página  para obtener información más detallada sobre los parámetros de entrenamiento disponibles.

Instanciamos el entrenador:

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

Entrenamiento del modelo:

# train the model
trainer.train()

El entrenamiento tarda unas horas en finalizar, dependiendo de su GPU. Si está en la versión gratuita de Colab, debería tomar una hora con NVIDIA Tesla K80. Aquí está la salida:

***** Running training *****
  Num examples = 14628
  Num Epochs = 1
  Instantaneous batch size per device = 10
  Total train batch size (w. parallel, distributed & accumulation) = 10
  Gradient Accumulation steps = 1
  Total optimization steps = 1463
 [1463/1463 41:07, Epoch 1/1]
Step	Training Loss	Validation Loss	Accuracy
200		0.250800		0.100533		0.983867
400		0.027600		0.043009		0.993437
600		0.023400		0.017812		0.997539
800		0.014900		0.030269		0.994258
1000	0.022400		0.012961		0.998086
1200	0.009800		0.010561		0.998633
1400	0.007700		0.010300		0.998633
***** Running Evaluation *****
  Num examples = 3657
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-200
Configuration saved in ./results/checkpoint-200/config.json
Model weights saved in ./results/checkpoint-200/pytorch_model.bin
<SNIPPED>
***** Running Evaluation *****
  Num examples = 3657
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-1400
Configuration saved in ./results/checkpoint-1400/config.json
Model weights saved in ./results/checkpoint-1400/pytorch_model.bin

Training completed. Do not forget to share your model on huggingface.co/models =)

Loading best model from ./results/checkpoint-1400 (score: 0.010299865156412125).
TrainOutput(global_step=1463, training_loss=0.04888018785440506, metrics={'train_runtime': 2469.1722, 'train_samples_per_second': 5.924, 'train_steps_per_second': 0.593, 'total_flos': 3848788517806080.0, 'train_loss': 0.04888018785440506, 'epoch': 1.0})

Evaluación del modelo

Dado que load_best_model_at_endestá configurado en True, los mejores pesos se cargarán cuando se complete el entrenamiento. Vamos a evaluarlo con nuestro conjunto de validación:

# evaluate the current model after training
trainer.evaluate()

Producción:

***** Running Evaluation *****
  Num examples = 3657
  Batch size = 20
 [183/183 02:11]
{'epoch': 1.0,
 'eval_accuracy': 0.998632759092152,
 'eval_loss': 0.010299865156412125,
 'eval_runtime': 132.0374,
 'eval_samples_per_second': 27.697,
 'eval_steps_per_second': 1.386}

Guardando el modelo y el tokenizador:

# saving the fine tuned model & tokenizer
model_path = "fake-news-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Aparecerá una nueva carpeta que contiene la configuración del modelo y los pesos después de ejecutar la celda anterior. Si desea realizar una predicción, simplemente use el from_pretrained()método que usamos cuando cargamos el modelo, y ya está listo.

A continuación, hagamos una función que acepte el texto del artículo como argumento y devuelva si es falso o no:

def get_prediction(text, convert_to_label=False):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    d = {
        0: "reliable",
        1: "fake"
    }
    if convert_to_label:
      return d[int(probs.argmax())]
    else:
      return int(probs.argmax())

Tomé un ejemplo de test.csvque el modelo nunca vio para realizar inferencias, lo verifiqué y es un artículo real de The New York Times:

real_news = """
Tim Tebow Will Attempt Another Comeback, This Time in Baseball - The New York Times",Daniel Victor,"If at first you don’t succeed, try a different sport. Tim Tebow, who was a Heisman   quarterback at the University of Florida but was unable to hold an N. F. L. job, is pursuing a career in Major League Baseball. <SNIPPED>
"""

El texto original está en el entorno de Colab si desea copiarlo, ya que es un artículo completo. Vamos a pasarlo al modelo y ver los resultados:

get_prediction(real_news, convert_to_label=True)

Producción:

reliable

Apéndice: Creación de un archivo de envío para Kaggle

En esta sección, predeciremos todos los artículos en el test.csvpara crear un archivo de envío para ver nuestra precisión en la prueba establecida en la competencia Kaggle :

# read the test set
test_df = pd.read_csv("test.csv")
# make a copy of the testing set
new_df = test_df.copy()
# add a new column that contains the author, title and article content
new_df["new_text"] = new_df["author"].astype(str) + " : " + new_df["title"].astype(str) + " - " + new_df["text"].astype(str)
# get the prediction of all the test set
new_df["label"] = new_df["new_text"].apply(get_prediction)
# make the submission file
final_df = new_df[["id", "label"]]
final_df.to_csv("submit_final.csv", index=False)

Después de concatenar el autor, el título y el texto del artículo, pasamos la get_prediction()función a la nueva columna para llenar la labelcolumna, luego usamos to_csv()el método para crear el archivo de envío para Kaggle. Aquí está mi puntaje de presentación:

Puntuación de envío

Obtuvimos una precisión del 99,78 % y del 100 % en las tablas de clasificación privadas y públicas. ¡Eso es genial!

Conclusión

Muy bien, hemos terminado con el tutorial. Puede consultar esta página para ver varios parámetros de entrenamiento que puede modificar.

Si tiene un conjunto de datos de noticias falsas personalizado para ajustarlo, simplemente tiene que pasar una lista de muestras al tokenizador como lo hicimos nosotros, no cambiará ningún otro código después de eso.

Consulta el código completo aquí , o el entorno de Colab aquí .