Mike Verdone [Wed, 12 Feb 2014 14:06:23 +0000 (15:06 +0100)]
Merge pull request #199 from adonoho/use-bytearray-buffer
Reduce memory usage by writing directly into byte array buffer.
Gentlefolk,
The whole point of using a `bytearray` as opposed to concatenating reads of `bytes` was to reduce memory usage. This pull request now takes that strategy to its logical conclusion by writing the balance of the chunk directly into the `bytearray`. This saves the creation of a temporary `bytes` array, up to 8KiB in size.
It does this by creating a `memoryview` of the `bytearray`. While this is still an allocation, the `memoryview`s are much smaller and are, presumably, reclaimed faster.
This patch has been running for over 24 hours and has processed over 4 MTw under Python v3.3.3 on OS X 10.8.5. It has also been run for about 10 minutes on Python v2.7.6 on a similar machine. `memoryview` does not appear to have been backported to Python v2.6.*. Hence, this pull request is incompatible with that platform.
Anon,
Andrew
P.S. The changes in this pull request are larger than strictly necessary for adding this functionality. I chose to improve the naming of my variables and move some of them closer to where they are used.
Mike Verdone [Tue, 4 Feb 2014 14:40:12 +0000 (06:40 -0800)]
Merge pull request #197 from RouxRC/pr-rouxrc
Add image support + fix diverses issues
All right, Mike's motivation was contagious and I cherry-picked and merged my pending commits regarding the other parts than the streaming.
This should fix a few more pending issues I tagged back then : #127 #177 #192 #126 #10 #64
I also seized the occasion to add @adonoho among the authors as well.
RouxRC [Sat, 28 Dec 2013 16:56:46 +0000 (17:56 +0100)]
Complete tildes fix for python 2
The fix #64 by @grahame was fixing the issue of sending tweets with
'~' for python 3 but not for python 2 due to urllib.urlencode's lack
of safe functions. Here's a dirty hack to do it anyway.
Any better fix welcome :)
RouxRC [Sat, 28 Dec 2013 16:44:14 +0000 (17:44 +0100)]
Handle multipart oauth to use image sending api
So the calls to send media (update_with_media, update_profile_image,
update_profile_background_image and update_profile_banner) require
to send queries differently, as multipart form, and oauth needs to be
signed without taking any parameter into account.
This does the trick, following the rules here and there:
https://dev.twitter.com/docs/uploading-media
https://dev.twitter.com/docs/api/1.1/post/statuses/update_with_media
https://dev.twitter.com/discussions/1059
RouxRC [Fri, 27 Dec 2013 09:38:15 +0000 (10:38 +0100)]
Use POST for all methods requiring it in specs
Added all missing methods from https://dev.twitter.com/docs/api/1.1
Also included some of the streaming methods which work with both GET
and POST but accept arguments like "track" which can quickly require
POST.
Mike Verdone [Mon, 3 Feb 2014 21:51:53 +0000 (13:51 -0800)]
Merge pull request #196 from adonoho/pr-fix-stream
A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014.
Gentlefolk,
This is a candidate release patch. I propose it become the formal branch of this library and have dubbed it version v1.10.3. I once again formally thank RouxRC for his efforts moving this library forward. Any errors in this patch remain mine and do not reflect upon RouxRC or his code.
This library is a high performance streaming library. Compared to other Twitter libraries, it is easily an order of magnitude faster at delivering tweets to your application. Why is that? When streaming, this library pierces Python's urllib abstraction and takes control of the socket. It interprets the HTTP stream directly. That makes it fast. It also makes it vulnerable to changes. It needed to be upgraded when Twitter upgraded the protocol version.
Twitter's switch to HTTP v1.1 was long overdue.
Summary of changes:
- Based upon RouxRC's code, I turned off gzip compression. My version is slightly different than RouxRC's version.
- Instead of incrementally reading arbitrary lengths of bytes from the socket and seeing if they parse in the JSON parser, a good technique, the switch to HTTP chunking forced us to process in chunk sized blocks. Based upon inspection, Twitter never sends partial JSON in a chunk. They also send keep-alive delimiters in single 7 byte long chunks. This code depends upon both of these observations. It does not do general purpose HTTP chunk processing. It is a Twitter specific HTTP chunk parser.
- Chunk oriented processing allowed me to isolate stream interpretation to the chunk code and migrate the wrapper code to operate exclusively using strings. This makes the wrapper code more readable.
- Once I had opened up the wrapper code, I cleaned it up. This involved modest edits in how certain socket parameters were determined and moving data exclusive to the generator into the generator and out of the containing object.
- As this is exclusively socket oriented code, the HTTP exception catching was removed from the method. The exception was moved to wrap the opening of the socket by url lib.
- Due to reading the data in larger chunks and, hence, running it through the JSON parser less often, this code is about 10% faster than the prior generation.
- When Twitter hangs up on us, this code emits a `hangup` message in the stream.
- This code has been tested using Python v2.7.6 and v3.3.3 on OS X 10.8.5 (Mountain Lion). I have tested it on the high volume sample stream and on a user stream under both versions of Python. It is believed, but not tested, that it will function under Python v2.6.x. It uses the bytearray type. I believe that has been back ported all the way to Python v2.6.x. As the code is not particularly tricky, I do not foresee that it has introduced any new issues that were not already apparent in this library.
- I use this patch in production and have captured 50M+ tweets with it. It is solid and reliable. If you find it to not be so, please contact me. I use it in production and have a vested interest in ensuring that it catches all corner cases.
Thank you for your patience while I refine this patch and I ask Mr. Verdone to select this patch as the basis for moving this library forward.
Andrew W. Donoho [Tue, 28 Jan 2014 14:13:06 +0000 (08:13 -0600)]
Further refine socket management.
All HTTP chunks are read in their entirety.
Cosmetic code improvements. (The socket's blocking state is set in a more compact form after a DeMorgan's boolean transformation.)
Hangups by Twitter, as with timeouts, are signaled via a message to allow gracious recovery.
Andrew W. Donoho [Mon, 27 Jan 2014 13:26:44 +0000 (07:26 -0600)]
As Twitter appears to send complete JSON in the chunks, we can simplify buffer management to only operate on strings and not re-encode the string as bytes. This improves readability at the expense of breakage if Twitter starts spanning JSON across HTTP chunks. This is an unlikely change to their infrastructure. That said, this is a totally optional patch.
Andrew W. Donoho [Thu, 23 Jan 2014 23:44:46 +0000 (17:44 -0600)]
Minimize string decoding and move to use a bytearray for the buffer. This reduces memory consumption and is faster than the += operator for buffer concatenation and trimming.
Mike Verdone [Mon, 4 Nov 2013 11:47:16 +0000 (03:47 -0800)]
Merge pull request #185 from cegme/json_status_dump
Added a json format option
This addition allows the user to get the raw json tweet information from each row. This is helpful when the twitter json format is needed by another process.
Example usage: `twitter --format=json friends`
This would get the latest tweets from friends in the raw json format. That json can be ported into another process or another database for processing.
Mike Verdone [Mon, 4 Nov 2013 11:43:32 +0000 (03:43 -0800)]
Merge pull request #178 from dkanygin/master
added timeout option to TwitterStream
In case of low tweet volume, we now can timeout and exit iterator to update search query or other housekeeping tasks.
Christan Grant [Fri, 18 Oct 2013 22:44:49 +0000 (18:44 -0400)]
Added a json format option
This addition allows the user to get the raw json tweet information from each row. This is helpful when the twitter json format is needed by another process.
Mike Verdone [Mon, 2 Sep 2013 16:35:56 +0000 (09:35 -0700)]
Merge pull request #167 from lumbric/master
Add stream documenation
It was very difficult to find information on this topic. Now that I figured out how to get direct messages, I added it to the README.
See also questions and discussions on this topic:
http://stackoverflow.com/a/17536438/859591
https://dev.twitter.com/discussions/8081
https://dev.twitter.com/discussions/8110
Mike Verdone [Mon, 2 Sep 2013 16:34:50 +0000 (09:34 -0700)]
Merge pull request #174 from RouxRC/master
POST for "statuses/filter" in Streaming API
Twitter recommends to use preferably POST for the filter method in the Streaming API https://dev.twitter.com/docs/api/1.1/post/statuses/filter
So it should be listed here
See also questions and discussions on this topic:
http://stackoverflow.com/a/17536438/859591
https://dev.twitter.com/discussions/8081
https://dev.twitter.com/discussions/8110
Mike Verdone [Sat, 22 Jun 2013 17:11:13 +0000 (10:11 -0700)]
Merge pull request #156 from mattcen/master
DM archiving, Twitter API upgrade, better timestamps.
You know what's awesome? Patching a program, realising you should rebase your patch on the latest commit (I based off twitter-1.8.0, so had a fair few changes to make), and then finding all the features (namely Favourites and Mentions) that got added to master in the meantime! Love your project! I will likely try to tweak the Favourites and Mentions behaviours in the near future though so they and Timeline-fetching aren't mutually exclusive.
NOTE: You'd need to update your Twitter App settings to allow viewing and posting of DMs for this to work out of the box for people.
Add argument to get DMs
Adapt statuses_portion()
Adapt statuses() to optionally handle DMs
Adapt main() to pull down DMs if instructed
Enforce Twitter API 1.1 for archiver and follow.
Add option to allow more accurate timestamps (specifically the timezone specification) in output files.
Matthew Cengia [Sun, 9 Jun 2013 06:57:14 +0000 (16:57 +1000)]
Convert archiver.py and follow.py to API 1.1
This is mostly done. I've not yet decided on a tidy way to re-implement
the API limit tests, since this has changed significantly between API
versions 1.0 and 1.1.
Further, as I understand it, API 1.1 requires OAuth for everything, but
it is still an optional command argument which is off by default. This
should be fairly trivial to fix, but I've not yet done so.