Mike Verdone [Sun, 9 Mar 2014 21:11:15 +0000 (22:11 +0100)]
Merge pull request #205 from sixohsix/stream_cleanup
Explicit support for heartbeat handling
Yet another crack at the stream iterator.
Now we explicitly track timeouts and heartbeat timeouts separately. Regular timeouts produce `Timeout` tokens (dicts). Heartbeat timeouts produce `HeartbeatTimeout` and StopIteration if iterated again. This is a solid fix for issue #202.
If the stream is set to use `timeout=None, block=False`, then we yield only data or `None`. If `timeout` is set to a number, we yield data or `Timeout` tokens, never `None`. If we are set to `block=True` then we yield only data.
Also: improve the documentation and remove some code weirdness.
Also: make the example stream program highly configurable to test every bizarre combination from the command line.
Mike Verdone [Mon, 17 Feb 2014 21:29:35 +0000 (22:29 +0100)]
Merge pull request #203 from RouxRC/pr-fix-timeout
Fix streams timeout & hangup behavior + ensure python2.6 compat
Here are the changes to fix the misbehavior of the timeout in case of low tweets by catching Twitter's keep-alive heartbeat signals thanks to the select.select originally added in #178 and still problematic as pointed out by @ksecrist in #202
I also generalized the hangup to all cases since there is no reason to stay in infinite loop after a hangup in non-blocking mode.
And to make things easier and avoid merging issues, I adapted the refacto and fixed python2.6 compatibility from @adonoho's #201
Mike Verdone [Wed, 12 Feb 2014 14:06:23 +0000 (15:06 +0100)]
Merge pull request #199 from adonoho/use-bytearray-buffer
Reduce memory usage by writing directly into byte array buffer.
Gentlefolk,
The whole point of using a `bytearray` as opposed to concatenating reads of `bytes` was to reduce memory usage. This pull request now takes that strategy to its logical conclusion by writing the balance of the chunk directly into the `bytearray`. This saves the creation of a temporary `bytes` array, up to 8KiB in size.
It does this by creating a `memoryview` of the `bytearray`. While this is still an allocation, the `memoryview`s are much smaller and are, presumably, reclaimed faster.
This patch has been running for over 24 hours and has processed over 4 MTw under Python v3.3.3 on OS X 10.8.5. It has also been run for about 10 minutes on Python v2.7.6 on a similar machine. `memoryview` does not appear to have been backported to Python v2.6.*. Hence, this pull request is incompatible with that platform.
Anon,
Andrew
P.S. The changes in this pull request are larger than strictly necessary for adding this functionality. I chose to improve the naming of my variables and move some of them closer to where they are used.
Mike Verdone [Tue, 4 Feb 2014 14:40:12 +0000 (06:40 -0800)]
Merge pull request #197 from RouxRC/pr-rouxrc
Add image support + fix diverses issues
All right, Mike's motivation was contagious and I cherry-picked and merged my pending commits regarding the other parts than the streaming.
This should fix a few more pending issues I tagged back then : #127 #177 #192 #126 #10 #64
I also seized the occasion to add @adonoho among the authors as well.
RouxRC [Sat, 28 Dec 2013 16:56:46 +0000 (17:56 +0100)]
Complete tildes fix for python 2
The fix #64 by @grahame was fixing the issue of sending tweets with
'~' for python 3 but not for python 2 due to urllib.urlencode's lack
of safe functions. Here's a dirty hack to do it anyway.
Any better fix welcome :)
RouxRC [Sat, 28 Dec 2013 16:44:14 +0000 (17:44 +0100)]
Handle multipart oauth to use image sending api
So the calls to send media (update_with_media, update_profile_image,
update_profile_background_image and update_profile_banner) require
to send queries differently, as multipart form, and oauth needs to be
signed without taking any parameter into account.
This does the trick, following the rules here and there:
https://dev.twitter.com/docs/uploading-media
https://dev.twitter.com/docs/api/1.1/post/statuses/update_with_media
https://dev.twitter.com/discussions/1059
RouxRC [Fri, 27 Dec 2013 09:38:15 +0000 (10:38 +0100)]
Use POST for all methods requiring it in specs
Added all missing methods from https://dev.twitter.com/docs/api/1.1
Also included some of the streaming methods which work with both GET
and POST but accept arguments like "track" which can quickly require
POST.
Mike Verdone [Mon, 3 Feb 2014 21:51:53 +0000 (13:51 -0800)]
Merge pull request #196 from adonoho/pr-fix-stream
A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014.
Gentlefolk,
This is a candidate release patch. I propose it become the formal branch of this library and have dubbed it version v1.10.3. I once again formally thank RouxRC for his efforts moving this library forward. Any errors in this patch remain mine and do not reflect upon RouxRC or his code.
This library is a high performance streaming library. Compared to other Twitter libraries, it is easily an order of magnitude faster at delivering tweets to your application. Why is that? When streaming, this library pierces Python's urllib abstraction and takes control of the socket. It interprets the HTTP stream directly. That makes it fast. It also makes it vulnerable to changes. It needed to be upgraded when Twitter upgraded the protocol version.
Twitter's switch to HTTP v1.1 was long overdue.
Summary of changes:
- Based upon RouxRC's code, I turned off gzip compression. My version is slightly different than RouxRC's version.
- Instead of incrementally reading arbitrary lengths of bytes from the socket and seeing if they parse in the JSON parser, a good technique, the switch to HTTP chunking forced us to process in chunk sized blocks. Based upon inspection, Twitter never sends partial JSON in a chunk. They also send keep-alive delimiters in single 7 byte long chunks. This code depends upon both of these observations. It does not do general purpose HTTP chunk processing. It is a Twitter specific HTTP chunk parser.
- Chunk oriented processing allowed me to isolate stream interpretation to the chunk code and migrate the wrapper code to operate exclusively using strings. This makes the wrapper code more readable.
- Once I had opened up the wrapper code, I cleaned it up. This involved modest edits in how certain socket parameters were determined and moving data exclusive to the generator into the generator and out of the containing object.
- As this is exclusively socket oriented code, the HTTP exception catching was removed from the method. The exception was moved to wrap the opening of the socket by url lib.
- Due to reading the data in larger chunks and, hence, running it through the JSON parser less often, this code is about 10% faster than the prior generation.
- When Twitter hangs up on us, this code emits a `hangup` message in the stream.
- This code has been tested using Python v2.7.6 and v3.3.3 on OS X 10.8.5 (Mountain Lion). I have tested it on the high volume sample stream and on a user stream under both versions of Python. It is believed, but not tested, that it will function under Python v2.6.x. It uses the bytearray type. I believe that has been back ported all the way to Python v2.6.x. As the code is not particularly tricky, I do not foresee that it has introduced any new issues that were not already apparent in this library.
- I use this patch in production and have captured 50M+ tweets with it. It is solid and reliable. If you find it to not be so, please contact me. I use it in production and have a vested interest in ensuring that it catches all corner cases.
Thank you for your patience while I refine this patch and I ask Mr. Verdone to select this patch as the basis for moving this library forward.
Andrew W. Donoho [Tue, 28 Jan 2014 14:13:06 +0000 (08:13 -0600)]
Further refine socket management.
All HTTP chunks are read in their entirety.
Cosmetic code improvements. (The socket's blocking state is set in a more compact form after a DeMorgan's boolean transformation.)
Hangups by Twitter, as with timeouts, are signaled via a message to allow gracious recovery.
Andrew W. Donoho [Mon, 27 Jan 2014 13:26:44 +0000 (07:26 -0600)]
As Twitter appears to send complete JSON in the chunks, we can simplify buffer management to only operate on strings and not re-encode the string as bytes. This improves readability at the expense of breakage if Twitter starts spanning JSON across HTTP chunks. This is an unlikely change to their infrastructure. That said, this is a totally optional patch.
Andrew W. Donoho [Thu, 23 Jan 2014 23:44:46 +0000 (17:44 -0600)]
Minimize string decoding and move to use a bytearray for the buffer. This reduces memory consumption and is faster than the += operator for buffer concatenation and trimming.
Mike Verdone [Mon, 4 Nov 2013 11:47:16 +0000 (03:47 -0800)]
Merge pull request #185 from cegme/json_status_dump
Added a json format option
This addition allows the user to get the raw json tweet information from each row. This is helpful when the twitter json format is needed by another process.
Example usage: `twitter --format=json friends`
This would get the latest tweets from friends in the raw json format. That json can be ported into another process or another database for processing.
Mike Verdone [Mon, 4 Nov 2013 11:43:32 +0000 (03:43 -0800)]
Merge pull request #178 from dkanygin/master
added timeout option to TwitterStream
In case of low tweet volume, we now can timeout and exit iterator to update search query or other housekeeping tasks.
Christan Grant [Fri, 18 Oct 2013 22:44:49 +0000 (18:44 -0400)]
Added a json format option
This addition allows the user to get the raw json tweet information from each row. This is helpful when the twitter json format is needed by another process.