抖音链接复制了怎么找

人气:223 ℃/2023-02-10 03:25:38

抖音链接复制了怎么找?接下来就来为大家介绍一下方法,一起来看看吧。

1、打开手机上的抖音APP。

2、打开想要复制链接的视频,点击“分享”按钮。

3、弹出的页面列表滑到最右边。

4、点击“复制链接”。

5、点击后界面上显示“内容已复制”。

6、在微信里粘贴后发送给好友或者即可找到链接。

以上就是为大家介绍了抖音链接复制了怎么找,希望对大家有所帮助。

抖音链接提取网站

PHP根据抖音的分享链接来抓包抖音视频

现在抖音是个很火的短视频平台,上面有许多不错的小视频。今天教大家怎么用PHP技术来获取到抖音上的的内容。

1:打开抖音选中你认为好的视频点击分享,复制链接,然后你会获取到如下的内容:

#科比 愿你去的地方也有篮球陪伴,也能披着24号紫金战衣! #动态壁纸 https://v.douyin.com/36xkCS/ 复制此链接,打开【抖音短视频】,直接观看视频!

这段内容就是我们进行抓包使用的路径。

2:需要使用到php解析html类库文件simple_html_dom

创建 simple_html_dom.php 代码如下:

1 <?php 2 /** 3 * Website: http://sourceforge.net/projects/simplehtmldom/ 4 * Acknowledge: Jose Solorzano (https://sourceforge.net/projects/php-html/) 5 * Contributions by: 6 * Yousuke Kumakura (Attribute filters) 7 * Vadim Voituk (Negative indexes supports of "find" method) 8 * Antcs (Constructor with automatically load contents either text or file/url) 9 * 10 * all affected sections have comments starting with "PaperG" 11 * 12 * Paperg - Added case insensitive testing of the value of the selector. 13 * Paperg - Added tag_start for the starting index of tags - NOTE: This works but not accurately. 14 * This tag_start gets counted AFTER \r\n have been crushed out, and after the remove_noice calls so it will not reflect the REAL position of the tag in the source, 15 * it will almost always be smaller by some amount. 16 * We use this to determine how far into the file the tag in question is. This "percentage will never be accurate as the $dom->size is the "real" number of bytes the dom was created from. 17 * but for most purposes, it's a really good estimation. 18 * Paperg - Added the forceTagsClosed to the dom constructor. Forcing tags closed is great for malformed html, but it CAN lead to parsing errors. 19 * Allow the user to tell us how much they trust the html. 20 * Paperg add the text and plaintext to the selectors for the find syntax. plaintext implies text in the innertext of a node. text implies that the tag is a text node. 21 * This allows for us to find tags based on the text they contain. 22 * Create find_ancestor_tag to see if a tag is - at any level - inside of another specific tag. 23 * Paperg: added parse_charset so that we know about the character set of the source document. 24 * NOTE: If the user's system has a routine called get_last_retrieve_url_contents_content_type availalbe, we will assume it's returning the content-type header from the 25 * last transfer or curl_exec, and we will parse that and use it in preference to any other method of charset detection. 26 * 27 * Found infinite loop in the case of broken html in restore_noise. Rewrote to protect from that. 28 * PaperG (John Schlick) Added get_display_size for "IMG" tags. 29 * 30 * Licensed under The MIT License 31 * Redistributions of files must retain the above copyright notice. 32 * 33 * @author S.C. Chen <me578022@gmail.com> 34 * @author John Schlick 35 * @author Rus Carroll 36 * @version 1.5 ($Rev: 196 $) 37 * @package PlaceLocalinclude 38 * @subpackage simple_html_dom 39 */ 40 41 /** 42 * All of the Defines for the classes below. 43 * @author S.C. Chen <me578022@gmail.com> 44 */ 45 define('HDOM_TYPE_ELEMENT', 1); 46 define('HDOM_TYPE_COMMENT', 2); 47 define('HDOM_TYPE_TEXT', 3); 48 define('HDOM_TYPE_ENDTAG', 4); 49 define('HDOM_TYPE_ROOT', 5); 50 define('HDOM_TYPE_UNKNOWN', 6); 51 define('HDOM_QUOTE_DOUBLE', 0); 52 define('HDOM_QUOTE_SINGLE', 1); 53 define('HDOM_QUOTE_NO', 3); 54 define('HDOM_INFO_BEGIN', 0); 55 define('HDOM_INFO_END', 1); 56 define('HDOM_INFO_QUOTE', 2); 57 define('HDOM_INFO_SPACE', 3); 58 define('HDOM_INFO_TEXT', 4); 59 define('HDOM_INFO_INNER', 5); 60 define('HDOM_INFO_OUTER', 6); 61 define('HDOM_INFO_ENDSPACE',7); 62 define('DEFAULT_TARGET_CHARSET', 'UTF-8'); 63 define('DEFAULT_BR_TEXT', "\r\n"); 64 define('DEFAULT_SPAN_TEXT', " "); 65 define('MAX_FILE_SIZE', 600000); 66 // helper functions 67 // ----------------------------------------------------------------------------- 68 // get html dom from file 69 // $maxlen is defined in the code as PHP_STREAM_COPY_ALL which is defined as -1. 70 function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) 71 { 72 // We DO force the tags to be terminated. 73 $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText); 74 // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done. 75 $contents = file_get_contents($url, $use_include_path, $context, $offset); 76 // Paperg - use our own mechanism for getting the contents as we want to control the timeout. 77 //$contents = retrieve_url_contents($url); 78 if (empty($contents) || strlen($contents) > MAX_FILE_SIZE) 79 { 80 return false; 81 } 82 // The second parameter can force the selectors to all be lowercase. 83 $dom->load($contents, $lowercase, $stripRN); 84 return $dom; 85 } 86 87 // get html dom from string 88 function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) 89 { 90 $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText); 91 if (empty($str) || strlen($str) > MAX_FILE_SIZE) 92 { 93 $dom->clear(); 94 return false; 95 } 96 $dom->load($str, $lowercase, $stripRN); 97 return $dom; 98 } 99 100 // dump html dom tree 101 function dump_html_tree($node, $show_attr=true, $deep=0) 102 { 103 $node->dump($node); 104 } 105 106 107 /** 108 * simple html dom node 109 * PaperG - added ability for "find" routine to lowercase the value of the selector. 110 * PaperG - added $tag_start to track the start position of the tag in the total byte index 111 * 112 * @package PlaceLocalInclude 113 */ 114 class simple_html_dom_node 115 { 116 public $nodetype = HDOM_TYPE_TEXT; 117 public $tag = 'text'; 118 public $attr = array(); 119 public $children = array(); 120 public $nodes = array(); 121 public $parent = null; 122 // The "info" array - see HDOM_INFO_... for what each element contains. 123 public $_ = array(); 124 public $tag_start = 0; 125 private $dom = null; 126 127 function __construct($dom) 128 { 129 $this->dom = $dom; 130 $dom->nodes[] = $this; 131 } 132 133 function __destruct() 134 { 135 $this->clear(); 136 } 137 138 function __toString() 139 { 140 return $this->outertext(); 141 } 142 143 // clean up memory due to php5 circular references memory leak... 144 function clear() 145 { 146 $this->dom = null; 147 $this->nodes = null; 148 $this->parent = null; 149 $this->children = null; 150 } 151 152 // dump node's tree 153 function dump($show_attr=true, $deep=0) 154 { 155 $lead = str_repeat(' ', $deep); 156 157 echo $lead.$this->tag; 158 if ($show_attr && count($this->attr)>0) 159 { 160 echo '('; 161 foreach ($this->attr as $k=>$v) 162 echo "[$k]=>\"".$this->$k.'", '; 163 echo ')'; 164 } 165 echo "\n"; 166 167 if ($this->nodes) 168 { 169 foreach ($this->nodes as $c) 170 { 171 $c->dump($show_attr, $deep 1); 172 } 173 } 174 } 175 176 177 // Debugging function to dump a single dom node with a bunch of information about it. 178 function dump_node($echo=true) 179 { 180 181 $string = $this->tag; 182 if (count($this->attr)>0) 183 { 184 $string .= '('; 185 foreach ($this->attr as $k=>$v) 186 { 187 $string .= "[$k]=>\"".$this->$k.'", '; 188 } 189 $string .= ')'; 190 } 191 if (count($this->_)>0) 192 { 193 $string .= ' $_ ('; 194 foreach ($this->_ as $k=>$v) 195 { 196 if (is_array($v)) 197 { 198 $string .= "[$k]=>("; 199 foreach ($v as $k2=>$v2) 200 { 201 $string .= "[$k2]=>\"".$v2.'", '; 202 } 203 $string .= ")"; 204 } else { 205 $string .= "[$k]=>\"".$v.'", '; 206 } 207 } 208 $string .= ")"; 209 } 210 211 if (isset($this->text)) 212 { 213 $string .= " text: (" . $this->text . ")"; 214 } 215 216 $string .= " HDOM_INNER_INFO: '"; 217 if (isset($node->_[HDOM_INFO_INNER])) 218 { 219 $string .= $node->_[HDOM_INFO_INNER] . "'"; 220 } 221 else 222 { 223 $string .= ' NULL '; 224 } 225 226 $string .= " children: " . count($this->children); 227 $string .= " nodes: " . count($this->nodes); 228 $string .= " tag_start: " . $this->tag_start; 229 $string .= "\n"; 230 231 if ($echo) 232 { 233 echo $string; 234 return; 235 } 236 else 237 { 238 return $string; 239 } 240 } 241 242 // returns the parent of node 243 // If a node is passed in, it will reset the parent of the current node to that one. 244 function parent($parent=null) 245 { 246 // I am SURE that this doesn't work properly. 247 // It fails to unset the current node from it's current parents nodes or children list first. 248 if ($parent !== null) 249 { 250 $this->parent = $parent; 251 $this->parent->nodes[] = $this; 252 $this->parent->children[] = $this; 253 } 254 255 return $this->parent; 256 } 257 258 // verify that node has children 259 function has_child() 260 { 261 return !empty($this->children); 262 } 263 264 // returns children of node 265 function children($idx=-1) 266 { 267 if ($idx===-1) 268 { 269 return $this->children; 270 } 271 if (isset($this->children[$idx])) return $this->children[$idx]; 272 return null; 273 } 274 275 // returns the first child of node 276 function first_child() 277 { 278 if (count($this->children)>0) 279 { 280 return $this->children[0]; 281 } 282 return null; 283 } 284 285 // returns the last child of node 286 function last_child() 287 { 288 if (($count=count($this->children))>0) 289 { 290 return $this->children[$count-1]; 291 } 292 return null; 293 } 294 295 // returns the next sibling of node 296 function next_sibling() 297 { 298 if ($this->parent===null) 299 { 300 return null; 301 } 302 303 $idx = 0; 304 $count = count($this->parent->children); 305 while ($idx<$count && $this!==$this->parent->children[$idx]) 306 { 307 $idx; 308 } 309 if ( $idx>=$count) 310 { 311 return null; 312 } 313 return $this->parent->children[$idx]; 314 } 315 316 // returns the previous sibling of node 317 function prev_sibling() 318 { 319 if ($this->parent===null) return null; 320 $idx = 0; 321 $count = count($this->parent->children); 322 while ($idx<$count && $this!==$this->parent->children[$idx]) 323 $idx; 324 if (--$idx<0) return null; 325 return $this->parent->children[$idx]; 326 } 327 328 // function to locate a specific ancestor tag in the path to the root. 329 function find_ancestor_tag($tag) 330 { 331 global $debugObject; 332 if (is_object($debugObject)) { $debugObject->debugLogEntry(1); } 333 334 // Start by including ourselves in the comparison. 335 $returnDom = $this; 336 337 while (!is_null($returnDom)) 338 { 339 if (is_object($debugObject)) { $debugObject->debugLog(2, "Current tag is: " . $returnDom->tag); } 340 341 if ($returnDom->tag == $tag) 342 { 343 break; 344 } 345 $returnDom = $returnDom->parent; 346 } 347 return $returnDom; 348 } 349 350 // get dom node's inner html 351 function innertext() 352 { 353 if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER]; 354 if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]); 355 356 $ret = ''; 357 foreach ($this->nodes as $n) 358 $ret .= $n->outertext(); 359 return $ret; 360 } 361 362 // get dom node's outer text (with tag) 363 function outertext() 364 { 365 global $debugObject; 366 if (is_object($debugObject)) 367 { 368 $text = ''; 369 if ($this->tag == 'text') 370 { 371 if (!empty($this->text)) 372 { 373 $text = " with text: " . $this->text; 374 } 375 } 376 $debugObject->debugLog(1, 'Innertext of tag: ' . $this->tag . $text); 377 } 378 379 if ($this->tag==='root') return $this->innertext(); 380 381 // trigger callback 382 if ($this->dom && $this->dom->callback!==null) 383 { 384 call_user_func_array($this->dom->callback, array($this)); 385 } 386 387 if (isset($this->_[HDOM_INFO_OUTER])) return $this->_[HDOM_INFO_OUTER]; 388 if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]); 389 390 // render begin tag 391 if ($this->dom && $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]) 392 { 393 $ret = $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]->makeup(); 394 } else { 395 $ret = ""; 396 } 397 398 // render inner text 399 if (isset($this->_[HDOM_INFO_INNER])) 400 { 401 // If it's a br tag... don't return the HDOM_INNER_INFO that we may or may not have added. 402 if ($this->tag != "br") 403 { 404 $ret .= $this->_[HDOM_INFO_INNER]; 405 } 406 } else { 407 if ($this->nodes) 408 { 409 foreach ($this->nodes as $n) 410 { 411 $ret .= $this->convert_text($n->outertext()); 412 } 413 } 414 } 415 416 // render end tag 417 if (isset($this->_[HDOM_INFO_END]) && $this->_[HDOM_INFO_END]!=0) 418 $ret .= '</'.$this->tag.'>'; 419 return $ret; 420 } 421 422 // get dom node's plain text 423 function text() 424 { 425 if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER]; 426 switch ($this->nodetype) 427 { 428 case HDOM_TYPE_TEXT: return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]); 429 case HDOM_TYPE_COMMENT: return ''; 430 case HDOM_TYPE_UNKNOWN: return ''; 431 } 432 if (strcasecmp($this->tag, 'script')===0) return ''; 433 if (strcasecmp($this->tag, 'style')===0) return ''; 434 435 $ret = ''; 436 // In rare cases, (always node type 1 or HDOM_TYPE_ELEMENT - observed for some span tags, and some p tags) $this->nodes is set to NULL. 437 // NOTE: This indicates that there is a problem where it's set to NULL without a clear happening. 438 // WHY is this happening? 439 if (!is_null($this->nodes)) 440 { 441 foreach ($this->nodes as $n) 442 { 443 $ret .= $this->convert_text($n->text()); 444 } 445 446 // If this node is a span... add a space at the end of it so multiple spans don't run into each other. This is plaintext after all. 447 if ($this->tag == "span") 448 { 449 $ret .= $this->dom->default_span_text; 450 } 451 452 453 } 454 return $ret; 455 } 456 457 function xmltext() 458 { 459 $ret = $this->innertext(); 460 $ret = str_ireplace('<![CDATA[', '', $ret); 461 $ret = str_replace(']]>', '', $ret); 462 return $ret; 463 } 464 465 // build node's text with tag 466 function makeup() 467 { 468 // text, comment, unknown 469 if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]); 470 471 $ret = '<'.$this->tag; 472 $i = -1; 473 474 foreach ($this->attr as $key=>$val) 475 { 476 $i; 477 478 // skip removed attribute 479 if ($val===null || $val===false) 480 continue; 481 482 $ret .= $this->_[HDOM_INFO_SPACE][$i][0]; 483 //no value attr: nowrap, checked selected... 484 if ($val===true) 485 $ret .= $key; 486 else { 487 switch ($this->_[HDOM_INFO_QUOTE][$i]) 488 { 489 case HDOM_QUOTE_DOUBLE: $quote = '"'; break; 490 case HDOM_QUOTE_SINGLE: $quote = '\''; break; 491 default: $quote = ''; 492 } 493 $ret .= $key.$this->_[HDOM_INFO_SPACE][$i][1].'='.$this->_[HDOM_INFO_SPACE][$i][2].$quote.$val.$quote; 494 } 495 } 496 $ret = $this->dom->restore_noise($ret); 497 return $ret . $this->_[HDOM_INFO_ENDSPACE] . '>'; 498 } 499 500 // find elements by css selector 501 //PaperG - added ability for find to lowercase the value of the selector. 502 function find($selector, $idx=null, $lowercase=false) 503 { 504 $selectors = $this->parse_selector($selector); 505 if (($count=count($selectors))===0) return array(); 506 $found_keys = array(); 507 508 // find each selector 509 for ($c=0; $c<$count; $c) 510 { 511 // The change on the below line was documented on the sourceforge code tracker id 2788009 512 // used to be: if (($levle=count($selectors[0]))===0) return array(); 513 if (($levle=count($selectors[$c]))===0) return array(); 514 if (!isset($this->_[HDOM_INFO_BEGIN])) return array(); 515 516 $head = array($this->_[HDOM_INFO_BEGIN]=>1); 517 518 // handle descendant selectors, no recursive! 519 for ($l=0; $l<$levle; $l) 520 { 521 $ret = array(); 522 foreach ($head as $k=>$v) 523 { 524 $n = ($k===-1) ? $this->dom->root : $this->dom->nodes[$k]; 525 //PaperG - Pass this optional parameter on to the seek function. 526 $n->seek($selectors[$c][$l], $ret, $lowercase); 527 } 528 $head = $ret; 529 } 530 531 foreach ($head as $k=>$v) 532 { 533 if (!isset($found_keys[$k])) 534 $found_keys[$k] = 1; 535 } 536 } 537 538 // sort keys 539 ksort($found_keys); 540 541 $found = array(); 542 foreach ($found_keys as $k=>$v) 543 $found[] = $this->dom->nodes[$k]; 544 545 // return nth-element or array 546 if (is_null($idx)) return $found; 547 else if ($idx<0) $idx = count($found) $idx; 548 return (isset($found[$idx])) ? $found[$idx] : null; 549 } 550 551 // seek for given conditions 552 // PaperG - added parameter to allow for case insensitive testing of the value of a selector. 553 protected function seek($selector, &$ret, $lowercase=false) 554 { 555 global $debugObject; 556 if (is_object($debugObject)) { $debugObject->debugLogEntry(1); } 557 558 list($tag, $key, $val, $exp, $no_key) = $selector; 559 560 // xpath index 561 if ($tag && $key && is_numeric($key)) 562 { 563 $count = 0; 564 foreach ($this->children as $c) 565 { 566 if ($tag==='*' || $tag===$c->tag) { 567 if ( $count==$key) { 568 $ret[$c->_[HDOM_INFO_BEGIN]] = 1; 569 return; 570 } 571 } 572 } 573 return; 574 } 575 576 $end = (!empty($this->_[HDOM_INFO_END])) ? $this->_[HDOM_INFO_END] : 0; 577 if ($end==0) { 578 $parent = $this->parent; 579 while (!isset($parent->_[HDOM_INFO_END]) && $parent!==null) { 580 $end -= 1; 581 $parent = $parent->parent; 582 } 583 $end = $parent->_[HDOM_INFO_END]; 584 } 585 586 for ($i=$this->_[HDOM_INFO_BEGIN] 1; $i<$end; $i) { 587 $node = $this->dom->nodes[$i]; 588 589 $pass = true; 590 591 if ($tag==='*' && !$key) { 592 if (in_array($node, $this->children, true)) 593 $ret[$i] = 1; 594 continue; 595 } 596 597 // compare tag 598 if ($tag && $tag!=$node->tag && $tag!=='*') {$pass=false;} 599 // compare key 600 if ($pass && $key) { 601 if ($no_key) { 602 if (isset($node->attr[$key])) $pass=false; 603 } else { 604 if (($key != "plaintext") && !isset($node->attr[$key])) $pass=false; 605 } 606 } 607 // compare value 608 if ($pass && $key && $val && $val!=='*') { 609 // If they have told us that this is a "plaintext" search then we want the plaintext of the node - right? 610 if ($key == "plaintext") { 611 // $node->plaintext actually returns $node->text(); 612 $nodeKeyValue = $node->text(); 613 } else { 614 // this is a normal search, we want the value of that attribute of the tag. 615 $nodeKeyValue = $node->attr[$key]; 616 } 617 if (is_object($debugObject)) {$debugObject->debugLog(2, "testing node: " . $node->tag . " for attribute: " . $key . $exp . $val . " where nodes value is: " . $nodeKeyValue);} 618 619 //PaperG - If lowercase is set, do a case insensitive test of the value of the selector. 620 if ($lowercase) { 621 $check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue)); 622 } else { 623 $check = $this->match($exp, $val, $nodeKeyValue); 624 } 625 if (is_object($debugObject)) {$debugObject->debugLog(2, "after match: " . ($check ? "true" : "false"));} 626 627 // handle multiple class 628 if (!$check && strcasecmp($key, 'class')===0) { 629 foreach (explode(' ',$node->attr[$key]) as $k) { 630 // Without this, there were cases where leading, trailing, or double spaces lead to our comparing blanks - bad form. 631 if (!empty($k)) { 632 if ($lowercase) { 633 $check = $this->match($exp, strtolower($val), strtolower($k)); 634 } else { 635 $check = $this->match($exp, $val, $k); 636 } 637 if ($check) break; 638 } 639 } 640 } 641 if (!$check) $pass = false; 642 } 643 if ($pass) $ret[$i] = 1; 644 unset($node); 645 } 646 // It's passed by reference so this is actually what this function returns. 647 if (is_object($debugObject)) {$debugObject->debugLog(1, "EXIT - ret: ", $ret);} 648 } 649 650 protected function match($exp, $pattern, $value) { 651 global $debugObject; 652 if (is_object($debugObject)) {$debugObject->debugLogEntry(1);} 653 654 switch ($exp) { 655 case '=': 656 return ($value===$pattern); 657 case '!=': 658 return ($value!==$pattern); 659 case '^=': 660 return preg_match("/^".preg_quote($pattern,'/')."/", $value); 661 case '$=': 662 return preg_match("/".preg_quote($pattern,'/')."$/", $value); 663 case '*=': 664 if ($pattern[0]=='/') { 665 return preg_match($pattern, $value); 666 } 667 return preg_match("/".$pattern."/i", $value); 668 } 669 return false; 670 } 671 672 protected function parse_selector($selector_string) { 673 global $debugObject; 674 if (is_object($debugObject)) {$debugObject->debugLogEntry(1);} 675 676 // pattern of CSS selectors, modified from mootools 677 // Paperg: Add the colon to the attrbute, so that it properly finds <tag attr:ibute="something" > like google does. 678 // Note: if you try to look at this attribute, yo MUST use getAttribute since $dom->x:y will fail the php syntax check. 679 // Notice the \[ starting the attbute? and the @? following? This implies that an attribute can begin with an @ sign that is not captured. 680 // This implies that an html attribute specifier may start with an @ sign that is NOT captured by the expression. 681 // farther study is required to determine of this should be documented or removed. 682 // $pattern = "/([\w-:\*]*)(?:\#([\w-] )|\.([\w-] ))?(?:\[@?(!?[\w-] )(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ] )/is"; 683 $pattern = "/([\w-:\*]*)(?:\#([\w-] )|\.([\w-] ))?(?:\[@?(!?[\w-:] )(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ] )/is"; 684 preg_match_all($pattern, trim($selector_string).' ', $matches, PREG_SET_ORDER); 685 if (is_object($debugObject)) {$debugObject->debugLog(2, "Matches Array: ", $matches);} 686 687 $selectors = array(); 688 $result = array(); 689 //print_r($matches); 690 691 foreach ($matches as $m) { 692 $m[0] = trim($m[0]); 693 if ($m[0]==='' || $m[0]==='/' || $m[0]==='//') continue; 694 // for browser generated xpath 695 if ($m[1]==='tbody') continue; 696 697 list($tag, $key, $val, $exp, $no_key) = array($m[1], null, null, '=', false); 698 if (!empty($m[2])) {$key='id'; $val=$m[2];} 699 if (!empty($m[3])) {$key='class'; $val=$m[3];} 700 if (!empty($m[4])) {$key=$m[4];} 701 if (!empty($m[5])) {$exp=$m[5];} 702 if (!empty($m[6])) {$val=$m[6];} 703 704 // convert to lowercase 705 if ($this->dom->lowercase) {$tag=strtolower($tag); $key=strtolower($key);} 706 //elements that do NOT have the specified attribute 707 if (isset($key[0]) && $key[0]==='!') {$key=substr($key, 1); $no_key=true;} 708 709 $result[] = array($tag, $key, $val, $exp, $no_key); 710 if (trim($m[7])===',') { 711 $selectors[] = $result; 712 $result = array(); 713 } 714 } 715 if (count($result)>0) 716 $selectors[] = $result; 717 return $selectors; 718 } 719 720 function __get($name) { 721 if (isset($this->attr[$name])) 722 { 723 return $this->convert_text($this->attr[$name]); 724 } 725 switch ($name) { 726 case 'outertext': return $this->outertext(); 727 case 'innertext': return $this->innertext(); 728 case 'plaintext': return $this->text(); 729 case 'xmltext': return $this->xmltext(); 730 default: return array_key_exists($name, $this->attr); 731 } 732 } 733 734 function __set($name, $value) { 735 switch ($name) { 736 case 'outertext': return $this->_[HDOM_INFO_OUTER] = $value; 737 case 'innertext': 738 if (isset($this->_[HDOM_INFO_TEXT])) return $this->_[HDOM_INFO_TEXT] = $value; 739 return $this->_[HDOM_INFO_INNER] = $value; 740 } 741 if (!isset($this->attr[$name])) { 742 $this->_[HDOM_INFO_SPACE][] = array(' ', '', ''); 743 $this->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE; 744 } 745 $this->attr[$name] = $value; 746 } 747 748 function __isset($name) { 749 switch ($name) { 750 case 'outertext': return true; 751 case 'innertext': return true; 752 case 'plaintext': return true; 753 } 754 //no value attr: nowrap, checked selected... 755 return (array_key_exists($name, $this->attr)) ? true : isset($this->attr[$name]); 756 } 757 758 function __unset($name) { 759 if (isset($this->attr[$name])) 760 unset($this->attr[$name]); 761 } 762 763 // PaperG - Function to convert the text from one character set to another if the two sets are not the same. 764 function convert_text($text) 765 { 766 global $debugObject; 767 if (is_object($debugObject)) {$debugObject->debugLogEntry(1);} 768 769 $converted_text = $text; 770 771 $sourceCharset = ""; 772 $targetCharset = ""; 773 774 if ($this->dom) 775 { 776 $sourceCharset = strtoupper($this->dom->_charset); 777 $targetCharset = strtoupper($this->dom->_target_charset); 778 } 779 if (is_object($debugObject)) {$debugObject->debugLog(3, "source charset: " . $sourceCharset . " target charaset: " . $targetCharset);} 780 781 if (!empty($sourceCharset) && !empty($targetCharset) && (strcasecmp($sourceCharset, $targetCharset) != 0)) 782 { 783 // Check if the reported encoding could have been incorrect and the text is actually already UTF-8 784 if ((strcasecmp($targetCharset, 'UTF-8') == 0) && ($this->is_utf8($text))) 785 { 786 $converted_text = $text; 787 } 788 else 789 { 790 $converted_text = iconv($sourceCharset, $targetCharset, $text); 791 } 792 } 793 794 // Lets make sure that we don't have that silly BOM issue with any of the utf-8 text we output. 795 if ($targetCharset == 'UTF-8') 796 { 797 if (substr($converted_text, 0, 3) == "\xef\xbb\xbf") 798 { 799 $converted_text = substr($converted_text, 3); 800 } 801 if (substr($converted_text, -3) == "\xef\xbb\xbf") 802 { 803 $converted_text = substr($converted_text, 0, -3); 804 } 805 } 806 807 return $converted_text; 808 } 809 810 /** 811 * Returns true if $string is valid UTF-8 and false otherwise. 812 * 813 * @param mixed $str String to be tested 814 * @return boolean 815 */ 816 static function is_utf8($str) 817 { 818 $c=0; $b=0; 819 $bits=0; 820 $len=strlen($str); 821 for($i=0; $i<$len; $i ) 822 { 823 $c=ord($str[$i]); 824 if($c > 128) 825 { 826 if(($c >= 254)) return false; 827 elseif($c >= 252) $bits=6; 828 elseif($c >= 248) $bits=5; 829 elseif($c >= 240) $bits=4; 830 elseif($c >= 224) $bits=3; 831 elseif($c >= 192) $bits=2; 832 else return false; 833 if(($i $bits) > $len) return false; 834 while($bits > 1) 835 { 836 $i ; 837 $b=ord($str[$i]); 838 if($b < 128 || $b > 191) return false; 839 $bits--; 840 } 841 } 842 } 843 return true; 844 } 845 /* 846 function is_utf8($string) 847 { 848 //this is buggy 849 return (utf8_encode(utf8_decode($string)) == $string); 850 } 851 */ 852 853 /** 854 * Function to try a few tricks to determine the displayed size of an img on the page. 855 * NOTE: This will ONLY work on an IMG tag. Returns FALSE on all other tag types. 856 * 857 * @author John Schlick 858 * @version April 19 2012 859 * @return array an array containing the 'height' and 'width' of the image on the page or -1 if we can't figure it out. 860 */ 861 function get_display_size() 862 { 863 global $debugObject; 864 865 $width = -1; 866 $height = -1; 867 868 if ($this->tag !== 'img') 869 { 870 return false; 871 } 872 873 // See if there is aheight or width attribute in the tag itself. 874 if (isset($this->attr['width'])) 875 { 876 $width = $this->attr['width']; 877 } 878 879 if (isset($this->attr['height'])) 880 { 881 $height = $this->attr['height']; 882 } 883 884 // Now look for an inline style. 885 if (isset($this->attr['style'])) 886 { 887 // Thanks to user gnarf from stackoverflow for this regular expression. 888 $attributes = array(); 889 preg_match_all("/([\w-] )\s*:\s*([^;] )\s*;?/", $this->attr['style'], $matches, PREG_SET_ORDER); 890 foreach ($matches as $match) { 891 $attributes[$match[1]] = $match[2]; 892 } 893 894 // If there is a width in the style attributes: 895 if (isset($attributes['width']) && $width == -1) 896 { 897 // check that the last two characters are px (pixels) 898 if (strtolower(substr($attributes['width'], -2)) == 'px') 899 { 900 $proposed_width = substr($attributes['width'], 0, -2); 901 // Now make sure that it's an integer and not something stupid. 902 if (filter_var($proposed_width, FILTER_VALIDATE_INT)) 903 { 904 $width = $proposed_width; 905 } 906 } 907 } 908 909 // If there is a width in the style attributes: 910 if (isset($attributes['height']) && $height == -1) 911 { 912 // check that the last two characters are px (pixels) 913 if (strtolower(substr($attributes['height'], -2)) == 'px') 914 { 915 $proposed_height = substr($attributes['height'], 0, -2); 916 // Now make sure that it's an integer and not something stupid. 917 if (filter_var($proposed_height, FILTER_VALIDATE_INT)) 918 { 919 $height = $proposed_height; 920 } 921 } 922 } 923 924 } 925 926 // Future enhancement: 927 // Look in the tag to see if there is a class or id specified that has a height or width attribute to it. 928 929 // Far future enhancement 930 // Look at all the parent tags of this image to see if they specify a class or id that has an img selector that specifies a height or width 931 // Note that in this case, the class or id will have the img subselector for it to apply to the image. 932 933 // ridiculously far future development 934 // If the class or id is specified in a SEPARATE css file thats not on the page, go get it and do what we were just doing for the ones on the page. 935 936 $result = array('height' => $height, 937 'width' => $width); 938 return $result; 939 } 940 941 // camel naming conventions 942 function getAllAttributes() {return $this->attr;} 943 function getAttribute($name) {return $this->__get($name);} 944 function setAttribute($name, $value) {$this->__set($name, $value);} 945 function hasAttribute($name) {return $this->__isset($name);} 946 function removeAttribute($name) {$this->__set($name, null);} 947 function getElementById($id) {return $this->find("#$id", 0);} 948 function getElementsById($id, $idx=null) {return $this->find("#$id", $idx);} 949 function getElementByTagName($name) {return $this->find($name, 0);} 950 function getElementsByTagName($name, $idx=null) {return $this->find($name, $idx);} 951 function parentNode() {return $this->parent();} 952 function childNodes($idx=-1) {return $this->children($idx);} 953 function firstChild() {return $this->first_child();} 954 function lastChild() {return $this->last_child();} 955 function nextSibling() {return $this->next_sibling();} 956 function previousSibling() {return $this->prev_sibling();} 957 function hasChildNodes() {return $this->has_child();} 958 function nodeName() {return $this->tag;} 959 function appendChild($node) {$node->parent($this); return $node;} 960 961 } 962 963 /** 964 * simple html dom parser 965 * Paperg - in the find routine: allow us to specify that we want case insensitive testing of the value of the selector. 966 * Paperg - change $size from protected to public so we can easily access it 967 * Paperg - added ForceTagsClosed in the constructor which tells us whether we trust the html or not. Default is to NOT trust it. 968 * 969 * @package PlaceLocalInclude 970 */ 971 class simple_html_dom 972 { 973 public $root = null; 974 public $nodes = array(); 975 public $callback = null; 976 public $lowercase = false; 977 // Used to keep track of how large the text was when we started. 978 public $original_size; 979 public $size; 980 protected $pos; 981 protected $doc; 982 protected $char; 983 protected $cursor; 984 protected $parent; 985 protected $noise = array(); 986 protected $token_blank = " \t\r\n"; 987 protected $token_equal = ' =/>'; 988 protected $token_slash = " />\r\n\t"; 989 protected $token_attr = ' >'; 990 // Note that this is referenced by a child node, and so it needs to be public for that node to see this information. 991 public $_charset = ''; 992 public $_target_charset = ''; 993 protected $default_br_text = ""; 994 public $default_span_text = ""; 995 996 // use isset instead of in_array, performance boost about 30%... 997 protected $self_closing_tags = array('img'=>1, 'br'=>1, 'input'=>1, 'meta'=>1, 'link'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1); 998 protected $block_tags = array('root'=>1, 'body'=>1, 'form'=>1, 'div'=>1, 'span'=>1, 'table'=>1); 999 // Known sourceforge issue #29773411000 // B tags that are not closed cause us to return everything to the end of the document.1001 protected $optional_closing_tags = array(1002 'tr'=>array('tr'=>1, 'td'=>1, 'th'=>1),1003 'th'=>array('th'=>1),1004 'td'=>array('td'=>1),1005 'li'=>array('li'=>1),1006 'dt'=>array('dt'=>1, 'dd'=>1),1007 'dd'=>array('dd'=>1, 'dt'=>1),1008 'dl'=>array('dd'=>1, 'dt'=>1),1009 'p'=>array('p'=>1),1010 'nobr'=>array('nobr'=>1),1011 'b'=>array('b'=>1),1012 'option'=>array('option'=>1),1013 );1014 1015 function __construct($str=null, $lowercase=true, $forceTagsClosed=true, $target_charset=DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)1016 {1017 if ($str)1018 {1019 if (preg_match("/^http:\/\//i",$str) || is_file($str))1020 {1021 $this->load_file($str);1022 }1023 else1024 {1025 $this->load($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);1026 }1027 }1028 // Forcing tags to be closed implies that we don't trust the html, but it can lead to parsing errors if we SHOULD trust the html.1029 if (!$forceTagsClosed) {1030 $this->optional_closing_array=array();1031 }1032 $this->_target_charset = $target_charset;1033 }1034 1035 function __destruct()1036 {1037 $this->clear();1038 }1039 1040 // load html from string1041 function load($str, $lowercase=true, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)1042 {1043 global $debugObject;1044 1045 // prepare1046 $this->prepare($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);1047 // strip out comments1048 $this->remove_noise("'<!--(.*?)-->'is");1049 // strip out cdata1050 $this->remove_noise("'<!\[CDATA\[(.*?)\]\]>'is", true);1051 // Per sourceforge http://sourceforge.net/tracker/?func=detail&aid=2949097&group_id=218559&atid=10440371052 // Script tags removal now preceeds style tag removal.1053 // strip out <script> tags1054 $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");1055 $this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");1056 // strip out <style> tags1057 $this->remove_noise("'<\s*style[^>]*[^/]>(.*?)<\s*/\s*style\s*>'is");1058 $this->remove_noise("'<\s*style\s*>(.*?)<\s*/\s*style\s*>'is");1059 // strip out preformatted tags1060 $this->remove_noise("'<\s*(?:code)[^>]*>(.*?)<\s*/\s*(?:code)\s*>'is");1061 // strip out server side scripts1062 $this->remove_noise("'(<\?)(.*?)(\?>)'s", true);1063 // strip smarty scripts1064 $this->remove_noise("'(\{\w)(.*?)(\})'s", true);1065 1066 // parsing1067 while ($this->parse());1068 // end1069 $this->root->_[HDOM_INFO_END] = $this->cursor;1070 $this->parse_charset();1071 1072 // make load function chainable1073 return $this;1074 1075 }1076 1077 // load html from file1078 function load_file()1079 {1080 $args = func_get_args();1081 $this->load(call_user_func_array('file_get_contents', $args), true);1082 // Throw an error if we can't properly load the dom.1083 if (($error=error_get_last())!==null) {1084 $this->clear();1085 return false;1086 }1087 }1088 1089 // set callback function1090 function set_callback($function_name)1091 {1092 $this->callback = $function_name;1093 }1094 1095 // remove callback function1096 function remove_callback()1097 {1098 $this->callback = null;1099 }1100 1101 // save dom as string1102 function save($filepath='')1103 {1104 $ret = $this->root->innertext();1105 if ($filepath!=='') file_put_contents($filepath, $ret, LOCK_EX);1106 return $ret;1107 }1108 1109 // find dom node by css selector1110 // Paperg - allow us to specify that we want case insensitive testing of the value of the selector.1111 function find($selector, $idx=null, $lowercase=false)1112 {1113 return $this->root->find($selector, $idx, $lowercase);1114 }1115 1116 // clean up memory due to php5 circular references memory leak...1117 function clear()1118 {1119 foreach ($this->nodes as $n) {$n->clear(); $n = null;}1120 // This add next line is documented in the sourceforge repository. 2977248 as a fix for ongoing memory leaks that occur even with the use of clear.1121 if (isset($this->children)) foreach ($this->children as $n) {$n->clear(); $n = null;}1122 if (isset($this->parent)) {$this->parent->clear(); unset($this->parent);}1123 if (isset($this->root)) {$this->root->clear(); unset($this->root);}1124 unset($this->doc);1125 unset($this->noise);1126 }1127 1128 function dump($show_attr=true)1129 {1130 $this->root->dump($show_attr);1131 }1132 1133 // prepare HTML data and init everything1134 protected function prepare($str, $lowercase=true, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)1135 {1136 $this->clear();1137 1138 // set the length of content before we do anything to it.1139 $this->size = strlen($str);1140 // Save the original size of the html that we got in. It might be useful to someone.1141 $this->original_size = $this->size;1142 1143 //before we save the string as the doc... strip out the \r \n's if we are told to.1144 if ($stripRN) {1145 $str = str_replace("\r", " ", $str);1146 $str = str_replace("\n", " ", $str);1147 1148 // set the length of content since we have changed it.1149 $this->size = strlen($str);1150 }1151 1152 $this->doc = $str;1153 $this->pos = 0;1154 $this->cursor = 1;1155 $this->noise = array();1156 $this->nodes = array();1157 $this->lowercase = $lowercase;1158 $this->default_br_text = $defaultBRText;1159 $this->default_span_text = $defaultSpanText;1160 $this->root = new simple_html_dom_node($this);1161 $this->root->tag = 'root';1162 $this->root->_[HDOM_INFO_BEGIN] = -1;1163 $this->root->nodetype = HDOM_TYPE_ROOT;1164 $this->parent = $this->root;1165 if ($this->size>0) $this->char = $this->doc[0];1166 }1167 1168 // parse html content1169 protected function parse()1170 {1171 if (($s = $this->copy_until_char('<'))==='')1172 {1173 return $this->read_tag();1174 }1175 1176 // text1177 $node = new simple_html_dom_node($this);1178 $this->cursor;1179 $node->_[HDOM_INFO_TEXT] = $s;1180 $this->link_nodes($node, false);1181 return true;1182 }1183 1184 // PAPERG - dkchou - added this to try to identify the character set of the page we have just parsed so we know better how to spit it out later.1185 // NOTE: IF you provide a routine called get_last_retrieve_url_contents_content_type which returns the CURLINFO_CONTENT_TYPE from the last curl_exec1186 // (or the content_type header from the last transfer), we will parse THAT, and if a charset is specified, we will use it over any other mechanism.1187 protected function parse_charset()1188 {1189 global $debugObject;1190 1191 $charset = null;1192 1193 if (function_exists('get_last_retrieve_url_contents_content_type'))1194 {1195 $contentTypeHeader = get_last_retrieve_url_contents_content_type();1196 $success = preg_match('/charset=(. )/', $contentTypeHeader, $matches);1197 if ($success)1198 {1199 $charset = $matches[1];1200 if (is_object($debugObject)) {$debugObject->debugLog(2, 'header content-type found charset of: ' . $charset);}1201 }1202 1203 }1204 1205 if (empty($charset))1206 {1207 $el = $this->root->find('meta[http-equiv=Content-Type]',0);1208 if (!empty($el))1209 {1210 $fullvalue = $el->content;1211 if (is_object($debugObject)) {$debugObject->debugLog(2, 'meta content-type tag found' . $fullvalue);}1212 1213 if (!empty($fullvalue))1214 {1215 $success = preg_match('/charset=(. )/', $fullvalue, $matches);1216 if ($success)1217 {1218 $charset = $matches[1];1219 }1220 else1221 {1222 // If there is a meta tag, and they don't specify the character set, research says that it's typically ISO-8859-11223 if (is_object($debugObject)) {$debugObject->debugLog(2, 'meta content-type tag couldn\'t be parsed. using iso-8859 default.');}1224 $charset = 'ISO-8859-1';1225 }1226 }1227 }1228 }1229 1230 // If we couldn't find a charset above, then lets try to detect one based on the text we got...1231 if (empty($charset))1232 {1233 // Have php try to detect the encoding from the text given to us.1234 $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) );1235 if (is_object($debugObject)) {$debugObject->debugLog(2, 'mb_detect found: ' . $charset);}1236 1237 // and if this doesn't work... then we need to just wrongheadedly assume it's UTF-8 so that we can move on - cause this will usually give us most of what we need...1238 if ($charset === false)1239 {1240 if (is_object($debugObject)) {$debugObject->debugLog(2, 'since mb_detect failed - using default of utf-8');}1241 $charset = 'UTF-8';1242 }1243 }1244 1245 // Since CP1252 is a superset, if we get one of it's subsets, we want it instead.1246 if ((strtolower($charset) == strtolower('ISO-8859-1')) || (strtolower($charset) == strtolower('Latin1')) || (strtolower($charset) == strtolower('Latin-1')))1247 {1248 if (is_object($debugObject)) {$debugObject->debugLog(2, 'replacing ' . $charset . ' with CP1252 as its a superset');}1249 $charset = 'CP1252';1250 }1251 1252 if (is_object($debugObject)) {$debugObject->debugLog(1, 'EXIT - ' . $charset);}1253 1254 return $this->_charset = $charset;1255 }1256 1257 // read tag info1258 protected function read_tag()1259 {1260 if ($this->char!=='<')1261 {1262 $this->root->_[HDOM_INFO_END] = $this->cursor;1263 return false;1264 }1265 $begin_tag_pos = $this->pos;1266 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1267 1268 // end tag1269 if ($this->char==='/')1270 {1271 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1272 // This represents the change in the simple_html_dom trunk from revision 180 to 181.1273 // $this->skip($this->token_blank_t);1274 $this->skip($this->token_blank);1275 $tag = $this->copy_until_char('>');1276 1277 // skip attributes in end tag1278 if (($pos = strpos($tag, ' '))!==false)1279 $tag = substr($tag, 0, $pos);1280 1281 $parent_lower = strtolower($this->parent->tag);1282 $tag_lower = strtolower($tag);1283 1284 if ($parent_lower!==$tag_lower)1285 {1286 if (isset($this->optional_closing_tags[$parent_lower]) && isset($this->block_tags[$tag_lower]))1287 {1288 $this->parent->_[HDOM_INFO_END] = 0;1289 $org_parent = $this->parent;1290 1291 while (($this->parent->parent) && strtolower($this->parent->tag)!==$tag_lower)1292 $this->parent = $this->parent->parent;1293 1294 if (strtolower($this->parent->tag)!==$tag_lower) {1295 $this->parent = $org_parent; // restore origonal parent1296 if ($this->parent->parent) $this->parent = $this->parent->parent;1297 $this->parent->_[HDOM_INFO_END] = $this->cursor;1298 return $this->as_text_node($tag);1299 }1300 }1301 else if (($this->parent->parent) && isset($this->block_tags[$tag_lower]))1302 {1303 $this->parent->_[HDOM_INFO_END] = 0;1304 $org_parent = $this->parent;1305 1306 while (($this->parent->parent) && strtolower($this->parent->tag)!==$tag_lower)1307 $this->parent = $this->parent->parent;1308 1309 if (strtolower($this->parent->tag)!==$tag_lower)1310 {1311 $this->parent = $org_parent; // restore origonal parent1312 $this->parent->_[HDOM_INFO_END] = $this->cursor;1313 return $this->as_text_node($tag);1314 }1315 }1316 else if (($this->parent->parent) && strtolower($this->parent->parent->tag)===$tag_lower)1317 {1318 $this->parent->_[HDOM_INFO_END] = 0;1319 $this->parent = $this->parent->parent;1320 }1321 else1322 return $this->as_text_node($tag);1323 }1324 1325 $this->parent->_[HDOM_INFO_END] = $this->cursor;1326 if ($this->parent->parent) $this->parent = $this->parent->parent;1327 1328 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1329 return true;1330 }1331 1332 $node = new simple_html_dom_node($this);1333 $node->_[HDOM_INFO_BEGIN] = $this->cursor;1334 $this->cursor;1335 $tag = $this->copy_until($this->token_slash);1336 $node->tag_start = $begin_tag_pos;1337 1338 // doctype, cdata & comments...1339 if (isset($tag[0]) && $tag[0]==='!') {1340 $node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until_char('>');1341 1342 if (isset($tag[2]) && $tag[1]==='-' && $tag[2]==='-') {1343 $node->nodetype = HDOM_TYPE_COMMENT;1344 $node->tag = 'comment';1345 } else {1346 $node->nodetype = HDOM_TYPE_UNKNOWN;1347 $node->tag = 'unknown';1348 }1349 if ($this->char==='>') $node->_[HDOM_INFO_TEXT].='>';1350 $this->link_nodes($node, true);1351 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1352 return true;1353 }1354 1355 // text1356 if ($pos=strpos($tag, '<')!==false) {1357 $tag = '<' . substr($tag, 0, -1);1358 $node->_[HDOM_INFO_TEXT] = $tag;1359 $this->link_nodes($node, false);1360 $this->char = $this->doc[--$this->pos]; // prev1361 return true;1362 }1363 1364 if (!preg_match("/^[\w-:] $/", $tag)) {1365 $node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until('<>');1366 if ($this->char==='<') {1367 $this->link_nodes($node, false);1368 return true;1369 }1370 1371 if ($this->char==='>') $node->_[HDOM_INFO_TEXT].='>';1372 $this->link_nodes($node, false);1373 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1374 return true;1375 }1376 1377 // begin tag1378 $node->nodetype = HDOM_TYPE_ELEMENT;1379 $tag_lower = strtolower($tag);1380 $node->tag = ($this->lowercase) ? $tag_lower : $tag;1381 1382 // handle optional closing tags1383 if (isset($this->optional_closing_tags[$tag_lower]) )1384 {1385 while (isset($this->optional_closing_tags[$tag_lower][strtolower($this->parent->tag)]))1386 {1387 $this->parent->_[HDOM_INFO_END] = 0;1388 $this->parent = $this->parent->parent;1389 }1390 $node->parent = $this->parent;1391 }1392 1393 $guard = 0; // prevent infinity loop1394 $space = array($this->copy_skip($this->token_blank), '', '');1395 1396 // attributes1397 do1398 {1399 if ($this->char!==null && $space[0]==='')1400 {1401 break;1402 }1403 $name = $this->copy_until($this->token_equal);1404 if ($guard===$this->pos)1405 {1406 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1407 continue;1408 }1409 $guard = $this->pos;1410 1411 // handle endless '<'1412 if ($this->pos>=$this->size-1 && $this->char!=='>') {1413 $node->nodetype = HDOM_TYPE_TEXT;1414 $node->_[HDOM_INFO_END] = 0;1415 $node->_[HDOM_INFO_TEXT] = '<'.$tag . $space[0] . $name;1416 $node->tag = 'text';1417 $this->link_nodes($node, false);1418 return true;1419 }1420 1421 // handle mismatch '<'1422 if ($this->doc[$this->pos-1]=='<') {1423 $node->nodetype = HDOM_TYPE_TEXT;1424 $node->tag = 'text';1425 $node->attr = array();1426 $node->_[HDOM_INFO_END] = 0;1427 $node->_[HDOM_INFO_TEXT] = substr($this->doc, $begin_tag_pos, $this->pos-$begin_tag_pos-1);1428 $this->pos -= 2;1429 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1430 $this->link_nodes($node, false);1431 return true;1432 }1433 1434 if ($name!=='/' && $name!=='') {1435 $space[1] = $this->copy_skip($this->token_blank);1436 $name = $this->restore_noise($name);1437 if ($this->lowercase) $name = strtolower($name);1438 if ($this->char==='=') {1439 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1440 $this->parse_attr($node, $name, $space);1441 }1442 else {1443 //no value attr: nowrap, checked selected...1444 $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;1445 $node->attr[$name] = true;1446 if ($this->char!='>') $this->char = $this->doc[--$this->pos]; // prev1447 }1448 $node->_[HDOM_INFO_SPACE][] = $space;1449 $space = array($this->copy_skip($this->token_blank), '', '');1450 }1451 else1452 break;1453 } while ($this->char!=='>' && $this->char!=='/');1454 1455 $this->link_nodes($node, true);1456 $node->_[HDOM_INFO_ENDSPACE] = $space[0];1457 1458 // check self closing1459 if ($this->copy_until_char_escape('>')==='/')1460 {1461 $node->_[HDOM_INFO_ENDSPACE] .= '/';1462 $node->_[HDOM_INFO_END] = 0;1463 }1464 else1465 {1466 // reset parent1467 if (!isset($this->self_closing_tags[strtolower($node->tag)])) $this->parent = $node;1468 }1469 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1470 1471 // If it's a BR tag, we need to set it's text to the default text.1472 // This way when we see it in plaintext, we can generate formatting that the user wants.1473 // since a br tag never has sub nodes, this works well.1474 if ($node->tag == "br")1475 {1476 $node->_[HDOM_INFO_INNER] = $this->default_br_text;1477 }1478 1479 return true;1480 }1481 1482 // parse attributes1483 protected function parse_attr($node, $name, &$space)1484 {1485 // Per sourceforge: http://sourceforge.net/tracker/?func=detail&aid=3061408&group_id=218559&atid=10440371486 // If the attribute is already defined inside a tag, only pay atetntion to the first one as opposed to the last one.1487 if (isset($node->attr[$name]))1488 {1489 return;1490 }1491 1492 $space[2] = $this->copy_skip($this->token_blank);1493 switch ($this->char) {1494 case '"':1495 $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE;1496 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1497 $node->attr[$name] = $this->restore_noise($this->copy_until_char_escape('"'));1498 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1499 break;1500 case '\'':1501 $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_SINGLE;1502 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1503 $node->attr[$name] = $this->restore_noise($this->copy_until_char_escape('\''));1504 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1505 break;1506 default:1507 $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;1508 $node->attr[$name] = $this->restore_noise($this->copy_until($this->token_attr));1509 }1510 // PaperG: Attributes should not have \r or \n in them, that counts as html whitespace.1511 $node->attr[$name] = str_replace("\r", "", $node->attr[$name]);1512 $node->attr[$name] = str_replace("\n", "", $node->attr[$name]);1513 // PaperG: If this is a "class" selector, lets get rid of the preceeding and trailing space since some people leave it in the multi class case.1514 if ($name == "class") {1515 $node->attr[$name] = trim($node->attr[$name]);1516 }1517 }1518 1519 // link node's parent1520 protected function link_nodes(&$node, $is_child)1521 {1522 $node->parent = $this->parent;1523 $this->parent->nodes[] = $node;1524 if ($is_child)1525 {1526 $this->parent->children[] = $node;1527 }1528 }1529 1530 // as a text node1531 protected function as_text_node($tag)1532 {1533 $node = new simple_html_dom_node($this);1534 $this->cursor;1535 $node->_[HDOM_INFO_TEXT] = '</' . $tag . '>';1536 $this->link_nodes($node, false);1537 $this->char = ( $this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1538 return true;1539 }1540 1541 protected function skip($chars)1542 {1543 $this->pos = strspn($this->doc, $chars, $this->pos);1544 $this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1545 }1546 1547 protected function copy_skip($chars)1548 {1549 $pos = $this->pos;1550 $len = strspn($this->doc, $chars, $pos);1551 $this->pos = $len;1552 $this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1553 if ($len===0) return '';1554 return substr($this->doc, $pos, $len);1555 }1556 1557 protected function copy_until($chars)1558 {1559 $pos = $this->pos;1560 $len = strcspn($this->doc, $chars, $pos);1561 $this->pos = $len;1562 $this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next1563 return substr($this->doc, $pos, $len);1564 }1565 1566 protected function copy_until_char($char)1567 {1568 if ($this->char===null) return '';1569 1570 if (($pos = strpos($this->doc, $char, $this->pos))===false) {1571 $ret = substr($this->doc, $this->pos, $this->size-$this->pos);1572 $this->char = null;1573 $this->pos = $this->size;1574 return $ret;1575 }1576 1577 if ($pos===$this->pos) return '';1578 $pos_old = $this->pos;1579 $this->char = $this->doc[$pos];1580 $this->pos = $pos;1581 return substr($this->doc, $pos_old, $pos-$pos_old);1582 }1583 1584 protected function copy_until_char_escape($char)1585 {1586 if ($this->char===null) return '';1587 1588 $start = $this->pos;1589 while (1)1590 {1591 if (($pos = strpos($this->doc, $char, $start))===false)1592 {1593 $ret = substr($this->doc, $this->pos, $this->size-$this->pos);1594 $this->char = null;1595 $this->pos = $this->size;1596 return $ret;1597 }1598 1599 if ($pos===$this->pos) return '';1600 1601 if ($this->doc[$pos-1]==='\\') {1602 $start = $pos 1;1603 continue;1604 }1605 1606 $pos_old = $this->pos;1607 $this->char = $this->doc[$pos];1608 $this->pos = $pos;1609 return substr($this->doc, $pos_old, $pos-$pos_old);1610 }1611 }1612 1613 // remove noise from html content1614 // save the noise in the $this->noise array.1615 protected function remove_noise($pattern, $remove_tag=false)1616 {1617 global $debugObject;1618 if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }1619 1620 $count = preg_match_all($pattern, $this->doc, $matches, PREG_SET_ORDER|PREG_OFFSET_CAPTURE);1621 1622 for ($i=$count-1; $i>-1; --$i)1623 {1624 $key = '___noise___'.sprintf('% 5d', count($this->noise) 1000);1625 if (is_object($debugObject)) { $debugObject->debugLog(2, 'key is: ' . $key); }1626 $idx = ($remove_tag) ? 0 : 1;1627 $this->noise[$key] = $matches[$i][$idx][0];1628 $this->doc = substr_replace($this->doc, $key, $matches[$i][$idx][1], strlen($matches[$i][$idx][0]));1629 }1630 1631 // reset the length of content1632 $this->size = strlen($this->doc);1633 if ($this->size>0)1634 {1635 $this->char = $this->doc[0];1636 }1637 }1638 1639 // restore noise to html content1640 function restore_noise($text)1641 {1642 global $debugObject;1643 if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }1644 1645 while (($pos=strpos($text, '___noise___'))!==false)1646 {1647 // Sometimes there is a broken piece of markup, and we don't GET the pos 11 etc... token which indicates a problem outside of us...1648 if (strlen($text) > $pos 15)1649 {1650 $key = '___noise___'.$text[$pos 11].$text[$pos 12].$text[$pos 13].$text[$pos 14].$text[$pos 15];1651 if (is_object($debugObject)) { $debugObject->debugLog(2, 'located key of: ' . $key); }1652 1653 if (isset($this->noise[$key]))1654 {1655 $text = substr($text, 0, $pos).$this->noise[$key].substr($text, $pos 16);1656 }1657 else1658 {1659 // do this to prevent an infinite loop.1660 $text = substr($text, 0, $pos).'UNDEFINED NOISE FOR KEY: '.$key . substr($text, $pos 16);1661 }1662 }1663 else1664 {1665 // There is no valid key being given back to us... We must get rid of the ___noise___ or we will have a problem.1666 $text = substr($text, 0, $pos).'NO NUMERIC NOISE KEY' . substr($text, $pos 11);1667 }1668 }1669 return $text;1670 }1671 1672 // Sometimes we NEED one of the noise elements.1673 function search_noise($text)1674 {1675 global $debugObject;1676 if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }1677 1678 foreach($this->noise as $noiseElement)1679 {1680 if (strpos($noiseElement, $text)!==false)1681 {1682 return $noiseElement;1683 }1684 }1685 }1686 function __toString()1687 {1688 return $this->root->innertext();1689 }1690 1691 function __get($name)1692 {1693 switch ($name)1694 {1695 case 'outertext':1696 return $this->root->innertext();1697 case 'innertext':1698 return $this->root->innertext();1699 case 'plaintext':1700 return $this->root->text();1701 case 'charset':1702 return $this->_charset;1703 case 'target_charset':1704 return $this->_target_charset;1705 }1706 }1707 1708 // camel naming conventions1709 function childNodes($idx=-1) {return $this->root->childNodes($idx);}1710 function firstChild() {return $this->root->first_child();}1711 function lastChild() {return $this->root->last_child();}1712 function createElement($name, $value=null) {return @str_get_html("<$name>$value</$name>")->first_child();}1713 function createTextNode($value) {return @end(str_get_html($value)->nodes);}1714 function getElementById($id) {return $this->find("#$id", 0);}1715 function getElementsById($id, $idx=null) {return $this->find("#$id", $idx);}1716 function getElementByTagName($name) {return $this->find($name, 0);}1717 function getElementsByTagName($name, $idx=-1) {return $this->find($name, $idx);}1718 function loadFile() {$args = func_get_args();$this->load_file($args);}1719 }1720 1721 ?>

3:创建抓包代码 test.php,代码如下:

1 <?php 2 //error_reporting(0); 3 set_time_limit(0); 4 include_once 'simple_html_dom.php'; 5 echo '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />'; 6 //$data = '#在抖音,记录美好生活#【库存】抛个硬币 如果摔碎了 今天我就不吃饭了 #校园生活 #大学 https://v.douyin.com/p9taJa/ 复制此链接,打开【抖音短视频】,直接观看视频!'; 7 $data = '#科比 愿你去的地方也有篮球陪伴,也能披着24号紫金战衣! #动态壁纸 https://v.douyin.com/36xkCS/ 复制此链接,打开【抖音短视频】,直接观看视频!'; 8 $data = getData($data); 9 echo json_encode($data); 10 11 function getData($data){ 12 $url = getUrl($data); 13 $cookie_jar = dirname(__FILE__).'/tmp.txt';//tempnam('./tmp','cookie'); 14 $data = get_content($url, $cookie_jar); 15 16 $page = str_get_html($data); 17 18 $data = array( 19 'base'=>array( 20 'headimg'=>false, // 头像 21 'name'=>false, // 昵称 22 'title'=>false, // 标题(姑且叫标题吧) 23 'description'=>false // 描述 24 ), 25 'video'=>array( 26 'cover'=>false, // 封面 27 'src'=>false, // 路径 28 'width'=>false, // 宽度 29 'height'=>false // 高度 30 ) 31 ); 32 $user = $page->find('div[class=user-info]'); 33 // 头像、 昵称 34 if(count($user) > 0){ 35 $img = $user[0]->find('div[class=avatar]'); 36 if(count($img) > 0){ 37 $img = $img[0]->find('img'); 38 if(count($img) > 0){ 39 // 头像 40 $data['base']['headimg'] = $img[0]->src; 41 // 昵称 42 $data['base']['name'] = $img[0]->alt; 43 } 44 } 45 } 46 // 标题、描述 47 $title = $page->find('div[class=challenge-info]'); 48 if(count($title) > 0){ 49 $description = $title[0]->next_sibling(); 50 $title = $title[0]->first_child()->first_child(); 51 $data['base']['title'] = $title->innertext; 52 $data['base']['description'] = $description->innertext; 53 } 54 $video = $page->find('div[id=pageletReflowVideo]'); 55 if(count($video) > 0){ 56 $script = $video[0]->next_sibling(); 57 if(!empty($script)){ 58 $script = $script->next_sibling(); 59 if(!empty($script)){ 60 $script = $script->next_sibling()->innertext; 61 $data['video'] = getVideo($script); 62 } 63 } 64 65 66 } 67 return $data; 68 } 69 70 function getVideo($scripts){ 71 $video = array(); 72 $scripts = preg_replace('/\s /','',$scripts); 73 // 宽度 74 preg_match('/videoWidth:([0-9.]*),/' , $scripts, $matches); 75 if(empty($matches) || count($matches) < 2){ 76 $video['width'] = false; 77 }else{ 78 $video['width'] = $matches[1]; 79 } 80 // 高度 81 preg_match('/videoHeight:([0-9.]*),/' , $scripts, $matches); 82 if(empty($matches) || count($matches) < 2){ 83 $video['height'] = false; 84 }else{ 85 $video['height'] = $matches[1]; 86 } 87 // 视频路径 88 preg_match('/playAddr:"(.*)",/' , $scripts, $matches); 89 if(empty($matches) || count($matches) < 2){ 90 $video['src'] = false; 91 }else{ 92 $video['src'] = $matches[1]; 93 } 94 // 封面 95 preg_match('/cover:"(.*)"}/' , $scripts, $matches); 96 if(empty($matches) || count($matches) < 2){ 97 $video['cover'] = false; 98 }else{ 99 $video['cover'] = $matches[1];100 }101 return $video;102 }103 function get_content($url, $cookie,$referfer='') {104 //var_dump($post);exit;105 $useragent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";106 /*if ($curl_loops >= $curl_max_loops) {107 $curl_loops = 0;108 return false;109 }*/110 if($referfer == ''){111 $referfer = 'https://www.kujiale.com/';112 }113 $header = array("Referer: ".$referfer); 114 $curl = curl_init();//初始化curl模块 115 curl_setopt($curl, CURLOPT_URL, $url);//登录提交的地址 116 curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); //不验证证书117 curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false); //不验证证书118 curl_setopt($curl, CURLOPT_HEADER, 1);//是否显示头信息 119 curl_setopt($curl, CURLOPT_HTTPHEADER,$header); 120 //curl_setopt ($curl,CURLOPT_REFERER,'http://www.kujiale.com/');121 curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);//是否自动显示返回的信息 122 curl_setopt($curl, CURLOPT_COOKIEFILE, $cookie); //设置Cookie信息保存在指定的文件中 123 curl_setopt($curl, CURLOPT_COOKIEJAR, $cookie); //设置Cookie信息保存在指定的文件中 124 //curl_setopt($curl, CURLOPT_POST, 1);//post方式提交 125 //curl_setopt($curl, CURLOPT_POSTFIELDS, http_build_query($post));//要提交的信息 126 //curl_setopt($curl,CURLOPT_POSTFIELDS,$post);127 128 curl_setopt($curl, CURLOPT_USERAGENT, $useragent);129 //curl_setopt($curl, CURLOPT_REFERER, 'http://www.kujiale.com/');130 $data = curl_exec($curl);//执行cURL 131 $ret = $data;132 list($header, $data) = explode("\r\n\r\n", $data, 2);133 $http_code = curl_getinfo($curl, CURLINFO_HTTP_CODE);134 $last_url = curl_getinfo($curl, CURLINFO_EFFECTIVE_URL);135 //var_dump($last_url);136 //$httpCode = curl_getinfo($curl,CURLINFO_HTTP_CODE);137 //var_dump($httpCode);138 //echo '<hr/>';139 curl_close($curl);//关闭cURL资源,并且释放系统资源 140 if ($http_code == 301 || $http_code == 302) {141 $matches = array();142 preg_match('/Location:(.*?)\n/', $header, $matches);143 $url = @parse_url(trim(array_pop($matches)));144 if (!$url) {145 $curl_loops = 0;146 return $data;147 }148 $new_url = $url['scheme'] . '://' . $url['host'] . $url['path']149 . (isset($url['query']) ? '?' . $url['query'] : '');150 $new_url = stripslashes($new_url);151 return get_content($new_url,$cookie);152 } else {153 $curl_loops = 0;154 list($header, $data) = explode("\r\n\r\n", $ret, 2);155 return $data;156 }157 } 158 function getUrl($data){159 preg_match('/https:\/\/v.douyin.com\/.*\//' , $data, $matches);160 if(empty($matches) || count($matches) != 1){161 return false;162 }else{163 return $matches[0];164 }165 }166 ?>

将test.php中的$data 换成自己复制出来的链接就可以了。返回的是json格式的内容,可以直接渲染在前台 也可以存到数据库里面。

到此 从抖音上抓包视频内容就完成了,写的不好,请大家勿喷。大家有什么意见,欢迎在评论区留言。我看到了会回复大家 ,谢谢。

推荐

首页/电脑版/网名
© 2025 NiBaKu.Com All Rights Reserved.