In a rapidly growing digital era, the ability to sift through massive amounts of data to extract specific information is not only crucial but increasingly accessible. This blog post unfolds the journey of analyzing YouTube transcripts to discover house-related issues mentioned by the previous owner of my property, who happened to be an avid YouTuber.
The Challenge
Understanding the Task
The goal was to identify potential house-related problems by analyzing a collection of text transcripts from the previous owner’s YouTube channel. The challenge was to unearth comments about problems or issues related to the house without the time-consuming process of watching each of their videos.
Initial Exploration
I didn’t really know where to start so I utilized some exploratory prompting with ChatGPT to determine the necessary components for extracting the transcripts. For example:
I want to extract the text transcripts from youtube videos with python so the content can be analyzed. what is the best way to do that?
The Approach
Here’s the eventual prompt that helped me craft a functioning script:
I'm looking to extract transcripts from a specific YouTube channel, and I need a Python script to automate the process of fetching and saving these transcripts as text files. The script must be capable of doing the following:
1. Retrieve Video IDs from a Specific Channel: Using the YouTube API (v3), the script should fetch all video IDs from a specified channel ID. It should handle pagination to retrieve all videos.
2. Extract Transcripts: For each video ID, use the `youtube_transcript_api` to fetch the transcript of the video. If the transcript is disabled or not found, handle these exceptions and continue to the next video.
3. Save Transcripts as Text Files: Each transcript should be saved as a text file, with the video ID as the filename.
4. Avoid Duplicates: If a transcript file already exists for a specific video ID, skip fetching the transcript for that video to avoid redundancy.
5. Handle Errors: If any other error occurs, log it and continue to the next video.
Please make sure to include necessary imports and any setup required to authenticate with the YouTube API. The final script should be a standalone program that can be executed to fetch and save the transcripts.
Note: The script should be designed with the assumption that the necessary YouTube API key and channel ID will be available as variables in the code.
Transcript Extraction Script
Here is the script ChatGPT wrote to extract the transcripts:
import os
from googleapiclient.discovery import build
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
def get_channel_videos(channel_id):
youtube = build('youtube', 'v3', developerKey=api_key)
all_videos = []
next_page_token = None
while True:
request = youtube.search().list(
part="snippet",
channelId=channel_id,
maxResults=50, # Maximum allowed by API is 50
pageToken=next_page_token
)
response = request.execute()
videos = response.get('items', [])
for video in videos:
if video['id']['kind'] == 'youtube#video':
all_videos.append(video['id']['videoId'])
next_page_token = response.get('nextPageToken')
if next_page_token is None:
break
return all_videos
def save_transcript_as_text(video_id, transcript):
with open(f"{video_id}.txt", "w", encoding='utf-8') as file:
for entry in transcript:
file.write(entry['text'] + '\n')
def get_transcripts(video_ids):
for video_id in video_ids:
# Check if transcript file already exists
if os.path.exists(f"{video_id}.txt"):
print(f"Transcript for video: {video_id} already exists, skipping...")
continue
try:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
save_transcript_as_text(video_id, transcript)
except TranscriptsDisabled:
print(f"Transcripts are disabled for video: {video_id}")
continue
except NoTranscriptFound:
print(f"No transcript found for video: {video_id}")
continue
except Exception as e:
print(f"An error occurred for video: {video_id}, error: {str(e)}")
continue
def main():
video_ids = get_channel_videos(channel_id)
get_transcripts(video_ids)
if __name__ == "__main__":
main()
Data Ingestion
When trying to identify positive and negative statements related to specific subjects like property issues, determining the optimal chunk size and overlap is a critical step.
Chunk Size
The chunk size refers to the number of words or characters you will process at once. If it’s too small, you might miss the context or overall sentiment of a statement, especially if the speaker uses lots of run-on sentences and filler words. If it’s too large, it could create unnecessary complexity and might even encompass unrelated content.
Overlap
Overlap refers to the amount of shared text between successive chunks. If run-on sentences are frequent, overlapping can ensure that you do not lose context or split a statement related to property issues across two different chunks.
The Prompt
Determine the optimal chunk size and overlap for processing text transcriptions from English YouTube videos, with the goal of identifying positive and negative statements about a house or property, taking into consideration that the speaker uses lots of run-on sentences. The analysis should focus on problem keywords related to property issues.
Chunk Size :
- Small Chunk (e.g., 50-100 words): Too small a chunk might break up the long run-on sentences and miss the connection between different parts of the statement.
- Large Chunk (e.g., 1000-2000 words): This could capture the complete context but might become computationally inefficient or include unrelated information.
- Optimal Chunk Size (e.g., 200-400 words): This middle ground should allow for capturing enough of the run-on sentences without including too much unrelated text.
Overlap :
- Small Overlap (e.g., 10-20 words): Might miss some connections between sentences.
- Large Overlap (e.g., 100-200 words): Ensures that statements are not split but might lead to redundancy.
- Optimal Overlap (e.g., 40-80 words): Should allow for capturing the run-on nature of the sentences without too much repetition.
Embeddings
I leveraged Randy Lin’s chat-with-your-doc tool (https://github.com/linjungz/chat-with-your-doc) to create embeddings and ingest the transcripts.
python chat_cli.py ingest --path "./data/source_documents/*.txt" --name youtube_transcripts
After creating the embedding, the power of OpenAI’s ChatGPT could be leveraged to query the data.
python chat_cli.py chat --name youtube_transcripts
Findings
Through analysis, several positive mentions about the house were uncovered, including preferences for lighting, bedroom features, and appreciation of outdoor plants. Thankfully, nothing particularly negative was discovered.
Find all sentences that mention a house, home, residence, dwelling, property, or yard, along with a problem, damage, or issue. Highlight examples of common problems associated with houses, such as leaks, structural damage, or renovation needs. Analyze the context of mentions of houses and related terms to identify whether they are associated with a problem or negative situation. Determine the sentiment of sentences that contain keywords related to houses, and highlight those with negative connotations.
- Sentence: “It kind of got in the way of like moving around the space and like he definitely would get really annoyed with me with like having plants in the bedroom.”
Context and Sentiment: The context of this sentence is about the speaker’s previous living situation with their ex-boyfriend and the issues that arose due to having too many plants in the bedroom. The sentiment is negative, as the speaker describes the situation as annoying and causing problems in their relationship.- Sentence: “I don’t like live in a very busy neighborhood and like I kind of live on like a hill so people can’t like easily see into my house since my room or my house isn’t like level with the road.”
Context and Sentiment: The context of this sentence is about the speaker’s house and its location on a hill. The sentiment is neutral, as the speaker is simply describing their living situation and how it affects their privacy.
Are there specific problems or issues related to houses mentioned in the transcripts?
In the transcripts, there are no specific problems or issues mentioned regarding the houses. The main issues discussed are related to having a large number of plants in the house, which caused some disagreements in a past relationship, and the echoey nature of the new house due to a lack of furniture and rugs.
was there anything the person mentioned that they liked about their house or property?
Yes, the person mentioned several things they liked about their house or property:
- They prefer to have the windows completely open to let in as much light as possible for their plants.
- Their bedroom has a king bed, beautiful windows that let in southern light, and a walk-in closet.
- They appreciate outdoor plants and landscapes.
was there anything mentioned about issues with privacy or neighbors?
Yes, there were mentions of privacy and neighbor-related issues in the transcripts. The person mentioned not having window coverings initially because they didn’t think they were necessary and they didn’t live in a busy neighborhood. They also mentioned living on a hill, making it difficult for people to see into their house. However, they eventually decided to get drapes for more privacy at nighttime.
Additionally, they mentioned the benefit at their new house of not having neighbors directly behind them when filming outside, as it made them feel less self-conscious. They also talked about being respectful and maintaining privacy when filming in public places, making sure not to capture other people’s faces on camera.
what is the average number of times per sentence that they say the word "like"?
I cannot calculate the exact average frequency of the word “like” being used per sentence in their conversation without knowing the total number of sentences and the total number of times “like” was used. However, it appears that the word “like” is used quite frequently in their conversation.
Conclusion
This exploration into YouTube transcripts illuminated the power of text analysis and machine learning. The complex task was simplified by innovative tools like ChatGPT and chat-with-your-doc. This method’s success suggests applications across various fields, from customer feedback to social media trends.
Acknowledgments
A heartfelt thanks to the open-source community, whose innovation and support enable remarkable projects like this.