|
- ---
- title: "funwithdata"
- output: rmarkdown::html_vignette
- vignette: >
- %\VignetteIndexEntry{funwithdata}
- %\VignetteEngine{knitr::rmarkdown}
- %\VignetteEncoding{UTF-8}
- ---
-
- ```{r, include = FALSE}
- knitr::opts_chunk$set(
- collapse = TRUE,
- comment = "#>"
- )
- ```
-
- ```{r setup}
- library(hateimparlament)
- library(dplyr)
- library(stringr)
- library(ggplot2)
- ```
-
- ## Preparation of data
-
- First, you need to download all records of the current legislative period.
- ```r
- fetch_all("../records/") # path to directory where records should be stored
- ```
- Second, those `.xml` files, need to be parsed into `R` `tibbles`. This is accomplished by:
- ```r
- read_all("../records/") %>% repair() -> res
-
- reden <- res$reden
- redner <- res$redner
- talks <- res$talks
- ```
- We also used `repair` to fix a bunch of formatting issues in the records and unpacked
- the result into more descriptive variables.
-
- For development purposes, we load the tables from csv files.
- ```{r}
- tables <- read_from_csv('../csv/')
-
- comments <- tables$comments
- reden <- tables$reden
- redner <- tables$redner
- talks <- tables$talks
- ```
-
- Further, we need to load a list of words that were used by Hitler but not by standard German texts.
- ```{r}
- fil <- file('../hitler_texts/hitler_words')
- Worte <- readLines(fil)
- hitlerwords <- tibble(Worte)
- ```
-
- ## Analysis
-
- Now we extract the words that were used with higher frequency by one party and compare them with `hitlerwords`.
- ```{r}
- talks %>%
- left_join(redner, by=c(redner='id')) %>%
- group_by(fraktion) %>%
- summarize(full_text=str_c(content, collapse="\n")) -> talks_by_fraktion
- talks_by_fraktion
- ```
- For each party, we want to get a tibble of words with frequency.
- ```{r}
- #AfD
- Worte <- str_extract_all(talks_by_fraktion$full_text[[1]], "\\b[a-zA-ZäöüÄÖÜß]+\\b")[[1]]
- total = length(Worte)
- tibble(Worte) %>% group_by(Worte) %>% count() %>% mutate(freq =n/total) -> afd_words
-
- #AfD&Fraktionslos
- Worte <- str_extract_all(talks_by_fraktion$full_text[[2]], "\\b[a-zA-ZäöüÄÖÜß]+\\b")[[1]]
- total = length(Worte)
- tibble(Worte) %>% group_by(Worte) %>% count() %>% mutate(freq =n/total) -> afdundfraktionslos_words
- #BÜNDNIS 90 / DIE GRÜNEN
- Worte <- str_extract_all(talks_by_fraktion$full_text[[3]], "\\b[a-zA-ZäöüÄÖÜß]+\\b")[[1]]
- total = length(Worte)
- tibble(Worte) %>% group_by(Worte) %>% count() %>% mutate(freq =n/total) -> grüne_words
- #CDU/CSU
- Worte <- str_extract_all(talks_by_fraktion$full_text[[4]], "\\b[a-zA-ZäöüÄÖÜß]+\\b")[[1]]
- total = length(Worte)
- tibble(Worte) %>% group_by(Worte) %>% count() %>% mutate(freq =n/total) -> cdu_words
- #DIE LINKE
- Worte <- str_extract_all(talks_by_fraktion$full_text[[5]], "\\b[a-zA-ZäöüÄÖÜß]+\\b")[[1]]
- total = length(Worte)
- tibble(Worte) %>% group_by(Worte) %>% count() %>% mutate(freq =n/total) -> linke_words
- #FDP
- Worte <- str_extract_all(talks_by_fraktion$full_text[[6]], "\\b[a-zA-ZäöüÄÖÜß]+\\b")[[1]]
- total = length(Worte)
- tibble(Worte) %>% group_by(Worte) %>% count() %>% mutate(freq =n/total) -> fdp_words
- #Fraktionslos
- Worte <- str_extract_all(talks_by_fraktion$full_text[[7]], "\\b[a-zA-ZäöüÄÖÜß]+\\b")[[1]]
- total = length(Worte)
- tibble(Worte) %>% group_by(Worte) %>% count() %>% mutate(freq =n/total) -> fraktionslos_words
- #SPD
- Worte <- str_extract_all(talks_by_fraktion$full_text[[8]], "\\b[a-zA-ZäöüÄÖÜß]+\\b")[[1]]
- total = length(Worte)
- tibble(Worte) %>% group_by(Worte) %>% count() %>% mutate(freq =n/total) -> spd_words
- #NA
- Worte <- str_extract_all(talks_by_fraktion$full_text[[9]], "\\b[a-zA-ZäöüÄÖÜß]+\\b")[[1]]
- total = length(Worte)
- tibble(Worte) %>% group_by(Worte) %>% count() %>% mutate(freq =n/total) -> na_words
- #alle
- all_words <- bind_rows(afd_words, afdundfraktionslos_words, grüne_words, cdu_words, linke_words, fdp_words, fraktionslos_words, spd_words, na_words)
- total <- sum(all_words$n)
- all_words %>% group_by(Worte) %>% summarize(n = sum(n), part= sum(n)/total) -> all_words
- ```
-
- Now we want to extract the words that are more frequently used by a specific `fraktion`.
- ```{r}
- afd_words %>% transmute(freq, fraktion_n = n) %>% left_join(all_words) %>% transmute(fraktion_freq = freq, total_freq = part, fraktion_n, total_n = n, rel_quotient = fraktion_freq/total_freq, abs_quotient = fraktion_n/total_n) %>% arrange(-abs_quotient, -fraktion_n) %>% filter(rel_quotient > 1) -> afd_high_frequent
-
- ```
-
- We compare these words with `hitlerwords`.
-
- ```{r}
- afd_high_frequent %>% mutate(Worte = str_to_lower(Worte)) %>% inner_join(hitlerwords)
- ```
|