--- title: "SML201 Chapter 2.2 ^[© 2021 Daniel Persia & Daisy Huang. All rights reserved.]" author: "Daniel Persia ORCID ID:0000-0001-6097-7161 " date: "Fall 2021" output: #html_document: #fig_caption: yes #df_print: paged #toc: yes pdf_document: fig_caption: yes toc: yes geometry: margin=1.5in subtitle: 'Transforming Data with dplyr in R: The Translation Database' editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.align="center", fig.height=5, fig.width=9, collapse=F, comment="", prompt=F, echo = TRUE, cache=TRUE, autodep=TRUE, tidy=TRUE, tidy.opts=list(width.cutoff=60)) options(width=70) ``` # Overview Today we'll be looking at data from the Translation Database, an open-source project that aims to track all fiction, poetry, children's books, and nonfiction translated into English and published in the United States since 2008. The project was created by the literary hub Three Percent, in collaboration with Open Letter Books at the University of Rochester. The name "Three Percent" is derived from the oft-cited statistic that about 3% of all books published in the US are translations. In many parts of the globe, this figure is much higher, hovering around 30, 40, or even 50% or more. Even if translation is not a primary component of your research, it no doubt affects your work. Translation is one of the primary mechanisms through which knowledge is exchanged worldwide. It is likely that many of the research articles or even primary sources that you are reading have been translated from other languages into English! Recent scholarship has been approaching topics such as translation across race and gender. Scholars, writers and translators alike must be familiar with the metadata around publishing: which writers are being translated? From what countries? From what languages? Who is doing the translating? Who is doing the publishing? VIDA, an organization supporting women in the literary arts, has taken up a similar analysis with original writing, showing the numeric breakdown between men and women published in major literary journals and magazines. The inequities are startling. Just as data is more than simply letters and numbers, data analysis is more than a computer at work. It relies on human interaction and interpretation. Consider the guiding questions below as you work through the module. We'll return to these later for discussion.\ **Guiding Question 1** Consider the variables in the translation database. What information is present? What information is missing that might be relevant to your analysis? What would you like to see in the dataset that isn't there? Outline the processes (and potential roadblocks) to acquiring such information. \ **Guiding Question 2** Consider variables that may not fully represent all possible contributors. For example, what are the limitations to categorizing gender as binary (male or female)? How might the dataset better account for transgender writers and translators? \ **Guiding Question 3** Make a running list of questions that a researcher could tackle by using this dataset. Think big! For example, how has the ratio of female to male translators changed over the last decade? Keep these questions in mind as you familiarize yourself with the summary statistics and dplyr transformations taught throughout. Good luck and enjoy! # R functions covered -------- * Arithmetic: `diff()` * Check and remove objects in the environment: `objects()`, `ls()`, `rm()` * Save and load variables: `save(..., file=...)`, `load(..., file=...)` * Data extraction: `$` by name; `[,]` by indices; with conditions or logical arguments * Combining conditions: `&`, `|`, `!` * Data manipulation: `na.omit()`, `sort(..., decreasing = FALSE)`, `unique()`, `rank()`, `%in%` * Statistic functions: `sd(..., na.rm = ...)`, `var(..., na.rm = ...)`, `quantile(..., na.rm =..., p=...)`, `IQR(..., na.rm =...)` * {dplyr} functions: see R videos (e.g. `select()`, `filter()`, `slice()`, `slice_max()`, `slice_min()`, `distinct()`, `count()`, `group_by()`, `summarize()`, `arrange()`, `n()` etc.) # Introduction ---------- In the last chapter we introduced the data frame, an object type in R. We also introduced various statistics that summarize one-dimensional data. We hope you are convinced that these statistics are useful for extracting interesting information from a dataset. We will now go into more detail to show you how to calculate other more advanced statistics and how to use R to obtain them. We will also show you how to transform data with the package dplyr, using real-world data from the Translation Database. -------- ```{r} library(dplyr) library(ggplot2) ``` ---------- # Some helpful notes We'll begin with some information that will help you navigate your workspace in R. ---------- Clearing variables in your workspace (i.e., global environment) -------- ```{r} ls() # check what R objects are currently in your workspace rm(list=ls()) # remove all the R objects in your workspace x = c(3,1,4) # create a variable x that stores the vector c(3,1,4) y = 1:2 rm(x) # remove just x from the workspace x = c(3,1,4) y = 1:2 rm(list=c('x', 'y')) # remove x and y from the workspace x = c(3,1,4) y = 1:2 rm(list=ls()) # remove all R objects in your workspace # same as rm(list=c('x', 'y')) since x and y are the only R objects in your workspace ``` -------- To run a line of code from the script file use * `Command + Return` for Mac * `Control + Return` for PC You can see all the short-cut keys here Functions in R -------- A function in R takes some data (i.e., a column or multiple columns in the dataset), manipulates the data (e.g., sum up of, take the maximum of, or make a graph with the data), and outputs the desired result. A function in R has the following format:
![function_anatomy](./Ch2.2_function_anatomy.png)
Input arguments are separated by commas. -------- Reading the help manual for a function -------- On the help manual of a function if there is already a default value assigned to the input argument, this means that if the user does not assign a value to the input argument the default value will be used by `R`.
![function_default_values](./Ch2.2_function_default_values.png)
For example, ```{r} seq() ``` -------- Save and load specific variables -------- ```{r} x = c(3,1,4) y = 1:2 save("x", "y", file="xy.RData") # save x, y variables to xy.RData # xy.RData will be saved to current work directory save("x", "y", file="xy.RData") # can also specify a full path to tell R where to save variable x and y rm(list=c('x', 'y')) # remove x and y from the workspace to see if save() works ls() load(file = "xy.RData") # load x, y back ls() # check if we get x and y back x y rm(list=ls()) ``` -------- # Getting Started -------- For today's lesson, let's first read in the dataset. ```{r load-file, warning=FALSE} #This line will require require you to find the data on your own machine. The data can be found on the [Translation Database](https://www.publishersweekly.com/pw/translation/search/index.html) website #Once you've saved the csv to your computer, insert your own file path below. library(readr) translations = read_csv(file = '~/Documents/Princeton/CDH/SML 201/Module/TranslationDatabase.csv') ``` ```{r include = F} #library(readr) # Plesae ignore this code chunk and don't worry about what I put here #translations = read_csv(file = #'/Users/yanhuang/Documents/Teaching/SML201_Fall2020/Lectures/EDA/EDA_intro/translations.csv') ``` -------- We should always familiarize ourselves with the dataset before beginning any kind of analysis. ```{r} class(translations) # this tells us what kind of R object this is dim(translations) # 8628 rows by 18 columns ``` ---------- ```{r} head(translations) # look at the first 6 (by default) rows of the dataset ``` ---------- ```{r} tail(translations, n=3) # look at the last 3 rows of the dataset ``` ---------- We can see that some observations have NA listed where no information is provided. We'll account for these missing values later on. ```{r} # check the data type of each column in `translations` str(translations, strict.width = 'cut') ``` Notice that, next to each variable, the data type is listed. There are three basic data types in R that we'll use throughout this course: character, numeric, and logical. (1) **Character variable**: each element is a string of one or more characters (essentially, letters) (2) **Numeric variable**: each element is a number. The two most common classes of numeric variables are integer (whole number) and double (decimal) (3) **Logical variable**: binary, two possible values (TRUE or FALSE). Later on in the course, we'll also see factor variables, which are essentially numeric codes tied to character-valued levels. ---------- Before analyzing the data, we should check each variable type in R. Here, we see that price is listed as a character variable, when it should in fact be a numeric variable. We can change this with a simple command. To reference a single column from the dataframe, we use the $ symbol. ```{r warning=FALSE} translations$price <-as.numeric(translations$price) ``` Given that price is now numeric, we can find its corresponding 5-number summary: ```{r} summary(translations$price) ``` We can also input the entire dataset into the summary function, but in this case the only relevant output will be for price, given that the other variables are predominantly character variables: ```{r} summary(translations) ``` # Manipulating Variables in a Data Frame with dplyr ---------- The dpylr package in R has several useful functions to manipulate a data frame. Here we'll use the translation database to try out these functions. Using select( ) ---------- Let's have another look at an overview of our dataset: ```{r} # check the data type of each column in `translations` str(translations, strict.width = 'cut') ``` There are several variables that might be less useful to us in analyzing the data. ISBN and e-ISBN numbers, for instance, might not tell us much. We can keep certain variables (columns) while removing others by using the `select( )` function in the dplyr package. For instance: ```{r} translations %>% select(title, genre, country, language, price) #create data frame with the selected variables ``` This new dataframe, with fewer variables, will be easier to work with. Essentially we're tuning out the extra noise. We can assign it a new name: ```{r} new_translations <- translations %>% select(title, genre, country, language, price) ``` Alternatively, we might want to select columns (variables) according to their position in the dataframe. ```{r} translations %>% select(3:5) #select columns 3-5 translations %>% select(3:ncol(translations)) #select all columns beginning with the 3rd column translations %>% select(, -1) #remove only the first column ``` Using mutate( ) ---------- In some cases, we may want to change a column or add an additional column to the dataframe. Suppose that we are presenting to an international audience and would like to include the book prices in dollars and euros. We can add a price column in euros using the `mutate( )` function. (Note: at the moment, 1USD is approximately equal to 0.85 euros) ```{r} translations %>% mutate(EuroPrice=price*0.85) ``` Note that the new column is given the name "EuroPrice." We could have written "price=price*0.85" within our function, but this would have overwritten the original column in the dataset. We recommend against altering the original dataset, that way you can always return to the original data. Using pull( ) ---------- If you want to extract the values in a column as a vector, you can use the `pull( )` function. Suppose we want to make a list of all of the countries included in the translation database. ```{r} translations %>% pull(country) %>% #extract vector with country names unique() %>% #list each country only once sort() #list in alphabetical order ``` That first entry with the dashes might seem a bit strange. If we go back to the original dataset, we'll see that this corresponds to some translations that include work from multiple countries. These exceptional cases are important to fully understanding the dataset, so make sure to spend some time with them. # Manipulating Cases in a Data Frame in dplyr ---------- Note that functions such as `select( )` and `mutate( )` have to do with the variables (columns) of the data frame. Sometimes, we may want to extract particular observations (rows) from the data frame, without altering the variables. Using filter( ) ---------- To extract rows that meet certain logical criteria, we can use the `filter( )` function in dplyr. Suppose that our area of research interest is Spanish, and we only want to look at translations from Spanish. ```{r} translations %>% filter(language=="Spanish") ``` It's easy to add multiple criteria. Suppose we're interested only in Fiction translated from Spanish. ```{r} translations %>% filter(language=="Spanish" & genre=="Fiction") #using the AND logical operator ``` Or maybe we're interested in two particular countries, Mexico and Nicaragua. ```{r} translations %>% filter(country=="Mexico"| country=="Nicaragua") #using the OR logical operator ``` Or in translations from Spanish that fall in the price range of 10-20 dollars. ```{r} translations %>% filter(language=="Spanish" & price >=10 & price<=20) #GREATER THAN OR EQUAL TO and LESS THAN OR EQUAL TO ``` Or maybe we want to exclude a country: translations from Spanish, not including Spain. ```{r} translations %>% filter(language=="Spanish" & country!= "Spain") #making an exclusion with NOT ``` The `filter()` function is a useful way to limit the observations to only those of interest. It will also come in handy later on when we want to graph observations that meet certain criteria. Using slice( ) and related functions ---------- To select rows by position, use the `slice( )` function. ```{r} translations %>% slice(5:10) #selects rows 5-10 ``` To select rows with the lowest values, use `slice_min( )`. ```{r} translations %>% slice_min(price) #there are 20 translations at the minimum cost of 0 dollars translations %>% slice_min(price, prop=0.25) #returns the 25% of all translations with the lowest cost ``` To select rows with the highest values, use `slice_max( )`. ```{r} translations %>% slice_max(price) #there is 1 translation at the maximum cost of 282 dollars translations %>% slice_max(price, prop=0.10) #returns the 10% of all translations with the highest cost ``` To randomly select a subset of rows, use `slice_sample( )`. ```{r} translations %>% slice_sample(n=10) #randomly select 10 rows ``` Note that if you run `slice_sample( )` multiple times, you should get a new random selection each time. If you want to create a simulation that can be reproduced, use the `set.seed( )` function. ```{r} set.seed(800) #you can set the seed with any number translations %>% slice_sample(n=10) ``` Every time you run the chunk above, you will get the same randomly selected 10 rows. Using arrange( ) ---------- Sometimes we may want to rearrange the order of our dataset. Let's `arrange( )` the dataset by price: ```{r} translations %>% arrange(price) # orders from lowest price to highest price translations %>% arrange(desc(price)) # orders from highest price to lowest price ``` Using group_by( ) and summarize( ) ---------- Sometimes we may want to create certain groups within the dataset to carry out our analysis. Suppose we want to separate out the mean (average) price for books by country. ```{r} translations %>% group_by(country) %>% summarize(AveragePrice=mean(price)) ``` We can also `group_by` more than one variable. R will group from left to right. Can you tell the difference between the following two results? ```{r} translations %>% group_by(genre, country) %>% #groups by genre and then country summarize(AveragePrice=mean(price)) translations %>% group_by(country, genre) %>% #groups by country and then genre summarize(AveragePrice=mean(price)) ``` Note that the `group_by` function does not alter the original dataframe. It simply creates groups within the data to facilitate analysis. Using count( ) and n( ) ---------- To count the number of rows (observations) in a particular group, we can use the `count( )` function. ```{r} translations %>% count(country) # counts the number of translations by country ``` To count the number of rows within the summarize function, use the `n( )` function. For instance, we can obtain the same result as above with the following code: ```{r} translations %>% group_by(country) %>% summarize(n()) ``` To count the number of uniques, use the `n_distinct( )` function. ```{r} translations %>% summarize(n_distinct(country)) # determine the number of distinct countries ``` ---------- These are some of the many dplyr functions available to you. For a cheat sheet that includes these and more, see . # Summary functions for numeric variables ---------- We will now turn our attention to numeric variables, which we'll analyze in terms of center and spread. One numeric variable in the translation database is `price`. -------- Three ways to summarize the central location and the spread of a dataset: * **Mean, standard deviation/variance** * **Five number summary: minimum, 1st quartile, median(i.e., 2nd quartile), 3rd quartile, maximum** * **Median, interquartile range** -------- ## Mean, standard deviation and variance -------- Recall that the **mean** of a set of data points is the *average* of all the data points in the set \[ mean\_of\_a\_dataset = \frac{pt.1 + pt.2 + ... + pt.n}{n} \] *pt.i* stands for the $i^{th}$ data point in your dataset -------- Variance gives a measure of the spread of the dataset by taking the average of the squared difference between each data point and the mean of the dataset: \begin{align} & variance\_of\_dataset \\ & = \frac{(pt.1-mean)^2 + (pt.2-mean)^2 + ... + (pt.n-mean)^2}{n-1} \end{align} -------- A question for you: Why do we need to square the difference? -------- Note: the mean of a dataset is the quantity that minimizes the average of the squared difference between the data points and the quantity. -------- Taking square root of the variance gives standard deviation an unit that is consistent with the units for the original dataset. \[ SD\_of\_dataset = \sqrt{variance\_of\_dataset} \] -------- ## Five number summary: minimum, 1st quartile, median, 3rd quartile, maximum -------- Recall that 1st quartile, median and 3rd quartile are just 25th , 50th and 75th percentiles, respectively. -------- While mean and SD give a quick summary of the central location and the spread of a dataset, they do not provide information on the shape of the distribution; e.g., we cannot tell from the values of mean and SD whether the distribution is symmetric or if there are extreme values in the dataset. -------- The following three datasets have their means approximately 0 and their SDs approximately 1. However, we can see that the shapes of their distributions are quite different. The first two distributions are symmetric while the last one is skewed right (i.e., with a long right tail). -------- \newpage ```{r, include=F} set.seed(2346) unif4000 = runif(n=10000, min=-sqrt(12)/2, max=sqrt(12)/2) mean(unif4000) var(unif4000) rnorm4000 = rnorm(n=10000, mean=0, sd = 1) mean(rnorm4000) sd(rnorm4000) exp4000 = rexp(n=10000, rate=1)-1 mean(exp4000) sd(exp4000) ``` ```{r fig.cap = "Histograms for three datasets each with mean around 0 and SD around 1", echo=F, fig.height=6, fig.width=7} par(mfrow=c(3,1), cex.main=.8, cex.lab=.8, cex.axis=.8) hist(unif4000, prob=T, main='Histogram of a Dataset with Mean about 0 and SD about 1', xlab='data points', xlim=c(-6, 6)) hist(rnorm4000, prob=T, main='Histogram of a Dataset with Mean about 0 and SD about 1', xlab='data points', xlim=c(-6, 6)) hist(exp4000, prob=T, main='Histogram of a Dataset with Mean about 0 and SD about 1', xlab='data points', xlim=c(-6, 6)) par(mfrow=c(1,1)) ``` -------- The five number summary gives a better idea of the shape of the distribution of a dataset: The median gives an idea of the central location (different definition of central location from the definition for the mean). For the three datasets above the five number summary along with the means are the following accordingly: -------- ```{r echo=F} knitr::kable( round( rbind( summary(unif4000), summary(rnorm4000), summary(exp4000) ), digits = 2) ) ``` Compared to the median, the mean was affected by the extreme values in the third dataset. -------- ## Median, interquartile range We already know that median is just the 50th percentile. Interquartile range (IQR) gives a measure of the spread of the "central" 50% of the data: Interquartile range = Q3-Q1 -------- Finding measures of center and spread -------- Let's try to find the mean (average) price of a translation in the translation database. Recall that to refer to a particular column, we use the $ symbol. ```{r} mean(translations$price) # notice that this results in an error ``` An error is often the result of missing values in the dataset. There are several translations in the translation database that do not have a price listed. We can remove these entries. ```{r} mean(translations$price, na.rm=TRUE) ``` We should be careful about excluding too much data. Let's check to see how many entries were excluded. ```{r} sum(is.na(translations$price)) # count how many entries do not have a price listed ``` We might also recall that the 5-number summary includes the number of NAs. ```{r} summary(translations$price) ``` This small number of exclusions shouldn't have a major effect on our results. -------- Standard deviation and variance are common measures of spread, which we'll study in depth later on in the course. ```{r} sd(translations$price, na.rm = T) # calculate the standard deviation var(translations$price, na.rm = T) # calculate the variance ``` -------- We might also want to find the median and interquartile range. ```{r} median(translations$price, na.rm = T) #calculate the median ``` Three ways to find the interquartile range: ```{r} # first way: calculate each quantile separately first # then, take the difference quantile(translations$price, na.rm = T, p = .75) # .75 quantile quantile(translations$price, na.rm = T, p = .25) # .25 quantile quantile(translations$price, na.rm = T, p=.75)-quantile(translations$price, na.rm = T, p=.25) # you can use this method to calculate the difference between any two quantiles, not just the 1st and the 3rd quartiles ``` -------- ```{r} # alternatively, we can output the values of the two quantiles # as elements in a vector and then take the difference # of the two quantiles diff(quantile(translations$price, na.rm = T, p=c(.25, .75))) ``` -------- ```{r} # lastly, we can also calculate this with # the `IQR()` function IQR(translations$price, na.rm = T) ``` -------- # Check for understanding We can display the full five-number summary of `price` with one line of code. ```{r} summary(translations$price) ``` With the outputs above answer the following questions. Question 1. Roughly what percentage of the translations cost between 15 and 24 dollars? -------- Answer: About 50%. -------- Question 2. Roughly what percentage of translations cost more than 24 dollars? -------- Answer: 100% - 75% = 25%. -------- Question 3. Suppose that about 67.5% of the translations cost less than `x` dollars; `x` is called the ______ quantile. Find out what price this quantile corresponds to. -------- Answer: The price that about 67.5% of the translations cost less than is called the $\underline{.675}$ quantile. The price this .675 quantile corresponds to is 20 dollars. ```{r} quantile(translations$price, p = .675, na.rm = T) ``` -------- Question 4. Let's look at the histogram for translation prices. -------- Answer: ```{r warning=F, message=F, echo=F} theme_set(theme_bw()) ggplot(translations) + geom_histogram(mapping = aes(x = price, y = ..density..), fill = "lightblue", color = "white") + labs(x='Price (in dollars)', # y='Proportion per Pound', title = 'Histogram for Translation Prices' ) ``` -------- ```{r} max(new_translations$price, na.rm=T) # some translations seem to be a lot more expensive than the rest ``` -------- We see that some translations cost a lot more than others. Which translations are these? We can subset part of the data that pertains to the highest-cost translations. From the histogram, it looks like there are few books that cost more than 100 dollars. Let's see how many there are. ------- ```{r} nrow(translations %>% filter(price > 100)) ``` -------- We can also sort the book prices in decreasing order: ```{r} translations %>% arrange(desc(price)) %>% pull(price) ``` -------- The list above is too long; in general you don't want to print out figures that are irrelevant to your analysis. Let's look at just the top 20 prices. ```{r} # print only the first 20 lines translations %>% arrange(desc(price)) %>% pull(price) %>% head(n = 20) ``` Another way to do this is: ```{r} translations %>% slice_max(order_by = price, n = 20) %>% pull(price) ``` Note that the most expensive translation is almost 100 dollars more than the second highest one! Why might these books be so expensive? Perhaps they are rare, limited-edition prints. Or perhaps translations from some languages are more expensive than others. It's also possible that these could be input errors. We should study the dataset more to come to a proper conclusion. A price that falls well outside the main price range is called an outlier; we are almost certain that these top prices are outliers just from looking at the histogram. -------- ## Outliers An **Outlier** is an observation that falls well outside the main range of the data. We should treat outliers with caution; sometimes outliers occur due to recording errors and other times outliers actually provide us insightful information about the data. -------- There are different rules to identify outliers. One set of rules that is commonly used defines an outlier to be a data point that falls more than 1.5 times the IQR below Q1 or above the Q3. ```{r} quantile(translations$price, na.rm=T) #show all quartiles quartiles<-quantile(translations$price, na.rm=T) #Use index to reference the proper quartile quartiles[2]-1.5*IQR(translations$price, na.rm=T) # lower bound quartiles[4]+1.5*IQR(translations$price, na.rm=T) # upper bound lower <-quartiles[2]-1.5*IQR(translations$price, na.rm=T) # lower bound upper <-quartiles[4]+1.5*IQR(translations$price, na.rm=T) # upper bound translations %>% filter(priceupper) %>% nrow() #calculate number of outliers ``` --------- In addition, a boxplot provides a graphical way to identify outliers. ```{r fig.height=10, echo = F} boxplot(translations$price, main='Boxplot for Translation Prices', ylab = 'Price (in Dollars)') ``` -------- ## Mode The **modes** of a dataset are the peaks on the histogram for the dataset. The location of the mode tells us on which part of the x-interval the data has high frequency. The most prominent mode is called the **major mode**--there could be more than one major mode depending on the distribution. The less prominent ones are called the minor modes. -------- Question 5. Observe the following modified histogram for `price`, which shows only translations costing between 0 and 50 dollars; are there any modes? -------- ```{r warning=F, message=F, echo=F} ggplot(translations) + geom_histogram(mapping = aes(x = price, y = ..density..), fill = "lightblue", color = "white", binwidth = 2) + labs(x='Price (in Dollars)', # y='Proportion per Year', title = 'Price Distribution for Translations')+ xlim(0,50) #limit the x-axis scale to exclude far-right outliers ``` -------- Answer: There is one major mode around 14 dollars and one minor mode around 24 dollars. -------- Question 6. Exactly how many translations have a price within the interval (13,15]? (Hint: When you apply mathematical operations on a vector with logical arguments `R` treats logical arguments as numbers; `TRUE` is treated as 1 and `False` is treated as 0). -------- ```{r} sum(translations$price > 13 & translations$price <=15, na.rm=T) ``` -------- What is the proportion of translations with a price in the interval (13,15]? ```{r} mean(translations$price > 13 & translations$price <=15, na.rm=T) ``` -------- `geom_histogram()` uses `stat_bin()` to divide the data into bins. The default value for stat_bin() is `closed = "right"`. This means that the bin covers the value on its right-end-point but not the left-end-point. Use the histogram to approximate what proportion of passengers are in the age interval (13, 15]. -------- ```{r} .1125*2 #observe the height and the width of the corresponding bar ``` -------- * How much is the most expensive translation? * How much is the least expensive translation? -------- ```{r} range(translations$price, na.rm=T) ``` -------- Question 7. What are the prices of the four most expensive translations? -------- ```{r} # sort price in decreasing order translations %>% slice_min(order_by = -price, n = 20) %>% #note that -price tells R to sort in descending order pull(price) ``` -------- The four most expensive translations in the database cost: ```{r} translations %>% slice_min(order_by = -price, n = 4) %>% pull(price) ``` -------- # Additional Practice -------- Question 1. Create the following objects in R. (1) A data frame with translation title, author's first and last name, translator's first and last name, publication year, author's gender and translator's gender. ```{r} translations %>% select(title, auth_first, auth_last, trnsl_first, trnsl_last, pubdate_yr, auth_gend, trnsl_gend) ``` Or alternatively: ```{r} translations %>% select(title, starts_with("auth"), starts_with("trnsl"), pubdate_yr) ``` -------- (2) A dataframe with only translations from Germany or France. ```{r} translations %>% filter(country=="Germany" | country=="France") %>% arrange(country) #list by country in alphabetical order ``` -------- (3) A vector that includes all of the unique languages in the translation database. ```{r} translations %>% pull(language) %>% unique() ``` How many unique languages are there? ```{r} length( translations %>% pull(language) %>% unique() ) ``` If we're being super attentive, we'll actually notice that one of the languages is listed as NA (i.e., the language was not listed for that observation). So there are actually 83 unique languages in the dataset- details matter! -------- Question 2. Create a table that shows the number of female and male translators by country. For now you can ignore observations for which the translator's gender is not provided. ```{r} translations %>% group_by(country) %>% summarize(ct.female=sum(trnsl_gend=="Female", na.rm=T), ct.male=sum(trnsl_gend=="Male", na.rm=T)) ``` -------- Question 3. Modify one function from your answer to the previous question to instead show the *proportion* of female and male translators by country. ```{r} #Table with proportion of female and male translators by country translations %>% group_by(country) %>% summarize(prp.female=mean(trnsl_gend=="Female", na.rm=T), prp.male=mean(trnsl_gend=="Male", na.rm=T)) ``` We could easily modify our code to show the proportions as percentages: ```{r} #Table with proportion of female and male translators by country translations %>% group_by(country) %>% summarize(pct.female=mean(trnsl_gend=="Female", na.rm=T)*100, pct.male=mean(trnsl_gend=="Male", na.rm=T)*100) ``` -------- Question 4. Create a table to show which languages are most represented in the translation database. ```{r} translations %>% group_by(language) %>% summarize(count=n(), percentage=round((n()/nrow(translations))*100, digits=1)) %>% arrange(desc(count)) %>% mutate(rank=rank(-count)) #Add column to show rank ``` Change one part of your code to do the same for country, and then for publisher. ```{r} translations %>% group_by(country) %>% summarize(count=n(), percentage=round((n()/nrow(translations))*100, digits=1)) %>% arrange(desc(count)) %>% mutate(rank=rank(-count)) #Add column to show rank ``` ```{r} translations %>% group_by(publisher) %>% summarize(count=n(), percentage=round((n()/nrow(translations))*100, digits=1)) %>% arrange(desc(count)) %>% mutate(rank=rank(-count)) #Add column to show rank ``` -------- Question 5. Determine if each of the following countries is included in the translation database. Brazil Cyprus Laos Nigeria Zimbabwe You could print out a list of all countries included in the translation database and then verify individually, but there's a much simpler way. We can use the `%in%` function to verify this instantaneously. ```{r} "Brazil" %in% translations$country "Cyprus" %in% translations$country "Laos" %in% translations$country "Nigeria" %in% translations$country "Zimbabwe" %in% translations$country ``` We can write this more efficiently with a vector: ```{r} c("Brazil", "Cyprus", "Laos", "Nigeria", "Zimbabwe") %in% translations$country ``` # For Discussion -------- Just as data is more than simply letters and numbers, data analysis is more than a computer at work. It relies on human interaction and interpretation. Consider the guiding questions below, to be discussed with your fellow classmates. Guiding Question 1 -------- Review the variables in the translation database. What information is missing that might be relevant to your analysis? What would you like to see in the dataset that isn't there? Outline the processes (and potential roadblocks) to acquiring such information. Guiding Question 2 -------- Consider variables that may not fully represent all possible contributors. For example, what are the limitations to categorizing gender as binary (male or female)? How might the dataset better account for transgender writers and translators? Guiding Question 3 -------- Make a list of 3-5 questions that a researcher could try tackling using this dataset. Think big! For example, how has the ratio of female to male translators changed over the last decade?