Transform integer values in a char column in R - r

I was trying to do transform some datasets in R when I found the following issue:
I have got a char column that shows the income of some people (a census). So what I was trying to do is to standardize the data for future analysis. This is a sample of the data:
income
2000,3 Thousand Euros
50,14 Thousand Euros
54000 Euros
52312 Euros
This is what I am expecting:
income
2000.3 k€
50.14 k€
54 k€
52.31 k€
And finally, this is the code I have got so far, but it still not working. I am new in R and I am still searching for methods. To clarify, in the if statement what I was trying is to search all those values that have more than 4 digits, but I think it is easier to search the ones which have " Euros". But to make operations, I believe I have to transform the char column into an integer one, so the " Euros" regex will not be valid (I believe).
census$income <- str_replace_all(census$income, " Thousand Euros", '')
census$income <- str_replace_all(census$income, " Euros", '')
census$income <- as.integer(census$income)
if(floor(log10(census$income))+1>4){
census$income/1000
}
census$income <- as.character(census$income)
Thank you very much for any help! =)

A solution with nested sub:
dyplyr
library(dplyr)
df %>%
mutate(income = sub("(000\\s|\\sThousand\\s)?Euros", " k€",
sub(",", ".", income)))
income
1 2000.3 k€
2 50.14 k€
3 54 k€
base R:
df$income <- sub("(000\\s|\\sThousand\\s)?Euros", " k€",
sub(",", ".", df$income))
Data:
df <- data.frame(
income = c("2000,3 Thousand Euros","50,14 Thousand Euros","54000 Euros")
)
EDIT:
Here's a solution for more complex data (as per OP's comment):
library(dplyr)
library(stringr)
df %>%
mutate(
# change comma into dot:
income = gsub(",", ".", df$income),
# remove text:
income = gsub("[ A-Za-z]", "", income),
# divide integer by 1000:
income = ifelse(str_detect(income, "^\\d+$"),
as.numeric(str_extract(income, "\\d+"))/1000,
income),
# add " k€":
income = paste0(income, " k€"))
Data:
df <- data.frame(
income = c("2000,3 Thousand Euros","50,14 Thousand Euros","54000 Euros", "43156 Euros")
)

I think you can accomplish this with a combination of readr::parse_number and str_detect(tolower(income), "thousand").
census %>%
mutate(
parsed_income = if_else(
str_detect(tolower(income), "thousand"),
parse_number(income),
1000 * parse_number(income)
)
)

Related

How to access values in DF to write a text using R markdown?

I have a processed dataframe as follows (the list of staff differs for different weekly reports):
df <- structure(list(Department = c("DP1", "DP1", "DP2", "DP2", "DP2",
"DP4"), `Staff Name` = c("Bray Laura", "Fognani Mikaela", "Despain Taylor",
"Housum Zachary", "Herman Trenton", "Burgette Lesley"), `Non-compliance Criteria` = c("0 temperature reporting for >/= 3 days",
"0 temperature reporting for 2 consecutive readings", "0 temperature reporting for 2 consecutive readings",
"0 temperature reporting for 2 consecutive readings", "0 temperature reporting for 2 consecutive readings",
"1 temperature reporting/day for >/= 5 days")), row.names = c(NA,
6L), class = "data.frame")
There are 4 fixed values in Department:
DP1
DP2
DP3
DP4
And 3 fixed values in Non-compliance Criteria:
0 temperature reporting for >/= 3 days
0 temperature reporting for 2 consecutive readings
1 temperature reporting/day for >/= 5 days
The list of Staff Name will differ for different weekly reports.
I would like to output the texts in R markdown as follows:
I wish to access the values in the dataframe and write it out in text and bullet points. So within, a Department, we would show the list of staff that are non-compliant and which criteria they fulfill. If within a Department, >1 staff fulfill the same criteria, we would collapse them with "," and "and" as shown in "DP2" example.
I was stuck after this:
non_compliants <- df %>%
group_by(Department, `Non-compliance Criteria`) %>%
summarise(Text = paste(paste(`Staff Name`, collapse = " and "), "had", unique(`Non-compliance Criteria`))) %>%
ungroup()
lapply(c("DP1", "DP2", "DP3", "DP4"),
function(x){
ifelse(dim(filter(non_compliants, Department == x))[1] == 0,
"NA",
non_compliants$Text[non_compliants$Department == x])})
using pander to output the bullet-list
Code is like this:
non_compliants <- df %>%
group_by(Department, `Non-compliance Criteria`) %>%
summarise(Text =paste("*", paste(paste(`Staff Name`, collapse = " and "), "had", unique(`Non-compliance Criteria`)))) %>%
ungroup()
non<-non_compliants %>%
group_by(Department) %>%
summarise(messages= paste(Text, collapse = " \n "))
library(pander)
panderOptions("list.style", 'bullet')
non %>% pander(keep.line.breaks = TRUE,style = 'grid', justify = 'left')
We need a character-string that simulates a markdown syntax... So, you can do like this in your Rmarkdown:
persons <- sapply(c("DP1", "DP2", "DP3", "DP4"),
function(x){
ifelse(dim(filter(non_compliants, Department == x))[1] == 0,
"NA",
non_compliants$Text[non_compliants$Department == x])})
text <- sapply(names(persons),
FUN = function(person){
paste("###", person, "\n\n", persons[person], "\n")
}) %>% paste(collapse = "\n")
# not run
cat(text)
# ### DP1
#
# Bray Laura had 0 temperature reporting for >/= 3 days
#
# ### DP2
#
# Despain Taylor and Housum Zachary and Herman Trenton had 0 temperature reporting for 2 consecutive readings
#
# ### DP3
#
# NA
#
# ### DP4
#
# Burgette Lesley had 1 temperature reporting/day for >/= 5 days
(inside a chunk)
Then just put `r text` in your markdown. It will render your wished output.

reading txt file and converting it to dataframe

I have a .txt file that consists of some investment data. I want to convert the data in file to data frame with three columns. Data in .txt file looks like below.
Date:
06-04-15, 07-04-15, 08-04-15, 09-04-15, 10-04-15
Equity :
-237.79, -170.37, 304.32, 54.19, -130.5
Debt :
16318.49, 9543.76, 6421.67, 3590.47, 2386.3
If you are going to use read.table(), then the following may help:
Assuming the dat.txt contains above contents, then
dat <- read.table("dat.txt",fill=T,sep = ",")
df <- as.data.frame(t(dat[seq(2,nrow(dat),by=2),]))
rownames(df) <- seq(nrow(df))
colnames(df) <- trimws(gsub(":","",dat[seq(1,nrow(dat),by=2),1]))
yielding:
> df
Date Equity Debt
1 06-04-15 -237.79 16318.49
2 07-04-15 -170.37 9543.76
3 08-04-15 304.32 6421.67
4 09-04-15 54.19 3590.47
5 10-04-15 -130.5 2386.3
Assuming the text file name is demo.txt here is one way to do this
#Read the file line by line
all_vals <- readLines("demo.txt")
#Since the column names and data are in alternate lines
#We first gather column names together and clean them
column_names <- trimws(sub(":", "", all_vals[c(TRUE, FALSE)]))
#we can then paste the data part together and assign column names to it
df <- setNames(data.frame(t(read.table(text = paste0(all_vals[c(FALSE, TRUE)],
collapse = "\n"), sep = ",")), row.names = NULL), column_names)
#Since most of the data is read as factors, we use type.convert to
#convert data in their respective format.
type.convert(df)
# Date Equity Debt
#1 06-04-15 -237.79 16318.49
#2 07-04-15 -170.37 9543.76
#3 08-04-15 304.32 6421.67
#4 09-04-15 54.19 3590.47
#5 10-04-15 -130.50 2386.30

Separating a column using big spaces in strings in R

This is my data frame, composed only of the 1 observation. This is a long string where 4 different parts are identifiable:
example <- "4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
df <- data.frame(example)
As you can see, the first observation is composed of a string with 4 different parts: rating (4.6), number of ratings (19 ratings), a sentence (Course...accurately), and students enrolled (151).
I employed the separate() function to divide that column in 4 one:
df1 <- separate(df, example, c("Rating", "Number of rating", "Sentence", "Students"), sep = " ")
Thus, this does not behave as expected.
Any idea.
UPDATE:
This is what I get with your comment #nicola
> df1 <- separate(df, example, c("Rating", "Number of rating", "Sentence", "Students"), sep=" {4,}")
Warning message:
Expected 4 pieces. Additional pieces discarded in 1 rows [1].
How about this:
x <- str_split(example, " ") %>%
unlist()
x <- x[x != ""]
df <- tibble("a", "b", "c", "d")
df[1, ] <- x
colnames(df) <- c("Rating", "Number of rating", "Sentence", "Students")
> str(df)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 4 variables:
$ Rating : chr "4.6"
$ Number of rating: chr " (19 ratings)"
$ Sentence : chr " Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of ra"| __truncated__
$ Students : chr "151 students enrolled"
There are two keys to the answer. The first is to the correct regex used as separator sep = "[[:space:]]{2,}" which means two or more whitespace (\\s{2,} would be a more common alterantive). The second one is that your example actually has a lot a trailing whitespace which separate() tries to put into another column. It can simply be removed using trimws(). The solution therefore looks like this:
library(tidyr)
library(dplyr)
example <- "4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
df <- data.frame(example)
df_new <- df %>%
mutate(example = trimws(example)) %>%
separate(col = "example",
into = c("rating", "number_of_ratings", "sentence", "students_enrolled"),
sep = "[[:space:]]{2,}")
as_tibble(df_new)
# A tibble: 1 x 4
rating number_of_ratings sentence students_enrolled
<chr> <chr> <chr> <chr>
1 4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a vari~ 151 students enr~
tibble is only used to formatting the output.
Certainly possible with the stringr package and a bit of regular expressions:
rating_mean n_ratings n_students descr
1 4.65 19 151 "Course (...) accurately."
Code
library(stringr)
# create result data frame
result <- data.frame(cbind(rating_mean = 0, n_ratings = 0, n_students = 0, descr = 0))
# loop through rows of example data frame
for (i in 1:nrow(df)){
# replace spaces
example[i, 1] <- gsub("\\s+", " ", example[i, 1])
# match and extract mean rating
result[i, 1] <- as.numeric(str_match(example[i], "^[0-9]+\\.[0-9]+"))
# match and extract number of ratings
result[i, 2] <- as.numeric(str_match(str_match(example[i, 1], "\\(.+\\)"), "[0-9]+"))
# match and extract number of enrolled students
result[i, 3] <- as.numeric(str_match(str_match(example[i, 1], "\\s[0-9].+$"), "[0-9]+"))
# match and extract sentence
result[i, 4] <- str_match(example[i, 1], "[A-Z].+\\.")
}
Data
example <- "4.65 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
example <- data.frame(example, stringsAsFactors = FALSE)

Reformatting downloaded Excel data

I have downloaded some GDP data in .xls-format from the OECD website. However, to make this data workable in R, I need to reformat the data to a .csv file. More specifically, I need the year, day and month in the first column, and after the comma I need the GDP values (for example: 1990-01-01, 234590).
The column with GDP values can be easily copied and transposed, but how does one quickly add dates? Is there a fast way to do this, without having to add in the dates manually?
Thanks for the help!
Best,
Sean
PS. Link to (one of) the specific OECD files: https://ufile.io/8ogav or https://stats.oecd.org/index.aspx?queryid=350#
PSS. I have now changed the file to this:
Which I would like to transform into the same style as example 1.
Codes that I use for reading in data:
gdp.start <- c(1970,1) # type "double"
gdp.end <- c(2018,1)
gdp.raw <- "rawData/germany_gdp.csv"
gdp.table <- read.table(gdp.raw, skip = 1, header = F, sep = ',', stringsAsFactors = F)
gdp.ger <- ts(gdp.table[,2], start = gdp.start, frequency = 4) # time-series representation
PSS.
dput(head(gdp.table))
structure(list(V1 = c("Q2-1970;1.438.810 ", "Q3-1970;1.465.684 ",
"Q4-1970;1.478.108 ", "Q1-1971;1.449.712 ", "Q2-1971;1.480.136 ",
"Q3-1971;1.505.743 ")), row.names = c(NA, 6L), class = "data.frame")
Using your data:
z <- structure(list(V1 = c("Q2-1970;1.438.810 ", "Q3-1970;1.465.684 ",
"Q4-1970;1.478.108 ", "Q1-1971;1.449.712 ", "Q2-1971;1.480.136 ",
"Q3-1971;1.505.743 ")), row.names = c(NA, 6L), class = "data.frame")
dat <- read.csv2(text=paste(z$V1, collapse='\n'), stringsAsFactors=FALSE, header=FALSE)
dat
# V1 V2
# 1 Q2-1970 1.438.810
# 2 Q3-1970 1.465.684
# 3 Q4-1970 1.478.108
# 4 Q1-1971 1.449.712
# 5 Q2-1971 1.480.136
# 6 Q3-1971 1.505.743
and a simple function to replace quarters with the first date of each quarter
quarters <- function(s, format) {
qs <- c("Q1","Q2","Q3","Q4")
dts <- c("01-01", "04-01", "07-01", "10-01")
for (i in seq_along(qs))
s <- sub(qs[i], dts[i], s)
if (! missing(format))
s <- as.Date(s, format=format)
s
}
We can change them into strings of dates, preserving the order:
str(quarters(dat$V1))
# chr [1:6] "04-01-1970" "07-01-1970" "10-01-1970" "01-01-1971" ...
or we can convert into Date objects by setting the format:
str( quarters(dat$V1, format='%m-%d-%Y') )
# Date[1:6], format: "1970-04-01" "1970-07-01" "1970-10-01" "1971-01-01" ...
so replacing the column with the actual Date object is simply dat$V1 <- quarters(dat$V1, format='%m-%d-%Y').

How do I average a sentiment score for a day with multiple texts?

I am doing a text sentiment analysis in R using the tm package. I have scraped news articles from Reuters and gave them a variable name according to their date. I added a,b,c etc. to indicate multiple articles per day, like this:
art170411a
art170411b
art170411c
art170410a
...
...
I then run a standard positive/negative terms analysis which gives me the sentiment score per article. My question is: how do I average these scores so that I get a sentiment score per day?
I have a VCorpus containing my 2000+ articles over 3 years. Every article has a date stamp. For the matching with the positive/negative terms I have converted my Corpus to a list and then a bag of words like this:
corp_list <- lapply(corp, FUN = paste, collapse=" ")
corp_bag <- str_split(corp_list, pattern = "\\s+")
I have the final score in two formats:
score_naive_list <- lapply(corp_bag, function(x) { sum(!is.na(match(x, pos))) - sum(!is.na(match(x, neg)))})
score_naive <- unlist(lapply(corp_bag, function(x) { sum(!is.na(match(x, pos))) - sum(!is.na(match(x, neg)))}))
So my question: how do I average the multiple sentiment scores into a one day score?
I redid my answer with reproducible data, once you get your data sorted this should work just fine.
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578),readerControl = list(reader = readReut21578XMLasPlain))
timestamps <- meta(reuters,"datetimestamp")
days <- sapply(timestamps,strftime,format="%Y-%m-%d")
pos <- c("good","excellent","positive","effective")
neg <- c("bad","terrible","negative")
corp_list <- lapply(corp, FUN = paste, collapse=" ")
daily_bows <- aggregate(corp_list ~ days,data.frame(corp_list = unlist(corp_list),days = days),FUN=paste,collapse = " ")
corp_bag <- str_split(daily_bows$corp_list, pattern = "\\s+")
score_string <- function(x){
sum(!is.na(match(x, pos))) - sum(!is.na(match(x, neg)))
}
daily_bows$scores <- sapply(corp_bag,score_string)
print(daily_bows[,c("days","scores")])
# days scores
# 1 1987-02-26 3
# 2 1987-03-01 1
# 3 1987-03-02 1

Resources