This is my data frame, composed only of the 1 observation. This is a long string where 4 different parts are identifiable:
example <- "4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
df <- data.frame(example)
As you can see, the first observation is composed of a string with 4 different parts: rating (4.6), number of ratings (19 ratings), a sentence (Course...accurately), and students enrolled (151).
I employed the separate() function to divide that column in 4 one:
df1 <- separate(df, example, c("Rating", "Number of rating", "Sentence", "Students"), sep = " ")
Thus, this does not behave as expected.
Any idea.
UPDATE:
This is what I get with your comment #nicola
> df1 <- separate(df, example, c("Rating", "Number of rating", "Sentence", "Students"), sep=" {4,}")
Warning message:
Expected 4 pieces. Additional pieces discarded in 1 rows [1].
How about this:
x <- str_split(example, " ") %>%
unlist()
x <- x[x != ""]
df <- tibble("a", "b", "c", "d")
df[1, ] <- x
colnames(df) <- c("Rating", "Number of rating", "Sentence", "Students")
> str(df)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 4 variables:
$ Rating : chr "4.6"
$ Number of rating: chr " (19 ratings)"
$ Sentence : chr " Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of ra"| __truncated__
$ Students : chr "151 students enrolled"
There are two keys to the answer. The first is to the correct regex used as separator sep = "[[:space:]]{2,}" which means two or more whitespace (\\s{2,} would be a more common alterantive). The second one is that your example actually has a lot a trailing whitespace which separate() tries to put into another column. It can simply be removed using trimws(). The solution therefore looks like this:
library(tidyr)
library(dplyr)
example <- "4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
df <- data.frame(example)
df_new <- df %>%
mutate(example = trimws(example)) %>%
separate(col = "example",
into = c("rating", "number_of_ratings", "sentence", "students_enrolled"),
sep = "[[:space:]]{2,}")
as_tibble(df_new)
# A tibble: 1 x 4
rating number_of_ratings sentence students_enrolled
<chr> <chr> <chr> <chr>
1 4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a vari~ 151 students enr~
tibble is only used to formatting the output.
Certainly possible with the stringr package and a bit of regular expressions:
rating_mean n_ratings n_students descr
1 4.65 19 151 "Course (...) accurately."
Code
library(stringr)
# create result data frame
result <- data.frame(cbind(rating_mean = 0, n_ratings = 0, n_students = 0, descr = 0))
# loop through rows of example data frame
for (i in 1:nrow(df)){
# replace spaces
example[i, 1] <- gsub("\\s+", " ", example[i, 1])
# match and extract mean rating
result[i, 1] <- as.numeric(str_match(example[i], "^[0-9]+\\.[0-9]+"))
# match and extract number of ratings
result[i, 2] <- as.numeric(str_match(str_match(example[i, 1], "\\(.+\\)"), "[0-9]+"))
# match and extract number of enrolled students
result[i, 3] <- as.numeric(str_match(str_match(example[i, 1], "\\s[0-9].+$"), "[0-9]+"))
# match and extract sentence
result[i, 4] <- str_match(example[i, 1], "[A-Z].+\\.")
}
Data
example <- "4.65 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
example <- data.frame(example, stringsAsFactors = FALSE)
Related
I have a column that lists the race/ethnicity of individuals. I am trying to make it so that if the cell contains an 'H' then I only want H. Similarly, if the cell contains an 'N' then I want an N. Finally, if the cell has multiple races, not including H or N, then I want it to be M. Below is how it is listed currently and the desired output.
Current output
People | Race/Ethnicity
PersonA| HAB
PersonB| NHB
PersonC| AB
PersonD| ABW
PersonE| A
Desired output
PersonA| H
PersonB| N
PersonC| M
PersonD| M
PersonE| A
You can try the following dplyr approach, which combines grepl with dplyr::case_when to first search for N values, then among those not with N values, search for H values, then among those without an H or an N will assign M to those with >1 races and the original letter to those with only one race (assuming each race is represented by a single character).
A base R approach is below as well - no need for dependencies but but less elegant.
Data
df <- read.table(text = "person ethnicity
PersonA HAB
PersonB NHB
PersonC AB
PersonD ABW
PersonE A", header = TRUE)
dplyr (note order matters given your priority)
df %>% mutate(eth2 = case_when(
grepl("N", ethnicity) ~ "N",
grepl("H", ethnicity) ~ "H",
!grepl("H|N", ethnicity) & nchar(ethnicity) > 1 ~ "M",
TRUE ~ ethnicity
))
You could also do it "manually" in base r by indexing (note order matters given your priority):
df[grepl("H", df$ethnicity), "eth2"] <- "H"
df[grepl("N", df$ethnicity), "eth2"] <- "N"
df[!grepl("H|N", df$ethnicity) & nchar(df$ethnicity) > 1, "eth2"] <- "M"
df[nchar(df$ethnicity) %in% 1, "eth2"] <- df$ethnicity[nchar(df$ethnicity) %in% 1]
In both cases the output is:
# person ethnicity eth2
# 1 PersonA HAB H
# 2 PersonB NHB N
# 3 PersonC AB M
# 4 PersonD ABW M
# 5 PersonE A A
Note this is based on your comment about assigning superiority (that N anywhere supersedes those with both N and H, etc)
We could use str_extract. When the number of characters in the column is greater than 1, extract, the 'N', 'M' separately, do a coalesce with the extracted elements along with 'M' (thus if there is no match, we get 'M', or else it will be in the order we placed the inputs in coalecse, For the other case, i.e. number of characters is 1, return the column values. Thus, N supersedes 'H' no matter the position in the string.
library(dplyr)
library(stringr)
df1 %>%
mutate(output = case_when(nchar(`Race/Ethnicity`) > 1
~ coalesce(str_extract(`Race/Ethnicity`, 'N'),
str_extract(`Race/Ethnicity`, 'H'), "M"),
TRUE ~ `Race/Ethnicity`))
-output
People Race/Ethnicity output
1 PersonA HAB H
2 PersonB NHB N
3 PersonC AB M
4 PersonD ABW M
5 PersonE A A
data
df1 <- structure(list(People = c("PersonA", "PersonB", "PersonC", "PersonD",
"PersonE"), `Race/Ethnicity` = c("HAB", "NHB", "AB", "ABW", "A"
)), class = "data.frame", row.names = c(NA, -5L))
I was trying to do transform some datasets in R when I found the following issue:
I have got a char column that shows the income of some people (a census). So what I was trying to do is to standardize the data for future analysis. This is a sample of the data:
income
2000,3 Thousand Euros
50,14 Thousand Euros
54000 Euros
52312 Euros
This is what I am expecting:
income
2000.3 k€
50.14 k€
54 k€
52.31 k€
And finally, this is the code I have got so far, but it still not working. I am new in R and I am still searching for methods. To clarify, in the if statement what I was trying is to search all those values that have more than 4 digits, but I think it is easier to search the ones which have " Euros". But to make operations, I believe I have to transform the char column into an integer one, so the " Euros" regex will not be valid (I believe).
census$income <- str_replace_all(census$income, " Thousand Euros", '')
census$income <- str_replace_all(census$income, " Euros", '')
census$income <- as.integer(census$income)
if(floor(log10(census$income))+1>4){
census$income/1000
}
census$income <- as.character(census$income)
Thank you very much for any help! =)
A solution with nested sub:
dyplyr
library(dplyr)
df %>%
mutate(income = sub("(000\\s|\\sThousand\\s)?Euros", " k€",
sub(",", ".", income)))
income
1 2000.3 k€
2 50.14 k€
3 54 k€
base R:
df$income <- sub("(000\\s|\\sThousand\\s)?Euros", " k€",
sub(",", ".", df$income))
Data:
df <- data.frame(
income = c("2000,3 Thousand Euros","50,14 Thousand Euros","54000 Euros")
)
EDIT:
Here's a solution for more complex data (as per OP's comment):
library(dplyr)
library(stringr)
df %>%
mutate(
# change comma into dot:
income = gsub(",", ".", df$income),
# remove text:
income = gsub("[ A-Za-z]", "", income),
# divide integer by 1000:
income = ifelse(str_detect(income, "^\\d+$"),
as.numeric(str_extract(income, "\\d+"))/1000,
income),
# add " k€":
income = paste0(income, " k€"))
Data:
df <- data.frame(
income = c("2000,3 Thousand Euros","50,14 Thousand Euros","54000 Euros", "43156 Euros")
)
I think you can accomplish this with a combination of readr::parse_number and str_detect(tolower(income), "thousand").
census %>%
mutate(
parsed_income = if_else(
str_detect(tolower(income), "thousand"),
parse_number(income),
1000 * parse_number(income)
)
)
I am reading in 2 big .TXT files and filtering them based off a certain code. The codes are located in the 16th column of each file.
Colleges <- read.table("Colleges.txt", sep ="|", fill = TRUE)
Majors <- read.table("Majors.txt", sep ="|", fill = TRUE)
The Data looks like this
bld_name dpt_name majors admin code college year
MLK English Literature Ms. W T A&S 18
Freedom Math Stats Ms. B R STEM 18
MLK Math CALC Ms. B P STEM 18
After I create the subset and append the two files. I want to create a unique ID using bld_name and dpt_name.
college_sub <- subset(colleges,colleges[[16]] %in% c("T", "R"), drop = TRUE)
majors_sub <- subset(majors,majors[[16]] %in% c("T", "R"), drop = TRUE)
combine <- do.call(rbind,list(college_sub,majors_sub)) #Append both files
uniqueID$id <- paste(combine$dpt_name,"-",combine$bld_name)
cols_g <- c("dpt_name", "Majors", "Admin", "Year")
combine <- combine[,cols_g]
It should look like this:
Unique ID majors admin code college year
MLK-English Literature Ms. W T A&S 18
I am doing a text sentiment analysis in R using the tm package. I have scraped news articles from Reuters and gave them a variable name according to their date. I added a,b,c etc. to indicate multiple articles per day, like this:
art170411a
art170411b
art170411c
art170410a
...
...
I then run a standard positive/negative terms analysis which gives me the sentiment score per article. My question is: how do I average these scores so that I get a sentiment score per day?
I have a VCorpus containing my 2000+ articles over 3 years. Every article has a date stamp. For the matching with the positive/negative terms I have converted my Corpus to a list and then a bag of words like this:
corp_list <- lapply(corp, FUN = paste, collapse=" ")
corp_bag <- str_split(corp_list, pattern = "\\s+")
I have the final score in two formats:
score_naive_list <- lapply(corp_bag, function(x) { sum(!is.na(match(x, pos))) - sum(!is.na(match(x, neg)))})
score_naive <- unlist(lapply(corp_bag, function(x) { sum(!is.na(match(x, pos))) - sum(!is.na(match(x, neg)))}))
So my question: how do I average the multiple sentiment scores into a one day score?
I redid my answer with reproducible data, once you get your data sorted this should work just fine.
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578),readerControl = list(reader = readReut21578XMLasPlain))
timestamps <- meta(reuters,"datetimestamp")
days <- sapply(timestamps,strftime,format="%Y-%m-%d")
pos <- c("good","excellent","positive","effective")
neg <- c("bad","terrible","negative")
corp_list <- lapply(corp, FUN = paste, collapse=" ")
daily_bows <- aggregate(corp_list ~ days,data.frame(corp_list = unlist(corp_list),days = days),FUN=paste,collapse = " ")
corp_bag <- str_split(daily_bows$corp_list, pattern = "\\s+")
score_string <- function(x){
sum(!is.na(match(x, pos))) - sum(!is.na(match(x, neg)))
}
daily_bows$scores <- sapply(corp_bag,score_string)
print(daily_bows[,c("days","scores")])
# days scores
# 1 1987-02-26 3
# 2 1987-03-01 1
# 3 1987-03-02 1
I am really new at R and this is probably a really basic question but let's say I have a data set with 2 columns that has students that are composed of males and female. One column has the student, and the other column is gender. How do I find the percentage of each?
Another way using data.table:
students <- data.frame( names = c( "Bill", "Stacey", "Fred", "Jane", "Sarah" ),
gender = c( "M", "F", "M", "F", "F" ),
stringsAsFactors = FALSE )
library( data.table )
setDT( students )[ , 100 * .N / nrow( students ), by = gender ]
# gender V1
# 1: M 40
# 2: F 60
Or dplyr:
library( dplyr )
students %>%
group_by( gender ) %>%
summarise( percent = 100 * n() / nrow( students ) )
# A tibble: 2 × 2
# gender percent
# <chr> <dbl>
# 1 F 60
# 2 M 40
These are both popular packages for operations like these but, as has already been pointed out, you can also stick with base R if you prefer.
You can use table() function to produce a table telling you how much of males and of females are among the students.Then just divide this table over the total amount of students (you can get this by using the length() function). At last you just multiply the result by 100.
Your code should be something like:
proportions <- table(your_data_frame$gender_columnn)/length(your_data_frame$gender_column)
percentages <- proportions*100
There are already some good answers to this question, but as the original submitter admits to being new to R, I wanted to provide a very long form answer. The answer below takes more than the minimum necessary number of steps and doesn't use helpers like pipes.
Hopefully, providing an answer in this way helps the original submitter understand what is happening with each step.
# Load the dplyr library
library("dplyr")
# Create an example data frame
students <-
data.frame(
names = c("Bill", "Stacey", "Fred", "Jane", "Sarah"),
gender = c("M", "F", "M", "F", "F"),
stringsAsFactors = FALSE
)
# Count the total number of students.
total_students <- nrow(students)
# Use dplyr filter to obtain just Female students
all_female_students <- dplyr::filter(students, gender %in% "F")
# Count total number of female students
total_female <- nrow(all_female_students)
# Repeat to find total number of male students
all_male_students <- dplyr::filter(students, gender %in% "M")
total_male <- nrow(all_male_students)
# Divide total female students by total students
# and multiply result by 100 to obtain a percentage
percent_female <- (total_female / total_students) * 100
# Repeat for males
percent_male <- (total_male / total_students) * 100
> percent_female
[1] 60
> percent_male
[1] 40
This is probably not the most efficient way to do this, but this is one way to solve the problem.
First you have to create a data.frame. How is an artificial one:
students <- data.frame(student = c("Carla", "Josh", "Amanda","Gabriel", "Shannon", "Tiffany"), gender = c("Female", "Male", "Female", "Male", "Female", "Female")
View(students)
Then I use prop table which gives me a proportion table or the ratios the columns in the matrix, and I coerce it to a data.frame because I love data.frames, and I have to multiply by 100 to turn the ratios from the prop table as they would be as percentages.
tablature <- as.data.frame.matrix(prop.table(table(students)) * 100)
tablature
I decided to call my data frame table tablature.
So it says "Amanda" is 16 + (2 / 3) % on the female column. Basically that means that she is a Female and thus 0 for male, and my data.frame has 6 students so (1 / 6) * 100 makes her 16.667 percent of the set.
Now what percentage of females and males are there?
Two ways: 1) Get the number of each set at the same time with the apply function, or get the number of each set one at a time, and we should use the sum function now.
apply(tablature, 2, FUN = sum)
Female Male
66.66667 33.33333
Imagine that in terms of percentages.
Where 2 tablature is the proportion table dataframe that I am applying the sum function to across the columns (2 for columns or 1 for rows).
So if you just eyeball the small amount of data, you can see that there are 2 / 6 = 33.3333% males in the data.frame students, and 4 / 6 = 66.66667 % females in the data.frame so I did the calculation correctly.
Alternatively,
sum(tablature$Female)
[1] 66.66667
sum(tablature$Male)
[1] 33.33333
And you can make a barplot. As I formatted it, you would have to refer to it as a matrix to get a barplot.
And from here you can make a stacked visual comparison of Gender barplot.
barplot(as.matrix(tablature), xlab = "Gender", main = "Barplot comparison of Gender Among Students", ylab = "Percentages of Student Group")
It's stacking because R made each student a box of 16.6667%.
To be honest it looks better if you just plot the the output of the apply function. Of course you could save it to a variable. But naahhh ...
barplot(apply(tablature, 2, FUN = sum), col = c("green", "blue"),xlab = "Gender", ylab = "Percentage of Total Students", main = "Barplot showing the Percentages of Gender Represented Among Students", cex.main = 1)
Now it doesn't stack.