Related
I'm trying to use R on a large CSV file that for this example can be said to represent a list of people and forms of transportation. If a person owns that mode of transportation, this is represented by a X in the corresponding cell. Example data of this is as per below:
Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,
The below image makes it easier to see what it represents:
What I'm after is to learn which persons have identical modes of transportation, or, ideally, where the modes of transportation differs by no more than one.
The format is a bit weird but, assuming the csv file is named example.csv, I can read it into a data frame and transpose it as per below (it should be fairly obvious that I'm a complete R noob)
ex <- read.csv('example.csv')
ext <- as.data.frame(t(ex))
This post explained how to find duplicates and it seems to work
duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1]
which(duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1])
This returns the following indexes:
1 2 4 5 6 7
That does indeed correspond with what I consider to be duplicate rows. That is, Peter has the same modes of transportation as Mary and Stan (indexes 2, 4 and 6); Don and Mike likewise share the same modes of transportation, indexes 5 and 7.
Again, that seems to work ok but if the modes of transportation and number of people are significant, it becomes really difficult finding/knowing not just which rows are duplicates, but which indexes actually matched. In this case that indexes 2, 4 and 6 are identical and that 5 and 7 are identical.
Is there an easy way of getting that information so that one doesn't have to try and find the matches manually?
Also, given all of the above, is it possible to alter the code in any way so that it would consider rows to match if there was only a difference in X positions (for example a difference of one is acceptable so as long as the persons in the above example have no more than one mode of transportation that is different, it's still considered a match)?
Happy to elaborate further and very grateful for any and all help.
library(dplyr)
library(tidyr)
ex <- read.csv(text = "Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,", )
ext <- tidyr::pivot_longer(ex, -Type, names_to = "person")
# head(ext)
ext <- ext %>%
group_by(person) %>%
filter(value == "X") %>%
summarise(Modalities = n(), Which = paste(Type, collapse=", ")) %>%
arrange(desc(Modalities), Which) %>%
mutate(IdenticalGrp = rle(Which)$lengths %>% {rep(seq(length(.)), .)})
ext
#> # A tibble: 6 x 4
#> person Modalities Which IdenticalGrp
#> <chr> <int> <chr> <int>
#> 1 Paul 3 Scooter, Skateboard, Boat 1
#> 2 Don 2 Car, Skateboard 2
#> 3 Mike 2 Car, Skateboard 2
#> 4 Mary 2 Scooter, Skateboard 3
#> 5 Peter 2 Scooter, Skateboard 3
#> 6 Stan 2 Scooter, Skateboard 3
To get a membership list in any particular IndenticalGrp you can just pull like this.
ext %>% filter(IdenticalGrp == 3) %>% pull(person)
#> [1] "Mary" "Peter" "Stan"
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So I got a dataset with a column that I need to clean.
The column has objects with stuff like: "$10,000 - $19,999", "$40,000 and over."
How do I code this so for example "$10,000 - $19,999" becomes 15000 instead, and "$40,000 and over" becomes 40000 in a new column?
I am new to R so I have no idea how to start. I need to do a regression analysis on this but it doesn't work if I don't get this fixed.
I have been told that some basic string/regex operations are what I need. How should I proceed?
Here's a solution using the tidyverse.
Load packages
library(dplyr) # for general cleaning functions
library(stringr) # for string manipulations
library(magrittr) # for the '%<>% function
Make a dummy dataset based on your example.
df <- data_frame(price = sample(c(rep('$40,000 and over', 10),
rep('$10,000', 10),
rep('$19,999', 10),
rep('$9,000', 10),
rep('$28,000', 10))))
Inspect the new dataframe
print(df)
#> # A tibble: 50 x 1
#> price
#> <chr>
#> 1 $9,000
#> 2 $40,000 and over
#> 3 $28,000
#> 4 $10,000
#> 5 $10,000
#> 6 $9,000
#> 7 $19,999
#> 8 $10,000
#> 9 $19,999
#> 10 $40,000 and over
#> # ... with 40 more rows
Clean-up the the format of the price strings by removing the $ symbol and ,. Note the use of the '\\' before the $ symbol. This formatting is used within R to escape special characters (the second \ is a standard regex escape switch, the first \ is tells R to escape the second \).
df %<>%
mutate(price = str_remove(string = price, pattern = '\\$'), # remove $ sign
price = str_remove(string = price, pattern = ',')) # remove comma
Quick check of the data.
head(df)
#> # A tibble: 6 x 1
#> price
#> <chr>
#> 1 9000
#> 2 40000 and over
#> 3 28000
#> 4 10000
#> 5 10000
#> 6 9000
Process the number strings into numerics. First convert 40000 and over to 40000, then convert all the strings to numerics, then use logic statements to convert the numbers to the values you want. The functions ifelse() and case_when() are interchangeable, but I tend to use ifelse() for single rules, and case_when() when there are multiple rules because of the more compact format of the case_when().
df %<>%
mutate(price = ifelse(price == '40000 and over', # convert 40000+ to 40000
yes = '40000',
no = price),
price = as.numeric(price), # convert all to numeric
price = case_when( # use logic statements to change values to desired value
price == 40000 ~ 40000,
price >= 30000 & price < 40000 ~ 35000,
price >= 20000 & price < 30000 ~ 25000,
price >= 10000 & price < 20000 ~ 15000,
price >= 0 & price < 10000 ~ 5000
))
Have a final look.
print(df)
#> # A tibble: 50 x 1
#> price
#> <dbl>
#> 1 5000
#> 2 40000
#> 3 25000
#> 4 15000
#> 5 15000
#> 6 5000
#> 7 15000
#> 8 15000
#> 9 15000
#> 10 40000
#> # ... with 40 more rows
```
Created on 2018-11-18 by the reprex package (v0.2.1)
First you should see what exactly your data is composed of- use the table() function on data$column to see how many unique entries you must account for.
table(data$column)
If whoever was entering this data was consistent about their wording, it may be easiest to hard code for substitution for each unique entry. So if unique(data$column)[1]== "$10,000 - $19,999", and unique(data$column)[2]== "$40,000 and over."
data$column[which(data$column==unique(data$column)[1])] <- "15000"
data$column[which(data$column==unique(data$column)[2])] <- "40000"
...
If you have too many unique entries for this approach to be viable, I'd suggest looking for consistencies in character sequences that can be used to make replacements. If you found that whoever entered this data was inconsistent about how they would write "$40,000 and over" such that you had:
data$column==unique(data$column)[2]
>"$40,000 and over."
data$column==unique(data$column)[3]
>"$40,000 and over"
data$column==unique(data$column)[4]
>"above $40,000"
...
If there weren't instances of "$40,000" that belonged to other categories, you could combine these entries for substitution a la:
data$column[which(grepl("$40,000",data$column))] <- "40000"
Inconsistency in qualitative data entry is a very human problem and requires exploring your data to search for trends and easy ways to consolidate your replacements. I think it's a fine idea to use R to identify and replace for patterns you find to save time, but ultimately it will require a fine touch as you get down to individual cases where you have to interpret/correct someone's entries to include them in your desired bins. Depending on your data quality standards, you can always throw out these entries that don't seem to fit your observed patterns.
I have a dataframe that contains survey responses with each row representing a different person. One column - "Text" - is an open-ended text question. I would like to use Tidytext::unnest_tokens so that I do text analysis by each row, including sentiment scores, word counts, etc.
Here is the simple dataframe for this example:
Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)
I then turned the Text column into character...
df$Text<-as.character(df$Text)
Next I grouped by the id column and nested the dataframe.
df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)
Getting this far seems to have worked ok, but now how do I use purrr::map functions to work on the nested list column "word"? For example, if I want to create a new column using dplyr::mutate with word counts for each row?
Also, is there a better way to nest the dataframe so that only the "Text" column is a nested list?
I love using purrr::map to do modeling for different groups, but for what you are talking about doing, I think you can stick with just straight dplyr.
You can set up your dataframe like this:
library(dplyr)
library(tidytext)
Satisfaction <- c("Satisfied",
"Satisfied",
"Dissatisfied",
"Satisfied",
"Dissatisfied")
Text <- c("I'm very satisfied with the services",
"Your service providers are always late which causes me a lot of frustration",
"You should improve your staff training, service providers have bad customer service",
"Everything is great!",
"Service is bad")
Gender <- c("M","M","F","M","F")
df <- data_frame(Satisfaction, Text, Gender)
tidy_df <- df %>%
mutate(id = row_number()) %>%
unnest_tokens(word, Text)
Then to find, for example, the number of words per line, you can use group_by and mutate.
tidy_df %>%
group_by(id) %>%
mutate(num_words = n()) %>%
ungroup
#> # A tibble: 37 × 5
#> Satisfaction Gender id word num_words
#> <chr> <chr> <int> <chr> <int>
#> 1 Satisfied M 1 i'm 6
#> 2 Satisfied M 1 very 6
#> 3 Satisfied M 1 satisfied 6
#> 4 Satisfied M 1 with 6
#> 5 Satisfied M 1 the 6
#> 6 Satisfied M 1 services 6
#> 7 Satisfied M 2 your 13
#> 8 Satisfied M 2 service 13
#> 9 Satisfied M 2 providers 13
#> 10 Satisfied M 2 are 13
#> # ... with 27 more rows
You can do sentiment analysis by implementing an inner join; check out some examples here.
I'm still learning R and have been given the task of grouping a long list of students into groups of four based on another variable. I have loaded the data into R as a data frame. How do I sample entire rows without replacement, one from each of 4 levels of a variable and have R output the data into a spreadsheet?
So far I have been tinkering with a for loop and the sample function but I'm quickly getting over my head. Any suggestions? Here is sample of what I'm attempting to do. Given:
Last.Name <- c("Picard","Troi","Riker","La Forge", "Yar", "Crusher", "Crusher", "Data")
First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com")
Section <- c(1,1,2,2,3,3,4,4)
df <- data.frame(Last.Name,First.Name,Email,Section)
I want to randomly select a Star Trek character from each section and end up with 2 groups of 4. I would want the entire row's worth of information to make it over to a new data frame containing all groups with their corresponding group number.
I'd use the wonderful package 'dplyr'
require(dplyr)
random_4 <- df %>% group_by(Section) %>% slice(sample(c(1,2),1))
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Troi Deanna b#b.com 1
2 La Forge Geordi d#d.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Picard Jean-Luc a#a.com 1
2 Riker William c#c.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
%>% means 'and then'
The code is read as:
Take DF AND THEN for all 'Section', select by position (slice) 1 or 2. Voila.
I suppose you have 8 students: First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data").
If you wish to randomly assign a section number to the 8 students, and assuming you would like each section to have 2 students, then you can either permute Section <- c(1, 1, 2, 2, 3, 3, 4, 4) or permute the list of the students.
First approach, permute the sections:
> assigned_section <- print(sample(Section))
[1] 1 4 3 2 2 3 4 1
Then the following data frame gives the assignments:
assigned_students <- data.frame(First.Name, assigned_section)
Second approach, permute the students:
> assigned_students <- print(sample(First.Name))
[1] "Data" "Geordi" "Tasha" "William" "Deanna" "Beverly" "Jean-Luc" "Wesley"
Then, the following data frame gives the assignments:
assigned_students <- data.frame(assigned_students, Section)
Alex, Thank You. Your answer wasn't exactly what I was looking for, but it inspired the correct one for me. I had been thinking about the process from a far too complicated point of view. Instead of having R select rows and put them into a new data frame, I decided to have R assign a random number to each of the students and then sort the data frame by the number:
First, I broke up the data frame into sections:
df1<- subset(df, Section ==1)
df2<- subset(df, Section ==2)
df3<- subset(df, Section ==3)
df4<- subset(df, Section ==4)
Then I randomly generated a group number 1 through 4.
Groupnumber <-sample(1:4,4, replace=F)
Next, I told R to bind the columns:
Assigned1 <- cbind(df1,Groupnumber)
*Ran the group number generator and cbind in alternating order until I got through the whole set. (Wanted to make sure the order of the numbers was unique for each section).
Finally row binding the data set back together:
Final_List<-rbind(Assigned1,Assigned2,Assigned3,Assigned4)
Thank you everyone who looked this over. I am new to data science, R, and stackoverflow, but as I learn more I hope to return the favor.
I'd suggest the randomizr package to "block assign" according to section. The block_ra function lets you do this in a easy-to-read one-liner.
install.packages("randomizr")
library(randomizr)
df$group <- block_ra(block_var = df$Section,
condition_names = c("group_1", "group_2"))
You can inspect the resulting sets in a variety of ways. Here's with base r subsetting:
df[df$group == "group_1",]
Last.Name First.Name Email Section group
2 Troi Deanna b#b.com 1 group_1
3 Riker William c#c.com 2 group_1
6 Crusher Beverly f#f.com 3 group_1
7 Crusher Wesley g#g.com 4 group_1
df[df$group == "group_2",]
Last.Name First.Name Email Section group
1 Picard Jean-Luc a#a.com 1 group_2
4 La Forge Geordi d#d.com 2 group_2
5 Yar Tasha e#e.com 3 group_2
8 Data Data h#h.com 4 group_2
If you want to roll your own:
set <- tapply(1:nrow(df), df$Section, FUN = sample, size = 1)
df[set,] # show the sampled set
df[-set,] # show the complimentary set
Imagine I have a dataset of soccer players' salaries, nationalities, and heights. I'm interested in seeing whether there is an association between the two variables and the salaries of soccer players. I have come up with a few different models, and would like to compare the models on how well they predict. But to do this, I need train and test data that contain the same levels of nationality in the train and the test data.
So imagine I have data that look like this:
> soccer_player_df
salary nationality height
1 504731.1 USA 6.466627
2 485333.2 USA 5.468320
3 483259.4 USA 4.694929
4 493594.2 USA 5.685126
5 530805.8 England 5.856093
6 520851.5 England 6.031963
7 484309.9 Spain 6.127087
8 462986.6 Portugal 6.023823
9 492580.1 Brazil 5.949609
10 470410.0 Brazil 5.978207
How would I go about splitting the data such that I am guaranteed to have at least one observation of each nationality in the train and test data?
How would I remove a soccer player if he were the only representative of his nationality (and thus for that country, I could not have a train and test pair)?
As I mentioned in my comments, I would suggest checking out my stratifiedDT function -- with a caveat: you need to be using at least version V1.9.3 of "data.table", which can be obtained from the "data.table" GitHub page.
I've also used "dplyr" for convenient filtering.
Once you've loaded the function, load the relevant packages and just do:
library(dplyr)
library(data.table)
set.seed(1)
soccer_player_df %>%
group_by(nationality) %>%
filter(length(nationality) > 1) %>%
stratifiedDT("nationality", .5, bothSets = TRUE)
# $SAMP1
# Source: local data frame [4 x 3]
# Groups:
#
# salary nationality height
# 1 492580.1 Brazil 5.949609
# 2 530805.8 England 5.856093
# 3 483259.4 USA 4.694929
# 4 493594.2 USA 5.685126
#
# $SAMP2
# Source: local data frame [4 x 3]
# Groups:
#
# salary nationality height
# 1 470410.0 Brazil 5.978207
# 2 520851.5 England 6.031963
# 3 504731.1 USA 6.466627
# 4 485333.2 USA 5.468320
The bothSets is a new argument that lets you return a list with the two subsets.
If you don't like living on the bleeding edge, you can use the data.frame version of the function, which is pretty fast (but not nearly as fast as the "data.table" version).
The usage is pretty much the same:
soccer_player_df %>%
group_by(nationality) %>%
filter(length(nationality) > 1) %>%
stratified("nationality", .5, bothSets = TRUE)
Update:
If you just want to use the function and don't want to use "dplyr" just for filtering and piping, you can also do the subsetting directly in the stratified or stratifiedDT functions. I've added the names of the arguments so that you can see more clearly what is happening:
set.seed(1)
stratified(
soccer_player_df,
group = "nationality",
size = .5,
select = list(
nationality = names(which(table(soccer_player_df$nationality) > 1))),
bothSets = TRUE)
Note that there is a select argument that lets you specify the subsets you're interested in.