Cleaning a column in a dataset R [closed]

Cleaning a column in a dataset R [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So I got a dataset with a column that I need to clean.
The column has objects with stuff like: "$10,000 - $19,999", "$40,000 and over."
How do I code this so for example "$10,000 - $19,999" becomes 15000 instead, and "$40,000 and over" becomes 40000 in a new column?
I am new to R so I have no idea how to start. I need to do a regression analysis on this but it doesn't work if I don't get this fixed.
I have been told that some basic string/regex operations are what I need. How should I proceed?

Here's a solution using the tidyverse.
Load packages
library(dplyr) # for general cleaning functions
library(stringr) # for string manipulations
library(magrittr) # for the '%<>% function
Make a dummy dataset based on your example.
df <- data_frame(price = sample(c(rep('$40,000 and over', 10),
rep('$10,000', 10),
rep('$19,999', 10),
rep('$9,000', 10),
rep('$28,000', 10))))
Inspect the new dataframe
print(df)
#> # A tibble: 50 x 1
#> price
#> <chr>
#> 1 $9,000
#> 2 $40,000 and over
#> 3 $28,000
#> 4 $10,000
#> 5 $10,000
#> 6 $9,000
#> 7 $19,999
#> 8 $10,000
#> 9 $19,999
#> 10 $40,000 and over
#> # ... with 40 more rows
Clean-up the the format of the price strings by removing the $ symbol and ,. Note the use of the '\\' before the $ symbol. This formatting is used within R to escape special characters (the second \ is a standard regex escape switch, the first \ is tells R to escape the second \).
df %<>%
mutate(price = str_remove(string = price, pattern = '\\$'), # remove $ sign
price = str_remove(string = price, pattern = ',')) # remove comma
Quick check of the data.
head(df)
#> # A tibble: 6 x 1
#> price
#> <chr>
#> 1 9000
#> 2 40000 and over
#> 3 28000
#> 4 10000
#> 5 10000
#> 6 9000
Process the number strings into numerics. First convert 40000 and over to 40000, then convert all the strings to numerics, then use logic statements to convert the numbers to the values you want. The functions ifelse() and case_when() are interchangeable, but I tend to use ifelse() for single rules, and case_when() when there are multiple rules because of the more compact format of the case_when().
df %<>%
mutate(price = ifelse(price == '40000 and over', # convert 40000+ to 40000
yes = '40000',
no = price),
price = as.numeric(price), # convert all to numeric
price = case_when( # use logic statements to change values to desired value
price == 40000 ~ 40000,
price >= 30000 & price < 40000 ~ 35000,
price >= 20000 & price < 30000 ~ 25000,
price >= 10000 & price < 20000 ~ 15000,
price >= 0 & price < 10000 ~ 5000
))
Have a final look.
print(df)
#> # A tibble: 50 x 1
#> price
#> <dbl>
#> 1 5000
#> 2 40000
#> 3 25000
#> 4 15000
#> 5 15000
#> 6 5000
#> 7 15000
#> 8 15000
#> 9 15000
#> 10 40000
#> # ... with 40 more rows
```
Created on 2018-11-18 by the reprex package (v0.2.1)

First you should see what exactly your data is composed of- use the table() function on data$column to see how many unique entries you must account for.
table(data$column)
If whoever was entering this data was consistent about their wording, it may be easiest to hard code for substitution for each unique entry. So if unique(data$column)[1]== "$10,000 - $19,999", and unique(data$column)[2]== "$40,000 and over."
data$column[which(data$column==unique(data$column)[1])] <- "15000"
data$column[which(data$column==unique(data$column)[2])] <- "40000"
...
If you have too many unique entries for this approach to be viable, I'd suggest looking for consistencies in character sequences that can be used to make replacements. If you found that whoever entered this data was inconsistent about how they would write "$40,000 and over" such that you had:
data$column==unique(data$column)[2]
>"$40,000 and over."
data$column==unique(data$column)[3]
>"$40,000 and over"
data$column==unique(data$column)[4]
>"above $40,000"
...
If there weren't instances of "$40,000" that belonged to other categories, you could combine these entries for substitution a la:
data$column[which(grepl("$40,000",data$column))] <- "40000"
Inconsistency in qualitative data entry is a very human problem and requires exploring your data to search for trends and easy ways to consolidate your replacements. I think it's a fine idea to use R to identify and replace for patterns you find to save time, but ultimately it will require a fine touch as you get down to individual cases where you have to interpret/correct someone's entries to include them in your desired bins. Depending on your data quality standards, you can always throw out these entries that don't seem to fit your observed patterns.

Related

Group by name and add up the columns count in r

I have a dataset with 405 observations and 39 variables. But just two columns are important for further analysis.
I would like to group the first row with similar names together and add up their number from the second column.
Reproducible dataset looks like this:
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facebook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
Outcome should be in an new data.frame and look like this:
df2 <- data.frame (name=c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft","Others"),
value=c(30,70,50,60,70,80,190))

A tidyverse way of doing it.
First store all valid_names in a vector say valid_names
Thereafter create a new column say all_names in df1 by -
first splitting all strings at space ' ' using str_split
thereafter use purrr::map_chr() to check if any of the split string matches with your valid_names and if yes, retrieve that string only otherwise get others
Thereafter group_by on this field. (I omitted one step of mutate first and then group_by and directly created the new field in group_by statement, that works)
Now summarise your important values as desired.
valid_names =c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft")
valid_names
#> [1] "Google" "Facebook" "Twitter" "Flurry" "Amazon" "Microsoft"
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facebook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
df1
#> name value unimportant
#> 1 Google Ads 10 1
#> 2 Google Doubleclick 20 2
#> 3 Facebook Login 30 3
#> 4 Facebook Ads 40 4
#> 5 Twitter MoPub 50 5
#> 6 Flurry 60 6
#> 7 Amazon advertisment 70 7
#> 8 Microsoft 80 8
#> 9 Ad4screen 90 9
#> 10 imobi 100 10
library(tidyverse)
df1 %>% group_by(all_names = str_split(name, ' '),
all_names = map_chr(all_names, ~ ifelse(any(.x %in% valid_names),.x[.x %in% valid_names], 'others'))) %>%
summarise(value = sum(value), .groups = 'drop')
#> # A tibble: 7 x 2
#> all_names value
#> <chr> <dbl>
#> 1 Amazon 70
#> 2 Facebook 70
#> 3 Flurry 60
#> 4 Google 30
#> 5 Microsoft 80
#> 6 others 190
#> 7 Twitter 50
Created on 2021-06-22 by the reprex package (v2.0.0)

This works on the sample data using the adist function and with partial=TRUE to look at partial string matches. It requires defining the known groups though, rather than trying to find them. I think this leg work is worth doing though as it simplifies the problem a lot once the output is known
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facbook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
# types we want to map. known is the groupings
types <- unique(df1$name)
known <- c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft")
# use distrance measures, and look for matches on partial strings eg
# ignore the Doubleclick part when matching on Google
distance <- adist(known, types, partial=TRUE)
# cap controls leniancy in matching e.g. Facbook and Facebook have a dist of 1
# whilst Facebook and Facebook is a perfect match with score of 0
# Raise to be more leniant
cap <- 1
# loop through the types
map_all <- sapply(seq_along(types), function(i){
# find minimum value, check if its below the cap. If so, assign to the closest
# group, else assign to others
v <- min(distance[,i])
if(v <= cap){
map_i <- known[which.min(distance[,i])]
}else{
map_i <- "Others"
}
map_i
})
# now merge in to df1, then sum out using your preferred method
df_map <- data.frame(name=types, group=map_all)
df_merged <- merge(df1, df_map, by="name")
df2 <- aggregate(value ~ group, sum, data=df_merged)
df2
group value
1 Amazon 70
2 Facebook 70
3 Flurry 60
4 Google 30
5 Microsoft 80
6 Others 190
7 Twitter 50

R dplyr - running total with row-wise calculations

I have a dataframe that keeps track of the activities associated with a bank account (example below).
The initial balance is $5,000 (type "initial). If type is "in", that means a cash deposit. In this example each deposit is $1,000. If type is "out", that means a withdrawal from the account. In this example each withdrawal is 10% of the account balance.
data <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000))
Is there a dplyr solution to keep track of the balance after each activity?? I have tried several ways but I can't seem to find a way to efficiently calculate running totals and the withdrawal amount (which depends on the running total).
For this example the output should be:
result <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000),
balance=c(5000,6000,5400,4860,5860,6860))
Thanks in advance for any suggestions or recommendations!

You can use purrr::accumulate2() to condition the calculation on the value of type:
library(dplyr)
library(purrr)
library(tidyr)
data %>%
mutate(balance = accumulate2(input, type[-1], .f = function(x, y, type) if(type == "out") x - x * y/100 else x + y)) %>%
unnest(balance)
# A tibble: 6 x 4
activity type input balance
<int> <chr> <dbl> <dbl>
1 1 initial 5000 5000
2 2 in 1000 6000
3 3 out 10 5400
4 4 out 10 4860
5 5 in 1000 5860
6 6 in 1000 6860

r - rounding in summarise() [duplicate]

This question already has answers here:
Rounding the numeric values in a dplyr tbl_df upon printing
(2 answers)
Closed 4 years ago.
I want to see more digits in the aggregated output using group_by() and summarise() from package{dplyr}. My codes are below:
library(dplyr)
# download 2 datasets
download.file('https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv','GDP.csv',mode = 'wb')
GDP<-read.csv('GDP.csv',skip=4,stringsAsFactors = F,na.strings = '')
GDP<-GDP%>%filter(!is.na(X),!is.na(X.1))%>%mutate(X.1=as.numeric(X.1))
download.file('https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv','EDSTATS.csv',mode = 'wb')
edu<-read.csv('EDSTATS.csv',stringsAsFactors = F)
# join these two datasets
df<-inner_join(GDP,edu,by=c('X'='CountryCode'))%>%arrange(desc(X.1))
# aggregation
df%>%group_by(Income.Group)%>%summarise(avg_GDP=mean(X.1))
The result I get from console:
# A tibble: 5 x 2
Income.Group avg_GDP
<chr> <dbl>
1 High income: nonOECD 91.9
2 High income: OECD 33.0
3 Low income 134.
4 Lower middle income 108.
5 Upper middle income 92.1
Clearly, the number were not shown in full. So how do I see more digits in avg_GDP?
If I assign the result to a new dataframe and view it in RStudio, I get to see more digits, but still, only 5 digits:
df2<-df%>%group_by(Income.Group)%>%summarise(avg_GDP=mean(X.1))
View(df2)
So how do I see more digits both in console print and dataframe View()?
I tried:
df%>%group_by(Income.Group)%>%summarise(avg_GDP=mean(X.1,digits=10))
it didn't work.
My question is different from the potential duplicate is that I want the code that could do the job within the %>% chain. From his post, I like the answer with:
# this is my favorite, because it fits well with my original code with %>%.
print.data.frame(my_tbl, digits = 3)
or
options(digits = 3)
print.data.frame(my_tbl)
From my post, I like options(pillar.sigfig = 10).

For the tibble package you need to modifiy the option pillar.sigfig.
pillar.sigfig: The number of significant digits that will be printed and highlighted, default: 3
library(tibble)
options(pillar.sigfig = 10)
set.seed(1)
tibble(a = rnorm(3), b = rexp(3))
# A tibble: 3 x 2
# a b
# <dbl> <dbl>
#1 -0.6264538107 0.4360686258
#2 0.1836433242 2.894968537
#3 -0.8356286124 1.229562053

Delete row from data.frame based on condition

I have some repeated measures data I'm trying to clean in R. At this point, it is in the long format and I'm trying to fix some entries before I move to a wide format - for example, if people took my survey too many times I'm going to drop the rows. I have two main problems that I'm trying to solve:
Changing an entry
If someone took the survey from the "pre-test link" when it was actually supposed to be a post-test, I'm fixing it with the following code:
data[data$UserID == 52118254, "Prepost"][2] <- 2
This filters out the entries from that person based on ID, then changes the second entry to be coded as a post-test. This code has enough meaning that reviewing it tells me what is happening.
Dropping a row
I'm struggling to get meaningful code to delete extra rows - for example if someone accidentally clicked on my link twice. I have data like the following:
UserID Prepost Duration..in.seconds.
1 52118250 1 357
2 52118284 1 226
3 52118284 1 11 #This is an extra attempt to remove
4 52118250 2 261
5 52118284 2 151
#to reproduce:
structure(list(UserID = c(52118250, 52118284, 52118284, 52118250, 52118284), Prepost = c("1", "1", "1", "2", "2"), Duration..in.seconds. = c("357", "226", "11", "261", "151")), class = "data.frame", row.names = c(NA, -5L), .Names = c("UserID", "Prepost", "Duration..in.seconds."))
I can filter by UserID to see who has taken it too many times and I'm looking for a way to easily remove those rows from the dataset. In this case, UserID 52118284 has taken it three times and the second attempt needs to be removed. If it is "readable" like the other fix that is better.

I'd use a collection of dplyr functions as shown below. To explain:
group_by(UserID) will help to apply functions separately to each User.
mutate(click_n = row_number()) iteratively counts User appearances and saves it as a new variable click_n.
library(dplyr)
data %>%
group_by(UserID) %>%
mutate(click_n = row_number())
#> Source: local data frame [5 x 4]
#> Groups: UserID [4]
#>
#> UserID Prepost Duration..in.seconds. click_n
#> <dbl> <chr> <chr> <int>
#> 1 52118254 1 357 1
#> 2 52118284 1 226 1
#> 3 52118284 1 11 2
#> 4 52118250 2 261 1
#> 5 52118280 2 151 1
filter(click_n == 1) can then be used to keep only 1st attempts as shown below.
data <- data %>%
group_by(UserID) %>%
mutate(click_n = row_number()) %>%
filter(click_n == 1)
data
#> Source: local data frame [4 x 4]
#> Groups: UserID [4]
#>
#> UserID Prepost Duration..in.seconds. click_n
#> <dbl> <chr> <chr> <int>
#> 1 52118254 1 357 1
#> 2 52118284 1 226 1
#> 3 52118250 2 261 1
#> 4 52118280 2 151 1
Note that this approach assumes that your data frame is ordered. I.e., first clicks appear close to the top.
If you're unfamiliar with %>%, look for help on the "pipe operator".
EXTRA:
To bring the comment into answer, once you're comfortable with what's going on here, you can skip the mutate line a just do the following:
data %>% group_by(UserID) %>% filter(row_number() == 1)

A simple solution to remove duplicates is below:
subset(data, !duplicated(data$UserID))
However, you may want to consider also subsetting by duration, such as if the duration is less than 30 seconds.

Thanks #Simon for the suggestions. One criteria I wanted was that the code made sense as I "read" it. As I thought more, another criteria is that I wanted to be deliberate about what changes to make. So I incorporated Simon's recommendation to make a separate column and then use dplyr::filter() to exclude those variables. Here's what an example segment of code looked like:
#Change pre/post entries
data[data$UserID == 52118254, "Prepost"][2] <- 2
#Mark rows to delete
data$toDelete <- NA #Makes new empty column for marking deletions
data[data$UserID == 52118284,][2, "toDelete"] <- 1 #Marks row for deletion
#Filter to exclude rows
data %>% filter(is.na(toDelete))
#Optionally add "%>% select(-toDelete)" to remove the extra column
In my context, advantages here are that everything is deliberate rather than automatic and changes are anchored to data rather than row numbers that might change. I'd still welcome any feedback or other ways of achieving this (maybe in a single step).

Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens

I have a dataframe that contains survey responses with each row representing a different person. One column - "Text" - is an open-ended text question. I would like to use Tidytext::unnest_tokens so that I do text analysis by each row, including sentiment scores, word counts, etc.
Here is the simple dataframe for this example:
Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)
I then turned the Text column into character...
df$Text<-as.character(df$Text)
Next I grouped by the id column and nested the dataframe.
df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)
Getting this far seems to have worked ok, but now how do I use purrr::map functions to work on the nested list column "word"? For example, if I want to create a new column using dplyr::mutate with word counts for each row?
Also, is there a better way to nest the dataframe so that only the "Text" column is a nested list?

I love using purrr::map to do modeling for different groups, but for what you are talking about doing, I think you can stick with just straight dplyr.
You can set up your dataframe like this:
library(dplyr)
library(tidytext)
Satisfaction <- c("Satisfied",
"Satisfied",
"Dissatisfied",
"Satisfied",
"Dissatisfied")
Text <- c("I'm very satisfied with the services",
"Your service providers are always late which causes me a lot of frustration",
"You should improve your staff training, service providers have bad customer service",
"Everything is great!",
"Service is bad")
Gender <- c("M","M","F","M","F")
df <- data_frame(Satisfaction, Text, Gender)
tidy_df <- df %>%
mutate(id = row_number()) %>%
unnest_tokens(word, Text)
Then to find, for example, the number of words per line, you can use group_by and mutate.
tidy_df %>%
group_by(id) %>%
mutate(num_words = n()) %>%
ungroup
#> # A tibble: 37 × 5
#> Satisfaction Gender id word num_words
#> <chr> <chr> <int> <chr> <int>
#> 1 Satisfied M 1 i'm 6
#> 2 Satisfied M 1 very 6
#> 3 Satisfied M 1 satisfied 6
#> 4 Satisfied M 1 with 6
#> 5 Satisfied M 1 the 6
#> 6 Satisfied M 1 services 6
#> 7 Satisfied M 2 your 13
#> 8 Satisfied M 2 service 13
#> 9 Satisfied M 2 providers 13
#> 10 Satisfied M 2 are 13
#> # ... with 27 more rows
You can do sentiment analysis by implementing an inner join; check out some examples here.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cleaning a column in a dataset R [closed] - r

Related

Group by name and add up the columns count in r

R dplyr - running total with row-wise calculations

r - rounding in summarise() [duplicate]

Delete row from data.frame based on condition

Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens

Categories

Resources