I am stuck with converting strings to times. I am aware that there are many topics on Stack regarding converting strings-to-times, however I couldn't fix this problem with the solutions.
Situation
I have a file with times like this:
> dput(df$Time[1:50])
c("1744.3", "2327.54", "1718.51", "2312.3200000000002", "1414.16",
"2046.15", "1442.5", "1912.22", "2303.2199999999998", "2146.3200000000002",
"1459.02", "1930.15", "1856.23", "2319.15", "1451.05", "25.460000000000036",
"1453.25", "2309.02", "2342.48", "2322.5300000000002", "2101.5",
"2026.07", "1245.04", "1945.15", "5.4099999999998545", "1039.5",
"1731.37", "2058.41", "2030.36", "1814.31", "1338.18", "1858.33",
"1731.36", "2343.38", "1733.27", "2304.59", "1309.47", "1916.11",
"1958.3", "1929.54", "1756.4", "1744.23", "1731.26", "1844.47",
"1353.25", "1958.3", "1746.44", "1857.53", "2047.15", "2327.2199999999998", "1915"
)
In this example, the times should be like this:
"1744.3" = 17:44:30
"2327.54" = 23:27:54
"1718.51" = 17:18:51
"2312.3200000000002" = 23:12:32
...
"25.460000000000036" = 00:25:46 # as you can see, the first two 00 are missing.
"1915" = 19:15:00
However, I tried multiple things (and now I am even stuck with str_replace()). Hopefully some one knows how I can transform this.
What have I tried?
format(df$Time, "%H%M.%S") # Yes I know...
# So therefore I thought, lets replace the strings to get them in a proper format
# like HH:MM:SS. First step was to replace the "." for a ":"
str_replace("." , ":", df$Time) # this was leading to "." (don't know why)
And that was the point that I was so frustrated that I posted it on Stack. Hope that you guys can help me.
Many thanks in advance!
Here is a way to do this, storing the output from dput in x.
library(magrittr)
#Remove all the dots
gsub('\\.', '', x) %>%
#Select only first 6 characters
substr(1, 6) %>%
#Pad 0's at the end
stringr::str_pad(6,pad = '0', side = 'right') %>%
#Add colon (:) separator
sub('(.{2})(.{2})', '\\1:\\2:', .)
# [1] "17:44:30" "23:27:54" "17:18:51" "23:12:32" "14:14:16" "20:46:15"
# [7] "14:42:50" "19:12:22" "23:03:21" "21:46:32" "14:59:02" "19:30:15"
#[13] "18:56:23" "23:19:15" "14:51:05" "25:46:00" "14:53:25" "23:09:02"
#...
Note that this can be done without pipes as well but using it for clarity. From here you can convert the time to POSIXct format if needed.
The main problem is the time "25.460000000000036". But I think I found a clear though somewhat verbose solution:
library(tidyverse)
df %>%
mutate(hours = formatC(as.numeric(Time), width = 4, format = "d", flag = "0"),
seconds = as.numeric(str_extract(Time, "[.].+")) * 100) %>%
mutate(Time_new = stringi::stri_datetime_parse(paste0(hours, seconds), format = "HHmm.ss"))
#> # A tibble: 51 x 4
#> Time hours seconds Time_new
#> <chr> <chr> <dbl> <dttm>
#> 1 25.460000000000036 0025 46. 2020-02-19 00:25:46 # I changed the order of the times so the weird format is on top
#> 2 1744.3 1744 30 2020-02-19 17:44:30
#> 3 2327.54 2327 54 2020-02-19 23:27:54
#> 4 1718.51 1718 51 2020-02-19 17:18:51
#> 5 2312.3200000000002 2312 32. 2020-02-19 23:12:32
#> 6 1414.16 1414 16 2020-02-19 14:14:16
#> 7 2046.15 2046 15 2020-02-19 20:46:15
#> 8 1442.5 1442 50 2020-02-19 14:42:50
#> 9 1912.22 1912 22 2020-02-19 19:12:22
#> 10 2303.2199999999998 2303 22.0 2020-02-19 23:03:21
#> # ... with 41 more rows
If you also have times without fractions (i.e., without the dot) you could use this approach:
normalize_time <- function(t) {
formatC(as.numeric(t) * 100, width = 6, format = "d", flag = "0")
}
df %>%
mutate(Time_new = as.POSIXct(normalize_time(Time), format = "%H%M%S"))
A roundabout way of doing it
tmp=as.numeric(lapply(strsplit(as.character(df$Time),"\\."),function(x){nchar(x[1])}))
ifelse(tmp>2,
substr(as.POSIXct(df$Time,format="%H%M.%S"),12,19),
substr(as.POSIXct(df$Time,format="%M.%S"),12,19))
a data.table way
First, convert your strings in your vector to numeric, multiply by 100 (to get the relevant part of HMS before the decimal separator) and set to integer. Then use sprintf() to add leading zero's to get a 6-digit string. Finally, convert to time.
data.table::as.ITime( sprintf( "%06d",
as.integer( as.numeric(time) * 100 ) ),
format = "%H%M%S" )
# [1] "17:44:30" "23:27:54" "17:18:51" "23:12:32" "14:14:16" "20:46:15" "14:42:50" "19:12:22" "23:03:21" "21:46:32" "14:59:02" "19:30:15"
# [13] "18:56:23" "23:19:15" "14:51:05" "00:25:46" "14:53:25" "23:09:02" "23:42:48" "23:22:53" "21:01:50" "20:26:07" "12:45:04" "19:45:15"
# [25] "00:05:40" "10:39:50" "17:31:37" "20:58:41" "20:30:36" "18:14:31" "13:38:18" "18:58:33" "17:31:36" "23:43:38" "17:33:27" "23:04:59"
# [37] "13:09:47" "19:16:11" "19:58:30" "19:29:54" "17:56:40" "17:44:23" "17:31:26" "18:44:47" "13:53:25" "19:58:30" "17:46:44" "18:57:53"
# [49] "20:47:15" "23:27:21"
Related
I'm trying to select strings based on multiple criteria but so far no success.
My vector contains the following strings (a total of 48 strings): (1_A, 1_B, 1_C, 1_D, 2_A, 2_B, 2_C, 2_D... 12_A, 12_B, 12_C, 12_D)
I need to randomly select 12 strings. The criteria are:
I need one string containing each number
I need exactly three strings that contains each letter.
I need the final output to be something like: 1_A, 2_A, 3_A, 4_B, 5_B, 6_B, 7_C, 8_C, 9_C, 10_D, 11_D, 12_D.
Any help will appreciated.
All the best,
Angelica
The trick here is not to use your vector at all, but to create the sample strings from their components, which are randomly chosen according to your criteria.
sample(paste(sample(12), rep(LETTERS[1:4], 3), sep = '_'))
#> [1] "12_D" "8_C" "7_B" "1_B" "6_D" "5_A" "4_B" "10_A" "2_C" "3_A" "11_D" "9_C"
This will give a different result each time.
Note that all 4 letters are always represented exactly 3 times since we use rep(LETTERS[1:4], 3), all numbers 1 to 12 are present exactly once but in a random order since we use sample(12), and the final result is shuffled so that the order of the letters and the order of the numbers is not predictable.
If you want the result to give you the indices of your original vector where the samples are from, then it's easy to do that using match. We can recreate your vector by doing:
vec <- paste(rep(1:12, each = 4), rep(LETTERS[1:4], 12), sep = "_")
vec
#> [1] "1_A" "1_B" "1_C" "1_D" "2_A" "2_B" "2_C" "2_D" "3_A" "3_B"
#> [11] "3_C" "3_D" "4_A" "4_B" "4_C" "4_D" "5_A" "5_B" "5_C" "5_D"
#> [21] "6_A" "6_B" "6_C" "6_D" "7_A" "7_B" "7_C" "7_D" "8_A" "8_B"
#> [31] "8_C" "8_D" "9_A" "9_B" "9_C" "9_D" "10_A" "10_B" "10_C" "10_D"
#> [41] "11_A" "11_B" "11_C" "11_D" "12_A" "12_B" "12_C" "12_D"
And to find the location of the random samples we can do:
samp <- match(sample(paste(sample(12), rep(LETTERS[1:4], 3), sep = '_')), vec)
samp
#> [1] 30 26 37 43 46 20 8 3 33 24 15 9
So that, for example, you can retrieve an appropriate sample from your vector with:
vec[samp]
#> [1] "8_B" "7_B" "10_A" "11_C" "12_B" "5_D" "2_D" "1_C" "9_A" "6_D"
#> [11] "4_C" "3_A"
Created on 2022-04-10 by the reprex package (v2.0.1)
I have a data frame of words (tweets have been tokenised), the number of uses of this word and the sentiment score attached to it and the total score (n * value). I have created another data frame that are all the words in my corpus that follow a negative (so I have made bigrams and filtered for word_1 being a negative).
I want to subtract the amount of negatives from the original data frame so it shows the net amount of a word.
library(tidyverse)
library(tidyr)
library(tidytext)
tweets <- read_csv("http://nodeassets.nbcnews.com/russian-twitter-trolls/tweets.csv")
custom_stop_words <- bind_rows(tibble(word = c("https", "t.co", "rt", "amp"),
lexicon = c("custom")), stop_words)
tweet_tokens <- tweets %>%
select(user_id, user_key, text, created_str) %>%
na.omit() %>%
mutate(row= row_number()) %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word)
sentiment <- tweet_tokens %>%
count(word, sort = T) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
mutate(total_score = n * value)
#df showing contribution of overall sentiment to each word
negation_words <- c("not", "no", "never", "without", "won't", "dont", "doesnt", "doesn't", "don't", "can't")
bigrams <- tweets %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) #re-tokenise our tweets with bigrams.
bigrams_separated <- bigrams %>%
separate(bigram, c("word_1", "word_2"), sep = " ")
not_words <- bigrams_separated %>%
filter(word_1 %in% negation_words) %>%
inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
count(word_2, value, sort = TRUE) %>%
mutate(value = value * -1) %>%
mutate(contribution = value * n)
I would like the outcome to be one data frame. So if sentiment shows 'matter' appears 696 times, but the not_words df shows it was preceded by a negation 274 times, the new data frame has the n value for 'matter' is 422.
(without really knowing the specifics) I think you did a good job massaging the tweet_tokens and not_words datasets. Nevertheless, you'll have to slightly modify them, for them to work as you (probably?) want.
Inactivate the mutate(row=... line in your tweet_tokens <- ... dataframe, as it would give troubles if you don't. Also re-run your sentiment <- ... dataframe, just to be on the safe side.
tweet_tokens <- tweets %>%
select(user_id, user_key, text, created_str) %>%
na.omit() %>%
#mutate(row= row_number()) %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word)
Cut the last three lines of your not_words <- ... dataframe, as later that summary count(... won't let you reference your dataframes. The select(user_id,user_key,created_str,word = word_2) line gives you a dataframe with the same "standards" of your tweet_tokens dataframe. Check also how my "word_2" column is now called "world" (in the new not_words dataframe).
not_words <- bigrams_separated %>%
filter(word_1 %in% negation_words) %>%
inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
select(user_id,user_key,created_str,word = word_2)
Now, for your particular example/case, when using the word "matter" (for tweet_tokens) we have indeed a dataframe of 696 rows...
> matter_tweet = tweet_tokens[tweet_tokens$word=='matter',]
> dim(matter_tweet)
[1] 696 4
and when using the word "matter" (for not_words) we end up with a dataframe of 274 rows.
> matter_not = not_words[not_words$word=='matter',]
> dim(matter_not)
[1] 274 4
So if we just subtract matter_not from matter_tweet you would have those 422 rows you're looking for.
Well... no so fast... and strictly speaking I'm also sure that's not what you really want.
The simple and accurate answer is:
> anti_join(matter_tweet,matter_not)
Joining, by = c("user_id", "user_key", "created_str", "word")
# A tibble: 429 x 4
user_id user_key created_str word
<dbl> <chr> <dttm> <chr>
1 1671234620 hyddrox 2016-10-17 07:22:47 matter
2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
4 1680366068 willisbonnerr 2017-02-14 09:14:24 matter
5 2533221819 lazykstafford 2015-12-25 13:41:12 matter
6 1833223908 dorothiebell 2016-09-29 21:08:14 matter
7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
8 2606301939 finley1589 2016-09-19 08:24:37 matter
9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 419 more rows
Now allow me to explain why is that you end up with 429 rows when you asked for 422.
> #-not taking into account NAs in the 'user_id column' (you'll decide what to do with that issue later, I guess)
> matter_not_clean = matter_not[!is.na(matter_not$user_id),]
> dim(matter_not_clean)
[1] 256 4
> #-the above dataframe contains also duplicates, which we 'have to?' get rid off of them
> #-the 'matter' dataframe is the cleanest you can have
> matter = matter_not_clean[!duplicated(matter_not_clean),]
> dim(matter)
[1] 250 4
#-you'd be tempted to say that 696-250=446 are the columns you'd want now;
#-...which is not true as some of the 250 rows from 'matter' are also duplicated in
#-...'matter_tweet', but that should not worry you. You can later delete them... if that's what you want.
> #-then I jump to 'data.table' as it helps me to prove my point
> library(data.table)
> #-transforming those 'tbl_df' into 'data.table'
> mt = as.data.table(matter_tweet)
> mm = as.data.table(matter)
> #-I check if (all) 'mm' is contained in 'mt'
> test = mt[mm,on=names(mt)]
> dim(test)
[1] 267 4
These 267 rows are the ones you want to get rid off!. Hence you're looking for a dataframe of 696 - 267 = 429 rows!.
> #-the above implies that there are indeed duplicates... but this doesn't mean that all 'mm' is contain is contained in 'mt'
> #-now I remove the duplicates
> test[!duplicated(test),]
user_id user_key created_str word
1: 1.518857e+09 nojonathonno 2016-11-08 10:36:14 matter
2: 1.594887e+09 jery_robertsyo 2016-11-08 20:57:07 matter
3: 1.617939e+09 paulinett 2017-01-14 16:33:38 matter
4: 1.617939e+09 paulinett 2017-03-05 18:16:48 matter
5: 1.617939e+09 paulinett 2017-04-03 03:21:34 matter
---
246: 4.508631e+09 thefoundingson 2017-03-23 13:40:00 matter
247: 4.508631e+09 thefoundingson 2017-03-29 01:05:01 matter
248: 4.840552e+09 blacktolive 2016-07-19 15:32:04 matter
249: 4.859142e+09 trayneshacole 2016-04-09 23:16:13 matter
250: 7.532149e+17 margarethkurz 2017-03-05 16:31:43 matter
> #-and here I test that all 'matter' is in 'matter_tweet', which IT IS!
> identical(mm,test[!duplicated(test),])
[1] TRUE
> #-in this way we keep the duplicates from/in 'matter_tweet'
> answer = mt[!mm,on=names(mt)]
> dim(answer)
[1] 429 4
> #-if we remove the duplicates we end up with a dataframe of 415 columns
> #-...and this is where I am not sure if that's what you want
> answer[!duplicated(answer),]
user_id user_key created_str word
1: 1671234620 hyddrox 2016-10-17 07:22:47 matter
2: 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
3: 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
4: 1680366068 willisbonnerr 2017-02-14 09:14:24 matter
5: 2533221819 lazykstafford 2015-12-25 13:41:12 matter
---
411: 4508630900 thefoundingson 2016-09-13 12:15:03 matter
412: 1655194147 melanymelanin 2016-02-21 02:32:50 matter
413: 1684524144 datwisenigga 2017-04-27 02:45:25 matter
414: 1660771422 garrettsimpson_ 2016-10-14 01:14:04 matter
415: 1671234620 hyddrox 2017-02-19 19:40:39 matter
> #-you'll get this same 'answer' if you do:
> setdiff(matter_tweet,matter)
# A tibble: 415 x 4
user_id user_key created_str word
<dbl> <chr> <dttm> <chr>
1 1671234620 hyddrox 2016-10-17 07:22:47 matter
2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
4 1680366068 willisbonnerr 2017-02-14 09:14:24 matter
5 2533221819 lazykstafford 2015-12-25 13:41:12 matter
6 1833223908 dorothiebell 2016-09-29 21:08:14 matter
7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
8 2606301939 finley1589 2016-09-19 08:24:37 matter
9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 405 more rows
> #-nut now you know why ;)
> #-testing equality in both methods
> identical(answer[1:429,],as.data.table(anti_join(matter_tweet,matter_not))[1:429,])
Joining, by = c("user_id", "user_key", "created_str", "word")
[1] TRUE
CONCLUSION 1: do anti_join(matter_tweet,matter) if you don't want duplicated values in your tweet_tokens dataframe; do setdiff(matter_tweet,matter) if otherwise.
CONCLUSION 2: if you noticed anti_join(matter_tweet,matter_not) and anti_join(matter_tweet,matter) gives you the same answer. This means that anti_join(... doesn't take into account NAs in its workings.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So I got a dataset with a column that I need to clean.
The column has objects with stuff like: "$10,000 - $19,999", "$40,000 and over."
How do I code this so for example "$10,000 - $19,999" becomes 15000 instead, and "$40,000 and over" becomes 40000 in a new column?
I am new to R so I have no idea how to start. I need to do a regression analysis on this but it doesn't work if I don't get this fixed.
I have been told that some basic string/regex operations are what I need. How should I proceed?
Here's a solution using the tidyverse.
Load packages
library(dplyr) # for general cleaning functions
library(stringr) # for string manipulations
library(magrittr) # for the '%<>% function
Make a dummy dataset based on your example.
df <- data_frame(price = sample(c(rep('$40,000 and over', 10),
rep('$10,000', 10),
rep('$19,999', 10),
rep('$9,000', 10),
rep('$28,000', 10))))
Inspect the new dataframe
print(df)
#> # A tibble: 50 x 1
#> price
#> <chr>
#> 1 $9,000
#> 2 $40,000 and over
#> 3 $28,000
#> 4 $10,000
#> 5 $10,000
#> 6 $9,000
#> 7 $19,999
#> 8 $10,000
#> 9 $19,999
#> 10 $40,000 and over
#> # ... with 40 more rows
Clean-up the the format of the price strings by removing the $ symbol and ,. Note the use of the '\\' before the $ symbol. This formatting is used within R to escape special characters (the second \ is a standard regex escape switch, the first \ is tells R to escape the second \).
df %<>%
mutate(price = str_remove(string = price, pattern = '\\$'), # remove $ sign
price = str_remove(string = price, pattern = ',')) # remove comma
Quick check of the data.
head(df)
#> # A tibble: 6 x 1
#> price
#> <chr>
#> 1 9000
#> 2 40000 and over
#> 3 28000
#> 4 10000
#> 5 10000
#> 6 9000
Process the number strings into numerics. First convert 40000 and over to 40000, then convert all the strings to numerics, then use logic statements to convert the numbers to the values you want. The functions ifelse() and case_when() are interchangeable, but I tend to use ifelse() for single rules, and case_when() when there are multiple rules because of the more compact format of the case_when().
df %<>%
mutate(price = ifelse(price == '40000 and over', # convert 40000+ to 40000
yes = '40000',
no = price),
price = as.numeric(price), # convert all to numeric
price = case_when( # use logic statements to change values to desired value
price == 40000 ~ 40000,
price >= 30000 & price < 40000 ~ 35000,
price >= 20000 & price < 30000 ~ 25000,
price >= 10000 & price < 20000 ~ 15000,
price >= 0 & price < 10000 ~ 5000
))
Have a final look.
print(df)
#> # A tibble: 50 x 1
#> price
#> <dbl>
#> 1 5000
#> 2 40000
#> 3 25000
#> 4 15000
#> 5 15000
#> 6 5000
#> 7 15000
#> 8 15000
#> 9 15000
#> 10 40000
#> # ... with 40 more rows
```
Created on 2018-11-18 by the reprex package (v0.2.1)
First you should see what exactly your data is composed of- use the table() function on data$column to see how many unique entries you must account for.
table(data$column)
If whoever was entering this data was consistent about their wording, it may be easiest to hard code for substitution for each unique entry. So if unique(data$column)[1]== "$10,000 - $19,999", and unique(data$column)[2]== "$40,000 and over."
data$column[which(data$column==unique(data$column)[1])] <- "15000"
data$column[which(data$column==unique(data$column)[2])] <- "40000"
...
If you have too many unique entries for this approach to be viable, I'd suggest looking for consistencies in character sequences that can be used to make replacements. If you found that whoever entered this data was inconsistent about how they would write "$40,000 and over" such that you had:
data$column==unique(data$column)[2]
>"$40,000 and over."
data$column==unique(data$column)[3]
>"$40,000 and over"
data$column==unique(data$column)[4]
>"above $40,000"
...
If there weren't instances of "$40,000" that belonged to other categories, you could combine these entries for substitution a la:
data$column[which(grepl("$40,000",data$column))] <- "40000"
Inconsistency in qualitative data entry is a very human problem and requires exploring your data to search for trends and easy ways to consolidate your replacements. I think it's a fine idea to use R to identify and replace for patterns you find to save time, but ultimately it will require a fine touch as you get down to individual cases where you have to interpret/correct someone's entries to include them in your desired bins. Depending on your data quality standards, you can always throw out these entries that don't seem to fit your observed patterns.
I apologize if this is a duplicate, I've searched through all of the "add leading zero" content I can find, and I'm struggling to find a solution I can work with. I have the following:
siteid<-c("1","11","111")
modifier<-c("44","22","11")
df<-data.frame(siteid,modifier)
and I want a modified siteid that is always six (6) characters long with zeroes to fill the gaps. The Site ID can vary in nchar from 1-3, the modifier is always a length of 2, and the number of zeroes can vary depending on the length of the site ID (so that 6 is always the final modified length).
I would like the following final output:
df
# siteid modifier mod.siteid
#1 1 44 440001
#2 11 22 220011
#3 111 11 110111
Thanks for any suggestions or direction. This could also be numeric, but it seems like character manipulation has more options...?
The vocabulary here is left pad and paste here is one way using sprintf()::
df$mod.siteid <- with(df, sprintf("%s%04d", modifier, as.integer(siteid)))
# Note:
# code simplified thanks to suggestion by Maurits.
Output:
siteid modifier mod.siteid
1 1 44 440001
2 11 22 220011
3 111 11 110111
Data:
df <- data.frame(
siteid = c("1", "11", "111"),
modifier = c("44", "22", "11"),
stringsAsFactors = FALSE
)
Extra: If you don't want to left pad with 0, then using the stringi package is one option: with(df, paste0(modifier, stringi::stri_pad_left(siteid, 4, "q")))
siteid<-c("1","11","111")
modifier<-c("44","22","11")
df<-data.frame(siteid,modifier, stringsAsFactors = FALSE)
df$mod.siteid = paste0( df$modifier,
formatC( as.numeric(df$siteid), width = 4, format = "d", flag="0") )
df
# siteid modifier mod.siteid
# 1 1 44 440001
# 2 11 22 220011
# 3 111 11 110111
I'm beginner dealing with R and working with strings.
I've been trying to remove periods from data but unfortunately I can't find a solution.
This is the data I'm working on in a dataframe df:
df <- read.table(text = " n mesAno receita
97 1/2009 3.812.819.062,06
98 2/2009 4.039.362.599,36
99 3/2009 3.652.885.587,18
100 4/2009 3.460.247.960,02
101 5/2009 3.465.677.403,12
102 6/2009 3.131.903.622,55
103 7/2009 3.204.983.361,46
104 8/2009 3.811.786.009,24
105 9/2009 3.180.864.095,05
106 10/2009 3.352.535.553,88
107 11/2009 5.214.148.756,95
108 12/2009 4.491.795.201,50
109 1/2010 4.333.557.619,30
110 2/2010 4.808.488.277,86
111 3/2010 4.039.347.179,81
112 4/2010 3.867.676.530,69
113 5/2010 6.356.164.873,94
114 6/2010 3.961.793.391,19
115 7/2010 3797656130.81
116 8/2010 4709949715.37
117 9/2010 4047436592.12
118 10/2010 3923484635.28
119 11/2010 4821729985.03
120 12/2010 5024757038.22",
header = TRUE,
stringsAsFactors = TRUE)
My objective is to transform receita column to numeric as it's is being stored as factor. But applying conversion functions like as.numeric(as.factor(x)) does not work in the interval 97:114 (it coerces to NA's).
I suppose that this is because of the periods separating billion/million/thousands in this column.
The mentioned conversion functions will work only if I have something like 3812819062.06 as in 115:120.
I tried mutating the dataset adding another column and modelling.
I don't really know if what i'm doing is fine, but i also tried extracting the anomalous numbers to a variable, and applying sub/gsub on them but without success.
Is there some straight forward way of doing this, that is, instruct it to remove the 2 first occurrences of '.' and then replace the comma with a '.'?
I'm very confident that the function i'm needing is gsub but i'm having a hard time finding the correct usage. Any help will be appreciated.
Edit: My approach using dplyr::mutate(). Ugly but works.
df <- df %>%
mutate(receita_temp = receita) %>%
mutate(dot_count = str_count(receita, '\\.')) %>%
mutate(receita_temp = ifelse(dot_count == 3,
gsub('\\.', '', as.factor(receita_temp)),
gsub('\\,', '.',as.factor(receita_temp))
)) %>%
mutate(receita_temp = ifelse(dot_count == 3,
gsub('\\,', '.',as.factor(receita_temp)),
receita_temp)) %>%
select(-c(dot_count, receita)) %>%
rename(., receita = receita_temp)
I'm using regex and some stringr functions to remove all the periods except those followed by two digits and the end of the string. That way, periods denoting separation like in 3.811.786.009,24 are removed, but periods denoting the start of a decimal like in 4821729985.03 are not. Using str_remove_all rather than str_remove lets me not have to worry about removing the matches repeatedly or about how well it will scale. Then replace the remaining commas with periods, and make it numeric.
library(tidyverse)
df2 <- df %>%
mutate(receita = str_remove_all(receita, "\\.(?!\\d{2,}$)") %>%
str_replace_all(",", ".") %>%
as.numeric())
print(head(df2), digits = 12)
#> n mesAno receita
#> 1 97 1/2009 3812819062.06
#> 2 98 2/2009 4039362599.36
#> 3 99 3/2009 3652885587.18
#> 4 100 4/2009 3460247960.02
#> 5 101 5/2009 3465677403.12
#> 6 102 6/2009 3131903622.55
Created on 2018-09-04 by the reprex package (v0.2.0).
You can use the following:
first create a function that will be used for replacement:
repl = function(x)setNames(c("","."),c(".",","))[x]
This function takes in either "." or "," and returns "" or '.' respectively
Now use this function to replace
stringr::str_replace_all(as.character(df[,3]), "[.](?!\\d+$)|,", repl)
[1] "3812819062.06" "4039362599.36" "3652885587.18" "3460247960.02" "3465677403.12" "3131903622.55"
[7] "3204983361.46" "3811786009.24" "3180864095.05" "3352535553.88" "5214148756.95" "4491795201.50"
[13] "4333557619.30" "4808488277.86" "4039347179.81" "3867676530.69" "6356164873.94" "3961793391.19"
[19] "3797656130.81" "4709949715.37" "4047436592.12" "3923484635.28" "4821729985.03" "5024757038.22"
Of course you can do the rest. ie calling as.numeric() etc.
To do this in base R:
sub(',','.',gsub('[.](?!\\d+$)','',as.character(df[,3]),perl=T))
or If you know the exact number of . and , in your data, you could do
a = as.character(df[,3])
regmatches(a,gregexpr('[.](?!\\d+$)|,',df[,3],perl = T)) = list(c("","","","."))
a
df$num <- as.numeric(sapply(as.character(si), function(x) gsub("\\,","\\.",ifelse(grepl("\\,", x), gsub("\\.","",x),x))))
should do the trick.
First, the function searches for rows with ",", removes "." in these rows, and last it converts all occurring "," into ".", so that it can be converted without problems to numeric.
Use print(df$num, digits = 12) to see your data with 2 decimals.