Splitting a column in a dataframe in R into two based on content - r

I have a column in a R dataframe that holds a product weight i.e. 20 kg but it has mixed measuring systems i.e. 1 lbs & 2 kg etc. I want to separate the value from the measurement and put them in separate columns then convert them in a new column to a standard weight. Any thoughts on how I might achieve that? Thanks in advance.

Assume you have the column given as
x <- c("20 kg","50 lbs","1.5 kg","0.02 lbs")
and you know that there is always a space between the number and the measurement. Then you can split this up at the space-character, e.g. via
splitted <- strsplit(x," ")
This results in a list of vectors of length two, where the first is the number and the second is the measurement.
Now grab the numbers and convert them via
numbers <- as.numeric(sapply(splitted,"[[",1))
and grab the units via
units <- sapply(splitted,"[[",2)
Now you can put everything together in a `data.frame.
Note: When using as.numeric, the decimal point has to be a dot. If you have commas instead, you need to replace them by a dot, for example via gsub(",","\\.",...).

separate(DataFrame, VariableName, into = c("Value", "Metric"), sep = " ")
My case was simple enough that I could get away with just one space separator but I learned you can also use a regular expression here for more complex separator considerations.

Related

R:how to extract the first integer or decimal number from a text, and if the first number equal to specific numbers extract the second integer/decimal

The data is like this:
example - name of database
detail - the first column the contain sting with number in it (the number can be attached to $ etc. like 25m$ and also can be decimal like 1.2m$ or $1.2M)
lets say the datatable look like this:
example$detail<- c("The cole mine market worth every year 100M$ and the equipment they use worth 30$m per capita", "In 2017 the first enterpenur realized there is a potential of 500$M in cole mining", "The cole can make 23b$ per year ans help 1000000 familys living on it")
i want to add a column to the example data table - named: "number" that will extract the first number in the string in column "detail". BUT if this number is equal to one of the numbers in vector "year" (its not in the example database - its a seprate list i created) i want it to extract the second number of the string example$detail.
so i create another years list (separate from the database),
years<-c(2016:2030 )
im trying to create new column - number
what i did so far:
I managed to add variable that extract the first number of a string, by writing the following command:
example$number<-as.integer( sub("\\D*(\\d+).*", "\\1", example$detail) ) # EXTRACT ONLT INTEGERS
example$number1<-format(round(as.numeric(str_extract(example$detail, "\\d+\\.*\\d*")), 2), nsmall = 2) #EXTRACT THE NUMBERS AS DECIMALS WITH TWO DIGITS AFTER THE . (ITS ENOUGH FOR ME)
example$number1<-ifelse(example$number %in% years, TRUE, example$number1 ) #IF THE FIRST NUMBER EXTRACTED ARE IN THE YEARS VECTOR RETURN "TRUE"
and then i tried to write a code that extract the second number according to this if and its not working, just return me errors
i tried:
gsub("[^\d]*[\d]+[^\d]+([\d]+)", example$detail)
str_extract(example$detail, "\d+(?=[A-Z\s.]+$)",[[2]])
as.integer( sub("\\D*(\\d+).*", "\\1", example$detail) )
as.numeric(strsplit(example$detail, "\\D+")[1])
i didnt understand how i symbolized any number (integer\digits) or how i symbolized THE SECOND number in string.
thanks a lot!!
List item
Since no good example data is provided I'm just going to 'wing-it' here.
Imagine the dataframe df has the columns year (int) and details (char), then
df = mutate(clean_details = sub("[^0-9.-]", "",details),
clean_details_part1 = as.integer(strsplit(clean_details,"[.]")[[1]][1]),
clean_details_part2 = as.integer(strsplit(clean_details,"[.]")[[1]][2])
)
This works with the code I wrote up. I didn't apply the logic because I see you're proficient enough to do that. I believe a simple ifelse statement would do to create a boolean and then you can filter on that boolean, or a most direct way.

Tidying financial data with mixed decimal and grouping digits

Context
I need to clean financial data with mixed formats. The data has been punched in manually by different departments, some of them using "." as decimal and "," as grouping digit (e.g. US notation: $1,000,000.00) while others are using "," as decimal and "." as grouping digit (e.g. notation used in certain European countries: $1.000.000,00).
Input:
Here's a fictional example set:
df <- data.frame(Y2019= c("17.530.000,03","28000000.05", "256.000,23", "23,000",
"256.355.855","2565467,566","225,453.126")
)
Y2019
1 17.530.000,03
2 28000000.05
3 256.000,23
4 23,000
5 256.355.855
6 2565467,566
7 225,453.126
Desired result:
Y2019
1 17530000.03
2 28000000.05
3 256000.23
4 23000.00
5 256355855.00
6 2565467.566
7 225453.126
My attempt:
I got pretty close by considering the first occurrence (starting from the right) of "," or "." as the decimal operator and replacing the other occurrences accordingly. However, some entries are without decimals (e.g. entry 4 and 5) or have a variable number of decimals, rendering this strategy less useful.
Any input is greatly appreciated!
Edit:
As per request, I salvaged some of the code of the original attempt. I am sure it could be written a lot cleaner.
df %>%
mutate(Y2019r = ifelse(str_length(Y2019)- data.frame(str_locate(pattern =",",Y2019 ))[,1]==2, gsub("\\.","", Y2019),NA )) %>%
mutate(Y2019r = ifelse((is.na(Y2019r) & str_length(Y2019)- data.frame(str_locate(pattern ="\\.",Y2019 ))[,1]==2), gsub("\\.",",", Y2019),Y2019r ))%>%
mutate(Y2019r = gsub(",",".", Y2019r))
Y2019 Y2019r
1 17.530.000,03 17530000.03
2 28000000.05 28000000.05
3 256.000,23 256000.23
4 23,000 <NA>
5 256.355.855 <NA>
6 2565467,566 <NA>
7 225,453.126 <NA>
Here's a functional approach to build up the logic needed to parse the strings you might come across. I suppose it is built up from thinking about how we parse these strings when we read them, and trying to emulate that.
I think the key is realising that all we really need to know is whether the value after the last delimiter is decimal or not. If we could somehow label the strings as having a decimal portion it would be easy to parse the strings then.
The following method involves splitting the character strings at the points and commas and trying to label them as having a terminal decimal or not. The split strings will be held as a list of string vectors, with each vector being composed of the "chunks" of digits between the delimiters.
First we will write two helper functions to create the final numbers from the string vectors once we have correctly labeled them as having a terminal decimal portion or not:
last_element_is_decimal <- function(x)
{
as.numeric(paste0(paste(x[-length(x)], collapse = ""), ".", x[length(x)]))
}
last_element_is_whole <- function(x)
{
as.numeric(paste0(x, collapse = ""))
}
It will be easy to decide what to do in the event of no delimiters, since we assume these are just whole numbers. Similarly, it is easy to see that any numbers containing both a comma and a stop (in either order) must have a terminal decimal component.
However, it is less obvious what to do when there is only a single delimiter; in these cases we have to use the length of the digit chunks to decide. If any chunk is longer than three digits, then a thousands seperator isn't in use, and the presence of a delimiter indicates we have a decimal component. If the terminal chunk contains only two digits then we must have a decimal. In all other cases, we assume a whole number.
This says the same thing in code:
decide_last_element <- function(x)
{
if(max(nchar(x)) > 3)
return(last_element_is_decimal(x))
if(nchar(x[length(x)]) < 3)
return(last_element_is_decimal(x))
return(last_element_is_whole(x))
}
Now we can write our main function. It takes our strings as input and classifies each string into having either two types of delimiter, one type of delimiter or no delimiter. Then we can apply the functions above using lapply accordingly.
parse_money <- function(money_strings)
{
any_comma <- grepl(",", money_strings)
any_point <- grepl("[.]", money_strings)
both <- any_comma & any_point
neither <- !any_comma & !any_point
single <- (any_comma & !any_point) | (any_point & !any_comma)
digit_groups <- strsplit(money_strings, "[.]|,")
values <- rep(0, length(money_strings))
values[neither] <- as.numeric(money_strings[neither])
values[both] <- sapply(digit_groups[both], last_element_is_decimal)
values[single] <- sapply(digit_groups[single], decide_last_element)
return(format(round(values, 2), nsmall = 2))
}
So now we can just do
parse_money(df$Y2019)
#> [1] " 17530000.03" " 28000000.05" " 256000.23" " 23000.00" "256355855.00"
#> [6] " 2565467.57" " 225453.13"
Note I have output as strings so that rounding inaccuracies in the console output aren't ascribed to mistakes in the code.

Gathering the correct amount of digits for numbers when text mining

I need to search for specific information within a set of documents that follows the same standard layout.
After I used grep to find the keywords in every document, I went on collecting the numbers or characters of interest.
One piece of data I have to collect is the Total Power that appears as following:
TotalPower: 986559. (UoPow)
Since I had already correctly selected this excerpt, I created the following function that takes the characters between positions n and m, where n and m start counting up from right to left.
substrRight <- function(x, n,m){
substr(x, nchar(x)-n+1, nchar(x)-m)
}
It's important to say that from the ":" to the number 986559, there are 2 spaces; and from the "." to the "(", there's one space.
So I wrote:
TotalP = substrRight(myDf[i],17,9) [1]
where myDf is a character vector with all the relevant observations.
Line [1], after I loop over all my observations, gives me the numbers I want, but I noticed that when the number was 986559, the result was 98655. It simply doesn't "see" 9 as the last number.
The code seems to work fine for the rest of the data. This number (986559) is indeed the highest number in the data and is the only one with order 10^5 of magnitude.
How can I make sure that I will gather all digits in every number?
Thank you for the help.
We can extract the digits before a . by using regex lookaround
library(stringr)
str_extract(str1, "\\d+(?=\\.)")
#[1] "986559"
The \\d+ indicates one or more digist followed by the regex lookaound .

Split character string into unequal segments R

I have some data that is I need to split into multiple elements, but there is not a specific identifier within the row to split on. I know the positions of different variables; is there a way I can split the string into multiple uneven parts based on my prior information. Example:
String: " 00008 L 1957110642706 194711071019561030R 1/812.5000000"
Desired result:
" 00008 "," ","L"," "," ","19571106","42706"," ","19471107","10","19561030","R 1/8","12.5000000"
So, my prior information is that the first element begins on the first position and is seven spaces long; the second begins at the 8th position in the string and is 8 spaces long; the 3rd element starts at the 16th position and is 1 space long, etc, etc.
xstr <- " 00008 L 1957110642706 194711071019561030R 1/812.5000000"
Rather than use this description:
first element begins on the first position and is seven spaces long; the second begins at the 8th position in the string and is 8 spaces long; the 3rd element starts at the 16th position and is 1 space long, etc, etc. ...
I'm just going to take the desired widths from your specified answer (nchar(res)):
res <- c(" 00008 "," ","L"," "," ","19571106","42706"," ","19471107","10","19561030","R 1/8","12.5000000")
Make sure that all variables are read as character strings:
res2 <- read.fwf(textConnection(xstr),widths=nchar(res),
colClasses=rep("character",length(res)))
Test:
all.equal(unname(unlist(res2)),res) ## TRUE
You can also use a simple substr function over your array of read lines:
my_lines <- read.table("your_file") #Or whatever way you read the lines
firstColumn <- substr(my_lines,1,7) #you can also use as.numeric and others if needed
secondColumn <- substr(my_lines,8,11)
# ..etc
rm(my_lines) #to save memory
Sometimes this is actually faster than other read.something packages specially if you dont use them correctly.

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Resources