Split numeric variables by decimals in R - r

I have a data frame with a column that contains numeric values, which represent the price.
ID
Total
1124
12.34
1232
12.01
1235
13.10
I want to split the column Total by "." and create 2 new columns with the euro and cent amount. Like this:
ID
Total
Euro
Cent
1124
12.34
12
34
1232
12.01
12
01
1235
13.10
13
10
1225
13.00
13
00
The euro and cent column should also be numeric.
I tried:
df[c('Euro', 'Cent')] <- str_split_fixed(df$Total, "(\\.)", 2)
But I get 2 new columns of type character that looks like this:
ID
Total
Euro
Cent
1124
12.34
12
34
1232
12.01
12
01
1235
13.10
13
1
1225
13.00
13
If I convert the character columns (euro and cent) to numeric like this:
as.numeric(df$Euro)
the 00 cent value turns into NULL and the 10 cent turn into 1 cent.
Any help is welcome.

Two methods:
If class(dat$Total) is numeric, you can do this:
dat <- transform(dat, Euro = Total %/% 1, Cent = 100 * (Total %% 1))
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
%/% is the integer-division operator, %% the modulus operator.
If class(dat$Total) is character, then
dat <- transform(dat, Euro = sub("\\..*", "", Total), Cent = sub(".*\\.", "", Total))
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 01
# 3 1235 13.10 13 10
The two new columns are also character. For this, you may want one of two more steps:
Removing leading 0s, and keep them character:
dat[,c("Euro", "Cent")] <- lapply(dat[,c("Euro", "Cent")], sub, pattern = "^0+", replacement = "")
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
Convert to numbers:
dat[,c("Euro", "Cent")] <- lapply(dat[,c("Euro", "Cent")], as.numeric)
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
(You can also use as.integer if you know both columns will always be such.)

Just use standard numeric functions:
df$Euro <- floor(df$Total)
df$Cent <- df$Total %% 1 * 100

Related

Merge different dataset

I have a question, I need to merge two different dataset in one but they have a different class. How I can I do? rbind doesn't work, ideas?
nycounties <- rgdal::readOGR("https://raw.githubusercontent.com/openpolis/geojson-italy/master/geojson/limits_IT_provinces.geojson")
city <- c("Novara", "Milano","Torino","Bari")
dimension <- c("150000", "5000000","30000","460000")
df <- cbind(city, dimension)
total <- rbind(nycounties,df)
Are you looking for something like this?
nycounties#data = data.frame(nycounties#data,
df[match(nycounties#data[, "prov_name"],
df[, "city"]),])
Output
nycounties#data[!is.na(nycounties#data$dimension),]
prov_name prov_istat_code_num prov_acr reg_name reg_istat_code reg_istat_code_num prov_istat_code city dimension
0 Torino 1 TO Piemonte 01 1 001 Torino 30000
2 Novara 3 NO Piemonte 01 1 003 Novara 150000
12 Milano 15 MI Lombardia 03 3 015 Milano 5000000
81 Bari 72 BA Puglia 16 16 072 Bari 460000

How to add the value of a row to other rows based on some criteria in R?

I have a panel data for costs, sampled monthly for various product types. I also have "Generic" costs which doesn't belong to any product type. A super simple representative df looks like this:
type <- c("A","A","B","B","C","C","Generic","Generic")
year <- c(2020,2020,2020,2020,2020,2020,2020,2020)
month <- c(1,2,1,2,1,2,1,2)
cost <- c(1,2,3,4,5,6,600,630)
volume <- c(10,11,20,21,30,31,60,63)
df <- data.frame(type,year,month,cost,volume)
type year month cost volume
A 2020 1 1 10
A 2020 2 2 11
B 2020 1 3 20
B 2020 2 4 21
C 2020 1 5 30
C 2020 2 6 31
Generic 2020 1 600 60
Generic 2020 2 630 63
I need to distribute the "Generic" costs to product types according to their "Volume".
For example,
For 2020-1, the volume ratio of
product type A: 10 / (10 + 20 + 30) = 1/6
product type B: 20 / (10 + 20 + 30) = 2/6
product type C: 30 / (10 + 20 + 30) = 3/6
For 2020-2, the volume ratio of
product type A: 11 / (11 + 21 + 31) = 11/63
product type B: 21 / (11 + 21 + 31) = 21/63
product type C: 31 / (11 + 21 + 31) = 31/63
So, I would like to distribute "Generic" costs for 2020-1 to product types like this:
1/6 * 600 = 100 for product type A
2/6 * 600 = 200 for product type B
3/6 * 600 = 300 for product type C
Similarly for 2020-2, I would like to distribute "Generic" costs like:
11/63 * 630 = 110 for product type A
21/63 * 630 = 210 for product type B
31/63 * 630 = 310 for product type C
In the end, I would like to end up with the following data frame:
type year month new_cost volume
A 2020 1 101 10
A 2020 2 112 11
B 2020 1 203 20
B 2020 2 214 21
C 2020 1 305 30
C 2020 2 316 31
I already have the total volume in the original dataframe within the "Generic" type, so there is no need to calculate that seperately.
I was trying to do these calculations via dplyr package's group_by() and mutate() functions, but I couldn't figure out how.
Any help is appreciated.
We can do this using data.table, by first merging in the generic costs separately and spreading them according to the percentage of volume made up by each type in each month/year:
df <- setDT(df)
generic <- df[type == "Generic"]
setnames(generic, "cost", "generic_cost")
df <- df[type !="Generic"]
df[, volume_ratio:=volume/sum(volume), by = c("year", "month")]
df <- merge(df, generic[,c("year", "month", "generic_cost")], by = c("year", "month"))
df[,new_cost:=cost + (generic_cost*volume_ratio)]
Which gives us:
df
year month type cost volume volume_ratio generic_cost new_cost
1: 2020 1 A 1 10 0.1666667 600 101
2: 2020 1 B 3 20 0.3333333 600 203
3: 2020 1 C 5 30 0.5000000 600 305
4: 2020 2 A 2 11 0.1746032 630 112
5: 2020 2 B 4 21 0.3333333 630 214
6: 2020 2 C 6 31 0.4920635 630 316
This has a few extra columns, but new cost seems to be the most important column of interest.

R - Regex to extract numbers prior to keyword with varying formatting

I need to extract numbers prior to their respective units from a string. Unfortunately the inputs sometimes vary and this is giving me trouble.
Sample data:
df <- data.frame(id = c(1, 2, 3, 4),
targets = c("1800 kcal 75 g", "2000kcal 80g", "1900 kcal,87g", "2035kcal,80g"))
> df
id targets
1 1 1800 kcal 75 g
2 2 2000kcal 80g
3 3 1900 kcal,87g
4 4 2035kcal,80g
Desired output:
df <- data.frame(id = c(1, 2, 3, 4),
targets = c("1800 kcal 75 g", "2000kcal 80g", "1900 kcal,87g", "2035kcal,80g"),
kcal_target = c("1800", "2000", "1900", "2035"),
protein_target = c("75", "80", "87", "80"))
> df
id targets kcal_target protein_target
1 1 1800 kcal 75 g 1800 75
2 2 2000kcal 80g 2000 80
3 3 1900 kcal,87g 1900 87
4 4 2035kcal,80g 2035 80
I got as far as this but it is breaking down with spaces between the numbers and unit keyword and a comma after the number keyword.
df <- df %>%
mutate(calorie_target = str_extract_all(targets, regex("\\d+(?=kcal)|\\d+(?=kcal,)"))) %>%
mutate(protein_target = str_extract_all(targets, regex("\\d+(?=g)")))
> df
id targets calorie_target protein_target
1 1 1800 kcal 75 g
2 2 2000kcal 80g 2000 80
3 3 1900 kcal,87g 87
4 4 2035kcal,80g 2035 80
edit: removed portion of code I'm not trying to capture
Base R with strcapture:
strcapture("(\\d+)\\D+(\\d+)", df$targets, list(calorie=0L, protein=0L))
# calorie protein
# 1 1800 75
# 2 2000 80
# 3 1900 87
# 4 2035 80
You can cbind this to the original:
cbind(df, strcapture("(\\d+)\\D+(\\d+)", df$targets, list(calorie=0L, protein=0L)))
# id targets calorie protein
# 1 1 1800 kcal 75 g 1800 75
# 2 2 2000kcal 80g 2000 80
# 3 3 1900 kcal,87g 1900 87
# 4 4 2035kcal,80g 2035 80
If you wanted to put this in a dplyr pipe, then
library(dplyr)
df %>%
bind_cols(strcapture("(\\d+)\\D+(\\d+)", .$targets, list(calorie=0L, protein=0L)))
# id targets calorie protein
# 1 1 1800 kcal 75 g 1800 75
# 2 2 2000kcal 80g 2000 80
# 3 3 1900 kcal,87g 1900 87
# 4 4 2035kcal,80g 2035 80
Note that strcapture uses regexec and regmatches under the hood, so this is similar to #ThomasIsCoding's answer in that respect.
For the regex,
\\d any digit (including unicode); this is similar to [0-9] but also includes other numerals (see https://stackoverflow.com/a/16621778/3358272);
\\D any non-digit
+ is one or more (of the preceding character/class)
A good reference if you need it is https://stackoverflow.com/a/22944075/3358272.
Here is a data.table option using regmatches + transpose
setDT(df)[, setNames(transpose(regmatches(targets, gregexpr("\\d+", targets))), c("kcal_target", "protein_target")), id]
which gives
id kcal_target protein_target
1: 1 1800 75
2: 2 2000 80
3: 3 1900 87
4: 4 2035 80

Splitting a column with multiple and unevenly distributed delimiters in R

I have a column/vector of character data that I need to separate into different columns. The problem? There are different delimiters (which mean different things), and different lengths between each delimiter. For example:
column_name
akjhaa 1-29 y 12-30
bsd, 14-20
asdf asdf del 2-5 y 6
dkljwv 3-31
joikb 6-22
sqwzsxcryvyde jd de 1-2
pk, ehde 1-2
jsd 1-15
asdasd asedd 1,3
The numbers need to be separated into columns apart from the characters. However, the numbers can be separated by a comma or dash or 'y'. Moreover, the numbers separated by dash should be somehow designated, as eventually, I need to make a document/vector where each of the numbers in that range is in their own column also (such that the split aaa column would become aaa 1 2 3 4 5 .... 29 12 13 ... 30).
So far, I have tried separating into columns based on the different delimiters, but because sometimes the values have more than one '-','y', or the 'y' falls as a word in one of the first character parts, it is starting to get a bit complicated...is there an easier way?
For clarification, in the particular "column_name" I gave, the final output would be such that i would have n columns, where n = (the highest number of numbers + 1 (the character string of the column name)). So, in the example of the provided "column_name," it would look like:
column_name n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 n15 n16 n17 n18 n19 n20 n21 n22 n23 n24 n25 n26 n27 n28 n29 n30 n31 n32 n33 n34 n35 n36 n37 n38 n39 n40 n41 n42 n43 n44 n45 n46 n47 n48 n49 n50 n51 n52 n53 n54 n55 n56 n57 n58
akjhaa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
bsd 14 15 16 17 18 19 20
asdf asdf del 2 3 4 5 6
dkljwv 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
joikb 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
sqwzsxcryvyde jd de 1 2
pk ehde 1 2
jsd 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
asdasd asedd 1 3
This isn't pretty, but it works. The result is a list column with the relevant values.
library(magrittr)
library(splitstackshape)
setDT(mydf)[, CN := gsub("(.*?) ([0-9].*)", "\\1SPLIT\\2", column_name)] %>%
cSplit("CN", "SPLIT") %>%
cSplit("CN_2", "[y,]", "long", fixed = FALSE) %>%
cSplit("CN_2", "-") %>%
.[, list(values = list(if (is.na(CN_2_2)) CN_2_1 else CN_2_1:CN_2_2)),
.(CN_1, rowid(CN_1))] %>%
.[, list(values = list(unlist(values))), .(CN_1)]
# CN_1 values
# 1: akjhaa 1,2,3,4,5,6,...
# 2: bsd, 14,15,16,17,18,19,...
# 3: asdf asdf del 2,3,4,5,6
# 4: dkljwv 3,4,5,6,7,8,...
# 5: joikb 6, 7, 8, 9,10,11,...
# 6: sqwzsxcryvyde jd de 1,2
# 7: pk, ehde 1,2
# 8: jsd 1,2,3,4,5,6,...
# 9: asdasd asedd 1,3
To get the extra columns instead of a list, you would need one more line: cbind(., .[, data.table::transpose(values)]):
as.data.table(mydf)[, CN := gsub("(.*?) ([0-9].*)", "\\1SPLIT\\2", column_name)] %>%
cSplit("CN", "SPLIT") %>%
cSplit("CN_2", "[y,]", "long", fixed = FALSE) %>%
cSplit("CN_2", "-") %>%
.[, list(values = list(if (is.na(CN_2_2)) CN_2_1 else CN_2_1:CN_2_2)),
.(CN_1, rowid(CN_1))] %>%
.[, list(values = list(unlist(values))), .(CN_1)] %>%
cbind(., .[, data.table::transpose(values)])
The basic idea is to do the following steps:
Split the column names from the values.
Split values separated by "y" or by a "," into new rows.
Split values separated by "-" into multiple columns.
Create your list of vectors according to the rule that if any values in the second split column are NA, return just the value from the first column, otherwise, create the sequence from the value in the first column to the value in the second column. Since you have duplicated "id" values because you've converted the data into a longer form, use rowid() to help with the grouping.
Consolidate the values in the list column according to the actual IDs.
(Optionally, in my opinion) transform the list data into multiple columns.

deletion of leading zeros in string split in R

The code below downloads census data from the United States census, names the columns and aims to split the column called FIPS into two. The FIPS column is numeric. The first two characters in position 1 and 2 should go into one column, StateFIPS and the last two characters in position 4 and 5 will make up the CountyFIPS column. The character in the 3rd position will be discarded. The problem I run into is that leading zero's are deleted.
In a previous post, I provided only a segment of code to learn how to split the string, which helped. However, when I applied it to my bigger code chunk it did not work. How do I prevent the deletion of leading zero's while splitting a string in the in the code below?
#State census data from 1990 to 1999
censusneeded<-seq(90,99,1)
for(i in 1:length(censusneeded)){
URL <- paste("https://www.census.gov/popest/data/intercensal/st-co/tables/STCH-Intercensal/STCH-icen19",censusneeded[i],".txt", sep="")
destfile <- paste(censusneeded[i],"statecensus.txt", sep="")
download.file(URL, destfile)
}
#Data fields Year, FIPS Code, FIPS code county, Age Group, Race-Sex, Ethnic Origin, POP
#We need to give names to the columns and separate the FIPS State Code and FIPS Code county
cleancensus_1990_1999 <- function(statecensus){
colnames(statecensus_90_99) <- c("Year", "FIPS", "AgeGroup", "RaceSex",
"HispanicStatus","Population")#label the columns
##separate the FIPS column into a column of State FIPS code and County FIPS code by
x <- c(as.character(statecensus_90_99$FIPS))
# x <- as.vector(as.character(statecensus_90_99$FIPS)) #I thought converting the column to a character and vector would prevent the drop of leading zeros when splitting the string
newfips <- lapply(2:3,function(i) if(i==2) str_sub(x,end=i) else str_sub(x,i+1))
StateFIPS <- newfips[[1]]
#StateFIPS <- substr(x, 1, 2) # 2nd attempt also doesn't work
CountyFIPS <- newfips[[2]]
#CountyFIPS <- str_sub(x,4,5) #2nd attempt also did not work because it drops leading zeros.
return(statecensus)
}
#lets apply the cleaning to census 90 to 99
for(i in 1:length(censusneeded)){
statecensus <- read.table(paste(censusneeded[i],"statecensus.txt", sep=""))
newcensus <- cleancensus_1990_1999(statecensus)
write.csv(newcensus, paste(censusneeded[i],"state1990_1999.txt", sep=""))
}
Thank you!
I rewrite your function so that it returns the original dataframe, plus two additional columns for StateFIPS and CountyFIPS (side note: do you really only want a 2-character CountyFIPS? So 06001 (Alameda County, CA) and 06101 (Sutter County, CA) will have the same CountyFIPS of "01").
cleancensus <- function(d) {
colnames(d) <- c("Year", "FIPS", "AgeGroup", "RaceSex",
"HispanicStatus","Population")
d$FIPS <- sprintf("%05d", d$FIPS)
d$StateFIPS <- substr(d$FIPS, 1, 2)
d$CountyFIPS <- substr(d$FIPS, 4, 5)
d
}
Try out the function:
data_url <- "https://www.census.gov/popest/data/intercensal/st-co/tables/STCH-Intercensal/STCH-icen1999.txt"
statecensus <- read.table(url(data_url))
d <- cleancensus(statecensus)
head(d)
# Year FIPS AgeGroup RaceSex HispanicStatus Population StateFIPS CountyFIPS
# 1 99 01001 0 1 1 218 01 01
# 2 99 01001 0 2 1 239 01 01
# 3 99 01001 1 1 1 947 01 01
# 4 99 01001 1 2 1 928 01 01
# 5 99 01001 2 1 1 1460 01 01
# 6 99 01001 2 2 1 1355 01 01
It behaves as expected (leading zeros are retained). Now, suppose we write it to csv, and read it back:
write.csv(d, "~/Desktop/census99.csv", row.names = FALSE)
d <- read.csv("~/Desktop/census99.csv")
head(d)
# Year FIPS AgeGroup RaceSex HispanicStatus Population StateFIPS CountyFIPS
# 1 99 1001 0 1 1 218 1 1
# 2 99 1001 0 2 1 239 1 1
# 3 99 1001 1 1 1 947 1 1
# 4 99 1001 1 2 1 928 1 1
# 5 99 1001 2 1 1 1460 1 1
# 6 99 1001 2 2 1 1355 1 1
The leading zeros are gone. This is because read.csv coerces character vectors to numeric where it can. There are (at least) two ways to solve this:
sprintf. Use the sprintf function to pad the numbers with leading zeros, so e.g. calling sprintf("%03d", 7) -- take an integer value ("d") and make it 3 characters wide, padding with leading 0s when necessary -- returns "007":
d$FIPS <- sprintf("%05d", d$FIPS)
d$StateFIPS <- sprintf("%02d", d$StateFIPS)
d$CountyFIPS <- sprintf("%02d", d$CountyFIPS)
Specify the column classes when you read in the data:
d <- read.csv("~/Desktop/census99.csv",
colClasses = c("numeric", # Year
"character", # FIPS
rep("numeric", 4), # AgeGroup..Population
rep("character", 2) # StateFIPS, CountyFIPS
)
)
head(d)
# Year FIPS AgeGroup RaceSex HispanicStatus Population StateFIPS CountyFIPS
# 1 99 01001 0 1 1 218 01 01
# 2 99 01001 0 2 1 239 01 01
# 3 99 01001 1 1 1 947 01 01
# 4 99 01001 1 2 1 928 01 01
# 5 99 01001 2 1 1 1460 01 01
# 6 99 01001 2 2 1 1355 01 01

Resources