Subracting rows from dataframe - r

Let's say I have the following dataset:
Industry Country Year AUS AUS AUT AUT ...
A AUS 1 0.5 0.2 0.1 0.01
B AUS 2 0.3 0.5 2 0.1
A AUT 3 1 1.2 1.3 0.3
B AUT 4 0.5 0 0.8 2
... ... ... ... ... ... ....
VA 11 10 47 55
tot 24 23 50 70
How can I subtract ONLY the last two rows(tot= tot-VA) to get:
Industry Country Year AUS AUS AUT AUT ...
A AUS 1 0.5 0.2 0.1 0.01
B AUS 2 0.3 0.5 2 0.1
A AUT 3 1 1.2 1.3 0.3
B AUT 4 0.5 0 0.8 2
... ... ... ... ... ... ....
VA 11 10 47 55
FI 13 13 3 15
FI/VA 1.2 1.3 0.06 0.27
Where FI is simply tot-VA

You could try this:
check which columns are numeric
use sapply to calculate the new row FI
bind them together
num <- sapply(df, class) == "numeric"
df_tot<- data.frame(as.list(sapply(df[, num], function(x) x[length(x)]-x[length(x)-1])))
df_tot$Industry <- "FI"
df <- data.table::rbindlist(list(df, df_tot), fill = TRUE)
EDIT:
If you just want to sum up all rows but the last one, then you could try this:
num <- sapply(df, class) == "numeric"
df_tot <- data.frame(as.list(sapply(df[1:(nrow(df)-1), num], sum)))
df_tot$Industry <- "FI"
df <- data.table::rbindlist(list(df, df_tot), fill = TRUE)

Here's a tidyverse approach to the issue of subtracting selected rows across columns:
library(tidyverse)
df %>%
# subtract across the relevant columns:
summarise(across(matches("^AU"), ~(.x[Industry == "tot"] - .x[Industry == "VA"]))) %>%
# add the 'new' column `Industry`:
mutate(Industry = "FI") %>%
# bind result back into `df`:
bind_rows(df,.)
Industry AU1 AU2 AU3
1 A 0.1 0.4 7.0
2 B 0.7 3.0 1.0
3 A 3.0 2.5 0.1
4 VA 11.0 10.0 47.0
5 tot 24.0 23.0 50.0
6 FI 13.0 13.0 3.0
If you no longer need rows #4 and #5, add this to the pipe:
filter(!Industry %in% c("VA", "tot"))
Data:
df <- data.frame(
Industry = c("A","B","A","VA","tot"),
AU1 = c(0.1,0.7,3,11,24),
AU2 = c(0.4,3,2.5,10,23),
AU3 = c(7, 1, 0.1,47,50)
)

Related

Make a loop function by including gender into the algorithm

I have the following data set:
Age<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
R<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
sex<-c(1,0,1,1,1,1,1,0,0,0,NA, 0,1)
df1<-data.frame(Age,R,sex)
# Second dataset:
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.F<-data.frame(Age2, Mspline)
# Third data
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.M<-data.frame(Age2, Mspline)
I was wondering how I can include gender into the calculation and combine these two algorithm to make a loop function. What I need is:
If sex=1 then use the following function to calculate Time
last = dim(df2.F)[1]
fM.F<-approxfun(df2.F$Age2, df2.F$Mspline, yleft = df2.F$Mspline[1] , yright = df2.F$Mspline[last])
df1$Time<-fM.F(df1$Age)
and If sex=0 then use this function to calculate Time
last = dim(df2.M)[1]
fM.M<-approxfun(df2.M$Age2, df2.M$Mspline, yleft = df2.M$Mspline[1] , yright = df2.M$Mspline[last])
df1$Time<-fM.M(df1$Age)
I mean: Read the first record in df1 if it is Female (with age=4.1) the time=fM.F(its age=4.1) but if the gender is Male then to calculate Time apply fM.M on its age so time=fM.M(4.1)
You can create a function that takes the Age vector, the sex value, and the male and female specific dataframes, and selects the frame to use based on the sex value.
f <- function(age, s, m,f) {
if(is.na(s)) return(NA)
if(s==0) df = m
else df = f
last = dim(df)[1]
fM<-approxfun(df$Age2, df$Mspline, yleft = df$Mspline[1] , yright = df$Mspline[last])
fM(age)
}
Now, just apply the function by group, using pull(cur_group(),sex) to get the sex value for the current group.
library(dplyr)
df1 %>%
group_by(sex) %>%
mutate(time = f(Age, pull(cur_group(),sex), df2.M, df2.F))
Output:
Age R sex time
<dbl> <dbl> <dbl> <dbl>
1 2 2 1 -0.186
2 2.1 2.1 0 1.02
3 2.2 2.2 1 -1.55
4 3.4 3.4 1 -0.461
5 3.5 3.5 1 0.342
6 4.2 4.2 1 -0.560
7 4.7 4.7 1 -0.114
8 4.8 4.8 0 0.247
9 5 5 0 -0.510
10 5.6 5.6 0 -0.982
11 NA NA NA NA
12 5.9 5.9 0 -0.231
13 NA NA 1 NA

For loop warning: "number of items to replace is not a multiple of replacement length" with two dataframes

I am trying to create a new vector by applying a transformation to a variable in one of my dataframe based on data from another dataframe.
I have two dataframes df1 and df2. df1 and df2 have different dimension, I have over 20,000 rows in df1 and 76 rows in df2.
df1 is my original dataset. I created df2 for Ag_ppm as follow:
df2 <- df1%>%
filter(!is.na(Ag_ppm)) %>%
group_by(Year,Zone, SubZone) %>%
summarise(
n = sum(!is.na(Ag_ppm)),
min = min(Ag_ppm),
max = max(Ag_ppm),
mean = mean(Ag_ppm),
sd = sd(Ag_ppm),
iqr = IQR(Ag_ppm),
Q1 = quantile(Ag_ppm, 0.25),
median = median(Ag_ppm),
Q3 = quantile(Ag_ppm, 0.75),
LW = min(Ag_ppm > (quantile(Ag_ppm, .25)-1.5*IQR(Ag_ppm))),
UF = quantile(Ag_ppm, .75) + 1.5*IQR(Ag_ppm))
Here is what the first rows of each data frames look like:
head(df1, n=5)
# A tibble: 5 x 12
Year Zone SubZone Au_ppm Ag_ppm Cu_ppm Pb_ppm Zn_ppm As_ppm Sb_ppm Bi_ppm Mo_ppm
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 BugLake BugLake 0.007 3.7 17 27 23 1 1 NA 1
2 1983 Johnny Mountain Johnny Mountain 0.01 1.6 71 63 550 4 NA NA NA
3 1983 Khyber Pass Khyber Pass 0.12 11.5 275 204 8230 178 7 60 NA
4 1987 Chebry Ridge Line Grid 0.05 2.2 35 21 105 16 6 NA NA
5 1987 Chebry Handel Grid 0.004 1.3 29 27 663 45 2 NA NA
head(df2, n=5)
# A tibble: 5 x 14
# Groups: Year, Zone [3]
Year Zone SubZone n min max mean sd iqr Q1 median Q3 LW UF
<chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1981 Chebry Handel 52 0.6 5.1 1.83 0.947 0.925 1.2 1.6 2.12 1 3.51
2 1981 Imperial Metals Handel 24 0.9 6.9 2.81 1.43 1.35 1.95 2.65 3.3 1 5.33
3 1983 Chebry Chebry 5 0.7 3.7 1.78 1.19 0.9 1.2 1.2 2.1 1 3.45
4 1983 Chebry Handel 17 0.1 0.7 0.318 0.163 0.2 0.2 0.3 0.4 1 0.7
5 1983 Chebry Handel Grid 225 0.1 16 0.892 1.33 0.7 0.3 0.6 1 1 2.05
I want to apply the following equation to my column Ag_ppm in df1 using the median and IQR calculated for each subgroup in df2:
Z = (X - median)/IQR
For that purpose, I wrote:
# Initialize Ag_std vector with NA values
Ag_std <- rep(NA, times = nrow(df1))
# Populate Ag_std vector with standardized Ag values
Ag_std <-
for (i in 1:nrow(df1)) {
if (!is.na(df1$Ag_ppm[i])) {
filter(df2, Zone == df1$Zone[i], Year == df1$Year[i],
SubZone == df1$SubZone[i])
Ag_std[i] <- (df1$Ag_ppm[i] - df2$median)/df2$iqr
}
}
But the loop does not work (it returns a NULL vector) and I have this warning:
1: In Ag_std[i] <- (df1$Ag_ppm[i] - df2$median)/df2$iqr :
number of items to replace is not a multiple of replacement length
I've looked similar questions, and I did not find an answer that would work for me. Any help would be much appreciated!
If there are better ways of achieving the same without a loop (I'm sure there are, e.g. apply()), I would appreciate such comments as well. Unfortunately I'm not familiar enough with the alternatives to be able to implement them quickly.
This can be done relatively easily in data.table
library(data.table)
DT <- data.table(df1)
#function to apply
fun <- function(x) (x - median(x)) / diff (quantile( x, c(.25, .75)))
# create a new column with desired result
DT[, Ag_std := fun(Ag_ppm), by = list(Year, Zone, SubZone)]
Also, I think your loop can be fixed by assigning the result of 'filter' to a temporary object
for (i in 1:nrow(df1)) {
if (!is.na(df1$Ag_ppm[i])) {
temp.var <- filter(df2, Zone == df1$Zone[i], Year == df1$Year[i],
SubZone == df1$SubZone[i])
Ag_std[i] <- (df1$Ag_ppm[i] - temp.var$median)/temp.var$iqr
}
}
Since you have df2 as a seperate dataframe, you can join and mutate:
df1 %>%
left_join(df2, by = c("Year", "Zone", "SubZone")) %>%
mutate(Z = (Ag_ppm - median) / iqr)
In fact, you could have generated info in df2 in df1 itself using summarise

How do I split or create a new column for a list of data in a dataframe?

Please have a look at the preview of the data in theimage. I would like to create 3 new columns i.e. Start, End, Density and create new row for each record in these 3 columns.
In accordance with comments above you can converse list into the data.frame as below:
# simulation of data.frame with one row and one cell with histogram
z <- hist(rnorm(1000))
z$start <- z$breaks[-length(z$breaks)]
z$end <- z$breaks[-1]
z[c("mids", "xname", "breaks", "equidist", "counts")] <- NULL
names_z <- names(z)
attributes(z) <- NULL
df <- data.frame(a = 1, b = 2, x = I(list((z))))
# Conversion of list to dataframe
setNames(as.data.frame(unlist(df["x"], recursive = FALSE)), names_z)
Output:
density start end
1 0.012 -3.0 -2.5
2 0.042 -2.5 -2.0
3 0.082 -2.0 -1.5
4 0.182 -1.5 -1.0
5 0.288 -1.0 -0.5
6 0.354 -0.5 0.0
7 0.418 0.0 0.5
8 0.300 0.5 1.0
9 0.172 1.0 1.5
10 0.088 1.5 2.0
11 0.050 2.0 2.5
12 0.012 2.5 3.0

extract irregular numeric data from strings

I have data like below. I wish to extract the first and last year from each string here called my.string. Some strings only contain one year and some strings contain no years. No strings contain more than two years. I have provided the desired result in the object named desired.result below the example data set. I am using R.
When a string contains two years those years are contained within a portion of the string that looks like this ga49.51 or ea22.24
When a string contains only one year that year is contained in a portion of the string that looks like this: time11
I know a bit about regex, but this problem seems too irregular and complex for me to figure out. I am not even sure where to begin. Thank you for any advice.
EDIT
Perhaps delete the numbers before the first colon (:) and the remaining numbers are what I want.
my.data <- read.table(text = '
my.string cov1 cov2
42:Alpha:ga6.8 -0.1 2.2
43:Alpha:ga9.11 -2.5 0.6
44:Alpha:ga30.32 -1.3 0.5
45:Alpha:ga49.51 -2.5 0.6
50:Alpha:time1:ga.time -1.7 0.9
51:Alpha:time2:ga.time -1.5 0.8
52:Alpha:time3:ga.time -1.0 1.0
2:Beta:ea2.9 -1.7 0.6
3:Beta:ea17.19 -5.0 0.8
4:Beta:ea22.24 -6.4 1.0
8:Beta:as 0.2 0.6
9:Beta:sd 1.7 0.4
12:Beta:time1:ea.tim -2.6 1.8
13:Beta:time10:ea.ti -3.6 1.1
14:Beta:time11:ea.ti -3.1 0.7
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
desired.result <- read.table(text = '
my.string cov1 cov2 time1 time2
42:Alpha:ga6.8 -0.1 2.2 6 8
43:Alpha:ga9.11 -2.5 0.6 9 11
44:Alpha:ga30.32 -1.3 0.5 30 32
45:Alpha:ga49.51 -2.5 0.6 49 51
50:Alpha:time1:ga.time -1.7 0.9 1 NA
51:Alpha:time2:ga.time -1.5 0.8 2 NA
52:Alpha:time3:ga.time -1.0 1.0 3 NA
2:Beta:ea2.9 -1.7 0.6 2 9
3:Beta:ea17.19 -5.0 0.8 17 19
4:Beta:ea22.24 -6.4 1.0 22 24
8:Beta:as 0.2 0.6 NA NA
9:Beta:sd 1.7 0.4 NA NA
12:Beta:time1:ea.tim -2.6 1.8 1 NA
13:Beta:time10:ea.ti -3.6 1.1 10 NA
14:Beta:time11:ea.ti -3.1 0.7 11 NA
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
I suggest using stringr library to extract the data you need since it handles NA values better, and also allows using a constrained-width lookbehind:
> library(stringr)
> my.data$time1 <- str_extract(my.data$my.string, "(?<=time)\\d+|(?<=\\b[ge]a)\\d+")
> my.data$time2 <- str_extract(my.data$my.string, "(?<=\\b[ge]a\\d{1,100}\\.)\\d+")
> my.data
my.string cov1 cov2 time1 time2
1 42:Alpha:ga6.8 -0.1 2.2 6 8
2 43:Alpha:ga9.11 -2.5 0.6 9 11
3 44:Alpha:ga30.32 -1.3 0.5 30 32
4 45:Alpha:ga49.51 -2.5 0.6 49 51
5 50:Alpha:time1:ga.time -1.7 0.9 1 <NA>
6 51:Alpha:time2:ga.time -1.5 0.8 2 <NA>
7 52:Alpha:time3:ga.time -1.0 1.0 3 <NA>
8 2:Beta:ea2.9 -1.7 0.6 2 9
9 3:Beta:ea17.19 -5.0 0.8 17 19
10 4:Beta:ea22.24 -6.4 1.0 22 24
11 8:Beta:as 0.2 0.6 <NA> <NA>
12 9:Beta:sd 1.7 0.4 <NA> <NA>
13 12:Beta:time1:ea.tim -2.6 1.8 1 <NA>
14 13:Beta:time10:ea.ti -3.6 1.1 10 <NA>
15 14:Beta:time11:ea.ti -3.1 0.7 11 <NA>
The first regex matches:
(?<=time)\\d+ - 1+ digits that have time before them
| - or
(?<=\\b[ge]a)\\d+ - 1+ digits that have ge or ea` as a whole word in front
The second regex matches:
(?<=\\b[ge]a\\d{1,100}\\.) - check if the current position is preceded with ge or ea as a whole word followed with 1 to 100 digits (I believe that should be enough for your scenario, 100-digit chunks are hardly expected here, you may even decrease the value), and then a .
\\d+ - 1+ digits
Here's a regex that will extract either of the two types, and output them to different columns at the end of the lines:
Search: .*(?:time(\d+)|(?:[ge]a)(\d+)\.(\d+)).*
Replace: $0\t$1\t$2\t$3
Breakdown:
.*(?: ... ).* ensures that the whole line is matched, and uses a non-capturing group for the main alternation
time(\d+): this is the first half of the alternation, capturing any digits after a "time"
(?:[ge]a)(\d+)\.(\d+): the second half of the alternation matches "ga" or "ea" followed by two sets of digits, each in its own capture group
Replacement: $0 puts the whole line back. Each of the other capture groups are added, with tabs in-between.
See regex101 example

r - subsetting one row of data frame drops zero decimal from number

Data
Given a data frame
df <- data.frame("id"=c(1,2,3), "a"=c(10.0, 11.2, 12.3),"b"=c(10.1, 11.9, 12.9))
> df
id a b
1 1 10.0 10.1
2 2 11.2 11.9
3 3 12.3 12.9
> str(df)
'data.frame': 3 obs. of 3 variables:
$ id: num 1 2 3
$ a : num 10 11.2 12.3
$ b : num 10.1 11.9 12.9
Question
When subsetting the first row, the .0 decimal part from the 10.0 in column a gets dropped
> df[1,]
id a b
1 1 10 10.1
> str(df[1,])
'data.frame': 1 obs. of 3 variables:
$ id: num 1
$ a : num 10
$ b : num 10.1
I 'assume' this is intentional, but how do I subset the first row so that it keeps the .0 part?
Notes
Subsetting two rows keeps the .0
> df[1:2,]
id a b
1 1 10.0 10.1
2 2 11.2 11.9
I assume you understand this is a matter of how the number is printed, and not about how the value is stored by R. Anyway, you can use format to ensure the digits will be printed:
> format(df[1,], nsmall = 1)
id a b
1 1.0 10.0 10.1
> format(df[1,], nsmall = 2)
id a b
1 1.00 10.00 10.10
The reason for this behavior is not about the number of rows being printed. R will try to display the minimum number of decimals possible. But all numbers in a column will have the same number of digits to improve the display:
> df2 <- data.frame(a=c(1.00001, 1), b=1:2)
> df2
a b
1 1.00001 1
2 1.00000 2
Now if I print only the row with the non-integer number:
> df2[1,]
a b
1 1.00001 1

Resources