I have a very large dataframe (around 100 rows, 200 columns). A subset of my data looks like this:
example <- data.frame("Station" = c("012", "013", "014"), "Value1" = c(145.23453, 1.022342, 0.4432),
"Value2" = c(2.1221213, 4445.2231412, 0.3333421), "Name" = c("ABC", "SDS", "EFG"))
I would like to round all numeric variables in my table with these conditions.
if x<1, then 1 sig fig
if 1<= x < 99, then 2 sig figs
if x>= 100, then 3 sig figs
I know to do something like this for a specific column:
example$Value1 <- ifelse(example$Value1 < 1, signif(example$Value1, 1), example$Value1)
but I'm not sure what to do for a large dataframe with a mix of numeric and character values.
Just put the ifelse into an lapply. To identify numeric columns use negate is.character in an sapply. You also could Vectorize a small replacement FUNction with all your desired conditions to use in the lapply, which might be convenient. However, note #GKi's comment, that your conditions are not complete.
nums <- sapply(example, is.numeric)
FUN <- Vectorize(function(x) {
if (x < 1) x <- signif(x, 1)
if (1 <= x & x < 99) x <- signif(x, 2)
if (x >= 100) x <- signif(x, 3)
x
})
example[nums] <- lapply(example[nums], FUN)
# Station Value1 Value2 Name
# 1 012 145.0 2.1 ABC
# 2 013 1.0 4450.0 SDS
# 3 014 0.4 0.3 EFG
CODE
example %>%
pivot_longer(contains("Value")) %>%
mutate(
signf = case_when(
value < 1 ~ 1,
value >= 1 & value < 99 ~ 2,
TRUE ~ 3
),
value = map2_dbl(value, signf, ~signif(.x, .y))
) %>%
select(-signf) %>%
pivot_wider(names_from = "name", values_from = "value")
OUTPUT
# A tibble: 3 x 4
Station Name Value1 Value2
<fct> <fct> <dbl> <dbl>
1 012 ABC 145 2.1
2 013 SDS 1 4450
3 014 EFG 0.4 0.3
I'll give the answer using data.table instead of data.frame because it's better and I don't remember data.frame syntax that well anymore.
library(data.table)
example = data.table(
Station = c("012", "013", "014"),
Value1 = c(145.23453, 1.022342, 0.4432),
Value2 = c(2.1221213, 4445.2231412, 0.3333421),
Name = c("ABC", "SDS", "EFG"))
numeric_colnames = names(example)[sapply(example,is.numeric)]
for(x in numeric_colnames){
example[,(x):=ifelse(
get(x)<1,
signif(get(x),1),
ifelse(
get(x)<99,
signif(get(x),2),
signif(get(x),3)
))]
}
Result:
Station Value1 Value2 Name
1: 012 145.0 2.1 ABC
2: 013 1.0 4450.0 SDS
3: 014 0.4 0.3 EFG
PS: Don't worry about the 145.0 and 4450.0; that's a display issue, not a data issue:
> example[,as.character(Value1)]
[1] "145" "1" "0.4"
> example[,as.character(Value2)]
[1] "2.1" "4450" "0.3"
PPS: the 99 cutoff produces some strange results, e.g.,
> signif(98.9,2)
[1] 99
> signif(99.1,3)
[1] 99.1
Why not use a cutoff of 100 instead?
> signif(99.4,2)
[1] 99
> signif(99.5,2)
[1] 100
> signif(100.1,3)
[1] 100
Use applyand nested ifelse:
If you do not know in advance which columns are numeric and you want to keep the original dataframe:
example[sapply(example, is.numeric)] <- apply(example[sapply(example, is.numeric)], 2,
function(x) ifelse(x < 1, signif(x, 1),
ifelse(x >= 1 & x < 99 , signif(x, 2), signif(x, 3))))
example
Station Value1 Value2 Name
1 012 145.0 2.1 ABC
2 013 1.0 4450.0 SDS
3 014 0.4 0.3 EFG
You can use findInterval to set signif:
i <- sapply(example, is.numeric)
x <- unlist(example[,i])
example[,i] <- signif(x, findInterval(x, c(1, 99))+1)
example
# Station Value1 Value2 Name
#1 012 145.0 2.1 ABC
#2 013 1.0 4450.0 SDS
#3 014 0.4 0.3 EFG
findIntervall result from #webb (Thanks!) example given in the comment:
findInterval(c(145.23453, 1.022342, 0.4432, 2.1221213, 4445.2231412
, 0.3333421), c(1, 99))
#[1] 2 1 0 1 2 0
Related
x1<- split(df, cumsum(df$Date < 1900-01-01))
x2<- split(df, cumsum(df$Date >= 1945-01-01 & df$Date <= 1955-01-01))
x3<- split(df, cumsum(df$Date > 2000-01-01))
I am trying to split the data frame based on the above mentioned conditions. However, the functions only work for the first one, not on the other two.
Here is what the data frame df looks like
Date T_min T_max
<chr> <dbl> <dbl>
1 1878-01-01 6.4 22.5
2 1878-01-02 8.2 23.4
3 1878-01-03 8.8 25
4 1878-01-04 8 24.5
5 1878-01-05 7.8 22.4
6 1878-01-06 7.9 20.8
7 1878-01-07 7 21.5
8 1878-01-08 7.7 21.4
9 1878-01-09 10 18.5
10 1878-01-10 7.3 19.3
Let's take this sample:
df <- data.frame(V1 = c("1800-01-01","1950-01-01","2005-01-01"))
df <- df %>%
mutate(V1 = as.Date(V1))
V1
1 1800-01-01
2 1950-01-01
3 2005-01-01
Code:
library(tidyverse)
df <- df %>%
mutate(indic = case_when(
V1 < "1900-01-01" ~ 1,
V1 >= "1945-01-01" & V1 <= "1955-01-01" ~ 2,
V1 > "2000-01-01" ~ 3,
TRUE ~ 4
))
list_of_df <- split(df, df$indic)
i = 1
MyF <- function(input){
out <- as.data.frame(input)
out <- out %>% select(-indic)
nom <- paste0("dfE",i)
assign(nom, out, envir = .GlobalEnv)
i <<- i + 1
}
lapply(list_of_df, MyF)
This will create in your environment the dataframes dfE1, dfE2 and dfE3 based on the conditions you used first (about the dates).
> dfE1
V1
1 1800-01-01
> dfE2
V1
2 1950-01-01
> dfE3
V1
3 2005-01-01
I'm trying to divide a long-formatted dataframe by a factor (e.g. for each subject) and then put the factor (subject) before the data of each one as a label. The simplied dataframe looks like this, columns X and Y are numbers, column subject is factor. The real dataset actually has hundreds of subjects.
X <- c(1,1,2,2)
Y <- c(0.2, 0.3, 1, 0.5)
Subject <- as.factor(c("A", "A", "B", "B"))
M <- tibble(X,Y,Subject)
> M
# A tibble: 4 x 3
X Y Subject
<dbl> <dbl> <fct>
1 1 0.2 A
2 1 0.3 A
3 2 1 B
4 2 0.5 B
The resulting dataframe should look like this:
> M_trans
A
1 0.2
1 0.3
B
2 1
2 0.5
Thank you for your help!
I tried this code and it works to output like below, I couldn't find a way to introduce factors as everything in r works in vector format. If you find a better solution, post it for us.
X <- c(1,1,2,2,3,3)
Y <- c(0.2, 0.3, 1, 0.5,0.2,0.9)
Subject <- as.factor(c("A", "A", "B", "B","C","C"))
M <- tibble(X,Y,Subject)
unq_subjects <- unique(Subject)
final <- data.frame()
for (i in 1: length(unique(Subject)))
{
sub <- unq_subjects[i]
tmp <- as.data.frame(M %>% filter(Subject == sub) %>%
select(-Subject) %>%
add_row(X = sub, Y = NA) %>%
arrange(desc(X)))
final <- union_all(tmp,final)
}
final Output
X Y
1 C NA
2 3 0.2
3 3 0.9
4 B NA
5 2 1.0
6 2 0.5
7 A NA
8 1 0.2
9 1 0.3
Does it answer your question now?
Using dplyr and tidyr
library(dplyr)
library(tidyr)
M %>%
group_by(Subject) %>%
nest()
Hope this helps!
Here I got an inelegant solution worked for myself, inspired by Bertil Baron's answer. I would be happy to got any easier code...
trans_output <- function(M){
M1 <- M %>%
group_by(subject) %>%
nest()
df <- NULL
for (i in 1:2)
{
output2 <- M1$data[[i]]
df_sub <- rbind(as.character(M1$subject[[i]]), # subject ID
output2) # output data
idx <- c(1L)
df_sub <- df_sub %>%
mutate(Y = ifelse(row_number() %in% idx, NA, Y)) %>% # else, stay as Y
transmute(X = X,
Y = as.numeric(Y))
df <- rbind(df, df_sub)
rm(df_sub)
}
return(df)
}
M_trans <- trans_output(M)
The output looks like this:
> M_trans
# A tibble: 6 x 2
X Y
<chr> <dbl>
1 A NA
2 1 0.2
3 2 0.3
4 B NA
5 3 1
6 4 0.5
Given a dataframe such as,
num <- c(5,10,15,20,25)
letter <- c("A", "B", "A", "C", "B")
thelist <- data.frame(num, letter)
I need to find the indices where the letters are the same.
Output:
A 1 3
B 2 5
C 4
Then, take these indices and find the mean of those indices in num.
Output:
A 10
B 17.5
C 20
I cannot use loops or if statements, I am looking at using a sort of apply, which, etc.
As the objective is to find the mean for each similar 'letter', it is better to group by 'letter' and get the mean of 'num'
library(dplyr)
thelist %>%
group_by(letter) %>%
summarise(num = mean(num))
# A tibble: 3 x 2
# letter num
# <fct> <dbl>
#1 A 10
#2 B 17.5
#3 C 20
or in base R
aggregate(num ~ letter, thelist, mean)
To find the index of the same 'letter', we can split the sequence of rows by 'letter
split(seq_len(nrow(thelist)), thelist$letter)
#$A
#[1] 1 3
#$B
#[1] 2 5
#$C
#[1] 4
Another option using data.table:
library(data.table)
setDT(thelist)[, .(ind = paste(.I, collapse = " "),
mean_num = mean(num)
),
by = letter]
Output:
letter ind mean_num
1: A 1 3 10.0
2: B 2 5 17.5
3: C 4 20.0
I'd use dplyr/tidyverse for this:
# setup
library(tidyverse)
# group by letters then get mean of num
thelist %>%
group_by(letter) %>%
summarise(mean_num = mean(num))
You could also use base R with a for loop:
lets <- unique(thelist$letter)
x <- rep(NA, length(lets))
for(i in 1:3){
x[i] <- mean(thelist$num[thelist$letter %in% lets[i]])
}
x
I have a function checking zero numbers in each column in a large dataframe. Now I want to check zero numbers in each col after grouped by category.
Here is the example:
zero_rate <- function(df) {
z_rate_list <- sapply(df, function(x) {
data.frame(
n_zero=length(which(x==0)),
n=length(x),
z_rate=length(which(x==0))/length(x))
})
d <- data.frame(z_rate_list)
d <- sapply(d, unlist)
d <- as.data.frame(d)
return(d)}
df = data.frame(var1=c(1,0,NA,4,NA,6,7,0,0,10),var2=c(11,NA,NA,0,NA,16,0,NA,19,NA))
df1= data.frame(cat = c(1,1,1,1,1,2,2,2,2,2),df)
zero_rate_df = df1 %>% group_by(cat) %>% do( zero_rate(.))
Here zero_rate(df) works just as I expected. But when I group the data by cat and calculate in each category the zero_rate for each column, the result is not as I expected.
I expect something like this:
cat va1 var2
1 n_zero 1 1
n 5 5
z_rate 0.2 0.2
2 n_zero 2 1
n 5 5
z_rate 0.4 0.2
Any suggestion? Thank you.
I came up with the following code. .[-1] was used to remove grouping col:
zero_rate <- function(df){
res <- lapply(df, function(x){
y <- c(sum(x == 0, na.rm = T), length(x))
c(y, y[1]/y[2])
})
res <- do.call(cbind.data.frame, res)
res$vars <- c('n_zero', 'n', 'z_rate')
res
}
df1 %>% group_by(cat) %>% do( zero_rate(.[-1]))
# cat var1 var2 vars
# <dbl> <dbl> <dbl> <chr>
# 1 1 1.0 1.0 n_zero
# 2 1 5.0 5.0 n
# 3 1 0.2 0.2 z_rate
# 4 2 2.0 1.0 n_zero
# 5 2 5.0 5.0 n
# 6 2 0.4 0.2 z_rate
I have a large data table with over 300 columns. I would like to get by each letter column
-- summary of (each observation in column * weight of observation).
-- summary of weight if obs. in a letter column is more than 0.
Here I provided a example for a column.
id <- c("0001", "0002", "0003", "0004")
a <- c(0, 9, 8, 5)
b <- c(0,5,5,0)
c <- c(1.5, 0.55, 0, 0.06)
weight <- c(102.354, 34.998, 84.664, .657)
data <- data.frame(id, a, b, c, weight)
data
id a b c weight
1 0001 0 0 1.50 102.354
2 0002 9 5 0.55 34.998
3 0003 8 5 0.00 84.664
4 0004 5 0 0.06 0.657
sum(data$a * data$weight)
[1] 995.579
sum(data$weight[data$a >0])
[1] 120.319
Any idea?
A possible data.table solution
You could define an helper function
tempfunc <- function(x) c(sum(x * data$weight), sum(data$weight[x > 0]))
Then do either
library(data.table)
setDT(data)[, lapply(.SD, tempfunc), .SDcols = -c("id", "weight")]
# a b c
# 1: 995.579 598.310 172.8193
# 2: 120.319 119.662 138.0090
Or
library(dplyr)
setDT(data) %>% summarise_each(funs(tempfunc), -c(id, weight))
## a b c
## 1: 995.579 598.310 172.8193
## 2: 120.319 119.662 138.0090
The following code should solve your question:
my.names <- names(data)[names(data) %in% letters]
res <- lapply(my.names, function(x){
c(sum(data[[x]]*data[["weight"]]), sum(data[["weight"]][data[[x]]>0]))
})
names(res) <- my.names
or directly to data.frame:
do.call("rbind", lapply(my.names, function(letter){
data.frame(letter, "sum1_name" = sum(data[[letter]]*data[["weight"]]),
"sum2_name" = sum(data[["weight"]][data[[letter]]>0]))
}))
# letter sum1_name sum2_name
# 1 a 995.5790 120.319
# 2 b 598.3100 119.662
# 3 c 172.8193 138.009