Data
Given a data frame
df <- data.frame("id"=c(1,2,3), "a"=c(10.0, 11.2, 12.3),"b"=c(10.1, 11.9, 12.9))
> df
id a b
1 1 10.0 10.1
2 2 11.2 11.9
3 3 12.3 12.9
> str(df)
'data.frame': 3 obs. of 3 variables:
$ id: num 1 2 3
$ a : num 10 11.2 12.3
$ b : num 10.1 11.9 12.9
Question
When subsetting the first row, the .0 decimal part from the 10.0 in column a gets dropped
> df[1,]
id a b
1 1 10 10.1
> str(df[1,])
'data.frame': 1 obs. of 3 variables:
$ id: num 1
$ a : num 10
$ b : num 10.1
I 'assume' this is intentional, but how do I subset the first row so that it keeps the .0 part?
Notes
Subsetting two rows keeps the .0
> df[1:2,]
id a b
1 1 10.0 10.1
2 2 11.2 11.9
I assume you understand this is a matter of how the number is printed, and not about how the value is stored by R. Anyway, you can use format to ensure the digits will be printed:
> format(df[1,], nsmall = 1)
id a b
1 1.0 10.0 10.1
> format(df[1,], nsmall = 2)
id a b
1 1.00 10.00 10.10
The reason for this behavior is not about the number of rows being printed. R will try to display the minimum number of decimals possible. But all numbers in a column will have the same number of digits to improve the display:
> df2 <- data.frame(a=c(1.00001, 1), b=1:2)
> df2
a b
1 1.00001 1
2 1.00000 2
Now if I print only the row with the non-integer number:
> df2[1,]
a b
1 1.00001 1
Related
I have the following data set:
Age<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
R<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
sex<-c(1,0,1,1,1,1,1,0,0,0,NA, 0,1)
df1<-data.frame(Age,R,sex)
# Second dataset:
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.F<-data.frame(Age2, Mspline)
# Third data
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.M<-data.frame(Age2, Mspline)
I was wondering how I can include gender into the calculation and combine these two algorithm to make a loop function. What I need is:
If sex=1 then use the following function to calculate Time
last = dim(df2.F)[1]
fM.F<-approxfun(df2.F$Age2, df2.F$Mspline, yleft = df2.F$Mspline[1] , yright = df2.F$Mspline[last])
df1$Time<-fM.F(df1$Age)
and If sex=0 then use this function to calculate Time
last = dim(df2.M)[1]
fM.M<-approxfun(df2.M$Age2, df2.M$Mspline, yleft = df2.M$Mspline[1] , yright = df2.M$Mspline[last])
df1$Time<-fM.M(df1$Age)
I mean: Read the first record in df1 if it is Female (with age=4.1) the time=fM.F(its age=4.1) but if the gender is Male then to calculate Time apply fM.M on its age so time=fM.M(4.1)
You can create a function that takes the Age vector, the sex value, and the male and female specific dataframes, and selects the frame to use based on the sex value.
f <- function(age, s, m,f) {
if(is.na(s)) return(NA)
if(s==0) df = m
else df = f
last = dim(df)[1]
fM<-approxfun(df$Age2, df$Mspline, yleft = df$Mspline[1] , yright = df$Mspline[last])
fM(age)
}
Now, just apply the function by group, using pull(cur_group(),sex) to get the sex value for the current group.
library(dplyr)
df1 %>%
group_by(sex) %>%
mutate(time = f(Age, pull(cur_group(),sex), df2.M, df2.F))
Output:
Age R sex time
<dbl> <dbl> <dbl> <dbl>
1 2 2 1 -0.186
2 2.1 2.1 0 1.02
3 2.2 2.2 1 -1.55
4 3.4 3.4 1 -0.461
5 3.5 3.5 1 0.342
6 4.2 4.2 1 -0.560
7 4.7 4.7 1 -0.114
8 4.8 4.8 0 0.247
9 5 5 0 -0.510
10 5.6 5.6 0 -0.982
11 NA NA NA NA
12 5.9 5.9 0 -0.231
13 NA NA 1 NA
I need to compute weighted Mann Whitney U test results a few hundred times. Each iteration involves is a two-sample test for differences between two groups. I can't figure out how to get the existing function to handle missing values without dynamically deleting cases.
The data for a few of the comparisons are here, in a data frame I call dat. All variables with numbers in this sheet are numeric in type.
Here's how I call the sjstats::mannwhitney() function:
mannwhitney(dat, measure1, group)
When I do so, I get the following error:
Error in `[[<-.data.frame`(`*tmp*`, "grp1.label", value = character(0)) :
replacement has 0 rows, data has 1
I suspect this is because of the missing value in the 212th observation of measure1. But wrapping the vector names in na.omit() or !is.na() don't address the problem, perhaps because doing so still results in a data frame where the number of non-NA values of group are greater than the number of non-NA values in measure1.
Any thoughts on how I could incorporate dynamic NA handling into the function call?
I am not sure what class your group column is, but if I do it like this:
library(sjstats)
dat = read.csv("question - Sheet1.csv")
str(dat)
'data.frame': 301 obs. of 5 variables:
$ measure1 : num 2 1.6 2.2 2.7 1.8 1.8 4 4 3.9 -3.7 ...
$ measure2 : num 0.9 0.1 0 0.4 -1 -1.3 2.1 0 -1.1 -3.9 ...
$ measure3 : num 1.1 1.1 2.2 1.2 1.9 1.2 0 3 1.9 -3.8 ...
$ measurre4: num 2 2 2 3 3 2 3 4 3 2.36 ...
$ group : int 0 0 0 0 0 0 0 0 0 0 ...
I get:
mannwhitney(dat, measure1, group)
Error in `[[<-.data.frame`(`*tmp*`, "grp1.label", value = character(0)) :
replacement has 0 rows, data has 1
Factor your group:
dat$group = factor(dat$group)
mannwhitney(dat, measure1, group)
# Mann-Whitney-U-Test
Groups 1 = 0 (n = 110) | 2 = 1 (n = 190):
U = 16913.000, W = 10808.000, p = 0.621, Z = 0.495
effect-size r = 0.029
rank-mean(1) = 153.75
rank-mean(2) = 148.62
Reading the code, the bug comes from this:
labels <- sjlabelled::get_labels(grp, attr.only = F, values = NULL,
non.labelled = T)
If your group is numeric, it doesn't have attributes and hence you get no labels:
sjlabelled::get_labels(0:1)
NULL
sjlabelled::get_labels(factor(0:1))
[1] "0" "1"
Here is a toy data frame
df <- data.frame(alpha = c(rep(.005,5)),
a1 = c(1:5),
b1 = c(4:8),
c1 = c(10:14),
a2 = c(9:13),
b2 = c(3:7),
c2 = c(15:19))
Here is a nonsensical toy function that requires two variables, both of which must have the same letter prefix. The specific function calculation is not important. Rather, the issue is how to pass two or more separate named variables to the function from the data frame where the order of the arguments matters.
toy_function <- function(x,y){
z = x+y
w = x/y
v = z+w
return(v)
}
Manual calculation of new variables using the function would look like this. Not practical when you've got dozens or hundreds of variable pairs.
df2 <- df %>%
mutate(va = toy_function(a1,a2),
vb = toy_function(b1,b2),
vc = toy_function(c1,c2)
)
How can I do this across all matching pairs of variables? This problem seems similar to How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs but that example was applying a simple mathematical function (e.g., +) in which variable order does not matter. I'm having trouble figuring out how to modify it for this case.
Here is one base R approach using split.default.
cbind(df, sapply(split.default(df[-1],
sub('\\d+', '', names(df)[-1])), function(x)
toy_function(x[[1]], x[[2]])))
# alpha a1 b1 c1 a2 b2 c2 a b c
#1 0.005 1 4 10 9 3 15 10.1 8.33 25.7
#2 0.005 2 5 11 10 4 16 12.2 10.25 27.7
#3 0.005 3 6 12 11 5 17 14.3 12.20 29.7
#4 0.005 4 7 13 12 6 18 16.3 14.17 31.7
#5 0.005 5 8 14 13 7 19 18.4 16.14 33.7
We ignore the first column ([-1]) since we don't want to include that in the calculation and create a group of similarly named column and split them into lists. Using sapply we apply toy_function to each element in the list.
sub is used to remove the numbers from the names and create groups to split on.
sub('\\d+', '', names(df)[-1])
#[1] "a" "b" "c" "a" "b" "c"
If you wish to use the tidyverse approach you could do :
library(dplyr)
library(purrr)
unique_names <- unique(sub('\\d+', '', names(df)[-1]))
map_dfc(unique_names, ~df[-1] %>%
select(matches(.x)) %>%
mutate(!!paste0('v', .x) := toy_function(.[[1]], .[[2]])))
# a1 a2 va b1 b2 vb c1 c2 vc
#1 1 9 10.1 4 3 8.33 10 15 25.7
#2 2 10 12.2 5 4 10.25 11 16 27.7
#3 3 11 14.3 6 5 12.20 12 17 29.7
#4 4 12 16.3 7 6 14.17 13 18 31.7
#5 5 13 18.4 8 7 16.14 14 19 33.7
You can do something like this
First, create a dataframe with the function arguments as columns and the values to be used for each function call as rows.
vars <- letters[1:3]
args <- tibble(
arg1 = setNames(paste0(vars, 1), paste0("set_output_names_like_this_", vars)),
arg2 = paste0(vars, 2)
)
> str(args)
tibble [3 x 2] (S3: tbl_df/tbl/data.frame)
$ arg1: Named chr [1:3] "a1" "b1" "c1"
..- attr(*, "names")= chr [1:3] "set_output_names_like_this_a" "set_output_names_like_this_b" "set_output_names_like_this_c"
$ arg2: chr [1:3] "a2" "b2" "c2"
Then, use pmap_dfc
df %>% mutate(pmap_dfc(args, function(arg1, arg2, d) toy_function(d[[arg1]], d[[arg2]]), .data))
Output
alpha a1 b1 c1 a2 b2 c2 set_output_names_like_this_a set_output_names_like_this_b set_output_names_like_this_c
1 0.005 1 4 10 9 3 15 10.11111 8.333333 25.66667
2 0.005 2 5 11 10 4 16 12.20000 10.250000 27.68750
3 0.005 3 6 12 11 5 17 14.27273 12.200000 29.70588
4 0.005 4 7 13 12 6 18 16.33333 14.166667 31.72222
5 0.005 5 8 14 13 7 19 18.38462 16.142857 33.73684
I have a dataframe (dcc) loaded in R which I have narrowed down to complete cases.
str(dcc)
'data.frame': 41715 obs. of 9 variables:
$ XCoord : num 661382 661412 661442 661472 661502 ...
$ YCoord : num 648092 648092 648092 648092 648092 ...
$ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
$ POINTID : int 1 2 3 4 5 6 7 8 9 10 ...
$ GRID_CODE : int 0 0 0 0 0 0 0 0 0 0 ...
$ APPL_COST_DIST_RIV_COAST: num 21350 21674 22185 22748 23448 ...
$ APPL_DEM30 : int 785 793 792 769 765 777 784 789 781 751 ...
$ APPL_DEM30_SLOPE : num 19.7 13.3 18.6 23.2 21 ...
$ APPL_SITE_NONSITE : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I want to standardize the numeric and integer variables by subtracting the mean and dividing by the standard deviation. When I apply the following code, I inadvertently drop the factor variable APPL_SITE_NONSITE from the dataframe:
ind <- sapply(dcc, is.numeric)
dcc.s<-sapply(dcc[,ind], function(x) (x-mean(x))/sd(x))
dcc.s<-data.frame(dcc.s)
If I'm not mistaken, that happens because ind=FALSE for that variable. It seems like I need some combination of a for loop and if/else statement to standardize the numeric variables and leave the factor variable alone. I have tried a number of permutations, but keep getting errors. For example, the following code:
dcc.s <- for (i in 1:ncol(dcc)){ sapply(dcc[,i],
if (is.numeric(dcc[,i])==TRUE) {
function(x) (x-mean(x))/sd(x) }
else {dcc[,i]})
}
returns the error:
Error in match.fun(FUN) :
c("'if (is.numeric(dcc[, i]) == TRUE) {' is not a function, character or symbol", "' function(x) (x - mean(x))/sd(x)' is not a function, character or symbol", "'} else {' is not a function, character or symbol", "' dcc[, i]' is not a function, character or symbol", "'}' is not a function, character or symbol")
Perhaps this is a simple formatting error or misplaced bracket, but I'm thoroughly stuck. I am open to other approaches if there is an more elegant way to do this. Any help would be much appreciated.
You need to use rapply instead of sapply
set.seed(1)
> df=data.frame(A=rnorm(10),b=1:10,C=as.factor(rep(1:2,5)))
> str(df)
'data.frame': 10 obs. of 3 variables:
$ A: num -0.626 0.184 -0.836 1.595 0.33 ...
$ b: int 1 2 3 4 5 6 7 8 9 10
$ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2
The code you need to use:
> D=rapply(df,scale,c("numeric","integer"),how="replace")
> D
A b C
1 -0.97190653 -1.4863011 1
2 0.06589991 -1.1560120 2
3 -1.23987805 -0.8257228 1
4 1.87433300 -0.4954337 2
5 0.25276523 -0.1651446 1
6 -1.22045645 0.1651446 2
7 0.45507643 0.4954337 1
8 0.77649606 0.8257228 2
9 0.56826358 1.1560120 1
10 -0.56059319 1.4863011 2
> str(D)
'data.frame': 10 obs. of 3 variables:
$ A: num [1:10, 1] -0.9719 0.0659 -1.2399 1.8743 0.2528 ...
..- attr(*, "scaled:center")= num 0.132
..- attr(*, "scaled:scale")= num 0.781
$ b: num [1:10, 1] -1.486 -1.156 -0.826 -0.495 -0.165 ...
..- attr(*, "scaled:center")= num 5.5
..- attr(*, "scaled:scale")= num 3.03
$ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2
>
Here is a dplyr and scale solution.
For dplyr < 1.0.0
require(dplyr)
df %>% mutate_if(is.numeric, scale)
# a runif(20) rnorm(20)
#1 y 0.5783877 -0.004177104
#2 n -0.2344854 -0.866626472
#3 m 1.5629961 1.526857969
#4 h 0.9648646 -1.557975547
#5 u -0.7212756 0.533400304
#6 u 1.4753675 -0.072289864
#7 b 0.5346870 -0.464299111
#8 l -0.4287559 0.426600473
#9 m -1.2050841 -0.880135405
#10 h -0.6150410 -0.040636433
#11 r 1.3768249 -0.719785950
#12 a -1.3929511 0.083010969
#13 a -0.4422665 0.385574213
#14 l -0.7719473 -0.934716525
#15 m 1.4483803 0.131974911
#16 k 0.6291919 2.598581195
#17 k -1.0356817 -1.018890381
#18 s -1.0960083 1.560216350
#19 y -0.8826702 -0.367821579
#20 v 0.2554671 -0.318862011
For dplyr >= 1.0.0
df %>% mutate(across(where(is.numeric), scale))
Note that scale(x) will do the same as (x - mean(x)) / sd(x); if you want to scale based on different metrics (e.g. a robust/modified Z score based on the median and MAD) you can do that using sweep.
Sample data
set.seed(2017);
df <- cbind.data.frame(a = factor(sample(letters, 20, replace = T)), runif(20), rnorm(20));
ind <- sapply(dcc, is.numeric)
dcc.s <- as.data.frame(lapply(dcc[,ind], function(x) (x-mean(x))/sd(x)))
dcc.s <- cbind(dcc, dcc.s)
If you don't need the "old" dataframe you can also do
ind <- sapply(dcc, is.numeric)
dcc[,ind] <- vapply(dcc[,ind], function(x) (x-mean(x))/sd(x))
I have recently transitioned from STATA + Excel to R. So, I would appreciate if someone could help me in writing efficient code. I have tried my best to research the answer before posting on SO.
Here's how my data looks like:
mydata<-data.frame(sassign$buyer,sassign$purch,sassign$total_)
str(mydata)
'data.frame': 50000 obs. of 3 variables:
$ sassign.buyer : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 2 1 ...
$ sassign.purch : num 10 3 2 1 1 1 1 11 11 1 ...
$ sassign.total_: num 357 138 172 272 149 113 15 238 418 123 ...
head(mydata)
sassign.buyer sassign.purch sassign.total_
1 no 10 357
2 no 3 138
3 no 2 172
4 no 1 272
5 no 1 149
6 yes 1 113
My objective is to find average number of buyers with # of purchases > 1.
So, here's what I did:
Method 1: Long method
library(psych)
check<-as.numeric(mydata$sassign.buyer)-1
myd<-cbind(mydata,check)
abcd<-psych::describe(myd[myd$sassign.purch>1,])
abcd$mean[4]
The output I got is:0.1031536697, which is correct.
#Sathish: Here's how check looks like:
head(check)
0 0 0 0 0 1
This did solve my purpose.
Pros of this method: It's easy and typically a beginner level.
Cons: Too many-- I need an extra variable (check). Plus, I don't like this method--it's too clunky.
Side Question : I realized that by default, functions don't show higher precision although options (digits=10) is set. For instance, here's what I got from running :
psych::describe(myd[myd$sassign.purch>1,])
vars n mean sd median trimmed mad min max range skew
sassign.buyer* 1 34880 1.10 0.30 1 1.00 0.00 1 2 1 2.61
sassign.purch 2 34880 5.14 3.48 4 4.73 2.97 2 12 10 0.65
sassign.total_ 3 34880 227.40 101.12 228 226.13 112.68 30 479 449 0.09
check 4 34880 0.10 0.30 0 0.00 0.00 0 1 1 2.61
kurtosis se
sassign.buyer* 4.81 0.00
sassign.purch -1.05 0.02
sassign.total_ -0.72 0.54
check 4.81 0.00
It's only when I ran
abcd$mean[4]
I got 0.1031536697
Method 2: Using dplyr
I tried pipes and function call, but I finally gave up.
Method 2 | Try1:
psych::describe(dplyr::filter(mydata,mydata$sassign.purch>1)[,dplyr::mutate(as.numeric(mydata$sassign.buyer)-1)])
Output:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "c('double', 'numeric')"
Method 2 | Try2: Using pipes:
mydata %>% mutate(newcol = as.numeric(sassign.buyer)-1) %>% dplyr::filter(sassign.purch>1) %>% summarise(meanpurch = mean(newcol))
This did work, and I got meanpurch= 0.1031537. However, I am still not sure about Try 1.
Any thoughts why this isn't working?
Data:
> dt
# sassign.buyer sassign.purch sassign.total_
# 1 no 10 357
# 2 no 3 138
# 3 no 2 172
# 4 no 1 272
# 5 no 1 149
# 6 yes 1 113
Number of Buyers with purchases greater than 1
library(dplyr)
dt %>%
group_by(sassign.buyer) %>%
filter(sassign.purch > 1)
#
# Source: local data frame [3 x 3]
# Groups: sassign.buyer [1]
#
# sassign.buyer sassign.purch sassign.total_
# (chr) (int) (int)
# 1 no 10 357
# 2 no 3 138
# 3 no 2 172
Average number of buyers with purchases greater than 1
dt %>%
group_by(sassign.buyer) %>%
filter(sassign.purch > 1) %>%
summarise(avg_no_buyers_gt_1 = length(sassign.buyer)/ nrow(dt))
# Source: local data frame [1 x 2]
#
# sassign.buyer avg_no_buyers_gt_1
# (chr) (dbl)
# 1 no 0.5
If no grouping of buyers is required,
dt %>%
filter(sassign.purch > 1) %>%
summarise(avg_no_buyers_gt_1 = length(sassign.buyer)/ nrow(dt))
# avg_no_buyers_gt_1
# 1 0.7777778
Finding the proportion of cases that suit a condition is easy to do with mean(). Here's a blog post explaining it: https://drsimonj.svbtle.com/proportionsfrequencies-with-mean-and-booleans, and here's a simple example:
buyer <- c("yes", "yes", "no", "no")
mean(buyer == "yes")
#> [1] 0.5
So in your case, you can do mean(d$sassign.buyer[d$sassign.purch > 1] == "yes"). Here's a worked example:
d <- data.frame(
sassign.buyer = factor(c("yes", "yes", "no", "no")),
sassign.purch = c(1, 10, 0, 200)
)
mean(d$sassign.buyer[d$sassign.purch > 1] == "yes")
#> [1] 0.5
This gets all cases where d$sassign.purch is greater han 1, and then computes the proportion (using mean()) of these cases in which d$sassign.buyer is equal to "yes".