Given a data frame with 6 variables:
x1 var1 x2 var2 x3 var3
How do you count the missing values in variables: var1, var2, var3 BY ROW such that the data frame will have these variables:
x1 var1 x2 var2 x3 var3 num.missing
A reproducible data set with expected answer would have been very helpful. I'll create one for you;
set.seed(1337)
dat <- data.frame(x1=1:10, var1=runif(10),
x2=11:20, var2=runif(10),
x3=21:30, var3=runif(10))
dat
x1 var1 x2 var2 x3 var3
1 1 0.57632155 11 0.97943029 21 0.84916377
2 2 0.56474213 12 0.99371759 22 0.72408821
3 3 0.07399023 13 0.82735873 23 0.04661798
4 4 0.45386562 14 0.19398230 24 0.15367816
5 5 0.37327926 15 0.98132543 25 0.56259417
6 6 0.33131745 16 0.02522857 26 0.98142569
7 7 0.94763002 17 0.97238848 27 0.93177423
8 8 0.28111731 18 0.92379666 28 0.89861494
9 9 0.24540405 19 0.33913968 29 0.46979326
10 10 0.14604362 20 0.24657940 30 0.99500811
Deleting a random sample of values;
dat[sample(1:10, 3), "var1"] <- NA
dat[sample(1:10, 3), "var2"] <- NA
dat[sample(1:10, 3), "var3"] <- NA
dat
x1 var1 x2 var2 x3 var3
1 1 NA 11 0.9794303 21 0.8491638
2 2 0.56474213 12 0.9937176 22 0.7240882
3 3 0.07399023 13 NA 23 NA
4 4 0.45386562 14 0.1939823 24 0.1536782
5 5 0.37327926 15 0.9813254 25 0.5625942
6 6 NA 16 NA 26 0.9814257
7 7 0.94763002 17 0.9723885 27 NA
8 8 0.28111731 18 NA 28 0.8986149
9 9 NA 19 0.3391397 29 0.4697933
10 10 0.14604362 20 0.2465794 30 NA
Given that logicals equate to binary integers (TRUE==1, FALSE==0) we can just sum the is.na() tests
dat$num.missing <- is.na(dat$var1) + is.na(dat$var2) + is.na(dat$var3)
dat
x1 var1 x2 var2 x3 var3 num.missing
1 1 NA 11 0.9794303 21 0.8491638 1
2 2 0.56474213 12 0.9937176 22 0.7240882 0
3 3 0.07399023 13 NA 23 NA 2
4 4 0.45386562 14 0.1939823 24 0.1536782 0
5 5 0.37327926 15 0.9813254 25 0.5625942 0
6 6 NA 16 NA 26 0.9814257 2
7 7 0.94763002 17 0.9723885 27 NA 1
8 8 0.28111731 18 NA 28 0.8986149 1
9 9 NA 19 0.3391397 29 0.4697933 1
10 10 0.14604362 20 0.2465794 30 NA 1
Related
I have a dataset with two groups of subjects, Group A, Group B like this.
Id Group Age
1 A 17
2 A 14
3 A 10
4 A 17
5 A 12
6 A 6
7 A 18
8 A 7
9 B 18
9 B 13
10 B 6
10 B 12
11 B 16
11 B 17
12 B 11
12 B 18
The subjects in Group A are unique. One row per subject. The subjects in Group B are not unique. There are two or in some cases 3 rows of observations per subject in Group B, example ID 9, 10, 10 etc.
What I am trying to do is
a) estimate the average distance of subjects in GroupB to everyone in Group A. Using Age to estimate the distance.
b) estimate the distance of subjects in GroupB to the mode of subjects in Group A. Using Age to estimate the mode in Group A and Age in Group B to estimate the distance from the mode.
Expecting a dataset like this.
ID Group Age AvDistance DistanceToMedian
1 A 17 NA NA
2 A 14 NA NA
3 A 10 NA NA
4 A 17 NA NA
5 A 12 NA NA
6 A 6 NA NA
7 A 18 NA NA
8 A 7 NA NA
9 B 18 6 2.11
9 B 13 3.875 2.88
10 B 6 ... ...
10 B 12 ... ...
11 B 16 ... ...
11 B 17 ... ...
12 B 11 ... ...
12 B 18 ... ...
I can do this manually, any suggestions on how to make this more efficient is much appreciated. Thanks.
# Estimate Average Distance of Id in Group B to all subjects in Group A
(sqrt((17 - 18)^2)+ sqrt((14-18)^2)+ sqrt((10-18)^2) + sqrt((17-18)^2) + sqrt((12-18)^2) + sqrt((6-18)^2) + sqrt((18-18)^2) + sqrt((7-18)^2))/8 = 6
(sqrt((17 - 13)^2)+ sqrt((14-13)^2)+ sqrt((10 - 13)^2) + sqrt((17-13)^2) + sqrt((12-13)^2) + sqrt((6-13)^2) + sqrt((18-13)^2) + sqrt((7-13)^2))/8 = 3.875
estimate_mode <- function(x) {
d <- density(x)
d$x[which.max(d$y)]
}
# Estimate Mode for Age in Group A
x <- c(17, 14, 10, 17, 12, 6, 18, 7)
estimate_mode(x)
m1 <- estimate_mode(x)
# Estimate Mode of
sqrt((18 - m1)^2) = 2.11
sqrt((13 - m1)^2) =2.88
This will be easier with a unique row ID, so I'll create one:
library(dplyr)
library(tibble)
df = df %>%
mutate(rownum = paste0("row", row_number()))
ages = setNames(df$Age, df$rownum)
## make a distance matrix
dist = outer(ages[df$Group == "B"], ages[df$Group == "A"], FUN = \(x, y) abs(x - y))
## calculate average distances
av_dist = data.frame(AvDist = rowMeans(dist)) %>% rownames_to_column("rownum")
## calculate median age for A
med_a = median(ages[df$Group == "A"])
## add back to original data
df %>%
left_join(av_dist, by = "rownum") %>%
mutate(DistanceToMedian = ifelse(Group == "B", abs(Age - med_a), NA))
# Id Group Age rownum AvDist DistanceToMedian
# 1 1 A 17 row1 NA NA
# 2 2 A 14 row2 NA NA
# 3 3 A 10 row3 NA NA
# 4 4 A 17 row4 NA NA
# 5 5 A 12 row5 NA NA
# 6 6 A 6 row6 NA NA
# 7 7 A 18 row7 NA NA
# 8 8 A 7 row8 NA NA
# 9 9 B 18 row9 5.375 5
# 10 9 B 13 row10 3.875 0
# 11 10 B 6 row11 6.625 7
# 12 10 B 12 row12 3.875 1
# 13 11 B 16 row13 4.375 3
# 14 11 B 17 row14 4.625 4
# 15 12 B 11 row15 4.125 2
# 16 12 B 18 row16 5.375 5
I used median, not mode, because I was looking at your column names, but you can easily swap in your mode instead.
Using this sample data:
df = read.table(text = 'Id Group Age
1 A 17
2 A 14
3 A 10
4 A 17
5 A 12
6 A 6
7 A 18
8 A 7
9 B 18
9 B 13
10 B 6
10 B 12
11 B 16
11 B 17
12 B 11
12 B 18', header = T)
I want to take differences for each pair of consecutive columns but for an arbitrary number of columns. For example...
df <- as.tibble(data.frame(group = rep(c("a", "b", "c"), each = 4),
subgroup = rep(c("adam", "boy", "charles", "david"), times = 3),
iter1 = 1:12,
iter2 = c(13:22, NA, 24),
iter3 = c(25:35, NA)))
I want to calculate the differences by column. I would normally use...
df %>%
mutate(diff_iter2 = iter2 - iter1,
diff_iter3 = iter3 - iter2)
But... I'd like to:
accomodate an arbitrary number of columns and
treat NAs such that:
if the number we're subtracting from is NA, then the result should be NA. E.g. NA - 11 = NA
if the number we're subtracting is NA, then that NA is effectively treated as a 0. E.g. 35 - NA = 35
The result should look like this...
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Originally, this df was in long format but the problem was that I believe the lag() function operates on position within groups and all the groups aren't the same because some have missing records (hence the NA in the wider table shown above).
Starting with long format would do but then please assume the records shown above with NA values would not exist in that longer dataframe.
Any help is appreciated.
An option in tidyverse would be - loop across the columns of 'iter' other than the iter1, then get the column value by replacing the column name (cur_column()) substring by subtracting 1 (as.numeric(x) -1) with str_replace, then replace the NA elements with 0 (replace_na) based on the OP's logic, subtract from the looped column and create new columns by adding prefix in .names ("diff_{.col}" - {.col} will be the original column name)
library(dplyr)
library(stringr)
library(tidyr)
df <- df %>%
mutate(across(iter2:iter3, ~
. - replace_na(get(str_replace(cur_column(), '\\d+',
function(x) as.numeric(x) - 1)), 0), .names = 'diff_{.col}'))
-output
df
# A tibble: 12 × 7
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Find the columns whose names start with iter, ix, and then take all but the first as df1, all but the last as df2 and replace the NAs in df2 with 0. Then subtract them and cbind df to that. No packages are used.
ix <- grep("^iter", names(df))
df1 <- df[tail(ix, -1)]
df2 <- df[head(ix, -1)]
df2[is.na(df2)] <- 0
cbind(df, diff = df1 - df2)
giving:
group subgroup iter1 iter2 iter3 diff.iter2 diff.iter3
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
I was looking to separate rows of data by Cue and adding a row which calculate averages per subject. Here is an example:
Before:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379
After:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 0.67978 0.51071 0.31723
4 4 22 0.26855 0.17487 0.22461
5 4 20 0.15106 0.48767 0.49072
6 0.209 0.331 0.357
7 7 18 0.11627 0.12604 0.2832
8 7 24 0.50201 0.14252 0.21454
9 0.309 0.134 0.248
10 12 16 0.27649 0.96008 0.42114
11 12 18 0.60852 0.21637 0.18799
12 0.442 0.588 0.304
13 22 20 0.32867 0.65308 0.29388
14 22 24 0.25726 0.37048 0.32379
15 0.292 0.511 0.308
So in the "after" example, line 3 is the average of lines 1 and 2 (line 6 is the average of lines 4 and 5, etc...).
Any help/information would be greatly appreciated!
Thank you!
You can use base r to do something like:
Reduce(rbind,by(data,data[1],function(x)rbind(x,c(NA,NA,colMeans(x[-(1:2)])))))
Cue ITI a b c
1 0 16 0.820620 0.521850 0.276790
2 0 24 0.538940 0.499570 0.357670
3 NA NA 0.679780 0.510710 0.317230
32 4 22 0.268550 0.174870 0.224610
4 4 20 0.151060 0.487670 0.490720
31 NA NA 0.209805 0.331270 0.357665
5 7 18 0.116270 0.126040 0.283200
6 7 24 0.502010 0.142520 0.214540
33 NA NA 0.309140 0.134280 0.248870
7 12 16 0.276490 0.960080 0.421140
8 12 18 0.608520 0.216370 0.187990
34 NA NA 0.442505 0.588225 0.304565
9 22 20 0.328670 0.653080 0.293880
10 22 24 0.257260 0.370480 0.323790
35 NA NA 0.292965 0.511780 0.308835
Here is one idea. Split the data frame, perform the analysis, and then combine them together.
DF_list <- split(DF, f = DF$Cue)
DF_list2 <- lapply(DF_list, function(x){
df_temp <- as.data.frame(t(colMeans(x[, -c(1, 2)])))
df_temp[, c("Cue", "ITI")] <- NA
df <- rbind(x, df_temp)
return(df)
})
DF2 <- do.call(rbind, DF_list2)
rownames(DF2) <- 1:nrow(DF2)
DF2
# Cue ITI a b c
# 1 0 16 0.820620 0.521850 0.276790
# 2 0 24 0.538940 0.499570 0.357670
# 3 NA NA 0.679780 0.510710 0.317230
# 4 4 22 0.268550 0.174870 0.224610
# 5 4 20 0.151060 0.487670 0.490720
# 6 NA NA 0.209805 0.331270 0.357665
# 7 7 18 0.116270 0.126040 0.283200
# 8 7 24 0.502010 0.142520 0.214540
# 9 NA NA 0.309140 0.134280 0.248870
# 10 12 16 0.276490 0.960080 0.421140
# 11 12 18 0.608520 0.216370 0.187990
# 12 NA NA 0.442505 0.588225 0.304565
# 13 22 20 0.328670 0.653080 0.293880
# 14 22 24 0.257260 0.370480 0.323790
# 15 NA NA 0.292965 0.511780 0.308835
DATA
DF <- read.table(text = " Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379", header = TRUE)
A data.table approach, but if someone can offer some improvements I'd be keen to hear.
library(data.table)
dt <- data.table(df)
dt2 <- dt[, lapply(.SD, mean), by = Cue][,ITI := NA][]
data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
> data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
Cue ITI a b c
1: 0 16 0.820620 0.521850 0.276790
2: 0 24 0.538940 0.499570 0.357670
3: NA NA 0.679780 0.510710 0.317230
4: 4 22 0.268550 0.174870 0.224610
5: 4 20 0.151060 0.487670 0.490720
6: NA NA 0.209805 0.331270 0.357665
If you want to leave the Cue values as-is to confirm group, just drop the [is.na(ITI), Cue := NA] from the last line.
I would use group_by and summarise from the DPLYR package to get a dataframe with the average values. Then rbind the new data frame with the old one and sort by Cue:
df_averages <- df_orig >%>
group_by(Cue) >%>
summarise(ITI = NA, a = mean(a), b = mean(b), c = mean(c)) >%>
ungroup()
df_all <- rbind(df_orig, df_averages)
I have a dataframe of records of varying lengths, with NAs at the end. If there are more than three x-values in a record, I want to make the value of the third x-value equal to the value of the last x-value. Each record already tells me how many x-values it has.
I can make x3 be equal to the name of the last x-value (x4 or x5 etc) but what I really need is to make x3 take the value of that last x-value.
I'm sure there is some simple answer. Any help would be greatly appreciated! Thank you.
Here is a simple case:
ii <- "n x1 x2 x3 x4 x5 x6
1 3 30 40 20 NA NA NA
2 4 10 50 16 25 NA NA
3 6 20 15 26 16 18 28
4 5 10 10 18 17 19 NA
5 2 65 41 NA NA NA NA
6 5 10 11 23 16 23 NA
7 1 99 NA NA NA NA NA"
df <- read.table(text=ii, header = TRUE, na.strings="NA", colClasses="character")
oo <- "n x1 x2 x3
1 3 30 40 20
2 4 10 50 25
3 6 20 15 28
4 5 10 10 19
5 2 65 41 NA
6 5 10 11 23
7 1 99 NA NA"
desireddf <- read.table(text=oo, header = TRUE, na.strings="NA", colClasses="character")
df$lastx <- as.character(paste("x", df$n, sep=""))
#df$lastx <- df[[get(df$lastx)]] #How can I make lastx equal to the _value_ of lastx???
df[df$n>3, c('x3')] <- df[df$n>3, 'lastx']
df <- df[,1:4]
print(df)
yields the following, not the desireddf above.
n x1 x2 x3
1 3 30 40 20
2 4 10 50 x4
3 6 20 15 x6
4 5 10 10 x5
5 2 65 41 <NA>
6 5 10 11 x5
7 1 99 <NA> <NA>
This seems like a pretty aribtrary task, but here goes:
desireddf <- data.frame(n=df$n, x1=df$x1, x2=df$x2, x3=df[cbind(1:nrow(df), paste("x", pmax(3,as.numeric(df$n)), sep=""))])
I have two dataframes and I want to put one above the other "with" column names of second as a row of the new dataframe. Column names are different and one dataframe has more columns.
For example:
mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
mydf1
V1 V2
1 1 21
2 2 22
3 3 23
4 4 24
5 5 25
mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
mydf2
C1 C2 C3
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
Result:
mydf
V1 V2
1 1 21 NA
2 2 22 NA
3 3 23 NA
4 4 24 NA
5 5 25 NA
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
I dont care if all numeric values treated like characters.
Many thanks
You can do this easily without any packages:
mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
mydf1[,3] <- NA
names(mydf1) <- c("one", "two", "three")
mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
names <- t(as.data.frame(names(mydf2)))
names <- as.data.frame(names)
names(mydf2) <- c("one", "two", "three")
names(names) <- c("one", "two", "three")
mydf3 <- rbind(mydf1, names)
mydf4 <- rbind(mydf3, mydf2)
> mydf4
one two three
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
>
Of course, you can edit the <- c("one", "two", "three") to make the final column names whatever you'd like. For example:
> mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
> mydf1[,3] <- NA
> names(mydf1) <- c("V1", "V2", "NA")
> mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
> names <- t(as.data.frame(names(mydf2)))
> names <- as.data.frame(names)
> names(mydf2) <- c("V1", "V2", "NA")
> names(names) <- c("V1", "V2", "NA")
> mydf3 <- rbind(mydf1, names)
> mydf4 <- rbind(mydf3, mydf2)
> row.names(mydf4) <- NULL
> mydf4
V1 V2 NA
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
If you need to resort a package for any reason when scaling this up to your real use case, then try melt from reshape2 or the package plyr. However, use of a package shouldn't be necessary.
I don't know what you tried with write.table, but that seems to me like the way to go.
I would create a function something like this:
myFun <- function(...) {
L <- list(...)
temp <- tempfile()
maxCol <- max(vapply(L, ncol, 1L))
lapply(L, function(x)
suppressWarnings(
write.table(x, file = temp, row.names = FALSE,
sep = ",", append = TRUE)))
read.csv(temp, header = FALSE, fill = TRUE,
col.names = paste0("New_", sequence(maxCol)),
stringsAsFactors = FALSE)
}
Usage would then simply be:
myFun(mydf1, mydf2)
# New_1 New_2 New_3
# 1 V1 V2
# 2 1 21
# 3 2 22
# 4 3 23
# 5 4 24
# 6 5 25
# 7 C1 C2 C3
# 8 1 21 41
# 9 2 22 42
# 10 3 23 43
# 11 4 24 44
# 12 5 25 45
# 13 6 26 46
# 14 7 27 47
# 15 8 28 48
# 16 9 29 49
# 17 10 30 50
The function is written such that you can specify more than two data.frames as input:
mydf3 <- data.frame(matrix(1:8, ncol = 4))
myFun(mydf1, mydf2, mydf3)
# New_1 New_2 New_3 New_4
# 1 V1 V2
# 2 1 21
# 3 2 22
# 4 3 23
# 5 4 24
# 6 5 25
# 7 C1 C2 C3
# 8 1 21 41
# 9 2 22 42
# 10 3 23 43
# 11 4 24 44
# 12 5 25 45
# 13 6 26 46
# 14 7 27 47
# 15 8 28 48
# 16 9 29 49
# 17 10 30 50
# 18 X1 X2 X3 X4
# 19 1 3 5 7
# 20 2 4 6 8
Here's one approach with the rbind.fill function (part of the plyr package).
library(plyr)
setNames(rbind.fill(setNames(mydf1, names(mydf2[seq(mydf1)])),
rbind(names(mydf2), mydf2)), names(mydf1))
V1 V2 NA
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
Give this a try.
Assign the column names from the second data set to a vector, and then replace the second set's names with the names from the first set. Then create a list where the middle element is the vector you assigned. Now when you call rbind, it should be fine since everything is in the right order.
d1$V3 <- NA
nm <- names(d2)
names(d2) <- names(d1)
dc <- do.call(rbind, list(d1,nm,d2))
rownames(dc) <- NULL
dc