What is causing this cryptic error message in bind_rows? - r

I have the below mentioned data frame which I am trying to bind with another data frame
X_df
1 2
1 18 NA
2 3 NA
3 6 NA
4 8 8
y_df
1 2
35 8
y_df is actually the sum of each column from x. I have been trying to bind these two dataframes using bind_rows. It shows me the following error. Can I get some advice on how I can rectify it. I am relatively new to R
*Error in (function (cond) :
error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': Can't combine `..1$1` <table> and `..2$1` <double>*
Thanks in advance

I tried to run using the same function bind_rows you've mentioned:
x_df
X1 X2
1 18 NA
2 3 NA
3 6 NA
4 8 8
y_df
X1 X2
1 35 8
x_df %>% bind_rows(y_df)
X1 X2
1 18 NA
2 3 NA
3 6 NA
4 8 8
5 35 8
Another approach:
x_df %>% bind_rows(x_df %>% summarise(across(everything(), ~ sum(., na.rm = TRUE))))
X1 X2
1 18 NA
2 3 NA
3 6 NA
4 8 8
5 35 8

I suggest not to use y_df for totals, but use something like janitor::adorn_totals() for actual calculation of the columns' totals.
library(janitor)
library(tidyverse)
X_df %>%
tibble::rowid_to_column() %>%
janitor::adorn_totals(where = "row")
# id X1 X2
# 1 18 NA
# 2 3 NA
# 3 6 NA
# 4 8 8
# Total 35 8

Try using rbind :
result <- rbind(X_df, y_df)

Related

Add a unique identifier to the same column value in R data frame

I have a data frame as follows:
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6
4 4 25 7
5 5 3 7
6 6 34 7
For each row with the sample_id, I would like to add a unique identifier as follows:
index val sample_id
1 1 14 5
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
Any suggestion? Thank you for your help.
Base R
dat$id2 <- ave(dat$sample_id, dat$sample_id,
FUN = function(z) if (length(z) > 1) paste(z, LETTERS[seq_along(z)], sep = "-") else as.character(z))
dat
# index val sample_id id2
# 1 1 14 5 5
# 2 2 22 6 6-A
# 3 3 1 6 6-B
# 4 4 25 7 7-A
# 5 5 3 7 7-B
# 6 6 34 7 7-C
tidyverse
library(dplyr)
dat %>%
group_by(sample_id) %>%
mutate(id2 = if (n() > 1) paste(sample_id, LETTERS[row_number()], sep = "-") else as.character(sample_id)) %>%
ungroup()
Minor note: it might be tempting to drop the as.character(z) from either or both of the code blocks. In the first, nothing will change (here): base R allows you to be a little sloppy; if we rely on that and need the new field to always be character, then in that one rare circumstance where all rows have unique sample_id, then the column will remain integer. dplyr is much more careful in guarding against this; if you run the tidyverse code without as.character, you'll see the error.
Using dplyr:
library(dplyr)
dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))
index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
If you just want to create unique tags for the same sample_id, maybe you can try make.unique like below
transform(
df,
sample_id = ave(as.character(sample_id),sample_id,FUN = function(x) make.unique(x,sep = "_"))
)
which gives
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6_1
4 4 25 7
5 5 3 7_1
6 6 34 7_2

conditional merge or left join two dataframes in R

I am trying to add additional data from a reference table onto my primary dataframe. I see similar questions have been asked about this however cant find anything for my specific case.
An example of my data frame is set up like this
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
print(df)
participant time
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
13 1 5
14 2 5
15 3 5
16 1 6
17 2 6
18 3 6
19 1 7
20 2 7
21 3 7
22 1 8
23 2 8
24 3 8
25 1 9
26 2 9
27 3 9
> print(lookup)
start.time end.time var1 var2 var3
1 1 3 A 8 fast
2 5 6 B 12 fast
3 8 10 A 3 slow
What I want to do is merge or join these two dataframes in a way which also includes the times in between both the start and end time of the look up data frame. So the columns var1, var2 and var3 are added onto the df at each instance where the time lies between the start time and end time.
for example, in the above case - the look up value in the first row has a start time of 1, an end time of 3, so for times 1, 2 and 3 for each participant, the first row data should be added.
the output should look something like this.
print(output)
participant time var1 var2 var3
1 1 1 A 8 fast
2 2 1 A 8 fast
3 3 1 A 8 fast
4 1 2 A 8 fast
5 2 2 A 8 fast
6 3 2 A 8 fast
7 1 3 A 8 fast
8 2 3 A 8 fast
9 3 3 A 8 fast
10 1 4 <NA> NA <NA>
11 2 4 <NA> NA <NA>
12 3 4 <NA> NA <NA>
13 1 5 B 12 fast
14 2 5 B 12 fast
15 3 5 B 12 fast
16 1 6 B 12 fast
17 2 6 B 12 fast
18 3 6 B 12 fast
19 1 7 <NA> NA <NA>
20 2 7 <NA> NA <NA>
21 3 7 <NA> NA <NA>
22 1 8 A 3 slow
23 2 8 A 3 slow
24 3 8 A 3 slow
25 1 9 A 3 slow
26 2 9 A 3 slow
27 3 9 A 3 slow
I realise that column names don't match and they should for merging data sets.
One option would be to use the sqldf package, and phrase your problem as a SQL left join:
sql <- "SELECT t1.participant, t1.time, t2.var1, t2.var2, t2.var3
FROM df t1
LEFT JOIN lookup t2
ON t1.time BETWEEN t2.\"start.time\" AND t2.\"end.time\""
output <- sqldf(sql)
A dplyr solution:
output <- df %>%
# Create an id for the join
mutate(merge_id=1) %>%
# Use full join to create all the combinations between the two datasets
full_join(lookup %>% mutate(merge_id=1), by="merge_id") %>%
# Keep only the rows that we want
filter(time >= start.time, time <= end.time) %>%
# Select the relevant variables
select(participant,time,var1:var3) %>%
# Right join with initial dataset to get the missing rows
right_join(df, by = c("participant","time")) %>%
# Sort to match the formatting asked by OP
arrange(time, participant)
This produces the output asked by OP, but it will only work for data of reasonable size, as the full join produces a data frame with number of rows equal to the product of the number of rows of both initial datasets.
Using tidyverse and creating an auxiliary table:
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
lookup_extended <- lookup %>%
mutate(time = map2(start.time, end.time, ~ c(.x:.y))) %>%
unnest(time) %>%
select(-start.time, -end.time)
df2 <- df %>%
left_join(lookup_extended, by = "time")

Create multiple sums

Ciao,
Here is a replicate able example.
df <- data.frame("STUDENT"=c(1,2,3,4,5),
"TEST1A"=c(NA,5,5,6,7),
"TEST2A"=c(NA,8,4,6,9),
"TEST3A"=c(NA,10,5,4,6),
"TEST1B"=c(5,6,7,4,1),
"TEST2B"=c(10,10,9,3,1),
"TEST3B"=c(0,5,6,9,NA),
"TEST1TOTAL"=c(NA,23,14,16,22),
"TEST2TOTAL"=c(10,16,15,12,NA))
I have columns STUDENT through TEST3B and want to create TEST1TOTAL TEST2TOTAL. TEST1TOTAL=TEST1A+TEST2A+TEST3A and so on for TEST2TOTAL. If there is any missing score in TEST1A TEST2A TEST3A then TEST1TOTAL is NA.
here is my attempt but is there a solution with less lines of coding? Because here I will need to write this line out many times as there are up to TEST A through O.
TEST1TOTAL=rowSums(df[,c('TEST1A', 'TEST2A', 'TEST3A')], na.rm=TRUE)
Using just R base functions:
output <- data.frame(df1, do.call(cbind, lapply(c("A$", "B$"), function(x) rowSums(df1[, grep(x, names(df1))]))))
Customizing colnames:
> colnames(output)[(ncol(output)-1):ncol(output)] <- c("TEST1TOTAL", "TEST2TOTAL")
> output
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL
1 1 NA NA NA 5 10 0 NA 15
2 2 5 8 10 6 10 5 23 21
3 3 5 4 5 7 9 6 14 22
4 4 6 6 4 4 3 9 16 16
5 5 7 9 6 1 1 NA 22 NA
Try:
library(dplyr)
df %>%
mutate(TEST1TOTAL = TEST1A+TEST2A+TEST3A,
TEST2TOTAL = TEST1B+TEST2B+TEST3B)
or
df %>%
mutate(TEST1TOTAL = rowSums(select(df, ends_with("A"))),
TEST2TOTAL = rowSums(select(df, ends_with("B"))))
I think for what you want, Jilber Urbina's solution is the way to go. For completeness sake (and because I learned something figuring it out) here's a tidyverse way to get the score totals by test number for any number of tests.
The advantage is you don't need to specify the identifiers for the tests (beyond that they're numbered or have a trailing letter) and the same code will work for any number of tests.
library(tidyverse)
df_totals <- df %>%
gather(test, score, -STUDENT) %>% # Convert from wide to long format
mutate(test_num = paste0('TEST', ('[^0-9]', '', test),
'TOTAL'), # Extract test_number from variable
test_let = gsub('TEST[0-9]*', '', test)) %>% # Extract test_letter (optional)
group_by(STUDENT, test_num) %>% # group by student + test
summarize(score_tot = sum(score)) %>% # Sum score by student/test
spread(test_num, score_tot) # Spread back to wide format
df_totals
# A tibble: 5 x 4
# Groups: STUDENT [5]
STUDENT TEST1TOTAL TEST2TOTAL TEST3TOTAL
<dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA
2 2 11 18 15
3 3 12 13 11
4 4 10 9 13
5 5 8 10 NA
If you want the individual scores too, just join the totals together with the original:
left_join(df, df_totals, by = 'STUDENT')
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL TEST3TOTAL
1 1 NA NA NA 5 10 0 NA NA NA
2 2 5 8 10 6 10 5 11 18 15
3 3 5 4 5 7 9 6 12 13 11
4 4 6 6 4 4 3 9 10 9 13
5 5 7 9 6 1 1 NA 8 10 NA

Filling NA in a data frame with a specified rule in R

Let I have such a date frame(df1) with column name x:
df1<-as.data.frame(x=c(4,3,2,16,7,8,9,1,12))
colnames(df1)<-"x"
df1[2,1]<-NA
df1[3,1]<-NA
df1[4,1]<-NA
The output is:
> df1
x
1 4
2 NA
3 NA
4 NA
5 7
6 8
7 9
8 1
9 12
I want to add a column to the data frame. The new column(y) will fill NA's with the nearest value above the first NA above.
The code and the output is(that is what I want)
df1$y<-na.locf(df1, fromLast = FALSE)
> df1
x x
1 4 4
2 NA 4
3 NA 4
4 NA 4
5 7 7
6 8 8
7 9 9
8 1 1
9 12 12
Note:I didn't understand why the second column's name is "x" alhough I defined it as "y".
However, above method gives error naturally when the first entry is NA as below:
df2<-as.data.frame(c(4,3,2,16,7,8,9,1,12))
colnames(df2)<-"x"
df2[1,1]<-NA
df2[2,1]<-NA
df2[3,1]<-NA
> df2
x
1 NA
2 NA
3 NA
4 16
5 7
6 8
7 9
8 1
9 12
When I apply the below code:
df2$y<-na.locf(df2, fromLast = FALSE)
I get the below error:
Error in `$<-.data.frame`(`*tmp*`, "y", value = list(x = c(16, 7, 8, 9, :
replacement has 6 rows, data has 9
In such situations I just want to the oppsite of na.locf(df2, fromLast =FALSE). Namely fill NA's as the first value of below NA.
Desired output is:
x y
1 NA 16
2 NA 16
3 NA 16
4 16 16
5 7 7
6 8 8
7 9 9
8 1 1
9 12 12
So using tryCatch function, I wrote the below code:
df2$y<-tryCatch(na.locf(df2, fromLast = FALSE),
error=function(err)
{na.locf(df2, fromLast = TRUE)})
However, I got such an error:
Error in `$<-.data.frame`(`*tmp*`, "y", value = list(x = c(16, 7, 8, 9, :
replacement has 6 rows, data has 9
So in summary the problem is:
if the data frame's first entry is not NA,then fill the NA with first element above
if the data frame's first entry is NA, then fill the NA with first element below.
How can I this using R? Especially with tryCatch function? I also don't understnad why the second column's name seem as "x" instead of "y"?
I will be very glad for any help. Thanks a lot.
We can do a double na.locf with the first one having the option na.rm = FALSE
library(zoo)
na.locf(na.locf(df2, na.rm = FALSE), fromLast = TRUE)
# x
#1 16
#2 16
#3 16
#4 16
#5 7
#6 8
#7 9
#8 1
#9 12
If we want to have two columns
transform(df2, y = na.locf(na.locf(x, na.rm = FALSE), fromLast = TRUE))
# x y
#1 NA 16
#2 NA 16
#3 NA 16
#4 16 16
#5 7 7
#6 8 8
#7 9 9
#8 1 1
#9 12 12
NOTE: Make sure to assign it to a new object or to the same object i.e. df2 <- transform(...

How to return the positions of first occurrence for (different) duplicated rows in a data.frame?

Suppose you have a data frame like the following:
dfiris <- rbind(iris[1:5, -5], iris[1:5, -5], iris[1:5, -5], iris[1:5, -5], iris[1:5, -5])
Since the first 5 rows are then repeated other 4 times, I would like to efficiently get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
The function duplicate() does not help me because it only returns TRUE from the second occurrence on of a certain duplicated row.
My (inefficient) solution:
apply(dfiris, 1, function(df) {
which(apply(unique(dfiris), 1, function(df_u) identical(df, df_u)))
})
There must be a quicker way to do that. Any suggestions?
Using data.table:
library(data.table)
setDT(dfiris, keep.rownames=TRUE)
print(setkey(dfiris[, list(rn=as.numeric(rn), firstOcc=.I[1]),
by=c(names(dfiris)[-1])], rn))
You may also try:
library(dplyr)
left_join(dfiris,mutate(distinct(dfiris), rn=row_number()))
%>% select(rn)

Resources