I have a dataframe with 1209 columns, and 27900 rows.
For each row there are duplicated values scatter around the columns.
I have tried transposing the dataframe and remove by columns. But it crashes.
After I transpose I used:
for(i in 1:ncol(df)){
#replicate column i without duplicates, fill blanks with NAs
df <- cbind.fill(df,unique(df[,1]), fill = NA)
#rename the new column
colnames(df)[n+1] <- colnames(df)[1]
#delete the old column
df[,1] <- NULL
}
But no result so far.
I would like to know if anyone has any idea.
Best
As I understand you would like to replace duplicated values in each column with NA?
this can be done in several ways.
First some data:
set.seed(7)
df <- data.frame(x = sample(1: 20, 50, replace = T),
y = sample(1: 20, 50, replace = T),
z = sample(1: 20, 50, replace = T))
head(df, 10)
#output
x y z
1 20 12 8
2 8 15 10
3 3 16 10
4 2 13 8
5 5 15 13
6 16 8 7
7 7 4 20
8 20 4 1
9 4 8 16
10 10 6 5
with purrr library:
library(purrr)
map_dfc(df, function(x) ifelse(duplicated(x), NA, x))
#output
# A tibble: 50 x 3
x y z
<int> <int> <int>
1 20 12 8
2 8 15 10
3 3 16 NA
4 2 13 NA
5 5 NA 13
6 16 8 7
7 7 4 20
8 NA NA 1
9 4 NA 16
10 10 6 5
# ... with 40 more rows
with apply in base R
as.data.frame(apply(df, 2, function(x) ifelse(duplicated(x), NA, x)))
Related
Purpose
Suppose I have four variables: Two variables are original variables and the other two variables are the predictions of the original variables. (In actual data, there are a greater number of original variables)
I want to use for loop and mutate to create columns that compute the difference between the original and prediction variable. The sample data and the current approach are following:
Sample data
set.seed(10000)
id <- sample(1:20, 100, replace=T)
set.seed(10001)
dv.1 <- sample(1:20, 100, replace=T)
set.seed(10002)
dv.2 <- sample(1:20, 100, replace=T)
set.seed(10003)
pred_dv.1 <- sample(1:20, 100, replace=T)
set.seed(10004)
pred_dv.2 <- sample(1:20, 100, replace=T)
d <-
data.frame(id, dv.1, dv.2, pred_dv.1, pred_dv.2)
Current approach (with Error)
original <- d %>% select(starts_with('dv.')) %>% names(.)
pred <- d %>% select(starts_with('pred_dv.')) %>% names(.)
for (i in 1:length(original)){
d <-
d %>%
mutate(diff = original[i] - pred[i])
l <- length(d)
colnames(d[l]) <- paste0(original[i], '.diff')
}
Error: Problem with mutate() input diff. # x non-numeric
argument to binary operator # ℹ Input diff is original[i] - pred[i].
d %>%
mutate(
across(
.cols = starts_with("dv"),
.fns = ~ . - (get(paste0("pred_",cur_column()))),
.names = "diff_{.col}"
)
)
# A tibble: 100 x 7
id dv.1 dv.2 pred_dv.1 pred_dv.2 diff_dv.1 diff_dv.2
<int> <int> <int> <int> <int> <int> <int>
1 15 5 1 5 15 0 -14
2 13 4 4 5 11 -1 -7
3 12 20 13 6 13 14 0
4 20 11 8 13 3 -2 5
5 9 11 10 7 13 4 -3
6 13 3 3 6 17 -3 -14
7 3 12 19 6 17 6 2
8 19 6 7 11 4 -5 3
9 6 7 12 19 6 -12 6
10 13 10 15 6 7 4 8
# ... with 90 more rows
Subtraction can be applied on dataframes directly.
So you can create a vector of original column names and another vector of prediction column names and subtract them creating new columns.
orig_var <- grep('^dv', names(d), value = TRUE)
pred_var <- grep('pred', names(d), value = TRUE)
d[paste0(orig_var, '.diff')] <- d[orig_var] - d[pred_var]
d
# id dv.1 dv.2 pred_dv.1 pred_dv.2 dv.1.diff dv.2.diff
#1 15 5 1 5 15 0 -14
#2 13 4 4 5 11 -1 -7
#3 12 20 13 6 13 14 0
#4 20 11 8 13 3 -2 5
#5 9 11 10 7 13 4 -3
#...
#...
we know a column x with a vector of like 21 numbers:
x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
If I want to get multiple columns with flexible pattern like :
can set numbers in advance (n could be 3 or 4 or ...):
n=3,n1=2,n2=3,n3=2,.... the total number of columns is determined by number n.
column n=3, for column1:row=n*n1 and column2: row= n*n2, column3:row=n*n3 (Here, the number could be variables)
Final output is:(this is n=3 case, but my final goal is n could be 4,5...)
1 7 16
2 8 17
3 9 18
4 10 19
5 11 20
6 12 21
13
14
15
If n set as n=2,n1=3,n2=4. The one column number would become 14 c(1:14). (The real practice is I do not know how many columns needed to be created in advance. The column number is input by users).
Then what I what to get n =2 columns:
1 7
2 8
3 9
4 10
5 11
6 12
13
14
I am trying to make the columns created automatically in advance with variables.
Many thanks.
We can create an grouping variable with rep and split
split(df1$x, rep(1:3, c(6, 9, 6)))
#$`1`
#[1] 1 2 3 4 5 6
#$`2`
#[1] 7 8 9 10 11 12 13 14 15
#$`3`
#[1] 16 17 18 19 20 21
A function can be created with arguments, 'n', and additional arguments with ...
f1 <- function(dat, n, ...) {
rgrp <- n * c(...)
split(dat[[1]][seq_len(sum(rgrp))], rep(seq_len(n), rgrp))
}
f1(df1, 2, 3, 4)
#$`1`
#[1] 1 2 3 4 5 6
#$`2`
#[1] 7 8 9 10 11 12 13 14
f1(df1, 3, 2, 3, 2)
#$`1`
#[1] 1 2 3 4 5 6
#$`2`
#[1] 7 8 9 10 11 12 13 14 15
#$`3`
#[1] 16 17 18 19 20 21
If the user submits a vector and we don't have n, then get the n from the length of the vector
f1 <- function(dat, vec) {
n <- length(vec)
rgrp <- n * vec
split(dat[[1]][seq_len(sum(rgrp))], rep(seq_len(n), rgrp))
}
f1(df1, 3:4)
If the user input 'n1', 'n2', we can use ...
f1 <- function(dat, ...) {
vec <- c(...)
n <- length(vec)
rgrp <- n * vec
split(dat[[1]][seq_len(sum(rgrp))], rep(seq_len(n), rgrp))
}
f1(df1, 3, 4)
data
df1 <- structure(list(x = 1:21), class = "data.frame", row.names = c(NA,
-21L))
Ciao,
Here is a replicate able example.
df <- data.frame("STUDENT"=c(1,2,3,4,5),
"TEST1A"=c(NA,5,5,6,7),
"TEST2A"=c(NA,8,4,6,9),
"TEST3A"=c(NA,10,5,4,6),
"TEST1B"=c(5,6,7,4,1),
"TEST2B"=c(10,10,9,3,1),
"TEST3B"=c(0,5,6,9,NA),
"TEST1TOTAL"=c(NA,23,14,16,22),
"TEST2TOTAL"=c(10,16,15,12,NA))
I have columns STUDENT through TEST3B and want to create TEST1TOTAL TEST2TOTAL. TEST1TOTAL=TEST1A+TEST2A+TEST3A and so on for TEST2TOTAL. If there is any missing score in TEST1A TEST2A TEST3A then TEST1TOTAL is NA.
here is my attempt but is there a solution with less lines of coding? Because here I will need to write this line out many times as there are up to TEST A through O.
TEST1TOTAL=rowSums(df[,c('TEST1A', 'TEST2A', 'TEST3A')], na.rm=TRUE)
Using just R base functions:
output <- data.frame(df1, do.call(cbind, lapply(c("A$", "B$"), function(x) rowSums(df1[, grep(x, names(df1))]))))
Customizing colnames:
> colnames(output)[(ncol(output)-1):ncol(output)] <- c("TEST1TOTAL", "TEST2TOTAL")
> output
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL
1 1 NA NA NA 5 10 0 NA 15
2 2 5 8 10 6 10 5 23 21
3 3 5 4 5 7 9 6 14 22
4 4 6 6 4 4 3 9 16 16
5 5 7 9 6 1 1 NA 22 NA
Try:
library(dplyr)
df %>%
mutate(TEST1TOTAL = TEST1A+TEST2A+TEST3A,
TEST2TOTAL = TEST1B+TEST2B+TEST3B)
or
df %>%
mutate(TEST1TOTAL = rowSums(select(df, ends_with("A"))),
TEST2TOTAL = rowSums(select(df, ends_with("B"))))
I think for what you want, Jilber Urbina's solution is the way to go. For completeness sake (and because I learned something figuring it out) here's a tidyverse way to get the score totals by test number for any number of tests.
The advantage is you don't need to specify the identifiers for the tests (beyond that they're numbered or have a trailing letter) and the same code will work for any number of tests.
library(tidyverse)
df_totals <- df %>%
gather(test, score, -STUDENT) %>% # Convert from wide to long format
mutate(test_num = paste0('TEST', ('[^0-9]', '', test),
'TOTAL'), # Extract test_number from variable
test_let = gsub('TEST[0-9]*', '', test)) %>% # Extract test_letter (optional)
group_by(STUDENT, test_num) %>% # group by student + test
summarize(score_tot = sum(score)) %>% # Sum score by student/test
spread(test_num, score_tot) # Spread back to wide format
df_totals
# A tibble: 5 x 4
# Groups: STUDENT [5]
STUDENT TEST1TOTAL TEST2TOTAL TEST3TOTAL
<dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA
2 2 11 18 15
3 3 12 13 11
4 4 10 9 13
5 5 8 10 NA
If you want the individual scores too, just join the totals together with the original:
left_join(df, df_totals, by = 'STUDENT')
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL TEST3TOTAL
1 1 NA NA NA 5 10 0 NA NA NA
2 2 5 8 10 6 10 5 11 18 15
3 3 5 4 5 7 9 6 12 13 11
4 4 6 6 4 4 3 9 10 9 13
5 5 7 9 6 1 1 NA 8 10 NA
I have a big dataset, with 240 cases representing 240 patients. They all have undergone neuropsychological tests and filled in questionnaires. Additionally, their significant others (hereafter: proxies) have also filled in questionnaires. Since 'patient' and 'proxy' are nested in 'couples', I want to conduct a multilevel analysis in R. For this, I need to reshape my dataset to run those kind of analysis.
Simply said, I want to 'duplicate' my rows. For the double subject IDs add a new variable with 1s and 2s, where 1 stands for patient data and 2 stands for proxy data. Then I want the rows to be filled with 1. all the patient data and the columns that contain the proxy data to be NA or empty or whatever, and 2. all the proxy data, and all the patient data NA or empty.
Let's say this is my data:
id <- c(1:5)
names <- c('id', 'p1', 'p2', 'p3', 'pr1', 'pr2', 'pr3')
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))
mydf <- as.data.frame(matrix(c(id, p1, p2, p3, pr1, pr2, pr3), nrow = 5))
colnames(mydf) <- names
>mydf
id p1 p2 p3 pr1 pr2 pr3
1 1 6 20 22 1 10 24
2 2 8 11 24 2 18 29
3 3 7 10 25 6 20 26
4 4 3 14 20 10 15 20
5 5 5 19 29 7 14 22
I want my data finally to look like this:
id2 <- rep(c(1:5), each = 2)
names2 <- c('id', 'couple', 'q1', 'q2', 'q3')
couple <- rep(1:2, 5)
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))
mydf <- as.data.frame(matrix(c(id2, couple, p1, p2, p3, pr1, pr2, pr3), nrow = 10, ncol = 5))
colnames(mydf) <- names2
>mydf
id couple q1 q2 q3
1 1 1 6 23 16
2 1 2 10 28 10
3 2 1 1 27 14
4 2 2 7 21 20
5 3 1 5 30 18
6 3 2 12 2 27
7 4 1 10 1 25
8 4 2 13 7 21
9 5 1 11 6 20
10 5 2 18 3 23
Or, if this is not possible, like this:
id couple bb1 bb2 bb3 pbb1 pbb2 pbb3
1 1 1 6 23 16
2 1 2 10 28 10
3 2 1 1 27 14
4 2 2 7 21 20
5 3 1 5 30 18
6 3 2 12 2 27
7 4 1 10 1 25
8 4 2 13 7 21
9 5 1 11 6 20
10 5 2 18 3 23
Now, to get me there, i've tried the melt() function and the gather() function and it feels like i'm close but still it's not working the way I want it to work.
note, in my dataset the variable names are bb1:bb54 for the patient questionnaire and pbb1:pbb54 for the proxy questionnaire
Example of what I've tried
df_long <- df_reshape %>%
gather(testname, value, -(bb1:bb11), -(pbb1:pbb11), -id, -pgebdat, -p_age, na.rm=T) %>%
arrange(id)
If I understand what you want correctly, you can gather everything to a very long form and then reshape back to a slightly wider form:
library(tidyverse)
set.seed(47) # for reproducibility
mydf <- data.frame(id = c(1:5),
p1 = c(sample(1:10, 5)),
p2 = c(sample(10:20, 5)),
p3 = c(sample(20:30, 5)),
pr1 = c(sample(1:10, 5)),
pr2 = c(sample(10:20, 5)),
pr3 = c(sample(20:30, 5)))
mydf_long <- mydf %>%
gather(var, val, -id) %>%
separate(var, c('couple', 'q'), -2) %>%
mutate(q = paste0('q', q)) %>%
spread(q, val)
mydf_long
#> id couple q1 q2 q3
#> 1 1 p 10 17 21
#> 2 1 pr 10 11 24
#> 3 2 p 4 13 27
#> 4 2 pr 4 15 20
#> 5 3 p 7 14 30
#> 6 3 pr 1 14 29
#> 7 4 p 6 18 24
#> 8 4 pr 8 20 30
#> 9 5 p 9 16 23
#> 10 5 pr 3 18 25
One approach would be to use unite and separate in tidyr, along with the gather function as well.
I'm using your mydf data frame since it was provided, but it should be pretty straightforward to make any changes:
mydf %>%
unite(p1:p3, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pr1:pr3, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c("q1", "q2", "q3"), convert = TRUE) %>% # Separate and retrieve original answers
arrange(id)
Which gives you:
id couple q1 q2 q3
1 1 1 9 18 25
2 1 2 10 18 30
3 2 1 1 11 29
4 2 2 2 15 29
5 3 1 10 19 26
6 3 2 3 19 25
7 4 1 7 10 23
8 4 2 1 20 28
9 5 1 6 16 21
10 5 2 5 12 26
Our numbers are different since they were all randomly generated with sample.
Edited per #alistaire comment: add convert = TRUE to the separate call to make sure the responses are still of class integer.
Let I have such a date frame(df1) with column name x:
df1<-as.data.frame(x=c(4,3,2,16,7,8,9,1,12))
colnames(df1)<-"x"
df1[2,1]<-NA
df1[3,1]<-NA
df1[4,1]<-NA
The output is:
> df1
x
1 4
2 NA
3 NA
4 NA
5 7
6 8
7 9
8 1
9 12
I want to add a column to the data frame. The new column(y) will fill NA's with the nearest value above the first NA above.
The code and the output is(that is what I want)
df1$y<-na.locf(df1, fromLast = FALSE)
> df1
x x
1 4 4
2 NA 4
3 NA 4
4 NA 4
5 7 7
6 8 8
7 9 9
8 1 1
9 12 12
Note:I didn't understand why the second column's name is "x" alhough I defined it as "y".
However, above method gives error naturally when the first entry is NA as below:
df2<-as.data.frame(c(4,3,2,16,7,8,9,1,12))
colnames(df2)<-"x"
df2[1,1]<-NA
df2[2,1]<-NA
df2[3,1]<-NA
> df2
x
1 NA
2 NA
3 NA
4 16
5 7
6 8
7 9
8 1
9 12
When I apply the below code:
df2$y<-na.locf(df2, fromLast = FALSE)
I get the below error:
Error in `$<-.data.frame`(`*tmp*`, "y", value = list(x = c(16, 7, 8, 9, :
replacement has 6 rows, data has 9
In such situations I just want to the oppsite of na.locf(df2, fromLast =FALSE). Namely fill NA's as the first value of below NA.
Desired output is:
x y
1 NA 16
2 NA 16
3 NA 16
4 16 16
5 7 7
6 8 8
7 9 9
8 1 1
9 12 12
So using tryCatch function, I wrote the below code:
df2$y<-tryCatch(na.locf(df2, fromLast = FALSE),
error=function(err)
{na.locf(df2, fromLast = TRUE)})
However, I got such an error:
Error in `$<-.data.frame`(`*tmp*`, "y", value = list(x = c(16, 7, 8, 9, :
replacement has 6 rows, data has 9
So in summary the problem is:
if the data frame's first entry is not NA,then fill the NA with first element above
if the data frame's first entry is NA, then fill the NA with first element below.
How can I this using R? Especially with tryCatch function? I also don't understnad why the second column's name seem as "x" instead of "y"?
I will be very glad for any help. Thanks a lot.
We can do a double na.locf with the first one having the option na.rm = FALSE
library(zoo)
na.locf(na.locf(df2, na.rm = FALSE), fromLast = TRUE)
# x
#1 16
#2 16
#3 16
#4 16
#5 7
#6 8
#7 9
#8 1
#9 12
If we want to have two columns
transform(df2, y = na.locf(na.locf(x, na.rm = FALSE), fromLast = TRUE))
# x y
#1 NA 16
#2 NA 16
#3 NA 16
#4 16 16
#5 7 7
#6 8 8
#7 9 9
#8 1 1
#9 12 12
NOTE: Make sure to assign it to a new object or to the same object i.e. df2 <- transform(...