I have data with over 6k columns. Each result has colums with data that are always the same.
XCODE Age Sex ResultA Sex ResultB
1 X001 12 2 2 2 4
2 X002 23 2 4 2 66
3 X003 NA NA NA NA NA
4 X004 32 1 1 1 3
5 X005 NA NA NA NA NA
6 X001 NA NA NA NA NA
7 X002 NA NA NA NA NA
8 X003 33 1 8 1 6
9 X004 NA NA NA NA NA
10 X005 55 2 8 2 8
I would like to remove duplicate e.g sex variable. Is there possibility of doing that with data.table?
You can use match if you need to check for equality of all values.
df[, unique(match(df, df)), with = F]
df2
# XCODE Age Sex ResultA ResultB
# 1 X001 12 2 2 4
# 2 X002 23 2 4 66
# 3 X003 NA NA NA NA
# 4 X004 32 1 1 3
# 5 X005 NA NA NA NA
# 6 X001 NA NA NA NA
# 7 X002 NA NA NA NA
# 8 X003 33 1 8 6
# 9 X004 NA NA NA NA
# 10 X005 55 2 8 8
Data used:
df <- fread('
XCODE Age Sex ResultA Sex ResultB
1 X001 12 2 2 2 4
2 X002 23 2 4 2 66
3 X003 NA NA NA NA NA
4 X004 32 1 1 1 3
5 X005 NA NA NA NA NA
6 X001 NA NA NA NA NA
7 X002 NA NA NA NA NA
8 X003 33 1 8 1 6
9 X004 NA NA NA NA NA
10 X005 55 2 8 2 8
')[, -'V1']
Try this:
df[, unique(colnames(df))]
One caveat: it will delete all columns with duplicated names. In your case, it will delete Sex even if the two columns have the same name but different content.
If you have duplicated columns with different names, you can transpose your dataframe, which allows you to use the unique function to solve your problem. Then you then transpose it back and set it back to dataframe (because it came a matrix when you transposed it).
df = data.frame(c = 1:5, a = c("A", "B","C","D","E"), b = 1:5)
df = t(df)
df = unique(df)
df = t(df)
df = data.frame(df)
Edit: like markus points out, this is probably not a good option if you have columns of multiples types because when t() coerces your dataframe to matrix it also coerces all your variables into the same type.
Related
I have generated random data like this.
data <- replicate(10,sample(0:9,10,rep=FALSE))
ind <- which(data %in% sample(data, 5))
#now replace those indices in data with NA
data[ind]<-NA
#here is our vector with 15 random NAs
data = as.data.frame(data)
rownames(data) = 1:10
colnames(data) = 1:10
data
which results in a data frame like this. How can I reorder the entry value such that if the entry is numeric, then the value will be placed in a (row number - 1), and NA will be put in any rows where there is no value matching the (row number -1). The data I want, for example, the first column, should look like this
.
How can I do this? I have no clue at all. We can order decreasing or increasing and put NA in the last order, but that is not what I want.
You can make a helper function to assign values to indices at (values + 1), then apply the function over all columns:
fx <- function(x) {
vals <- x[!is.na(x)]
pos <- vals + 1
out <- rep(NA, length(x))
out[pos] <- vals
out
}
as.data.frame(sapply(data, fx))
1 2 3 4 5 6 7 8 9 10
1 NA 0 NA 0 0 0 0 NA 0 0
2 NA NA NA 1 1 NA NA NA NA NA
3 2 NA 2 2 NA NA NA NA 2 NA
4 3 NA 3 3 NA NA 3 NA 3 3
5 4 4 4 4 NA 4 NA 4 4 NA
6 5 5 NA 5 NA NA 5 5 5 NA
7 NA 6 6 NA 6 NA NA 6 NA NA
8 7 NA 7 7 NA 7 7 NA NA 7
9 NA NA NA NA 8 8 8 8 8 8
10 9 9 NA NA 9 NA NA 9 NA 9
Starting data:
set.seed(13)
data <- replicate(10, sample(
c(0:9, rep(NA, 10)),
10,
replace = FALSE
))
data <- as.data.frame(data)
colnames(data) <- 1:10
data
1 2 3 4 5 6 7 8 9 10
1 2 NA NA 2 NA NA 0 NA 3 7
2 4 NA NA 4 NA NA NA NA 2 9
3 9 9 NA 3 9 4 NA 6 4 0
4 NA NA NA 1 6 NA NA 4 NA NA
5 5 6 3 0 NA NA 5 8 8 NA
6 NA NA 7 NA NA NA 7 NA 5 3
7 3 4 6 NA 1 0 NA 5 NA NA
8 NA NA NA 7 0 7 NA NA 0 NA
9 NA 0 4 NA 8 8 8 9 NA 8
10 7 5 2 5 NA NA 3 NA NA NA
I need to create a dataframe with all possible combinations of a variable. I found an example using data.table that works like this:
df <- data.frame("Age"=1:10)
df <- setDT(df)
df[,lag.Age1 := c(NA,Age[-.N])]
That creates this:
Age lag.Age1
1: 1 NA
2: 2 1
3: 3 2
.. .. ..
10: 10 9
Now, I want to keep adding lagged vectors that produce something like this:
Age lag.Age1 lag.Age2 lag.Age3
1: 1 NA NA NA
2: 2 1 NA NA
3: 3 2 1 NA
.. .. .. .. ..
10: 10 9 8 7
I tried this for the third column:
df[,lag.Age2 := c(NA,NA,Age[1:8])]
But I really don't get how data.table works here. That line runs but it doesn't do anything.
EDIT: what if the dataframe has a group variable and I want the lag to be done by group? For the first lag it is just:
df <- data.frame("Age"=1:10, "Group"=c(rep("A",4),rep("B",6)))
df[,lag.Age1 := c(NA,Age[-.N]), by="Group"]
How would this be now? note that the groups have different length.
data.table::shift() is very powerful, because you can provide a vector of offsets; For example, if you want n lag columns (from 1 to n), you can do this:
n=3
cols = paste0("lag.Age",1:n)
df[, c(cols):=shift(Age,1:n), Group]
Output:
Age Group lag.Age1 lag.Age2 lag.Age3
<int> <char> <int> <int> <int>
1: 1 A NA NA NA
2: 2 A 1 NA NA
3: 3 A 2 1 NA
4: 4 A 3 2 1
5: 5 B NA NA NA
6: 6 B 5 NA NA
7: 7 B 6 5 NA
8: 8 B 7 6 5
9: 9 B 8 7 6
10: 10 B 9 8 7
Alternatively:
df[, c(paste0("lag.Age",1:3)):=shift(Age,1:3), Group]
If you want to have the number of lags vary by group, where the number equals the number of observations in that group-1, then one approach is to do this:
# make function to return lags based on length of x
f <- function(x) shift(x,1:(length(x)-1))
# get unique groups
grps= unique(df$Group)
# set as DT, and use lapply()
setDT(df)
grp_lags = lapply(grps, \(g) f(df[Group==g, Age]))
names(grp_lags)<-grps
Output:
$A
$A[[1]]
[1] NA 1 2 3
$A[[2]]
[1] NA NA 1 2
$A[[3]]
[1] NA NA NA 1
$B
$B[[1]]
[1] NA 5 6 7 8 9
$B[[2]]
[1] NA NA 5 6 7 8
$B[[3]]
[1] NA NA NA 5 6 7
$B[[4]]
[1] NA NA NA NA 5 6
$B[[5]]
[1] NA NA NA NA NA 5
Or, if you have okay with lots of extra columns (i.e. for the groups with fewer observations), you can do this:
n = df[, .N, Group][,max(N)]
cols = paste0("lag.Age",1:n)
df[, c(cols):=shift(Age,1:n), Group]
Output:
Age Group lag.Age1 lag.Age2 lag.Age3 lag.Age4 lag.Age5 lag.Age6
1: 1 A NA NA NA NA NA NA
2: 2 A 1 NA NA NA NA NA
3: 3 A 2 1 NA NA NA NA
4: 4 A 3 2 1 NA NA NA
5: 5 B NA NA NA NA NA NA
6: 6 B 5 NA NA NA NA NA
7: 7 B 6 5 NA NA NA NA
8: 8 B 7 6 5 NA NA NA
9: 9 B 8 7 6 5 NA NA
10: 10 B 9 8 7 6 5 NA
Probably there's a very easy solution to this but I can't figure it out for some reason. This is what my data (in R) look like (except for value_new which is the exact description of what I need!):
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
I hope that this is self explanatory. What I need is the values of "value" for is.na(value) (i.e. the first five rows) and paste these values as the first five rows (i.e. when value<0) of a new variable I'd like to call "value_new".
What is an easy way of doing this? I'd basically need to cut out the bottom half and paste it as new variable(s) in the top section of the dataframe. Hope this makes sense.
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9))
dat$value_new = NA
dat$value_new[!is.na(dat$id)] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 NA 7 NA
# 7 NA NA NA
# 8 NA 4 NA
# 9 NA 1 NA
# 10 NA 9 NA
In case you have more rows with a non-NA id compared to NA id you can use:
dat<-data.frame("id"=c(1,2,3,4,5,6,NA,NA,NA,NA,NA),
"value"=c(rep(NA,6),7,NA,4,1,9))
k = sum(is.na(dat$id))
dat$value_new = NA
dat$value_new[!is.na(dat$id)][1:k] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 6 NA NA
# 7 NA 7 NA
# 8 NA NA NA
# 9 NA 4 NA
# 10 NA 1 NA
# 11 NA 9 NA
where k is the number of values you'll replace in the top part of your new column.
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
ind <- which(!is.na(dat$value))[1]
newcol <- `length<-`(dat$value[ind:nrow(dat)], nrow(dat))
dat$value_new2 <- newcol
# id value value_new value_new2
#1 1 NA 7 7
#2 2 NA NA NA
#3 3 NA 4 4
#4 4 NA 1 1
#5 5 NA 9 9
#6 NA 7 NA NA
#7 NA NA NA NA
#8 NA 4 NA NA
#9 NA 1 NA NA
#10 NA 9 NA NA
Short version:
dat$value_new2 <- `length<-`(dat$value[which(!is.na(dat$value))[1]:nrow(dat)], nrow(dat))
I remove the first continuing NA and add them to the end. Not considering id's here.
This question already has answers here:
All combinations of all sizes?
(2 answers)
Unordered combinations of all lengths
(3 answers)
Closed 4 years ago.
I would like to build a dataframe that lists all possible combinations of 6 numbers.
I realised that I can use combn(), but with only one value for m. With a bit of playing around I got the desired result by going through step-by-step with the following code -
combi1 <- data.frame(c(1:6))
colnames(combi1) <- 'X1'
combi2 <- data.frame(t(combn(c(1:6), 2)))
combi3 <- data.frame(t(combn(c(1:6), 3)))
combi4 <- data.frame(t(combn(c(1:6), 4)))
combi5 <- data.frame(t(combn(c(1:6), 5)))
combi6 <- data.frame(t(combn(c(1:6), 6)))
Combi <- rbind.fill(combi1, combi2, combi3, combi4, combi5, combi6)
I had to transpose each of the DFs to get them in the right shape.
My problem is that this seems to be quite an in-efficient method. Maybe a bit simplistic. I thought there must surely be some quicker way to code this, but haven't found any solution online that gives me what I'd like.
Possibly build it into a function or a loop somehow? I'm fairly new to R though and haven't had a great deal of practice writing functions.
Is it what you want ?
combis <- vector("list", 6)
combi1 <- data.frame(c(1:6))
colnames(combi1) <- 'X1'
combis[[1]] <- combi1
combis[2:6] <- lapply(2:6, function(n) data.frame(t(combn(c(1:6), n))))
do.call(plyr::rbind.fill, combis)
Result:
X1 X2 X3 X4 X5 X6
1 1 NA NA NA NA NA
2 2 NA NA NA NA NA
3 3 NA NA NA NA NA
4 4 NA NA NA NA NA
5 5 NA NA NA NA NA
6 6 NA NA NA NA NA
7 1 2 NA NA NA NA
8 1 3 NA NA NA NA
9 1 4 NA NA NA NA
10 1 5 NA NA NA NA
11 1 6 NA NA NA NA
12 2 3 NA NA NA NA
13 2 4 NA NA NA NA
14 2 5 NA NA NA NA
15 2 6 NA NA NA NA
16 3 4 NA NA NA NA
17 3 5 NA NA NA NA
18 3 6 NA NA NA NA
19 4 5 NA NA NA NA
20 4 6 NA NA NA NA
21 5 6 NA NA NA NA
22 1 2 3 NA NA NA
23 1 2 4 NA NA NA
24 1 2 5 NA NA NA
25 1 2 6 NA NA NA
26 1 3 4 NA NA NA
27 1 3 5 NA NA NA
28 1 3 6 NA NA NA
29 1 4 5 NA NA NA
30 1 4 6 NA NA NA
31 1 5 6 NA NA NA
32 2 3 4 NA NA NA
33 2 3 5 NA NA NA
34 2 3 6 NA NA NA
35 2 4 5 NA NA NA
36 2 4 6 NA NA NA
37 2 5 6 NA NA NA
38 3 4 5 NA NA NA
39 3 4 6 NA NA NA
40 3 5 6 NA NA NA
41 4 5 6 NA NA NA
42 1 2 3 4 NA NA
43 1 2 3 5 NA NA
44 1 2 3 6 NA NA
45 1 2 4 5 NA NA
46 1 2 4 6 NA NA
47 1 2 5 6 NA NA
48 1 3 4 5 NA NA
49 1 3 4 6 NA NA
50 1 3 5 6 NA NA
51 1 4 5 6 NA NA
52 2 3 4 5 NA NA
53 2 3 4 6 NA NA
54 2 3 5 6 NA NA
55 2 4 5 6 NA NA
56 3 4 5 6 NA NA
57 1 2 3 4 5 NA
58 1 2 3 4 6 NA
59 1 2 3 5 6 NA
60 1 2 4 5 6 NA
61 1 3 4 5 6 NA
62 2 3 4 5 6 NA
63 1 2 3 4 5 6
I have a large dataframe, 300+ columns (time series) with about 2600 observations. The columns are filled with a lot of NA's and then a short time series, and then typically NA's again. I would like to find the first non-NA value in each column and replace it with NA.
This is what I'm hoping to achieve, only with a much bigger dataframe:
Before:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 1 1 NA NA
4 2 2 1 1
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
After:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 2 2 NA NA
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
I've searched around and found a way to do this for each column, but my efforts to apply it to the whole dataframe has proven difficult.
I have created an example dataframe to reproduce my original dataframe:
#Dataframe with NA
x1=x2=c(NA,NA,1:10,NA,NA)
x3=x4=c(NA,NA,NA,1:7,NA,NA,NA,NA)
df=data.frame(x1,x2,x3,x4)
I have used this to replace the first value with NA in 1 column (provided by #Joshua Ulrich here), however I would like to apply it to all columns without manually changing 300+ codes:
NonNAindex <- which(!is.na(df[,1]))
firstNonNA <- min(NonNAindex)
is.na(df[,1]) <- seq(firstNonNA, length.out=1)
I have tried to set the above as a function and run it for all columns with apply/lapply, as well as a for loop, but haven't really figured out how to apply the changes to my dataframe. I'm sure there is something I've completely overlooked as I'm just taking my first small steps in R.
All suggestions would be highly appreciated!
We can use base R
df1[] <- lapply(df1, function(x) replace(x, which(!is.na(x))[1], NA))
df1
# x1 x2 x3 x4
#1 NA NA NA NA
#2 NA NA NA NA
#3 NA NA NA NA
#4 2 2 NA NA
#5 3 3 2 2
#6 4 4 3 3
#7 5 5 4 4
#8 6 6 5 5
#9 7 7 6 6
#10 8 8 7 7
#11 9 9 NA NA
#12 10 10 NA NA
#13 NA NA NA NA
#14 NA NA NA NA
Or as #thelatemail suggested
df1[] <- lapply(df1, function(x) replace(x, Position(Negate(is.na), x), NA))
Since you would like to do this for all columns, you could use the mutate_all function from dplyr. See http://dplyr.tidyverse.org/ for more information. In particular, you may want to look at some of the examples shown here.
library(dplyr)
mutate_all(df, funs(if_else(row_number() == min(which(!is.na(.))), NA_integer_, .)))
#> x1 x2 x3 x4
#> 1 NA NA NA NA
#> 2 NA NA NA NA
#> 3 NA NA NA NA
#> 4 2 2 NA NA
#> 5 3 3 2 2
#> 6 4 4 3 3
#> 7 5 5 4 4
#> 8 6 6 5 5
#> 9 7 7 6 6
#> 10 8 8 7 7
#> 11 9 9 NA NA
#> 12 10 10 NA NA
#> 13 NA NA NA NA
#> 14 NA NA NA NA