Replace all values in a data.table given a condition - r

How would you replace all values in a data.table given a condition?
For example
ppp <- data.table(A=1:6,B=6:1,C=1:6,D=3:8)
A B C D
1 6 1 3
2 5 2 4
3 4 3 5
4 3 4 6
5 2 5 7
6 1 6 8
I want to replace all "6" by NA
A B C D
1 NA 1 3
2 5 2 4
3 4 3 5
4 3 4 NA
5 2 5 7
NA 1 6 8
I've tried something like
ppp[,ifelse(.SD==6,NA,.SD)]
but it doesn't work, it produces a much wider table.

A native data.table way to do this would be:
for(col in names(ppp)) set(ppp, i=which(ppp[[col]]==6), j=col, value=NA)
# Test
> ppp
A B C D
1: 1 NA 1 3
2: 2 5 2 4
3: 3 4 3 5
4: 4 3 4 NA
5: 5 2 5 7
6: NA 1 NA 8
This approach - while perhaps more verbose - is nevertheless going to be significantly faster than ppp[ppp == 6] <- NA, because it avoids the copying of all columns.

Even easier:
ppp[ppp == 6] <- NA
ppp
A B C D
1: 1 NA 1 3
2: 2 5 2 4
3: 3 4 3 5
4: 4 3 4 NA
5: 5 2 5 7
6: NA 1 NA 8
Importantly, this doesn't change its class:
is.data.table(ppp)
[1] TRUE

Related

is there a way to use a column to label my variables in R [duplicate]

This question already has answers here:
R: Assign variable labels of data frame columns
(4 answers)
Closed 2 years ago.
I am trying to assign variable labels into my columns in R. I was able to create values list using the following code:
var.labels= dataframe$var name
now I want to use this list as variable labels for my columns in the data frame. I tried this code but it did not work:
label(dataframe) = as.list(var.labels[match(names(dataframe), names(var.labels))]
Thank you for your help.
Maybe this?
var.labels <- dataframe$var name
colnames(dataframe) <- var.labels
Example:
df <- as.data.frame(replicate(n = 13, expr = sample(c(1:8, NA), 13, replace = TRUE)))
df$names <- LETTERS[1:13]
df
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 names
#>1 1 NA 5 3 2 5 7 NA 3 NA NA NA 1 A
#>2 7 5 1 6 4 3 1 6 4 3 NA 3 NA B
#>3 7 3 1 2 7 2 6 4 6 1 7 3 2 C
#>4 3 1 5 1 5 6 1 2 2 NA 8 5 1 D
#>5 3 6 3 7 4 2 6 7 7 NA 1 2 8 E
#>6 NA 4 4 1 8 3 8 NA 6 3 8 4 NA F
#>7 1 8 3 8 1 3 2 4 7 4 2 1 2 G
#>8 NA NA 2 3 5 4 5 1 4 7 8 5 3 H
#>9 7 NA 3 2 7 NA 2 8 7 NA 6 8 6 I
#>10 6 8 5 3 6 5 5 3 4 8 NA 5 1 J
#>11 1 7 8 5 1 2 3 NA NA 3 2 6 7 K
#>12 8 4 1 8 7 NA 6 6 5 6 7 NA 2 L
#>13 5 8 5 1 2 1 6 3 NA 1 7 3 5 M
colnames(df) <- df$names
#> A B C D E F G H I J K L M NA
#>1 1 NA 5 3 2 5 7 NA 3 NA NA NA 1 A
#>2 7 5 1 6 4 3 1 6 4 3 NA 3 NA B
#>3 7 3 1 2 7 2 6 4 6 1 7 3 2 C
#>4 3 1 5 1 5 6 1 2 2 NA 8 5 1 D
#>5 3 6 3 7 4 2 6 7 7 NA 1 2 8 E
#>6 NA 4 4 1 8 3 8 NA 6 3 8 4 NA F
#>7 1 8 3 8 1 3 2 4 7 4 2 1 2 G
#>8 NA NA 2 3 5 4 5 1 4 7 8 5 3 H
#>9 7 NA 3 2 7 NA 2 8 7 NA 6 8 6 I
#>10 6 8 5 3 6 5 5 3 4 8 NA 5 1 J
#>11 1 7 8 5 1 2 3 NA NA 3 2 6 7 K
#>12 8 4 1 8 7 NA 6 6 5 6 7 NA 2 L
#>13 5 8 5 1 2 1 6 3 NA 1 7 3 5 M
# Finally, remove the names column
df[14] <- NULL

Delete rows when all numbers within a cycle of another variable equal to NA

My data are as follow:
Row x y
1 1 2
2 2 3
3 3 4
4 4 3
5 5 NA
6 1 NA
7 2 NA
8 3 NA
9 4 NA
10 5 7
11 1 NA
12 2 NA
13 3 NA
14 4 NA
15 5 NA
I wish to delete Row 11 to 15 since y are NA for ALL cycles of x (y euqal to NA whatever value x takes for Row 11 to 15). I am not going to delete other rows since there is at lease one number of y not NA when x moves from 1 to 5 (Like from Row 6 to 10, y is 7 when x is 5, thus I keep Row 6 to 10). I wish to know how should I write a R code to accompolish this.
using base R, Taking into assumption that x is arranged and that all start from 1.
subset(df,!ave(is.na(y),cumsum(c(1,diff(x)<0)),FUN=all))
Row x y
1 1 1 2
2 2 2 3
3 3 3 4
4 4 4 3
5 5 5 NA
6 6 1 NA
7 7 2 NA
8 8 3 NA
9 9 4 NA
10 10 5 7
using tidyverse:
df%>%
group_by(m = cumsum(c(1,diff(x)<0)))%>%
filter(!all(is.na(y)))
# A tibble: 10 x 4
# Groups: m [2]
Row x y m
<int> <int> <int> <dbl>
1 1 1 2 1
2 2 2 3 1
3 3 3 4 1
4 4 4 3 1
5 5 5 NA 1
6 6 1 NA 2
7 7 2 NA 2
8 8 3 NA 2
9 9 4 NA 2
10 10 5 7 2
of course you can unselect then remove m

R Subset matching contiguous blocks

I have a dataframe.
dat <- data.frame(k=c("A","A","B","B","B","A","A","A"),
a=c(4,2,4,7,5,8,3,2),b=c(2,5,3,5,8,4,5,8),
stringsAsFactors = F)
k a b
1 A 4 2
2 A 2 5
3 B 4 3
4 B 7 5
5 B 5 8
6 A 8 4
7 A 3 5
8 A 2 8
I would like to subset contiguous blocks based on variable k. This would be a standard approach.
#using rle rather than levels
kval <- rle(dat$k)$values
for(i in 1:length(kval))
{
subdf <- subset(dat,dat$k==kval[i])
print(subdf)
#do something with subdf
}
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
k a b
3 B 4 3
4 B 7 5
5 B 5 8
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
So the subsetting above obviously does not work the way I intended. Any elegant way to get these results?
k a b
1 A 4 2
2 A 2 5
k a b
1 B 4 3
2 B 7 5
3 B 5 8
k a b
1 A 8 4
2 A 3 5
3 A 2 8
We can use rleid from data.table to create a grouping variable
library(data.table)
setDT(dat)[, grp := rleid(k)]
dat
# k a b grp
#1: A 4 2 1
#2: A 2 5 1
#3: B 4 3 2
#4: B 7 5 2
#5: B 5 8 2
#6: A 8 4 3
#7: A 3 5 3
#8: A 2 8 3
We can group by 'grp' and do all the operations within the 'grp' using standard data.table methods.
Here is a base R option to create 'grp'
dat$grp <- with(dat, cumsum(c(TRUE, k[-1]!= k[-length(k)])))

Complex restructuring of R dataframe

as I have a dataframe like this:
participant v1 v2 v3 v4 v5 v6
1 4 2 9 7 2
2 6 8 1
3 5 4 5
4 1 1 2 3
Every two consecutive variables (v1 and v2, v3 and v4, v5 and v6) belong to each other (this is what I call "count" later).
I desperatly search a way to get the following:
participant count v(odd numbers) v(even numbers)
1 1 4 2
2 9
3 7 2
2 1 6
2 8
3 1
3 1
2 5 4
3 5
4 1 1 1
2 2
3 3
As this is my first question on stackoverflow ever, I hope you understand my request. I searched a lot for similar problems (and solutions to them) but found nothing. I would very much appreciate your support.
We can use melt
library(data.table)
melt(setDT(d1), measure = list(paste0("v", seq(1, 6, by= 2)),
paste0("v", seq(2,6, by = 2))))[order(participant)]
# participant variable value1 value2
# 1: 1 1 4 2
# 2: 1 2 NA 9
# 3: 1 3 7 2
# 4: 2 1 NA 6
# 5: 2 2 8 NA
# 6: 2 3 NA 1
# 7: 3 1 NA NA
# 8: 3 2 5 4
# 9: 3 3 NA 5
#10: 4 1 1 1
#11: 4 2 NA 2
#12: 4 3 3 NA

fill=TRUE will fail when different number of column occurr after 5 rows in read.table? [duplicate]

This question already has answers here:
How can you read a CSV file in R with different number of columns
(5 answers)
Closed 7 years ago.
Let's say we have a file name test.txt which contains unknown number of columns:
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5 6 7 8
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
fill=T fails when line 8 has more than 5 columns:
read.table('test.txt', header=F, sep='\t', fill=T)
results:
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 1 2 3 4 5
3 1 2 3 4 5
4 1 2 3 4 5
5 1 2 3 4 5
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 6 7 8 NA NA
10 1 2 3 4 5
11 1 2 3 4 5
12 6 NA NA NA NA
13 1 2 3 4 5
14 6 NA NA NA NA
15 1 2 3 4 5
16 6 NA NA NA NA
But with skip=3, everything works fine
read.table('test.txt', header=F, sep='\t', fill=T, skip=3)
We got what we expected:
V1 V2 V3 V4 V5 V6 V7 V8
1 1 2 3 4 5 NA NA NA
2 1 2 3 4 5 NA NA NA
3 1 2 3 4 5 NA NA NA
4 1 2 3 4 5 NA NA NA
5 1 2 3 4 5 6 7 8
6 1 2 3 4 5 NA NA NA
7 1 2 3 4 5 6 NA NA
8 1 2 3 4 5 6 NA NA
9 1 2 3 4 5 6 NA NA
Why would this happen? Was it because fill=T only check the first 5 rows? Is there any way to work around this?
I've found the answers right in the Examples of read.table.
ncol <- max(count.fields('test.txt', sep = "\t"))
read.table('test.txt', header=F, sep='\t', fill=T, col.names=paste0('V', seq_len(ncol)))
It did because of fill=T only checks the first five rows. The solution is to specify col.names.
use col.names = paste0("V",seq_len(N)) within read.table where N is the maximum number of columns.

Resources