is there a way to use a column to label my variables in R [duplicate] - r

This question already has answers here:
R: Assign variable labels of data frame columns
(4 answers)
Closed 2 years ago.
I am trying to assign variable labels into my columns in R. I was able to create values list using the following code:
var.labels= dataframe$var name
now I want to use this list as variable labels for my columns in the data frame. I tried this code but it did not work:
label(dataframe) = as.list(var.labels[match(names(dataframe), names(var.labels))]
Thank you for your help.

Maybe this?
var.labels <- dataframe$var name
colnames(dataframe) <- var.labels
Example:
df <- as.data.frame(replicate(n = 13, expr = sample(c(1:8, NA), 13, replace = TRUE)))
df$names <- LETTERS[1:13]
df
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 names
#>1 1 NA 5 3 2 5 7 NA 3 NA NA NA 1 A
#>2 7 5 1 6 4 3 1 6 4 3 NA 3 NA B
#>3 7 3 1 2 7 2 6 4 6 1 7 3 2 C
#>4 3 1 5 1 5 6 1 2 2 NA 8 5 1 D
#>5 3 6 3 7 4 2 6 7 7 NA 1 2 8 E
#>6 NA 4 4 1 8 3 8 NA 6 3 8 4 NA F
#>7 1 8 3 8 1 3 2 4 7 4 2 1 2 G
#>8 NA NA 2 3 5 4 5 1 4 7 8 5 3 H
#>9 7 NA 3 2 7 NA 2 8 7 NA 6 8 6 I
#>10 6 8 5 3 6 5 5 3 4 8 NA 5 1 J
#>11 1 7 8 5 1 2 3 NA NA 3 2 6 7 K
#>12 8 4 1 8 7 NA 6 6 5 6 7 NA 2 L
#>13 5 8 5 1 2 1 6 3 NA 1 7 3 5 M
colnames(df) <- df$names
#> A B C D E F G H I J K L M NA
#>1 1 NA 5 3 2 5 7 NA 3 NA NA NA 1 A
#>2 7 5 1 6 4 3 1 6 4 3 NA 3 NA B
#>3 7 3 1 2 7 2 6 4 6 1 7 3 2 C
#>4 3 1 5 1 5 6 1 2 2 NA 8 5 1 D
#>5 3 6 3 7 4 2 6 7 7 NA 1 2 8 E
#>6 NA 4 4 1 8 3 8 NA 6 3 8 4 NA F
#>7 1 8 3 8 1 3 2 4 7 4 2 1 2 G
#>8 NA NA 2 3 5 4 5 1 4 7 8 5 3 H
#>9 7 NA 3 2 7 NA 2 8 7 NA 6 8 6 I
#>10 6 8 5 3 6 5 5 3 4 8 NA 5 1 J
#>11 1 7 8 5 1 2 3 NA NA 3 2 6 7 K
#>12 8 4 1 8 7 NA 6 6 5 6 7 NA 2 L
#>13 5 8 5 1 2 1 6 3 NA 1 7 3 5 M
# Finally, remove the names column
df[14] <- NULL

Related

Is there any way to replace a missing value based on another columns' value to match the column name

I have a dataset:
a day day.1.time day.2.time day.3.time day.4.time day.5.time
1 NA 2 4 5 7 10 4
2 NA 5 4 1 1 6 NA
3 NA 3 7 9 6 7 4
4 NA 3 6 8 8 4 5
5 NA 3 5 2 4 5 6
6 NA 3 87 3 2 1 78
7 NA 1 NA 7 5 9 54
8 NA 5 6 6 3 2 3
9 NA 2 5 10 9 8 3
10 NA 3 9 4 10 3 3
I am trying to use the day column value to match with the day.x.time column to replace the missing value in column a. For instance, in the first row, the first value in the day column is 2, then we should use day.2.time value 5 to replace the first value in column a.
If the day.x.time value is missing, we should use -1 day or +1 day to replace the missing in column a. For instance, in the second row, the day column shows 5, so we should use the value in day.5.time column, but it's also a missing value. In this case, we should use the value in day.4.time column to replace the missing value in column a.
You can use dat = data.frame(a = rep(NA,10), day = c(2,5,3,3,3,3,1,5,2,3), day.1.time = c(4,4,7,6,5,87,NA,6,5,9), day.2.time = sample(10), day.3.time = sample(10), day.4.time = sample(10), day.5.time = c(4,NA,4,5,6,78,54,3,3,3)) to generate the sample data.
I have tried grep(paste0("^day."dat$day,".time$", names(dat)) to match with the column but my code isn't matching in every row, so any help would be appreciated!
Here is one way to do this.
The first part is easy to match day column with the corresponding day.x.time column. We can do this using matrix subsetting.
cols <- grep('day\\.\\d+\\.time', names(dat))
dat$a <- dat[cols][cbind(1:nrow(dat), dat$day)]
dat
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
#1 3 2 4 3 3 3 4
#2 NA 5 4 4 10 2 NA
#3 1 3 7 8 1 8 4
#4 4 3 6 6 4 5 5
#5 6 3 5 10 6 7 6
#6 8 3 87 5 8 9 78
#7 NA 1 NA 1 7 10 54
#8 3 5 6 7 9 1 3
#9 2 2 5 2 5 6 3
#10 2 3 9 9 2 4 3
To fill values where day.x.time column is NA we can select the closest non-NA value in that row.
inds <- which(is.na(dat$a))
dat$a[inds] <- mapply(function(x, y)
na.omit(unlist(dat[x, cols[order(abs(y- seq_along(cols)))]])[1:4])[1],
inds, dat$day[inds])
dat
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
#1 3 2 4 3 3 3 4
#2 2 5 4 4 10 2 NA
#3 1 3 7 8 1 8 4
#4 4 3 6 6 4 5 5
#5 6 3 5 10 6 7 6
#6 8 3 87 5 8 9 78
#7 1 1 NA 1 7 10 54
#8 3 5 6 7 9 1 3
#9 2 2 5 2 5 6 3
#10 2 3 9 9 2 4 3
Using sapply to loop over the rows and subset by day[i] + 2 column.
res <- transform(dat, a=sapply(1:nrow(dat), function(i) dat[i, dat$day[i] + 2]))
res
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
# 1 5 2 4 5 7 10 4
# 2 NA 5 4 1 1 6 NA
# 3 6 3 7 9 6 7 4
# 4 8 3 6 8 8 4 5
# 5 4 3 5 2 4 5 6
# 6 2 3 87 3 2 1 78
# 7 NA 1 NA 7 5 9 54
# 8 3 5 6 6 3 2 3
# 9 10 2 5 10 9 8 3
# 10 10 3 9 4 10 3 3
Edit
The +/-2 days would require a decision rule, what to chose, if day is NA, but none of day - 1 and day + 1 is NA and both have the same values.
Here a solution that goes from day backwards and takes the first non-NA. If it is day one, as it's the case in row 7, we get NA.
res <- transform(dat, a=sapply(1:nrow(dat), function(i) {
days <- dat[i, -(1:2)]
day.value <- days[dat$day[i]]
if (is.na(day.value)) {
day.value <- tail(na.omit(unlist(days[1:dat$day[i]])), 1)
if (length(day.value) == 0) day.value <- NA
}
return(day.value)
}))
res
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
# 1 10 2 4 10 1 2 4
# 2 10 5 4 1 3 10 NA
# 3 2 3 7 7 2 7 4
# 4 6 3 6 2 6 6 5
# 5 10 3 5 9 10 5 6
# 6 8 3 87 6 8 4 78
# 7 NA 1 NA 3 7 1 54
# 8 3 5 6 4 4 9 3
# 9 8 2 5 8 5 8 3
# 10 9 3 9 5 9 3 3

Create a function to Impute values form one data frame into another

The NA values in column A should be filled by the A value from the dat data frame and so on for the other variables.
id <- factor(rep(letters[1:2], each=5))
A <- c(1,2,NA,6,8,9,0,6,7,9)
B <- c(5,6,1,9,8,1,NA,9,7,4)
C <- c(2,3,5,NA,NA,2,7,6,4,6)
D <- c(6,5,8,3,2,9,NA,2,6,8)
df <- data.frame(id, A, B,C,D)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA 1 5 8
4 a 6 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
dat <- data.frame(col=c("A","B","C","D"), value=c(23,45,26,89))
dat
dat
col value
1 A 23
2 B 45
3 C 26
4 D 89
It should look like:
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 26 3
5 a 8 8 26 2
6 b 9 1 2 9
7 b 0 45 7 89
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
I was thinking something like this but I dont know how to connect those data frames in a function...
test <- function(i){
df[,i][is.na(df[,i])] <- dat$value
}
test(2)
If you want it in your format
test <- function(i){
df[,i][is.na(df[,i])] <<- dat$value[dat$col==i]
}
test("A")
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
One approach is to iterate over the columns and values and use coalesce():
library(dplyr)
library(purrr)
df[-1] <- map2_df(df[-1], dat$value, coalesce)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 26 3
5 a 8 8 26 2
6 b 9 1 2 9
7 b 0 45 7 89
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
Or same using replace():
map2_df(df[-1], dat$value, ~ replace(.x, is.na(.x), .y))

Delete rows when all numbers within a cycle of another variable equal to NA

My data are as follow:
Row x y
1 1 2
2 2 3
3 3 4
4 4 3
5 5 NA
6 1 NA
7 2 NA
8 3 NA
9 4 NA
10 5 7
11 1 NA
12 2 NA
13 3 NA
14 4 NA
15 5 NA
I wish to delete Row 11 to 15 since y are NA for ALL cycles of x (y euqal to NA whatever value x takes for Row 11 to 15). I am not going to delete other rows since there is at lease one number of y not NA when x moves from 1 to 5 (Like from Row 6 to 10, y is 7 when x is 5, thus I keep Row 6 to 10). I wish to know how should I write a R code to accompolish this.
using base R, Taking into assumption that x is arranged and that all start from 1.
subset(df,!ave(is.na(y),cumsum(c(1,diff(x)<0)),FUN=all))
Row x y
1 1 1 2
2 2 2 3
3 3 3 4
4 4 4 3
5 5 5 NA
6 6 1 NA
7 7 2 NA
8 8 3 NA
9 9 4 NA
10 10 5 7
using tidyverse:
df%>%
group_by(m = cumsum(c(1,diff(x)<0)))%>%
filter(!all(is.na(y)))
# A tibble: 10 x 4
# Groups: m [2]
Row x y m
<int> <int> <int> <dbl>
1 1 1 2 1
2 2 2 3 1
3 3 3 4 1
4 4 4 3 1
5 5 5 NA 1
6 6 1 NA 2
7 7 2 NA 2
8 8 3 NA 2
9 9 4 NA 2
10 10 5 7 2
of course you can unselect then remove m

fill=TRUE will fail when different number of column occurr after 5 rows in read.table? [duplicate]

This question already has answers here:
How can you read a CSV file in R with different number of columns
(5 answers)
Closed 7 years ago.
Let's say we have a file name test.txt which contains unknown number of columns:
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5 6 7 8
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
fill=T fails when line 8 has more than 5 columns:
read.table('test.txt', header=F, sep='\t', fill=T)
results:
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 1 2 3 4 5
3 1 2 3 4 5
4 1 2 3 4 5
5 1 2 3 4 5
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 6 7 8 NA NA
10 1 2 3 4 5
11 1 2 3 4 5
12 6 NA NA NA NA
13 1 2 3 4 5
14 6 NA NA NA NA
15 1 2 3 4 5
16 6 NA NA NA NA
But with skip=3, everything works fine
read.table('test.txt', header=F, sep='\t', fill=T, skip=3)
We got what we expected:
V1 V2 V3 V4 V5 V6 V7 V8
1 1 2 3 4 5 NA NA NA
2 1 2 3 4 5 NA NA NA
3 1 2 3 4 5 NA NA NA
4 1 2 3 4 5 NA NA NA
5 1 2 3 4 5 6 7 8
6 1 2 3 4 5 NA NA NA
7 1 2 3 4 5 6 NA NA
8 1 2 3 4 5 6 NA NA
9 1 2 3 4 5 6 NA NA
Why would this happen? Was it because fill=T only check the first 5 rows? Is there any way to work around this?
I've found the answers right in the Examples of read.table.
ncol <- max(count.fields('test.txt', sep = "\t"))
read.table('test.txt', header=F, sep='\t', fill=T, col.names=paste0('V', seq_len(ncol)))
It did because of fill=T only checks the first five rows. The solution is to specify col.names.
use col.names = paste0("V",seq_len(N)) within read.table where N is the maximum number of columns.

R, Using reshape to pull pre post data

I have a simple data frame as follows
x = data.frame(id = seq(1,10),val = seq(1,10))
x
id val
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
I want to add 4 more columns. The first 2 are the previous two rows and the next two are the next two rows. For the first two rows and last two rows it needs to write out as NA.
How do I accomplish this using cast in the reshape package?
The final output would look like
1 1 NA NA 2 3
2 2 NA 1 3 4
3 3 1 2 4 5
4 4 2 3 5 6
... and so on...
Thanks much in advance
After your give the example , I change the solution
mat <- cbind(dat,
c(c(NA,NA),head(dat$id,-2)),
c(c(NA),head(dat$val,-1)),
c(tail(dat$id,-1),c(NA)),
c(tail(dat$val,-2),c(NA,NA)))
colnames(mat) <- c('id','val','idp','valp','idn','valn')
id val idp valp idn valn
1 1 1 NA NA 2 3
2 2 2 NA 1 3 4
3 3 3 1 2 4 5
4 4 4 2 3 5 6
5 5 5 3 4 6 7
6 6 6 4 5 7 8
7 7 7 5 6 8 9
8 8 8 6 7 9 10
9 9 9 7 8 10 NA
10 10 10 8 9 NA NA
Here is a soluting with sapply. First, choose the relative change for the new columns:
lags <- c(-2, -1, 1, 2)
Create the new columns:
newcols <- sapply(lags,
function(l) {
tmp <- seq.int(nrow(x)) + l;
x[replace(tmp, tmp < 1 | tmp > nrow(x), NA), "val"]})
Bind together:
cbind(x, newcols)
The result:
id val 1 2 3 4
1 1 1 NA NA 2 3
2 2 2 NA 1 3 4
3 3 3 1 2 4 5
4 4 4 2 3 5 6
5 5 5 3 4 6 7
6 6 6 4 5 7 8
7 7 7 5 6 8 9
8 8 8 6 7 9 10
9 9 9 7 8 10 NA
10 10 10 8 9 NA NA

Resources