Use a vector to create empty columns in a data frame - r

I currently have a integer vector exampleVector, and a data frame exampleDF, and I would like to add each element of exampleVector as a new column, whose elements are NA, in the data frame exampleDF. For illustration, I currently have:
exampleVector <- 1:6
exampleDF <- data.frame(First=1:4, Second=4:7,Third=7:10)
exampleDF
First Second Third
1 1 4 7
2 2 5 8
3 3 6 9
4 4 7 10
And what I would like to be able to create is
exampleDF
First Second Third 1 2 3 4 5 6
1 1 4 7 <NA> <NA> <NA> <NA> <NA> <NA>
2 2 5 8 <NA> <NA> <NA> <NA> <NA> <NA>
3 3 6 9 <NA> <NA> <NA> <NA> <NA> <NA>
4 4 7 10 <NA> <NA> <NA> <NA> <NA> <NA>
Where exampleDF[4:9] are character vectors.
I am aware that I would be able to do this using variations of the below commands:
exampleDF$"1" <- as.character(NA)
exampleDF[["1"]] <- as.character(NA)
exampleDF[c("1","2","3","4","5","6")] <- as.character(NA)
But I need something that is more flexible. Everything that I've been able to find online has been about adding one column to multiple data frames, and to do so they suggest mapply and cbind.
My apologies if I'm missing something obvious here - I am very new to the R language and I'm trying to do this without a for loop, if possible, as recent interactions have led me to believe that this is mostly considered a hack in R scripts, as the apply functions are typically sufficient.

Since your exampleVector is numeric, you should convert it to
characters when you enter it into the code. Otherwise it will be
interpreted as a selection based on indexes.
exampleVector <- 1:6
exampleDF <- data.frame(First=1:4, Second=4:7,Third=7:10)
exampleDF[as.character(exampleVector)] <- NA_character_
Note that there's no restriction in this setup that protects you
against getting a data frame with the same name occurring several
times. That might create problems later on (if you want to subset
your data frame by names), so I would have added a sanity check to
ensure that you do get unique names.

Related

R Order only one factor level (or column if after) to affect order long to wide (using spread)

I have a problem after changing my dataset from long to wide (using spread, from the tidyr library on the Result_Type column). I have the following example df:
Group<-c("A","A","A","B","B","B","C","C","C","D", "D")
Result_Type<-c("Final.Result", "Verification","Test", "Verification","Final.Result","Fast",
"Verification","Fast", "Final.Result", "Test", "Final.Result")
Result<-c(7,1,8,7,"NA",9,10,12,17,50,11)
df<-data.frame(Group, Result_Type, Result)
df
Group Result_Type Result
1 A Final.Result 7
2 A Verification 1
3 A Test 8
4 B Verification 7
5 B Final.Result NA
6 B Fast 9
7 C Verification 10
8 C Fast 12
9 C Final.Result 17
10 D Test 50
11 D Final.Result 11
In the column Result_type there are many possible result types and in some datasets I have Result_Type 's that will not occur in other datasets. However, one level: Final.Resultdoes occur in every dataset.
Also: This is example data but the actual data has many different columns, and as these differ across the datasets I use, I used spread (from the tidyr library) so I don't have to give any specific column names other than my target columns.
library("tidyr")
df_spread<-spread(df, key = Result_Type, value = Result)
Group Fast Final.Result Test Verification
1 A <NA> 7 8 1
2 B 9 NA <NA> 7
3 C 12 17 <NA> 10
4 D <NA> 11 50 <NA>
What I would like is that once I convert the dataset from long to wide, Final.Result is the first column, how the rest of the columns is arranged doesn't matter, so I would like it to be like this (without calling any names of the other columns that are spread, or using order index numbers):
Group Final.Result Fast Test Verification
1 A 7 <NA> 8 1
2 B NA 9 <NA> 7
3 C 17 12 <NA> 10
4 D 11 <NA> 50 <NA>
I saw some answers that indicated you can reverse the order of the spreaded columns, or turn off the ordering of spread, but that doesn't make sure that Final.Result is always the first column of the spread levels.
I hope I am making myself clear, it's a little complicated to explain. If someone needs extra info I will be happy to explain more!
spread creates columns in the order of the key column's factor levels. Within the tidyverse, forcats::fct_relevel is a convenience function for rearranging factor levels. The default is that the level(s) you specify will be moved to the front.
library(dplyr)
library(tidyr)
...
levels(df$Result_Type)
#> [1] "Fast" "Final.Result" "Test" "Verification"
Calling fct_relevel will put "Final.Result" as the first level, keeping the rest of the levels in their previous order.
reordered <- df %>%
mutate(Result_Type = forcats::fct_relevel(Result_Type, "Final.Result"))
levels(reordered$Result_Type)
#> [1] "Final.Result" "Fast" "Test" "Verification"
Adding that into your pipeline puts Final.Result as the first column after spreading.
df %>%
mutate(Result_Type = forcats::fct_relevel(Result_Type, "Final.Result")) %>%
spread(key = Result_Type, value = Result)
#> Group Final.Result Fast Test Verification
#> 1 A 7 <NA> 8 1
#> 2 B NA 9 <NA> 7
#> 3 C 17 12 <NA> 10
#> 4 D 11 <NA> 50 <NA>
Created on 2018-12-14 by the reprex package (v0.2.1)
One option is to refactor Result_Type to put final.result as the first one:
df$Result_Type<-factor(df$Result_Type,levels=c("Final.Result",as.character(unique(df$Result_Type)[!unique(df$Result_Type)=="Final.Result"])))
spread(df, key = Result_Type, value = Result)
Group Final.Result Verification Test Fast
1 A 7 1 8 NA
2 B NA 7 NA 9
3 C 17 10 NA 12
4 D 11 NA 50 NA
If you'd like you can use this opportunity to also sort the rest of the columns whichever way you want.

Apply the same calculation in different data frames in R

I am trying to loop over many data frames in R and I feel like this is a rather basic question. However, I only found similar questions that were solved with specific functions that don't match my problem (like calculating means or medians, changing column names, ...). I hope to find a more general solution that can be applied for any change or calculation in various data frames here.
I have a lot (about 500) of data frames that look somewhat like this (very simplified):
df0100
a b c d
1 4 3 5 NA
2 2 5 4 NA
3 4 4 3 NA
...
df0130
a b c d
1 3 2 3 NA
2 4 5 3 NA
3 4 3 2 NA
...
For each of them, I want to calculate a new value (also simplified here) from the values in a and c in the first row and insert the value in any row in column d. It works fine like this for a single data frame:
df0100$d <- ((df0100[1,1]*(df0100[1,3]+13.5)/(3*exp(df0100[1,3]))/100
which leads to
df0100
a b c d
1 4 3 5 36.60858
2 2 5 4 36.60858
3 4 4 3 36.60858
....
Since I don't want to do this for every single of the 500 data frames, I saved them as a list and tried to loop over them as follows. I thought the easiest way would be to replace the former 'df0100' by each data frame name but both versions didn't work. Can anyone tell me what I have to change?
my_files <- list.files(pattern=".csv")
my_data <- lapply(my_files, read.csv)
Version 1:
for (n in my_data)
{
n$d <- ((n[1,1]*(n[1,3]+13.5)/(3*exp(n[1,3]))/100
}
Version 2:
my_data <- lapply(my_data, function(n){
n$d <- ((n[1,1]*(n[1,3]+13.5)/(3*exp(n[1,3]))/100
})
This is my first question here, I hope it makes sense to you.

Remove duplicates while keeping NA in R

I have data that looks like the following:
a<-data.frame(ID=c("A","B","C","C",NA,NA),score=c(1,2,3,3,5,6),stringsAsFactors=FALSE)
print(a)
ID score
A 1
B 2
C 3
C 3
<NA> 5
<NA> 6
I am trying to remove duplicates without R treating <NA> as duplicates to get the following:
b<-data.frame(ID=c("A","B","C",NA,NA),score=c(1,2,3,5,6),stringsAsFactors=FALSE)
print(b)
ID score
A 1
B 2
C 3
<NA> 5
<NA> 6
I have tried the following:
b<-a[!duplicated(a$ID),]
library(dplyr)
b<-distinct(a,ID)
print(b)
But both treat <NA> as a duplicate ID and remove one, but I want to keep all instances of <NA>. Thoughts? Thank you!
A straight forward approach is to break the original data frame down into 2 parts where ID is NA and where it is not. Perform your distinct filter and then combine the data frames back together:
a<-data.frame(ID=c("A","B","C","C",NA,NA),score=c(1,2,3,3,5,6),stringsAsFactors=FALSE)
aprime<-a[!is.na(a$ID),]
aNA<-a[is.na(a$ID),]
b<-aprime[!duplicated(aprime$ID),]
b<-rbind(b, aNA)
With a little work, one can reduce this down to a 1-2 line lines of code.
using dplyr:
b%>%group_by(ID,score)%>%distinct()
# A tibble: 5 x 2
# Groups: ID, score [5]
ID score
<chr> <dbl>
1 A 1
2 B 2
3 C 3
4 <NA> 5
5 <NA> 6
Found a very simple way to do this simply using the base duplicated() function.
b<-a[!duplicated(a$ID, incomparables = NA),]
Setting incomparables = NA makes R read NA duplicates as FALSE, therefore including them in the result dataset.

What is the best way to clean a data.frame in which row values are ordered properly, but are arbitrarily separated by NA columns?

Occasionally, I need to clean very messy datasets, which are the result of importing a pdf table to a spreadsheet. When the pdf file is converted, all of the columns remain in the correct order (in relation to each other), but blank columns are scattered arbitrarily between them.
Here is a greatly simplified example.
data <- data.frame(
W = sample(1:10),
X = c("yes","no"," ","yes","no"," "," ","no","yes"," "),
Y = c(" "," "," "," "," ","no","no"," "," ","yes"),
Z = c(" "," ","no"," "," "," "," "," "," "," ")
)
data$X <- gsub(" ", NA, data$X)
data$Y <- gsub(" ", NA, data$Y)
data$Z <- gsub(" ", NA, data$Z)
This results in:
> data
W X Y Z
1 6 yes <NA> <NA>
2 4 no <NA> <NA>
3 3 <NA> <NA> no
4 5 yes <NA> <NA>
5 9 no <NA> <NA>
6 1 <NA> no <NA>
7 7 <NA> no <NA>
8 8 no <NA> <NA>
9 10 yes <NA> <NA>
10 2 <NA> yes <NA>
I want to get this:
W X
1 6 yes
2 4 no
3 3 no
4 5 yes
5 9 no
6 1 no
7 7 no
8 8 no
9 10 yes
10 2 yes
How can I best accomplish this? I need code that can accommodate many versions of this problem, including successive columns with NA values before the column containing the desired values. If I could just remove each individual cell with NA values, while shifting remaining values left, that would work. Is this possible?
Using matrix subsetting in base R, we can select the non missing values as follows. The outer cbind constructs the two column data.frame. The second column is constructed by matrix subsetting. A matrix is fed to data which identify the desired elements. Here, the rows are selected with seq_len and the columns are selected using max.col to find the column with TRUE for each row. That is, the column that does not have NA in data[-1]. A 1 is added to adjust for the initial missing column.
cbind(data[1L], response=data[cbind(seq_len(nrow(data)), max.col(!is.na(data[-1L])) + 1L)])
W response
1 10 yes
2 7 no
3 8 no
4 5 yes
5 1 no
6 2 no
7 6 no
8 4 no
9 3 yes
10 9 yes
One option in base R is to get the array indices of the non NA values using which(), then subset the dataset according to the resulting matrix of indices sorted by row number.
indices <- which(!is.na(data[,-1]), arr.ind = TRUE)
data$X <- data[,-1][indices[order(indices[,1]),]]
Using coalesce from dplyr,
Reduce(dplyr::coalesce, data[-1])
[1] "yes" "no" "no" "yes" "no" "no" "no" "no" "yes" "yes"
Another option is pmax
cbind(data[1], response = do.call(pmax, c(data[-1], na.rm = TRUE)))
# W response
#1 3 yes
#2 6 no
#3 10 no
#4 2 yes
#5 5 no
#6 7 no
#7 8 no
#8 1 no
#9 4 yes
#10 9 yes

Fill = T won't work with single letters (?) [R]

I'm using 'fill = T' on a file that has single letters separated by commas:
Pred
1 T,T
2 NA
3 D
4 NA
5 NA
6 T
7 P,B
8 NA
9 NA
using the command:
sift <- read.table("/home/pred.txt", header=F, fill=TRUE, sep=',', stringsAsFactors=F)
Which I was hoping the sift will turn out as:
V1 V2
1 T T
2 <NA>
3 D
4 <NA>
5 <NA>
6 T
7 P B
8 <NA>
9 <NA>
However, it comes out like:
V1
1 T
2 <NA>
3 D
4 <NA>
5 <NA>
6 T
7 P
8 <NA>
9 <NA>
This code works when there are multiple sampleIDs (separated by a comma) in each row - but not for single letters. Does 'fill' work for single letters? Stupid question, I know.
So here is a workaround:
url <- "https://dl.dropboxusercontent.com/s/bjb241s16t63ev8/pred.txt?dl=1&token_hash=AAEBzfCGgoeHgNTvhMSVoZK6qRGrdwwuDZB3h8lWTZNtkA"
df.1 <- read.table(url,header=F,sep=",",fill=T,stringsAsFactors=F)
dim(df.1)
# [1] 149792 1 <-- 149,792 rows and ** 1 ** column
df.2 <- read.table(url,header=F,sep=",",fill=T,stringsAsFactors=F,
col.names=c("V1","V2"))
dim(df.2)
# [1] 149633 2 <-- 149,633 rows and ** 2 ** columns
head(df.2[which(nchar(df.2$V2)>0),])
# V1 V2
# 1000 T T
# 2419 T T
# 3507 T T
# 3766 T D
# 4308 T D
# 4545 T D
read.table(...) creates a data frame with number of columns determined by the first 5 rows. Since the first 5 rows in your file have only 1 column, that's what you get. Evidently, by specifying sep="," you force read.table(...) to add the "extra" data as extra rows.
The workaround explicitly sets the number of columns by specifying column names, which could be anything, as long as length(col.names) = 2.

Resources