I am using R for a project and I have a data frame in in the following format:
A B C
1 1 0 0
2 0 1 1
I want to return a data frame that gives the Column Name when the value is 1.
i.e.
Impair1 Impair2
1 A NA
2 B C
Is there a way to do this for thousands of records? The max impairment number is 4.
Note: There are more than 3 columns. Only 3 were listed to make it easier.
You could loop through the rows of your data, returning the column names where the data is set with an appropriate number of NA values padded at the end:
`colnames<-`(t(apply(dat == 1, 1, function(x) c(colnames(dat)[x], rep(NA, 4-sum(x))))),
paste("Impair", 1:4))
# Impair1 Impair2 Impair3 Impair4
# 1 "A" NA NA NA
# 2 "B" "C" NA NA
Using the apply family of functions, here is a general solution that should work for your larger dataset:
res <- apply(df, 1, function(x) {
out <- character(4) # create a 4-length vector of NAs
tmp <- colnames(df)[which(x==1)] # store the column names in a tmp field
out[1:length(tmp)] <- tmp # overwrite the relevant positions
out
})
# transpose and turn it into a data.frame
> data.frame(t(res))
X1 X2 X3 X4
1 A
2 B C
Related
I need to fill in R data.frame (or data.table) using named vectors as rows. The problem is that named vectors to be used as rows usually do not have all the variables. In other words, usually named vector has smaller length than the number of columns. Names of variables in the vectors coincide with column names of the dataframe:
df <- data.frame(matrix(NA, 2, 3))
colnames(df) <- c("A", "B", "C")
obs1 <- c(A=2, B=4)
obs2 <- c(A=3, C=10)
I want df as follows:
> df
A B C
1 2 4 NA
2 3 NA 10
So I want to fill in the first two rows with obs1 and obs2 respectively. When I try to do it, I get an error:
> df[1,] <- obs1
Error in `[<-.data.frame`(`*tmp*`, 1, , value = c(A = 2, B = 4)) :
replacement has 2 items, need 3
I suspect that similar question was already asked, but I could not find it. Does anybody know how to do it using data.frame or data.table?
We need to select the columns as well based on the names of 'obs1' and 'obs2'
df[1, names(obs1)] <- obs1
df[2, names(obs2)] <- obs2
-output
> df
A B C
1 2 4 NA
2 3 NA 10
When we do df[1,], it returns the first row with all the columns i.e. the length is 3 where as 'obs1' or 'obs2' have only a length of 2, thus getting the error in length
Also, creating a template dataset to fill is not really needed as we can use bind_rows which will automatically fill with NA for those columns not present
library(dplyr)
bind_rows(obs1, obs2)
# A tibble: 2 x 3
A B C
<dbl> <dbl> <dbl>
1 2 4 NA
2 3 NA 10
solution with data.table;
library(data.table)
obs1 <- data.table(t(obs1))
obs2 <- data.table(t(obs2))
df <- rbindlist(list(obs1,obs2),fill=T)
df
output;
A B C
<dbl> <dbl> <dbl>
1 2 4 NA
2 3 NA 10
I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5
I'm trying to calculate the mode for numeric columns. The columns which are not numeric, should have a "NA" as a placeholder in the vector. I would also need percentages according to a target. Some example data:
c1= c("A", "B", "C", "C", "B", "C", "C")
c2= factor(c(1, 1, 2, 2,1,2,1), labels = c("Y","N"))
d= as.Date(c("2015-02-01", "2015-02-03","2015-02-01","2015-02-05", "2015-02-03","2015-02-01", "2015-02-03"), format="%Y-%m-%d")
x= c(1,1,2,3,1,2,4)
y= c(1,2,2,6,2,3,1)
t= c(1,0,1,1,0,0,1)
df=data.frame(c1, c2, d, x, y,t)
df
c1 c2 d x y t
1 A Y 2015-02-01 1 1 1
2 B Y 2015-02-03 1 2 0
3 C N 2015-02-01 2 2 1
4 C N 2015-02-05 3 6 1
5 B Y 2015-02-03 1 2 0
6 C N 2015-02-01 2 3 0
7 C Y 2015-02-03 4 1 1
I would need the mode for each numeric column:
mode=as.numeric(c("NA","NA", "NA", 1,2,1))
mode
[1] NA NA NA 1 2 1
and a vector of percentages of rows with t==1, when value in column == mode
[1] NA NA NA 0.33 0.33
and a vector of percentages of rows with t==1, when value in column != mode
[1] NA NA NA 0.75 0.75
How could I calculate such vectors?
The best I have found for mode is:
library(plyr)
mode_fun <- function(x) {
mode0 <- names(which.max(table(x)))
if(is.numeric(x)) return(as.numeric(mode0))
mode0
}
kdf_mode=apply(kdf,2, numcolwise(mode_fun))
But it gives an error if there are any non numeric columns.
We can use sapply to loop over the columns of 'df', apply the mode_fun to get the output vector ('v1'). We use an if/else condition to return NA for non-numeric columns.
v1 <- unname(sapply(df, function(x) if(!is.numeric(x)) NA else mode_fun(x)))
v1
#[1] NA NA NA 1 2 1
For the second case (I guess we don't need the 6th column i.e. 't'). We loop through the columns of 'df' with sapply, use the if/else condition. In the else condition, we compare whether the mode values is equal to the column values (mode_fun(x)==x)). We use the & to get the logical index of values that are equal to mode that corresponds to t==1. Get the sum and divide by the sum(v1).
unname(sapply(df[-6], function(x) if(!is.numeric(x)) {
NA
} else {
v1 <- mode_fun(x)==x
sum(v1 & t==1)/sum(v1)
} ))
#[1] NA NA NA 0.3333333 0.3333333
For the third, we change the condition to get the logical index where the column is not equal to the mode. Do the same as in the previous case.
unname(sapply(df[-6], function(x) if(!is.numeric(x)){
NA
} else {
v1 <- mode_fun(x)!=x
sum(v1 & t==1)/sum(v1)
} ))
#[1] NA NA NA 0.75 0.75
After we calculate 'v1', this can be also done without looping with sapply. We create a logical index where the column class is 'numeric' and the column names is not 't' ('indx').
indx <- sapply(df, is.numeric) & names(df)!='t'
We subset the 'df' and 'v1' based on 'indx' (df[indx], v1[indx]), make the lengths by replicating the vector using col. The col gives the numeric index of the columns in df[indx]. Then we check whether the subset dataset is equal to the vector to give a logical matrix.
indx1 <- df[indx]==v1[indx][col(df[indx])]
As in the previous code, we use & to check whether the TRUE values in 'indx1' also corresponds to 't==1. DocolSums, divide by thecolSumsof 'indx1', and concatenate (c) with theNA` elements of 'v1'
unname(c(v1[is.na(v1)], colSums(indx1& t==1)/colSums(indx1)))
#[1] NA NA NA 0.3333333 0.3333333
Similarly, we can create 'indx2' by changing the condition and then do colSums as before
indx2 <- df[indx]!=v1[indx][col(df[indx])]
unname(c(v1[is.na(v1)], colSums(indx2& t==1)/colSums(indx2)))
#[1] NA NA NA 0.75 0.75
I am on the lookout for a function in R that would check for the presence of particular columns, e.g.
cols=c("a","b","c","d")
in a matrix or dataframe that would insert a column with NAs in case any columns did not exist (in the position in which the columns are given in vector cols). Say if you had a matrix or dataframe with named columns "a", "d", that it would insert a column "b" and "c" filled up with NAs before column "d", and that any columns not listed in cols would be deleted (e.g. column "e"). What would be the easiest and fastest way to achieve this (I am dealing with a fairly large dataset of ca. 1 million rows)? Or is there already some function that does this?
I would separate the creation step and the ordering step. Here is an example:
cols <- letters[1:4]
## initialize test data set
my.df <- data.frame(a = rnorm(100), d = rnorm(100), e = rnorm(100))
## exclude columns not in cols
my.df <- my.df[ , colnames(my.df) %in% cols]
## add missing columns filled with NA
my.df[, cols[!(cols %in% colnames(my.df))]] <- NA
## reorder
my.df <- my.df[, cols]
Other approach I also just discovered using match, but only works for matrices:
# original matrix
matrix=cbind(a = 1:2, d = 3:4)
# required columns
coln=c("a","b","c","d")
colnmatrix=colnames(matrix)
matrix=matrix[,match(coln,colnmatrix)]
colnames(matrix)=coln
matrix
a b c d
[1,] 1 NA NA 3
[2,] 2 NA NA 4
Another possibility if your data is in a matrix
# original matrix
m1 <- cbind(a = 1:2, d = 3:4)
m1
# a d
# [1,] 1 3
# [2,] 2 4
# matrix will all columns, filled with NA
all.cols <- letters[1:4]
m2 <- matrix(nrow = nrow(m1), ncol = length(all.cols), dimnames = list(NULL, all.cols))
m2
# a b c d
# [1,] NA NA NA NA
# [2,] NA NA NA NA
# replace columns in 'NA matrix' with values from original matrix
m2[ , colnames(m1)] <- m1
m2
# a b c d
# [1,] 1 NA NA 3
# [2,] 2 NA NA 4
I am trying to combine two dataframes with different number of columns and column headers. However, after I combine them using rbind.fill(), the resulting file has filled the empty cells with NA.
This is very inconvenient since one of the columns has data that is also represented as "NA" (for North America), so when I import it into a csv, the spreadsheet can't tell them apart.
Is there a way for me to:
Use the rbind.fill function without having it populate the empty cells with NA
or
Change the column to replace the NA values*
*I've scoured the blogs, and have tried the two most popular solutions:
df$col[is.na(df$col)] <- 0, #it does not work
df$col = ifelse(is.na(df$col), "X", df$col), #it changes all the characters to numbers, and ruins the column
Let me know if you have any advice! I (unfortunately) cannot share the df, but will be willing to answer any questions!
NA is not the same as "NA" to R, but might be interpreted as such by your favourite spreadsheet program. NA is a special value in R just like NaN (not a number). If I understand correctly, one of your solutions is to replace the "NA" values in the column representing North America with something else, in which case you should just be able to do...
df$col[ df$col == "NA" ] <- "NorthAmerica"
This is assuming that your "NA" values are actually character strings. is.na() won't return any values if they are character strings which is why df$col[ is.na(df$col) ] <- 0 won't work.
An example of the difference between NA and "NA":
x <- c( 1, 2, 3 , "NA" , 4 , 5 , NA )
> x[ !is.na(x) ]
[1] "1" "2" "3" "NA" "4" "5"
> x[ x == "NA" & !is.na(x) ]
[1] "NA"
Method to resolve this
I think you want to leave "NA" and any NAs as they are in the first df, but make all NA in the second df formed from rbind.fill() change to something like "NotAvailable". You can accomplish this like so...
df1 <- data.frame( col = rep( "NA" , 6 ) , x = 1:6 , z = rep( 1 , 6 ) )
df2 <- data.frame( col = rep( "SA" , 2 ) , x = 1:2 , y = 5:6 )
df <- rbind.fill( df1 , df2 )
temp <- df [ (colnames(df) %in% colnames(df2)) ]
temp[ is.na( temp ) ] <- "NotAvailable"
res <- cbind( temp , df[ !( colnames(df) %in% colnames(df2) ) ] )
#df has real NA values in column z and column y. We just want to get rid of y's
df
# col x z y
# 1 NA 1 1 NA
# 2 NA 2 1 NA
# 3 NA 3 1 NA
# 4 NA 4 1 NA
# 5 NA 5 1 NA
# 6 NA 6 1 NA
# 7 SA 1 NA 5
# 8 SA 2 NA 6
#res has "NA" strings in col representing "North America" and NA values in z, whilst those in y have been removed
#More generally, any NA in df1 will be left 'as-is', whilst NA from df2 formed using rbind.fill will be converted to character string "NotAvilable"
res
# col x y z
# 1 NA 1 NotAvailable 1
# 2 NA 2 NotAvailable 1
# 3 NA 3 NotAvailable 1
# 4 NA 4 NotAvailable 1
# 5 NA 5 NotAvailable 1
# 6 NA 6 NotAvailable 1
# 7 SA 1 5 NA
# 8 SA 2 6 NA
If you have a dataframe that contains NA's and you want to replace them all you can do something like:
df[is.na(df)] <- -999
This will take care of all NA's in one shot
If you only want to act on a single column you can do something like
df$col[which(is.na(df$col))] <- -999